Data Pipeline - Part -II - Develop Locally, Deploy Globally - CDAP

One notable advantage of working with CDAP is that you can build data integration pipelines locally on your own computer, with a local instance of CDAP Sandbox, and you can deploy those pipelines to run either in the cloud or on a remote Hadoop cluster.

In this article I’ll show you how to use CDAP with the Google Cloud Platform (GCP) so that you can create pipelines on your own machine and run them in the cloud with Dataproc, Google’s managed service for Apache Hadoop & Apache Spark. I’ll also show you how to deploy the pipeline to Data Fusion, the managed service offering of CDAP on GCP, for centralized management and execution.

Install CDAP Sandbox

To get started follow the instructions in the linked blog post to set up your local sandbox version of CDAP — “How to Install and Configure CDAP Sandbox Locally.”

With CDAP configured and working locally we now turn our attention to configuring access to GCP. To use GCP resources you’ll need to configure IAM access via a service account.

Log in to your GCP console and navigate to API’s & Services, select Create credentials, and choose Service account key.

