bannershape.png

Data Pipeline - Part -III - Develop Locally, Deploy Globally - CDAP

In the previous blog we learned how we can develop a pipeline locally using CDAP and how to deploy it to Cloud Data Fusion (CDF) to process data via Dataproc. In this article we’ll look at how we can leverage CDAP’s compute profile so that we can run a pipeline on a remote Hadoop cluster either from a local CDAP instance or from Data Fusion, with both ephemeral and non-ephemeral clusters. CDAP supports two profile types for compute resources, ephemeral clusters and remote clusters. An ephemeral cluster is created for the duration of the data processing job and is promptly terminated once the job completes. A remote cluster on the other hand is a pre-existing, non-ephemeral or long-lived cluster, that stays on continuously and awaits requests for data processing jobs. We’ll use the same exact pipeline we created in the previous article and deploy it to Dataproc for processing in both ephemeral and non-ephemeral clusters. Cloud Dataproc Setup The use case for having a non-ephemeral cluster can vary from having a predictable configuration to reduced latencies for job start times. For real-time workloads this is a critical important as you may want a cluster that is aways up and running and is available to crunch data as soon as it becomes available. Create a non-ephemeral Dataproc cluster To start off with let’s create a non-ephemeral Dataproc cluster on GCP. Navigate to the Dataproc console and create a new cluster consisting of one master and four worker nodes. If you prefer to use the gcloud CLI simply enter the following command, and make sure you update the project ID with your own ID. The name I’ve given to this cluster is “small-cluster.” gcloud beta dataproc clusters create small-cluster — enable-component-gateway — region us-central1 — subnet default — zone “” — master-machine-type n1-standard-4 — master-boot-disk-type pd-ssd — master-boot-disk-size 500 — num-workers 4 — worker-machine-type n1-standard-4 — worker-boot-disk-type pd-ssd — worker-boot-disk-size 500 — image-version 1.3-deb9 — project your_projectoject_id

5 views0 comments