Running Spark on Kubernetes with Dataproc

Apache Spark is now the de facto standard for data engineering, data exploration and machine learning. Just likeKubernetes (k8s), is for automating containerized application deployment, scaling, and management. The open source ecosystem is now converging towards utilizing k8s as the compute platform in addition to YARN. 

Today, Google are announcing the general availability of Dataproc on Google Kubernetes Engine (GKE), enabling you to leverage k8s to manage and optimize your compute platforms. You can now create a Dataproc cluster and submit Spark jobs on a self-managed GKE cluster. 

Dataproc on GKE for Spark (GA)

K8s builds on 15 years of running Google’s containerized workloads and the critical contributions from the open source community. Inspired by Google’s internal cluster management system, Borg, K8s makes everything associated with deploying and managing your application easier. With the widespread adoption of k8s, many customers are now standardizing on k8s for their compute platform management. Google are observing a trend towards building applications as containers, to simplify application management among the many other benefits such as, improved agility, security, portability. 

Dataproc on GKE, now in GA, allows you to run Spark workloads on a self-managed GKE cluster. Letting you derive the benefits of fully automated, most scalable and cost optimized K8s service in the market.You bring your GKE cluster and create a Dataproc ‘virtual’ cluster on it . You can then submit jobs and monitor them the same as you would for Dataproc on Google Compute Engine (GCE). You use the Dataproc Jobs API to submit jobs on the cluster, you cannot use the open source Spark Submit directly. Jobs on Dataproc are submitted as native containers and you even have the ability to customize the containers to include additional libraries and data for your applications. 

Concepts for existing Dataproc on GCE users

Node Pool Roles
Dataproc uses GKE node pools to manage the Dataproc cluster configuration. You have the ability to select the machine type for the node pools. All the nodes in the node pool use the same configuration. Configuring the node pool with the following roles allows you to optimize Dataproc cluster configuration. 

Workload Identity
Dataproc uses GKE Workload Identity to allow pods within the GKE cluster to act with the authority of a linked Google Service Account. This is very similar to the default Service Account for the Dataproc on GCE.

Autoscaling
Dataproc on GKE utilizes the GKE Cluster autoscaler. Once Dataproc has created the node pools, you can define autoscaling policies for the node pool to optimize your environment.

Key Benefits

Preview customers with expertise in GKE were able to easily integrate Dataproc into their environments and are now looking forward to migrating Spark workloads and optimizing their execution environments to improve efficiency and save costs. Our advanced customers are exploring GPUs for improved job performance to meet their stringent SLAs needs. As we go GA, these customers are excited about utilizing the advanced k8s compute management and resource sharing for their production workloads. 

Running on GKE enables you to take advantage of the advanced capabilities of k8s enabling you optimize costs and performance by running: 

Dataproc on GKE Key Features

Following are some of the salient features of Dataproc on GKE

Running Spark jobs with the infrastructure management style of your choice

With the general availability of Dataproc on GKE, organizations can now run Spark jobs on their infrastructure management style of choice: Serverless Spark for no-ops deployment, customers standardizing on k8s for infrastructure management can run Spark on GKE to improve resource utilization and simplify infrastructure management. Customers looking for VM-style infrastructure management can run Spark on GCE.

What’s Next

Google are actively working on integrating Dataproc on GKE with Vertex AI Workbench for data scientists in the upcoming months. With this integration, data scientists can use notebooks for their interactive workloads and even schedule notebooks executions. They are also looking to extend Enhanced Flex Mode Support to Dataproc on GKE allowing you to maximize the benefit of preemptible VMs.To get started, check out this quickstart link.

You can now take your knowledge of k8s compute management and leverage Dataproc on GKE to run Spark workload.

Related posts

Package Management for Debian/Ubuntu Operating Systems on Google Cloud

by Kartika Triyanti
2 years ago

Why you should migrate to network firewall policies from VPC Firewall rules

by Cloud Ace Indonesia
2 years ago

Use Deep Learning VM Images and Deep Learning Containers with Vertex AI

by Cloud Ace Indonesia
3 years ago