Apache Spark is now the de facto standard for data engineering, data exploration and machine learning. Just likeKubernetes (k8s), is for automating containerized application deployment, scaling, and management. The open source ecosystem is now converging towards utilizing k8s as the compute platform in addition to YARN. 

Today, Google are announcing the general availability of Dataproc on Google Kubernetes Engine (GKE), enabling you to leverage k8s to manage and optimize your compute platforms. You can now create a Dataproc cluster and submit Spark jobs on a self-managed GKE cluster. 

Dataproc on GKE for Spark (GA)

K8s builds on 15 years of running Google’s containerized workloads and the critical contributions from the open source community. Inspired by Google’s internal cluster management system, Borg, K8s makes everything associated with deploying and managing your application easier. With the widespread adoption of k8s, many customers are now standardizing on k8s for their compute platform management. Google are observing a trend towards building applications as containers, to simplify application management among the many other benefits such as, improved agility, security, portability. 

Dataproc on GKE, now in GA, allows you to run Spark workloads on a self-managed GKE cluster. Letting you derive the benefits of fully automated, most scalable and cost optimized K8s service in the market.You bring your GKE cluster and create a Dataproc ‘virtual’ cluster on it . You can then submit jobs and monitor them the same as you would for Dataproc on Google Compute Engine (GCE). You use the Dataproc Jobs API to submit jobs on the cluster, you cannot use the open source Spark Submit directly. Jobs on Dataproc are submitted as native containers and you even have the ability to customize the containers to include additional libraries and data for your applications. 

Concepts for existing Dataproc on GCE users

Node Pool Roles
Dataproc uses GKE node pools to manage the Dataproc cluster configuration. You have the ability to select the machine type for the node pools. All the nodes in the node pool use the same configuration. Configuring the node pool with the following roles allows you to optimize Dataproc cluster configuration. 

  • Default: You must have at least a default role for a node pool. If other roles are not defined, default is used to run the workload. 
  • Controller: If defined, Dataproc control plane runs this node pool. This role has very low resource requirements. 
  • Spark Driver: if defined, Spark job drivers run in this node pool. This allows you to optimize the cluster configuration to the workload characteristics.
  • Spark Executor: If defined, Spark job executors run in this node pool. This allows you to optimize the job executor environment. 

Workload Identity
Dataproc uses GKE Workload Identity to allow pods within the GKE cluster to act with the authority of a linked Google Service Account. This is very similar to the default Service Account for the Dataproc on GCE.

Autoscaling
Dataproc on GKE utilizes the GKE Cluster autoscaler. Once Dataproc has created the node pools, you can define autoscaling policies for the node pool to optimize your environment.

Key Benefits

Preview customers with expertise in GKE were able to easily integrate Dataproc into their environments and are now looking forward to migrating Spark workloads and optimizing their execution environments to improve efficiency and save costs. Our advanced customers are exploring GPUs for improved job performance to meet their stringent SLAs needs. As we go GA, these customers are excited about utilizing the advanced k8s compute management and resource sharing for their production workloads. 

Running on GKE enables you to take advantage of the advanced capabilities of k8s enabling you optimize costs and performance by running: 

  • Completely independent jobs on the same cluster.
    • You can now share a Dataproc cluster among multiple applications with distinct libraries and dependencies. Each Job can run its own container. Allowing independent Jobs with conflicting dependencies to run at the same time on the same Cluster. Earlier, each job with a distinct environment required an exclusive cluster. Relaxing this constraint enables customers to further optimize their execution environment. 
  • Multiple clusters on the same node pool.
    • You can share the same infrastructure across multiple Dataproc clusters. You can run multiple Dataproc clusters on the same node pools, thereby allowing you further optimize costs. Some customers are now sharing multiple development environments on the same infrastructure. The same is applicable for testing, validation and certification environments. 
  • Multiple Spark versions on the same infrastructure
    • You can easily migrate from one version of Spark to another with the support for multiple versions on the same node pool. Your cluster management is simplified as you do not need to create two distinct environments and do not have to plan scaling down ‘existing’ cluster and scaling up the ‘upgraded’ cluster. 

Dataproc on GKE Key Features

Following are some of the salient features of Dataproc on GKE

  • Spark Versions: You can run Spark 2.4 and Spark 3.1 jobs on Dataproc on GKE clusters.
  • Metastore Integration: You can integrate Dataproc on GKE with Dataproc Metastore.
  • Job level access controls: You can now specify granular access controls at job level leveraging the k8s RBAC and workload identity. 
  • Uniform Dataproc APIs: you can use the same Dataproc APIs to manage clusters, submit jobs and use the same monitoring capabilities as Dataproc on GCE.

Running Spark jobs with the infrastructure management style of your choice

With the general availability of Dataproc on GKE, organizations can now run Spark jobs on their infrastructure management style of choice: Serverless Spark for no-ops deployment, customers standardizing on k8s for infrastructure management can run Spark on GKE to improve resource utilization and simplify infrastructure management. Customers looking for VM-style infrastructure management can run Spark on GCE.

What’s Next

Google are actively working on integrating Dataproc on GKE with Vertex AI Workbench for data scientists in the upcoming months. With this integration, data scientists can use notebooks for their interactive workloads and even schedule notebooks executions. They are also looking to extend Enhanced Flex Mode Support to Dataproc on GKE allowing you to maximize the benefit of preemptible VMs.To get started, check out this quickstart link.

You can now take your knowledge of k8s compute management and leverage Dataproc on GKE to run Spark workload.