Dataproc Serverless: Now faster, easier and smarter

Google Cloud is thrilled to announce new capabilities that make running Dataproc Serverless even faster, easier, and more intelligent.

Elevate your Spark experience with:

$300 in free credit to try Google Cloud data analytics

Bulid intelligent apps using real-time data insights with $300 in free credit for new customers. Plus, all customers get free monthly usage of 20+ products, including BigQuery.Start building for free

Accelerate your Spark jobs with native query execution

You can unlock considerable speed improvements for your Spark batch jobs in the Premium tier on Dataproc Serverless Runtimes 2.2.26+ or 1.2.26+ by enabling native query execution — no application changes required.

This new feature in Dataproc Serverless Premium tier improved the query performance by ~47%in our tests on queries derived from TPC-DS and TPC-H benchmarks.

Note: Performance results are based on 1TB GCS Parquet data and queries derived from the TPC-DS standard and TPC-H standard. These runs as such aren’t comparable to published TPC-DS standard and TPC-H standard results, as these runs don’t comply with all requirements of the TPC-DS standard and and TPC-H standard specification.

Start now by running the native query execution qualification tool that can help you easily identify eligible jobs and estimate potential performance gains. Once you have the list of batch jobs identified for native query execution, you can enable it and have the jobs run faster and potentially save costs.

Seamless monitoring with Spark UI

Tired of wrestling with setting up the persistent history server (PHS) clusters and maintaining them just to debug your Spark batches? Wouldn’t it be easier if you could avoid the ongoing costs of the history server and yet see the Spark UI in real-time?

Until now, monitoring and troubleshooting Spark jobs in Dataproc Serverless required setting up and managing a separate Spark persistent history server. Crucially, each batch job had to be configured to use the history server. Otherwise, the open-source UI would be unavailable for analysis for the batch job. Additionally, the open-source UI suffered from slow navigation between applications.

Google Cloud has heard you, loud and clear. Google Cloud is excited to announce a fully managed Spark UI in Dataproc Serverless that makes monitoring and troubleshooting a breeze.

The new Spark UI is built-in and automatically available for every batch job and session in both Standard and Premium tiers of Dataproc Serverless at no additional cost. Simply submit your job and start analyzing performance in real time with the Spark UI right away.

Here’s why you’ll love the Serverless Spark UI:

 Traditional ApproachThe new Dataproc Serverless Spark UI
EffortCreate and manage a Spark history server cluster. Configure each batch job to use the cluster.No cluster setup or management required. Spark UI is available by default for all your batches without any extra configuration.The UI can be accessed directly from the Batch / Session details page in the Google Cloud console.
LatencyUI performance can degrade with increased load. Requires active resource management.Enjoy a responsive UI that automatically scales to handle even the most demanding workloads.
AvailabilityThe UI is only available as long as the history server cluster is running.Access your Spark UI for 90 days after your batch job is submitted.
Data freshnessWait for a stage to complete to see that its events are in the UI.View regularly updated data without waiting for the stage to complete.
FunctionalityBasic UI based on open-source Spark.Enhanced UI with ongoing improvements based on user feedback.
CostOngoing cost for the PHS cluster.No additional charge.

Accessing the Spark UI

To gain deeper insights into your Spark batches and sessions — whether they’re still running or completed —  simply navigate to the Batch Details or Session Details page in the Google Cloud console. You’ll find a “VIEW SPARK UI” link in the top right corner.

The new Spark UI provides the same powerful features as the open-source Spark History Server, giving you deep insights into your Spark job performance. Easily browse both running and completed applications, explore jobs, stages, and tasks, and analyze SQL queries for a comprehensive understanding of the execution of your application. Quickly identify bottlenecks and troubleshoot issues with detailed execution information. For even deeper analysis, the ‘Executors’ tab provides direct links to the relevant logs in Cloud Logging, allowing you to quickly investigate issues related to specific executors.

You can still use the “VIEW SPARK HISTORY SERVER” link to view the Persistent Spark History Server if you had already configured one.

Explore this feature now. Click “VIEW SPARK UI” on the top right corner of the Batch details page of any of your recent Spark batch jobs to get started. Learn more in the Dataproc Serverless user guide.

Streamlined investigation (Preview)

A new “Investigate” tab in the Batch details screen gives you instant diagnostic highlights collected at a single place.

In the “Metrics highlights” section, the essential metrics are automatically displayed, giving you a clear picture of your batch job’s health. You can further create a custom dashboard if you need more metrics.

Below the metrics highlights, a widget “Job Logs” shows the logs filtered by errors, so you can instantly spot and address problems. If you would like to dig further into the logs, you can go to the Logs Explorer.

Proactive autotuning and assisted troubleshooting with Gemini (Preview)

Last but not least, Gemini in BigQuery can help reduce the complexity of optimizing hundreds of Spark properties in your batch job configurations while submitting the job. If the job fails or runs slow, Gemini can save the effort of wading through several GBs of logs to troubleshoot the job.

Optimize performance: Gemini can automatically fine-tune the Spark configurations of your Dataproc Serverless batch jobs for optimal performance and reliability.

Simplify troubleshooting: You can quickly diagnose and resolve issues with slow or failed jobs by clicking “Ask Gemini” for AI-powered analysis and guidance.

Sign up here for a free preview of the Gemini features and “Investigate” tab for Dataproc Serverless.

Related posts

Minimal downtime migration from PostgreSQL database to Spanner PostgreSQL dialect database

by Cloud Ace Indonesia
8 months ago

Learn how to tackle supply chain disruptions with SAP IBP and Google Cloud

by Kartika Triyanti
2 years ago

Introducing AlloyDB for PostgreSQL: Free yourself from expensive, legacy databases

by Kartika Triyanti
3 years ago