Introducing the Hive-BigQuery open-source Connector

In Google work on the Big Data solutions architecture, BigQuery, and Dataproc teams, we talk to a lot of customers who are interested in migrating all or part of their data warehouse from Apache Hive to BigQuery, but who have hit some speed bumps along the way. To help, Google are proud to announce the public GA release of the Hive-BigQuery Connector.

With this open-source connector, you now can let Apache Hive workloads read and write to BigQuery and BigLake tables. The underlying data can be stored either in BigQuery native storage or in open-source data formats on Cloud Storage. Whether you’re fully migrating from Apache Hive to BigQuery or you want both systems to coexist and interact together, this new connector covers a wide range of use cases.

What is the Hive-BigQuery Connector?

If you’ve run Hadoop or Spark workloads on Google Cloud, you should already be familiar with the Cloud Storage Connector and the Apache Spark SQL connector for BigQuery. The former implements the Hadoop Compatible File System (HCFS) API that lets you store and access data files in Cloud Storage, Google Cloud’s highly-scalable and highly-available object storage service. The latter implements the Spark SQL Data Source API to allow reading BigQuery tables into Spark’s dataframes and writing DataFrames back into BigQuery.

Similarly, the Hive-BigQuery connector implements the Hive StorageHandler API to allow Hive workloads to integrate with BigQuery and BigLake tables. While Hive’s execution engine still handles all compute operations such as aggregates and joins, the connector manages all interactions with the data layer in BigQuery, whether the underlying data is stored in BigQuery native storage or in Cloud Storage buckets via a BigLake connection.

The following diagram illustrates how the Hive-BigQuery connector fits in the architecture:

For its part, Apache Hive is one of the most popular open-source data warehouses, and provides an SQL-like interface to query data stored in various databases and file systems that integrate with Apache Hadoop. Over time Hive evolved from using HDFS on-premises as its exclusive data storage layer, to using cloud storage services. And now, thanks to this new connector, Hive integrates with native storage solutions like BigQuery, simplifying migration.

Migrating a data warehouse to the cloud is a complex process, but it can offer significant benefits, including:

Google Cloud offers the BigQuery Migration Service as a comprehensive solution to accelerate migration from your Hive data warehouse to BigQuery. It includes free-to-use tools that help you with each phase of migration, including assessment and planning, data transfer, and data validation. Two of those tools, the BigQuery batch SQL translator and interactive SQL translator, enable you to translate your Hive queries to BigQuery’s own ANSI-compliant SQL syntax so you can then run those queries natively with BigQuery execution engine.

The new Hive-BigQuery connector offers one additional option: You can keep your original queries in their HiveQL dialect, continue to run those queries with the Hive execution engine on your cluster, but let those queries access data migrated to BigQuery and BigLake tables.

Here’s how the Connector assisted Flipkart’s data lake migration to Google Cloud:

“Flipkart places great importance on interoperability with open source technologies as a part of their reliance on and contribution to the open source community. The Hive-BigQuery Connector has played a crucial role in enabling queries on BigQuery data from Hive, as Hive is the primary query engine on our data lake. This integration has provided Flipkart the flexibility to utilize fast query engines like BigQuery without the need for data duplication or silos across various data stores.” – Venkata Ramana Gollamudi, Principal Architect, Flipkart; Apache Committer

Use cases

The Hive-BigQuery Connector can help you in at least the following core use cases:

Features

The Connector’s public preview release already ships with many features, including:

CREATE TABLE my_hive_table (int_val BIGINT, text STRING, ts TIMESTAMP)
STORED BY 'com.google.cloud.hive.bigquery.connector.BigQueryStorageHandler'
TBLPROPERTIES (
    'bq.table'='myproject.mydataset.mytable',
    'bq.time.partition.field'='ts',
    'bq.time.partition.type'='MONTH',
    'bq.clustered.fields'='int_val,text'
);

Getting started

To get started, see the documentation on how to install and configure the connector on your Hive cluster.

Related posts

Automatic data risk management for BigQuery using DLP

by Cloud Ace Indonesia
2 years ago

How to set compliance controls for your Google Cloud Organization

by Cloud Ace Indonesia
3 weeks ago

Getting Started with Google Cloud Logging Python v3.0.0

by Cloud Ace Indonesia
2 years ago