Best practices of migrating Hive ACID Tables to BigQuery

Are you looking to migrate a large amount of Hive ACID tables to BigQuery? 

ACID enabled Hive tables support transactions that accept updates and delete DML operations. In this blog, we will explore migrating Hive ACID tables to BigQuery. The approach explored in this blog works for both compacted (major / minor) and non-compacted Hive tables. Let’s first understand the term ACID and how it works in Hive.

ACID stands for four traits of database transactions:  

Starting in Version 0.14, Hive supports all ACID properties which enables it to use transactions, create transactional tables, and run queries like Insert, Update, and Delete on tables.

Underlying the Hive ACID table, files are in the ORC ACID version. To support ACID features, Hive stores table data in a set of base files and all the insert, update, and delete operation data in delta files. At the read time, the reader merges both the base file and delta files to present the latest data. As operations modify the table, a lot of delta files are created and need to be compacted to maintain adequate performance.  There are two types of compactions, minor and major.

Organizations configure automatic compactions, but they also need to perform manual compactions when automated fails. If compaction is not performed for a long time after a failure, it results in a lot of small delta files. Running compaction on these large numbers of small delta files can become a very resource intensive operation and can run into failures as well. 

Some of the issues with Hive ACID tables are:

Benefits of migrating Hive ACIDs to BigQuery

Some of the benefits of migrating Hive ACID tables to BigQuery are:

Hive ACID table structure and sample data

Here is the sample Hive ACID  table  “employee_trans” Schema

  hive> show create table employee_trans;
OK
CREATE TABLE `employee_trans`(
  `id` int, 
  `name` string, 
  `age` int, 
  `gender` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://hive-cluster-m/user/hive/warehouse/aciddb.db/employee_trans'
TBLPROPERTIES (
  'bucketing_version'='2', 
  'transactional'='true', 
  'transactional_properties'='default', 
  'transient_lastDdlTime'='1657906607')

This sample ACID table “employee_trans” has 3 records.

  hive> select * from employee_trans;
OK
1       James   30      M
3       Jeff    45      M
2       Ann     40      F
Time taken: 0.1 seconds, Fetched: 3 row(s)

For every insert, update and delete operation, small delta files are created. This is the underlying directory structure of the Hive ACID enabled table.

  hdfs://hive-cluster-m/user/hive/warehouse/aciddb.db/employee_trans/delete_delta_0000005_0000005_0000
hdfs://hive-cluster-m/user/hive/warehouse/aciddb.db/employee_trans/delete_delta_0000006_0000006_0000
hdfs://hive-cluster-m/user/hive/warehouse/aciddb.db/employee_trans/delta_0000001_0000001_0000
hdfs://hive-cluster-m/user/hive/warehouse/aciddb.db/employee_trans/delta_0000002_0000002_0000
hdfs://hive-cluster-m/user/hive/warehouse/aciddb.db/employee_trans/delta_0000003_0000003_0000
hdfs://hive-cluster-m/user/hive/warehouse/aciddb.db/employee_trans/delta_0000004_0000004_0000
hdfs://hive-cluster-m/user/hive/warehouse/aciddb.db/employee_trans/delta_0000005_0000005_0000

These ORC files in an ACID table are extended with several columns:

  struct<
  operation: int,
  originalTransaction: bigInt,
  bucket: int,
  rowId: bigInt,
  currentTransaction: bigInt,
  row: struct<...>
>

Steps to Migrate Hive ACID tables to BigQuery

Migrate underlying Hive table HDFS data

Copy the files present under employee_trans hdfs directory and stage in GCS. You can use either HDFS2GCS solution or Distcp. HDFS2GCS solution uses open source technologies to transfer data and provide several benefits like status reporting, error handling, fault tolerance, incremental/delta loading,  rate throttling, start/stop, checksum validation, byte2byte comparison etc. Here is the high level architecture of the HDFS2GCS solution. Please refer to the public github URL HDFS2GCS to learn more about this tool.

The source location may contain extra files that we don’t necessarily want to copy. Here, we can use filters based on regular expressions to do things such as copying files with the .ORC extension only.

Load ACID Tables as-is to BigQuery

Once the underlying Hive acid table files are copied to GCS, use the BQ load tool to load data in BigQuery base table. This base table will have all the change events.

Data verification

Run  “select *” on the base table to verify if all the changes are captured. 

Note: Use of “select * …” is used for demonstration purposes and is not a stated best practice.

Loading to target BigQuery table

The following query will select only the latest version of all records from the base table, by discarding the intermediate delete and update operations.

You can either load the results of this query into a target table using scheduled query on-demand with the overwrite option or alternatively, you can also create this query as a view on the base table to get the latest records from the base table directly.

  WITH
 latest_records_desc AS (
 SELECT
   Row.*,
   operation,
   ROW_NUMBER() OVER (PARTITION BY originalTransaction ORDER BY originalTransaction ASC, bucket ASC, rowId ASC, currentTransaction DESC) AS rownum
 FROM
   `hiveacid-sandbox.hivetobq.basetable` )
SELECT id,name,age,gender
FROM
 latest_records_desc
WHERE
 rownum=1
 AND operation != 2

Once the data is loaded in target BigQuey table, you can perform validation using below steps:

a. Use the Data Validation Tool to validate the Hive ACID table and the target BigQuery table. DVT provides an automated and repeatable solution to perform schema and validation tasks. This tool supports the following validations:

b. If you have analytical HiveQLs running on this ACID table, translate them using the BigQuery SQL translation service and point to the target BigQuery table. 

Hive DDL Migration (Optional)

Since ORC is self-contained, leverage BigQuery’s schema inference feature when loading. 

There is no dependency to extract Hive DDLs from Metastore. 

But if you have an organization-wide policy to pre-create datasets and tables before migration, this step will be useful and will be a good starting point. 

a. Extract Hive ACID DDL dumps and translate them using BigQuery translation service to create equivalent BigQuery DDLs. 

There is a Batch SQL translation service to bulk translate exported HQL (Hive Query Language) scripts from a source metadata bucket in Google Cloud Storage to BigQuery equivalent SQLs  into a target GCS bucket. 

You can also use BigQuery interactive SQL translator which is a live, real time SQL translation tool across multiple SQL dialects to translate a query like HQL dialect into a BigQuery Standard SQL query. This tool can reduce time and effort to migrate SQL workloads to BigQuery. 

b. Create managed BigQuery tables using the translated DDLs. 

Here is the screenshot of the translation service in the BigQuery console.  Submit “Translate” to translate the HiveQLs and “Run” to execute the query. For creating tables from batch translated bulk sql queries, you can use Airflow BigQuery operator (BigQueryInsertJobOperator) to run multiple queries

After the DDLs are converted, copy the ORC files to GCS and perform ELT in BigQuery. 

The pain points of Hive ACID tables are resolved when migrating to BigQuery. When you migrate the ACID tables to BigQuery, you can leverage BigQuery ML and GeoViz capabilities for real-time analytics. If you are interested in exploring more, please check out the additional resources section. 

Additional Resources

Related posts

Supercharge Your Event-driven Architecture with New Cloud Functions (2nd gen)

by Kartika Triyanti
2 years ago

Building AI in the cloud: An easier way with Google Cloud and NVIDIA

by Kartika Triyanti
2 years ago

Log Analytics in Cloud Logging is now GA

by Cloud Ace Indonesia
1 year ago