Data security is a key focus for organizations moving their data warehouses from on-premises to cloud-first systems, such as BigQuery. In addition to storage-level encryption, whether using Google-managed or customer-managed keys, BigQuery also provides column-level encryption. Using BigQuery’s SQL AEAD functions, organizations can enforce a more granular level of encryption to help protect sensitive customer data, such as government identity or credit card numbers, and help comply with security requirements.
While BigQuery provides column-level encryption in the cloud, many organizations operate in hybrid-cloud environments. To prevent a scenario where data needs to be decrypted and re-encrypted each time it moves between locations, Google Cloud offers a consistent and interoperable encryption mechanism. This enables deterministically-encrypted data (which maintains referential integrity) to be immediately joined with on-prem tables for anonymized analytics.
To achieve a BigQuery-compatible encryption on-prem, customers can use Tink, a Google-developed open-source cryptography library. BigQuery uses Tink to implement its SQL AEAD functions. We can use the Tink library directly to encrypt data on-prem in a way that can later be decrypted using BigQuery SQL in the cloud, and decrypt BigQuery’s column level-encrypted data outside of BigQuery.
For Google customers who want to use Tink with BigQuery, we have put together a few helpful Python utilities and samples in the BigQuery Tink Toolkit GitHub repo. Let’s first walk through an example of how to use Tink directly to encrypt or decrypt on-prem data using the same keyset used for BigQuery, followed by how the BigQuery Tink Toolkit can help simplify working with Tink.
To start, Google need to retrieve the Tink keyset. We’ll assume that KMS-wrapped keysets are being used. These keysets need to be stored in BigQuery to use with BigQuery SQL. If needed, they can also be replicated to a secondary store on-prem.lang-py
from google.cloud import bigquery, kms
bq_client = bigquery.Client()
query_job = bq_client.query("""SELECT kms_resource_path, wrapped_keyset, associated_data FROM `my-keysets-table` WHERE column_name = "my-pii-column";""")
result_row = query_job.result()
Now that we have the encrypted keyset, we need to unwrap it to retrieve the usable Tink keyset. If Cloud KMS is not accessible from on-prem, the unwrapped keyset will need to be maintained in a secure keystore on-prem.lang-py
kms_client = kms.KeyManagementServiceClient()
decrypted_keyset_obj = kms_client.decrypt(
{
"name": result_row.kms_resource_path.split("gcp-kms://")[1],
"ciphertext": result_row.wrapped_keyset,
}
)
keyset = decrypted_keyset_obj.plaintext
Google can now use the keyset to generate a Tink primitive. This can be used to encrypt or decrypt data with the associated keyset. Note that different primitives should be used depending on whether the keyset is for a deterministic or nondeterministic key.lang-py
import tink
from tink import aead, cleartext_keyset_handle, daead
binary_keyset_reader = tink.BinaryKeysetReader(keyset)
keyset_handle = cleartext_keyset_handle.read(binary_keyset_reader)
# If using a determinisitic keyset:
daead.register()
cipher = keyset_handle.primitive(daead.DeterministicAead)
# If using a nondeterministic keyset instead:
aead.register()
cipher = keyset_handle.primitive(aead.Aead)
Once we have our cipher, we can use it to encrypt or decrypt data as needed.lang-py
plaintext = "Hello world!"
associated_data = result_row.associated_data
# To encrpyt:
# If using a determinisitic keyset
ciphertext = cipher.encrypt_deterministically(
plaintext.encode(), associated_data.encode()
)
# If using a nondeterministic keyset instead
ciphertext = cipher.encrypt(plaintext.encode(), associated_data.encode())
# To decrypt:
# If using a determinisitic keyset
plaintext = cipher.decrypt_deterministically(
ciphertext, associated_data.encode()
)
# If using a nondeterministic keyset instead
plaintext = cipher.decrypt(
ciphertext, associated_data.encode()
)
Google have provided the CipherManager class to help simplify this process, which handles four actions:
- Retrieving the required keysets from a BigQuery table
- Unwrapping those keysets
- Creating a Tink cipher for each column
- Providing a consistent interface to call encrypt and decrypt.
Google have also included a sample Spark job that shows how to use CipherManager to encrypt or decrypt columns for a given table. We hope these come in handy – happy Tinkering.