Analytics Hub is a data exchange platform within BigQuery that helps organizations streamline the accessibility of data and analytics assets. It provides a unified environment where data analysts, data scientists, and users can collaborate, share data, and gain insights from various data sources.

Here are some of the benefits of using Analytics Hub:

  • Data sharing at scale: Analytics Hub makes it easy for data analysts, data scientists, and users to collaborate and share data at scale.
  • Improved data security: Analytics Hub, built on top of BigQuery, enables robust security measures to protect data from unauthorized access.
  • Access to public datasets: Analytics hub provides access to thousands of valuable datasets, including first-party data from Google
  • Built-in features for data providers: Analytics hub provides access to subscriber’s usage metrics and subscription management for data providers.

This blog provides a detailed guide on setting up Analytics Hub for sharing datasets across organizations using Google Cloud, including essential steps and considerations.

Architecture overview

The above diagram shows cross-organizational representation. There are two organizations, Publisher and Subscriber. Project (A) under Publisher organization contains a private data exchange where datasets from all other projects under the same organization are listed.

The architecture above uses a dedicated project for setting up Analytics Hub data exchange. It helps in the following manner:

  1. Improved management: A dedicated project can make it easier to manage the multiple public and private data exchanges. You can create custom permissions and roles, and you can use billing and cost management tools to track your spending.
  2. Access to Subscriber usage metrics: A dedicated project will help you easily collect subscriber’s usage metrics by using INFORMATION_SCHEMA view from the project.
  3. Improved security: With a dedicated project, you can manage authorized subscribers from one place which helps in reducing the risk of unauthorized access or data breaches.

By default, a data exchange is private and only users or groups that have access to that exchange can view or subscribe to the data. After the user subscribes to the listing, a linked dataset is created in the target project accessible to its users as a READ ONLY dataset in BigQuery.

Create data exchanges and listings

First you should enable the Analytics Hub API using the command

gcloud services enable analyticshub.googleapis.com

Create data exchange

You should create a data exchange in the publisher project. The permission analyticshub.dataExchanges.create is required to perform the above step. Please follow the link to know how to provide IAM permissions.

Create listing

You should create a listing in the data exchange created above. At the least, following permissions are required for creating a listing:

  1. analyticshub.listings.create (On data exchange level or higher)
  2. bigquery.datasets.get (on dataset level or higher)
  3. bigquery.datasets.update (on dataset level or higher)

Above permissions are covered under Analytics Hub Publisher/Analytics Hub Admin and BigQuery Data Owner/BigQuery Admin roles.

Add subscriber role

To give subscribers access to a private listing, you must grant the Analytics Hub Subscriber role to the users for that listing.

In the following sections, you will learn how the subscriber user/group can view the listings available under a data exchange and subscribe to them programmatically.

Steps to view & subscribe to a listing

Below set of IAM roles/permissions are required to subscribe to a listing in your target project:

  1. bigquery.datasets.create permission at project level inside subscriber organization. This enables the user to create a linked dataset in their own project. This permission is covered as part of the role roles/bigquery.user
  2. roles/analyticshub.subscriber role on listing level or at data exchange level in the publisher organization.

Authenticate gcloud (or Cloud Shell) with an existing service account having required permissions using the following command:

gcloud auth activate-service-account SERVICE_ACCOUNT@DOMAIN.COM --key-file=/path/key.json --project=PROJECT_ID

If the key file generation is not allowed, the alternative option is to create a VM and assign the above service account. Login to the VM and you can proceed further.

You can also perform the following steps by authenticating as a user.

List available listings

To view all the listings available under the data exchange, user can run the below curl command:

curl -s -X GET "https://analyticshub.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/dataExchanges/DATA_EXCHANGE_ID/listings" -H "accept: application/json" -H "Authorization: Bearer $(gcloud auth application-default print-access-token)"

Where project refers to a projectId or project_numberlocationId refers to a Google Cloud region and DATA_EXCHANGE_ID refers to Data Exchange ID. A truncated sample response is provided below.

Output

{
  "listings": [
    {
      "name": "projects/PROJECT_NUMBER/locations/LOCATION/dataExchanges/DATA_EXCHANGE_ID/listings/LISTING_ID",
      "displayName": "Listing display name",
      "bigqueryDataset": ...
    }
  ]
}

Subscribe to a listing

For subscribing to a listing in the Publisher organization, the Analytics Hub administrator must share a link to that data exchange & project ID with the subscribers. The link has fields data_exchange_id & listing_id which will be used by the user for the subscribing process. Run the below cURL command to subscribe to the desired listing. If the request is successful the response body will be empty. See API documentation for details.

curl -i -X POST "https://analyticshub.googleapis.com/v1/projects/PUBLISHER_PROJECT_ID/location/LOCATION/dataExchanges/DATA_EXCH_ID/listings/LISTINGS_ID:subscribe" -H "accept: */*" -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" -H "Content-Type: application/json" -d '{destination_dataset: {dataset_reference: {project_id: "SUBSCRIBER_PROJECT_ID", dataset_id: "DATASET_ID"}, location: "LOCATION"}}'

The request body ‘-d’ specifies the destination project and dataset where the linked dataset should be created in the subscriber’s organization.

Alerting users on new listings

Enable audit logs

You should enable BigQuery Analytics Hub API Audit logs before setting up log based alerts in the next step.

Setup alert for create listing

You can set alerts in cloud monitoring which will create an incident whenever a listing is created in the data exchange. Below is the policy.json file which creates a log based alert whenever a new listing is created in the defined project.

{
   "displayName": "Analytics_hub_create_listing_alert",
   "userLabels": {},
   "conditions": [
     {
       "displayName": "Log match condition",
       "conditionMatchedLog": {
         "filter": "protoPayload.methodName=\"google.cloud.bigquery.analyticshub.v1.AnalyticsHubService.CreateListing\"\n"
       }
     }
   ],
   "alertStrategy": {
     "notificationRateLimit": {
       "period": "86400s"
     },
     "autoClose": "604800s"
   },
   "combiner": "OR",
   "enabled": true,
   "notificationChannels": []
 }

The gcloud command to activate this alert is

gcloud alpha monitoring policies create --policy-from-file=policy.json

Subscription metrics using SCHEMATA_LINK view

The subscription metadata is exposed via INFORMATION_SCHEMA.SCHEMATA_LINKS view. And, the metadata can be aggregated to build useful metrics & to understand usage patterns of subscribers. Below are some sample queries to get started:

1. Number of linked datasets (i.e. subscriptions) created for each listing, sorted by number of subscriptions in descending order.

SELECT catalog_name, schema_name, COUNT(linked_schema_name) AS num_linked_datasets
FROM `[PROJECT_ID].region-REGION`.INFORMATION_SCHEMA.SCHEMATA_LINKS
GROUP BY 1, 2
ORDER BY 3 DESC

Where REGION is any dataset region. For example, region-us

2. Number of unique subscribers for each listing (based on project level or organization level)

The unique number of subscribers can be based on project level or at organization level. To get the number of unique subscribers at organization level, you can use the below query:

SELECT
  CATALOG_NAME,
  SCHEMA_NAME,
  COUNT(DISTINCT (LINKED_SCHEMA_ORG_DISPLAY_NAME)) AS NUM_UNIQUE_ORG_SUBSCRIBERS
FROM `[PROJECT_ID].region-REGION`.INFORMATION_SCHEMA.SCHEMATA_LINKS
GROUP BY 1, 2
ORDER BY 3 DESC

To get the number of unique subscribers based on project level, replace column LINKED_SCHEMA_ORG_DISPLAY_NAME with LINKED_SCHEMA_CATALOG_NAME in the above query.

This should give you a fair idea of how to share data across organizations using Bigquery Analytics Hub and analyze the usage metrics of subscribers. To learn more about Analytics Hub visit the Analytics Hub page and get started today.