Introducing new ML model monitoring capabilities in BigQuery

Monitoring machine learning (ML) models in production is now as simple as using a function in BigQuery! Today Google introducing a new set of functions that enable model monitoring directly within BigQuery. Now, you can describe data throughout the model workflow by profiling training or inference data, monitor skew between training and serving data, and monitor drift in serving data over time using SQL — for BigQuery ML models as well as any model whose feature training and serving data is available through BigQuery. With these new functions, you can ensure your production models continue to deliver value while simplifying their monitoring.

In this blog, Google present two companion notebooks to help you get hands-on with these features today!

The foundation of a model: the data

A model is only as good as the data it learns from. Understanding the data deeply is essential for effective feature engineering, model selection, and ensuring quality through MLOps. BigQuery’s table-valued function ML.DESCRIBE_DATA provides a powerful tool for this, allowing you to summarize and describe an entire table with a single query.

Example: Identifying data issues

In the accompanying introduction notebook, we profile the training data ( penguin classification dataset) using   the ML.DESCRIBE DATA  function and quickly identify a data issue.

SELECT *
FROM ML.DESCRIBE_DATA(
  TABLE `bigquery-public-data.ml_datasets.penguins`,
  STRUCT(3 AS top_k, 4 AS num_quantiles)
);

Here’s the resulting output table:

Notice that the min value for the sex column is a ‘.’. Ideally, we’d see the values MALEFEMALE or null as indicated in the top_values.values column. This means that in addition to the 10 null values (indicated by the num_null column) there are also some null values indicated by a string with value ‘.’. This should be corrected before using it as training data. 

The ML.DESCRIBE_DATA function is extra helpful because it summarizes each data type all in one table. There are also optional parameters that can be specified to control the number of quantiles for different numerical column types and the number of top values to return for categorical columns. The input data can be specified as a table or a query statement, allowing you to describe specific subsets of data (e.g., serving timeframes, or groups within your training data). The function’s flexibility extends beyond ML tasks: it even allows you to describe data stored outside of BigQuery, facilitating quick analysis for both model-building and broader data exploration purposes.

Detect skew at a glance

A trained model will perform only when the serving data is similar in distribution to the training data. Model monitoring helps ensure this by comparing training and serving data for shifts known as skew. BigQuery’s ML.VALIDATE_DATA_SKEW table valued function streamlines this process, allowing you to directly compare serving data to any BigQuery ML model’s training data.

Let’s see it in action:

SELECT *
FROM ML.VALIDATE_DATA_SKEW(
    MODEL `bqml_model_monitoring.classify_species_logistic`,
    (
        SELECT *
        FROM `bqml_model_monitoring.serving`
        WHERE DATE(instance_timestamp) = CURRENT_DATE()
    )
);

This query directly compares the data in the serving table to the BigQuery ML model classify_species_logistic. The accompanying introduction notebook has the full code in an interactive example. In that notebook the serving data is simulated to create change in two of the features: body_mass_g and flipper_length_mm. The results of the ML.VALIDATE_SKEW function show anomalies detect for each of these:

The detection of skew is as easy as comparing a model in BigQuery to a table of serving data. During training, BigQuery ML models automatically compute and store relevant statistics. This eliminates the need for reusing the entire training dataset, making skew monitoring simple and cost-efficient. Importantly, the function intelligently focuses on features present in the model, further enhancing efficiency and workflow. With optional parameters, you can customize anomaly detection thresholds, metric types for categorical features, and even set different thresholds for specific features. Later, we’ll demonstrate how easily you can monitor skew for any model!

Proactive monitoring for drift

Beyond comparing serving data to training data, it’s also important to keep an eye on changes within serving data over time. Comparing recent serving data to previous serving data is another type of model monitoring known as drift detection. This uses the same detection techniques of metrics that compare distributions between a baseline and comparison dataset and flag anomalies that exceed set threshold. With the table valued function ML.VALIDATE_DATA_DRIFT, you can compare any two tables, or query statements results, directly for detection. 

Drift detection in action:

SELECT *
FROM ML.VALIDATE_DATA_DRIFT(
    (
        SELECT * EXCEPT(species, instance_timestamp)
        FROM `statmike-mlops-349915.bqml_model_monitoring.serving`
        WHERE DATE(instance_timestamp) = CURRENT_DATE()
    ),
    (
        SELECT * EXCEPT(species, instance_timestamp)
        FROM `statmike-mlops-349915.bqml_model_monitoring.serving`
        WHERE DATE(instance_timestamp) = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
    ),
    STRUCT(
        0.03 AS categorical_default_threshold,
        0.03 AS numerical_default_threshold
    )
)

Here, the same serving table is used as the baseline and comparison table but with different WHERE statements to filter the rows and compare today to yesterday as an example. The results below show that while the detection values did not surpass the threshold, they are approaching the threshold between two consecutive days for the features that have simulated change.

Just like with skew detection, you can also adjust the default detection threshold for anomaly detection as well as the metric type used for categorical features, and specify different thresholds for different columns and feature types. There are additional parameters to control the binning of numerical features for the metrics calculations. 

Take TFDV monitoring to the next level

If you’re already familiar with the TensorFlow Data Validation (TFDV) library, you’ll appreciate how these new BigQuery functions enhance your model monitoring toolkit. They bring the power of TFDV directly into your BigQuery workflows, allowing you to generate rich statistics, detect anomalies, and leverage TFDV’s powerful visualization tools — all with SQL. And the best part is it uses BigQuery’s scalable, serverless compute. Leverage BigQuery’s scalable, serverless compute for near-instant analysis, empowering you to take rapid action on model monitoring insights!

Let’s explore how it works:

Generate statistics with ML.TFDV_DESCRIBE

You can generate in-depth statistics summaries with table valued function ML.TFDV_DESCRIBE for any table, or query, in the same format as the TensorFlow tfdv.generate_statistics_from_csv() API:

SELECT *
FROM ML.TFDV_DESCRIBE(
    (
        SELECT * EXCEPT(species)
        FROM `bqml_model_monitoring.training`
    )
)

The ML.TFDV_DESCRIBE function outputs statistics in a structured data format (a ‘proto’) that is directly compatible with TFDV: tfmd.proto.statistics_pb2.DatasetFeatureStatisticsList

Using a bit of Python code in a BigQuery notebook, we can import the TFDV package as well as TensorFlow Metadata package and then make a call to the tfdv.visualize_statistics method while converting the data to the expected format. The ML.TFDV_DESCRIBE results were loaded to Python for the training data as train_describe and for the current day’s serving data as today_describe. See the accompanying tutorial for complete details.

import tensorflow_data_validation as tfdv
import tensorflow_metadata as tfmd
from google.protobuf import json_format
tfdv.visualize_statistics(
    lhs_statistics = json_format.ParseDict(train_describe, tfmd.proto.statistics_pb2.DatasetFeatureStatisticsList()),
    rhs_statistics = json_format.ParseDict(today_describe, tfmd.proto.statistics_pb2.DatasetFeatureStatisticsList()),
    lhs_name = 'Training Data Stats',
    rhs_name = 'Serving Data Stats - For Today'
)

This generates the amazing visualizations shown below that directly highlight shifts in the two parameters that we purposefully shifted in the serving data for this example: body_mass_g and flipper_length_mm

This streamlined workflow brings the power and precision of TensorFlow Data Validation directly to BigQuery and enables you to quickly visualize how sets of data differ. This provides deeper insight to model health monitoring and informs how to proceed with model training iterations.

Detect anomalies With ML.TFDV_VALIDATE

You can also precisely detect skew or drift anomalies with the scalar function ML.TFDV_VALIDATE, which compares tables, or queries, pinpointing potential model-breaking shifts.

Example:

WITH
    TRAIN AS (
        SELECT * EXCEPT(species)
        FROM `bqml_model_monitoring.training`
    ),
    SERVE AS (
        SELECT * EXCEPT(species, instance_timestamp)
        FROM `bqml_model_monitoring.serving`
        WHERE DATE(instance_timestamp) = CURRENT_DATE()
    )
SELECT ML.TFDV_VALIDATE(
    (SELECT * FROM ML.TFDV_DESCRIBE(TABLE TRAIN)),
    (SELECT * FROM ML.TFDV_DESCRIBE(TABLE SERVE)),
    'SKEW', 0.03,'L_INFTY',0.03
) as validate

These results are formatted in a structured data format (‘proto’) that is specifically compatible with TFDV’s display tools: tfmd.proto.anomalies_pbs2.Anomalies. Passing this as input to Python method tfdv.display_anomalies presents an easy-to-read table of anomaly detection results as presented after the code snippet:

tfdv.display_anomalies(
    anomalies = json_format.ParseDict(validate, tfmd.proto.anomalies_pb2.Anomalies())
)
Feature nameAnomaly short descriptionAnomaly long description
‘culmen_depth_mm’High approximate Jensen-Shannon divergence between training and servingThe approximate Jensen-Shannon divergence between training and serving is 0.0483968 (up to six significant digits), above the threshold 0.03.
‘flipper_length_mm’High approximate Jensen-Shannon divergence between training and servingThe approximate Jensen-Shannon divergence between training and serving is 0.917495 (up to six significant digits), above the threshold 0.03.
‘body_mass_g’High approximate Jensen-Shannon divergence between training and servingThe approximate Jensen-Shannon divergence between training and serving is 0.356159 (up to six significant digits), above the threshold 0.03.
‘island’High Linfty distance between training and servingThe Linfty distance between training and serving is 0.118041 (up to six significant digits), above the threshold 0.03. The feature value with maximum difference is: Dream
‘culmen_length_mm’High approximate Jensen-Shannon divergence between training and servingThe approximate Jensen-Shannon divergence between training and serving is 0.0594803 (up to six significant digits), above the threshold 0.03.
‘sex’High Linfty distance between training and servingThe Linfty distance between training and serving is 0.0513795 (up to six significant digits), above the threshold 0.03. The feature value with maximum difference is: FEMALE

The default detection methods for numerical and categorical data, as well as thresholds are the same as for the other functions shown above. You can customize detection with parameters in the function for precision monitoring needs. For a deeper dive, the accompanying tutorial includes a section that demonstrates how these metrics are calculated manually and uses this function to compare to the manual calculation results as a validation.

Online and batch serving: A unified model monitoring approach

BigQuery’s model monitoring functions offer a streamlined solution whether you’re working with models deployed on Vertex AI Prediction Endpoints or using batch serving data stored within BigQuery (as shown above). Here’s how:

The accompanying tutorial provides a step-by-step walkthrough, demonstrating endpoint creation, model deployment, logging setup (for Vertex AI to BigQuery), and how to monitor both online and batch serving data within BigQuery.

Automate for scale

To achieve truly scalable monitoring of shifts and drifts, automation is essential. BigQuery’s procedural language offers a powerful way to streamline this process, as demonstrated in the SQL query from Google introductory notebook. This automation isn’t limited to monitoring; it can extend to continuous model retraining. In a production environment, continuous training would be accompanied by: proactively identifying data quality issues, adapting to real-world changes, and maintaining a rigorous deployment strategy aligned with your organization’s needs.

DECLARE skew_anomalies ARRAY<STRING>;
# Monitor Skew: latest serving compared to training
SET skew_anomalies = (
    SELECT ARRAY_AGG(input)
    FROM ML.VALIDATE_DATA_SKEW(
        MODEL `bqml_model_monitoring.classify_species_logistic`,
        (
            SELECT *
            FROM `bqml_model_monitoring.serving`
        WHERE DATE(instance_timestamp) >= CURRENT_DATE()
        )
    )
    WHERE is_anomaly = True
);
IF(ARRAY_LENGTH(skew_anomalies) > 0) THEN
    # retrain the model
    CREATE OR REPLACE MODEL `bqml_model_monitoring.classify_species_logistic`
    # find the full model training query in the introduction notebook
    ;
    
    # force alert with message
    SELECT ERROR(
        CONCAT(
            '\n\nFound data skew in features: ',
            ARRAY_TO_STRING(skew_anomalies, ', '),
            '. Model is retrained with latest up to date serving data.\n\n'
        )
    );
    ELSE SET skew_anomalies = ['No skew detected.'];
END IF;

Let’s take a look at what the results look like:

Found data skew in features: body_mass_g, flipper_length_mm. Model is retrained with the latest serving data.

A skew anomaly was detected and successfully triggered model retraining, restoring accuracy after the data changes. This demonstrates the value of automated monitoring and retraining for maintaining model performance in dynamic production environments.

To streamline this process, Google Cloud offers several powerful automation options::

Want a hands-on demonstration? Google accompanying tutorial dives into BigQuery scheduled queries, including historical backfilling, daily monitoring, and setting up email alerts for detected shifts and drifts. Google will also be releasing future tutorials covering the other automation tools.

The simplicity and power of model monitoring With BigQuery

Building trustworthy machine learning systems requires continuous monitoring. BigQuery’s new model monitoring functions streamline this to just a few SQL functions:

Getting Started

Extend from BigQuery ML models to Vertex AI Models and automate these new functions with Google Cloud offerings like BigQuery scheduled queries, Dataform, Workflows, Cloud Composer, or Vertex AI Pipelines. Dive into our hands-on notebooks to get started today:

Related posts

Transfer data from AWS to GCP using Storage Transfer Service

by Cloud Ace Indonesia
2 years ago

USAA and Google Cloud: Modernizing Insurance Operations with Machine Learning

by Kartika Triyanti
3 years ago

AlloyDB for PostgreSQL with Database Migration Service is now Generally Available

by Cloud Ace Indonesia
1 year ago