How to do multivariate time series forecasting in BigQuery ML

Companies across industries rely heavily on time series forecasting to project product demand, forecast sales, project online subscription/cancellation, and for many other use cases. This makes time series forecasting one of the most popular models in BigQuery ML.

What is multivariate time series forecasting? For example, if you want to forecast ice cream sales, it is helpful to forecast using the external covariant “weather” along with the target metric “past sales.” Multivariate time series forecasting in BigQuery lets you create more accurate forecasting models without having to move data out of BigQuery.

When it comes to time series forecasting, covariates or features besides the target time series are often used to provide better forecasting. Up until now, BigQuery ML has only supported univariate time series modeling using the ARIMA_PLUS model (documentation). It is one of the most popular BigQuery ML models.

While ARIMA_PLUS is widely used, forecasting using only the target variable is sometimes not sufficient. Some patterns inside the time series strongly depend on other features. Google see strong customer demand for multivariate time series forecasting support that allows you to forecast using covariate and features.

Google recently announced the public preview of multivariate time series forecasting with external regressors. Google are introducing a new model type ARIMA_PLUS_XREG, where the XREG refers to external regressors or side features. You can use the SELECT statement to choose side features with the target time series. This new model leverages the BigQuery ML linear regression model to include the side features and the BigQuery ML ARIMA_PLUS model to model the linear regression residuals.

The ARIMA_PLUS_XREG model supports the following capabilities:

Automatic feature engineering for numerical, categorical, and array features.
All the model capabilities of the ARIMA_PLUS model, such as detecting seasonal trends, holidays, etc.

Headlight, an AI-powered ad agency, is using a multivariate forecasting model to determine conversion volumes for down-funnel metrics like subscriptions, cancellations, etc. based on cohort age.

The following sections show some examples of the new ARIMA_PLUS_XREG model in BigQuery ML. In this example, we explore the bigquery-public-data.epa_historical_air_quality dataset, which has daily air quality and weather information. We use the model to forecast the PM_2.5¹ , based on its historical data and some covariates, such as temperature and wind speed.

An example: forecast Seattle’s air quality with weather information

Step 1. Create the dataset

The PM_2.5, temperature, and wind speed data are in separate tables. To simplify the queries, create a new table by joining those tables into a new table “bqml_test.seattle_air_quality_daily,” with the following columns:

date: the date of the observation
PM_2.5: the average PM_2.5 value for each day
wind_speed: the average wind speed for each day
temperature: the highest temperature for each day

The new table has daily data from 2009-08-11 to 2022-01-31.

CREATE TABLE `bqml_test.seattle_air_quality_daily`

AS

WITH

  pm25_daily AS (

    SELECT

      avg(arithmetic_mean) AS pm25, date_local AS date

    FROM

      `bigquery-public-data.epa_historical_air_quality.pm25_nonfrm_daily_summary`

    WHERE

      city_name = 'Seattle'

      AND parameter_name = 'Acceptable PM2.5 AQI & Speciation Mass'

    GROUP BY date_local

),

  wind_speed_daily AS (

    SELECT

      avg(arithmetic_mean) AS wind_speed, date_local AS date

    FROM

      `bigquery-public-data.epa_historical_air_quality.wind_daily_summary`

    WHERE

      city_name = 'Seattle' AND parameter_name = 'Wind Speed - Resultant'

    GROUP BY date_local

),

  temperature_daily AS (

    SELECT

      avg(first_max_value) AS temperature, date_local AS date

    FROM

      `bigquery-public-data.epa_historical_air_quality.temperature_daily_summary`

    WHERE

      city_name = 'Seattle' AND parameter_name = 'Outdoor Temperature'

    GROUP BY date_local

SELECT

  pm25_daily.date AS date, pm25, wind_speed, temperature

FROM pm25_daily

JOIN wind_speed_daily USING (date)

JOIN temperature_daily USING (date)

Here is a preview of the data:

Step 2. Create Model

The “CREATE MODEL” query of the new multivariate model, ARIMA_PLUS_XREG, is very similar to the current ARIMA_PLUS model. The major differences are the MODEL_TYPE and inclusion of feature columns in the SELECT statement.

CREATE OR REPLACE

  MODEL

    `bqml_test.seattle_pm25_xreg_model`

  OPTIONS (

    MODEL_TYPE = 'ARIMA_PLUS_XREG',

    time_series_timestamp_col = 'date',

    time_series_data_col = 'pm25')

AS

SELECT

  date,

  pm25,

  temperature,

  wind_speed

FROM

  `bqml_test.seattle_air_quality_daily`

WHERE

  date

  BETWEEN DATE('2012-01-01')

  AND DATE('2020-12-31')

Step 3. Forecast the future data

With the created model, you can use the ML.FORECAST function to forecast the future data. Compared to the ARIMA_PLUS model, you have to specify the future covariates as an input.

SELECT

FROM

  ML.FORECAST(

    MODEL

      `bqml_test.seattle_pm25_xreg_model`,

    STRUCT(30 AS horizon),

      SELECT

        date,

        temperature,

        wind_speed

      FROM

        `bqml_test.seattle_air_quality_daily`

      WHERE

        date > DATE('2020-12-31')

))

After running the above query, you can see the forecasting results:

Step 4. Evaluate the model

You can use the ML.EVALUATE function to evaluate the forecasting errors. You can set perform_aggregation to “TRUE” to get the aggregated error metric or “FALSE” to see the per timestamp errors.

SELECT

FROM

  ML.EVALUATE(

    MODEL `bqml_test.seattle_pm25_xreg_model`,

      SELECT

        date,

        pm25,

        temperature,

        wind_speed

      FROM

        `bqml_test.seattle_air_quality_daily`

      WHERE

        date > DATE('2020-12-31')

),

    STRUCT(

      TRUE AS perform_aggregation,

      30 AS horizon))

The evaluation result of ARIMA_PLUS_XREG is as follows:

As a comparison, we also show the univariate forecasting ARIMA_PLUS result in the following table:

Compared to ARIMA_PLUS, ARIMA_PLUS_XREG performs better on all measured metrics on this specific dataset and date range.

Conclusion

In the previous example, Google demonstrated how to create a multivariate time series forecasting model, forecast future values using the model, and evaluate the forecasted results. The ML.ARIMA_EVALUATE and ML.ARIMA_COEFFICIENTS table value functions are also helpful for investigating your model. Based on the feedback from users, the model does the following to improve user productivity.

It shortens the time spent preprocessing data and lets users keep their data in BigQuery when doing machine learning.
It reduces overhead for the users who know SQL to do machine learning work in BigQuery.

Blog Post

How to do multivariate time series forecasting in BigQuery ML

An example: forecast Seattle’s air quality with weather information

Step 1. Create the dataset

Step 2. Create Model

Step 3. Forecast the future data

Step 4. Evaluate the model

Conclusion

Tags In

Leave a Reply Cancel reply

WhatsApp

Email

Phone

Blog Post

How to do multivariate time series forecasting in BigQuery ML

An example: forecast Seattle’s air quality with weather information

Step 1. Create the dataset

Step 2. Create Model

Step 3. Forecast the future data

Step 4. Evaluate the model

Conclusion

Tags In

Related Posts

Four steps to managing your Cloud Logging costs on a budget

Firestore Multiple Databases is now generally available

Limitless Data. All Workloads. For Everyone

Leave a Reply Cancel reply

WhatsApp

Email

Phone