In this year-end edition, Google are pleased to walk you through new SQL capabilities that Google launched in the categories of Enterprise Class, Data Quality and Schema Operations.
Enterprise Class
For our enterprise data analysts, we are pleased to share new capabilities in BigQuery SQL that allow them to manage their workloads using the same familiarity that they may have used in legacy databases and data warehouses.
Sessions (GA)
A database session (or session) represents a connection between a user or a program and a database or data warehouse that stores and queries data. A data analyst through a tool or a program may execute multiple commands which are tracked under this session. BigQuery at its heart is a stateless query engine which means it doesn’t maintain a persistent network connection, as traditional database connections do. This allows BigQuery to scale compute and storage independently without limits.
For those enterprise data administrators looking to manage BigQuery activity through sessions, they can optionally enable session support in BigQuery using the UI, API and CLI or can by setting the EnableSession parameter in the JDBC or ODBC driver. This allows enterprises to accrue all the benefits of sessions without the overhead of a persistent connection.
For data analysts, sessions provides the benefits of:
- Session variables: which allows the creation of a session-specific variable that is persisted for all commands that get executed in the session.
- System variables: Pre-build variables for timezone, dataset identifiers, projects, session identifiers, which if specified, allow the data analysts to override the organization or project level default values with session-specific settings.
- Session-duration temporary tables and temporary functions: When temporary tables are used in a session to stage results or a session-specific temporary function is used to make query operations efficient, they are persisted for the entire duration of the session and are accessible to all jobs in that session.
For data administrators, the benefits are:
- Session labels: By assigning labels for job tracking, data administrators can find all activities associated with that session using that label in the audit log.
- Session management: data administrators can manually terminate sessions by calling the BQ.ABORT_SESSION system procedure with the session ID as input.
BigQuery sessions also support multi-statement queries and transactions.
-- Create a session as a part of a transaction
BEGIN TRANSACTION
CREATE TEMP TABLE Flights(total INT64) AS SELECT * FROM UNNEST([10,23,3,14,55]) AS a;
SELECT * FROM Flights;
-- clear the contents of the same query window
DECLARE x INT64 DEFAULT 10;
COMMIT;
-- Lookup sessionID details from information schema
WITH running_transactions AS (
SELECT DISTINCT transaction_id
FROM
`region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
EXCEPT DISTINCT
SELECT transaction_id FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE
creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
AND statement_type = "COMMIT_TRANSACTION"
OR statement_type = "ROLLBACK_TRANSACTION"
)
SELECT
jobs.transaction_id AS transaction_id,
project_id,
user_email,
session_info.session_id,
query
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT AS jobs, running_transactions
WHERE
creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
AND destination_table.table_id = "Flights"
AND jobs.transaction_id = running_transactions.transaction_id;
SELECT total * x AS total_a FROM Flights LIMIT 2;
-- Use session specific variables
BEGIN
SET x = 100;
SELECT total * x AS total_b FROM Flights LIMIT 2;
END;
-- Use and set system variables in a session
SELECT STRING(TIMESTAMP "2008-12-20 15:30:00+00", @@time_zone) AS default_time_zone;
SET @@time_zone = "America/Los_Angeles";
SELECT STRING(TIMESTAMP "2008-12-20 15:30:00+00", @@time_zone) AS new_time_zone;
Case-insensitive dataset and table names (Preview)
BigQuery tables names and dataset names are case sensitive, by default, reflecting its origins in big data analysis where dataset file names have maintained case sensitivity. To become more user-friendly to the tools and programs brought over by data analysts from their legacy data warehouses, BigQuery now supports case-insensitive table names and schema (dataset) names so that a dataset.Table ≡ DaTaSeT.TABLE ≡ Dataset.table ≡ DATASET.tAbLe in BigQuery. You can configure this in the definition (DDL) for dataset or table using the is_case_insensitive option.
-- Create a case-insensitive dataset
CREATE SCHEMA mydataset
OPTIONS(is_case_insensitive=TRUE)
-- Create a table in the case-insensitive dataset.
CREATE TABLE mydataset.table1(id STRING);
-- Query table with different casing
SELECT * FROM myDATASET.TaBLE1;
Data Quality
As data analytics platforms, such as BigQuery, bring data from different sources to help with decision making, data engineers need to maintain the quality of critical data elements. Data sources may be imperfect and have incomplete information. Therefore, data engineers need to configure intelligent defaulting logic when column data is missing to ensure that the right data gets populated. Similarly, data pipelines need to provide flexible logic for numeric data to reduce bias and ensure accurate outcomes. Data analysts need to match text information independent of case to ensure accurate reporting.
Default column value (Preview)
When new rows are inserted into a table, some columns may not have any data. Default value expression allows the specification of a default value using a literal value or a function that computes the default value when the associated column data is missing. The default value specification can be set when creating a new table or by altering the column properties of an existing table. In addition, DML statements such as INSERT or CREATE TABLE AS SELECT or MERGE allow the specification of the defaulting logic in place of an actual inserted value which uses the defaulting specification of the column being inserted into.
-- Create a table with a default value like CURRENT_TIMESTAMP() etc..
CREATE OR REPLACE TABLE mydataset.orders
(
customer STRING DEFAULT 'foo',
order_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP(),
user STRING DEFAULT SESSION_USER(),
details STRUCT<a STRING, b STRING> DEFAULT ('a1', 'b1')
);
-- Insert to table with omitted columns
INSERT INTO mydataset.orders(customer, user)
VALUES ('customer1', DEFAULT),
('customer2', 'David');
-- Validates default value are used
SELECT * FROM mydataset.orders;
Case-insensitive string collation (Preview)
A collation specification when associated with a column operation, e.g. join, comparison or ORDER BY or GROUP BY clause, determines the logic used to compare or order string values. For example, the default collation specific in BigQuery is ‘binary’ which uses the code point order to specify the ordering sequence in Unicode by which all uppercase letters [A-Z] precede lower case letters [a-z]. We are pleased to offer case insensitive collation, specified using ‘und:ci’ by which [A,a] will be treated as equivalent characters and will precede [B. b] for string value operations. This allows data analysts to find matching string values independent of the case by treating Maclean and MacLean as equivalent.
-- Create a table with case-insensitive string columns
CREATE or replace TABLE mydataset.table1
(
id INT64,
column_ci STRING COLLATE 'und:ci'
);
insert into mydataset.table1
values (1,"alpha"),(2,"BETA"), (3,"Alpha");
-- Filter by column_ci case-insensitively
SELECT * FROM `mydataset`.table1 WHERE column_ci = 'Alpha';
-- Sort by column_ci case-insensitively
SELECT * FROM mydataset.table1 ORDER BY column_ci;
-- Group by column_ci case-insensitively
SELECT column_ci, sum(id) FROM mydataset.table1 GROUP BY column_ci;
Banker’s rounding (GA)
BigQuery has been natively rounding all values that overflow maximum precision using the most common rounding logic: “round half away from zero”. Using “Round half away from zero” exclusively however, can introduce a rounding bias in the data by causing the aggregate operation e.g. sum or average to drift away from the actual value. Other rounding techniques, such as Gaussian rounding, commonly known as banker’s rounding, provide an alternate rounding scheme which does not suffer from negative or positive bias as much as the round half away from zero method over aggregations over most reasonable distributions. With this, BigQuery is pleased to introduce support for multiple rounding modes for columns in table definitions and in the explicit ROUND function: the existing mode of rounding, called “round_half_away_from_zero”, and a new mode, Bankers Rounding, otherwise known as “round_half_even”. The “round_half_even” mode rounds towards the nearest “neighbor” unless both neighbors are equidistant, in which case, round towards the even neighbor, e.g., 3.1 & 3.2 will round to 3 and 3.5 & 4.5 will round to 4.
-- ROUND_HALF_EVEN will round down 2.25 to 2.2
SELECT ROUND(NUMERIC "2.25", 1, "ROUND_HALF_EVEN");
-- Create a table with 'ROUND_HALF_EVEN' numeric column
CREATE or REPLACE TABLE mydataset.table1 (
x NUMERIC(5,2) OPTIONS (rounding_mode='ROUND_HALF_EVEN'),
y NUMERIC(5,2) OPTIONS (rounding_mode='ROUND_HALF_AWAY_FROM_ZERO')
);
-- Insert values into the new columns, values that are in halfway between 2.22 and 2.23
INSERT INTO mydataset.table1 VALUES (2.225, 2.225);
-- Select values and view how they were rounded to scale of 2.
SELECT * FROM mydataset.table1;
Schema operations
Continuing the expansion of SQL syntax for schema operations, BigQuery is pleased to introduce two new SQL capabilities: LOAD DATA to load data into tables and RENAME column to allow existing table columns to be renamed
LOAD DATA (Preview)
This statement provides data engineers and analysts with a SQL interface to load one or more files into tables. It provides the ability to load into a named table, create a new table , or truncate an existing table as a part of the load from all the file types supported by BigQuery including CSV, AVRO, Parquet, JSON, etc. In addition to specifying table options, such as table metadata or table expiration, the LOAD DATA command also allows users to specify partition and clustering schemes as a part of the table specification or from a hive partitioning specification for external files.
LOAD DATA uses the same resource pool as the BigQuery load api to load data. If flat slots customers have configured a PIPELINE reservation, then the LOAD DATA will use the dedicated compute slots from the PIPELINE reservation assignment.
-- Load a CSVfile into a table.
LOAD DATA INTO mydataset.table6
FROM FILES(
format='CSV',
uris = ['gs://cloud-samples-data/bigquery/us-states/us-states-by-date.csv']
);
select * from mydataset.table6;
-- Truncate table and load a CSV file into it.
LOAD DATA OVERWRITE mydataset.table6
FROM FILES(
format='CSV',
uris = ['gs://cloud-samples-data/bigquery/us-states/us-states-by-date.csv']
);
select * from mydataset.table6;
-- Load a CSV file into a table with schema, partitioning and clustering
LOAD DATA INTO mydataset.table7(name STRING, post_abbr STRING, date DATE)
PARTITION BY date
CLUSTER BY post_abbr
FROM FILES(
skip_leading_rows=1,
format='CSV',
uris = ['gs://cloud-samples-data/bigquery/us-states/us-states-by-date.csv']
);
select * from mydataset.table7;
-- Load hive partitioned PARQUET files, infer partition layout
LOAD DATA INTO mydataset.table8
FROM FILES(
format='PARQUET',
uris = ['gs://cloud-samples-data/bigquery/hive-partitioning-samples/autolayout/*'],
hive_partition_uri_prefix='gs://cloud-samples-data/bigquery/hive-partitioning-samples/autolayout'
)
WITH PARTITION COLUMNS;
select * from mydataset.table8;
RENAME column (Preview)
RENAME COLUMN allows users to change a column name in an existing table; this is a metadata only change. Previously, renaming columns required a workaround which required column data to be rewritten. Now, when a column has to be renamed because the current naming is obsolete or was named incorrectly in error, data owners can run a zero cost metadata only command, RENAME column, to fix and correct the column name.
-- Rename two columns
CREATE or replace TABLE mydataset.table9
(columnA int64, columnB string);
select column_name, ordinal_position, data_type
from information_schema.columns
where table_schema = “mydataset”
and table_name = "table9";
ALTER TABLE mydataset.table9
RENAME COLUMN columnA TO columnA1,
RENAME COLUMN IF EXISTS columnB TO columnB1;
select column_name, ordinal_position, data_type
from information_schema.columns
where table_schema = “mydataset”
and table_name = "table9";
-- Swap two columns
ALTER TABLE mydataset.table9
RENAME COLUMN columnA1 TO temp,
RENAME COLUMN columnB1 TO columnA,
RENAME COLUMN temp TO columnB;
select column_name, ordinal_position, data_type
from information_schema.columns
where table_schema = “mydataset”
and table_name = "table9";
Google are excited about bringing new capabilities in the new year to you, our BigQuery data aficionados, to make BigQuery SQL more user-friendly than ever.