Push down queries when using the Google BigQuery Connector for AWS Glue

4 minute read
Intermediate
1

The objective of this article is to describe how to use the Google BigQuery Connector for AWS Glue to build an optimized Extract, transform, and load (ETL) job by pushing down your own query to BigQuery. Query push down helps limiting the amount of data that needs to be scanned and transferred. Hence, it can optimize the cost of running the extraction query and the cost of data transfer.

Prerequisites

This feature is available with the Google BigQuery Connector for AWS Glue versions 0.22.2 or 0.24.2. In order to implement it you should already have:

  • An account in Google Cloud, specifically a service account that has permissions to Google BigQuery.
  • Subscribed and configured the Google BigQuery Connector for AWS Glue.

If you have not yet done so, you can follow the instructions in the blog post Migrating data from Google BigQuery to Amazon S3 using AWS Glue custom connectors.

Create the ETL job in AWS Glue Studio

To create your ETL job, complete the following steps:

  1. On the AWS Glue console, open AWS Glue Studio.
  2. In the navigation pane, choose Jobs.
  3. Choose Create job.
  4. Select Source and target added to the graph.
  5. For Source, choose Google BigQuery Connector 0.24.2 for AWS Glue 3.0 (choose Google BigQuery Connector 0.24.2 for AWS Glue 1.0 and 2.0 if you plan to use AWS Glue 1.0 or AWS Glue 2.0).
  6. For Target, choose S3.

Enter image description here

  1. Choose Create.
  2. Choose your data source node (Google BigQuery Connector for AWS Glue 3.0).
  3. On the Data source properties – Connector tab, for Connection, choose BigQuery.
  4. Under Connection options, choose Add new option.

Enter image description here

  1. Add 5 key value pairs:

    a. For the first key pair, for Key, enter parentProject, and for Value, enter your Google project name.

    b. For the second key pair, for Key, enter query, and for Value, type your query: SELECT col1, col2, col3 FROM your_table WHERE col4 = 'yyy' . The filter of rows and selection of the columns to be returned will be pushed down to BigQuery, improving performance and reducing costs.

    c. For the third key pair, for Key , enter viewsEnabled , and for Value , enter true.

    d. For the forth key pair, forKey , enter materializationDataset , and for Value , enter a dataset where the GCP user has table creation permission. If you want to materialize the temporary table in a different project you can also add an optional key pair, for Key , enter materializationProject, and for Value enter the name of another Project where you have dataset and table creation permission.

    e. For the last key pair (optional), for Key, enter maxparallelism, for Value, enter a number between 1 and 1000. This defines the number of streams that will read from the BigQuery Storage APIs. For performance optimization the number of executor cores in your Glue Job should be higher than the maxparallelism value. By default, the connector creates one partition per 400MB in the table being read (before filtering). As of this writing, the Google Cloud BigQuery Storage API has a maximum of 1,000 parallel read streams.

Enter image description here

  1. Complete your job by configuring the S3 target node, or adding additional transforms as needed.

Conclusion

In this article, we learned how to easily customize an AWS Glue ETL job that connects to a BigQuery and pushdown a query to improve performance and reduce costs for your ingestion. If you are interested in understanding the performance you can achieve with AWS Glue and the Google BigQuery Connector for AWS Glue you can review the blogpost Migrate terabytes of data quickly from Google Cloud to Amazon S3 with AWS Glue Connector for Google BigQuery.

EXPERT
published 3 months ago450 views