Skip to content

Getting Started with AWS Glue DataBrew: Simplified Data Preparation for Non-Technical Users

13 minute read
Content level: Foundational
0

AWS Glue DataBrew hands on tutorial for non-technical users

Visual Charts

Introduction

Data is the base foundation of modern business intelligence, but it's rarely analysis-ready in its raw form. According to industry research, data scientists and analysts spend up to 80% of their time cleaning, preparing, and organizing data before it can yield meaningful insights. This significant time investment creates a bottleneck in the analytics pipeline and often requires specialized coding skills in SQL, Python, or R.

AWS Glue DataBrew addresses this challenge by providing a visual, no-code interface for data preparation tasks. This service empowers business analysts, data scientists, and other non-technical users to clean and transform data efficiently without writing a single line of code.

In this guide, we'll walk through a practical scenario of preparing messy customer transaction data for analysis using AWS Glue DataBrew. By the end, you'll understand how to:

  • Set up projects and datasets in DataBrew
  • Apply common data cleaning transformations
  • Handle missing values and standardize formats
  • Join multiple datasets
  • Create and save reusable recipes
  • Optimize DataBrew jobs for large datasets
  • Integrate DataBrew into your data workflows

Understanding AWS Glue DataBrew

Before diving into our practical example, let's understand what AWS Glue DataBrew offers.

Key components of DataBrew:

  1. Projects: Interactive sessions where you explore data and create transformation recipes
  2. Datasets: References to your source data (S3, databases, or other AWS data stores)
  3. Recipes: Reusable collections of data transformation steps
  4. Jobs: Scheduled or on-demand executions of recipes against datasets
  5. Rulesets: Collections of data quality rules to validate your data

Benefits of using DataBrew:

  • No-code interface: 250+ built-in transformations via point-and-click interface
  • Data profiling: Automatic statistics and visualization of your data characteristics
  • Scalability: Process datasets of any size without infrastructure management
  • Integration: Seamless workflow with other AWS analytics services
  • Collaboration: Share and reuse recipes across your organization

Our Example Scenario: Preparing Customer Transaction Data

For this tutorial, we'll work with a fictional e-commerce company that has messy customer transaction data spread across multiple files:

  1. customer_data.csv: Basic customer information with inconsistent formatting
  2. transactions.csv: Purchase records with missing values
  3. product_catalog.json: Product information in JSON format

Our goal is to prepare this data for a customer segmentation analysis by:

  • Cleaning and standardizing customer information
  • Filling in missing transaction values
  • Joining the datasets into a unified view
  • Exporting the clean data for analysis

Getting Started with DataBrew

Step 1: Create a Dataset

Sample Datasets to get you started.

First, let's set up our customer data as a DataBrew dataset:

  1. Navigate to the AWS Glue DataBrew console in the AWS Management Console
  2. In the left navigation pane, select Datasets
  3. Click Create dataset
  4. Configure your dataset: o Dataset name: customer_data o Connection type: Select "Upload from S3" o S3 location: Browse to your S3 bucket where the customer_data.csv file is stored o File type: CSV o First row contains header: Yes
  5. Click Create dataset

Repeat this process for the **transactions.csv **and product_catalog.json files, making sure to select the appropriate file format for each.

Step 2: Create a Project and Explore Your Data

Now that we have our datasets created, let's start exploring the customer data:

  1. From the DataBrew console, select Projects in the left navigation pane
  2. Click Create project
  3. Configure your project: o Project name: customer_data_preparation o Select a dataset: Choose the customer_data dataset we created o Role name: Either select an existing IAM role with appropriate permissions or create a new one. Insert suitable suffix such as “Project”,
  4. Click Create project Once the project loads, you'll see a spreadsheet-like interface displaying your data. Notice the data profile panel on the right side showing statistics about your data:

Step 3: Clean and Transform Customer Data

Looking at our customer data, we notice several issues that need addressing:

3.1. Standardize Names (Upper/Lower Case)

  1. Click on the column header for customer_name
  2. Select Format > Change case > Title case
  3. You'll see a preview of the transformation before it's applied
  4. Apply Transform to all rows.
  5. Click Apply

3.2. Clean Phone Numbers

Our phone numbers are inconsistent with various formats like (555) 123-4567, 555.123.4567, and 5551234567.

  1. Click on the phone_number column header
  2. Select Clean > Remove Special Characters > All Special Characters.
  3. Tick the checkbox to remove all whitespaces as well.
  4. Apply Transform to all rows.
  5. Click Apply

3.3. Handle Missing Email Addresses

  1. Click on the email column header
  2. In the data profile panel, note the percentage of missing values
  3. Select Format > Remove or fill missing values > Fill with custom value
  4. Enter "customer@example.com " as the placeholder
  5. Apply Transform to all rows.
  6. Click Apply

3.4. Extract Domain from Email

To enable analysis of email providers:

  1. Click on the email column header
  2. Select Extract > Extract details from email
  3. Select Email value breakdown to extract > Domain
  4. Name your new destination column email_domain
  5. Apply Transform to all rows.
  6. Click Apply

3.5. Categorize Customer Types

Let's create a simplified customer type column based on their age:

  1. Fill in null values of age column first.
  2. Click Remove or fill missing values > Fill with numeric aggregate > Apply Transform to All rows > Apply.
  3. Then, click Create > Based on conditions > Case
  4. Select Case 1 > Value of age > Less than 25 (Enter custom value) > Result value(Enter custom value – Young Adult)
  5. Select Case 2 > Value of age > Is between [Greater than or equal to 26, Less than or equal to 64 (Enter custom value)] > Result value(Enter custom value –Adult)
  6. Select Case 3 > Value of age > Greater than 64 (Enter custom value) > Result value(Enter custom value –Senior)
  7. Set column name to customer_age_segment
  8. Enter the following formula:
  9. Click Preview to check results, then Apply

Step 4: Save Recipe and Create Job for Customer Data

Now that we've created a series of transformations for our customer data:

  1. Click Recipes in the top navigation bar
  2. Input version description – “Version 1 - basic cleansing of customer data”
  3. Click Publish
  4. Click Create job
  5. Configure your job: o Job name: customer-data-cleaning-job o Job output format: CSV o Job output location: Browse to your desired S3 output location o Role name: Use the same role as before o Job run behavior: On-demand
  6. Click Create and run job Your job will now process the full dataset using your recipe. You can monitor its progress from the Jobs section of the DataBrew console.

Step 5: Transform Transaction Data

Let's switch to working with the transaction data:

  1. Return to the Projects page and create a new project named transaction-data-preparation using the transactions.csv dataset
  2. Repeat similar steps, as in prior step 1 and 2.
  3. Once the project loads, explore the data in the grid to familiarize. For transaction data, we'll perform these transformations:

5.1. Fill missing values in qty and price, and calculate Total Purchase Value

  1. For both quantity and price columns, click Remove or fill missing values > Fill with numeric aggregate > Apply Transform to All rows > Apply.
  2. Click Create
  3. Select Create > Based on a function
  4. Select Multiply > Values using source columns > Source columns(quantity, price) > Apply
  5. Name destination column total_value
  6. Click Apply

5.2. Remove Duplicate Transactions

  1. From the top menu, select Group & aggregate > Remove duplicates
  2. Select the columns that should be unique (transaction_id)
  3. Click Apply

5.3. Bin Transaction Values

To create a categorical field for analysis:

  1. Click Create
  2. Select Based on conditions > Case o Case 1 > Value of > total value > is between > 0, 50 > Result Value (Enter custom value “Small”) o Case 2 > Value of > total value > is between > 51, 200 > Result Value (Enter custom value “Medium”) o Case 3 > Value of > total value > Greater than > 200 > Result Value (Enter custom value “Large”) o Default > Enter custom value (“null”)
  3. Name your column purchase_size
  4. Select the total_value column
  5. Create bins: o Small: 0-50 o Medium: 51-200 o Large: >200
  6. Click Apply Save this recipe as transaction_data_preparation_recipe and create a job as we did before.

Step 6: Join Datasets

Now that we've cleaned both our customer and transaction datasets, let's join them:

  1. Create a new project named joined-customer-transactions
  2. Select your cleaned “customer” dataset as the starting point
  3. Once the project loads, click Join in the top menu
  4. Select Join > Join datasets
  5. Configure the join: o Select dataset: Choose your cleaned “transactions” dataset o Join type: Inner join o Join keys: Map customer_id from both datasets o Columns to include: Select all columns from both datasets
  6. Click Apply

You'll now have a unified view of customer and transaction data. If needed, you can perform additional transformations on this joined dataset.

Step 7: Import product catalog and create project

Now that we have our datasets created, let's start exploring the customer data:

  1. Follow Step 1 at beginning to Import product catalog json dataset, select “Json document”.
  2. From the DataBrew console, select Projects in the left navigation pane
  3. Click Create project
  4. Configure your project: o Project name: product-catalog-preparation o Select a dataset: Choose the product_catalog dataset which was imported o Role name: Either select an existing IAM role with appropriate permissions or create a new one. Insert suitable suffix such as “Project”,
  5. Click Create project

Step 8: Create Product Enrichment

Let's add our product information to further enrich the data:

  1. With your joined-customer-transactions project open, at top menu bar. Click Join > Select dataset > product-catalog
  2. Configure this join: o Select dataset: Choose the product_catalog dataset o Join type: Left join (to keep all transactions even if product info is missing) o Join keys: Map product_id from both datasets o Columns to include: Select all product information
  3. Click Finish

Step 9: Finalize and Export

Let's perform a few final transformations:

  1. Sort data: Click Sort and choose to sort by customer_id
  2. Flag outliers: Click Outliers > Flag outliers > Choose “Price” column for outlier transactions > Leave defaults for rest of values > Preview changes(at bottom). If there are any outliers, you may choose to add in new column to flag them. In this sample datasets, there are none so click “Cancel” and return.
  3. Final data quality check: Click Profile (in middle page) > Run data profile > Job name (“customer-transactions-product-profile-job”) > Select suitable S3 output bucket> Permissions (select Glue service role created for this project) > Leave rest as defaults > Create and Run Job
  4. Use the data profile to view data profile overview, column statistics.

*PS: Note you may find few null values in datasets, as long as it’s an acceptable amount (depends on nature of data), and doesn’t skew data too much leading to unacceptable inaccuracies, it will be ok

Optimizing DataBrew for Large Datasets

When working with larger datasets in production, consider these optimization tips:

  1. Partitioning: When creating jobs, use partitioning options to process your data in parallel chunks.
  2. Sample intelligently: During development, use DataBrew's sampling options to work with representative subsets: o Random sampling for uniform data o Stratified sampling for maintaining distribution of key categories
  3. Monitor resource usage: In the job details page, observe runtime and resource consumption. Adjust sampling or partitioning if needed.
  4. Recipe efficiency: Keep transformations in logical order, placing filters early to reduce downstream processing.
  5. Schedule during off-peak hours: For recurring jobs, schedule during your organization's off-peak times.

Integrating DataBrew into Your Data Pipelines

DataBrew jobs can be integrated into broader AWS data pipelines:

AWS Step Functions Integration

Create a state machine that orchestrates a workflow including DataBrew jobs, for example:

  1. Start with raw data landing in S3
  2. Trigger DataBrew job via Step Functions
  3. Upon completion, initiate analytics in Amazon QuickSight or SageMaker Unified Studio

AWS EventBridge Integration

Configure DataBrew to work with EventBridge for event-driven workflows:

  1. Set up an EventBridge rule to detect new data arrivals in S3
  2. Trigger your DataBrew job automatically
  3. Configure notifications upon job completion or failure

Using with AWS Glue Workflows

Incorporate DataBrew jobs into AWS Glue workflows alongside ETL jobs:

Example of triggering a DataBrew job from an AWS Glue Python shell job

import boto3

databrew = boto3.client('databrew')

response = databrew.start_job_run(
    Name='customer_transaction_analysis_job'
)

print(f"Started job run with ID: {response['RunId']}")

Data Quality Validation

DataBrew includes built-in data quality capabilities:

  1. From your project, click DQ rules in the left menu
  2. Select Create data quality ruleset
  3. Add rules like: o Column completeness (e.g., email must be at least 95% complete) o Value distribution (e.g., customer_segment must contain expected values) o Statistical validations (e.g., total_value must have mean between expected ranges)
  4. Save your ruleset and associate it with your jobs to automatically validate outputs

Cost Considerations

DataBrew pricing is based on the time your interactive sessions and jobs run. To optimize costs:

  • Develop and test recipes using sample data to minimize interactive session time
  • Consolidate transformations where possible to reduce job runtime
  • Schedule periodic jobs rather than running on-demand when feasible
  • Monitor job execution times and optimize recipes that take longer than expected

Conclusion

In this guide, we've walked through a sample workflow demonstrating main capabilities for preparing messy customer transaction data using AWS Glue DataBrew. Without writing lines of code, we've:

  • Cleaned and standardized customer information
  • Transformed transaction data
  • Joined multiple datasets into a unified view
  • Created reusable recipes for ongoing data preparation
  • Learned optimization techniques for production use

AWS Glue DataBrew democratizes data preparation by enabling non-technical users to perform sophisticated data transformations through an intuitive visual interface. By reducing the time spent on data preparation with ease of use, DataBrew allows analysts to focus on extracting insights rather than cleaning data.

The recipes and techniques demonstrated in this guide can be adapted to a wide range of data preparation scenarios, from marketing analytics to operational reporting. By making data preparation accessible to more team members, organizations can accelerate their data-driven decision making and derive value from their data assets faster than ever before.

Additional Resources

Have questions about implementing AWS Glue DataBrew in your organization? Leave a comment below or reach out local AWS Analytics team.

AWS
EXPERT
published 3 days ago38 views