Getting Started with AWS Glue DataBrew: Simplified Data Preparation for Non-Technical Users
AWS Glue DataBrew hands on tutorial for non-technical users
Introduction
Data is the base foundation of modern business intelligence, but it's rarely analysis-ready in its raw form. According to industry research, data scientists and analysts spend up to 80% of their time cleaning, preparing, and organizing data before it can yield meaningful insights. This significant time investment creates a bottleneck in the analytics pipeline and often requires specialized coding skills in SQL, Python, or R.
AWS Glue DataBrew addresses this challenge by providing a visual, no-code interface for data preparation tasks. This service empowers business analysts, data scientists, and other non-technical users to clean and transform data efficiently without writing a single line of code.
In this guide, we'll walk through a practical scenario of preparing messy customer transaction data for analysis using AWS Glue DataBrew. By the end, you'll understand how to:
- Set up projects and datasets in DataBrew
- Apply common data cleaning transformations
- Handle missing values and standardize formats
- Join multiple datasets
- Create and save reusable recipes
- Optimize DataBrew jobs for large datasets
- Integrate DataBrew into your data workflows
Understanding AWS Glue DataBrew
Before diving into our practical example, let's understand what AWS Glue DataBrew offers.
Key components of DataBrew:
- Projects: Interactive sessions where you explore data and create transformation recipes
- Datasets: References to your source data (S3, databases, or other AWS data stores)
- Recipes: Reusable collections of data transformation steps
- Jobs: Scheduled or on-demand executions of recipes against datasets
- Rulesets: Collections of data quality rules to validate your data
Benefits of using DataBrew:
- No-code interface: 250+ built-in transformations via point-and-click interface
- Data profiling: Automatic statistics and visualization of your data characteristics
- Scalability: Process datasets of any size without infrastructure management
- Integration: Seamless workflow with other AWS analytics services
- Collaboration: Share and reuse recipes across your organization
Our Example Scenario: Preparing Customer Transaction Data
For this tutorial, we'll work with a fictional e-commerce company that has messy customer transaction data spread across multiple files:
- customer_data.csv: Basic customer information with inconsistent formatting
- transactions.csv: Purchase records with missing values
- product_catalog.json: Product information in JSON format
Our goal is to prepare this data for a customer segmentation analysis by:
- Cleaning and standardizing customer information
- Filling in missing transaction values
- Joining the datasets into a unified view
- Exporting the clean data for analysis
Getting Started with DataBrew
Step 1: Create a Dataset
Sample Datasets to get you started.
First, let's set up our customer data as a DataBrew dataset:
- Navigate to the AWS Glue DataBrew console in the AWS Management Console
- In the left navigation pane, select Datasets
- Click Create dataset
- Configure your dataset: o Dataset name: customer_data o Connection type: Select "Upload from S3" o S3 location: Browse to your S3 bucket where the customer_data.csv file is stored o File type: CSV o First row contains header: Yes
- Click Create dataset
Repeat this process for the **transactions.csv **and product_catalog.json files, making sure to select the appropriate file format for each.
Step 2: Create a Project and Explore Your Data
Now that we have our datasets created, let's start exploring the customer data:
- From the DataBrew console, select Projects in the left navigation pane
- Click Create project
- Configure your project: o Project name: customer_data_preparation o Select a dataset: Choose the customer_data dataset we created o Role name: Either select an existing IAM role with appropriate permissions or create a new one. Insert suitable suffix such as “Project”,
- Click Create project Once the project loads, you'll see a spreadsheet-like interface displaying your data. Notice the data profile panel on the right side showing statistics about your data:
Step 3: Clean and Transform Customer Data
Looking at our customer data, we notice several issues that need addressing:
3.1. Standardize Names (Upper/Lower Case)
- Click on the column header for customer_name
- Select Format > Change case > Title case
- You'll see a preview of the transformation before it's applied
- Apply Transform to all rows.
- Click Apply
3.2. Clean Phone Numbers
Our phone numbers are inconsistent with various formats like (555) 123-4567, 555.123.4567, and 5551234567.
- Click on the phone_number column header
- Select Clean > Remove Special Characters > All Special Characters.
- Tick the checkbox to remove all whitespaces as well.
- Apply Transform to all rows.
- Click Apply
3.3. Handle Missing Email Addresses
- Click on the email column header
- In the data profile panel, note the percentage of missing values
- Select Format > Remove or fill missing values > Fill with custom value
- Enter "customer@example.com " as the placeholder
- Apply Transform to all rows.
- Click Apply
3.4. Extract Domain from Email
To enable analysis of email providers:
- Click on the email column header
- Select Extract > Extract details from email
- Select Email value breakdown to extract > Domain
- Name your new destination column email_domain
- Apply Transform to all rows.
- Click Apply
3.5. Categorize Customer Types
Let's create a simplified customer type column based on their age:
- Fill in null values of age column first.
- Click Remove or fill missing values > Fill with numeric aggregate > Apply Transform to All rows > Apply.
- Then, click Create > Based on conditions > Case
- Select Case 1 > Value of age > Less than 25 (Enter custom value) > Result value(Enter custom value – Young Adult)
- Select Case 2 > Value of age > Is between [Greater than or equal to 26, Less than or equal to 64 (Enter custom value)] > Result value(Enter custom value –Adult)
- Select Case 3 > Value of age > Greater than 64 (Enter custom value) > Result value(Enter custom value –Senior)
- Set column name to customer_age_segment
- Enter the following formula:
- Click Preview to check results, then Apply
Step 4: Save Recipe and Create Job for Customer Data
Now that we've created a series of transformations for our customer data:
- Click Recipes in the top navigation bar
- Input version description – “Version 1 - basic cleansing of customer data”
- Click Publish
- Click Create job
- Configure your job: o Job name: customer-data-cleaning-job o Job output format: CSV o Job output location: Browse to your desired S3 output location o Role name: Use the same role as before o Job run behavior: On-demand
- Click Create and run job Your job will now process the full dataset using your recipe. You can monitor its progress from the Jobs section of the DataBrew console.
Step 5: Transform Transaction Data
Let's switch to working with the transaction data:
- Return to the Projects page and create a new project named transaction-data-preparation using the transactions.csv dataset
- Repeat similar steps, as in prior step 1 and 2.
- Once the project loads, explore the data in the grid to familiarize. For transaction data, we'll perform these transformations:
5.1. Fill missing values in qty and price, and calculate Total Purchase Value
- For both quantity and price columns, click Remove or fill missing values > Fill with numeric aggregate > Apply Transform to All rows > Apply.
- Click Create
- Select Create > Based on a function
- Select Multiply > Values using source columns > Source columns(quantity, price) > Apply
- Name destination column total_value
- Click Apply
5.2. Remove Duplicate Transactions
- From the top menu, select Group & aggregate > Remove duplicates
- Select the columns that should be unique (transaction_id)
- Click Apply
5.3. Bin Transaction Values
To create a categorical field for analysis:
- Click Create
- Select Based on conditions > Case o Case 1 > Value of > total value > is between > 0, 50 > Result Value (Enter custom value “Small”) o Case 2 > Value of > total value > is between > 51, 200 > Result Value (Enter custom value “Medium”) o Case 3 > Value of > total value > Greater than > 200 > Result Value (Enter custom value “Large”) o Default > Enter custom value (“null”)
- Name your column purchase_size
- Select the total_value column
- Create bins: o Small: 0-50 o Medium: 51-200 o Large: >200
- Click Apply Save this recipe as transaction_data_preparation_recipe and create a job as we did before.
Step 6: Join Datasets
Now that we've cleaned both our customer and transaction datasets, let's join them:
- Create a new project named joined-customer-transactions
- Select your cleaned “customer” dataset as the starting point
- Once the project loads, click Join in the top menu
- Select Join > Join datasets
- Configure the join: o Select dataset: Choose your cleaned “transactions” dataset o Join type: Inner join o Join keys: Map customer_id from both datasets o Columns to include: Select all columns from both datasets
- Click Apply
You'll now have a unified view of customer and transaction data. If needed, you can perform additional transformations on this joined dataset.
Step 7: Import product catalog and create project
Now that we have our datasets created, let's start exploring the customer data:
- Follow Step 1 at beginning to Import product catalog json dataset, select “Json document”.
- From the DataBrew console, select Projects in the left navigation pane
- Click Create project
- Configure your project: o Project name: product-catalog-preparation o Select a dataset: Choose the product_catalog dataset which was imported o Role name: Either select an existing IAM role with appropriate permissions or create a new one. Insert suitable suffix such as “Project”,
- Click Create project
Step 8: Create Product Enrichment
Let's add our product information to further enrich the data:
- With your joined-customer-transactions project open, at top menu bar. Click Join > Select dataset > product-catalog
- Configure this join: o Select dataset: Choose the product_catalog dataset o Join type: Left join (to keep all transactions even if product info is missing) o Join keys: Map product_id from both datasets o Columns to include: Select all product information
- Click Finish
Step 9: Finalize and Export
Let's perform a few final transformations:
- Sort data: Click Sort and choose to sort by customer_id
- Flag outliers: Click Outliers > Flag outliers > Choose “Price” column for outlier transactions > Leave defaults for rest of values > Preview changes(at bottom). If there are any outliers, you may choose to add in new column to flag them. In this sample datasets, there are none so click “Cancel” and return.
- Final data quality check: Click Profile (in middle page) > Run data profile > Job name (“customer-transactions-product-profile-job”) > Select suitable S3 output bucket> Permissions (select Glue service role created for this project) > Leave rest as defaults > Create and Run Job
- Use the data profile to view data profile overview, column statistics.
*PS: Note you may find few null values in datasets, as long as it’s an acceptable amount (depends on nature of data), and doesn’t skew data too much leading to unacceptable inaccuracies, it will be ok
Optimizing DataBrew for Large Datasets
When working with larger datasets in production, consider these optimization tips:
- Partitioning: When creating jobs, use partitioning options to process your data in parallel chunks.
- Sample intelligently: During development, use DataBrew's sampling options to work with representative subsets: o Random sampling for uniform data o Stratified sampling for maintaining distribution of key categories
- Monitor resource usage: In the job details page, observe runtime and resource consumption. Adjust sampling or partitioning if needed.
- Recipe efficiency: Keep transformations in logical order, placing filters early to reduce downstream processing.
- Schedule during off-peak hours: For recurring jobs, schedule during your organization's off-peak times.
Integrating DataBrew into Your Data Pipelines
DataBrew jobs can be integrated into broader AWS data pipelines:
AWS Step Functions Integration
Create a state machine that orchestrates a workflow including DataBrew jobs, for example:
- Start with raw data landing in S3
- Trigger DataBrew job via Step Functions
- Upon completion, initiate analytics in Amazon QuickSight or SageMaker Unified Studio
AWS EventBridge Integration
Configure DataBrew to work with EventBridge for event-driven workflows:
- Set up an EventBridge rule to detect new data arrivals in S3
- Trigger your DataBrew job automatically
- Configure notifications upon job completion or failure
Using with AWS Glue Workflows
Incorporate DataBrew jobs into AWS Glue workflows alongside ETL jobs:
Example of triggering a DataBrew job from an AWS Glue Python shell job
import boto3
databrew = boto3.client('databrew')
response = databrew.start_job_run(
Name='customer_transaction_analysis_job'
)
print(f"Started job run with ID: {response['RunId']}")
Data Quality Validation
DataBrew includes built-in data quality capabilities:
- From your project, click DQ rules in the left menu
- Select Create data quality ruleset
- Add rules like: o Column completeness (e.g., email must be at least 95% complete) o Value distribution (e.g., customer_segment must contain expected values) o Statistical validations (e.g., total_value must have mean between expected ranges)
- Save your ruleset and associate it with your jobs to automatically validate outputs
Cost Considerations
DataBrew pricing is based on the time your interactive sessions and jobs run. To optimize costs:
- Develop and test recipes using sample data to minimize interactive session time
- Consolidate transformations where possible to reduce job runtime
- Schedule periodic jobs rather than running on-demand when feasible
- Monitor job execution times and optimize recipes that take longer than expected
Conclusion
In this guide, we've walked through a sample workflow demonstrating main capabilities for preparing messy customer transaction data using AWS Glue DataBrew. Without writing lines of code, we've:
- Cleaned and standardized customer information
- Transformed transaction data
- Joined multiple datasets into a unified view
- Created reusable recipes for ongoing data preparation
- Learned optimization techniques for production use
AWS Glue DataBrew democratizes data preparation by enabling non-technical users to perform sophisticated data transformations through an intuitive visual interface. By reducing the time spent on data preparation with ease of use, DataBrew allows analysts to focus on extracting insights rather than cleaning data.
The recipes and techniques demonstrated in this guide can be adapted to a wide range of data preparation scenarios, from marketing analytics to operational reporting. By making data preparation accessible to more team members, organizations can accelerate their data-driven decision making and derive value from their data assets faster than ever before.
Additional Resources
- AWS Glue DataBrew Documentation
- DataBrew Pricing
- Sample DataBrew Projects on GitHub
- AWS Glue DataBrew Workshop
Have questions about implementing AWS Glue DataBrew in your organization? Leave a comment below or reach out local AWS Analytics team.
- Language
- English
Relevant content
- AWS OFFICIALUpdated 7 months ago