Questions tagged with AWS Glue DataBrew
Content language: English
Sort by most recent
Test Data Management tool for file Anonymization in AWS
Hi All, I'm looking for a Test Data management (TDM) tool in AWS which can perform below requirements, 1. TDM tool to Connect Production S3 bucket to extract files for anonymization and load it in Test S3 bucket 2. Job scheduled on daily basis to anonymize files in Prod S3 and store files in Test S3 bucket 3. Identify PII columns from S3 file and anonymize it, later these files are loaded in redshift database 4. Data integrity should be maintained between file and database, for example, incremental daily data should able to match the existing mocked PII columns in the redshift database Kindly let me know how can I achieve above requirements using AWS services Thanks & Regards, Aflah
Importing Datasets into AWS Glue DataBrew from PostgreSQL tables with Case Sensitive names
Hello, I'm trying to import datasets in DataBrew from a PostgreSQL db. Some tables have Case Sensitive names, others don't. When I try to import Tables with only lowercase letters in their names, everything works well and I'm able to use the dataset. However when the table name includes both lower and upper case letters, the data access fails with the following error : > An error occurred while calling o106.count. ERROR: relation "genphensql.sequencingdata" does not exist Indeed the table is registered in glue as genphensql.sequencingdata. However the real table name is "GenPhenSQL"."SequencingData". In Glue, all my ETL scripts work for feeding data into my PostgreSQL database. However, as DataBrew uses the table name stored by Glue to try and access the table, this fails. Is there any plan to resolve this problem at the moment from the Glue/DataBrew teams? I think it's a recurrent problem as I have seen other related questions to upper/lowercase table names Thanks a lot for your help !
AWS Glue Datatype Numeric giving issue
I've glue catalog and use s3 as DB. Datatype of columns are numeric in Redshift which i'm casting as decimal in Glue. But getting issue while querying the data in Athena. Column in redshift looks like (-12.887686) with data type Number. we cannot change the data type in source table
Is there a way to logically group steps within a recipe?
One thing I want to do is be able to "group" steps of a recipe together, in the interest of maintainability. Often times I will need to go back and tweak a recipe and it is very hard to tell which steps can logically be grouped together (i.e. they're trying to solve the same goal). One solution off the top of my head would be to be able to tag each step in a recipe. Is there any way to do this? For example, let's say that I have 5 logical steps in my preprocessing. Where step 1 could be "Filter out all purchases in region X with amount over $100" Doing this step alone in a recipe can take say 4 sub-steps. Now once I have completed creating the whole preprocessing recipe for all 5 logical steps, I can have 20+ steps in the recipe. By just looking at the recipe alone, it will be very hard to know which recipe steps are working together to achieve the same goal. My workaround right now is to just add it as documentation when I publish a recipe. For example I'd write: Step 1: Filter out purchases in region X with amt > $100 -- recipe steps 1 - 4 Step 2: Blah Blah Blah -- recipe steps 5 - 11 Step 3: Blah Blah Blah -- recipe steps 12 - 20 Now when I go back and have to edit something like changing the region of interest from region_X to region_Y for filtering, i know where to look. Now is there a better way to group steps together?
Is there a way to add a step between other steps in a recipe with the UI?
It seems like the only option when adding a new step to a recipe with the visual UI is to add it to the end, as the last step. There are cases where I'd like to modify a recipe and add a step between existing steps. The workaround I have found is to add whatever step i wanted to insert at the end, download the recipe as a JSON file, reorder the steps, then re-upload the recipe. This is a bit of a pain though. Is there any way to do this w/o that workaround? ^ Or are there any plans to add in this functionality?
Difference between a Job that is tied to a Project vs Recipe + Dataset
When you create a job you can associate it with a project OR specifically enter the dataset and recipe to be used. Say we have Project_A that uses Recipe_A and Dataset_A. What would be the difference of creating a job by using Project_A vs specifying Recipe_A + Dataset_A. Would they be identical?
Conditionals and comparison in DataBrew
Can Databrew set a column _afternoon_ to be _1_ if _hour>12 and hour <16_ , otherwise _0_ ? This requires 1. Numerical comparison "less than" (not string comparison) 2. Boolean _and_ This actually can be done with Databrew with some arithmetic transformations and the _sign_ function, but that gets complicated. Does Databrew support _afternoon = (hour>12 and hour <16)_ ?