By using AWS re:Post, you agree to the Terms of Use
/Extract Transform & Load Data/

Questions tagged with Extract Transform & Load Data

Sort by most recent
  • 1
  • 90 / page

Browse through the questions and answers listed below or filter and sort to narrow down your results.

0
answers
0
votes
29
views
1
answers
0
votes
46
views
asked 14 days ago

AWS parameterized Glue Concurrent jobs using step functions with enabled bookmarks - throws Version mismatch exception

I have a parameterized glue job , that will be called in parallel (25 glue job) through step functions, when bookmark enabled , version mismatch exception is thrown, when disabled, it runs fine. . Below are the inputs to the step function { "JobName": "gluejobname3", "Arguments": { "--cdb": "catalogdbname", "--ctbl": "tablename", "--dest": "s3://thepath/raw/", "--extra-py-files": "s3://thepath/mytransformation.py.zip" } } When bookmarks are disabled, the step functions calls the parameterized glue job, and loads the data into the different s3 location. below is the glue job script --------------- import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job import mytransformation args = getResolvedOptions(sys.argv, ["JOB_NAME","cdb","ctbl","dest"]) sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) # this was given so that while it was initialized each job would have their unique id job.init(args["JOB_NAME"]+args["ctbl"], args) #Custom parameters that will be dynamically passed through with the call to the parameterized glue job db = args["cdb"] tbl = args["ctbl"] dest = args["dest"] # Script generated for node MySQL table #dynamically creating a variable name. so that transformation_ctx would be unique for each glue job globals()[f'MySQLtable_sourceDf{tbl}'] = glueContext.create_dynamic_frame.from_catalog( database=db, table_name=tbl, transformation_ctx = '"'+f'MySQLtable_sourceDf{tbl}'+'"' ) #passing the same transformation_ctx into the destination frame destination_Df = mytransformation.tansform(globals()[f'MySQLtable_sourceDf{tbl}']) # Script generated for node Amazon S3 # : "s3://anvil-houston-preprod" #creating the dynamic unique transformation ctx name since the jobs will be run concurrently globals()[f'S3bucket_node{tbl}'] = glueContext.write_dynamic_frame.from_options( frame=destination_Df, connection_type="s3", format="glueparquet", connection_options={"path": dest, "partitionKeys": []}, format_options={"compression": "snappy"}, transformation_ctx='"'+f'S3bucket_node{tbl}'+'"' ) job.commit() ------ Above runs fine , while the execution is started through step functions ( 25 parallel parameterized glue job), the job runs fine, and loads to 25 diffferent locations. When bookmark is now enabled, the job fails with version mismatch . An error occurred while calling z:com.amazonaws.services.glue.util.Job.commit. Continuation update failed due to version mismatch. Expected version 1 but found version 2 (Service: AWSGlueJobExecutor; Status Code: 400; Error Code: VersionMismatchException; Please help
0
answers
0
votes
22
views
asked 15 days ago

Generating Parquet files from Glue Data Catalog

I have a glue job that write to a Data Catalog. In the Data Catalog I originally set it up as CSV, and all works fine. Now I would like to try to use Parquet for the Data Catalog. I thought I would just have to re-create the table and select Parquet instead of CSV, so I did so like so: ``` CREATE EXTERNAL TABLE `gp550_load_database_beta.gp550_load_table_beta`( `vid` string, `altid` string, `vtype` string, `time` timestamp, `timegmt` timestamp, `value` float, `filename` string) PARTITIONED BY ( `year` int, `month` int, `day` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3://ds905-load-forecast/data_store_beta/' TBLPROPERTIES ( 'classification'='parquet') ``` I left my glue job unchanged. I have it sending its output to the Data Catalog Table like so: ``` additionalOptions = {"enableUpdateCatalog": True, "updateBehavior": "LOG"} additionalOptions["partitionKeys"] = ["year", "month", "day"] # Data Catalog WRITE DataCatalogtable_node2 = glueContext.write_dynamic_frame.from_catalog( frame = dynamicDF, database = db_name, table_name = tbl_name, additional_options=additionalOptions, transformation_ctx = "DataCatalogtable_node2", ) ``` When I checked the files being created by the Data Catalog in s3://ds905-load-forecast/data_store_beta/, they look to just be CSV. What do I need to do to use Parquet? Can I just change the sink routine to use glueContext_write_dynamic_frame.from_options()?
0
answers
0
votes
17
views
asked 21 days ago

MSCK REPAIR TABLE behaves differently when executed via Spark Context vs Athena Console/boto3

I have a Glue ETL job which creates partitions during the job ``` additionalOptions = {"enableUpdateCatalog": True, "updateBehavior": "LOG"} additionalOptions["partitionKeys"] = ["year", "month", "day"] ``` I don’t have it Update the Data Catalog because doing so changes all my Table Data Types. So after I am done, the way I get the Data Catalog updated with the correct partition information is to run MSCK REPAIR TABLE. If I do this inside the Glue ETL job using the Spark Context like so: ``` spark.sql("use gp550_load_database_beta") spark.sql("msck repair table gp550_load_table_beta").show() ``` The following happens: Serde Properties of my table are updated with “serialization.format : 1” Table Properties are updated with: EXTERNAL : TRUE spark.sql.partitionProvider : catalog ALL Data Types in my table are set to “String” with a comment of “from deserializer” Basically it makes a mess. If I instead run MSCK REPAIR TABLE from boto3, or if I manually run it from Athena Console, then there are no issues. No Serde Properties are changes, no table properties, no data types are changed, it simply adds my partitions to the Data Catalog like I want. I do like so in my ETL job: ``` client = boto3.client('athena') sql = 'MSCK REPAIR TABLE gp550_load_database_beta.gp550_load_table_beta' context = {'Database': 'gp550_load_database_beta'} client.start_query_execution(QueryString = sql, QueryExecutionContext = context, ResultConfiguration= { 'OutputLocation': 's3://aws-glue-assets-977465404123-us-east-1/temporary/' }) ``` Why does it behave so differently? Is it because somehow I need to tell Spark to work with HIVE? I had thought that since I already had a spark context it would be easy to use that to kick off the MSCK REPAIR TABLE, but obviously I was surprised at the result!
0
answers
0
votes
14
views
asked 23 days ago

AWS Glue problems reading from PostgreSQL DB that has uppercased table and column names

I have an RDS PostgreSQL database that has table names and column names with upper cased characters. I have created a glue crawler that connects to the database via jdbc and populates the glue data catalog with the database schemas but in that translation the upper case characters are converted to lower case characters when stored in the data catalog tables. When I run a glue job to query I get this error: An error occurred while calling o94.getDynamicFrame. ERROR: column "id" does not exist I made a copy of this table and changed the table names and column names to have all lower case characters and the same glue jobs and queries run successfully. Changing the table names and column names to lower cased characters in our production environment is just not an option due to the extensive work it would require. I found the 'Edit Schema' option in the Glue UI where you can change column names and data types and thought for a moment the solution had been found. However, when you change a character to upper case and then select "Save" it is reverted to lower case as it is saved. I have edited the pyspark script directly and worked with the glueContext.create_dynamic_frame.from_catalog method using the additional_options parameter to build my select statement using upper and lower case characters but that still fails with the error message noted above. # Script generated for node PostgreSQL table PostgreSQLtable_node1 = glueContext.create_dynamic_frame.from_catalog( database="mydatabase", table_name="mytable", additional_options={"query":"SELECT id from mytable;"}, transformation_ctx="PostgreSQLtable_node1" I believe the failure is because the schema as it is stored in the data catalog contains lower characters while the actual schema in the database is upper characters so when Glue tries to work with the table it is looking for "id" while the actual is "ID" and so "not found" is returned. I have read about the CaseSensitive option and looking in that direction next for a solution. I have not seen any recent (less than couple years old) posts about this issue so not sure if I'm missing something. Any assistance would be greatly appreciated.
2
answers
0
votes
27
views
asked a month ago

Data Catalog schema table getting modified when I run my Glue ETL job

I created a Data Catalog with a table that I manually defined. I run my ETL job and all works well. I added partitions to both the table in the Data Catalog, as well as the ETL job. it creates the partitions and I see the folders being created in S3 as well. But my table data types change. I originally had: | column | data type | | --- | --- | | vid | string | | altid | string | | vtype | string | | time | timestamp | | timegmt | timestamp | | value | float | | filename | string | | year | int | | month | int | | day | int | But now after the ETL job with partitions, my table ends up like so: | column | data type | | --- | --- | | vid | string | | altid | string | | vtype | string | | time | bigint | | timegmt | bigint | | value | float | | filename | string | | year | bigint | | month | bigint | | day | bigint | Before this change of data types, I could do queries in Athena. Including a query like this: ``` SELECT * FROM "gp550-load-database"."gp550-load-table-beta" WHERE vid IN ('F_NORTH', 'F_EAST', 'F_WEST', 'F_SOUTH', 'F_SEAST') AND vtype='LOAD' AND time BETWEEN TIMESTAMP '2021-05-13 06:00:00' and TIMESTAMP '2022-05-13 06:00:00' ``` But now with the data types change, I get an error when trying to do a query like above ``` "SYNTAX_ERROR: line 1:154: Cannot check if bigint is BETWEEN timestamp and timestamp This query ran against the "gp550-load-database" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 2a5287bc-7ac2-43a8-b617-bf01c63b00d5" ``` So then if I go into the the table and change the data type back to "timestamp", I then run the query and get a different error: ``` "HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'time' in table 'gp550-load-database.gp550-load-table-beta' is declared as type 'timestamp', but partition 'year=2022/month=2/day=2' declared column 'time' as type 'bigint'. This query ran against the "gp550-load-database" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: f788ea2b-e274-43fe-a3d9-22d80a2bbbab" ``` With Glue Crawlers, you can influence how the partitions are created. You can have the Crawler modify the Data Catalog table, or not make changes to the table scheme, other than adding new partitions: ``` { "Version": 1.0, "CrawlerOutput": { "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" }, "Tables": {"AddOrUpdateBehavior": "MergeNewColumns" } } } ``` basically, this *InheritFromTable* behavior is what I am looking for with ETL jobs. Does anyone know what is happening?
0
answers
0
votes
53
views
asked a month ago

Can't get Partitions to work with my Glue Data Catalog

I have S3 files that are uploaded to a single bucket. There is no folders or anything like that, its just 1 file per hour uploaded to this bucket. I run a Glue ETL job on these files, do some transformations, and insert the data into a Glue Data Catalog stored in a different bucket. I can then query that Glue Data Catalog with Athena, and that works. What I would like to do is store the files in the S3 folder of the Data Catalog as YEAR/MONTH/DAY, using partitions. Even though the SOURCE data is just files uploaded every hour with no partitions, I want to store them in the Data Catalog WITH partitions. So I extracted the YEAR, MONTH, DAY from the files during Glue ETL, and created columns in my Data Catalog table accordingly and marked them as partitions: Partition 1 YEAR Partition 2 MONTH Partition 3 DAY The proper values are in these columns, and I have verified that. After creating the partitions I rand MSCK REPAIR TABLE on the table, and it came back with "Query Ok." I then ran my Glue ETL job. When I look in the S3 bucket I do not see folders created. I just see regular r-part files. When I click on the Table Schema it shows the columns YEAR, MONTH, DAY marked as partitions, but when I click on View Partitions it just shows: year month day No partitions found What do I need to do? These are just CSV files. I can't control the process that is uploading the raw data to S3, it is just going to store hourly files in a bucket. I can control the ETL job and the Data Catalog. When I try to query after creating the partitions and running MSCK REPAIR TABLE, there is no data returned. Yet I can go into the Data Catalog bucket and pull up one of the r-part files and the data is there.
1
answers
0
votes
34
views
asked a month ago

Data Pipeline error when using RegEx data format

I'm working on a sample to output an Aurora query to a fixed-width file (each column is converted to a specific width in the file, regardless of column data length), but every attempt to use the RegEx data format results in a `This format cannot be written to` error. **Using the preset CSV or TSV formats are successful**. I'm currently first outputting the Aurora query to CSV (stored in S3), then pulling that file and attempting to do the conversion via RegEx. I'm following the example at https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-regex.html but for some reason it's not working for me. **Does anyone have any thoughts on what I'm missing or other things I could try to get this working?** ## Aurora For this sample, it's just a simple 3 column table: INT, VARCHAR, VARCHAR ## CSV Again, really simple: > 1,xxxxxxxxxxx,yyyyyyyy > > 2,xxxxxxxxxxx,yyyyyyyyy > > 3,xxxxxxxxx,yyyyyyyy ## RegEx * Input Reg Ex: `(.)` * Above is just the most minimal grouping I could come up with after trying multiple others (including escaping the `\` and `,` in the example below * The real regex I would expect to use is `([^,]*)\,([^,]*)\,([^,]*)\n` * Output Format: `%1$s` * Ideally, I'd expect to see something like `%1$15s %2$15s %3$15s` * Columns: tried every combination both with and without columns, which made no difference ## Pipeline Definition { "objects": [ { "s3EncryptionType": "NONE", "maximumRetries": "3", "dataFormat": { "ref": "DataFormatId_A0LHb" }, "filePath": "s3://xxxx, 'YYYY-MM-dd-HH-mm-ss')}", "name": "Fixed Width File", "id": "DataNodeId_b4YBg", "type": "S3DataNode" }, { "s3EncryptionType": "NONE", "dataFormat": { "ref": "DataFormatId_fIUdS" }, "filePath": "s3://xxxxx, 'YYYY-MM-dd-HH-mm-ss')}", "name": "S3Bucket", "id": "DataNodeId_0vkYO", "type": "S3DataNode" }, { "output": { "ref": "DataNodeId_0vkYO" }, "input": { "ref": "DataNodeId_axAKH" }, "name": "CopyActivity", "id": "CopyActivityId_4fSv7", "runsOn": { "ref": "ResourceId_B2kdU" }, "type": "CopyActivity" }, { "inputRegEx": "(.)", "name": "Fixed Width Format", "id": "DataFormatId_A0LHb", "type": "RegEx", "outputFormat": "%1$s" }, { "subnetId": "subnet-xxxx", "resourceRole": "DataPipelineDefaultResourceRole", "role": "DataPipelineDefaultRole", "securityGroupIds": "sg-xxx", "instanceType": "t1.micro", "actionOnTaskFailure": "terminate", "name": "EC2", "id": "ResourceId_B2kdU", "type": "Ec2Resource", "terminateAfter": "5 Minutes" }, { "name": "CSV Format", "id": "DataFormatId_fIUdS", "type": "CSV" }, { "output": { "ref": "DataNodeId_b4YBg" }, "input": { "ref": "DataNodeId_0vkYO" }, "dependsOn": { "ref": "CopyActivityId_4fSv7" }, "maximumRetries": "3", "name": "Change Format", "runsOn": { "ref": "ResourceId_B2kdU" }, "id": "CopyActivityId_YytI2", "type": "CopyActivity" }, { "failureAndRerunMode": "CASCADE", "resourceRole": "DataPipelineDefaultResourceRole", "role": "DataPipelineDefaultRole", "pipelineLogUri": "s3://xxxxx/", "scheduleType": "ONDEMAND", "name": "Default", "id": "Default" }, { "connectionString": "jdbc:mysql://xxxxxxx", "*password": "xxxx", "name": "Aurora", "id": "DatabaseId_HL9uz", "type": "JdbcDatabase", "jdbcDriverClass": "com.mysql.jdbc.Driver", "username": "xxxx" }, { "database": { "ref": "DatabaseId_HL9uz" }, "name": "Aurora Table", "id": "DataNodeId_axAKH", "type": "SqlDataNode", "selectQuery": "select * from policy", "table": "policy" } ], "parameters": [] }
1
answers
0
votes
12
views
asked 2 months ago

AWs trigger EventBatchingCondition/BatchWindow is not optional

Hi team, I have a glue workflow : trigger (type = "EVENT") => trigger a glue job (to take data from S3 and push them to MySQL RDS) I configured the glue Triggering criteria to kickoff the glue job after 5 events were received. in the console it says : > Specify the number of events received or maximum elapsed time before firing this trigger. > Time delay in seconds (optional) on AWS documentation it says also it's not required : ``` BatchWindow Window of time in seconds after which EventBridge event trigger fires. Window starts when first event is received. Type: Integer Valid Range: Minimum value of 1. Maximum value of 900. Required: No ``` So I want only my trigger to be triggered only and only after 5 events are received and not depending on: Time delay in seconds (optional). actually, the Time delay in seconds (optional) is set to 900 by default and my job is started after 900s even if there are no 5 events received. that's not the behaviour we want. We want ONLY the job to be started after x events are received. I tried via the console to edit the trigger and remove the 900s for the Time delay in seconds (optional) input but I can't save it until I put a value on it. it says it's optional but it doesn't seem to be. is there a workaround to make the trigger not take account of Time delay in seconds (optional)? and only be launched when it received x events and nothing else. right now the behaviour I have is that my job is triggered after 900s, we want to eliminate this case and let the job be triggered only and only if there is x event received and nothing else. how can I make the Time delay in seconds (optional) input optional, because now the console forces me to put a value in there? thank you.
1
answers
0
votes
18
views
asked 2 months ago

Is there a way to create a Redshift Table from a Glue table's schema?

Athena tables can be created from Glue tables which can have schemas based on crawlers. **Is it also possible to use the schema of a Glue table to generate a *Redshift-compatible* `CREATE TABLE` statement? ** I tried `SHOW CREATE TABLE encounter;` in Athena. And then I tried plugging in the resulting `CREATE TABLE` statement in Redshift, but got an error: ``` ERROR: syntax error at or near "`" Position: 23. ``` I can go through the statement Athena generated and clean it up to fit Redshift requirements, like taking out the back-ticks, but I'm wondering if there's any more direct way to generate a table based on a Glue table? This is that `CREATE TABLE` statement that Athena generated: ``` CREATE EXTERNAL TABLE `encounter`( `resourcetype` string COMMENT 'from deserializer', `id` string COMMENT 'from deserializer', `meta` struct<lastupdated:string,profile:array<string>> COMMENT 'from deserializer', `identifier` array<struct<use:string,system:string,value:string>> COMMENT 'from deserializer', `status` string COMMENT 'from deserializer', `class` struct<system:string,code:string> COMMENT 'from deserializer', `type` array<struct<coding:array<struct<system:string,code:string,display:string>>,text:string>> COMMENT 'from deserializer', `subject` struct<reference:string,display:string> COMMENT 'from deserializer', `participant` array<struct<type:array<struct<coding:array<struct<system:string,code:string,display:string>>,text:string>>,period:struct<start:string,end:string>,individual:struct<reference:string,display:string>>> COMMENT 'from deserializer', `period` struct<start:string,end:string> COMMENT 'from deserializer', `location` array<struct<location:struct<reference:string,display:string>>> COMMENT 'from deserializer', `serviceprovider` struct<reference:string,display:string> COMMENT 'from deserializer', `reasoncode` array<struct<coding:array<struct<system:string,code:string,display:string>>>> COMMENT 'from deserializer') ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' WITH SERDEPROPERTIES ( 'paths'='class,id,identifier,location,meta,participant,period,reasonCode,resourceType,serviceProvider,status,subject,type') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://bucket/Encounter/' TBLPROPERTIES ( 'CrawlerSchemaDeserializerVersion'='1.0', 'CrawlerSchemaSerializerVersion'='1.0', 'UPDATED_BY_CRAWLER'='healthlake-export-crawler', 'averageRecordSize'='1561', 'classification'='json', 'compressionType'='none', 'objectCount'='14', 'recordCount'='53116', 'sizeKey'='83059320', 'typeOfData'='file') ``` Here's an example of what the original data looks like (it's synthetic data, so not PHI): ``` { "period": { "start": "2019-11-18T13:53:49-08:00", "end": "2019-11-18T14:23:49-08:00" }, "subject": { "reference": "Patient/92e36d1e-66a2-4e77-9f50-155f7edf819c", "display": "Cyndi533 Bogan287" }, "serviceProvider": { "reference": "Organization/3ecb1bdd-03d7-3fd2-b52d-8df2a04f5b0a", "display": "SOUTH SHORE SKIN CENTER, LLC" }, "id": "b39745ae-14dd-46b3-9345-2916efa759ad", "type": [{ "coding": [{ "system": "http://snomed.info/sct", "code": "410620009", "display": "Well child visit (procedure)" }], "text": "Well child visit (procedure)" }], "class": { "system": "http://terminology.hl7.org/CodeSystem/v3-ActCode", "code": "AMB" }, "participant": [{ "period": { "start": "2019-11-18T13:53:49-08:00", "end": "2019-11-18T14:23:49-08:00" }, "individual": { "reference": "Practitioner/c51e847b-fcd0-3f98-98a7-7e4274a2e6f3", "display": "Dr. Jacquelyne425 O'Reilly797" }, "type": [{ "coding": [{ "system": "http://terminology.hl7.org/CodeSystem/v3-ParticipationType", "code": "PPRF", "display": "primary performer" }], "text": "primary performer" }] }], "resourceType": "Encounter", "status": "finished", "meta": { "lastUpdated": "2022-04-08T15:40:39.926Z" } } ```
2
answers
0
votes
41
views
asked 2 months ago

Data transformation not taken into account in AWS Glue

I have a S3 bucket with folders in which we have files. I want to make a database to be able to query these documents on a few keys with an API based on Lambda. But for that I need to normalize the data. For example I need to transform all the files in the folder `/jomalone/` as the following: { "data": { "products": { "items": [ { "default_category": { "id": "25956", "value": "Bath & Body" }, "description": "London's Covent Garden early morning market. Succulent nectarine, peach and cassis and delicate spring flowers melt into the note of acacia honey. Sweet and delightfully playful. Our luxuriously rich Body Crème with its conditioning oils of jojoba seed, cocoa seed and sweet almond, help to hydrate, nourish and protect the skin, while delicious signature fragrances leave your body scented all over.", "display_name": "Nectarine Blossom & Honey Body Crème", "is_hazmat": false, "meta": { "description": "The Jo Malone&trade; Nectarine Blossom & Honey Body Crème leaves skin beautifully scented with fruity notes of nectarine and peach sweetened with acacia honey." }, ... { "currency": "EUR", "is_discounted": false, "include_tax": { "price": 68, "original_price": 68, "price_per_unit": 38.86, "price_formatted": "€68.00", "original_price_formatted": "€68.00", "price_per_unit_formatted": "€38.86 / 100ML" } } ], "sizes": [ { "value": "175ML", "key": 1 } ], "shades": [ { "name": "", "description": "", "hex_val": "" } ], "sku_id": "L4P801", "sku_badge": null, "unit_size_formatted": "100ML", "upc": "690251040254", "is_engravable": null, "perlgem": { "SKU_BASE_ID": 63584 }, "media": { "large": [ { "src": "/media/export/cms/products/1000x1000/jo_sku_L4P801_1000x1000_0.png", "alt": "Nectarine Blossom & Honey Body Crème", "height": 1000, "width": 1000 }, { "src": "/media/export/cms/products/1000x1000/jo_sku_L4P801_1000x1000_1.png", "alt": "Nectarine Blossom & Honey Body Crème", "height": 1000, "width": 1000 } ], "medium": [ { "src": "/media/export/cms/products/670x670/jo_sku_L4P801_670x670_0.png", "alt": "Nectarine Blossom & Honey Body Crème", "height": 670, "width": 670 } ], "small": [ { "src": "/media/export/cms/products/100x100/jo_sku_L4P801_100x100_0.png", "alt": "Nectarine Blossom & Honey Body Crème", "height": 100, "width": 100 } ] }, "collection": null, "recipient": [ { "key": "mom-recipient", "value": "mom_recipient" }, { "key": "bride-recipient", "value": "bride_recipient" }, { "key": "host-recipient", "value": "host_recipient" }, { "key": "me-recipient", "value": "me_recipient" }, { "key": "her-recipient", "value": "her_recipient" } ], "occasion": [ { "key": "thankyou-occasion", "value": "thankyou_occasion" }, { "key": "birthday-occasion", "value": "birthday_occasion" }, { "key": "treat-occasion", "value": "treat_occasion" } ], "location": [ { "key": "bathroom-location", "value": "bathroom_location" } ] } ] } } ] } } } In a json with the following schema: brandName String productName String productLink String productType ? maleFemale Male/Female price float unitPrice String size float ingredients String notes String numReviews Int userIDs float locations float dates Date ages int sexes M/F ratings Int reviews Array of String sources String characteristics String specificRatings String So I have tried AWS Glue but I don't know how to get rid of the nested data as the keys at the beginning: "data": { "products": { "items": [ ... Indeed, I used to test the modifications on the names: [![introducir la descripción de la imagen aquí][1]][1] But it doesn't seem to have any of the consequences I was looking for if I am to believe the Preview tab: [![introducir la descripción de la imagen aquí][2]][2] I had indeed deleted the first and last soubrayed fields and modified the others but none of this seems to have been taken into account in the Preview. Indeed it doesn't seem there is anyhting like at least mapping in the related script: ``` import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job args = getResolvedOptions(sys.argv, ["JOB_NAME"]) sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args["JOB_NAME"], args) # Script generated for node S3 bucket S3bucket_node1 = glueContext.create_dynamic_frame.from_options( format_options={"multiline": False}, connection_type="s3", format="json", connection_options={"paths": ["s3://datahubpredicity/JoMalone/"], "recurse": True}, transformation_ctx="S3bucket_node1", ) # Script generated for node ApplyMapping ApplyMapping_node2 = ApplyMapping.apply( frame=S3bucket_node1, mappings=[("data.products.items", "array", "data.products.items", "array")], transformation_ctx="ApplyMapping_node2", ) # Script generated for node S3 bucket S3bucket_node3 = glueContext.write_dynamic_frame.from_options( frame=ApplyMapping_node2, connection_type="s3", format="json", connection_options={"path": "s3://datahubpredicity/merged/", "partitionKeys": []}, transformation_ctx="S3bucket_node3", ) job.commit() ``` [1]: https://i.stack.imgur.com/xYHOu.png [2]: https://i.stack.imgur.com/nWNrn.png
0
answers
0
votes
9
views
asked 2 months ago

aws glue job fail for escaper char

Hi team, I tried to load a big CSV file from s3 to RDS MySQL using AWS glue, I have an escaper character on the file (special character). This escape char is also defined on the crawled CSV table. each time the job fails with an error : `An error occurred while calling o122.pyWriteDynamicFrame. Duplicate entry '123456' for key 'MySQL table.PRIMARY'` I don't have any duplicate keys on my file and the table is truncated each time before running the job. I tried to narrow down the issue by dividing the file into chunks each chank runs successfully, but the whole file in a single job always fails with the above error. I divided the whole file into chunks,** every chunk runs successfully and I get the full data.** I can't figure out why ? is this a glue issue or a data issue ..? I know the issue is related to my escape character because when I removed them the whole file passed **OR** when I replace my special character escape with "\" the whole file passes also. is that because glue doesn't support certain escape characters (I have this issue with big files) Not sure why the whole file with the escaper fails and when we run it in sub-chunk every chunk passes? any idea? glue script: ``` datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "db_csv", table_name = "tbl_csvxx", transformation_ctx = "datasource0") applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("id", "string", "id", "string"), ("col1", "string", "col1", "string"), ("date1", "string", "date2", "timestamp"), ("col2", "string", "col2", "string"), ("col3", "string", "col3", "string"), ("col4", "string", "col24", "string"), ("col5", "string", "col5", "string"),...], transformation_ctx = "applymapping1") selectfields2 = SelectFields.apply(frame = applymapping1, paths = [ "col1", "col2", "col3", "id","col4", "col5",...], transformation_ctx = "selectfields2") datasink3 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = selectfields2, catalog_connection = conn_name, connection_options = {"dbtable": "mysqltable", "database": db_name}, transformation_ctx = "datasink3") ``` sample data : ``` "123","2018-02-09 12:16:38.000","2018-02-09 12:16:38.000","addr1 ®" addr2®" addr3®"",,,"22","city1","121",,,,,"CC" "456","2018-02-09 12:16:38.000","2018-02-09 12:16:38.000","sds, dssdds®"F®", sds sds, dwr, re2",,,"ree364","ABD","288",,,,,"N" "789","2018-02-09 12:16:38.000","2018-02-09 12:16:38.000","Alle# 02, Sept# 06, sdsx,",,"SAP# ®"C®"","DPPK# 05","dssd","313","Alkl",,,"1547","P" ``` Thank you.
1
answers
0
votes
31
views
asked 2 months ago

Best way to overcome HIVE_PARTITION_SCHEMA_MISMATCH error in Athena while preserving structure of structs?

I ran the following Amazon Athena query on a table created by AWS Glue, which had crawled an Amazon S3 export of Synthea data from Amazon HealthLake: ``` SELECT * FROM "glue01234567890_fhir_export_abcdefghijklmnopqrst"; ``` That resulted in this error: > HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'code' in table 'demo.glue01234567890_fhir_export_abcdefghijklmnopqrst' is declared as type 'struct<coding:array<struct<system:string,code:string,display:string>>,text:string>', but partition 'partition_0=ImagingStudy' declared column 'id' as type 'string'. This query ran against the "demo" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id I saw [one answer](https://repost.aws/questions/QU1vPk1069Q5qg4iyuQWTk6Q/while-importing-s-3-data-into-quicksight-from-glue-database-getting-the-following-error) to go to edit the Crawler and then to output -->configuration and "Update all new and existing partitions with metadata from the table." However, that didn't resolve the error for me. When I edit the schema manually an change column 'code' from `struct` to `string`, the error goes away, and Athena brings my attention to the next mismatch: > partition 'partition_0=Immunization' declared column 'patient' as type 'struct<reference:string>'. I don't think, however, manually replacing all structs with strings will get me the results I want, because I need Glue and Athena to be aware of the fields inside the structs. What is the best approach to overcoming this error while preserving the structure of the data, in this situation where I want to make the HealthLake export queryable by Athena? I understand I'll want to add a step to the ETL in Glue to first convert to Parquet or ORC for Athena performance reasons, but right now I'm dealing with a small sample dataset and just want to focus on getting a minimum viable end-to-end flow going.
1
answers
0
votes
52
views
asked 3 months ago
  • 1
  • 90 / page