By using AWS re:Post, you agree to the Terms of Use

Questions tagged with AWS Glue

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

What should be the correct Exclude Pattern and Table level when dealing with folders with different names?

Hello, I have a s3 bucket with this following path: "s3://a/b/c" Inside this 'c' folder I have one folder for each table. Then for each of these table folders I have a folder for each version. Each version is a database snapshot obtained on a weekly basis, which is run by a workflow. To clarify, the structure inside 'c' is like this: 1. products 1. /version_0 1. _temporary 1. 0_$folder$ 2. part-00000-c5... ...c000.snappy.parquet 2. /version_1 1. _temporary 1. 0_$folder$ 2. part-00000-c5... ...c000.snappy.parquet 2. locations 1. /version_0 1. _temporary 1. 0_$folder$ 2. part-00000-c5... ...c000.snappy.parquet 2. /version_1 1. _temporary 1. 0_$folder$ 2. part-00000-c5... ...c000.snappy.parquet I have created a crawler (Include Path is set to the same path mentioned above - "s3://a/b/c") with the intention of merging all the versions together into 1 table, for each table (products, locations). The schemas of the different partitions are always the same. The structure of the different partitions is also always the same. The _temporary folder is something automatically generated by the workflow. **What should be the actual correct Exclude path (to ignore everything in _temporary folder) and maybe set any Table Level in order for me to create only ONE table merging all versions together for each table (products, locations)?** In summary I should have 2 tables: 1. products (containing version_0 and version_1 rows) 2. locations (containing version_0 and version_1 rows) I really have no way of testing the exclude patterns. Is there any Sandbox where we can actually test the glob exclude patterns? I have found one online but it doesn't seem to be similar to what AWS is using. I have tried with these exclude patterns but none worked (it still created a table for each table & each version): 1. version*/_temporary** 2. /\*\*/version*/_temporary*\*
1
answers
0
votes
74
views
asked a month ago

Set correct Table level, Include Path and Exclude path.

Hello all, I have a s3 bucket with this following path: s3://a/b/c/products Inside the products folder I have one folder for each version (each version is a database snapshot of the products table, obtained on a weekly basis by a workflow). 1. /version_0 1. _temporary 1. 0_$folder$ 2. part-00000-c5... ...c000.snappy.parquet 2. /version_1 1. _temporary 1. 0_$folder$ 2. part-00000-29... ...c000.snappy.parquet I have created a crawler (Include Path is set to the same path mentioned above -s3://a/b/c/products) with the intention of merging all the versions together into 1 table. The schemas of the different partitions are always the same. The structure of the different partitions is also always the same. I have tried with different Table Levels (4, 5 and 6) in the "Grouping Behaviour for S3 Data" section on the Crawler Settings but it always created multiple tables (one table for each version). The _temporary folder is something automatically generated by the workflow so it seems. I don't know if I have to include this in the exclude path in order for it to work. **What should be the correct Include path, exclude path and table levels in order for me to create only ONE table merging all versions together?** I have checked all your general documentation links about this issue but could you please provide an actual solution for this issue?
1
answers
0
votes
30
views
asked a month ago

update glue trigger via CDK code

Hi team, I'm trying to use CDK customResource to update EventBatchingCondition for a glue trigger( as this is not supported natively by cloudFormation) this my code : ``` new AwsCustomResource(this, "updateEventBatching", { policy: AwsCustomResourcePolicy.fromSdkCalls({ resources: AwsCustomResourcePolicy.ANY_RESOURCE, }), onCreate: { service: "Glue", action: "updateTrigger", parameters: { Name: myGlueTrigger.name, //The name of the trigger to update. TriggerUpdate: { EventBatchingCondition: { BatchSize: "20", BatchWindow: "900", }, }, }, physicalResourceId: PhysicalResourceId.of( "updateEventBatching_id" ), }, onUpdate: { service: "Glue", action: "updateTrigger ", parameters: { Name: myGlueTrigger.name, //The name of the trigger to update. TriggerUpdate: { EventBatchingCondition: { BatchSize: "20" , BatchWindow: "300", }, }, }, physicalResourceId: PhysicalResourceId.of("updateEventBatching_id"), }, }); ``` I followed this article to grabe the service name, action, and parameters : https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/Glue.html#updateTrigger-property when I try to deploy I have this error : not sure what if it's about the service action is not correct ? also I'm not sure what I should put in `physicalResourceId `parameter in this case, I just put a static string ``` node_modules\aws-cdk-lib\aws-iam\lib\policy-statement.js:1 "use strict";var _a;Object.defineProperty(exports,"__esModule",{value:!0}),exports.Effect=exports.PolicyStatement=void 0;const jsiiDeprecationWarnings=require("../../.warnings.jsii.js"),JSII_RTTI_SYMBOL_1=Symbol.for("jsii.rtti"),cdk=require("../../core"),group_1=require("./group"),principals_1=require("./principals"),postprocess_policy_document_1=require("./private/postprocess-policy-document"),util_1=require("./util"),ensureArrayOrUndefined=field=>{if(field!==void 0){if(typeof field!="string"&&!Array.isArray(field))throw new Error("Fields must be either a string or an array of strings");if(Array.isArray(field)&&!!field.find(f=>typeof f!="string"))throw new Error("Fields must be either a string or an array of strings");return Array.isArray(field)?field:[field]}};class PolicyStatement{constructor(props={}){this.action=new Array,this.notAction=new Array,this.principal={},this.notPrincipal={},this.resource=new Array,this.notResource=new Array,this.condition={},this._principals=new Array;try{jsiiDeprecationWarnings.aws_cdk_lib_aws_iam_PolicyStatementProps(props)}catch(error){throw process.env.JSII_DEBUG!=="1"&&error.name==="DeprecationError"&&Error.captureStackTrace(error,this.constructor),error}for(const action of[...props.actions||[],...props.notActions||[]])if(!/^(\*|[a-zA-Z0-9-]+:[a-zA-Z0-9*]+)$/.test(action)&&!cdk.Token.isUnresolved(action))throw new Error(`Action '${action}' is invalid. An action string consists of a service namespace, a colon, and the name of an action. Action names can include wildcards.`);this.sid=props.sid,this.effect=props.effect||Effect.ALLOW,this.addActions(...props.actions||[]),this.addNotActions(...props.notActions||[]),this.addPrincipals(...props.principals||[]),this.addNotPrincipals(...props.notPrincipals||[]),this.addResources(...props.resources||[]),this.addNotResources(...props.notResources||[]),props.conditions!==void 0&&this.addConditions(props.conditions)}static fromJson(obj){const ret=new PolicyStatement({sid:obj.Sid,actions:ensureArrayOrUndefined(obj.Action),resources:ensureArrayOrUndefined(obj.Resource),conditions:obj.Condition,effect:obj.Effect,notActions:ensureArrayOrUndefined(obj.NotAction),notResources:ensureArrayOrUndefined(obj.NotResource),principals:obj.Principal?[new JsonPrincipal(obj.Principal)]:void 0,notPrincipals:obj.NotPrincipal?[new JsonPrincipal(obj.NotPrincipal)]:void 0}),errors=ret.validateForAnyPolicy();if(errors.length>0)throw new Error("Incorrect Policy Statement: "+errors.join(` ^ Error: Action 'glue:UpdateTrigger ' is invalid. An action string consists of a service namespace, a colon, and the name of an action. Action names can include wildcards. at new PolicyStatement (C:\xxxx\node_modules\aws-cdk-lib\aws-iam\lib\policy-statement.js:1:1371) at new AwsCustomResource (C:\xxxx\node_modules\aws-cdk-lib\custom-resources\lib\aws-custom-resource\aws-custom-resource.js:1:4109) at new CdkGlueEdwLoadStack (C:\xxxxxx\lib\cdk-glue-edw-load-stack.ts:634:5) at Object.<anonymous> (C:\xxxxx\bin\index.ts:115:1) at Module._compile (node:internal/modules/cjs/loader:1105:14) at Module.m._compile (C:\xxxxx\node_modules\ts-node\src\index.ts:1056:23) at Module._extensions..js (node:internal/modules/cjs/loader:1159:10) at Object.require.extensions.<computed> [as .ts] (C:\xxxxx\node_modules\ts-node\src\index.ts:1059:12) at Module.load (node:internal/modules/cjs/loader:981:32) at Function.Module._load (node:internal/modules/cjs/loader:822:12) ```
2
answers
0
votes
58
views
asked a month ago

MSK Connect Postgres connector fails during getting Glue avro schema details

I have problem using MSK Postgres Debezium connector with Glue Schema Registry avro serialisation, and getting "connect timed out" to GSR. Logs are following: ``` [Worker-051272e114b69c525] [2022-08-17 08:47:55,387] ERROR [route-events-connector|task-0] WorkerSourceTask{id=route-events-connector-0} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:191) ... com.amazonaws.services.schemaregistry.kafkaconnect.AWSKafkaAvroConverter.fromConnectData(AWSKafkaAvroConverter.java:97) [Worker-051272e114b69c525] at org.apache.kafka.connect.storage.Converter.fromConnectData(Converter.java:63) [Worker-051272e114b69c525] at org.apache.kafka.connect.runtime.WorkerSourceTask.lambda$convertTransformedRecord$2(WorkerSourceTask.java:313) [Worker-051272e114b69c525] at com.amazonaws.services.schemaregistry.kafkaconnect.AWSKafkaAvroConverter.fromConnectData(AWSKafkaAvroConverter.java:95) [Worker-051272e114b69c525] ... 15 more [Worker-051272e114b69c525] Caused by: com.amazonaws.services.schemaregistry.exception.AWSSchemaRegistryException: Failed to get schemaVersionId by schema definition for schema name = key-schema com.amazonaws.services.schemaregistry.common.AWSSchemaRegistryClient.getSchemaVersionIdByDefinition(AWSSchemaRegistryClient.java:144) [Worker-051272e114b69c525] ... 28 more [Worker-051272e114b69c525] Caused by: java.net.SocketTimeoutException: connect timed out [Worker-051272e114b69c525] at java.base/java.net.PlainSocketImpl.socketConnect(Native Method) ``` Related to GSR connector config: ``` ... key.converter.region=eu-central-1 key.converter.registry.name=my-schema-registry key.converter.schemaAutoRegistrationEnabled=true key.converter.schemaName=key-schema key.converter.avroRecordType=GENERIC_RECORD key.converter=com.amazonaws.services.schemaregistry.kafkaconnect.AWSKafkaAvroConverter value.converter.region=eu-central-1 value.converter.registry.name=my-schema-registry value.converter.schemaAutoRegistrationEnabled=true value.converter.schemaName=value-schema value.converter.avroRecordType=GENERIC_RECORD value.converter=com.amazonaws.services.schemaregistry.kafkaconnect.AWSKafkaAvroConverter ``` We have already configured MSK connect json Postgres connectors which are working fine and publishing data to MSK topics. Has anyone successfully configured MSK Connect with Glue Schema Registry for avro serialization? Thanks.
1
answers
0
votes
76
views
asked a month ago

How to use Glue job bookmark to read mongodb data and track last processed row using id column

I have implemented aws glue job bookmark to read data from MongoDB and write to the s3 bucket, but when we run the script, every time it writes all data in a separate file : below are my code: import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job import time import logging import urllib from pymongo import MongoClient import sys import nad_config from datetime import date ## @params: [JOB_NAME] args = getResolvedOptions(sys.argv, ['JOB_NAME']) sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args['JOB_NAME'], args) #Production DB mongo_uri = "mongodb://ip_1_2_3_4.ap-south-1.compute.internal:27017/test?replicaSet=repomongo" list = ['glue_bookmark'] today = date.today() folder_name = today.strftime("%d-%m-%Y") for i in list: org_id = i[12:18] read_mongo_options = 'read_mongo_options_'+org_id collection_name = i dynamic_frame = 'dynamic_frame'+org_id read_mongo_options = { "uri": mongo_uri, "database": "test", "collection": "test", "username": "test", "password": "test", "partitioner": "MongoSamplePartitioner", "partitionerOptions.partitionSizeMB": "10", "partitionerOptions.partitionKey": "id"} sub_folder_name = org_id; final_folder_path = folder_name+'/test/' datasource0 = glueContext.create_dynamic_frame_from_catalog(database = catalogDB, table_name = catalogTable,connection_type="mongodb",connection_options=read_mongo_options, transformation_ctx = "datasource0",additional_options = {"jobBookmarkKeys":["id"],"jobBookmarkKeysSortOrder":"asc"}) datasink1 = glueContext.write_dynamic_frame.from_options(frame = datasource0,connection_type = "s3",connection_options = {"path": "s3://aws-glue-assets-123456-ap-south-1/"+final_folder_path},format = "json", transformation_ctx = "datasink1") job.commit()
1
answers
0
votes
46
views
asked a month ago

Data Mesh on AWS Lake Formation

Hi, I'm building a data mesh in AWS Lake Formation. The idea is to have 4 accounts: account 0: main account account 1: central data governance account 2: data producer account 3: data consumer I have been looking for information about how to implement the mesh in AWS and I'm following some tutorials that are very similar to what I'm doing: https://catalog.us-east-1.prod.workshops.aws/workshops/78572df7-d2ee-4f78-b698-7cafdb55135d/en-US/lakeformation-basics/cross-account-data-mesh https://aws.amazon.com/blogs/big-data/design-a-data-mesh-architecture-using-aws-lake-formation-and-aws-glue/ https://aws.amazon.com/blogs/big-data/build-a-data-sharing-workflow-with-aws-lake-formation-for-your-data-mesh/ However, after having created the bucket and uploaded some csv data to it (in the producer account), I don't know if I have to register first to the glue catalog in the producer account or I just do it in the lake formation like it says here: https://catalog.us-east-1.prod.workshops.aws/workshops/78572df7-d2ee-4f78-b698-7cafdb55135d/en-US/lakeformation-basics/databases (is this dependant on if one uses glue permissions or lake formation permissions in lake formation configuration?) Indeed I have done it first the database and the table in glue and then when I go to lake formation in the database and table sections the database and table created from glue appear there without doing anything. Even if I disable there the options: "Use only IAM access control for new databases" "Use only IAM access control for new tables in new databases" both the database and table appear there do you know if glue and lake formations share the data catalog? and I'm doing it correctly? thanks, John
1
answers
0
votes
50
views
asked a month ago