By using AWS re:Post, you agree to the Terms of Use
/AWS Data Pipeline/

Questions tagged with AWS Data Pipeline

Sort by most recent
  • 1
  • 90 / page

Browse through the questions and answers listed below or filter and sort to narrow down your results.

0
answers
0
votes
30
views

Data Pipeline error when using RegEx data format

I'm working on a sample to output an Aurora query to a fixed-width file (each column is converted to a specific width in the file, regardless of column data length), but every attempt to use the RegEx data format results in a `This format cannot be written to` error. **Using the preset CSV or TSV formats are successful**. I'm currently first outputting the Aurora query to CSV (stored in S3), then pulling that file and attempting to do the conversion via RegEx. I'm following the example at https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-regex.html but for some reason it's not working for me. **Does anyone have any thoughts on what I'm missing or other things I could try to get this working?** ## Aurora For this sample, it's just a simple 3 column table: INT, VARCHAR, VARCHAR ## CSV Again, really simple: > 1,xxxxxxxxxxx,yyyyyyyy > > 2,xxxxxxxxxxx,yyyyyyyyy > > 3,xxxxxxxxx,yyyyyyyy ## RegEx * Input Reg Ex: `(.)` * Above is just the most minimal grouping I could come up with after trying multiple others (including escaping the `\` and `,` in the example below * The real regex I would expect to use is `([^,]*)\,([^,]*)\,([^,]*)\n` * Output Format: `%1$s` * Ideally, I'd expect to see something like `%1$15s %2$15s %3$15s` * Columns: tried every combination both with and without columns, which made no difference ## Pipeline Definition { "objects": [ { "s3EncryptionType": "NONE", "maximumRetries": "3", "dataFormat": { "ref": "DataFormatId_A0LHb" }, "filePath": "s3://xxxx, 'YYYY-MM-dd-HH-mm-ss')}", "name": "Fixed Width File", "id": "DataNodeId_b4YBg", "type": "S3DataNode" }, { "s3EncryptionType": "NONE", "dataFormat": { "ref": "DataFormatId_fIUdS" }, "filePath": "s3://xxxxx, 'YYYY-MM-dd-HH-mm-ss')}", "name": "S3Bucket", "id": "DataNodeId_0vkYO", "type": "S3DataNode" }, { "output": { "ref": "DataNodeId_0vkYO" }, "input": { "ref": "DataNodeId_axAKH" }, "name": "CopyActivity", "id": "CopyActivityId_4fSv7", "runsOn": { "ref": "ResourceId_B2kdU" }, "type": "CopyActivity" }, { "inputRegEx": "(.)", "name": "Fixed Width Format", "id": "DataFormatId_A0LHb", "type": "RegEx", "outputFormat": "%1$s" }, { "subnetId": "subnet-xxxx", "resourceRole": "DataPipelineDefaultResourceRole", "role": "DataPipelineDefaultRole", "securityGroupIds": "sg-xxx", "instanceType": "t1.micro", "actionOnTaskFailure": "terminate", "name": "EC2", "id": "ResourceId_B2kdU", "type": "Ec2Resource", "terminateAfter": "5 Minutes" }, { "name": "CSV Format", "id": "DataFormatId_fIUdS", "type": "CSV" }, { "output": { "ref": "DataNodeId_b4YBg" }, "input": { "ref": "DataNodeId_0vkYO" }, "dependsOn": { "ref": "CopyActivityId_4fSv7" }, "maximumRetries": "3", "name": "Change Format", "runsOn": { "ref": "ResourceId_B2kdU" }, "id": "CopyActivityId_YytI2", "type": "CopyActivity" }, { "failureAndRerunMode": "CASCADE", "resourceRole": "DataPipelineDefaultResourceRole", "role": "DataPipelineDefaultRole", "pipelineLogUri": "s3://xxxxx/", "scheduleType": "ONDEMAND", "name": "Default", "id": "Default" }, { "connectionString": "jdbc:mysql://xxxxxxx", "*password": "xxxx", "name": "Aurora", "id": "DatabaseId_HL9uz", "type": "JdbcDatabase", "jdbcDriverClass": "com.mysql.jdbc.Driver", "username": "xxxx" }, { "database": { "ref": "DatabaseId_HL9uz" }, "name": "Aurora Table", "id": "DataNodeId_axAKH", "type": "SqlDataNode", "selectQuery": "select * from policy", "table": "policy" } ], "parameters": [] }
1
answers
0
votes
12
views
asked 2 months ago

KDA Studio App keep throwing glue getFunction error, but I didn't use any glue function

I followed [this AWS blog post](https://aws.amazon.com/blogs/aws/introducing-amazon-kinesis-data-analytics-studio-quickly-interact-with-streaming-data-using-sql-python-or-scala/) to create KDA app, and change the output sink into s3 instead of data stream, everything is working, and I can get the result in s3. However in the KDA error logs, glue keep throwing getFunction error almost every second I run the deployed app, but I only use glue to define input/output schemas, didn't use any glue function, so I wonder where is it come form. Please help to take a look. ``` @logStream kinesis-analytics-log-stream @message {"locationInformation":"com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.getFunction(GlueMetastoreClientDelegate.java:1915)","logger":"com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate","message":"software.amazon.kinesisanalytics.shaded.com.amazonaws.services.glue.model.EntityNotFoundException: Cannot find function. (Service: AWSGlue; Status Code: 400; Error Code: EntityNotFoundException; Request ID: <Request ID>; Proxy: null)","threadName":"Thread-20","applicationARN":<applicationARN>,"applicationVersionId":"1","messageSchemaVersion":"1","messageType":"ERROR"} @timestamp <timestamp> applicationARN <applicationARN> applicationVersionId 1 locationInformation com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.getFunction(GlueMetastoreClientDelegate.java:1915) logger com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate message software.amazon.kinesisanalytics.shaded.com.amazonaws.services.glue.model.EntityNotFoundException: Cannot find function. (Service: AWSGlue; Status Code: 400; Error Code: EntityNotFoundException; Request ID:<Request ID>; Proxy: null) messageSchemaVersion 1 messageType ERROR threadName Thread-20 ``
0
answers
0
votes
9
views
asked 3 months ago

Can Data Pipelines be used for running Spark jobs on EMR 6.5.0?

Hi, I have a problem in that I make heavy use of EMRs, and I orchestrate their use with Data Pipeline - multiple daily runs are automated and EMRs are launched and terminated on conclusion. However, I'd now like to make use of EMR 6.X.X releases via Data Pipelines, rather than EMR 5.X.X releases I'm currently using. This is for two main reasons: * Security compliance: the latest EMR 6.X.X release have less vulnerabilities than the latest EMR 5.X.X releases * Performance/functionality: EMR 6.X.X releases perform much better than EMR 5.X.X releases for what I'm doing, and have functionality I prefer to use However...the current documentation for Data Pipeline says the following regarding EMR versions: > AWS Data Pipeline only supports release version 6.1.0 (emr-6.1.0). Version 6.1.0 of EMR was last updated on Oct 15, 2020...it's pretty old. Now, if I try and use an EMR version > 6.1.0 with Data Pipeline, I get the issue that has already been raised [here](https://forums.aws.amazon.com/thread.jspa?threadID=346359) i.e. during initial EMR cluster bring up via the Data Pipeline, there's a failure that renders the cluster unusable. It looks like a malformed attempt to create a symbolic link to a jar by one of the AWS scripts: ++ find /usr/lib/hive/lib/ -name 'opencsv*jar' + open_csv_jar='/usr/lib/hive/lib/opencsv-2.3.jar /usr/lib/hive/lib/opencsv-3.9.jar' + sudo ln -s /usr/share/aws/emr/goodies/lib/emr-hadoop-goodies-3.2.0.jar /mnt/taskRunner/oncluster-emr-hadoop- goodies.jar + sudo ln -s /usr/share/aws/emr/goodies/lib/emr-hive-goodies-3.2.0.jar /mnt/taskRunner/oncluster-emr-hive-goodies.jar + sudo ln -s /usr/lib/hive/lib/opencsv-2.3.jar /usr/lib/hive/lib/opencsv-3.9.jar /mnt/taskRunner/open-csv.jar ln: target ‘/mnt/taskRunner/open-csv.jar’ is not a directory Command exiting with ret '1' So - I guess my questions are: 1. Is there a way to work around the above so that Data Pipelines can be used to launch EMR 6.5.0 clusters for Spark jobs? 2. If there isn't, is there a different way of automating runs of EMR 6.5.0 clusters, other than writing my own script and scheduling *that* to bring up the EMR cluster and add the required jobs/steps? Thanks.
0
answers
0
votes
32
views
asked 5 months ago
3
answers
0
votes
51
views
asked 7 months ago

Data Pipeline stops processing files in S3 bucket

I have a Data Pipeline which reads CSV files from an S3 bucket and copies the data into an RDS database. I specify the bucket/folder name and it goes through each CSV file in the bucket/folder and processes it. When it is done, a ShellCommandActivity moves the files to another 'folder' in the S3 bucket. That's how it works in testing. With the real data it just stops after a few files. The last line in the logs is `07 Dec 2021 09:57:55,755 [INFO] (TaskRunnerService-resource:df-1234xxx1_@Ec2Instance_2021-12-07T09:53:00-0) df-1234xxx1 amazonaws.datapipeline.connector.s3.RetryableS3Reader: Reopening connection and advancing 0` The logs show that it usually downloads the CSV file, then it writes the 'Reopening connection and advancing 0' line, then it deletes a temp file, then goes onto the the next file. But on the seventh file it just stops on 'Reopening connection and advancing 0'. It isn't the next file that is the problem, as it will process fine on it's own. I've already tried making the files smaller - originally it was stopping on the second file but now the filesizes are about 1.7MB it's getting through six of them before it stops. The status of each task (both DataLoadActivity and ShellCommandActivity) show 'CANCELLED' after one attempt (3 attempts are allowed) and there is no error message. I'm guessing this is some sort of timeout. How can I make the pipeline reliable so that it will process all of the files?
2
answers
0
votes
16
views
asked 7 months ago

Unable to validate an instance profile with the role DataPipelineDefault

Hi, I am facing a weird issue while trying to set up a **DataPipeline** via **Cloudformation**. The Cloudformation yaml file is used to create the two needed Roles ( **DataPipelineDefaultRole** and **DataPipelineDefaultResourceRole** ) and its DataPipeline as described in the AWS doc : https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-datapipeline-pipeline.html I am using exactly that example including the creation of two Roles by strictly following this AWS tutorial : https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-iam-roles.html **To make it short:** If I create the two roles via **AWS Web Console** and then run the CloudFormation process, everything works as expected (datapipeline and all needed resources are properly created). But if I try to include the creation of the roles into the **CloudFormation** file and skip the Web Console, then I get the below error : ``` Pipeline Definition failed to validate because of following Errors: [{ObjectId = 'EmrClusterForBackup', errors = [Unable to validate an instance profile with the role name'DataPipelineDefaultResourceRole'.Please create an EC2 instance profile with the same name as your resource role]}] and Warnings: [{ObjectId = 'Default', warnings = ['pipelineLogUri'is missing. It is recommended to set this value on Default object for better troubleshooting.]}] ``` So, I have spent hours today trying to debug this issue and can guarantee that the generated Roles are identical either using the Web Console or the CloudFormation definition. I have extracted their json definition via **iam get-role** command in both cases and they are indeed the same. Can someone help out here ? Best, M. Edited by: tundraspar on Feb 6, 2019 1:22 PM
2
answers
0
votes
106
views
asked 3 years ago
  • 1
  • 90 / page