Failure migrating large table with DMS 3.5.1 (Postgres 12 to Redshift)

0

I have a large postgres table which is having issues replicating on DMS v3.5.1. The same task (and endpoints) work fine with v3.4.7, so I have rolled back to that for now. The job is a full load & ongoing replication of changes.

  • Destinaton: Redshift
  • Source: Postgres 12.x
  • 700M+ rows
  • Rows 128b to 1Kb, average ~256b
  • All other tables are fine, one has more records but is lighter

The error is not particularly helpful, but is likely accurate - session timeout. Are there any known issues with v3.5.1 and old postgres, or any debugging tips? I have an identical task that works, but that database is 1/5th the size for now, and PG 15.2.

I have attempted changing the LoadTimeout variable on the Redshift endpoint to 12 hours (currently takes ~5.5 hours to replicate), to no noticeable effect. It still crashed after 4.5 hours.

[TARGET_LOAD     ]E:  RetCode: SQL_ERROR  SqlState: 57P01 NativeError: 30 Message: 
    [Amazon][Amazon Redshift] (30) Error occurred while trying to execute a query: 
    [SQLState 57P01] FATAL:  terminating connection due to session timeout [1022500]  (ar_odbc_stmt.c:5007)
[TARGET_LOAD     ]E:  Failed to load public.table_name from S3, file name: LOAD0000123E.csv [1022509]  (cloud_imp.c:2518)
[TARGET_LOAD     ]E:  Failed to copy data of file /rdsdbdata/data/tasks/DK.../cloud/23/LOAD0000123E.csv 
[PERFORMANCE     ]I:  End load handler time for public.table_name = 480312 microseconds  (endpointshell.c:2992)
[PERFORMANCE     ]I:  Total load time for public.table_name = 13947613253 microseconds  (endpointshell.c:2994)
...
[TASK_MANAGER    ]W:  Table 'public'.'table_name' was errored/suspended (subtask 1 thread 1). 
    Failed (retcode -1) to execute statement; RetCode: SQL_ERROR  SqlState: 57P01 NativeError: 30 
    Message: [Amazon][Amazon Redshift] (30) Error occurred while trying to execute a query: 
    [SQLState 57P01] FATAL:  terminating connection due to session timeout ; 
    Failed to load public.table_name from S3, file name: LOAD0000123E.csv; 
    Failed to copy data of file /rdsdbdata/data/tasks/DK.../cloud/23/LOAD0000123E.csv to database; 
    Handling End of table 'public'.'table_name' loading failed by subtask 1 thread 1  (replicationtask.c:3023)
1 Answer
0

Hi Andrew,

Section "Identify the cause of long run times for commands" of https://repost.aws/knowledge-center/dms-error-canceling-statement-timeout may help you locate command that failed to run during the timeout period and also find locks on the associated tables.

This information will probably be very helpful to the DMS service team if, at the end, you need to open a ticket to AWS Support. And given what you describe, it seems highly probable.

Best,

Didier

profile pictureAWS
EXPERT
answered 9 months ago
  • Thanks for the response! I had not considered checking the PG logs in RDS as I figured it was a Redshift/Client Library issue. Depending on time I may spin up a new Redshift and try it again to see if I can catch the actual errors and report them properly. Retention was low so I do not have the PG logs covering that time window.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions