How do I troubleshoot issues with major version upgrades in Amazon RDS for PostgreSQL or Aurora PostgreSQL-Compatible?

12 minute read
0

My engine version upgrade for Amazon Relational Database Service (Amazon RDS) for PostgreSQL or Amazon Aurora PostgreSQL-Compatible Edition is stuck or has failed.

Short description

Minor version upgrades are compatible with the earlier and later minor releases of the same major version. However, major version upgrades contain database changes that aren't backward-compatible with existing applications. These upgrades might change the internal format of system tables, data files, and data storage. Amazon RDS uses pg_upgrade to perform major version upgrades. For more information, see pg_upgrade on the PostgreSQL website.

During a major version upgrade, Amazon RDS runs a precheck procedure that identifies issues that might cause the upgrade to fail. It checks for potential incompatible conditions across all databases. If Amazon RDS identifies an issue during the precheck process, then it creates a log event for the failed precheck. The log events includes the file name, timestamp, and reasons for upgrade failure. For information about the precheck process for all databases, check the pg_upgrade_precheck.log log. However, for engine-specific issues, you must check the database log files for Amazon RDS for PostgreSQL or for Aurora PostgreSQL.

Resolution

The pg_upgrade utility that performs major version upgrades produces two logs: pg_upgrade_internal.log and pg_upgrade_server.log. Amazon RDS appends a timestamp to the file name for these logs. View these logs to get more information about the issues and errors encountered during the upgrade. For more information, see Monitoring Amazon RDS log files or Monitoring Amazon Aurora log files.

Upgrade takes a long time

Check for pending maintenance activities

Amazon RDS automatically uses engine version upgrades to apply pending maintenance activities, such as an operating system (OS) patch on your RDS instance. Amazon RDS applies the pending activity first, and then upgrades the engine version. If Amazon RDS must perform OS maintenance activities, then the upgrade takes longer.

If your RDS instance is in a Multi-AZ deployment, then OS maintenance results in a failover. When you set up your instance in Multi-AZ, Amazon RDS typically creates the backup of the instance on the secondary instance. In a failover, Amazon RDS creates a backup on a new secondary instance after the upgrade. This backup on the new secondary instance might not be the latest backup, so Amazon RDS completes a full backup instead of an incremental backup. A full backup can take a long time, especially if the database is large.

To avoid this issue, search for pending maintenance activities for your RDS DB instance or for your Aurora DB instance. Or, run the following AWS Command Line Interface (AWS CLI) describe-pending-maintenance-actions command on your instance:

aws rds describe-pending-maintenance-actions --resource-identifier example-arn

Note: Replace example-arn with you DB instance ARN. If you receive errors when you run AWS CLI commands, then see Troubleshooting errors for the AWS CLI. Also, make sure that you're using the most recent AWS CLI version.

Complete pending maintenance activities before you perform the database engine version upgrades.

Create a snapshot before the upgrade

It's a best practice to create a snapshot of the RDS DB instance or of the Aurora DB cluster before you upgrade the version. If you already turned on backups for your instance, then Amazon RDS automatically creates a snapshot as part of the upgrade process. With a snapshot, you reduce the upgrade process time because Amazon RDS only needs to create an incremental backup as part of the upgrade.

Wait for the read replica to upgrade

When you perform a major version upgrade of your primary DB instance, Amazon RDS automatically upgrades all the read replicas in the same AWS Region. After the upgrade workflow starts, the read replicas wait for pg_upgrade to complete successfully on the primary DB instance. Then, the primary DB instance upgrade waits for the read replica upgrades to complete. You experience an outage until all upgrades are complete. If the downtime window for the upgrade is small, promote or drop your replica instance, and then recreate the read replicas after the upgrade completes.

For Aurora DB clusters, pg_upgrade upgrades the writer instance first. Then, each reader DB instance experiences an outage as pg_upgrade upgrades it to the new major version. Note that if you upgrade an Aurora global database, then there are additional requirements and processes.

Resolve long-running transactions or high workload before the upgrade

Long-running transactions or high workload can increase the time Amazon RDS takes to shut down the database and upgrade the database engine.

To identify long-running transactions, run the following query:

SQL>SELECT pid, datname, application_name, state, age(query_start, clock_timestamp()), usename, query 
FROM pg_stat_activity 
WHERE query NOT ILIKE '%pg_stat_activity%' AND 
usename!='rdsadmin' 
ORDER BY query_start desc;

If you identify a long-running query, then use pg_cancel_backend or pg_terminate_backend to end the query. For more information about these functions, see 9.28.2. Server signaling functions.

Make sure that you have enough compute capacity

The pg_upgrade utility can be compute-intensive. It's a best practice to perform a dry-run upgrade before you upgrade your production databases to check that you have enough compute or memory capacity. The dry-run upgrade also checks whether you might encounter precheck or upgrade errors. You can restore a snapshot of the production instance and perform a dry run with the same instance class as that of the production database.

Upgrade fails

Check for unsupported DB instance classes

If the instance class of your DB instance isn't compatible with the PostgreSQL version that you're upgrading to, then the upgrade fails. Check the compatibility of engine version with the instance class for Amazon RDS or for Amazon Aurora.

Check for open prepared transactions

If there are open prepared transactions on the database, then the upgrade fails. You receive the There are uncommitted prepared transactions error in the pg_upgrade.log file. Commit or roll back all open prepared transactions before you start an upgrade.

To check whether there are open prepared transactions on your instance, run the following query:

SELECT count(*) FROM pg_catalog.pg_prepared_xacts;

Make sure that you use a supported data type

You can only upgrade the version for regclass, regrole, and regtype data types. The pg_upgrade utility can't upgrade databases that include table columns that use the reg* object identifier (OID) referencing types. If you use a regcollation, regconfig, regdictionary, regnamespace, regoper, regoperator, regproc, or regprocedure data type, then the upgrade fails.

To resolve this issue, remove all uses of reg* data types, except for regclass, regrole, and regtype, before you upgrade the data engine. To check for unsupported reg* data types in your tables, run the following query.

SELECT count(*) FROM pg_catalog.pg_class c, pg_catalog.pg_namespace n, pg_catalog.pg_attribute a  WHERE c.oid = a.attrelid
      AND NOT a.attisdropped
      AND a.atttypid IN ('pg_catalog.regproc'::pg_catalog.regtype,
                         'pg_catalog.regprocedure'::pg_catalog.regtype,
                         'pg_catalog.regoper'::pg_catalog.regtype,
                         'pg_catalog.regoperator'::pg_catalog.regtype,
                         'pg_catalog.regconfig'::pg_catalog.regtype,
                         'pg_catalog.regdictionary'::pg_catalog.regtype)
      AND c.relnamespace = n.oid
      AND n.nspname NOT IN ('pg_catalog', 'information_schema');

Check for logical replication slots

If your instance has logical replication slots, then you can't upgrade the instance and receive the following error in the pg_upgrade.log file:

"The instance could not be upgraded because one or more databases have logical replication slots. Please drop all logical replication slots and try again."

Logical replication slots are typically used for AWS Database Migration Service (AMS DMS) migration. They're also used to replicate tables from databases to data lakes, business intelligence tools, and other targets. Make sure that you know the purpose of the logical replication slots that are in use to check whether you can delete them. If the logical replication slots are in use, don't delete them. You must wait to upgrade the version until you can delete the logical replication slots.

If you don't need the logical replication slots, then run the following queries to delete them:

SELECT * FROM pg_replication_slots;
SELECT pg_drop_replication_slot(slot_name);

Note: Replace slot_name with the name of the logical replication slot.

Check for storage issues

If the instance runs out of space when the pg_upgrade script runs, then the upgrade fails and you receive the following error:

"pg_restore: [archiver (db)] Error while PROCESSING TOC: pg_restore: [archiver (db)] could not execute query: ERROR: could not create file "base/12345/12345678": No space keyword" left on device"

To resolve this issue, make sure that the instance has sufficient free storage before you start the upgrade.

Check for unknown data types

You can't use unknown data types in PostgreSQL versions 10 and later. For example, if a PostgreSQL version 9.6 database uses the unknown data type, then you receive the following error when you upgrade to version 10:

"The instance could not be upgraded because the 'unknown' data type is used in user tables. Please remove all usages of the 'unknown' data type and try again."

To resolve this issue, manually remove columns that use the unknown data type or modify them to a supported data type. To find the columns in your database that use the unknown data type, run the following query:

SELECT DISTINCT data_type FROM information_schema.columns WHERE data_type ILIKE 'unknown';

(RDS for PostgreSQL only) Check whether there's a read replica upgrade failure

If the PostgreSQL instance has read replicas, then read replica upgrade failures might cause your primary instance upgrade to get stuck or fail. Amazon RDS sets a failed read replica to the incompatible-restore state, and then stops replication on the DB instance.

A read replica upgrade might fail for one of the following reasons:

  • The read replica can't catch up with the primary DB instance even after the wait time.
  • The read replica is in a terminal or incompatible lifecycle state, such as storage-full or incompatible-restore.
  • When the primary DB instance upgrade starts, a separate minor version upgrade is running on the read replica.
  • The read replica uses incompatible parameters.
  • The read replica can't communicate with the primary DB instance to synchronize the data folder.

To resolve this issue, delete the read replica. Then, recreate a new read replica based on the upgraded primary instance after the upgrade.

Make sure that your primary user name is accurate

If the primary user name starts with pg_, then the upgrade fails, and you receive the following error message:

"PreUpgrade checks failed: The instance could not be upgraded because one or more role names start with 'pg_'. Please rename all roles with names that start with 'pg_' and try again".

To resolve this issue, create another user with the rds_superuser role that doesn't start with pg_.

Check for incompatible parameters

The incompatible parameters error occurs when the value of a memory-related parameter, such as shared_buffer or work_memory, is too high for your configuration. This error causes the upgrade script to fail. To resolve the issue, reduce the values of these parameters, and then re-run the upgrade.

Update your extensions before you upgrade

Major version upgrades don't upgrade PostgreSQL extensions. If you don't update the extensions before a major version upgrade, then you receive an error in the pg_upgrade.log file similar to the following example:

"The Logs indicates that the RDS instance ''abcd'' has older version of PostGIS extension or its dependent extensions (address_standardizer,address_standardizer_data_us, postgis_tiger_geocoder, postgis_topology, postgis_raster) installed as against the current version required for the upgrade."

The preceding example error shows an issue with the PostGIS extension. To resolve this issue, run the following query to check the default and installed versions for PostGIS and its dependent extensions:

SELECT name, default_version, installed_versionFROM pg_available_extensions WHERE installed_version IS NOT NULL AND
name LIKE 'postgis%' OR name LIKE 'address%';

Note: Replace postgis% with your extension.

If the value for installed_version is lower than the value of default_version, then you must update PostGIS to the default version. To upgrade your extension, run the following query:

ALTER EXTENSION extension_name UPDATE TO 'default_version_number';

Note: Replace default_version_number with the default_version value.

For more information, see Upgrades of the RDS for PostgreSQL DB engine or Upgrading Amazon Aurora PostgreSQL DB clusters.

Check for issue in views caused by a change in the system catalog of the target version

Columns in the certain views vary across different PostgreSQL versions. For example, you might receive an error similar to the following example:

"PreUpgrade checks failed: The instance could not be upgraded because one or more databases have views or materialized views which depend on 'pg_stat_activity'. Please drop them and try again."

This error occurs when you upgrade the database from version 9.5 to 9.6. In the preceding example, the pg_stat_activity view is causing the issue. In version 9.6, PostgreSQL replaced the waiting column with wait_event_type and wait_event columns.

Or, you might receive an error similar to the following example:

"pg_restore: from TOC entry abc; abc abcd VIEW sys_user_constraints art pg_restore: error: could not execute query: ERROR: column c.consrc does not exist LINE 18: "c"."consrc" AS "search_condition", ^ HINT: Perhaps you meant to reference the column "c.conkey" or the column "c.conbin"."

In this example, the error occurs because the structure of the pg_constraint catalog changed in PostgreSQL version 12.

To resolve these issues, drop the views based on system catalogs of the target version.

Important: It's a best practice to use pgdump to back up your views or capture the view's definition before you drop it. When you drop a view, you need to manually recreate the view after the version upgrade. You might need to work with your database administrator.

After you upgrade

After the upgrade is complete, run ANALYZE on all user databases to upgrade the pg_statistics table. A major upgrade doesn't move the content of the pg_statistics table to the new version. If you don't move the contents, then you might encounter slow-running queries.

Related information

Overview of the Aurora PostgreSQL upgrade processes

2 Comments

Can you not at least include a summary that you can append to these logs, like stop and start and a duration? That's 3 very useful lines when all you wanted is a timing information. There seems to be a lot of information will all AWS activities but not a shortened summary of this is it information.

replied 2 years ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied 2 years ago