Why did I receive a "No space left on device" or "DiskFull" error on Amazon RDS for PostgreSQL?

9 minute read
1

I have a small Amazon Relational Database Service (Amazon RDS) for PostgreSQL database. The DB instance's free storage space is decreasing, and I receive the following error: "Error message: PG::DiskFull: ERROR: could not extend file "base/16394/5139755": No space left on device. HINT: Check free disk space."

Resolution

Note: If your workload is predictable, then activate storage autoscaling for your instance. With storage autoscaling, Amazon RDS automatically scales your storage when free database space is low.

To monitor your storage space, check the FreeStorageSpace Amazon CloudWatch metric. Set a CloudWatch alarm for free storage space to receive a notification when the space starts to decrease. If you receive an alarm, then check for the following resources that use Amazon RDS DB instance storage:

  • Temporary tables or files that are created by PostgreSQL transactions
  • Data files
  • Write ahead logs (WAL)
  • Replication slots
  • DB logs such as error files that are retained for too long
  • Other DB or Linux files that support the consistent state of the RDS DB instance

If your DB instance consumes more than the expected storage, then take the following troubleshooting actions.

Check the size of the DB log files

By default, Amazon RDS for PostgreSQL error log files have a retention value of 4,320 minutes (three days). Large log files can use more space because of higher workloads or excessive logging. To change the retention period for system logs, use the rds.log_retention_period parameter in the DB parameter group associated with your DB instance. For example, if you set the value to 1,440, then Amazon RDS retains the logs for one day. For more information, see RDS for PostgreSQL database log files.

To reduce excessive logging, change the error reporting and logging parameters in the DB parameter group. This action results in reduced log file size. For more information, see 19.8 Error reporting and logging on the PostgreSQL website.

Check for temporary files

Temporary files are files that Amazon RDS stores in each backend or session connection. Amazon RDS uses these files as a resource pool. To review the temporary files statistics, run the following command:

psql=> SELECT datname, temp_files AS "Temporary files",temp_bytes AS "Size of temporary files" FROM pg_stat_database ;

Important: The temp_files and temp_bytes columns in the pg_stat_database view collect aggregate statistics. Amazon RDS resets these counters only after an immediate shutdown, a server crash, or a point-in-time recovery (PITR). For this reason, it's a best practice to monitor the growth of these files in number and size, not only review the output.

Amazon RDS creates temporary files for sorts, hashes, and temporary query results. To track the creation of temporary tables or files, in a custom parameter group, set log_temp_files to 0 to log all temporary file information. By default, log_temp_files is set to -1, so Amazon RDS doesn't log temporary files. If you set log_temp_files to a positive value, then Amazon RDS logs only files that are equal to or larger than that number of kilobytes.

Use an EXPLAIN ANALYZE on your query to review disk sorting. In the log output, check the size of the temporary files that your query creates. For more information, see Tune sorting operations in PostgreSQL with work_mem.

Check for a constant increase in transaction logs disk usage

Check the TransactionLogsDiskUsage metric to view the disk space that transaction WAL uses. Increases in transaction log disk usage can occur for the following reasons:

  • High DB loads from writes and updates that generate additional WALs
  • Streaming read replica lag for replicas in the same AWS Region, or a read replica in storage full state
  • Replication slots

The AWS Database Migration Service (AWS DMS) might create replication slots as part of logical decoding. For logical replication, the rds.logical_replication slot parameter is set to 1. Replication slots retain the WAL files until an external consumer consumes the files. Example consumers include pg_recvlogical, extract, transform, and load (ETL) jobs, and the AWS DMS.

If you set rds.logical_replication to 1, then Amazon RDS sets the wal_level, max_wal_senders, max_replication_slots, and max_connections parameters. These parameter changes might increase WAL generation. It's a best practice to set the rds.logical_replication parameter only when you're using logical slots. If there isn't a consumer for the retained WAL files, then transaction logs disk usage increases and free storage space consistently decreases.

To check for the presence and size of replication slots, run the following queries:

  • PostgreSQL v9:

    psql=> SELECT slot_name, pg_size_pretty(pg_xlog_location_diff(pg_current_xlog_location(),restart_lsn)) AS replicationSlotLag, active FROM pg_replication_slots ;
  • PostgreSQL v10 and later:

    psql=> SELECT slot_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(),restart_lsn)) AS replicationSlotLag, active FROM pg_replication_slots ;

Example output:

slot_name                                                      | replicationslotlag | active---------------------------------------------------------------+--------------------+--------
xc36ujql35djp_00013322_907c1e0a_9f8b_4c13_89ea_ef0ea1cf143d    | 129 GB             | f
7pajuy7htthd7sqn_00013322_a27bcebf_7d0f_4124_b336_92d0fb9f5130 | 704 MB             | t
zp2tkfo4ejw3dtlw_00013322_03e77862_689d_41c5_99ba_021c8a3f851a | 624 MB             | t

Replication slots that have an active state set to f (false) aren't being consumed. In this example, the xc36ujql35djp_00013322_907c1e0a_9f8b_4c13_89ea_ef0ea1cf143d slow has an f active state. This slot isn't actively used, but uses 129 GB of transaction files.

To drop unused slots, run the following query:

psql=> SELECT pg_drop_replication_slot('YOUR_SLOTNAME');

Note: Replace YOUR_SLOTNAME with the slot name.

Example output:

psql=> SELECT pg_drop_replication_slot('xc36ujql35djp_00013322_907c1e0a_9f8b_4c13_89ea_ef0ea1cf143d');

If an AWS DMS task that you no longer require is the consumer, then delete the task and manually drop the replication slot.

Check the status of cross-Region or same-Region read replicas

Note: You can use the following resolution for same-Region read replicas only if they're running on PostgreSQL 14.1 or later.

When you use a cross-Region or same-Region read replication, Amazon RDS creates a physical replication slot on the primary instance. A read replica failure might affect the storage space on the primary DB instance. This situation occurs when the WAL files aren't replicated in the read replica. Check the OldestReplicationSlotLag and TransactionLogsDiskUsage metrics to determine how far behind the replica with the most lag is. You can also see how much storage WAL data uses.

To check the status of the read replica, run the following query:

psql=> SELECT * FROM pg_replication_slots;

For more information about pg_replication_slots, see 52.19 pg_replication_slots on the PostgreSQL website. If the output has the active state set to f, then the slot isn't used for replication.

You can also use view pg_stat_replication on the source instance to check the statistics for the replication. For more information, see Table 27.14. pg_stat_replication view on the PostgreSQL website.

Check for bloat or improper removal of dead rows

In standard PostgreSQL operations, PostgreSQL doesn't remove dead rows (tuples) that users delete or that an UPDATE makes obsolete from their table. For Multi-Version Concurrency Control (MVCC) implementations, when you perform a DELETE operation, the row isn't immediately removed from the data file. Instead, PostgreSQL sets the xmax field in a header to mark the row as deleted. Updates mark rows for deletion first, and then carry out an insert operation. This allows concurrency with minimal locking between the different transactions. As a result, PostgreSQL keeps different row versions as part of the MVCC process.

Dead rows that aren't cleaned up stay in the data files, but remain invisible to transactions. These rows can cause disk space issues. If a table has many DELETE and UPDATE operations, then the dead tuples might use a large amount of disk space (bloat).

Use the VACUUM operation to free the storage used by dead tuples. Note that VACUUM doesn't release the free storage to the file system. To release the storage to the file system, use VACUUM FULL. Note that when you run VACUUM FULL, PostgreSQL applies an access exclusive lock to the table. This method requires extra disk space because VACUUM FULL writes a new table copy and doesn't release the existing copy until the operation is complete. It's a best practice to use VACUUM FULL only when you must reclaim a significant amount of space from the table. It's also a best practice to perform periodic vacuum or autovacuum operations on tables that you update frequently. For more information, see VACUUM on the PostgreSQL website.

To check the estimated number of dead tuples, use the pg_stat_all_tables view. For more information, see the Table 27.29. pg_stat_all_tables view on the PostgreSQL website. In the following example table, there are 1,999,952 dead tuples in the n_dead_tup record:

psql => SELECT * FROM pg_stat_all_tables WHERE relname='test';
-[ RECORD 1 ]-------+------------------------------
relid               | 16395
schemaname          | public
relname             | test
seq_scan            | 3
seq_tup_read        | 5280041
idx_scan            | 
idx_tup_fetch       | 
n_tup_ins           | 2000000
n_tup_upd           | 0
n_tup_del           | 3639911
n_tup_hot_upd       | 0
n_live_tup          | 1635941
n_dead_tup          | 1999952
n_mod_since_analyze | 3999952
last_vacuum         | 
last_autovacuum     | 2024-08-16 04:49:52.399546+00
last_analyze        | 2024-08-09 09:44:56.208889+00
last_autoanalyze    | 2024-08-16 04:50:22.581935+00
vacuum_count        | 0
autovacuum_count    | 1
analyze_count       | 1
autoanalyze_count   | 1


psql => VACUUM TEST;

Check for orphaned files

Orphaned files can occur when no objects point to a file that's present in the database directory. This scenario occurs when your instance runs out of storage or the engine crashes during an operation such as ALTER TABLE, VACUUM FULL, or CLUSTER. To check for orphaned files, complete the following steps:

  1. Log in to PostgreSQL in each database.

  2. To get the used size of the database, run the following query:

    psql=> SELECT pg_size_pretty(pg_database_size('DATABASE_NAME')); 

    Note: Replace DATABASE_NAME with your database name.

  3. To get the real size of the database, run the following query:

    psql=> SELECT pg_size_pretty(SUM(pg_relation_size(oid))) FROM pg_class;
  4. Compare the used and real size of the database from the outputs of the preceding commands. If the difference is significant, then orphaned files might be using storage space.

Related information

Working with read replicas for Amazon RDS for PostgreSQL

Automated monitoring tools