I've upgraded my Amazon Relational Database Service (Amazon RDS) instance running PostgreSQL to version 11 or higher. Why do I see Write Latency spike in Amazon CloudWatch every five minutes?
Resolution
With an idle Amazon Relational Database Service (Amazon RDS) for PostgreSQL instance, you might notice a spike in the Amazon CloudWatch metric Write Latency every five minutes. This spike correlates with a spike in the Enhanced Monitoring metric Write Total to approximately 64 MB after you upgrade to PostgreSQL version 11 or later. The value of the wal_segment_size parameter becomes 64 MB after the upgrade. These spikes might not be noticeable with version 10 or earlier because the value of wal_segment_size is 16 MB. For more information, see the update for September 7, 2018 in Amazon RDS - Document history.
The archive_timeout configuration in RDS for PostgreSQL is set to 5 minutes. This setting means that the archival process copies the Write-Ahead Logging (WAL) segments to be archived to Amazon Simple Storage Service (Amazon S3) every five minutes. In an idle system, this copy process is usually the only I/O operation, and so this operation might be visibly dominant in the CloudWatch graphs. However, you might not see this pattern in a busy system. For example, suppose that you run following workload for 30 minutes on a db.t3.small instance to simulate a busy system on an RDS for PostgreSQL instance with a 20 GB Amazon Elastic Block Store (Amazon EBS) volume:
#pgbench --host=$HOST --username=$USER --port=$PORT --protocol=simple --progress=2 --client=1 --jobs=1 $DB -T 1800
#pgbench --initialize --fillfactor=90 --scale=100 --port=$PORT --host=$HOST --username=$USER $DB
With this workload, you don't see spikes in the CloudWatch Write Latency metric.
But suppose that you have the following use cases:
- You have an I/O request that takes 10 milliseconds to complete, and nine other I/O requests that take 1 millisecond each to complete. The average latency of these requests is calculated as follows:
Average latency = (10 + (9 * 1)) / 10 = 1.9 milliseconds
- You have an I/O request that takes 10 milliseconds to complete, and you have no other I/O requests. The average latency in this case is calculated as follows:
Average latency = 10 / 1 = 10 milliseconds
Both use cases include the same I/O request that takes 10 milliseconds to complete. However, when you calculate the average latency, the slow request stands out in the second use case where you have fewer I/O requests running along with the slow request. If an idle system has one slow I/O request due to the high block size, the latency is calculated only from this request. In a busy system with multiple I/O requests, most with smaller block sizes, the average latency is calculated from all these requests.
In these cases, you might see a Write Latency spike every five minutes in an idle RDS for PostgreSQL system. If your database instance runs a busy workload, then you might not see these spikes.
Example: Suppose that you have two t2.small Amazon Elastic Compute Cloud (Amazon EC2) instances each with 8 gp2 volumes.
Run the following command on crontab to create a 64 MB file with a block size of 64 MB every five minutes in the first Amazon EC2 instance:
*/5 * * * * dd if=/dev/zero of=/tmp/big_file.txt bs=64MB count=1 oflag=dsync ; rm -rf /tmp/big_file.txt
Note: Be sure to replace big_file.txt in the command with the name of your file.
Also, run the following command on crontab to create 100 files each with a block size of 8 KB every one minute in the second Amazon EC2 instance:
* * * * * for i in {1..100} ; do dd if=/dev/zero of=/tmp/small_file.txt bs=8k count=100 oflag=dsync ; rm -rf /tmp/small_file.txt ; done
Note: Be sure to replace small_file.txt in the command with the name of your file.
You might notice that the second EC2 instance shows higher Writer Latency spikes in CloudWatch than the first EC2 instance.
Related information
Best practices for working with PostgreSQL