How do I troubleshoot issues with VACUUM in Amazon Redshift?

8 minute read
0

I have concerns about the performance of VACUUM on my Amazon Redshift cluster. Or, my VACUUM queries fail in my Amazon Redshift cluster.

Short description

VACUUM is a resource-intensive operation that can be slowed down by the following:

  • A high percentage of unsorted data
  • A large table with too many columns
  • The use of an interleaved sort key
  • Irregular or infrequent use of VACUUM
  • Concurrent tables, cluster queries, DDL statements, or ETL jobs

Use the svv_vacuum_progress query to check the status and details of your VACUUM operation. Then, follow the VACUUM best practices to troubleshoot and avoid any future issues.

Resolution

Troubleshoot VACUUM performance

Note: The following applies to provisioned Amazon Redshift clusters. The following system tables and queries don't work on Amazon Redshift Serverless.

To check if the VACUUM operation is in progress, run the svv_vacuum_progress query:

dev=# SELECT * FROM svv_vacuum_progress;
table_name |          status                 | time_remaining_estimate
-----------+---------------------------------+-------------------------
 data8     |  Vacuum: initialize merge data8 | 4m 55s
(1 row)

The svv_vacuum_progress query also includes the name of the table, the vacuum's status, and the estimated time that remains until completion. If no VACUUM is running, then the svv_vacuum_progress query shows the status of the last run VACUUM.

Note: The svv_vacuum_progress query returns only one row of results.

Check the details of the table being vacuumed. Specify the table and schema names in the WHERE clause:

SELECT schema, table_id, "table", diststyle, sortkey1, sortkey_num, unsorted, tbl_rows, estimated_visible_rows, stats_off 
FROM svv_table_info 
WHERE "table" IN ('data8');

Example output:

Schema     | table_id | table | diststyle | sortkey1 | sortkey_num | unsorted | tbl_rows  | est_visible_rows | stats_off 
------------+----------+-------+-----------+----------+-------------+----------+-----------+------------------+-----------
testschema | 977719   | data8 | EVEN      | order_id |  2          |    25.00 | 755171520 | 566378624        | 100.00

From the preceding output, the sortkey1 column shows the main sort key.

If the table has an interleaved sort key, then this column displays the INTERLEAVED state.

  • The sortkey_num column shows the number of columns in the sort key.
  • The unsorted column shows the percentage of rows that need to be sorted.
  • The tbl_rows column shows the total number of rows, including the deleted and updated rows.
  • The estimated_visible_rows is the number of rows that excludes the deleted rows.
  • After a complete vacuum (delete and sort), the value for tbl_rows and estimated_visible_rows resemble each other, and the unsorted value reaches 0. Note: Data in the table updates in real time. To check the progress of VACUUM, continue to run the query. Note that the unsorted rows gradually decrease as VACUUM progresses. To verify whether you have a high percentage of unsorted data, check the VACUUM information for a specific table.

Run the following query to check VACUUM information for a table. Specify the table ID from the previous query:

SELECT table_id, status, rows, sortedrows, blocks, eventtime
FROM stl_vacuum
WHERE table_id=977719
ORDER BY eventtime DESC LIMIT 20;

Example output:

table_id |             status             |    rows    | sortedrows | blocks |         eventtime
         ----------+--------------------------------+------------+------------+--------+----------------------------
  977719 | [VacuumBG] Finished            |  566378640 |          0 |  23618 | 2020-05-27 06:55:33.232536
  977719 | [VacuumBG] Started Delete Only | 1132757280 |  566378640 |  47164 | 2020-05-27 06:55:18.906008
  977719 | Finished                       |  566378640 |  566378640 |  23654 | 2020-05-27 06:46:04.086842
  977719 | Started                        | 1132757280 |  566378640 |  45642 | 2020-05-27 06:28:17.128345
(4 rows)

In the preceding example, the output lists the latest events first, followed by older events, in sorted order:

  • The last VACUUM was an automatic VACUUM DELETE that started at 2020-05-27 06:55:18.906008 UTC and completed in a few seconds.
  • This VACUUM released the space occupied by deleted rows. This state is confirmed by the number of rows and blocks that appear when the VACUUM operation started and completed.

Note the changes in the number of blocks occupied by the table from the start and completion of VACUUM.

Note: Amazon Redshift automatically performs VACUUM SORT and VACUUM DELETE operations on tables in the background. These background VACUUMs run during periods of reduced loads and are paused during periods of high load. This automatic VACUUM reduces the need to run the VACUUM command.

The sortedrows column shows the number of sorted rows in the table. In the last VACUUM, no sort was done because it was an automatic VACUUM DELETE operation. Because the active rows weren't sorted, the row marked for deletion displays the same number of sorted rows from when VACUUM started. After VACUUM DELETE completes, you see 0 sorted rows.

The initial VACUUM that started at 2020-05-27 06:28:17.128345 UTC shows a full VACUUM. The process released the space from deleted rows and sorted rows after about 18 minutes. When the VACUUM operation completed, the output shows the same values for rows and sortedrows because the VACUUM successfully sorted the rows.

For a VACUUM that's already in progress, continue to monitor its performance and incorporate best practices.

Troubleshoot VACUUM failure

Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshooting errors for the AWS CLI. Also, make sure that you're using the most recent AWS CLI version.

To find out why a VACUUM query failed, use either SYS_QUERY_HISTORY or STL_QUERY to check for error messages.

If you use the STL_QUERY, then you must get the error details from STL_ERROR. Because STL_ERROR doesn't have a query ID column, find the PID field from the STL_QUERY. Then, use that field in the STL_ERROR query.

Example:

SELECT user_id, query_id, transaction_id, session_id,  status, start_time, end_time, execution_time, error_message FROM sys_query_history WHERE query_id IN (<failed queries>)


+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

| user_id | query_id   | transaction_id |  session_id | status | start_time              |     end_time                  |  execution_time |   error_message |  
| 100        | 915082632 | 35599398     | 1096641177 | failed  | 2024-10-06 21:09:30.209587 | 2024-10

If you use the execute statement to run the VACUUM query, then use describe-statement to identify any error messages.

Example:

aws redshift-data describe-statement --id 7c823348d-be8b-437a-9a0-db8c0ca44f0f
{
    "ClusterIdentifier": "redshift-cluster-1",
    "CreatedAt": "2024-10-07T16:25:27.566000+00:00",
    "Duration": -1,
    "Error": "ERROR: VACUUM cannot run inside a multiple commands statement",
    "HasResultSet": false,
    "Id": "7c823348d-be8b-437a-9a0-db8c0ca44f0f",
    "QueryString": "vacuum full toptem;\nvacuum full tsupport;\nvacuum full supplierxbox;\nvacuum full party;",
    "RedshiftPid": 10723479554,
    "RedshiftQueryId": 42304,
    "ResultRows": -1,
    "ResultSize": -1,
    "Status": "FAILED",
    "UpdatedAt": "2024-10-07T16:25:33.566000+00:00"
}

VACUUM best practices

You can improve VACUUM performance with the following best practices:

  • Because VACUUM is a resource-intensive operation, run it during off-peak hours.
  • During off-peak hours, use wlm_query_slot_count to temporarily override the concurrency level in a queue for a VACUUM operation.
  • Run the VACUUM operation with a threshold parameter of up to 99% for large tables.
  • Determine the appropriate threshold and frequency of running VACUUM. For example, you might want to run VACUUM at a threshold of 100%, or have your data always sorted. Use the approach that optimizes your Amazon Redshift cluster's query performance.
  • Run a VACUUM FULL or VACUUM SORT ONLY often enough that a high unsorted Region doesn't accumulate in large tables.
  • If there is a large amount of unsorted data on a large table, then perform a deep copy. A deep copy can help you load the data into a new table instead of running VACUUM SORT on the existing table.
  • Run the VACUUM command with the BOOST option. The BOOST option allocates additional resources to VACUUM, such as available memory and disk space. With the BOOST option, VACUUM operates in one window and blocks concurrent deletes and updates for the duration of the VACUUM operation.
    Note: Running VACUUM with the BOOST option might affect query performance. It's a best practice to run the VACUUM BOOST operation only during maintenance operations or off-peak hours.
  • Divide any large tables into time-series tables to improve VACUUM performance. In some cases, when you use a time-series table, you can fulfill the need to run VACUUM.
  • Choose a column compression type for large tables. Compressed rows consume less disk space when you sort data.
  • Use the ANALYZE command after VACUUM operation to update the statistics. The query planner uses these values to choose the best plans.
  • If the cluster is fully idle, then run a MANUAL VACUUM for failed VACUUM attempts. For more information, see Vacuuming and analyzing tables manually.

Related information

STL_ERROR

Why was my query canceled in Amazon Redshift?

Why is my Amazon Redshift Serverless query canceled or stopped?

AWS OFFICIAL
AWS OFFICIALUpdated 5 months ago