Redshift SYS_QUERY_DETAIL returns incorrect entries for the tables that are scanned during query execution
We use Redshift for internal purposes. As a part of that, we want to determine the data lineage of the tables available in the Redshift cluster.
To generate lineage entries, we traditionally used STL_INSERT
and STL_SCAN
views.
For a given query ID, the entries in STL_SCAN
are considered source_tables
, and the entry in STL_INSERT
is considered as the destination_table
.
Thus we can create an edge in the Lineage graph.
Once we repeat this process for all DML queries, we have an excellent lineage of tables.
Note: For a given query there can be multiple entries in STL_INSERT
that reference redshift-centric internal intermediate destination tables,
however, there can only be a single entry corresponding to a user-defined destination table.
We ignore the entries for intermediate tables since they are used by Redshift internally and are not publicly accessible.
The query execution is divided into multiple steps e.g. scanning the source table, applying filters, inserting the final result in the destination table, etc. Again this is an over-simplification of the query execution process.
Based on this documentation: https://docs.aws.amazon.com/redshift/latest/dg/r_STL_INSERT.html redshift recommends using SYS_QUERY_DETAIL
view to get a complete list of steps executed for a given query. This also includes the SCAN
and INSERT
steps.
Furthermore, this view contains data from both the main and the concurrency scaling cluster thus it would result in comparatively accurate information.
We migrated our lineage systems to the above-recommended view successfully.
During validation, we found that the SYS_QUERY_DETAIL
contains incorrect table IDs for the scan steps.
Precisely, the table_id
associated with the first SCAN
step is incorrect and is the same as the table_id
corresponding to the INSERT
step.
Thus the lineage graph we generated was missing a bunch of edges and we saw a sudden increase in self-edges in the graph which is counter-intuitive since none of our tables inserts into themselves. Because of this, our integration tests started failing thereby leading to this detailed investigation.
Note: If we continue using the old views i.e. STL_INSERT
and STL_SCAN
the issue disappears.
The redshift version we are using is as follows:
PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.61678
How to reproduce the issue
- Run the following SQL to generate the test tables
create table parent_1 ( source varchar(32), a int ); insert into parent_1 values ('parent_1', 1), ('parent_1', 2), ('parent_1', 3); create table parent_2 ( source varchar(32), a int ); insert into parent_2 values ('parent_2', 1), ('parent_2', 2), ('parent_2', 3); create table parent_3 ( source varchar(32), a int ); insert into parent_3 values ('parent_3', 1), ('parent_3', 2), ('parent_3', 3); -- The following SQL is equivalent to what our DBT redshift connector would generate while running the models CREATE TABLE "downstream" AS (WITH SOURCE AS (SELECT * FROM "parent_1" UNION SELECT * FROM "parent_2" UNION SELECT * FROM "parent_3"), renamed AS (SELECT * FROM SOURCE) SELECT * FROM renamed)
- Find the table_id of the destination table
select * from pg_catalog.svv_table_info where "table" = 'downstream'; +----------------------------------+------+--------+----------+---------------+----------+-------------+-----------+------------+-----------+----+--------+-----+--------+---------+--------+-------------+---------+----------------------+----------+-------------------+--------------------------+ |database |schema|table_id|table |encoded |diststyle |sortkey1 |max_varchar|sortkey1_enc|sortkey_num|size|pct_used|empty|unsorted|stats_off|tbl_rows|skew_sortkey1|skew_rows|estimated_visible_rows|risk_event|vacuum_sort_benefit|create_time | +----------------------------------+------+--------+----------+---------------+----------+-------------+-----------+------------+-----------+----+--------+-----+--------+---------+--------+-------------+---------+----------------------+----------+-------------------+--------------------------+ |do_not_delete_used_for_intg_test_1|public|1580265 |downstream|Y, AUTO(ENCODE)|AUTO(EVEN)|AUTO(SORTKEY)|32 |null |0 |5 |0.0001 |0 |null |0.00 |9 |null |null |9 |null |null |2024-01-19 01:48:51.024926| +----------------------------------+------+--------+----------+---------------+----------+-------------+-----------+------------+-----------+----+--------+-----+--------+---------+--------+-------------+---------+----------------------+----------+-------------------+--------------------------+
- Find the query ID of the above CTAS
select * from pg_catalog.svv_table_info where "table" = 'downstream'; +----------------------------------+------+--------+----------+---------------+----------+-------------+-----------+------------+-----------+----+--------+-----+--------+---------+--------+-------------+---------+----------------------+----------+-------------------+--------------------------+ |database |schema|table_id|table |encoded |diststyle |sortkey1 |max_varchar|sortkey1_enc|sortkey_num|size|pct_used|empty|unsorted|stats_off|tbl_rows|skew_sortkey1|skew_rows|estimated_visible_rows|risk_event|vacuum_sort_benefit|create_time | +----------------------------------+------+--------+----------+---------------+----------+-------------+-----------+------------+-----------+----+--------+-----+--------+---------+--------+-------------+---------+----------------------+----------+-------------------+--------------------------+ |do_not_delete_used_for_intg_test_1|public|1580265 |downstream|Y, AUTO(ENCODE)|AUTO(EVEN)|AUTO(SORTKEY)|32 |null |0 |5 |0.0001 |0 |null |0.00 |9 |null |null |9 |null |null |2024-01-19 01:48:51.024926| +----------------------------------+------+--------+----------+---------------+----------+-------------+-----------+------------+-----------+----+--------+-----+--------+---------+--------+-------------+---------+----------------------+----------+-------------------+--------------------------+
- Find the query ID of the above CTAS
select * from sys_query_detail where table_id = <table id of the destination table> and step_name = 'insert' and start_time >= '<add a valid date filter here to avoid a full scan>'; +-------+--------+--------------------+---------+----------+-------+---------+--------+----------+---------+--------------------------+--------------------------+--------+-----+-----------+----------+------------+-----------+-----------+------------+-------------+--------------+------+-------------+-------------+---------+------------------------+-------------------------+--------------+ |user_id|query_id|child_query_sequence|stream_id|segment_id|step_id|step_name|table_id|table_name|is_rrscan|start_time |end_time |duration|alert|input_bytes|input_rows|output_bytes|output_rows|blocks_read|blocks_write|local_read_io|remote_read_io|source|data_skewness|time_skewness|is_active|spilled_block_local_disk|spilled_block_remote_disk|step_attribute| +-------+--------+--------------------+---------+----------+-------+---------+--------+----------+---------+--------------------------+--------------------------+--------+-----+-----------+----------+------------+-----------+-----------+------------+-------------+--------------+------+-------------+-------------+---------+------------------------+-------------------------+--------------+ |100 |17492523|1 |3 |5 |2 |insert |1580265 |downstream|f |2024-01-19 01:48:59.569734|2024-01-19 01:48:59.575038|5304 | |0 |0 |0 |9 |0 |0 |0 |0 | |50 |5 |f |0 |0 | | +-------+--------+--------------------+---------+----------+-------+---------+--------+----------+---------+--------------------------+--------------------------+--------+-----+-----------+----------+------------+-----------+-----------+------------+-------------+--------------+------+-------------+-------------+---------+------------------------+-------------------------+--------------+
- Plugin in the query ID in the following SQL to get the corresponding scan and insert steps
select * from sys_query_detail where table_name <> '' and query_id = <query id from above> and start_time >= '<add a valid date filter here to avoid a full scan>' order by start_time; +-------+--------+--------------------+---------+----------+-------+---------+--------+--------------------------------------------------+---------+--------------------------+--------------------------+--------+-----+-----------+----------+------------+-----------+-----------+------------+-------------+--------------+---------------+-------------+-------------+---------+------------------------+-------------------------+--------------+ |user_id|query_id|child_query_sequence|stream_id|segment_id|step_id|step_name|table_id|table_name |is_rrscan|start_time |end_time |duration|alert|input_bytes|input_rows|output_bytes|output_rows|blocks_read|blocks_write|local_read_io|remote_read_io|source |data_skewness|time_skewness|is_active|spilled_block_local_disk|spilled_block_remote_disk|step_attribute| +-------+--------+--------------------+---------+----------+-------+---------+--------+--------------------------------------------------+---------+--------------------------+--------------------------+--------+-----+-----------+----------+------------+-----------+-----------+------------+-------------+--------------+---------------+-------------+-------------+---------+------------------------+-------------------------+--------------+ |100 |17492523|1 |0 |0 |0 |scan |1580265 |do_not_delete_used_for_intg_test_1.public.parent_1|f |2024-01-19 01:48:55.341610|2024-01-19 01:48:59.708793|4367183 | |0 |12 |360 |12 |0 |0 |0 |0 |Redshift(local)|66 |40 |f |0 |0 | | |100 |17492523|1 |1 |1 |0 |scan |1580261 |do_not_delete_used_for_intg_test_1.public.parent_2|f |2024-01-19 01:48:55.351864|2024-01-19 01:48:55.355093|3229 | |0 |3 |72 |3 |0 |0 |0 |0 |Redshift(local)|50 |6 |f |0 |0 | | |100 |17492523|1 |2 |2 |0 |scan |1580263 |do_not_delete_used_for_intg_test_1.public.parent_3|f |2024-01-19 01:48:55.355709|2024-01-19 01:48:55.358835|3126 | |0 |3 |72 |3 |0 |0 |0 |0 |Redshift(local)|50 |7 |f |0 |0 | | |100 |17492523|1 |3 |5 |2 |insert |1580265 |downstream |f |2024-01-19 01:48:59.569734|2024-01-19 01:48:59.575038|5304 | |0 |0 |0 |9 |0 |0 |0 |0 | |50 |5 |f |0 |0 | | +-------+--------+--------------------+---------+----------+-------+---------+--------+--------------------------------------------------+---------+--------------------------+--------------------------+--------+-----+-----------+----------+------------+-----------+-----------+------------+-------------+--------------+---------------+-------------+-------------+---------+------------------------+-------------------------+--------------+
- The table_id for parent_1 is 1580259 as per the following result. However, the first entry in the above result shows the table_id as 1580265 i.e. exactly the same as the destination table
select * from pg_catalog.svv_table_info where "table" = 'parent_1'; +----------------------------------+------+--------+--------+---------------+---------+-------------+-----------+------------+-----------+----+--------+-----+--------+---------+--------+-------------+---------+----------------------+----------+-------------------+--------------------------+ |database |schema|table_id|table |encoded |diststyle|sortkey1 |max_varchar|sortkey1_enc|sortkey_num|size|pct_used|empty|unsorted|stats_off|tbl_rows|skew_sortkey1|skew_rows|estimated_visible_rows|risk_event|vacuum_sort_benefit|create_time | +----------------------------------+------+--------+--------+---------------+---------+-------------+-----------+------------+-----------+----+--------+-----+--------+---------+--------+-------------+---------+----------------------+----------+-------------------+--------------------------+ |do_not_delete_used_for_intg_test_1|public|1580259 |parent_1|Y, AUTO(ENCODE)|AUTO(ALL)|AUTO(SORTKEY)|32 |null |0 |5 |0.0001 |0 |null |0.00 |3 |null |null |3 |null |null |2024-01-19 01:47:31.986812| +----------------------------------+------+--------+--------+---------------+---------+-------------+-----------+------------+-----------+----+--------+-----+--------+---------+--------+-------------+---------+----------------------+----------+-------------------+--------------------------+
Cont'd in comments
- Neueste
- Die meisten Stimmen
- Die meisten Kommentare
Hello,
I executed the same queries on my provisioned cluster and saw correct table_id in SYS_QUERY_DETAIL. Could you check if this is one off issue or this happens for all your queries ? Are you running this on a provisioned or serverless workgroup ?
I would recommend reaching out to AWS support if you see more occurrences of the same issue.
- Sumeetvor 8 Monaten
Thanks, Ranjan for trying it out. We are using a provisioned cluster and the issue is consistently reproducible.
- Sumeetvor 8 Monaten
Hi Ranjan - can you please check the Redshift version you are running and confirm if it matches with the following version?
SELECT VERSION(); PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.61678
Relevanter Inhalt
- AWS OFFICIALAktualisiert vor 7 Monaten
- AWS OFFICIALAktualisiert vor 2 Jahren
- AWS OFFICIALAktualisiert vor 2 Jahren
Following similar steps on STL_INSERT AND STL_SCAN yields correct results
Just to add to the above issue:
We have created an AWS-supported ticket, the internal team was able to reproduce the issue and they are currently investigating the root cause.