I have a timestream table that occasionally receives large bursts of entries, but otherwise can go for days/weeks without getting any new rows. I want to construct a query to efficiently return the most recent row logged to this table.
Since this table can go long periods of time without receiving any new entries, it doesn't make sense to have a WHERE time BETWEEN ...
clause in my query. So my query looks like:
SELECT MAX(time), MAX_BY(other_field, time)
FROM my_database.my_table
However, the query data usage metrics show that I have to scan through all rows to execute this query. If an analogous query was run on a standard SQL database, this would be a near-instantaneous query because time
is indexed. Is there an approach I'm missing to make this query more efficient? I also tried something like:
SELECT time, other_field
FROM my_database.my_table
ORDER BY time DESC
LIMIT 1
This had similar data usage as the previous query. From what I tested, it doesn't seem like timestream queries take advantage of the indexing/partitioning on time
or any other partitioned fields unless they are a part of a WHERE
clause.
Is there an idiomatic way to retrieve the most recent row from a timestream table without scanning the entire table?
And as a followup question, is there a way to extend the above solution to a query like:
SELECT MAX(time), MAX_BY(other_field, time), my_partitioned_field
FROM my_database.my_table
GROUP BY my_partitioned_field
assuming my_partitioned_field
is a partition key on the table?
This approach, as I mentioned in my original post (though there was a typo, corrected now), still needs to scan the entire table to retrieve that result. Since there is an index on time, I'd expect there to be a way to query the latest row without scanning the entire table.