Redshift Window Functions

0

Hello,

Was just wondering if anyone could point me in the right direction to which redshift window function would be the best to use for the following scenario;

We have an IoT table with device reporting, included in the data which I wish to report on, if if the system is turned on or off.

We currently have about 1500 devices reporting every 20 minutes or so.

Business Question: Determine the total run time of the devices for the last 12 months.

The data looks something like the following

systemon macaddress timestamp
0 606405abb378 2019-06-21 06:46:36
0 606405abb378 2019-06-21 06:16:36
|<--- 2814 rows snipped out with systemon status of 0--->|
0 606405abb378 2019-04-22 06:20:53
0 606405abb378 2019-04-22 06:20:51
1 606405abb378 2019-04-22 06:20:49
1 606405abb378 2019-04-22 06:20:48
1 606405abb378 2019-04-22 06:20:40
|<--- 26 rows snipped out with systemon status of 1--->|
1 606405abb378 2019-04-22 03:31:28
1 606405abb378 2019-04-22 03:01:28
1 606405abb378 2019-04-22 02:45:07
0 606405abb378 2019-04-22 02:31:28
0 606405abb378 2019-04-22 02:01:27
|<--- 84 rows snipped out with systemon status of 0--->|
0 606405abb378 2019-04-20 07:01:21
0 606405abb378 2019-04-20 06:33:06
1 606405abb378 2019-04-20 06:31:21
1 606405abb378 2019-04-20 06:01:21
1 606405abb378 2019-04-20 05:31:21
1 606405abb378 2019-04-20 05:30:01
1 606405abb378 2019-04-20 05:29:51
0 606405abb378 2019-04-20 05:01:20

What I would like to query is that for each time the unit is turned on, what is the next system off report and the time difference in minutes.

The example data above there are 2944 records for one unit for a 2 month period, but for the data that I am interested in, it could be reduced to 2 instances of the unit being switched on;

This is what i am looking to have returned (based on the sample data above);

systemon macaddress timestamp_on timestamp_off time_running_mins
1 606405abb378 2019-04-22 02:45:07 2019-04-22 06:20:51 216
1 606405abb378 2019-04-20 05:29:51 2019-04-20 06:33:06 63

Is it possible with Redshift window functions? Can someone please point me in the right direction?

I need to consider scalability with this query, that table is currently tracking data for 1500 units, that report every 20-30 minutes, and that number is going to grow by approximately 500 units a month from now on.

If anyone can point me in the right direction, that would be greatly appreciated.

Bdob
asked 5 years ago178 views
3 Answers
0

Hola.

I've not worked out an answer for you, but I do want to reassure you about performed.

I think (although I've not tested, but I will be soon) that when the sort order of the table matches the partition and ordering of the window function, the window function is in effect free.

Once you do work out which window function and how to use it, performance will not be a problem.

Toebs2
answered 5 years ago
0

Your process can be completed in Redshift using "sql sessionization" techniques. Essentially you use a LAG() statement to find the start and end of each "session" (on/off period) and then aggregate over the session.

I recommend using temporary or staging tables for the phases of the sessionization. Breaking the logic into simpler pieces usually makes it easier to follow. Make sure to declare the dist key as the window partition column and the sort key as the partition column followed by the session timestamp. This will speed up the final calculation.

Several detailed articles on this technique:
• https://sonra.io/2017/08/14/redshifts-window-functions-advanced-use-case-sessionization/
https://www.dataiku.com/learn/guide/code/reshaping_data/sessionization.html
https://blog.modeanalytics.com/finding-user-sessions-sql/

answered 5 years ago
0

Worked example:

DROP TABLE IF EXISTS test_data
;
CREATE TEMP TABLE test_data (status CHAR(3), macaddress VARCHAR(32), event_ts TIMESTAMP)
DISTKEY (macaddress) SORTKEY (macaddress, event_ts)
;
INSERT INTO test_data (status, macaddress, event_ts)
          SELECT 'ON' , '606405abb378', '2019-04-22 06:00:00'::TIMESTAMP
UNION ALL SELECT 'OFF', '606405abb378', '2019-04-22 06:20:00'::TIMESTAMP
UNION ALL SELECT 'OFF', '606405abb378', '2019-04-22 06:40:00'::TIMESTAMP
UNION ALL SELECT 'OFF', '606405abb378', '2019-04-22 07:00:00'::TIMESTAMP
UNION ALL SELECT 'ON' , '606405abb378', '2019-04-22 07:20:00'::TIMESTAMP
UNION ALL SELECT 'ON' , '606405abb378', '2019-04-22 07:40:00'::TIMESTAMP
UNION ALL SELECT 'ON' , '606405abb378', '2019-04-22 08:00:00'::TIMESTAMP
UNION ALL SELECT 'ON' , '606405abb378', '2019-04-22 08:20:00'::TIMESTAMP
UNION ALL SELECT 'OFF', '606405abb378', '2019-04-22 08:40:00'::TIMESTAMP
;
--Has the status changed since the last message?
WITH status_changes AS (
    SELECT macaddress
         , event_ts
         --We will use the timestamp of the next event as the end of this event
         , NVL(LEAD(event_ts, 1) OVER (PARTITION BY macaddress ORDER BY event_ts),event_ts) next_event_ts
         , status
         , LAG(status, 1) OVER (PARTITION BY macaddress ORDER BY event_ts) lag_status
         , CASE WHEN lag_status IS NULL   THEN 1 
                WHEN lag_status <> status THEN 1 
           ELSE 0 END AS status_start
    FROM test_data
--Generate a unique ID for each session
), status_sessions AS (
    SELECT macaddress , event_ts, next_event_ts, status, lag_status, status_start
         , SUM(status_start) OVER (PARTITION BY macaddress ORDER BY event_ts ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) session_id
    FROM status_changes
)
--Final select
SELECT macaddress
     , status
     , session_id
     , MIN(event_ts) start_ts
     , MAX(next_event_ts) end_ts
     , DATEDIFF('minute',start_ts,end_ts) session_minutes
FROM status_sessions
GROUP BY macaddress, status, session_id
ORDER BY macaddress, session_id
;

Output

  macaddress  | status |    start_ts      |     end_ts       | minutes
--------------+--------+------------------+------------------+---------
 606405abb378 | ON     | 2019-04-22 06:00 | 2019-04-22 06:20 |      20
 606405abb378 | OFF    | 2019-04-22 06:20 | 2019-04-22 07:20 |      60
 606405abb378 | ON     | 2019-04-22 07:20 | 2019-04-22 08:40 |      80
 606405abb378 | OFF    | 2019-04-22 08:40 | 2019-04-22 08:40 |       0
answered 5 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions