Formatting csv table data before upload to Redshift

0

Pretty basic newbie redshift question here. Wanting to upload historical test data to a redshift database and the data as we have it is in multiple csv's and is formatted in typical table format with columns set as like this:

| Time | Var1 | Var2 | Var3 |

| 0 | 2 | 3 | 4 |

| 1 | 4 | 5 | 6 |

| 2 | 6 | 7 | 8 |

In the past for other projects we've had data imported to redshift using COPY from s3 and the data was already formatted like below, which I've read is more efficient for querying in redshift:

| Time | Var | Value |

| 0 | Var1 | 2 |

| 0 | Var2 | 3 |

| 0 | Var3 | 4 |

| 1 | Var3 | 4 |

| 1 | Var2 | 5 |

| 1 | Var3 | 6 |

| 2 | Var1 | 6 |

| 2 | Var2 | 7 |

| 2 | Var3 | 8 |

For this new data do we need to be transforming it to match the columnar setup before uploading to redshift? Are there situations where this reformatting is not necessary?

  • I would experiment both ways. That'd be a fun exercise regardless! Plus would readability be a factor in making a decision on how you'd store the data?

2 réponses
0

The idea of a columnar-storage database like Redshift for OLAP is that you can have tables with many columns and still be able to efficiently query just a small subset of columns across many rows. So I'd say there's no need to reformat as you're suggesting. However I don't have broad enough experience in this area to say it's NEVER a good idea, you may have special requirements.

EXPERT
répondu il y a 2 ans
0

The CSV file columns should match table columns. From your example the second file seems like a pivot of the first file. Since the layout is different you can create a second staging table for the second file and after loading the data with COPY command you can use the PIVOT/UNPIVOT to transform into first layout.

If the two files are independent then you can leave it as two separate tables. When running queries and joining the two tables you can use the PIVOT/UNPIVOT to transform at query execution time.

Having data as key-value pairs gives you flexibility of a varying schema but when running queries you often need to transform into tabular column format so it is easy to join with other tables. If your originating system is providing data in key-value pairs then use a staging schema that matches the source file format which will give you fast performance for write queries i.e., data ingestion. Then apply your transformations in intermediate or target schema which will give you fast join performance for your read queries.

profile pictureAWS
EXPERT
répondu il y a 2 ans

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions