Architecture choice for fast data reading with python

0

Hi,

I retrieve data each hour from an API. Data are statistical informations about cities. Until today I was storing the data as parquet in a S3 bucket with two partition : the day and the hour. With this, even if the API or my script doesn't work well 1 hour, I keep the rest of data safe. But the more the data grows, the longer it takes to read it with my python script. Right now it takes 20 min to read the whole data, for less than 1 go of data. There is a lot of partition and it's too much time for my goal. This script aims to calculate sliding indicators for each city and predict some stuffs. As you guess, I dont have to get all cities each time I read the data to calculate these indicators. So a better partition would be by cities but I'm afraid to overwrite previous data if my python script that retrieves data from the api crashes. Maybe a partition with cities / date can be good but It will generate lot of little partition.

What are your thoughts about that?

Maybe S3 isn't good, is dynamodb a better choice? Or maybe parquet are not the good format?

Thank you,

Ben

No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions