How to navigate the data offering?

0

In a new pipeline we are adding to our product, we are collecting a bunch of data from different sources with something similar to keywords. The collected data is mainly text with associated metadata. After collection, the data passes through a filtering stage before being inserted into a database, where it is combined with user data and feedback data based on whether it was successful or not based on a few tests in the real world.

I was wondering, what people think would be a good database for the 2 different stages (after data collection and after filtering). The collected data has to be stored and it has to be queryable by the metadata. We are also thinking about adding embeddings to this to make some filtering easier at a later stage. After the filtering, the data passes through a transformation layer, so I can store it structured. I have associated features for the data (embeddings,...), metadata, and feedback data. The setup should be able to cover a recommendation system use-case over time. I am at the moment thinking PostgreSQL.

The data also has a short life-cycle of use (about 1 month) before it gets replaced with new data. I could get away with storing it in a less available storage only for training new models. I want to store the input texts, the features, metadata, outputs, and feedback permanently. Especially the feedback is sparse, so we do not get feedback for each output.

My concrete questions are:

  1. What databases seem to be most suitable for the (a) collected data (b) filtered data (c) permanent data?
  2. What data model is suitable for recommendation systems?

I am at the moment not storing the collected data but for optimizing the pipeline over time this will be necessary. The filtered data is roughly 50MB / day, so storing it for 1 month would get us to roughly 1500MB. The factor for the collected data should be between 20-30x. Thank you very much for your help!

Nicolay
질문됨 3달 전138회 조회
답변 없음

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠