In-place query of S3 data without provisioning DB or creating tables
We are exploring usecases where we want to achieve in-place transformation and querying of S3 data lake data. We don't want to provision database and create tables (so we are not keen to consider Redshift or Athena) and we want the querying to be most cost-efficient. While we can use S3 Select to directly query S3 data, it has its own limitations such as we can query single object using S3 Select, etc. Are there any alternatives to achieve this? Please guide
Seems like you have deprived yourselves of all the options. Athena would be an obvious choice, although you have not disclosed any details around the format and structure of the files you will be querying. What is most cost-efficient depends on the size of data and your usage-pattern, none of which you have included any information on.
What are your concerns around the necessity to maintain tables in Athena (Glue)? You will not incur any cost for these. Think of them as a trade-off that gives you a higher-level access to your data, i.e. not having to re-declare the data structure as well as being able to query without consideration to any single S3 objects.
Thank you very much SMP. I will explore more on Athena (Glue). In Athena, we need to create table for every S3 object, hence we were exploring if there are other options. But I get your point.
Our use case is to generate reports on S3 data lake data. S3 may contain CSV, Parquet and highly nested JSON data. Client wants to make this reporting cost-efficient and simple. (Simple in the sense that they want to explore options where there is no need provision DB/create tables). I'm not sure of the latency requirements. Client is also keen to use Hudi wherever possible in many of their modules.
in addition to what have already been commented, you can also query S3 data using Spark for example, so it would give you additional options such as Glue, EMR ( EMR Serverless) or EMR on EKS.
Thank you Alex. Yes, right, Spark is a useful option.
As my original question was vague, I will try to complete it below - Our use case is to generate reports on S3 data lake data. S3 may contain CSV, Parquet and highly nested JSON data. Client wants to make this reporting cost-efficient and simple. (Simple in the sense that they want to explore options where there is no need provision DB/create tables). I'm not sure of the latency requirements. Client is also keen to use Hudi wherever possible in many of their modules.
In response to your comment on my previous answer I would like to point out that it is not the case that you need to create an Athena table for each individual S3 object.
As per https://docs.aws.amazon.com/athena/latest/ug/tables-location-format.html, the
LOCATION refers to an S3 folder, and (at query time)
Athena reads all data stored in the Amazon S3 folder that you specify.
To use Athena in a practical way, you declare tables once per folder (of files with identical type and schema) and add (or remove) data at any time by writing (or removing) S3 objects in that folder.
the answer is correct, if the user is worried about the need to create the schema definition for the table, probably AWS Glue Crawlers could be an answer to that enabling them to automatically discover the data from the CSV , JSON, and Parquet Files.
If I understand the requirements, you are looking for:
- a cost effective solution
- ability to query the data in place on the data lake (Amazon S3)
- ability to transform the data in place on the data lake (Amazon S3)
- generate reports
- simplicity no need to maintain a database neither the schema design for the tables.
Based on that principle a serverless service that allow you to query the data and even transform it would be the best option, from an execution perspective.
So as already mentioned Amazon Athena would actually match perfectly your requirements ( see Athena supported file formats , it also support HUDI). On top of that if you do not want to maintain the tables schema nor create the tables you could use AWS Glue Crawlers to do that for you automatically. The Tables discovered by AWS Glue are generated in the AWS Glue Catalog that is accessed transparently by Amazon Athena.
To Query Nested JSON with Athena you can look at this blog post, if you used Glue to crawl the data you would not need to create the table manually.
If you are trying to generate report easily for the users, query complex nested structure could be too difficult in that case you could use Glue transformations to relationalize the data.
hope this helps
Thank you Fabrizio. I agree and understand. Our client is somehow still expecting us to come up with options other than Athena/Redshift Spectrum, that's the reason we are exploring. Do you know about S3 connectors? I know that S3 connector integrated with Quicksight/Tableau cannot replace Athena/Redshift, but are there any use cases where we could use S3 data directly using reporting tools like Tableau or Qlik sense using connectors?
data loading to s3 in csv format is adding line breaker in rows, randomly to one columnasked 20 days ago
How do you set data retention policy on IoT Sitewise time series store?Accepted Answerasked 2 years ago
Is it possible to specify DB snapshot in AWS Lake Formation?Accepted Answerasked a year ago
Additional Costs for running query in Athena on S3Accepted Answerasked 2 years ago
In-place query of S3 data without provisioning DB or creating tablesasked 3 months ago
use Redshift Spectrum to query both Redshift table and s3 fileAccepted Answerasked 2 years ago
Analytics Store for run time slice and diceasked 4 months ago
Amazon S3 connectors. Pros and consasked 2 months ago
Aws athena- query both s3 and rdsasked 3 years ago
Tool for Data Dictionary (similar to SQL Doc)Accepted Answerasked 4 years ago