How can I see the Amazon S3 source file for a row in an Athena table?

3 minute read
1

I'm querying a table in Amazon Athena. I want to find out which Amazon Simple Storage Service (Amazon S3) file is the source for each row in the output, or which rows correspond to a specific file.

Short description

  • To find out which Amazon S3 file contains data that is returned by a specific row in an Athena table, run a SELECT $path query.
  • To find out which Athena table rows are associated with a specific Amazon S3 file, run a SELECT query with the WHERE $path condition.

These queries are useful for investigating unexpected data and building queries that require information about the source data.

Resolution

Before you run the following queries, be sure that you have the following:

  • An Athena table. If you don't have a table, run a CREATE TABLE statement.
  • An AWS Identity and Access Management (IAM) user or role that has permissions to run Athena queries.

To find the S3 file that's associated with a row of an Athena table:

1.    Run a SELECT query against your table to return the data that you want:

SELECT * FROM "my_database"."my_table" WHERE year=2019;

2.    To find the Amazon S3 source file for the data, run a query similar to the following:

SELECT "$path" FROM "my_database"."my_table" WHERE year=2019;

The query returns the Amazon S3 path for the data:

s3://awsexamplebucket/datasets_mytable/year=2019/data_file1.json

To return Athena table rows that originate from a specific Amazon S3 file:

1.    Confirm the name and location of the Amazon S3 object that you want to retrieve rows for. If you don't know the object name and location, run the ls command to list objects under a specific S3 bucket or prefix. In the following example, replace s3://awsexamplebucket/my_table/my_partition/ with your partition path.

aws s3 ls s3://awsexamplebucket/my_table/my_partition/

2.    In Athena, run a SELECT query against your table to return data that's associated with the Amazon S3 file:

SELECT *,"$path" FROM "my_database"."my_table" WHERE regexp_like("$path", 's3://awsexamplebucket/my_table/my_partition/file-01.csv')

The query returns data that matches the Amazon S3 path:

id    name    year    $path
3     John    1999    's3://awsexamplebucket/my_table/my_partition/file-01.csv'
4     Jane    2000    's3://awsexamplebucket/my_table/my_partition/file-01.csv'

This query also supports some wildcard functionality. For example, to return all data that contains part of a file name, run a query like this:

SELECT *,"$path" FROM "my_database"."my_table" WHERE regexp_like("$path", '\-02')

This query returns all rows with file names that contain -02:

id    name    year    $path
13    Mia     2009    's3://awsexamplebucket/my_table/my_partition/file-02.csv'
14    Mary    2010    's3://awsexamplebucket/my_table/my_partition/file-02.csv'
15    Max     2011    's3://awsexamplebucket/my_table/my_partition/file-02.csv'
16    Merriam 2012    's3://awsexamplebucket/my_table/my_partition/file-02.csv'

Related information

Parquet SerDe

Connecting to data sources

AWS OFFICIAL
AWS OFFICIALUpdated 2 years ago