Skip to content

Create athena table for a CSV file in my S3 bucket where one field has multiple values

0

Hi all,

I hope you are doing well!!! I was building an application for myself with lambda, s3 and Athena. I have a CSV file stored in an S3 bucket. There is one field in this CSV file that has strings separated by comma. For example, "value1, value2, value3" in one cell. However, when I was trying to create an Athena table to query it, I couldn't figure out how to do with this field. I tried string and array<string> but the results didn't make sense whatsoever. value1 became the only value in this column and other values overwrote the values in the following fields. What should I handle this properly?

Many Thanks!

3 Answers
0
Accepted Answer

so as you have given the table with little change use this:

CREATE EXTERNAL TABLE employees ( userID string, Name string, Color string, -- Changed to string, array creation in query Region string, Country string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( 'field.delim' = ',', 'quoteChar' = ' " ', 'skip.header.line.count' = '1' ) LOCATION 's3://inventory/test.csv' TBLPROPERTIES ('classification' = 'csv');

[ so instead of Lazysimplesere need to use openCSVserde for CSV files, use quote char property too handle comma-separated strings within fields. use the split, quoteChar = ' " ' is crucial it tells Athena ti treat any comma-separated values enclosed in double as single fields. ]

here is the link where i find it. "https://docs.aws.amazon.com/athena/latest/ug/csv-serde.html

After creating the table query with

SELECT userID, Name, split(Color, ',') AS ColorArray, Region, Country FROM employees;

this should give you updated output.

Hope it will resolve. :-)

answered a year ago
EXPERT
reviewed a year ago
0

To handle CSV fields containing comma-separated values in Athena, you need to use the ARRAY type with a custom ROW FORMAT definition:

CREATE EXTERNAL TABLE your_table (
    id string,
    multi_value_field array<string>,
    other_field string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
    'separatorChar' = ',',
    'quoteChar' = '"',
    'escapeChar' = '\\'
)
STORED AS TEXTFILE
LOCATION 's3://your-bucket/path/';
AWS
answered a year ago
  • Got it! I will try this today and update it! So I looked into my query and I created the table with ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' and I also had STORED AS INPUTFORMAT in the query. Did these two statement cause the issue?

0

so that is what i understand "as you mention only one cell comma separated values like "red, blue, green" in CSV file something like userID, Name, Color.

in Athena console editor page where it says " Table and views" create section choose S3 bucket data. now it is going to ask bunch of questions, table name <mytable> create database name it <queryDB> dataset ---browse you s3 bucket where CSV file is---now data format Apache Hive from drop down---file formate --CSV---and here comes the main SerDeLibrary choose "org.apache.hadop.hive.....than SerDe properties though it is optional but choose --field.delim , --Quote character " ----- Escape character . (choose as your CSV file is) now Column details user_id type string, "name" type string---color string ( as given column in your CSV file, i take example as i gave you above csv example). it will give you preview table to query. (magic of aws :-) )

now query with simple select SELECT user_id, name, colors FROM your_table_name WHERE colors LIKE '%blue%';"

if you want each comma separated value as separate items: SELECT user_id, name, color_item FROM your_table_name CROSS JOIN UNNEST(split(color, ',')) AS t(color_item);

I hope this may help!

answered a year ago
  • Yes. You are right. I have "red, blue, green" in one cell. Let's say I have two other attributes in the CSV file following the Color field in your example, which are Region and Country. I ran this statement to create the table:

    CREATE EXTERNAL TABLE employees (
    `userID` string,
    `Name` string,
    `Color` array<string>,
    `Region` string,
    `Country` string)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
    WITH SERDEPROPERTIES ('field.delim' = ',')
    STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION 's3://inventory/test.csv'
    TBLPROPERTIES ('classification' = 'csv', "skip.header.line.count"="1");
    

    After I ran this statement, the value for the 'Color' attribute became 'red' alone. The value in 'Region' for this row became 'blue' and the value in 'Country' for this row became 'green'. My only guess is that since I set the field.delim to ',', are values that are delimited by ',' are separated. How could I create an array of strings for this attribute?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.