Why is Redshift Spectrum showing the filename in the first column?

0

I'm trying to setup Redshift Spectrum to read some data which is in a S3 bucket. The files are stored as gzipped (tar.gz) CSV files, with a filename like 12345.csv.tar.gz.

I've followed the instructions here to set up the external schema and table.

When I do a basic query (SELECT * FROM schema.table), I get the data back, but the first column is the filename with a bunch of numbers after it, rather than the data in the CSV. All of the other columns are fine.

I tried an experiment with the some of the same data uncompressed, and the query returns the data as expected. It's only when I gzip the files where I get this issue.

What am I missing?

  • Please add the link to the documentation you followed.

asked 2 years ago917 views
2 Answers
0

You may want to use property ('skip.header.line.count'='1') as part of your table DDL - and this works for both spectrum and athena as well.

AWS
answered 2 years ago
  • Question mentions issues with the first column, ski.header.line.count is for skipping rows.

  • Thank you, unfortunately this drops the entire first row. The first row is valid data, and when I run the select in Redshift, all of the columns in the first row are correct, except the first column, which is the filename plus some characters as stated

0

@AWS-User-3359539 I was able to read it fine. I did not understand why are you trying read an archive. But anyways, Redshift Spectrum handles gzip out of the box with out any special handling.

In order to prove it I have taken the data s3://us-west-2.serverless-analytics/NYC-Pub/green/green_tripdata_2016-01.csv which is publicly available, and part of this Amazon Redshift Immersion Day Labs.

Copy the data to your own S3 bucket say my-bucket. You can use AWS Cloud Shell to create a gzip or a tar file. And move it to your buckets like

[cloudshell-user@ip-XX-XX-XX-XX ~]$ aws s3 ls my-bucket/
                       PRE csv/
                       PRE gz/
                       PRE tar/

  1. Create a Spectrum Schema
create external schema myspectrum_schema 
from data catalog 
database 'myspectrum_db' 
iam_role 'arn:aws:iam::XXXXXXXXXXXX:role/XXXXXX-RedshiftIAMRole-XXXXXX'
create external database if not exists;

  1. Create table for CSV (skipping 2 lines as the file has a blank line after header)
create external table myspectrum_schema.green_201601_csv
(
  vendorid                VARCHAR(4),
  pickup_datetime         TIMESTAMP,
  dropoff_datetime        TIMESTAMP,
  store_and_fwd_flag      VARCHAR(1),
  ratecode                INT,
  pickup_longitude        FLOAT4,
  pickup_latitude         FLOAT4,
  dropoff_longitude       FLOAT4,
  dropoff_latitude        FLOAT4,
  passenger_count         INT,
  trip_distance           FLOAT4,
  fare_amount             FLOAT4,
  extra                   FLOAT4,
  mta_tax                 FLOAT4,
  tip_amount              FLOAT4,
  tolls_amount            FLOAT4,
  ehail_fee               FLOAT4,
  improvement_surcharge   FLOAT4,
  total_amount            FLOAT4,
  payment_type            VARCHAR(4),
  trip_type               VARCHAR(4)
)
row format delimited fields terminated by ','
stored as textfile 
location 's3://my-bucket/csv/'
table properties ('skip.header.line.count'='2');

  1. Create table for gz (skipping 2 lines as the file has a blank line after header)
create external table myspectrum_schema.green_201601_csv_gz
(
  vendorid                VARCHAR(4),
  pickup_datetime         TIMESTAMP,
  dropoff_datetime        TIMESTAMP,
  store_and_fwd_flag      VARCHAR(1),
  ratecode                INT,
  pickup_longitude        FLOAT4,
  pickup_latitude         FLOAT4,
  dropoff_longitude       FLOAT4,
  dropoff_latitude        FLOAT4,
  passenger_count         INT,
  trip_distance           FLOAT4,
  fare_amount             FLOAT4,
  extra                   FLOAT4,
  mta_tax                 FLOAT4,
  tip_amount              FLOAT4,
  tolls_amount            FLOAT4,
  ehail_fee               FLOAT4,
  improvement_surcharge   FLOAT4,
  total_amount            FLOAT4,
  payment_type            VARCHAR(4),
  trip_type               VARCHAR(4)
)
row format delimited fields terminated by ','
stored as textfile 
location 's3://my-bucket/gz/'
table properties ('skip.header.line.count'='2');

  1. Create table for tar (skipping 2 lines as the file has a blank line after header)
create external table myspectrum_schema.green_201601_csv_tar_gz
(
  vendorid                VARCHAR(4),
  pickup_datetime         TIMESTAMP,
  dropoff_datetime        TIMESTAMP,
  store_and_fwd_flag      VARCHAR(1),
  ratecode                INT,
  pickup_longitude        FLOAT4,
  pickup_latitude         FLOAT4,
  dropoff_longitude       FLOAT4,
  dropoff_latitude        FLOAT4,
  passenger_count         INT,
  trip_distance           FLOAT4,
  fare_amount             FLOAT4,
  extra                   FLOAT4,
  mta_tax                 FLOAT4,
  tip_amount              FLOAT4,
  tolls_amount            FLOAT4,
  ehail_fee               FLOAT4,
  improvement_surcharge   FLOAT4,
  total_amount            FLOAT4,
  payment_type            VARCHAR(4),
  trip_type               VARCHAR(4)
)
row format delimited fields terminated by ','
stored as textfile 
location 's3://my-bucket/tar/'
table properties ('skip.header.line.count'='2');

  1. Verify results
SELECT * FROM "myspectrum_schema"."green_201601_csv";
SELECT * FROM "myspectrum_schema"."green_201601_csv_gz";
SELECT * FROM "myspectrum_schema"."green_201601_csv_tar_gz";

Hope this helps!

AWS
Kunal_G
answered 2 years ago
AWS
EXPERT
reviewed 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions