Why is Redshift Spectrum showing the filename in the first column?

0

I'm trying to setup Redshift Spectrum to read some data which is in a S3 bucket. The files are stored as gzipped (tar.gz) CSV files, with a filename like 12345.csv.tar.gz.

I've followed the instructions here to set up the external schema and table.

When I do a basic query (SELECT * FROM schema.table), I get the data back, but the first column is the filename with a bunch of numbers after it, rather than the data in the CSV. All of the other columns are fine.

I tried an experiment with the some of the same data uncompressed, and the query returns the data as expected. It's only when I gzip the files where I get this issue.

What am I missing?

  • Please add the link to the documentation you followed.

gefragt vor 2 Jahren929 Aufrufe
2 Antworten
0

You may want to use property ('skip.header.line.count'='1') as part of your table DDL - and this works for both spectrum and athena as well.

AWS
beantwortet vor 2 Jahren
  • Question mentions issues with the first column, ski.header.line.count is for skipping rows.

  • Thank you, unfortunately this drops the entire first row. The first row is valid data, and when I run the select in Redshift, all of the columns in the first row are correct, except the first column, which is the filename plus some characters as stated

0

@AWS-User-3359539 I was able to read it fine. I did not understand why are you trying read an archive. But anyways, Redshift Spectrum handles gzip out of the box with out any special handling.

In order to prove it I have taken the data s3://us-west-2.serverless-analytics/NYC-Pub/green/green_tripdata_2016-01.csv which is publicly available, and part of this Amazon Redshift Immersion Day Labs.

Copy the data to your own S3 bucket say my-bucket. You can use AWS Cloud Shell to create a gzip or a tar file. And move it to your buckets like

[cloudshell-user@ip-XX-XX-XX-XX ~]$ aws s3 ls my-bucket/
                       PRE csv/
                       PRE gz/
                       PRE tar/

  1. Create a Spectrum Schema
create external schema myspectrum_schema 
from data catalog 
database 'myspectrum_db' 
iam_role 'arn:aws:iam::XXXXXXXXXXXX:role/XXXXXX-RedshiftIAMRole-XXXXXX'
create external database if not exists;

  1. Create table for CSV (skipping 2 lines as the file has a blank line after header)
create external table myspectrum_schema.green_201601_csv
(
  vendorid                VARCHAR(4),
  pickup_datetime         TIMESTAMP,
  dropoff_datetime        TIMESTAMP,
  store_and_fwd_flag      VARCHAR(1),
  ratecode                INT,
  pickup_longitude        FLOAT4,
  pickup_latitude         FLOAT4,
  dropoff_longitude       FLOAT4,
  dropoff_latitude        FLOAT4,
  passenger_count         INT,
  trip_distance           FLOAT4,
  fare_amount             FLOAT4,
  extra                   FLOAT4,
  mta_tax                 FLOAT4,
  tip_amount              FLOAT4,
  tolls_amount            FLOAT4,
  ehail_fee               FLOAT4,
  improvement_surcharge   FLOAT4,
  total_amount            FLOAT4,
  payment_type            VARCHAR(4),
  trip_type               VARCHAR(4)
)
row format delimited fields terminated by ','
stored as textfile 
location 's3://my-bucket/csv/'
table properties ('skip.header.line.count'='2');

  1. Create table for gz (skipping 2 lines as the file has a blank line after header)
create external table myspectrum_schema.green_201601_csv_gz
(
  vendorid                VARCHAR(4),
  pickup_datetime         TIMESTAMP,
  dropoff_datetime        TIMESTAMP,
  store_and_fwd_flag      VARCHAR(1),
  ratecode                INT,
  pickup_longitude        FLOAT4,
  pickup_latitude         FLOAT4,
  dropoff_longitude       FLOAT4,
  dropoff_latitude        FLOAT4,
  passenger_count         INT,
  trip_distance           FLOAT4,
  fare_amount             FLOAT4,
  extra                   FLOAT4,
  mta_tax                 FLOAT4,
  tip_amount              FLOAT4,
  tolls_amount            FLOAT4,
  ehail_fee               FLOAT4,
  improvement_surcharge   FLOAT4,
  total_amount            FLOAT4,
  payment_type            VARCHAR(4),
  trip_type               VARCHAR(4)
)
row format delimited fields terminated by ','
stored as textfile 
location 's3://my-bucket/gz/'
table properties ('skip.header.line.count'='2');

  1. Create table for tar (skipping 2 lines as the file has a blank line after header)
create external table myspectrum_schema.green_201601_csv_tar_gz
(
  vendorid                VARCHAR(4),
  pickup_datetime         TIMESTAMP,
  dropoff_datetime        TIMESTAMP,
  store_and_fwd_flag      VARCHAR(1),
  ratecode                INT,
  pickup_longitude        FLOAT4,
  pickup_latitude         FLOAT4,
  dropoff_longitude       FLOAT4,
  dropoff_latitude        FLOAT4,
  passenger_count         INT,
  trip_distance           FLOAT4,
  fare_amount             FLOAT4,
  extra                   FLOAT4,
  mta_tax                 FLOAT4,
  tip_amount              FLOAT4,
  tolls_amount            FLOAT4,
  ehail_fee               FLOAT4,
  improvement_surcharge   FLOAT4,
  total_amount            FLOAT4,
  payment_type            VARCHAR(4),
  trip_type               VARCHAR(4)
)
row format delimited fields terminated by ','
stored as textfile 
location 's3://my-bucket/tar/'
table properties ('skip.header.line.count'='2');

  1. Verify results
SELECT * FROM "myspectrum_schema"."green_201601_csv";
SELECT * FROM "myspectrum_schema"."green_201601_csv_gz";
SELECT * FROM "myspectrum_schema"."green_201601_csv_tar_gz";

Hope this helps!

AWS
Kunal_G
beantwortet vor 2 Jahren
AWS
EXPERTE
überprüft vor 2 Jahren

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen