How to read from CSV files in S3 that have headers?

Question

Is there any way to configure Glue to read or at least ignore, a header from a CSV file?

I wasn't able to find how to do that.

In case it is unclear what I mean, here are some implementations in related tools:

- `header` in [Spark](http://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv)
- `ignoreheader` in [Redshift's Copy](https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-conversion.html#copy-ignoreheader)
- `'skip.header.line.count'='1'` in [Redshift's](view-source:https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_TABLE.html) external tables.

Accepted Answer

If you're crawling the files with Glue to add them to the Glue catalog, you can set this table property:

skip.header.line.count=1

I set that property manually in the console and was able to query successfully in Athena with header rows ignored.  You can also set the table property via the API or in a CloudFormation template.

This also works if you use Glue's Spark libraries to query the table using the catalog:

import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
    
    glueContext = GlueContext(SparkContext.getOrCreate())
    df = glueContext.create_dynamic_frame.from_catalog(
           database = "default",
           table_name = "headertest_headertest")
    df.printSchema()
    df.toDF().show()

If you are reading the CSV directly into a dynamic frame, you can use the *withHeader* connection option:

dfs3 = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": ["s3://rd-mb3/headertest/"]}, format="csv", format_options = {"withHeader": True})
    dfs3.toDF().show()

How to read from CSV files in S3 that have headers?

相關內容