How to read from CSV files in S3 that have headers?

0

Is there any way to configure Glue to read or at least ignore, a header from a CSV file?

I wasn't able to find how to do that.

In case it is unclear what I mean, here are some implementations in related tools:

AWS
mkamp
質問済み 6年前4402ビュー
1回答
0
承認された回答

If you're crawling the files with Glue to add them to the Glue catalog, you can set this table property:

skip.header.line.count=1

I set that property manually in the console and was able to query successfully in Athena with header rows ignored. You can also set the table property via the API or in a CloudFormation template.

This also works if you use Glue's Spark libraries to query the table using the catalog:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

glueContext = GlueContext(SparkContext.getOrCreate())
df = glueContext.create_dynamic_frame.from_catalog(
       database = "default",
       table_name = "headertest_headertest")
df.printSchema()
df.toDF().show()

If you are reading the CSV directly into a dynamic frame, you can use the withHeader connection option:

dfs3 = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": ["s3://rd-mb3/headertest/"]}, format="csv", format_options = {"withHeader": True})
dfs3.toDF().show()
AWS
モデレーター
回答済み 6年前
profile picture
エキスパート
レビュー済み 1ヶ月前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ