Skip to content

How do I troubleshoot errors when I use AWS Glue Crawler to crawl .csv files?

2 minute read
0

I want to troubleshoot common issues that occur when I use AWS Glue Crawler to crawl data in .csv files.

Short description

Some common issues that cause errors for the built-in .csv classifier in AWS Glue Crawler include:

  • The first row of data isn't specified as the header, and then the data displays generic column headers, such as col1 and col2.
  • Data enclosed between two quotation marks, such as "ABC" and "XYZ", isn't recognized.

Resolution

Create a customized .csv classifier, and then add the customized classifier to a new AWS Glue Crawler.

Create a customized classifier

Use the AWS Glue Console to create a custom classifier. Use the following parameters to define your classifier:

  • For Classifier name, enter a unique name.
  • For Classifier type, choose CSV.
  • For Column delimiter, select the comma symbol.
  • For Quote symbol, select the quote symbol.
  • For Column headings, choose Has headings.
    (Optional) If you know the names of your columns, then enter the heading names. Make sure to separate the names with a comma.

Note: By default, the .csv classifier uses Open CSV SerDe as its serialization library. Open CSV SerDe supports data with double quotes and the header that you specify. For more Information, see CSV SerDe Libraries.

Add your customized classifier to a new AWS Glue Crawler

Create a new AWS Glue Crawler. Use the following parameters to configure the crawler:

  • For Data source, select the data store where your .csv files are located.
  • For Include path, enter the include path to your .csv files.
  • For Custom classifiers, add the custom .csv classifier that you created to the list of classifiers.
  • For IAM Role, select an AWS Identity and Access Management (IAM) role that has the necessary permissions to crawl your .csv file.
AWS OFFICIALUpdated 6 months ago