Crawler can't skip header first line

0

Hi All,
I trying to use crawler to add tables in a Glue Database from CSV files. That works in the most folder/tables, but if the file have only strings separated by commas, crawler can't identify the first line by the name of columns and each one receive names like: col1, col2, etc..

In the tables properties with wrong schemas I can't see this property: "skip.header.line.count": 1

Someone of you know how can I force crawler to skip the first line?

Thank you.

asked 5 years ago4693 views
7 Answers
2

I know this is an extremely old topic, but for those of you finding this result in a search engine, the proper way to solve this is by using a classifier on your crawler. You could either explicitly specify the column headings, or allow auto detection of the column headings within the classifier, more details here: https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html .

answered 4 years ago
  • Thanks for taking the time in 2020 to answer this question, it made my day here in 2023 a lot easier.

0

Make sure that csv file contains mixed datatypes (string, numeric) and rerun the crawler

Example,

name,age
"john",10

Shivan
answered 5 years ago
0

Hi Shivan

Thank you for your answer.

This CSV file have only string data, not mixed datatypes.

You know: Why in this case the crawler can't identify correctly?

answered 5 years ago
0

I do not know however, you can add manually skip header table property manually and change the column but, it beats the crawler purpose.

Shivan
answered 5 years ago
0

Seems like Classifiers don't help when there are multiple pre-amble lines (e.g. 6 lines) in the file before the headers and data begin (for CSV format, at least). This is a pity as we have to do some manual data-cleansing outside of Glue.

answered 3 years ago
0

If you are using AWS CDK as the IaC tool, you can use the following code to skip the header:

    const resource = table.node.defaultChild as cfnglue.CfnTable;
    resource.addPropertyOverride('TableInput.StorageDescriptor.SerdeInfo', {
      Parameters: {
        'skip.header.line.count': '1',
      },
    });
AWS
MarkusL
answered 10 months ago
0

Add a table property of skip.header.line.count with a value of 1.

Charles
answered 8 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions