Crawler can't skip header first line

0

Hi All,
I trying to use crawler to add tables in a Glue Database from CSV files. That works in the most folder/tables, but if the file have only strings separated by commas, crawler can't identify the first line by the name of columns and each one receive names like: col1, col2, etc..

In the tables properties with wrong schemas I can't see this property: "skip.header.line.count": 1

Someone of you know how can I force crawler to skip the first line?

Thank you.

gefragt vor 5 Jahren4786 Aufrufe
7 Antworten
2

I know this is an extremely old topic, but for those of you finding this result in a search engine, the proper way to solve this is by using a classifier on your crawler. You could either explicitly specify the column headings, or allow auto detection of the column headings within the classifier, more details here: https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html .

beantwortet vor 4 Jahren
  • Thanks for taking the time in 2020 to answer this question, it made my day here in 2023 a lot easier.

0

Make sure that csv file contains mixed datatypes (string, numeric) and rerun the crawler

Example,

name,age
"john",10

Shivan
beantwortet vor 5 Jahren
0

Hi Shivan

Thank you for your answer.

This CSV file have only string data, not mixed datatypes.

You know: Why in this case the crawler can't identify correctly?

beantwortet vor 5 Jahren
0

I do not know however, you can add manually skip header table property manually and change the column but, it beats the crawler purpose.

Shivan
beantwortet vor 5 Jahren
0

Seems like Classifiers don't help when there are multiple pre-amble lines (e.g. 6 lines) in the file before the headers and data begin (for CSV format, at least). This is a pity as we have to do some manual data-cleansing outside of Glue.

beantwortet vor 3 Jahren
0

If you are using AWS CDK as the IaC tool, you can use the following code to skip the header:

    const resource = table.node.defaultChild as cfnglue.CfnTable;
    resource.addPropertyOverride('TableInput.StorageDescriptor.SerdeInfo', {
      Parameters: {
        'skip.header.line.count': '1',
      },
    });
AWS
MarkusL
beantwortet vor 10 Monaten
0

Add a table property of skip.header.line.count with a value of 1.

Charles
beantwortet vor 9 Monaten

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen