Opensearch CSV processor not using column header

0

Hello, I have a TSV (not CSV) file that looks like this

Firstname Lastname
aayush neupane

Im parsing this using opensearch CSV processor. The files data is parsed, but its generating output as

{
"column1": "aayush",
"column2": "neupane"
}

instead of

{
"Firstname": "aayush",
"Lastname": "neupane"
}

Any idea? Ive tried all the configs mentioned at https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/csv/. The manual solution is to use column_names: ["Firstname", "Lastname"] but this is my last resort.

Is it because Im using a TSV file instead of CSV?

2 Risposte
0

Hello.

It is difficult to parse a tsv file separated by a tab using the CSV Processor. Generally, the separator supports the CSV separator (,). However, testing the tab separator(\t) using the CSV Processor confirms that it is not supported.

[+] CSV processor : Using the processor

https://opensearch.org/docs/latest/ingest-pipelines/processors/csv/

TEST : Below are the sample test results. Details of the information below can be found in the document above.

PUT _ingest/pipeline/csv-processor
{
  "description": "Split resource usage into individual fields",
  "processors": [
    {
      "csv": {
        "field": "resource_usage",
        "target_fields": ["cpu_usage", "memory_usage", "disk_usage"],
        "separator": "/t"
      }
    }
  ]
}
PUT testindex1/_doc/1?pipeline=csv-processor
{
  "resource_usage": "60 70  80"
}

Result: TAB(\t) is not properly recognized .

{
        "_id" : "137",
        "_score" : 1.0,
        "_source" : {
          "resource_usage" : "60 70  80",
          "cpu_usage" : "60 70  80"
}

If you want to use the CSV Processor, it seems the best way to use the CSV Processor after converting the TSV to CSV. Additionally, if you want to use TSV source to bring it to open search, you can also use Data Prepper.

[+] Announcing Data Prepper 2.0.0

https://opensearch.org/blog/Announcing-Data-Prepper-2.0.0/

Data Prepper can now import CSV or TSV formatted files from Amazon Simple Storage Service (Amazon S3) sources. 
This is useful for systems like Amazon CloudFront, which write their access logs as TSV files. Now you can parse these logs using Data Prepper.

[+] Data Prepper

https://opensearch.org/docs/latest/data-prepper/index/

Thank you.

profile pictureAWS
TECNICO DI SUPPORTO
con risposta 5 mesi fa
  • Hello Hyunjoong, Im using CSV processor as part of data-prepper 2.0. Its using \t as a valid delimiter. My problem is the inability to use TSV file's header.

    https://docs.aws.amazon.com/opensearch-service/latest/developerguide/pipeline-config-reference.html and https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/csv/

    This is how im using it

    version: "2"
    log-pipeline:
      source:
        s3:
          codec:
            newline:
          compression: "none"
          aws:
            region: "my-region"
            sts_role_arn: "my-role"
          acknowledgments: true
          notification_type: "sqs"
          sqs:  
            queue_url: "my-queue" 
            maximum_messages: 10
            visibility_timeout: "30s"
            wait_time: "20s"
            poll_delay: "0s"
            visibility_duplication_protection: true
      processor:
        - csv:
            source: "message"
            delimiter: "\t"
            delete_header: false
      sink:
        - opensearch:
            hosts: [ "my-opensearch-serverless" ]
            aws:
              sts_role_arn: "my-role"
              region: "my-region"
              serverless: true
              serverless_options:
    

    This is generating the output as

    {
    "column1": "aayush",
    "column2": "neupane"
    }
    

    instead of my required format

    {
    "Firstname": "aayush",
    "Lastname": "neupane"
    }
    
0

I figured it out. I had to use csv codec to detect and use header -

      codec:
        csv:
          detect_header: true
          separator: "\t"
          quote_character: "\""
con risposta 5 mesi fa

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande