Opensearch CSV processor not using column header

0

Hello, I have a TSV (not CSV) file that looks like this

Firstname Lastname
aayush neupane

Im parsing this using opensearch CSV processor. The files data is parsed, but its generating output as

{
"column1": "aayush",
"column2": "neupane"
}

instead of

{
"Firstname": "aayush",
"Lastname": "neupane"
}

Any idea? Ive tried all the configs mentioned at https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/csv/. The manual solution is to use column_names: ["Firstname", "Lastname"] but this is my last resort.

Is it because Im using a TSV file instead of CSV?

2 Answers
0

Hello.

It is difficult to parse a tsv file separated by a tab using the CSV Processor. Generally, the separator supports the CSV separator (,). However, testing the tab separator(\t) using the CSV Processor confirms that it is not supported.

[+] CSV processor : Using the processor

https://opensearch.org/docs/latest/ingest-pipelines/processors/csv/

TEST : Below are the sample test results. Details of the information below can be found in the document above.

PUT _ingest/pipeline/csv-processor
{
  "description": "Split resource usage into individual fields",
  "processors": [
    {
      "csv": {
        "field": "resource_usage",
        "target_fields": ["cpu_usage", "memory_usage", "disk_usage"],
        "separator": "/t"
      }
    }
  ]
}
PUT testindex1/_doc/1?pipeline=csv-processor
{
  "resource_usage": "60 70  80"
}

Result: TAB(\t) is not properly recognized .

{
        "_id" : "137",
        "_score" : 1.0,
        "_source" : {
          "resource_usage" : "60 70  80",
          "cpu_usage" : "60 70  80"
}

If you want to use the CSV Processor, it seems the best way to use the CSV Processor after converting the TSV to CSV. Additionally, if you want to use TSV source to bring it to open search, you can also use Data Prepper.

[+] Announcing Data Prepper 2.0.0

https://opensearch.org/blog/Announcing-Data-Prepper-2.0.0/

Data Prepper can now import CSV or TSV formatted files from Amazon Simple Storage Service (Amazon S3) sources. 
This is useful for systems like Amazon CloudFront, which write their access logs as TSV files. Now you can parse these logs using Data Prepper.

[+] Data Prepper

https://opensearch.org/docs/latest/data-prepper/index/

Thank you.

profile pictureAWS
SUPPORT ENGINEER
answered 9 months ago
  • Hello Hyunjoong, Im using CSV processor as part of data-prepper 2.0. Its using \t as a valid delimiter. My problem is the inability to use TSV file's header.

    https://docs.aws.amazon.com/opensearch-service/latest/developerguide/pipeline-config-reference.html and https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/csv/

    This is how im using it

    version: "2"
    log-pipeline:
      source:
        s3:
          codec:
            newline:
          compression: "none"
          aws:
            region: "my-region"
            sts_role_arn: "my-role"
          acknowledgments: true
          notification_type: "sqs"
          sqs:  
            queue_url: "my-queue" 
            maximum_messages: 10
            visibility_timeout: "30s"
            wait_time: "20s"
            poll_delay: "0s"
            visibility_duplication_protection: true
      processor:
        - csv:
            source: "message"
            delimiter: "\t"
            delete_header: false
      sink:
        - opensearch:
            hosts: [ "my-opensearch-serverless" ]
            aws:
              sts_role_arn: "my-role"
              region: "my-region"
              serverless: true
              serverless_options:
    

    This is generating the output as

    {
    "column1": "aayush",
    "column2": "neupane"
    }
    

    instead of my required format

    {
    "Firstname": "aayush",
    "Lastname": "neupane"
    }
    
0

I figured it out. I had to use csv codec to detect and use header -

      codec:
        csv:
          detect_header: true
          separator: "\t"
          quote_character: "\""
answered 9 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions