Conversion into Gremlin format csv

0

I need To convert a single CSV file containing user information into two separate CSV files (nodes and edges) suitable for loading into a Neptune cluster . how we can do this??

  • Hello and thanks for the question! How big is the CSV file? If it is of modest size likely writing a small script might be an option. If it is large then a more rigorous approach, such as using AWS Glue may be more appropriate.

    Does the current CSV file already contain the relationships in some fashion or will they also have to be computed as part of the conversion?

    Knowing a little more about the source CSV will help recommend an appropriate approach.

  • Thankyou for the response. User csv headers: id first_name last_name email age gender state street_address postal_code city country latitude longitude traffic_source created_at

    Order csv headers :order_id user_id status gender created_at returned_at shipped_at delivered_at num_of_item

    Now I want Nodes --- Customer ,Address, Order ,ZipCode Edges ---HAS_ADDRESS(customer->address node),PLACE_ORDER(Customer-->order), HAS_ZIP(Customer-->Zipcode)

    CSV files are bigger in size we have 100000 customer and each customer has 10-15 orders and Relationship are also not there in csv file we need to make(compute) it by understanding data connection from source csv file

asked 24 days ago96 views
2 Answers
1
Accepted Answer

Python Lambda function is the fastest way I can think of. In pseudo code I would have it as:

FUNCTION lambda_handler(event, context):
    INPUT: 
        - event: S3 event triggered by object creation in a bucket
        - context: Lambda execution context
        
    1. Extract the bucket name and key from the S3 event.
    2. Download the CSV file from the specified bucket and key to the Lambda execution environment.
    3. Process the CSV file to identify node properties and edges.
    4. Write the node properties to a CSV file named "nodes.csv" and edges to a CSV file named "edges.csv".
    5. Upload the "nodes.csv" and "edges.csv" files to the same S3 bucket.
    
END FUNCTION

Uf you are happy with pseudo definitions above, you can have two separate csv files with the following (Check the bucket name):

import csv
import os
import boto3

s3 = boto3.client('s3')

def lambda_handler(event, context):
    # Get the bucket and key from the S3 event
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']
    
    # Download the CSV file from S3 to the Lambda execution environment
    download_path = '/tmp/input.csv'
    s3.download_file(bucket, key, download_path)
    
    # Process the CSV file
    node_properties, edges = process_csv_file(download_path)
    
    # Write nodes and edges to CSV files
    write_nodes(node_properties)
    write_edges(edges)
    
    # Upload the output files to S3
    upload_file_to_s3('nodes.csv', bucket)
    upload_file_to_s3('edges.csv', bucket)

def process_csv_file(input_file):
    node_properties = set()
    edges = []
    with open(input_file, 'r', newline='') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            for key, value in row.items():
                if key != 'source' and key != 'target' and key != 'relationship':
                    node_properties.add((key, value))
                if 'source' in row and 'target' in row and 'relationship' in row:
                    edges.append((row['source'], row['target'], row['relationship']))
    return node_properties, edges

def write_nodes(node_properties):
    with open('/tmp/nodes.csv', 'w', newline='') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(['property', 'value'])
        for property, value in node_properties:
            writer.writerow([property, value])

def write_edges(edges):
    with open('/tmp/edges.csv', 'w', newline='') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(['source', 'target', 'relationship'])
        for edge in edges:
            writer.writerow(edge)

def upload_file_to_s3(file_name, bucket_name):
    s3.upload_file('/tmp/' + file_name, bucket_name, file_name)


Then you can load them in Neptune with (Check the buckets name):

neptune-loader load --iam-role-arn arn:aws:iam::123456789012:role/NeptuneBulkLoadRole --url https://neptune-cluster-url --files s3://mybucket/nodes.csv, s3://mybucket/edges.csv
profile picture
EXPERT
answered 24 days ago
  • I would like to know a little more about the source CSV file before proposing such a prescriptive approach as the sample code above makes fairly broad assumptions about the nature of the source data. The question does not state whether the relationships already exist in the source file or not, nor does it give any other information about the nature of the source file.

  • Thanks, @Giovanni Lauria, for your response and for explaining the steps in a simpler and modified manner.

1

Thanks for the comments clarifying the source data composition.

In this case where you have a million or so customer records, you could probably equally well create a small script in a language like Python to do it, or use AWS Glue.

It might also be worth taking a look at an open source tool such as Nodestream which now supports Amazon Neptune. https://nodestream-proj.github.io/docs/docs/e2e-examples/movies/ and generates openCypher MERGE statements from source files, or this tool https://github.com/awslabs/amazon-neptune-tools/tree/master/csv-to-neptune-bulk-format which can handle CSV to CSV conversions.

Given the CSV files are still relatively small I might lean towards just writing a script to do it and produce one of the supported Amazon Neptune CSV formats described here https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load-tutorial-format.html

AWS
AWS-KRL
answered 24 days ago
  • Thank you @AWS-KRL, for your response and for mentioning various different approaches. It's incredibly helpful.

  • Is there a method through which we can obtain a CSV file in the Gremlin loading format directly from Neo4j export?

    I was trying Exporting Nodes information using // Only Nodes Extractions MATCH (n) WITH collect(distinct n) as nodes CALL apoc.export.csv.data(nodes, [], "nodes.csv", {}) YIELD nodes AS exported_nodes, relationships, properties, data RETURN exported_nodes, relationships, properties, data

    and edges information using //Only relationships Extractions MATCH ()-[r]-() WITH collect(distinct r) as rels CALL apoc.export.csv.data( [],rels, "rels.csv", {}) YIELD nodes AS exported_nodes, relationships, properties, data RETURN exported_nodes, relationships, properties, data

    But In nodes csv I am getting only headers name (ex;-id) not with their datatype(ex:-id:int) so is there any way to get node information with datatype in header as well.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions