Splitting a Large S3 File into Lines per File (not bytes per file)

0

I have an 8 GB file with text lines (each line has a carriage return) in S3. This file is custom formatted and does NOT follow any common format like CSV, pipe, JSON ... I need to split that file into smaller files based on the number of lines, such that each file will contains 100,000 lines or less (assuming the last file can have the remainder of the lines and thus may have less than 100,000 lines).

I need a method that is not based on the file size (i.e. bytes), but the number of lines. Files can't have a single line split across the two.  
I need to use Python.  
I need to use server-less AWS service like Lambda, Glue ... I can't spin up instances like EC2 or EMR.  

So far I found a lot of posts showing how to split by byte size but not by number of lines. Also, I do not want to read that file line by line as it will be just too slow an not efficient.

Could someone show me a starter code or method that could accomplish splitting this 6 GB file that would run fast and not require more than 10 GB of available memory (RAM), at any point?

I am looking for all possible options, as long as the basic requirements above are met...

BIG thank you!

Michael

asked 4 years ago4019 views
2 Answers
0

If you want to process custom formatted files, you can use SparkContext.textFile or SparkContext.newAPIHadoopFile.

SparkContext's textFile method divides data using line delimiter (\n). If you want to use other delimiter, I would recommend you to use SparkContext.newAPIHadoopFile instead.

Here's an example to read the custom formatted file by textFile method. Although I used a csv file here, you can use any format which uses \n as line delimiter.

lines = sc.textFile("s3://covid19-lake/static-datasets/csv/countrycode/CountryCodeQS.csv")

Then, let's check the number of lines and RDD partitions.

lines.count()

It will return 257. Actually there are 257 lines in this file.

Okay, then let's divide this data into 5 files (about 50 lines per file).

from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext.getOrCreate()
sc._jsc.hadoopConfiguration().set("mapred.output.committer.class", "org.apache.hadoop.mapred.FileOutputCommitter")
glueContext = GlueContext(sc)

lines = sc.textFile("s3://covid19-lake/static-datasets/csv/countrycode/CountryCodeQS.csv")
newLines = lines.repartition(5)
newLines.saveAsTextFile("s3://path_to_folder/");

Then you will see 5 files under 's3://path_to_folder/', and each will contain about 50 lines.
Althought you won't have full control of the number of lines per file, it will be good start for your use-case.

Next step, Spark's DataFrame has maxRecordsPerFile option, it can control the maximum number of records per file. Yes, it is great fit for your use-case.
You need to convert your data from RDD to DataFrame because the option cannot be used for RDD. To do that, you will need schema.
These points will be your second step.
Hope this information helps.

AWS
answered 4 years ago
0

Hi michaelko,

This seems trival in a shell script. I'd try to run a Glue Python job, i.e. run on single node as a Python script instead of need the full power of Spark cluster.

In Python I'd just shell out and us a combo of the head, tail, and wc -l (not the lines arg) to break up your text files into smaller files.

I'm sure there is a way more sophisticated way to do this directly in Python, but the shell is just so simple here.

I hope this helps,
-Kurt

klarson
answered 4 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions