- Newest
- Most votes
- Most comments
If you want to process custom formatted files, you can use SparkContext.textFile or SparkContext.newAPIHadoopFile.
SparkContext's textFile method divides data using line delimiter (\n). If you want to use other delimiter, I would recommend you to use SparkContext.newAPIHadoopFile instead.
Here's an example to read the custom formatted file by textFile method. Although I used a csv file here, you can use any format which uses \n as line delimiter.
lines = sc.textFile("s3://covid19-lake/static-datasets/csv/countrycode/CountryCodeQS.csv")
Then, let's check the number of lines and RDD partitions.
lines.count()
It will return 257. Actually there are 257 lines in this file.
Okay, then let's divide this data into 5 files (about 50 lines per file).
from pyspark.context import SparkContext
from awsglue.context import GlueContext
sc = SparkContext.getOrCreate()
sc._jsc.hadoopConfiguration().set("mapred.output.committer.class", "org.apache.hadoop.mapred.FileOutputCommitter")
glueContext = GlueContext(sc)
lines = sc.textFile("s3://covid19-lake/static-datasets/csv/countrycode/CountryCodeQS.csv")
newLines = lines.repartition(5)
newLines.saveAsTextFile("s3://path_to_folder/");
Then you will see 5 files under 's3://path_to_folder/', and each will contain about 50 lines.
Althought you won't have full control of the number of lines per file, it will be good start for your use-case.
Next step, Spark's DataFrame has maxRecordsPerFile option, it can control the maximum number of records per file. Yes, it is great fit for your use-case.
You need to convert your data from RDD to DataFrame because the option cannot be used for RDD. To do that, you will need schema.
These points will be your second step.
Hope this information helps.
Hi michaelko,
This seems trival in a shell script. I'd try to run a Glue Python job, i.e. run on single node as a Python script instead of need the full power of Spark cluster.
In Python I'd just shell out and us a combo of the head, tail, and wc -l (not the lines arg) to break up your text files into smaller files.
I'm sure there is a way more sophisticated way to do this directly in Python, but the shell is just so simple here.
I hope this helps,
-Kurt
Relevant content
- asked a year ago
- asked 4 months ago
- AWS OFFICIALUpdated 9 months ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated a year ago