1 Answer
- Newest
- Most votes
- Most comments
0
You can use the following PySpark script in AWS Glue to process flat files like the one you've described:
sample_df_1=sc.textFile('temp.txt')
sample_df_1.collect()
['1,2,3,4,5,6,7,8,9,10', 'A,B,C,D,E,F,G,H,I,K', 'foot,er']
hdr=sample_df_1.first()
sample_df_2=sample_df_1.filter(lambda l:l != hdr)
sample_df_2.collect()
['A,B,C,D,E,F,G,H,I,K', 'foot,er']
final_df=sample_df_2.map(lambda l:l.split(',')).filter(lambda l: len(l) > 2)
j=final_df.toDF()
j.show()
+---+---+---+---+---+---+---+---+---+---+
| _1| _2| _3| _4| _5| _6| _7| _8| _9|_10|
+---+---+---+---+---+---+---+---+---+---+
| A| B| C| D| E| F| G| H| I| K|
+---+---+---+---+---+---+---+---+---+---+
answered 3 years ago
Relevant content
- asked 6 months ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago