1 個回答
- 最新
- 最多得票
- 最多評論
0
You can use the following PySpark script in AWS Glue to process flat files like the one you've described:
sample_df_1=sc.textFile('temp.txt')
sample_df_1.collect()
['1,2,3,4,5,6,7,8,9,10', 'A,B,C,D,E,F,G,H,I,K', 'foot,er']
hdr=sample_df_1.first()
sample_df_2=sample_df_1.filter(lambda l:l != hdr)
sample_df_2.collect()
['A,B,C,D,E,F,G,H,I,K', 'foot,er']
final_df=sample_df_2.map(lambda l:l.split(',')).filter(lambda l: len(l) > 2)
j=final_df.toDF()
j.show()
+---+---+---+---+---+---+---+---+---+---+
| _1| _2| _3| _4| _5| _6| _7| _8| _9|_10|
+---+---+---+---+---+---+---+---+---+---+
| A| B| C| D| E| F| G| H| I| K|
+---+---+---+---+---+---+---+---+---+---+
已回答 3 年前
相關內容
- AWS 官方已更新 2 年前
- AWS 官方已更新 1 年前