How can I process flat files with a footer record in AWS Glue?

0

I'm trying to use AWS Glue to process a flat file that has header information in the first row and footer information in the last row. The file has 10 data columns, but the footer has only two columns (the number of records in the file and the file origin).

What's the best way to process this type of file in AWS Glue?

AWS
已提问 3 年前649 查看次数
1 回答
0
已接受的回答

You can use the following PySpark script in AWS Glue to process flat files like the one you've described:

sample_df_1=sc.textFile('temp.txt')
sample_df_1.collect()
['1,2,3,4,5,6,7,8,9,10', 'A,B,C,D,E,F,G,H,I,K', 'foot,er']
hdr=sample_df_1.first()
sample_df_2=sample_df_1.filter(lambda l:l != hdr)
sample_df_2.collect()
['A,B,C,D,E,F,G,H,I,K', 'foot,er']
final_df=sample_df_2.map(lambda l:l.split(',')).filter(lambda l: len(l) > 2)
j=final_df.toDF()

j.show()
+---+---+---+---+---+---+---+---+---+---+
| _1| _2| _3| _4| _5| _6| _7| _8| _9|_10|
+---+---+---+---+---+---+---+---+---+---+
|  A|  B|  C|  D|  E|  F|  G|  H|  I|  K|
+---+---+---+---+---+---+---+---+---+---+
AWS
Sundeep
已回答 3 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则