How can I process flat files with a footer record in AWS Glue?

Question

I'm trying to use AWS Glue to process a flat file that has header information in the first row and footer information in the last row. The file has 10 data columns, but the footer has only two columns (the number of records in the file and the file origin).

What's the best way to process this type of file in AWS Glue?

Accepted Answer

You can use the following PySpark script in AWS Glue to process flat files like the one you've described:

sample_df_1=sc.textFile('temp.txt')
    sample_df_1.collect()
    ['1,2,3,4,5,6,7,8,9,10', 'A,B,C,D,E,F,G,H,I,K', 'foot,er']
    hdr=sample_df_1.first()
    sample_df_2=sample_df_1.filter(lambda l:l != hdr)
    sample_df_2.collect()
    ['A,B,C,D,E,F,G,H,I,K', 'foot,er']
    final_df=sample_df_2.map(lambda l:l.split(',')).filter(lambda l: len(l) > 2)
    j=final_df.toDF()

j.show()
    +---+---+---+---+---+---+---+---+---+---+
    | _1| _2| _3| _4| _5| _6| _7| _8| _9|_10|
    +---+---+---+---+---+---+---+---+---+---+
    |  A|  B|  C|  D|  E|  F|  G|  H|  I|  K|
    +---+---+---+---+---+---+---+---+---+---+

How can I process flat files with a footer record in AWS Glue?

相关内容