Extract File metadata


Hi, I am uploading files directly to S3 and want to get metadata of these files like Auhor, created date, modified date etc. I was previously using Tika and was running in EC2 instance. Now I want to run this Tika in LAMBDA but whenever I run it shows error Tika server(VM) not started. Tried by creating Layers but failed. Can you guide me how can I run Apache Tika in LAMBDA, or if there is aything else I can use to get Metadata like Author, Modified date etc etc.

1 Answer

I am not familiar with Tika, but I guess that it uses the S3 APIs to get the data that you need. You can do the same using the S3 SDK yourself. You can find the full list of APIs here. More specifically, GetObjectAttributes, and HeadObject. There might be other relevant APIs as well. Depending on your programming language, you can find the appropriate SDK to use those APIs.

profile pictureAWS
answered a year ago
  • Well I am referring Apache Tika. Regarding the list of S3 API I had checked but didn't find anything related to File metadata. Let's say if you right click on .docx file and take properties then it will give you number of pages, author, modified date, created date etc etc. So I need this information which is not unavialble by S3 API

  • As Uri mentioned, metadata is present in the HEAD object and not in the event. Here is the sample python code for getting the metadata info.

    bucket = record['s3']['bucket']['name']
    key = record['s3']['object']['key']
    response = s3.head_object(Bucket=bucket, Key=key)
    print("Author : " + response['Metadata']['author'])
    print("Description : " + response['Metadata']['description'])

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions