Extract File metadata

0

Hi, I am uploading files directly to S3 and want to get metadata of these files like Auhor, created date, modified date etc. I was previously using Tika and was running in EC2 instance. Now I want to run this Tika in LAMBDA but whenever I run it shows error Tika server(VM) not started. Tried by creating Layers but failed. Can you guide me how can I run Apache Tika in LAMBDA, or if there is aything else I can use to get Metadata like Author, Modified date etc etc.

1 回答
0

I am not familiar with Tika, but I guess that it uses the S3 APIs to get the data that you need. You can do the same using the S3 SDK yourself. You can find the full list of APIs here. More specifically, GetObjectAttributes, and HeadObject. There might be other relevant APIs as well. Depending on your programming language, you can find the appropriate SDK to use those APIs.

profile pictureAWS
专家
Uri
已回答 1 年前
  • Well I am referring Apache Tika. Regarding the list of S3 API I had checked but didn't find anything related to File metadata. Let's say if you right click on .docx file and take properties then it will give you number of pages, author, modified date, created date etc etc. So I need this information which is not unavialble by S3 API

  • As Uri mentioned, metadata is present in the HEAD object and not in the event. Here is the sample python code for getting the metadata info.

    bucket = record['s3']['bucket']['name']
    key = record['s3']['object']['key']
    response = s3.head_object(Bucket=bucket, Key=key)
    
    print("Author : " + response['Metadata']['author'])
    print("Description : " + response['Metadata']['description'])
    

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则