INVALID_DOCUMENT_TYPE with certain PDFs

0

Hi,

We are having Textract jobs failing with INVALID_DOCUMENT_TYPE for certain PDFs. There are all quite small (around 200kb) and single paged. It doesn't happen for all PDFs.

Is there anyway of getting any feed back as to why some fail?

  • This is still a problem. INVALID_DOCUMENT_TYPE provides no information about what is actually wrong, more granular messages would be invaluable. In my case, I think that my documents are failing because they exceed page size limits, but there's no way to know for sure until you "fix" it and see if it succeeds.

asked 5 years ago693 views
2 Answers
0

Hmm... Strange. I am not able to reply to {message:id=893763} but can reply here.

The workaround mentioned in the other thread involving printing a PDF document to a PDF works for me. However, I needed a programmatic solution so I dug a little deeper. Using PyPDF2, I found that each of my documents that failed with INVALID_DOCUMENT_TYPE also threw the warning Xref table not zero-indexed. ID numbers for objects will be corrected. when reading in PyPDF2. So, I used PyPDF2.PdfFileReader(my_bad_pdf_stream, strict=False) which fixed the faulty Xref table. Now all my previously failed PDFs will work.

Note that this is not a complete solution because the PyPDF2 code introduced errors in some files that were not previously problematic. This other problem will have to wait as I can process all my files for now.

Hope that helps.

Edited by: wchan on Jul 25, 2019 10:39 PM

Edited by: wchan on Jul 25, 2019 10:43 PM

wchan
answered 5 years ago
0

Not sure if you ever got an answer to this, but I am running into it as well and I think its due to the fact my PDFs are too wide for the service.

profile picture
Zac Dan
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions