INVALID_DOCUMENT_TYPE with certain PDFs

0

Hi,

We are having Textract jobs failing with INVALID_DOCUMENT_TYPE for certain PDFs. There are all quite small (around 200kb) and single paged. It doesn't happen for all PDFs.

Is there anyway of getting any feed back as to why some fail?

  • This is still a problem. INVALID_DOCUMENT_TYPE provides no information about what is actually wrong, more granular messages would be invaluable. In my case, I think that my documents are failing because they exceed page size limits, but there's no way to know for sure until you "fix" it and see if it succeeds.

質問済み 5年前726ビュー
2回答
0

Hmm... Strange. I am not able to reply to {message:id=893763} but can reply here.

The workaround mentioned in the other thread involving printing a PDF document to a PDF works for me. However, I needed a programmatic solution so I dug a little deeper. Using PyPDF2, I found that each of my documents that failed with INVALID_DOCUMENT_TYPE also threw the warning Xref table not zero-indexed. ID numbers for objects will be corrected. when reading in PyPDF2. So, I used PyPDF2.PdfFileReader(my_bad_pdf_stream, strict=False) which fixed the faulty Xref table. Now all my previously failed PDFs will work.

Note that this is not a complete solution because the PyPDF2 code introduced errors in some files that were not previously problematic. This other problem will have to wait as I can process all my files for now.

Hope that helps.

Edited by: wchan on Jul 25, 2019 10:39 PM

Edited by: wchan on Jul 25, 2019 10:43 PM

wchan
回答済み 5年前
0

Not sure if you ever got an answer to this, but I am running into it as well and I think its due to the fact my PDFs are too wide for the service.

profile picture
Zac Dan
回答済み 1年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ