INVALID_DOCUMENT_TYPE with certain PDFs

0

Hi,

We are having Textract jobs failing with INVALID_DOCUMENT_TYPE for certain PDFs. There are all quite small (around 200kb) and single paged. It doesn't happen for all PDFs.

Is there anyway of getting any feed back as to why some fail?

  • This is still a problem. INVALID_DOCUMENT_TYPE provides no information about what is actually wrong, more granular messages would be invaluable. In my case, I think that my documents are failing because they exceed page size limits, but there's no way to know for sure until you "fix" it and see if it succeeds.

질문됨 5년 전731회 조회
2개 답변
0

Hmm... Strange. I am not able to reply to {message:id=893763} but can reply here.

The workaround mentioned in the other thread involving printing a PDF document to a PDF works for me. However, I needed a programmatic solution so I dug a little deeper. Using PyPDF2, I found that each of my documents that failed with INVALID_DOCUMENT_TYPE also threw the warning Xref table not zero-indexed. ID numbers for objects will be corrected. when reading in PyPDF2. So, I used PyPDF2.PdfFileReader(my_bad_pdf_stream, strict=False) which fixed the faulty Xref table. Now all my previously failed PDFs will work.

Note that this is not a complete solution because the PyPDF2 code introduced errors in some files that were not previously problematic. This other problem will have to wait as I can process all my files for now.

Hope that helps.

Edited by: wchan on Jul 25, 2019 10:39 PM

Edited by: wchan on Jul 25, 2019 10:43 PM

wchan
답변함 5년 전
0

Not sure if you ever got an answer to this, but I am running into it as well and I think its due to the fact my PDFs are too wide for the service.

profile picture
Zac Dan
답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠