- Newest
- Most votes
- Most comments
Hi There
It works by converting the text into a numerical representation that LLM's can understand.
Imagine you have a bunch of documents, like essays, articles, or reports, that you want to store in a database. The goal is to make it easy to find specific information in these documents when you need it.
The first step is to split the documents into smaller pieces, called "chunks." This makes it easier to search through the information later on. Think of it like breaking a big book into smaller chapters or sections.
Next, these chunks are converted into a special kind of code called "embeddings." Embeddings are like a way to represent the meaning of the text in a mathematical form that a computer can understand. This helps the computer figure out how similar the chunks are to each other, or to a question you might ask.
The embeddings are then stored in a "vector index." This is like a special type of database that's optimized for quickly finding the most relevant chunks based on the embeddings. It keeps track of where each chunk came from, so you can go back to the original document if you need to.
Finally, when you have a question or search term, the computer can use the vector index to find the chunks that are most similar to your query. This helps you quickly find the information you're looking for, without having to read through the entire collection of documents.
The image shows how this whole process works, from splitting the documents into chunks, to creating the embeddings, to storing them in the vector index. It's a handy way to make sure you can find the information you need, even in a big collection of documents.
If you're referring to the source document file, it must be in one of the following supported formats:
Format | Extension |
---|---|
Plain text | .txt |
Markdown | .md |
HyperText Markup Language | .html |
Microsoft Word document | .doc/.docx |
Comma-separated values | .csv |
Microsoft Excel spreadsheet | .xls/.xlsx |
Portable Document |
⚡ You can locate this information in the guide on setting up a data source for your knowledge base, available at Set up a data source for your knowledge base.
💡 If this answer doesn't meet your expectations, could you please clarify your question so I can better address your concerns?
Hi,
Thanks for the response. I did check out the guide on setting up a data source for the knowledge base, but I’m specifically interested in delving deeper into the process of text extraction from documents.
To clarify, while I understand the general flow of actions involved in setting up a knowledge base, what I’m particularly keen on is understanding the intricacies of the "Text Extraction From documents" phase. Our documents often contain a lot of what we call "dirty" text in PDFs, and we've struggled to extract clear text from them. It seems like Amazon has been able to handle this effectively, and I'm curious if there are insights or techniques we could learn from your approach.
Essentially, I’m wondering if there are any specific methods or services Amazon employs to achieve high-quality text extraction from PDFs, and whether these are available for use or integration into our own processes.
Thanks for your help!
Moving your follow up to a new answer.
Thank you for your detailed explanation. And I am sorry for not clarifying myself in the original post. I did check out the guide on setting up a data source for the knowledge base, but I’m specifically interested in delving deeper into the process of text extraction from documents. To clarify, while I understand the general flow of actions involved in setting up a knowledge base, what I’m particularly keen on is understanding the intricacies of the "Text Extraction From documents" phase. Our documents often contain a lot of what we call "dirty" text in PDFs, and we've struggled to extract clear text from them. It seems like Amazon has been able to handle this effectively, and I'm curious if there are insights or techniques we could learn from your approach. Essentially, I’m wondering if there are any specific methods or services Amazon employs to achieve high-quality text extraction from PDFs, and whether these are available for use or integration into our own processes.
It seems like you are asking how Amazon Q extracts text from the PDF. Q converts your PDF to HTML and then extracts the text. See https://docs.aws.amazon.com/amazonq/latest/qbusiness-ug/doc-types.html#doc-types-supported
Can you clarify what you mean by "dirty text"? If you are referring to handwriting or scanned PDF files, that isn't supported. If you are struggling with scanned PDF's or handwriting, you might want to run the documents through Amazon Textract first, and then ingest the raw text output into Q. Take a look at https://aws.amazon.com/blogs/machine-learning/process-text-and-images-in-pdf-documents-with-amazon-textract/
Relevant content
- Accepted Answerasked 3 months ago
- asked 3 months ago
- AWS OFFICIALUpdated 3 months ago
- How do I troubleshoot permission errors that I get when I create a knowledge base in Amazon Bedrock?AWS OFFICIALUpdated 2 months ago
- AWS OFFICIALUpdated 9 months ago
Thank you for your detailed explanation. And I am sorry for not clarifying myself in the original post. I did check out the guide on setting up a data source for the knowledge base, but I’m specifically interested in delving deeper into the process of text extraction from documents.
To clarify, while I understand the general flow of actions involved in setting up a knowledge base, what I’m particularly keen on is understanding the intricacies of the "Text Extraction From documents" phase. Our documents often contain a lot of what we call "dirty" text in PDFs, and we've struggled to extract clear text from them. It seems like Amazon has been able to handle this effectively, and I'm curious if there are insights or techniques we could learn from your approach.
Essentially, I’m wondering if there are any specific methods or services Amazon employs to achieve high-quality text extraction from PDFs, and whether these are available for use or integration into our own processes.
Thanks for your help!