Optimizing Data Sources

I have 3 book-length PDFs about growing food. Questions:

  1. Should I convert them to TXT files and remove front matter, the index, and anything else that’s not important, or should I not sweat that stuff?
  2. Should I upload each chapter, or at least section, as a separate document? Would that be better or worse for performance and output accuracy?
  3. Alternatively, if each book has a chapter on composting, should I instead combine them all into one document called Composting?

Thank you!

Hi @philthrive,

Thanks for the post!

You don’t need to convert PDFs to TXT. All text is extracted automatically and split into chunks.

That’s not necessary. The text from all files you upload to Data Sources will be automatically extracted and split into chunks. When you use a Query Data Source block or the RAG option in Chat, the system searches through those chunks and returns the most relevant ones based on your query. Since the output will then be generated based on the returned chunks, it’s a good idea to test your setup and make sure it’s pulling the right context. If the queries feel too vague, you can also add a Generate Text block to help refine them.

That’s a great question. I’d start by uploading all files as they are and testing from there. Since a Query Data Source block can return up to five chunks, you can have a Generate Text block create a few variations of the query and send them to multiple Query Data Source blocks to gather more relevant context.

If you decide to combine files by topic, you could also use automatically generated Index Snippets in a Logic block to let the AI pick the most relevant file for a user’s question. From there, you can branch the workflow to either Query Data Source blocks or Generate Text blocks with an Access Snippet that would simply insert the whole text of the file into the prompt, providing AI with all the data from the file.

Let me know what you think!

Thanks, Alex,

I tried a few different free converters because I figured it wouldn’t hurt to spend a little time removing unnecessary information, and I found there were some errors in the conversion, like all of the ff’s became !’s (e.g. the word offers became o!ers). I did some find/replaces to fix things like that. It may not be the end of the world, but it does feel good to have cleaner documents.

That makes sense, but it is confusing. I did a little more digging yesterday with LLMs and reading articles, and the consensus was that breaking things apart often leads to more accurate responses, which didn’t make sense to me because it all gets chunked and embedded anyway, but that’s what I read.

Thanks, I’ll dive into that more. Your documentation is sparse on that topic, but I’ll look elsewhere to learn more about it.

Thanks again!