Presently, I have only been building RAG databases using pdf files. Now I want to try and build a chatbot that uses a companies website information as the database as opposed to asking the company to give me their PDFs for me to create the database. Since there does not appear to be a specific block for this, I was thinking to use a scrape block and capture that as a huge PDF file and then manually myself set up the RAG database. Is there a smarter way of going about doing this? Would be neat if we had the ability to store whatever we scrape as preserved database information…or is that already possible?
Hmm, it sounds like it would be best split into 2 agents, one that creates the database PDF from scraped data and the other one that has the PDF as a Data Source in MS that uses the Retrieve Data Source block.
That said, when it comes to scraping tons of stuff, maybe something like https://www.hyperbrowser.ai/ might be useful here.
Alternatively have Claude create a webscraper using Python that specifically scrapes exactly what you need from a site. Claude will guide you through it so it’ll be walk in the park even if you don’t know a line of python code.
1 Like