Scrape Websites

Scrape any webpage(s) to add all of the content to your knowledge base.

Data Quality in AI Agents

When it comes to creating AI agents, especially with platforms like Stammer, the quality of data is paramount. Here's why:

Bad Data Equals Bad Performance
The performance of an AI agent is directly proportional to the quality of data it's trained on. Simply scraping all the content from a website and feeding it to a bot won't yield good results. Many websites contain outdated, irrelevant, or inaccurate information that can hinder the bot's performance.
The Structure of Data Matters
Not all content on a website is useful. For instance, some blog posts might lack relevant information about a product or might be generated automatically by the application. Including such data can lead to the bot being misinformed.
Strategies for Effective Data Collection
- Selective Scraping: Instead of scraping everything, focus on the most relevant and accurate pages.
- Utilizing FAQ Pages: These pages are goldmines as they often contain question-answer pairs. Platforms like Notifier use knowledge base matching to find text that matches customer queries, making FAQ pages extremely valuable.
- Handling Unstructured Data: If a webpage contains unstructured data, like paragraphs without clear headings, it's possible to preprocess this data using tools like Chat GPT and the WebPilot plugin. This helps in structuring the data in a more bot-friendly manner.
- Dealing with JavaScript-heavy Pages: Some web pages rely heavily on JavaScript to display content. In such cases, tools like the WebPilot plugin might not work effectively. However, using the Chrome extension 'Page Plain Text' can help extract all the text from such pages, ensuring the bot gets all the necessary information.
Future Developments
There are plans to introduce a feature that allows users to preprocess data directly from the user interface. Feedback from users is crucial in refining and introducing such features.

Scraping a Google Doc

The website scraper is also able to see and scrape all of the text data from a public Google Doc

PreviousUpload Documents Next24-hour Auto-Scraping

Last updated 1 year ago