# Scrape Websites

<figure><img src="https://1359281993-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FkLIQnOFYHkQtdWxUzzFE%2Fuploads%2FbZifE6ONo3JD2rV52Ul7%2FScreenshot%202024-03-07%20at%207.41.10%E2%80%AFPM.jpg?alt=media&#x26;token=43cb4c78-daa5-498d-94bb-371880805ac3" alt=""><figcaption></figcaption></figure>

{% embed url="<https://www.loom.com/share/d3afa7ca552e4f5c97e703312818f1ec?sid=2e555c28-095b-48cf-b968-87f77c3d2273>" %}

**Data Quality in AI Agents**

When it comes to creating AI agents, especially with platforms like Stammer, the quality of data is paramount. Here's why:

1. **Bad Data Equals Bad Performance**

   The performance of an AI agent is directly proportional to the quality of data it's trained on. Simply scraping all the content from a website and feeding it to a bot won't yield good results. Many websites contain outdated, irrelevant, or inaccurate information that can hinder the bot's performance.
2. **The Structure of Data Matters**

   Not all content on a website is useful. For instance, some blog posts might lack relevant information about a product or might be generated automatically by the application. Including such data can lead to the bot being misinformed.
3. **Strategies for Effective Data Collection**
   * Selective Scraping: Instead of scraping everything, focus on the most relevant and accurate pages.
   * Utilizing FAQ Pages: These pages are goldmines as they often contain question-answer pairs. Platforms like Notifier use knowledge base matching to find text that matches customer queries, making FAQ pages extremely valuable.
   * Handling Unstructured Data: If a webpage contains unstructured data, like paragraphs without clear headings, it's possible to preprocess this data using tools like Chat GPT and the WebPilot plugin. This helps in structuring the data in a more bot-friendly manner.
   * Dealing with JavaScript-heavy Pages: Some web pages rely heavily on JavaScript to display content. In such cases, tools like the WebPilot plugin might not work effectively. However, using the Chrome extension 'Page Plain Text' can help extract all the text from such pages, ensuring the bot gets all the necessary information.
4. **Future Developments**

   There are plans to introduce a feature that allows users to preprocess data directly from the user interface. Feedback from users is crucial in refining and introducing such features.

## Scraping a Google Doc

The website scraper is also able to see and scrape all of the text data from a public Google Doc

{% embed url="<https://www.youtube.com/watch?v=WEvh-nVej2w>" %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.stammer.ai/stammer.ai-docs/chat-ai-agents/knowledge-base-explained/scrape-websites.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
