Skip to content

Commit

Permalink
V1.0.0 (#10)
Browse files Browse the repository at this point in the history
* v.1.0.0

* continuity
  • Loading branch information
robbiemu authored Dec 12, 2024
1 parent 4be7b2d commit f08057f
Show file tree
Hide file tree
Showing 8 changed files with 265 additions and 111 deletions.
96 changes: 77 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,26 @@
# Documentation Crawler and Converter v.0.3
# Documentation Crawler and Converter v1.0.0

This tool crawls a documentation website and converts the pages into a single Markdown document. It intelligently removes common sections that appear across multiple pages to avoid duplication, including them once at the end of the document.

**Version 1.0.0** introduces significant improvements, including support for JavaScript-rendered pages using Playwright and a fully asynchronous implementation.

## Features

- **JavaScript Rendering**: Utilizes Playwright to accurately render pages that rely on JavaScript, ensuring complete and up-to-date content capture.
- Crawls documentation websites and combines pages into a single Markdown file.
- Removes common sections that appear across many pages, including them once at the beginning.
- Customizable threshold for similarity.
- Removes common sections that appear across many pages, including them once at the end of the document.
- Customizable threshold for similarity to control deduplication sensitivity.
- Configurable selectors to remove specific elements from pages.
- Supports robots.txt compliance with an option to ignore it.
- **NEW in v0.3.3**: Ability to skip URLs based on ignore-paths both pre-fetch (before requesting content) and post-fetch (after redirects).
## **NEW in v1.0.0**:
- Javascript rendering, waiting for page to stabilize before scraping.
- Asynchronous Operation: Fully asynchronous methods enhance performance and scalability during the crawling process.

## Installation

### Prerequisites

- **Python 3.6 or higher** is required.
- **Python 3.7 or higher** is required.
- (Optional) It is recommended to use a virtual environment to avoid dependency conflicts with other projects.

### 1. Installing the Package with `pip`
Expand Down Expand Up @@ -49,11 +54,13 @@ It is recommended to use a virtual environment to isolate the package and its de
2. **Activate the virtual environment**:

- On **macOS/Linux**:

```bash
source venv/bin/activate
```

- On **Windows**:

```bash
.\venv\Scripts\activate
```
Expand All @@ -66,15 +73,25 @@ It is recommended to use a virtual environment to isolate the package and its de

This ensures that all dependencies are installed within the virtual environment.

### 4. Installing from PyPI
### 4. Installing Playwright Browsers

After installing the package, you need to install the necessary Playwright browser binaries:

```bash
playwright install
```

This command downloads the required browser binaries (Chromium, Firefox, and WebKit) used by Playwright for rendering pages.

### 5. Installing from PyPI

Once the package is published on PyPI, you can install it directly using:

```bash
pip install libcrawler
```

### 5. Upgrading the Package
### 6. Upgrading the Package

To upgrade the package to the latest version, use:

Expand All @@ -84,7 +101,7 @@ pip install --upgrade libcrawler

This will upgrade the package to the newest version available.

### 6. Verifying the Installation
### 7. Verifying the Installation

You can verify that the package has been installed correctly by running:

Expand All @@ -102,7 +119,7 @@ crawl-docs BASE_URL STARTING_POINT [OPTIONS]

### Arguments

- `BASE_URL`: The base URL of the documentation site (e.g., https://example.com).
- `BASE_URL`: The base URL of the documentation site (e.g., _https://example.com_).
- `STARTING_POINT`: The starting path of the documentation (e.g., /docs/).

### Optional Arguments
Expand All @@ -117,29 +134,33 @@ crawl-docs BASE_URL STARTING_POINT [OPTIONS]
- `--ignore-paths PATH [PATH ...]`: List of URL paths to skip during crawling, either before or after fetching content.
- `--user-agent USER_AGENT`: Specify a custom User-Agent string (which will be harmonized with any additional headers).
- `--headers-file FILE`: Path to a JSON file containing optional headers. Only one of `--headers-file` or `--headers-json` can be used.
- `--headers-json JSON` (JSON string): Optional headers as JSON
- `--headers-json JSON` (JSON string): Optional headers as JSON.

### Examples

#### Basic Usage

```bash
crawl-docs https://example.com /docs/ -o output.md
```

#### Adjusting Thresholds

```bash
crawl-docs https://example.com /docs/ -o output.md \
--similarity-threshold 0.7 \
--delay-range 0.3
```

#### Specifying Extra Selectors to Remove

```bash
crawl-docs https://example.com /docs/ -o output.md \
--remove-selectors ".sidebar" ".ad-banner"
```

#### Limiting to Specific Paths

```bash
crawl-docs https://example.com / -o output.md \
--allowed-paths "/docs/" "/api/"
Expand All @@ -148,24 +169,61 @@ crawl-docs https://example.com / -o output.md \
#### Skipping URLs with Ignore Paths

```bash
Copiar código
crawl-docs https://example.com /docs/ -o output.md \
--ignore-paths "/old/" "/legacy/"
```

### Dependencies
## Dependencies

- **Python 3.7 or higher**
- [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for HTML parsing.
- [markdownify](https://github.com/matthewwithanm/python-markdownify) for converting HTML to Markdown.
- [Playwright](https://playwright.dev/python/docs/intro) for headless browser automation and JavaScript rendering.
- [aiofiles](https://github.com/Tinche/aiofiles) for asynchronous file operations.
- Additional dependencies are listed in `requirements.txt`.

### Installing Dependencies

- Python 3.6 or higher
- BeautifulSoup4
- datasketch
- requests
- markdownify
After setting up your environment, install all required dependencies using:

Install dependencies using:
```bash
pip install -r requirements.txt
```

**Note**: Ensure you have installed the Playwright browsers by running `playwright install` as mentioned in the Installation section.

## License

This project is licensed under the LGPLv3.
This project is licensed under the LGPLv3. See the [LICENSE]\(LICENSE) file for details.

## Contributing

Contributions are welcome! Please follow these steps to contribute:

1. **Fork the repository** on GitHub.
2. **Clone your fork** to your local machine:
```bash
git clone https://github.com/your-username/libcrawler.git
```
3. **Create a new branch** for your feature or bugfix:
```bash
git checkout -b feature-name
```
4. **Make your changes** and **commit** them with clear messages:
```bash
git commit -m "Add feature X"
```
5. **Push** your changes to your fork:
```bash
git push origin feature-name
```
6. **Open a Pull Request** on the original repository describing your changes.

Please ensure your code adheres to the project's coding standards and includes appropriate tests.
## Acknowledgements
- [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for HTML parsing.
- [Playwright](https://playwright.dev/) for headless browser automation.
- [Markdownify](https://github.com/matthewwithanm/python-markdownify) for converting HTML to Markdown.
- [aiofiles](https://github.com/Tinche/aiofiles) for asynchronous file operations.
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ description = "A tool to crawl documentation and convert to Markdown."
authors = [
{ name="Robert Collins", email="[email protected]" }
]
requires-python = ">=3.6"
requires-python = ">=3.7"
classifiers = [
"Programming Language :: Python :: 3",
"License :: OSI Approved :: GNU Lesser General Public License v3 (LGPLv3)",
Expand Down
2 changes: 2 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
aiofiles~=24.1.0
beautifulsoup4~=4.12.3
datasketch~=1.6.5
markdownify~=0.13.1
playwright~=1.49.1
Requests~=2.32.3
Empty file added src/libcrawler/__init__.py
Empty file.
8 changes: 5 additions & 3 deletions src/libcrawler/__main__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import asyncio
import argparse
import json
from urllib.parse import urljoin
Expand All @@ -18,6 +19,7 @@ def main():
help='Delay between requests in seconds.')
parser.add_argument('--delay-range', type=float, default=0.5,
help='Range for random delay variation.')
parser.add_argument('--interval', type=int, help='Time step used in wait for DOM to stablize, in milliseconds (default: 1000 ms)')
parser.add_argument('--remove-selectors', nargs='*',
help='Additional CSS selectors to remove from pages.')
parser.add_argument('--similarity-threshold', type=float, default=0.6,
Expand Down Expand Up @@ -55,7 +57,7 @@ def main():
start_url = urljoin(args.base_url, args.starting_point)

# Adjust crawl_and_convert call to handle ignore-paths and optional headers
crawl_and_convert(
asyncio.run(crawl_and_convert(
start_url=start_url,
base_url=args.base_url,
output_filename=args.output,
Expand All @@ -68,8 +70,8 @@ def main():
similarity_threshold=args.similarity_threshold,
allowed_paths=args.allowed_paths,
ignore_paths=args.ignore_paths # Pass the ignore-paths argument
)
))


if __name__ == '__main__':
main()
main()
Loading

0 comments on commit f08057f

Please sign in to comment.