V1.0.0 (#10)

* v.1.0.0 * continuity
robbiemu · Dec 12, 2024 · f08057f · f08057f
1 parent 4be7b2d
commit f08057f
Show file tree

Hide file tree

Showing 8 changed files with 265 additions and 111 deletions.
diff --git a/README.md b/README.md
@@ -1,21 +1,26 @@
-# Documentation Crawler and Converter v.0.3
+# Documentation Crawler and Converter v1.0.0
 
 This tool crawls a documentation website and converts the pages into a single Markdown document. It intelligently removes common sections that appear across multiple pages to avoid duplication, including them once at the end of the document.
 
+**Version 1.0.0** introduces significant improvements, including support for JavaScript-rendered pages using Playwright and a fully asynchronous implementation.
+
 ## Features
 
+- **JavaScript Rendering**: Utilizes Playwright to accurately render pages that rely on JavaScript, ensuring complete and up-to-date content capture.
 - Crawls documentation websites and combines pages into a single Markdown file.
-- Removes common sections that appear across many pages, including them once at the beginning.
-- Customizable threshold for similarity.
+- Removes common sections that appear across many pages, including them once at the end of the document.
+- Customizable threshold for similarity to control deduplication sensitivity.
 - Configurable selectors to remove specific elements from pages.
 - Supports robots.txt compliance with an option to ignore it.
-- **NEW in v0.3.3**: Ability to skip URLs based on ignore-paths both pre-fetch (before requesting content) and post-fetch (after redirects).
+  ## **NEW in v1.0.0**:
+  - Javascript rendering, waiting for page to stabilize before scraping.
+  - Asynchronous Operation: Fully asynchronous methods enhance performance and scalability during the crawling process.
 
 ## Installation
 
 ### Prerequisites
 
-- **Python 3.6 or higher** is required.
+- **Python 3.7 or higher** is required.
 - (Optional) It is recommended to use a virtual environment to avoid dependency conflicts with other projects.
 
 ### 1. Installing the Package with `pip`
@@ -49,11 +54,13 @@ It is recommended to use a virtual environment to isolate the package and its de
 2. **Activate the virtual environment**:
 
    - On **macOS/Linux**:
+
      ```bash
      source venv/bin/activate
      ```
 
    - On **Windows**:
+
      ```bash
      .\venv\Scripts\activate
      ```
@@ -66,15 +73,25 @@ It is recommended to use a virtual environment to isolate the package and its de
 
    This ensures that all dependencies are installed within the virtual environment.
 
-### 4. Installing from PyPI
+### 4. Installing Playwright Browsers
+
+After installing the package, you need to install the necessary Playwright browser binaries:
+
+```bash
+playwright install
+```
+
+This command downloads the required browser binaries (Chromium, Firefox, and WebKit) used by Playwright for rendering pages.
+
+### 5. Installing from PyPI
 
 Once the package is published on PyPI, you can install it directly using:
 
 ```bash
 pip install libcrawler
 ```
 
-### 5. Upgrading the Package
+### 6. Upgrading the Package
 
 To upgrade the package to the latest version, use:
 
@@ -84,7 +101,7 @@ pip install --upgrade libcrawler
 
 This will upgrade the package to the newest version available.
 
-### 6. Verifying the Installation
+### 7. Verifying the Installation
 
 You can verify that the package has been installed correctly by running:
 
@@ -102,7 +119,7 @@ crawl-docs BASE_URL STARTING_POINT [OPTIONS]
 
 ### Arguments
 
-- `BASE_URL`: The base URL of the documentation site (e.g., https://example.com).
+- `BASE_URL`: The base URL of the documentation site (e.g., _https://example.com_).
 - `STARTING_POINT`: The starting path of the documentation (e.g., /docs/).
 
 ### Optional Arguments
@@ -117,29 +134,33 @@ crawl-docs BASE_URL STARTING_POINT [OPTIONS]
 - `--ignore-paths PATH [PATH ...]`: List of URL paths to skip during crawling, either before or after fetching content.
 - `--user-agent USER_AGENT`: Specify a custom User-Agent string (which will be harmonized with any additional headers).
 - `--headers-file FILE`: Path to a JSON file containing optional headers. Only one of `--headers-file` or `--headers-json` can be used.
-- `--headers-json JSON` (JSON string): Optional headers as JSON
+- `--headers-json JSON` (JSON string): Optional headers as JSON.
 
 ### Examples
 
 #### Basic Usage
+
 ```bash
 crawl-docs https://example.com /docs/ -o output.md
 ```
 
 #### Adjusting Thresholds
+
 ```bash
 crawl-docs https://example.com /docs/ -o output.md \
     --similarity-threshold 0.7 \
     --delay-range 0.3
 ```
 
 #### Specifying Extra Selectors to Remove
+
 ```bash
 crawl-docs https://example.com /docs/ -o output.md \
     --remove-selectors ".sidebar" ".ad-banner"
 ```
 
 #### Limiting to Specific Paths
+
 ```bash
 crawl-docs https://example.com / -o output.md \
     --allowed-paths "/docs/" "/api/"
@@ -148,24 +169,61 @@ crawl-docs https://example.com / -o output.md \
 #### Skipping URLs with Ignore Paths
 
 ```bash
-Copiar código
 crawl-docs https://example.com /docs/ -o output.md \
     --ignore-paths "/old/" "/legacy/"
 ```
 
-### Dependencies
+## Dependencies
+
+- **Python 3.7 or higher**
+- [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for HTML parsing.
+- [markdownify](https://github.com/matthewwithanm/python-markdownify) for converting HTML to Markdown.
+- [Playwright](https://playwright.dev/python/docs/intro) for headless browser automation and JavaScript rendering.
+- [aiofiles](https://github.com/Tinche/aiofiles) for asynchronous file operations.
+- Additional dependencies are listed in `requirements.txt`.
+
+### Installing Dependencies
 
-- Python 3.6 or higher
-- BeautifulSoup4
-- datasketch
-- requests
-- markdownify
+After setting up your environment, install all required dependencies using:
 
-Install dependencies using:
 ```bash
 pip install -r requirements.txt
 ```
 
+**Note**: Ensure you have installed the Playwright browsers by running `playwright install` as mentioned in the Installation section.
+
 ## License
 
-This project is licensed under the LGPLv3.
+This project is licensed under the LGPLv3. See the [LICENSE]\(LICENSE) file for details.
+
+## Contributing
+
+Contributions are welcome! Please follow these steps to contribute:
+
+1. **Fork the repository** on GitHub.
+2. **Clone your fork** to your local machine:
+   ```bash
+   git clone https://github.com/your-username/libcrawler.git
+   ```
+3. **Create a new branch** for your feature or bugfix:
+   ```bash
+   git checkout -b feature-name
+   ```
+4. **Make your changes** and **commit** them with clear messages:
+   ```bash
+   git commit -m "Add feature X"
+   ```
+5. **Push** your changes to your fork:
+   ```bash
+   git push origin feature-name
+   ```
+6. **Open a Pull Request** on the original repository describing your changes.
+
+Please ensure your code adheres to the project's coding standards and includes appropriate tests.
+
+## Acknowledgements
+
+- [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for HTML parsing.
+- [Playwright](https://playwright.dev/) for headless browser automation.
+- [Markdownify](https://github.com/matthewwithanm/python-markdownify) for converting HTML to Markdown.
+- [aiofiles](https://github.com/Tinche/aiofiles) for asynchronous file operations.
diff --git a/pyproject.toml b/pyproject.toml
@@ -8,7 +8,7 @@ description = "A tool to crawl documentation and convert to Markdown."
 authors = [
     { name="Robert Collins", email="[email protected]" }
 ]
-requires-python = ">=3.6"
+requires-python = ">=3.7"
 classifiers = [
     "Programming Language :: Python :: 3",
     "License :: OSI Approved :: GNU Lesser General Public License v3 (LGPLv3)",

diff --git a/requirements.txt b/requirements.txt
@@ -1,4 +1,6 @@
+aiofiles~=24.1.0
 beautifulsoup4~=4.12.3
 datasketch~=1.6.5
 markdownify~=0.13.1
+playwright~=1.49.1
 Requests~=2.32.3
diff --git a/src/libcrawler/__init__.py b/src/libcrawler/__init__.py
diff --git a/src/libcrawler/__main__.py b/src/libcrawler/__main__.py
@@ -1,3 +1,4 @@
+import asyncio
 import argparse
 import json
 from urllib.parse import urljoin
@@ -18,6 +19,7 @@ def main():
                         help='Delay between requests in seconds.')
     parser.add_argument('--delay-range', type=float, default=0.5,
                         help='Range for random delay variation.')
+    parser.add_argument('--interval', type=int, help='Time step used in wait for DOM to stablize, in milliseconds (default: 1000 ms)')
     parser.add_argument('--remove-selectors', nargs='*',
                         help='Additional CSS selectors to remove from pages.')
     parser.add_argument('--similarity-threshold', type=float, default=0.6,
@@ -55,7 +57,7 @@ def main():
     start_url = urljoin(args.base_url, args.starting_point)
 
     # Adjust crawl_and_convert call to handle ignore-paths and optional headers
-    crawl_and_convert(
+    asyncio.run(crawl_and_convert(
         start_url=start_url,
         base_url=args.base_url,
         output_filename=args.output,
@@ -68,8 +70,8 @@ def main():
         similarity_threshold=args.similarity_threshold,
         allowed_paths=args.allowed_paths,
         ignore_paths=args.ignore_paths  # Pass the ignore-paths argument
-    )
+    ))
 
 
 if __name__ == '__main__':
-    main()
+    main()