GitHub Crawler is a Python-based tool designed to perform targeted searches on GitHub and retrieve relevant search results. It supports searching for repositories, issues, and wiki pages, utilizing proxy servers for request management.
- Executes GitHub searches based on user-defined keywords
- Supports three search types: repositories, issues, and wikis
- Implements proxy server usage for HTTP requests
- Retrieves additional metadata for repository results (owner, language statistics)
- Outputs search results in JSON format
- Python 3.9+
- Required libraries as specified in
requirements.txt
-
Clone the repository:
git clone https://github.com/ivayanc/github-crawler.git cd github-crawler
-
Set up a virtual environment:
python3 -m venv venv
-
Activate the virtual environment:
- On Windows:
venv\Scripts\activate
- On macOS and Linux:
source venv/bin/activate
- On Windows:
-
Install dependencies:
pip install -r requirements.txt
-
Configure the search parameters in
input_data.json
:{ "keywords": ["openstack", "nova", "css"], "proxies": ["88.216.34.140:50100", "89.117.250.68:50100"], "type": "Repositories" }
Input Data Specifications:
keywords
: An array of strings representing search termsproxies
: An array of strings in IP:PORT formattype
: A string specifying the search type. Valid options are:"Repositories"
: For repository searches"Issues"
: For issue searches"Wikis"
: For wiki page searches
-
Execute the script:
python main.py
-
Retrieve results from
crawler_result.json
main.py
: Entry point for the applicationgithub_crawler.py
: Core logic implementationtest_github_crawler.py
: Unit test suiteinput_data.json
: Configuration file for search parametersrequirements.txt
: Project dependencies
Execute the test suite:
python -m unittest test_github_crawler.py
For code coverage analysis:
coverage run -m unittest test_github_crawler.py
coverage report -m
- The crawler processes only the first page of search results
- The provided proxy servers are examples and may not be operational.