GitHub Crawler

Overview

GitHub Crawler is a Python-based tool designed to perform targeted searches on GitHub and retrieve relevant search results. It supports searching for repositories, issues, and wiki pages, utilizing proxy servers for request management.

Features

Executes GitHub searches based on user-defined keywords
Supports three search types: repositories, issues, and wikis
Implements proxy server usage for HTTP requests
Retrieves additional metadata for repository results (owner, language statistics)
Outputs search results in JSON format

Prerequisites

Python 3.9+
Required libraries as specified in requirements.txt

Installation

Clone the repository:

git clone https://github.com/ivayanc/github-crawler.git
cd github-crawler

Set up a virtual environment:
```
python3 -m venv venv
```
Activate the virtual environment:
- On Windows:
```
venv\Scripts\activate
```
- On macOS and Linux:
```
source venv/bin/activate
```
Install dependencies:
```
pip install -r requirements.txt
```

Usage

Configure the search parameters in input_data.json:
```
{
 "keywords": ["openstack", "nova", "css"],
 "proxies": ["88.216.34.140:50100", "89.117.250.68:50100"],
 "type": "Repositories"
}
```
Input Data Specifications:
- keywords: An array of strings representing search terms
- proxies: An array of strings in IP:PORT format
- type: A string specifying the search type. Valid options are:
  - "Repositories": For repository searches
  - "Issues": For issue searches
  - "Wikis": For wiki page searches
Execute the script:
```
python main.py
```
Retrieve results from crawler_result.json

Project Structure

main.py: Entry point for the application
github_crawler.py: Core logic implementation
test_github_crawler.py: Unit test suite
input_data.json: Configuration file for search parameters
requirements.txt: Project dependencies

Testing

Execute the test suite:

python -m unittest test_github_crawler.py

For code coverage analysis:

coverage run -m unittest test_github_crawler.py
coverage report -m

Limitations

The crawler processes only the first page of search results

Notes

The provided proxy servers are examples and may not be operational.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
github_crawler.py		github_crawler.py
input_data.json		input_data.json
main.py		main.py
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
test_github_crawler.py		test_github_crawler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GitHub Crawler

Overview

Features

Prerequisites

Installation

Usage

Project Structure

Testing

Limitations

Notes

About

Releases

Packages

Languages

ivayanc/github-crawler

Folders and files

Latest commit

History

Repository files navigation

GitHub Crawler

Overview

Features

Prerequisites

Installation

Usage

Project Structure

Testing

Limitations

Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages