WebCrawler

WebCrawler is a simple Java based framework which scans websites concurrently and stores indexed data into persistent storage.

Building and running

Prerequisites

Version numbers below indicate the versions used.

Maven 3.3.9 (http://maven.apache.org)
Java 1.8.0_73 (http://java.oracle.com)

Building Steps

git clone https://github.com/avoloshko/WebCrawler
cd WebCrawler
mvn clean package

The most important results of the build are

web/target/web-<version>.jar - A console application
cli/target/cli-<version>.jar - A web application

Running

java -jar cli/target/cli-<version>.jar crawl cli/dist/configuration.yml --href https://google.com
java -jar web/target/web-<version>.jar server web/dist/configuration.yml

By default the web application uses port 8080.

Working files

Database is saved into the webcrawler.db directory if LevelDB is configured for storage. (Memory data store is never persisted.) If you want a clean start then remove the entire data directory.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
cli		cli
core		core
docs		docs
web		web
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebCrawler

Building and running

Prerequisites

Building Steps

Running

Working files

Documentation

About

Releases

Packages

Languages

License

avoloshko/WebCrawler

Folders and files

Latest commit

History

Repository files navigation

WebCrawler

Building and running

Prerequisites

Building Steps

Running

Working files

Documentation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages