Skip to content

Commit

Permalink
Remove legacy code. (#178)
Browse files Browse the repository at this point in the history
Fix #155.
  • Loading branch information
bfabio authored Sep 11, 2020
1 parent edd01ac commit e4b9a1e
Show file tree
Hide file tree
Showing 35 changed files with 69 additions and 1,767 deletions.
4 changes: 0 additions & 4 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -1,7 +1,3 @@
data
elasticsearch
elasticsearch-searchguard
kibana
prometheus
vendor
.cache
11 changes: 1 addition & 10 deletions .env.example
Original file line number Diff line number Diff line change
@@ -1,12 +1,3 @@
# Crawler
ELASTIC_URL=http://devita_elasticsearch:9200
ELASTIC_URL=https://elasticsearch.developers.italia.it
#ELASTIC_USER=elastic
#ELASTIC_PWD=changeme

# Elasticsearch
ES_JAVA_OPTS='-Xms256m -Xmx1g'

# Kibana
ELASTICSEARCH_PROTOCOL=http
ELASTICSEARCH_HOST=devita_elasticsearch
ELASTICSEARCH_PORT=9200
12 changes: 0 additions & 12 deletions Makefile

This file was deleted.

192 changes: 68 additions & 124 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,126 +1,51 @@
# Backend and crawler for the OSS catalog of Developers Italia
# Crawler for the OSS catalog of Developers Italia

[![CircleCI](https://circleci.com/gh/italia/developers-italia-backend/tree/master.svg?style=shield)](https://circleci.com/gh/italia/developers-italia-backend/tree/master)
[![Go Report Card](https://goreportcard.com/badge/github.com/italia/developers-italia-backend)](https://goreportcard.com/report/github.com/italia/developers-italia-backend) [![Join the #website channel](https://img.shields.io/badge/Slack%20channel-%23website-blue.svg?logo=slack)](https://developersitalia.slack.com/messages/C9R26QMT6)
[![Go Report Card](https://goreportcard.com/badge/github.com/italia/developers-italia-backend)](https://goreportcard.com/report/github.com/italia/developers-italia-backend)
[![Join the #website channel](https://img.shields.io/badge/Slack%20channel-%23website-blue.svg?logo=slack)](https://developersitalia.slack.com/messages/C9R26QMT6)
[![Get invited](https://slack.developers.italia.it/badge.svg)](https://slack.developers.italia.it/)

## Overview: how the crawler works

The crawler finds and retrieves the *publiccode.yml* files from the organizations registered on *Github/Bitbucket/Gitlab*, listed in the whitelist.
It then creates YAML files used by the [Jekyll build chain](https://github.com/italia/developers.italia.it) to generate the static pages of [developers.italia.it](https://developers.italia.it/).

## Dependencies and other related software

These are the dependencies and some useful tools used in conjunction with the crawler.

* [Elasticsearch 6.8.7](https://www.elastic.co/products/elasticsearch) for storing the data. Elasticsearch should be active and ready to accept connections before the crawler gets started

* [Kibana 6.8.7](https://www.elastic.co/products/kibana) for internal data visualization (optional)
## How it works

* [Prometheus 6.8.7](https://prometheus.io) for collecting metrics (optional, currently supported but not used in production)
The crawler finds and retrieves the **`publiccode.yml`** files from the
organizations in the whitelist.

## Tools
It then creates YAML files used by the
[Jekyll build chain](https://github.com/italia/developers.italia.it)
to generate the static pages of [developers.italia.it](https://developers.italia.it/).

This is the list of tools used in the repository:

* [Docker](https://www.docker.com/)

* [Docker-compose](https://docs.docker.com/compose/)

* [Go](https://golang.org/) >= 1.11
[Elasticsearch 6.8](https://www.elastic.co/products/elasticsearch) is used to store
the data which be active and ready to accept connections before the crawler is started.

## Setup and deployment processes

The crawler can either run directly on the target machine, or it can be deployed in form of Docker container, possibly using an orchestrator, such as Kubernetes.

Up to now, the crawler and its dependencies have run in form of Docker containers on a virtual machine. Elasticsearch and Kibana have been deployed using a fork of the main project, called [search guard](https://search-guard.com/). This is still deployed in production and what we'll call in the readme *"legacy deployment process"*.

With the idea of making the legacy installation more scalable and reliable, a refactoring of the code has been recently made. The readme refers to this approach as the *new deployment process*. This includes using the official version of Elasticsearch and Kibana, and deploying the Docker containers on top of Kubernetes, using helm-charts. While the crawler has it's [own helm-chart](https://github.com/teamdigitale/devita-infra-kubernetes), Elasticsearch and Kibana are deployed using their [official helm-charts](https://github.com/elastic/helm-charts).
The new deployment process uses a [docker-compose.yml](docker-compose.yml) file to only bring up a local development and test environment.

The paragraph starts describing how to build and run the crawler, directly on a target machine.
The procedure described is the same automated in the Dockerfile. The -legacy and new- Docker deployment procedures are then described below.
The crawler can either run manually on the target machine or it can be deployed
in form of Docker container with
[its helm-chart](https://github.com/teamdigitale/devita-infra-kubernetes) in Kubernetes.

### Manually configure and build the crawler

* `cd crawler`

* Fill the *domains.yml* file with configuration values (i.e. host basic auth tokens)

* Rename the *config.toml.example* file to *config.toml* and fill the variables

> **NOTE**: The application also supports environment variables in substitution to config.toml file. Remember: "environment variables get higher priority than the ones in configuration file"
* Build the crawler binary: `make`
1. `cd crawler`

* Configure the crontab as desired
2. Save the auth tokens to `domains.yml`.

### Run the crawler
* Crawl mode (all item in whitelists): `bin/crawler crawl whitelist/*.yml`
- Crawl supports blacklists (see below for details), crawler will try to match each repository URL in its list with the ones listed in blacklists and if it so it will print a warn log and skip all operation on it. Furthermore it will immediately remove the blacklisted repository from ES if it is present.

* One mode (single repository url): `bin/crawler one [repo url] whitelist/*.yml`
- In this mode one single repository at the time will be evaluated. If the organization is present, its IPA code will be matched with the ones in whitelist otherwise it will be set to null and the `slug` will have a random code in the end (instead of the IPA code). Furthermore, the IPA code validation, which is a simple check within whitelists (to ensure that code belongs to the selected PA), will be skipped.
- One supports blacklists (see below for details), whether `[repo url]` is present in one of indicated blacklist, crawler will exit immediately. Basically ignore all repository defined in list preventing the unauthorized loading in catalog.

* `bin/crawler updateipa` downloads IPA data and writes them into Elasticsearch
3. Rename `config.toml.example` to `config.toml` and set the variables

* `bin/crawler delete [URL]` delete software from Elasticsearch using its code hosting URL specified in `publiccode.url`
> **NOTE**: The application also supports environment variables in substitution
> to config.toml file. Remember: "environment variables get higher priority than
> the ones in configuration file"
* `bin/crawler download-whitelist` downloads organizations and repositories from the [onboarding portal repository](https://github.com/italia/developers-italia-onboarding) and saves them to a whitelist file
4. Build the crawler binary with `make`

#### Crawler blacklists
Blacklists are needed to exclude individual repository that are not in line with our [guidelines](https://docs.italia.it/italia/developers-italia/policy-inserimento-catalogo-docs/it/stabile/approvazione-del-software-a-catalogo.html).
### Docker

##### Configuration
*config.toml* has a reference for blacklist configuration which can point to a given location and to all files that match given pattern. Blacklist is currently supported by commands:
- `one`
- `crawl`
The repository has a `Dockerfile`, used to build the production image,
and a `docker-compose.yml` file to facilitate the local deployment.

### Docker: the legacy deployment process
Before proceeding with the build, copy [`.env.example`](.env.example)
into `.env` and edit the environment variables as needed.

The paragraph describes how to setup and deploy the crawler, following the *legacy deployment process*.

* Rename [.env-search-guard.example](.env-search-guard.example) to *.env* and adapt its variables as needed

* Rename *elasticsearch-searchguard/config/searchguard/sg_internal_users.yml.example* to *elasticsearch/-searchguard/config/searchguard/sg_internal_users.yml* and insert the correct passwords. Hashed passwords can be generated with:

```shell
docker exec -t -i developers-italia-backend_elasticsearch elasticsearch-searchguard/plugins/search-guard-6/tools/hash.sh -p <password>
```

* Insert the *kibana* password in [kibana-searchguard/config/kibana.yml](kibana-searchguard/config/kibana.yml)

* Configure the Nginx proxy for the elasticsearch host with the following directives:

```
limit_req_zone $binary_remote_addr zone=elasticsearch_limit:10m rate=10r/s;

server {
...
location / {
limit_req zone=elasticsearch_limit burst=20 nodelay;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_pass http://localhost:9200;
proxy_ssl_session_reuse off;
proxy_cache_bypass $http_upgrade;
proxy_redirect off;
}
}
```

* You might need to type `sysctl -w vm.max_map_count=262144` and make this permanent in /etc/sysctl.conf in order to start elasticsearch, as [documented here](https://hub.docker.com/r/khezen/elasticsearch/)

* Start Docker: `make up`

### Docker: the new deployment process

The repository has a *Dockerfile*, used to also build the production image, and a *docker-compose.yml* file to facilitate the local deployment.

The containers declared in the *docker-compose.yml* file leverage some environment variables that should be declared in a *.env* file. A [.env.example](.env.example) file has some exemplar values. Before proceeding with the build, copy the [.env.example](.env.example) into *.env* and modify the environment variables as needed.

To build the crawler container, download its dependencies and start them all, run:
To build the crawler container run:

```shell
docker-compose up [-d] [--build]
Expand All @@ -132,44 +57,63 @@ where:

* *--build* forces the containers build

To destroy the containers, use:
To destroy the container, use:

```shell
docker-compose down
```

#### Xpack

By default, the system -specifically Elasticsearch- doesn't make use of xpack, so passwords and certificates. To do so, the Elasticsearch container mounts [this configuration file](elasticsearch/elasticsearch.yml). This will make things work out of the box, but it's not appropriate for production environments.

An alternative configuration file that enables xpack is available [here](elasticsearch/elasticsearch-xpack.yml). In order to use it, you should
## Run the crawler

* Generate appropriate certificates for elasticsearch, save them in the *elasticsearch folder*, and make sure that their name matches the one contained in the [elasticsearch-xpack configuration file](elasticsearch/elasticsearch-xpack.yml).

* Optionally change the [elasticsearch-xpack.yml configuration file](elasticsearch/elasticsearch-xpack.yml) as desired

* Rename the [elasticsearch-xpack.yml configuration file](elasticsearch/elasticsearch-xpack.yml) to *elasticsearch.yml*
* Crawl mode (all item in whitelists): `bin/crawler crawl whitelist/*.yml`
* `crawl` supports blacklists (see below for details). The crawler will try to
match each repository URL in its list with the ones listed in blacklists and,
if it does, it will print a warn log and skip all operation on it.
Furthermore it will immediately remove the blacklisted repository from ES if
it is present.

* Change the environment variables in your *.env* file to make sure that crawler, elasticsearch, and kibana configurations have matching passwords
* One mode (single repository url): `bin/crawler one [repo url] whitelist/*.yml`
* In this mode one single repository at the time will be evaluated. If the
organization is present, its IPA code will be matched with the ones in
whitelist otherwise it will be set to null and the `slug` will have a random
code in the end (instead of the IPA code). Furthermore, the IPA code
validation, which is a simple check within whitelists (to ensure that code
belongs to the selected PA), will be skipped.
* `one` supports blacklists (see below for details), whether `[repo url]` is
present in one of the indicated blacklists, the crawler will exit immediately.
Basically ignore all repository defined in list preventing the unauthorized
loading in catalog.

At this point you can bring up the environment with *docker-compose*.
* `bin/crawler updateipa` downloads IPA data and writes them into Elasticsearch

## Troubleshooting Q/A
* `bin/crawler delete [URL]` deletes software from Elasticsearch using its code
hosting URL specified in `publiccode.url`

* From docker logs seems that Elasticsearch container needs more virtual memory and now it's *Stalling for Elasticsearch...*
* `bin/crawler download-whitelist` downloads organizations and repositories from
the [onboarding portal repository](https://github.com/italia/developers-italia-onboarding)
and saves them to a whitelist file

Increase container virtual memory: https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html#docker-cli-run-prod-mode
### Crawler blacklists

* When trying to `make build` the crawler image, a fatal memory error occurs: "fatal error: out of memory"
Blacklists are needed to exclude individual repository that are not in line with
our
[guidelines](https://docs.italia.it/italia/developers-italia/policy-inserimento-catalogo-docs/it/stabile/approvazione-del-software-a-catalogo.html).

Probably you should increase the container memory: `docker-machine stop && VBoxManage modifyvm default --cpus 2 && VBoxManage modifyvm default --memory 2048 && docker-machine stop`
You can set `BLACKLIST_FOLDER` in `config.toml` to point to a directory
where blacklist files are located.
Blacklisting is currently supported by the `one` and `crawl` commands.

## See also

* [publiccode-parser-go](https://github.com/italia/publiccode-parser-go): the Go package for parsing publiccode.yml files
* [publiccode-parser-go](https://github.com/italia/publiccode-parser-go): the Go
package for parsing publiccode.yml files

* [developers-italia-onboarding](https://github.com/italia/developers-italia-onboarding): the onboarding portal
* [developers-italia-onboarding](https://github.com/italia/developers-italia-onboarding):
the onboarding portal

## Authors

[Developers Italia](https://developers.italia.it) is a project by [AgID](https://www.agid.gov.it/) and the [Italian Digital Team](https://teamdigitale.governo.it/), which developed the crawler and maintains this repository.
[Developers Italia](https://developers.italia.it) is a project by
[AgID](https://www.agid.gov.it/) and the
[Italian Digital Team](https://teamdigitale.governo.it/), which developed the
crawler and maintains this repository.
63 changes: 0 additions & 63 deletions docker-compose-es-searchguard.yml

This file was deleted.

30 changes: 0 additions & 30 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@ version: '3.3'

services:

# Crawler
devita_crawler:
container_name: devita_crawler
image: italia/developers-italia-backend
Expand All @@ -11,35 +10,6 @@ services:
dockerfile: Dockerfile
env_file:
- .env
depends_on:
- devita_elasticsearch
networks:
- overlay

# Elasticsearch
devita_elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:6.8.7
container_name: devita_elasticsearch
env_file:
- .env
volumes:
- ./elasticsearch:/usr/share/elasticsearch/config
networks:
- overlay
ports:
- 9200:9200
- 9300:9300

# Kibana
devita_kibana:
container_name: "devita_kibana"
image: docker.elastic.co/kibana/kibana:6.8.7
env_file:
- .env
depends_on:
- devita_elasticsearch
ports:
- "5601:5601"
networks:
- overlay

Expand Down
Loading

0 comments on commit e4b9a1e

Please sign in to comment.