Skip to content

Commit

Permalink
update documentation #7
Browse files Browse the repository at this point in the history
  • Loading branch information
goseind committed Jun 15, 2022
1 parent 4b0907c commit 6c6d688
Show file tree
Hide file tree
Showing 6 changed files with 368 additions and 89 deletions.
88 changes: 18 additions & 70 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,18 +6,27 @@

[![Open in Gitpod](https://gitpod.io/button/open-in-gitpod.svg)](https://gitpod.io/#https://github.com/Miracle-Fruit/distributed-nosqldb)

## Ideation
## Documentation

* Neo4j clustering is only avavilable in the enterprise edition (30 day trial available). Documentation for Docker Compose with enterprise edition: https://neo4j.com/docs/operations-manual/current/docker/clustering/
* Alternativley we can use ONgDB: https://www.graphfoundation.org/ongdb/ (a fork from the old Neo4j enterprise edition) or Casandra?
Find the detailed lessons learned [here](lessons-learned.md).

## Documentation
## Architecture

### Architecture
### Infrastructure

![](architecture-infra.png)

### Makefile
### Database

![](architecture-cass.png)

## Setup

## Gitpod Setup

Gitpod starts with executing `make cass` and then opens two browser windows for Cassandra Web and the Frontend Website. Due to development reasons this is currently disabled and needs to be started manually.

### Makefile

The Makefile allows to run different setups:

Expand All @@ -36,71 +45,10 @@ make cass

The frontend is build with [Create React App](https://github.com/facebook/create-react-app).

### Cassandra Cluster
### Cassandra Cluster Details

Cassandra Cluster with three nodes can be accssed via web interface at http://localhost:3000/

*Note: health checks are not working corrtly at the moment, may be necessary to reboot containers manually!*
Cassandra Cluster with three nodes can be accessed via web the interface at http://localhost:3000/

**Useful commands:**
* `docker exec cass1 nodetool status` (status check) --> UN = Up and Normal
* `docker exec -it cass1 cqlsh` (open cqlsh)

![](cassandra-web.png)

```bash
docker cp data/tweets.csv cass1:/tweets.csv
cqlsh
CREATE TABLE twitter.tweetsss(author text, content text, country text, date_time text, id bigint PRIMARY KEY, language text, latitude text, longitude text, number_of_likes text, number_of_shares text);
COPY twitter.tweetsss (author,content,country,date_time,id,language,latitude,longitude,number_of_likes,number_of_shares) FROM 'tweets.csv' WITH DELIMITER=',' AND HEADER=TRUE;
```

ID cannot be imported as number, the following error occours:

```bash
Failed to import 1 rows: ParseError - Failed to parse 5.34896E+17 : invalid literal for int() with base 10: '5.34896E+17', given up without retries
'builtin_function_or_method' object has no attribute 'error'
```

#### Optimizing Cassandra Performance

* Splitting the preexisting tables into the following structure:

![Cassandra Infra](architecture-cass.png)

* Optimze for read over write: `[..] WITH compaction = {'class' : 'LeveledCompactionStrategy'};`
* Formula for replication factor: [read-consistency-level] + [write-consistency-level] > [replication-factor]
* Pre-sort data: `CLUSTERING ORDER BY (number_of_likes ASC);`

## Problems & Lessons Learned

* Neo4j community does not support clustering
* Neo4j enterprise is complex to setup and we were not able to make it run
* Cassandra cluster with docker compose startup `service_healthy` check sometimes fails, so constant restart is the best option until all nodes including the web interface are up and running
* Container IP addresses need to be set static in order for cassandra-web to find them
* Import of tweets is challenges to find a suitable data type for primary key
* cassandra-web requires older ruby version >3 seems to cause problems
* cassandra configuration yaml file and volume mapping
* Execution of startup script to run cql commands is not working: `Connection error: ('Unable to connect to any servers', {'172.20.0.6:9042': ConnectionRefusedError(111, "Tried connecting to [('172.20.0.6', 9042)]. Last error: Connection refused")})`
* cqlsh> CREATE MATERIALIZED VIEW twitter.user_14237490 AS select * from twitter.user where user_id=14237490 PRIMARY KEY (user_id); Warnings : Materialized views are experimental and are not recommended for production use.

## Social Media Queries

1. Auflisten der Posts, die von einem Account gemacht wurden, bzw. ihm zugeordnet wurden

`SELECT * FROM twitter.tweets WHERE author='katyperry' ALLOW FILTERING;`

![](example_query_1.png)

2. Finden der 100 Accounts mit den meisten Followern
3. Finden der 100 Accounts, die den meisten der Accounts folgen, die in 1) gefunden wurden
4. Auflisten der Informationen für die persönliche Startseite eines beliebigen Accounts (am besten mit den in 2) gefundenen Accounts ausprobieren; die Startseite soll Folgendes beinhalten (als getrennte Queries umsetzen):
* die Anzahl der Follower
* die Anzahl der verfolgten Accounts
* wahlweise die 25 neusten oder die 25 beliebtesten Posts der verfolgten Accounts (per DB-Abfrage)
5. Caching der Posts für die Startseite (vgl. 4), erfordert einen sog. Fan-Out in den Cache jedes Followers beim Schreiben eines neuen Posts
6. Auflisten der 25 beliebtesten Posts, die ein geg. Wort enthalten (falls möglich auch mit UND-Verknüpfung mehrerer Worte)

## Gitpod Setup

Gitpod starts with executing `make cass` and then opens two browser windows for Cassandra Web and the Frontend Website. Due to developement reasons this is currently disabled and needs to be started manually.
* `docker exec -it cass1 cqlsh` (open cqlsh)
Loading

0 comments on commit 6c6d688

Please sign in to comment.