Skip to content

Commit

Permalink
Update the docs (#182)
Browse files Browse the repository at this point in the history
* Update the docs

* No queryplan docs yet

* Update docs/configuration.md

Co-authored-by: Michał Praszmo <[email protected]>

* Update docs/start.md

Co-authored-by: Michał Praszmo <[email protected]>

Co-authored-by: Michał Praszmo <[email protected]>
  • Loading branch information
msm-code and nazywam authored Aug 29, 2022
1 parent 0f37555 commit ce7972f
Show file tree
Hide file tree
Showing 8 changed files with 226 additions and 109 deletions.
64 changes: 1 addition & 63 deletions INSTALL.md
Original file line number Diff line number Diff line change
@@ -1,65 +1,3 @@
# Installation

## From pre-built package

UrsaDB is distributed in a form of pre-built Debian packages targeting Debian Buster and Ubuntu 18.04. You can get the packages from [GitHub Releases](https://github.com/CERT-Polska/ursadb/releases).

You may use this convenient one-liner to install the latest UrsaDB package along with the required dependencies:
```
curl https://raw.githubusercontent.com/CERT-Polska/ursadb/master/contrib/install_deb.sh | sudo bash
```

## From nixpkgs

```
nix-env -i ursadb
```

## From dockerhub

Change [index_dir] and [samples_dir] to paths on your filesystem where you want to keep
index and samples.

```
sudo docker run -v [index_dir]:/var/lib/ursadb:rw -v [samples_dir]:/mnt/samples certpl/ursadb
```

## From dockerfile

```
git clone https://github.com/CERT-Polska/ursadb.git
sudo docker image build -t ursadb .
sudo docker run -v [index_dir]:/var/lib/ursadb:rw -v [samples_dir]:/mnt/samples ursadb
```

## From source

1. Clone the repository:
```
git clone --recurse-submodules https://github.com/CERT-Polska/ursadb.git
```

2. Install necessary dependencies:
```
sudo apt update
sudo apt install -y gcc-7 g++-7 libzmq3-dev cmake build-essential clang-format git
```

3. Build project:
```
mkdir build
cd build
cmake -D CMAKE_C_COMPILER=gcc-7 -D CMAKE_CXX_COMPILER=g++-7 -D CMAKE_BUILD_TYPE=Release ..
make -j$(nproc)
```

4. (Optional) Install binaries to `/usr/local/bin`:
```
sudo make install
```

5. (Optional) Consider registering UrsaDB as a systemd service:
```
cp contrib/systemd/ursadb.service /etc/systemd/system/
systemctl enable ursadb
```
See instructions in the [docs/install.md file](./docs/install.md)
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# UrsaDB

A 3gram search engine for querying Terabytes of data in milliseconds. Optimized for working with binary files (for example, malware dumps).
A 3gram search engine for querying terabytes of data in milliseconds. Optimized for working with binary files (for example, malware dumps).

Created in [CERT.PL](https://cert.pl). Originally by Jarosław Jedynak ([tailcall.net](https://tailcall.net)), extended and improved by Michał Leszczyński.

Expand Down
24 changes: 20 additions & 4 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,17 @@
# Ursadb documentation

## User guide
Ursadb is a search engine optimized for working with binary files (for example, malware dumps).

## Basic usage

Read this section for a quick start with Ursadb.

- [Install](./install.md): How to install Ursadb
- [Getting Started](./start.md): Read if you want to learn Ursadb quickly.

## User guide

Read if you want to understand more about Ursadb.

- [Query syntax](./syntax.md) guide: Read to learn about the commands used
to interact with the database.
Expand All @@ -9,9 +20,14 @@
- [Datasets](./datasets.md): Introduction to Ursadb's datasets.
- [Indexing in-depth](./indexing.md): Read if you need to index a considerable
number of files.
- [Configuration](./configuration.md): Configuration options exposed by Ursadb,
and how to use them.

## Advanced features

Random notes about Ursadb internals

- [Performance and limits](./limits.md): Read in case you're not sure if Ursadb
can handle your collection.
- [On-disk format](./ondiskformat.md): Read if you want to understand Ursadb's on
disk format (hint: many files are just JSON and can be read/edited with vim).
- [Configuration](./configuration.md): Configuration options exposed by Ursadb,
and how to use them.
disk format.
59 changes: 28 additions & 31 deletions docs/configuration.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
# Configuration

Ursadb configuration is a simple set of key-value pairs. They can be read
by issuing a `config get` command with `ursacli`:
Ursadb configuration is a simple set of key-value pairs. The defaults are sane.
You don't need to change them, unless you want to tweak ursadb to your system.

They can be read by issuing a `config get` command with `ursacli`:

```
$ ursacli -c "config get;"
$ ursacli
ursadb> config get;
```

The response format is:
Expand All @@ -24,19 +27,7 @@ The response format is:
}
```

Values that have been changed from their default value can also be checked
by reading the database directly:

```bash
$ cat ~/tmp/ursadb/db.ursa | jq '.config'
{
"database_workers": 4,
"query_max_ngram": 256
}
```

To change a config value, you can edit the database file when it's turned off
(not recommended), or issue a `config set` command:
To change a config value, you may issue a `config set` command:

```
$ ursacli
Expand All @@ -45,9 +36,9 @@ ursadb> config set "database_workers" 10;

## Available configuration keys

- [database_workers](#database_workers) - Number of independent worker threads
- [query_max_edge](#query_max_edge) - Maximum query size (edge)
- [query_max_ngram](#query_max_ngram) - Maximum query size (ngram)
- [database_workers](#database_workers) - Number of independent worker threads.
- [query_max_edge](#query_max_edge) - Maximum query size (edge).
- [query_max_ngram](#query_max_ngram) - Maximum query size (ngram).
- [merge_max_datasets](#merge_max_datasets) - Maximum number of datasets involved
in a single merge.
- [merge_max_files](#merge_max_files) - Maximum number of datasets in a dataset
Expand All @@ -61,9 +52,10 @@ ursadb> config set "database_workers" 10;

Maximum number of values a first or last character in sequence can take
to be considered when planning a query. The default is a conservative 1,
so query plam will never start or end with a wildcard.
Recommendation: Stick to the default value. If you have a good disk and
want to reduce false-positives, increase to 16.
so query plan will never start or end with a wildcard.

**Recommendation**: Stick to the default value. If you have a good disk and
want to reduce false-positives, increase (but no more than 16).

### query_max_ngram

Expand All @@ -75,7 +67,8 @@ want to reduce false-positives, increase to 16.
Maximum number of values a ngram can take to be considered when planning
a query. For example, with a default value of 16, trigram `11 2? 33` will
be expanded and included in query, but `11 ?? 33` will be ignored.
Recommendation: Stick to the default value at first. If your queries are

**Recommendation**: Stick to the default value at first. If your queries are
fast, use many wildcards, but have many false positives, increase to 256.


Expand All @@ -88,8 +81,10 @@ fast, use many wildcards, but have many false positives, increase to 256.
How many tasks can be processed at once? The default 4 is a very
conservative value for most workloads. Increasing it will make the
database faster, but at a certain point the disk becomes a bottleneck.
Recommendation: If your server is dedicated to ursadb, or your IO latency
is high (for example, files are stored on NFS), increase to at least 16.
This will also linearly increase memory usage in the worst case.

**Recommendation**: If your server is dedicated to ursadb, or your IO latency
is high (for example, files are stored on NFS), increase to 8 or more.

### merge_max_datasets

Expand All @@ -98,12 +93,13 @@ is high (for example, files are stored on NFS), increase to at least 16.
- **Maximum**: 1024

How many datasets can be merged at once? This has severe memory usage
implications - for merging datasets must be fully loaded, and every
implications - before merging, datasets must be fully loaded, and every
loaded dataset consumes a bit over 128MiB. Increasing this number makes
compacting huge datasets faster, but may run out of ram.
Recommendation: merge_max_datasets * 128MiB can safely be set to around

**Recommendation**: merge_max_datasets * 128MiB can safely be set to around
1/4 of RAM dedicated to the database, so for example 8 for 4GiB server
or 32 for 16GiB server. Increasing past 10 gives diminishing returns, so
or 32 for 16GiB server. Increasing past 10 has diminishing returns, so
unless you have a lot of free RAM you can leave it at default.

### merge_max_files
Expand All @@ -115,6 +111,7 @@ unless you have a lot of free RAM you can leave it at default.
When merging, what is the maximal allowed number of files in the
resulting dataset? Large datasets make the database faster, but also need
more memory to run efficiently.
Recommendation: ursadb was used with multi-million datasets in the wild,
but currently we recommend to stay on the safe side and don't create
datasets larger than 1 million files.

**Recommendation**: ursadb was used with multi-million datasets in the wild,
but currently we recommend to keep
datasets smaller than 1 million files.
65 changes: 65 additions & 0 deletions docs/install.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Installation

## From dockerhub

Change [index_dir] and [samples_dir] to paths on your filesystem where you want to keep
index and samples.

```
sudo docker run -v [index_dir]:/var/lib/ursadb:rw -v [samples_dir]:/mnt/samples certpl/ursadb
```

## From dockerfile

```
git clone https://github.com/CERT-Polska/ursadb.git
sudo docker image build -t ursadb .
sudo docker run -v [index_dir]:/var/lib/ursadb:rw -v [samples_dir]:/mnt/samples ursadb
```

## From source

1. Clone the repository:
```
git clone --recurse-submodules https://github.com/CERT-Polska/ursadb.git
```

2. Install necessary dependencies:
```
sudo apt update
sudo apt install -y gcc-7 g++-7 libzmq3-dev cmake build-essential clang-format git
```

3. Build project:
```
mkdir build
cd build
cmake -D CMAKE_C_COMPILER=gcc-7 -D CMAKE_CXX_COMPILER=g++-7 -D CMAKE_BUILD_TYPE=Release ..
make -j$(nproc)
```

4. (Optional) Install binaries to `/usr/local/bin`:
```
sudo make install
```

5. (Optional) Consider registering UrsaDB as a systemd service:
```
cp contrib/systemd/ursadb.service /etc/systemd/system/
systemctl enable ursadb
```

## From nixpkgs (may be outdated)

```
nix-env -i ursadb
```

## From .deb package (may be outdated)

UrsaDB is distributed in a form of pre-built Debian packages targeting Debian Buster and Ubuntu 18.04. You can get the packages from [GitHub Releases](https://github.com/CERT-Polska/ursadb/releases).

You may use this convenient one-liner to install the latest UrsaDB package along with the required dependencies:
```
curl https://raw.githubusercontent.com/CERT-Polska/ursadb/master/contrib/install_deb.sh | sudo bash
```
45 changes: 37 additions & 8 deletions docs/ondiskformat.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ it means that it's not used and can be safely removed (when the database is turn

Example:

```
```json
{
"config": {
"database_workers": 10
Expand All @@ -40,7 +40,7 @@ Most importantly, it contains references to all indexes in this dataset,
and a list of filenames tracked by this dataset.

Example:
```
```json
{
"filename_cache": "namecache.files.set.507718ac.db.ursa",
"files": "files.set.507718ac.db.ursa",
Expand Down Expand Up @@ -86,7 +86,7 @@ For example:
Finally, the last `(2**24 + 1) * 8` bytes of an index consists of an array of uint64_t
values, where sequence N starts at `array[N]` offset in the file, and ends at `array[N+1]`
An index can be parsed with the following Python code. Warning this is just a demonstration,
An index can be parsed with the following Python code. Warning: this is just a demonstration,
and is way too slow to work with indexes bigger than really small ones.
```python
Expand Down Expand Up @@ -127,18 +127,33 @@ def parse(fpath):
continue
run = decompress(fdata[offsets[i]:offsets[i+1]])
# trigram [i] contains files with ids [run]
print(f"{i:06x}: {run}")
if __name__ == '__main__':
parse(sys.argv[1])
```

## Names
## Files

Newline-separated list of filenames in the database.

Newline-separated list of filenames in the database. This file can be safely
edited or changed with any editor, for example when moving the collection to a
different folder. It's only important to:
```
$ head -n 10 files.set.35dcff87.db.ursa
/mnt/samples/001
/mnt/samples/002
/mnt/samples/003
/mnt/samples/004
/mnt/samples/005
/mnt/samples/006
/mnt/samples/007
/mnt/samples/008
/mnt/samples/009
/mnt/samples/010
```

Right now, the only way to change the base directory of files is to edit this file directly. It can be safely edited or changed with any editor. It's only important to:

- ensure the database is turned off
- remove the namecache file later
Expand All @@ -147,10 +162,24 @@ different folder. It's only important to:

Contains an array of `uint64_t` offsets in the `names` file.
This is used to map file IDs to names for queries, without loading all the file
names into memory.
names into memory (and to speed up database startup)
If this file doesn't exist or was removed, it'll be regenerated when the database
starts.

```
$ xxd namecache.files.set.d2c6638f.db.ursa | head -n 10
00000000: 0000 0000 0000 0000 1f00 0000 0000 0000 ................
00000010: 3b00 0000 0000 0000 5d00 0000 0000 0000 ;.......].......
00000020: 8000 0000 0000 0000 9b00 0000 0000 0000 ................
00000030: b200 0000 0000 0000 d700 0000 0000 0000 ................
00000040: fb00 0000 0000 0000 2601 0000 0000 0000 ........&.......
00000050: 4c01 0000 0000 0000 6b01 0000 0000 0000 L.......k.......
00000060: 8d01 0000 0000 0000 c601 0000 0000 0000 ................
00000070: fd01 0000 0000 0000 3702 0000 0000 0000 ........7.......
00000080: 5402 0000 0000 0000 7f02 0000 0000 0000 T...............
00000090: a702 0000 0000 0000 d702 0000 0000 0000 ................
```

## Itermeta

Contains information about the current position of a given iterator. For example:
Expand Down
Loading

0 comments on commit ce7972f

Please sign in to comment.