Update the docs (#182)

* Update the docs * No queryplan docs yet * Update docs/configuration.md Co-authored-by: Michał Praszmo <[email protected]> * Update docs/start.md Co-authored-by: Michał Praszmo <[email protected]> Co-authored-by: Michał Praszmo <[email protected]>
CERT-Polska · Aug 29, 2022 · ce7972f · ce7972f
1 parent 0f37555
commit ce7972f
Show file tree

Hide file tree

Showing 8 changed files with 226 additions and 109 deletions.
diff --git a/INSTALL.md b/INSTALL.md
@@ -1,65 +1,3 @@
 # Installation
 
-## From pre-built package
-
-UrsaDB is distributed in a form of pre-built Debian packages targeting Debian Buster and Ubuntu 18.04. You can get the packages from [GitHub Releases](https://github.com/CERT-Polska/ursadb/releases).
-
-You may use this convenient one-liner to install the latest UrsaDB package along with the required dependencies:
-```
-curl https://raw.githubusercontent.com/CERT-Polska/ursadb/master/contrib/install_deb.sh | sudo bash
-```
-
-## From nixpkgs
-
-```
-nix-env -i ursadb
-```
-
-## From dockerhub
-
-Change [index_dir] and [samples_dir] to paths on your filesystem where you want to keep
-index and samples.
-
-```
-sudo docker run -v [index_dir]:/var/lib/ursadb:rw -v [samples_dir]:/mnt/samples certpl/ursadb
-```
-
-## From dockerfile
-
-```
-git clone https://github.com/CERT-Polska/ursadb.git
-sudo docker image build -t ursadb .
-sudo docker run -v [index_dir]:/var/lib/ursadb:rw -v [samples_dir]:/mnt/samples ursadb
-```
-
-## From source
-
-1. Clone the repository:
-```
-git clone --recurse-submodules https://github.com/CERT-Polska/ursadb.git
-```
-
-2. Install necessary dependencies:
-```
-sudo apt update
-sudo apt install -y gcc-7 g++-7 libzmq3-dev cmake build-essential clang-format git
-```
-
-3. Build project:
-```
-mkdir build
-cd build
-cmake -D CMAKE_C_COMPILER=gcc-7 -D CMAKE_CXX_COMPILER=g++-7 -D CMAKE_BUILD_TYPE=Release ..
-make -j$(nproc)
-```
-
-4. (Optional) Install binaries to `/usr/local/bin`:
-```
-sudo make install
-```
-
-5. (Optional) Consider registering UrsaDB as a systemd service:
-```
-cp contrib/systemd/ursadb.service /etc/systemd/system/
-systemctl enable ursadb
-```
+See instructions in the [docs/install.md file](./docs/install.md)
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # UrsaDB
 
-A 3gram search engine for querying Terabytes of data in milliseconds. Optimized for working with binary files (for example, malware dumps).
+A 3gram search engine for querying terabytes of data in milliseconds. Optimized for working with binary files (for example, malware dumps).
 
 Created in [CERT.PL](https://cert.pl). Originally by Jarosław Jedynak ([tailcall.net](https://tailcall.net)), extended and improved by Michał Leszczyński.
 

diff --git a/docs/README.md b/docs/README.md
@@ -1,6 +1,17 @@
 # Ursadb documentation
 
-## User guide 
+Ursadb is a search engine optimized for working with binary files (for example, malware dumps).
+
+## Basic usage
+
+Read this section for a quick start with Ursadb.
+
+- [Install](./install.md): How to install Ursadb
+- [Getting Started](./start.md): Read if you want to learn Ursadb quickly.
+
+## User guide
+
+Read if you want to understand more about Ursadb.
 
 - [Query syntax](./syntax.md) guide: Read to learn about the commands used
     to interact with the database.
@@ -9,9 +20,14 @@
 - [Datasets](./datasets.md): Introduction to Ursadb's datasets.
 - [Indexing in-depth](./indexing.md): Read if you need to index a considerable
     number of files.
+- [Configuration](./configuration.md): Configuration options exposed by Ursadb,
+    and how to use them.
+
+## Advanced features
+
+Random notes about Ursadb internals
+
 - [Performance and limits](./limits.md): Read in case you're not sure if Ursadb
     can handle your collection.
 - [On-disk format](./ondiskformat.md): Read if you want to understand Ursadb's on
-    disk format (hint: many files are just JSON and can be read/edited with vim).
-- [Configuration](./configuration.md): Configuration options exposed by Ursadb,
-    and how to use them.
+    disk format.
diff --git a/docs/configuration.md b/docs/configuration.md
@@ -1,10 +1,13 @@
 # Configuration
 
-Ursadb configuration is a simple set of key-value pairs. They can be read
-by issuing a `config get` command with `ursacli`:
+Ursadb configuration is a simple set of key-value pairs. The defaults are sane.
+You don't need to change them, unless you want to tweak ursadb to your system.
+
+They can be read by issuing a `config get` command with `ursacli`:
 
 ```
-$ ursacli -c "config get;"
+$ ursacli
+ursadb> config get;
 ```
 
 The response format is:
@@ -24,19 +27,7 @@ The response format is:
 }
 ```
 
-Values that have been changed from their default value can also be checked
-by reading the database directly:
-
-```bash
-$ cat ~/tmp/ursadb/db.ursa | jq '.config'
-{
-  "database_workers": 4,
-  "query_max_ngram": 256
-}
-```
-
-To change a config value, you can edit the database file when it's turned off
-(not recommended), or issue a `config set` command:
+To change a config value, you may issue a `config set` command:
 
 ```
 $ ursacli
@@ -45,9 +36,9 @@ ursadb> config set "database_workers" 10;
 
 ## Available configuration keys
 
-- [database_workers](#database_workers) - Number of independent worker threads
-- [query_max_edge](#query_max_edge) - Maximum query size (edge)
-- [query_max_ngram](#query_max_ngram) - Maximum query size (ngram)
+- [database_workers](#database_workers) - Number of independent worker threads.
+- [query_max_edge](#query_max_edge) - Maximum query size (edge).
+- [query_max_ngram](#query_max_ngram) - Maximum query size (ngram).
 - [merge_max_datasets](#merge_max_datasets) - Maximum number of datasets involved
   in a single merge.
 - [merge_max_files](#merge_max_files) - Maximum number of datasets in a dataset
@@ -61,9 +52,10 @@ ursadb> config set "database_workers" 10;
 
 Maximum number of values a first or last character in sequence can take
 to be considered when planning a query. The default is a conservative 1,
-so query plam will never start or end with a wildcard.
-Recommendation: Stick to the default value. If you have a good disk and
-want to reduce false-positives, increase to 16.
+so query plan will never start or end with a wildcard.
+
+**Recommendation**: Stick to the default value. If you have a good disk and
+want to reduce false-positives, increase (but no more than 16).
 
 ### query_max_ngram
 
@@ -75,7 +67,8 @@ want to reduce false-positives, increase to 16.
 Maximum number of values a ngram can take to be considered when planning
 a query. For example, with a default value of 16, trigram `11 2? 33` will
 be expanded and included in query, but `11 ?? 33` will be ignored.
-Recommendation: Stick to the default value at first. If your queries are
+
+**Recommendation**: Stick to the default value at first. If your queries are
 fast, use many wildcards, but have many false positives, increase to 256.
 
 
@@ -88,8 +81,10 @@ fast, use many wildcards, but have many false positives, increase to 256.
 How many tasks can be processed at once? The default 4 is a very
 conservative value for most workloads. Increasing it will make the
 database faster, but at a certain point the disk becomes a bottleneck.
-Recommendation: If your server is dedicated to ursadb, or your IO latency
-is high (for example, files are stored on NFS), increase to at least 16.
+This will also linearly increase memory usage in the worst case.
+
+**Recommendation**: If your server is dedicated to ursadb, or your IO latency
+is high (for example, files are stored on NFS), increase to 8 or more.
 
 ### merge_max_datasets
 
@@ -98,12 +93,13 @@ is high (for example, files are stored on NFS), increase to at least 16.
 - **Maximum**: 1024
 
 How many datasets can be merged at once? This has severe memory usage
-implications - for merging datasets must be fully loaded, and every
+implications - before merging, datasets must be fully loaded, and every
 loaded dataset consumes a bit over 128MiB. Increasing this number makes
 compacting huge datasets faster, but may run out of ram.
-Recommendation: merge_max_datasets * 128MiB can safely be set to around
+
+**Recommendation**: merge_max_datasets * 128MiB can safely be set to around
 1/4 of RAM dedicated to the database, so for example 8 for 4GiB server
-or 32 for 16GiB server. Increasing past 10 gives diminishing returns, so
+or 32 for 16GiB server. Increasing past 10 has diminishing returns, so
 unless you have a lot of free RAM you can leave it at default.
 
 ### merge_max_files
@@ -115,6 +111,7 @@ unless you have a lot of free RAM you can leave it at default.
 When merging, what is the maximal allowed number of files in the
 resulting dataset? Large datasets make the database faster, but also need
 more memory to run efficiently.
-Recommendation: ursadb was used with multi-million datasets in the wild,
-but currently we recommend to stay on the safe side and don't create
-datasets larger than 1 million files.
+
+**Recommendation**: ursadb was used with multi-million datasets in the wild,
+but currently we recommend to keep
+datasets smaller than 1 million files.
diff --git a/docs/install.md b/docs/install.md
@@ -0,0 +1,65 @@
+# Installation
+
+## From dockerhub
+
+Change [index_dir] and [samples_dir] to paths on your filesystem where you want to keep
+index and samples.
+
+```
+sudo docker run -v [index_dir]:/var/lib/ursadb:rw -v [samples_dir]:/mnt/samples certpl/ursadb
+```
+
+## From dockerfile
+
+```
+git clone https://github.com/CERT-Polska/ursadb.git
+sudo docker image build -t ursadb .
+sudo docker run -v [index_dir]:/var/lib/ursadb:rw -v [samples_dir]:/mnt/samples ursadb
+```
+
+## From source
+
+1. Clone the repository:
+```
+git clone --recurse-submodules https://github.com/CERT-Polska/ursadb.git
+```
+
+2. Install necessary dependencies:
+```
+sudo apt update
+sudo apt install -y gcc-7 g++-7 libzmq3-dev cmake build-essential clang-format git
+```
+
+3. Build project:
+```
+mkdir build
+cd build
+cmake -D CMAKE_C_COMPILER=gcc-7 -D CMAKE_CXX_COMPILER=g++-7 -D CMAKE_BUILD_TYPE=Release ..
+make -j$(nproc)
+```
+
+4. (Optional) Install binaries to `/usr/local/bin`:
+```
+sudo make install
+```
+
+5. (Optional) Consider registering UrsaDB as a systemd service:
+```
+cp contrib/systemd/ursadb.service /etc/systemd/system/
+systemctl enable ursadb
+```
+
+## From nixpkgs (may be outdated)
+
+```
+nix-env -i ursadb
+```
+
+## From .deb package (may be outdated)
+
+UrsaDB is distributed in a form of pre-built Debian packages targeting Debian Buster and Ubuntu 18.04. You can get the packages from [GitHub Releases](https://github.com/CERT-Polska/ursadb/releases).
+
+You may use this convenient one-liner to install the latest UrsaDB package along with the required dependencies:
+```
+curl https://raw.githubusercontent.com/CERT-Polska/ursadb/master/contrib/install_deb.sh | sudo bash
+```
diff --git a/docs/ondiskformat.md b/docs/ondiskformat.md
@@ -18,7 +18,7 @@ it means that it's not used and can be safely removed (when the database is turn
 
 Example:
 
-```
+```json
 {
     "config": {
         "database_workers": 10
@@ -40,7 +40,7 @@ Most importantly, it contains references to all indexes in this dataset,
 and a list of filenames tracked by this dataset.
 
 Example:
-```
+```json
 {
     "filename_cache": "namecache.files.set.507718ac.db.ursa",
     "files": "files.set.507718ac.db.ursa",
@@ -86,7 +86,7 @@ For example:
 Finally, the last `(2**24 + 1) * 8` bytes of an index consists of an array of uint64_t
 values, where sequence N starts at `array[N]` offset in the file, and ends at `array[N+1]`
 
-An index can be parsed with the following Python code. Warning this is just a demonstration,
+An index can be parsed with the following Python code. Warning: this is just a demonstration,
 and is way too slow to work with indexes bigger than really small ones.
 
 ```python
@@ -127,18 +127,33 @@ def parse(fpath):
             continue
 
         run = decompress(fdata[offsets[i]:offsets[i+1]])
+        # trigram [i] contains files with ids [run]
         print(f"{i:06x}: {run}")
 
 
 if __name__ == '__main__':
     parse(sys.argv[1])
 ```
 
-## Names
+## Files
+
+Newline-separated list of filenames in the database.
 
-Newline-separated list of filenames in the database. This file can be safely
-edited or changed with any editor, for example when moving the collection to a
-different folder. It's only important to:
+```
+$ head -n 10 files.set.35dcff87.db.ursa
+/mnt/samples/001
+/mnt/samples/002
+/mnt/samples/003
+/mnt/samples/004
+/mnt/samples/005
+/mnt/samples/006
+/mnt/samples/007
+/mnt/samples/008
+/mnt/samples/009
+/mnt/samples/010
+```
+
+Right now, the only way to change the base directory of files is to edit this file directly. It can be safely edited or changed with any editor. It's only important to:
 
  - ensure the database is turned off
  - remove the namecache file later
@@ -147,10 +162,24 @@ different folder. It's only important to:
 
 Contains an array of `uint64_t` offsets in the `names` file.
 This is used to map file IDs to names for queries, without loading all the file
-names into memory.
+names into memory (and to speed up database startup)
 If this file doesn't exist or was removed, it'll be regenerated when the database
 starts.
 
+```
+$ xxd namecache.files.set.d2c6638f.db.ursa | head -n 10
+00000000: 0000 0000 0000 0000 1f00 0000 0000 0000  ................
+00000010: 3b00 0000 0000 0000 5d00 0000 0000 0000  ;.......].......
+00000020: 8000 0000 0000 0000 9b00 0000 0000 0000  ................
+00000030: b200 0000 0000 0000 d700 0000 0000 0000  ................
+00000040: fb00 0000 0000 0000 2601 0000 0000 0000  ........&.......
+00000050: 4c01 0000 0000 0000 6b01 0000 0000 0000  L.......k.......
+00000060: 8d01 0000 0000 0000 c601 0000 0000 0000  ................
+00000070: fd01 0000 0000 0000 3702 0000 0000 0000  ........7.......
+00000080: 5402 0000 0000 0000 7f02 0000 0000 0000  T...............
+00000090: a702 0000 0000 0000 d702 0000 0000 0000  ................
+```
+
 ## Itermeta
 
 Contains information about the current position of a given iterator. For example: