diff --git a/INSTALL.md b/INSTALL.md index 5c8af3c..c2ee955 100644 --- a/INSTALL.md +++ b/INSTALL.md @@ -1,65 +1,3 @@ # Installation -## From pre-built package - -UrsaDB is distributed in a form of pre-built Debian packages targeting Debian Buster and Ubuntu 18.04. You can get the packages from [GitHub Releases](https://github.com/CERT-Polska/ursadb/releases). - -You may use this convenient one-liner to install the latest UrsaDB package along with the required dependencies: -``` -curl https://raw.githubusercontent.com/CERT-Polska/ursadb/master/contrib/install_deb.sh | sudo bash -``` - -## From nixpkgs - -``` -nix-env -i ursadb -``` - -## From dockerhub - -Change [index_dir] and [samples_dir] to paths on your filesystem where you want to keep -index and samples. - -``` -sudo docker run -v [index_dir]:/var/lib/ursadb:rw -v [samples_dir]:/mnt/samples certpl/ursadb -``` - -## From dockerfile - -``` -git clone https://github.com/CERT-Polska/ursadb.git -sudo docker image build -t ursadb . -sudo docker run -v [index_dir]:/var/lib/ursadb:rw -v [samples_dir]:/mnt/samples ursadb -``` - -## From source - -1. Clone the repository: -``` -git clone --recurse-submodules https://github.com/CERT-Polska/ursadb.git -``` - -2. Install necessary dependencies: -``` -sudo apt update -sudo apt install -y gcc-7 g++-7 libzmq3-dev cmake build-essential clang-format git -``` - -3. Build project: -``` -mkdir build -cd build -cmake -D CMAKE_C_COMPILER=gcc-7 -D CMAKE_CXX_COMPILER=g++-7 -D CMAKE_BUILD_TYPE=Release .. -make -j$(nproc) -``` - -4. (Optional) Install binaries to `/usr/local/bin`: -``` -sudo make install -``` - -5. (Optional) Consider registering UrsaDB as a systemd service: -``` -cp contrib/systemd/ursadb.service /etc/systemd/system/ -systemctl enable ursadb -``` +See instructions in the [docs/install.md file](./docs/install.md) \ No newline at end of file diff --git a/README.md b/README.md index ca95a11..067f88c 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # UrsaDB -A 3gram search engine for querying Terabytes of data in milliseconds. Optimized for working with binary files (for example, malware dumps). +A 3gram search engine for querying terabytes of data in milliseconds. Optimized for working with binary files (for example, malware dumps). Created in [CERT.PL](https://cert.pl). Originally by Jarosław Jedynak ([tailcall.net](https://tailcall.net)), extended and improved by Michał Leszczyński. diff --git a/docs/README.md b/docs/README.md index fba34e9..a443c82 100644 --- a/docs/README.md +++ b/docs/README.md @@ -1,6 +1,17 @@ # Ursadb documentation -## User guide +Ursadb is a search engine optimized for working with binary files (for example, malware dumps). + +## Basic usage + +Read this section for a quick start with Ursadb. + +- [Install](./install.md): How to install Ursadb +- [Getting Started](./start.md): Read if you want to learn Ursadb quickly. + +## User guide + +Read if you want to understand more about Ursadb. - [Query syntax](./syntax.md) guide: Read to learn about the commands used to interact with the database. @@ -9,9 +20,14 @@ - [Datasets](./datasets.md): Introduction to Ursadb's datasets. - [Indexing in-depth](./indexing.md): Read if you need to index a considerable number of files. +- [Configuration](./configuration.md): Configuration options exposed by Ursadb, + and how to use them. + +## Advanced features + +Random notes about Ursadb internals + - [Performance and limits](./limits.md): Read in case you're not sure if Ursadb can handle your collection. - [On-disk format](./ondiskformat.md): Read if you want to understand Ursadb's on - disk format (hint: many files are just JSON and can be read/edited with vim). -- [Configuration](./configuration.md): Configuration options exposed by Ursadb, - and how to use them. + disk format. diff --git a/docs/configuration.md b/docs/configuration.md index 4c2af55..6cf2927 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -1,10 +1,13 @@ # Configuration -Ursadb configuration is a simple set of key-value pairs. They can be read -by issuing a `config get` command with `ursacli`: +Ursadb configuration is a simple set of key-value pairs. The defaults are sane. +You don't need to change them, unless you want to tweak ursadb to your system. + +They can be read by issuing a `config get` command with `ursacli`: ``` -$ ursacli -c "config get;" +$ ursacli +ursadb> config get; ``` The response format is: @@ -24,19 +27,7 @@ The response format is: } ``` -Values that have been changed from their default value can also be checked -by reading the database directly: - -```bash -$ cat ~/tmp/ursadb/db.ursa | jq '.config' -{ - "database_workers": 4, - "query_max_ngram": 256 -} -``` - -To change a config value, you can edit the database file when it's turned off -(not recommended), or issue a `config set` command: +To change a config value, you may issue a `config set` command: ``` $ ursacli @@ -45,9 +36,9 @@ ursadb> config set "database_workers" 10; ## Available configuration keys -- [database_workers](#database_workers) - Number of independent worker threads -- [query_max_edge](#query_max_edge) - Maximum query size (edge) -- [query_max_ngram](#query_max_ngram) - Maximum query size (ngram) +- [database_workers](#database_workers) - Number of independent worker threads. +- [query_max_edge](#query_max_edge) - Maximum query size (edge). +- [query_max_ngram](#query_max_ngram) - Maximum query size (ngram). - [merge_max_datasets](#merge_max_datasets) - Maximum number of datasets involved in a single merge. - [merge_max_files](#merge_max_files) - Maximum number of datasets in a dataset @@ -61,9 +52,10 @@ ursadb> config set "database_workers" 10; Maximum number of values a first or last character in sequence can take to be considered when planning a query. The default is a conservative 1, -so query plam will never start or end with a wildcard. -Recommendation: Stick to the default value. If you have a good disk and -want to reduce false-positives, increase to 16. +so query plan will never start or end with a wildcard. + +**Recommendation**: Stick to the default value. If you have a good disk and +want to reduce false-positives, increase (but no more than 16). ### query_max_ngram @@ -75,7 +67,8 @@ want to reduce false-positives, increase to 16. Maximum number of values a ngram can take to be considered when planning a query. For example, with a default value of 16, trigram `11 2? 33` will be expanded and included in query, but `11 ?? 33` will be ignored. -Recommendation: Stick to the default value at first. If your queries are + +**Recommendation**: Stick to the default value at first. If your queries are fast, use many wildcards, but have many false positives, increase to 256. @@ -88,8 +81,10 @@ fast, use many wildcards, but have many false positives, increase to 256. How many tasks can be processed at once? The default 4 is a very conservative value for most workloads. Increasing it will make the database faster, but at a certain point the disk becomes a bottleneck. -Recommendation: If your server is dedicated to ursadb, or your IO latency -is high (for example, files are stored on NFS), increase to at least 16. +This will also linearly increase memory usage in the worst case. + +**Recommendation**: If your server is dedicated to ursadb, or your IO latency +is high (for example, files are stored on NFS), increase to 8 or more. ### merge_max_datasets @@ -98,12 +93,13 @@ is high (for example, files are stored on NFS), increase to at least 16. - **Maximum**: 1024 How many datasets can be merged at once? This has severe memory usage -implications - for merging datasets must be fully loaded, and every +implications - before merging, datasets must be fully loaded, and every loaded dataset consumes a bit over 128MiB. Increasing this number makes compacting huge datasets faster, but may run out of ram. -Recommendation: merge_max_datasets * 128MiB can safely be set to around + +**Recommendation**: merge_max_datasets * 128MiB can safely be set to around 1/4 of RAM dedicated to the database, so for example 8 for 4GiB server -or 32 for 16GiB server. Increasing past 10 gives diminishing returns, so +or 32 for 16GiB server. Increasing past 10 has diminishing returns, so unless you have a lot of free RAM you can leave it at default. ### merge_max_files @@ -115,6 +111,7 @@ unless you have a lot of free RAM you can leave it at default. When merging, what is the maximal allowed number of files in the resulting dataset? Large datasets make the database faster, but also need more memory to run efficiently. -Recommendation: ursadb was used with multi-million datasets in the wild, -but currently we recommend to stay on the safe side and don't create -datasets larger than 1 million files. + +**Recommendation**: ursadb was used with multi-million datasets in the wild, +but currently we recommend to keep +datasets smaller than 1 million files. diff --git a/docs/install.md b/docs/install.md new file mode 100644 index 0000000..94b1ad6 --- /dev/null +++ b/docs/install.md @@ -0,0 +1,65 @@ +# Installation + +## From dockerhub + +Change [index_dir] and [samples_dir] to paths on your filesystem where you want to keep +index and samples. + +``` +sudo docker run -v [index_dir]:/var/lib/ursadb:rw -v [samples_dir]:/mnt/samples certpl/ursadb +``` + +## From dockerfile + +``` +git clone https://github.com/CERT-Polska/ursadb.git +sudo docker image build -t ursadb . +sudo docker run -v [index_dir]:/var/lib/ursadb:rw -v [samples_dir]:/mnt/samples ursadb +``` + +## From source + +1. Clone the repository: +``` +git clone --recurse-submodules https://github.com/CERT-Polska/ursadb.git +``` + +2. Install necessary dependencies: +``` +sudo apt update +sudo apt install -y gcc-7 g++-7 libzmq3-dev cmake build-essential clang-format git +``` + +3. Build project: +``` +mkdir build +cd build +cmake -D CMAKE_C_COMPILER=gcc-7 -D CMAKE_CXX_COMPILER=g++-7 -D CMAKE_BUILD_TYPE=Release .. +make -j$(nproc) +``` + +4. (Optional) Install binaries to `/usr/local/bin`: +``` +sudo make install +``` + +5. (Optional) Consider registering UrsaDB as a systemd service: +``` +cp contrib/systemd/ursadb.service /etc/systemd/system/ +systemctl enable ursadb +``` + +## From nixpkgs (may be outdated) + +``` +nix-env -i ursadb +``` + +## From .deb package (may be outdated) + +UrsaDB is distributed in a form of pre-built Debian packages targeting Debian Buster and Ubuntu 18.04. You can get the packages from [GitHub Releases](https://github.com/CERT-Polska/ursadb/releases). + +You may use this convenient one-liner to install the latest UrsaDB package along with the required dependencies: +``` +curl https://raw.githubusercontent.com/CERT-Polska/ursadb/master/contrib/install_deb.sh | sudo bash +``` \ No newline at end of file diff --git a/docs/ondiskformat.md b/docs/ondiskformat.md index 977bfa9..31b8075 100644 --- a/docs/ondiskformat.md +++ b/docs/ondiskformat.md @@ -18,7 +18,7 @@ it means that it's not used and can be safely removed (when the database is turn Example: -``` +```json { "config": { "database_workers": 10 @@ -40,7 +40,7 @@ Most importantly, it contains references to all indexes in this dataset, and a list of filenames tracked by this dataset. Example: -``` +```json { "filename_cache": "namecache.files.set.507718ac.db.ursa", "files": "files.set.507718ac.db.ursa", @@ -86,7 +86,7 @@ For example: Finally, the last `(2**24 + 1) * 8` bytes of an index consists of an array of uint64_t values, where sequence N starts at `array[N]` offset in the file, and ends at `array[N+1]` -An index can be parsed with the following Python code. Warning this is just a demonstration, +An index can be parsed with the following Python code. Warning: this is just a demonstration, and is way too slow to work with indexes bigger than really small ones. ```python @@ -127,6 +127,7 @@ def parse(fpath): continue run = decompress(fdata[offsets[i]:offsets[i+1]]) + # trigram [i] contains files with ids [run] print(f"{i:06x}: {run}") @@ -134,11 +135,25 @@ if __name__ == '__main__': parse(sys.argv[1]) ``` -## Names +## Files + +Newline-separated list of filenames in the database. -Newline-separated list of filenames in the database. This file can be safely -edited or changed with any editor, for example when moving the collection to a -different folder. It's only important to: +``` +$ head -n 10 files.set.35dcff87.db.ursa +/mnt/samples/001 +/mnt/samples/002 +/mnt/samples/003 +/mnt/samples/004 +/mnt/samples/005 +/mnt/samples/006 +/mnt/samples/007 +/mnt/samples/008 +/mnt/samples/009 +/mnt/samples/010 +``` + +Right now, the only way to change the base directory of files is to edit this file directly. It can be safely edited or changed with any editor. It's only important to: - ensure the database is turned off - remove the namecache file later @@ -147,10 +162,24 @@ different folder. It's only important to: Contains an array of `uint64_t` offsets in the `names` file. This is used to map file IDs to names for queries, without loading all the file -names into memory. +names into memory (and to speed up database startup) If this file doesn't exist or was removed, it'll be regenerated when the database starts. +``` +$ xxd namecache.files.set.d2c6638f.db.ursa | head -n 10 +00000000: 0000 0000 0000 0000 1f00 0000 0000 0000 ................ +00000010: 3b00 0000 0000 0000 5d00 0000 0000 0000 ;.......]....... +00000020: 8000 0000 0000 0000 9b00 0000 0000 0000 ................ +00000030: b200 0000 0000 0000 d700 0000 0000 0000 ................ +00000040: fb00 0000 0000 0000 2601 0000 0000 0000 ........&....... +00000050: 4c01 0000 0000 0000 6b01 0000 0000 0000 L.......k....... +00000060: 8d01 0000 0000 0000 c601 0000 0000 0000 ................ +00000070: fd01 0000 0000 0000 3702 0000 0000 0000 ........7....... +00000080: 5402 0000 0000 0000 7f02 0000 0000 0000 T............... +00000090: a702 0000 0000 0000 d702 0000 0000 0000 ................ +``` + ## Itermeta Contains information about the current position of a given iterator. For example: diff --git a/docs/start.md b/docs/start.md new file mode 100644 index 0000000..58407aa --- /dev/null +++ b/docs/start.md @@ -0,0 +1,63 @@ +# Getting started + +**This part of documentation is work in progress and will be improved in the future** + +### Installation + +The easiest way to start a ursadb instance is to run (substitute with your files and index paths): + +``` +mkdir -p /tmp/ursadb/index /tmp/ursadb/files +sudo docker run -p 9281 -v /tmp/ursadb/index:/var/lib/ursadb:rw -v /tmp/ursadb/files:/mnt/samples certpl/ursadb +``` + +For other installation methods see [install.md](./install.md). + +To connect to the database you can build `ursacli` yourself, or use the tool from docker again: + +``` +sudo docker ps # look up container ID +sudo docker exec -it [container ID] ursacli +[2022-08-28 13:38:06.154] [info] Connecting to tcp://localhost:9281 +[2022-08-28 13:38:06.155] [info] Connected to UrsaDB v1.3.2+3797f9b (connection id: 006B8B4567) +ursadb> +``` + +### Indexing + +Using another terminal, put some files in the files directory (`/tmp/ursadb/files` in the snippet above). +I'll use the project source code in the example + +``` +cd /tmp/ursadb/files +cd git clone https://github.com/CERT-Polska/ursadb.git +``` + +Now send a command to the database + +``` +ursadb> index "/mnt/samples"; +{ + "result": { + "status": "ok" + }, + "type": "ok" +} +``` + +If everything worked correctly, you should have at least one dataset. Check that with a `topology` command: + +``` +ursadb> topology; +dataset 20d30d28 [ 311] (gram3) +``` + +Finally, query the data for some strings: + +``` +ursadb> select "BSD"; +/mnt/samples/ursadb/extern/catch/Catch.h +/mnt/samples/ursadb/LICENSE +``` + +That's it. For more available commands see [syntax.md](./syntax.md). \ No newline at end of file diff --git a/docs/syntax.md b/docs/syntax.md index ce2e5d6..19418b5 100644 --- a/docs/syntax.md +++ b/docs/syntax.md @@ -1,5 +1,14 @@ # syntax +You can communicate with ursadb in its own query language. It's not very +complicated, but longer queries can get complex. + +For example, to select files with "abc" trigram with ursacli you would write: + +``` +ursadb> select "abc"; +``` + Available commands: - [`index`](#index) @@ -20,7 +29,7 @@ All responses from the database use the JSON format. Additionally, all successful commands return response in the following format: -```json +```javascript { "result": { // json with type-specific information @@ -31,7 +40,7 @@ Additionally, all successful commands return response in the following format: All failed commands return response in the following format: -```json +```javascript { // "message" key, instead of "result". "message": "Human-readable error message",