-
Notifications
You must be signed in to change notification settings - Fork 242
SSDeep In Datawave
In 2015, Brian Wallace from Cylance published the paper "Optimizing ssDeep for use at scale". In 2023 an approach inspired by this work has been submitted to Datawave in PR #2085. Read the original paper and you will know almost everything you need to know to understand the Datawave PR.
This feature assumes you have an index created in Accumulo already and provides the Datawave query logic to query that index. The code to generate this index will possibly make its way into Datawave in the future, however all of the building blocks are present in this PR. The code itself is well commented. Stay tuned for further high level documentation.
There are multiple rowId indexing schemes implemented for SSDeep hash ngrams.
One indexing scheme will partition the hashes based on their prefix, with the idea that the alphabet of possible prefixes will be evenly distributed across partitions. We have not empirically evaluated that this is indeed the case.
The second indexing scheme adds a bucket and chunk size prefix to the rowId, each a single character in length. Ngrams are assigned to a bucked based of a hash of the original hash from which the ngram was derived. The chunk size bucket is based on a notion called the chunk index.
SSDeep hashes have a fixed set of chunk sizes that generally follow the form 3 * 2^n
, where n
is the the chunk
index. In other words, a chunk index of '0' corresponds to a chunk size of 3, a chunk index of '1' corresponds to a
chunk size of '6' and a chunk index of '12' corresponds to a chunk size of 12,228.
Including a bucket allows us to partition the search space for paralel query execution, while incluing the chunkIndex allows us to constrain the scan range to chunk sizes of interest.
This table structure puts the ngram in the rowId, the chunk size in the column family and the original ssdeep hash in the column qualifier
Table | Purpose | Row | Column Family | Column Qualifier | Value |
---|---|---|---|---|---|
ssdeepIndex | Index Row | SSDeep Ngram | Chunk Size | SSDeep Hash | Empty |
This table structure introduces the concept of buckets to partition ngrams further than they would be normally. The numeric bucket id is derived from a hash of the original ssdeep hash and encoded as two characters by default. The chunk size of the original hash is also encoded as a single character. This three character prefix is prepended to the rowId and used to partition ngrams into several partitions (or buckets). All the ngrams for a given hash will appear in the same bucket.
Table | Purpose | Row | Column Family | Column Qualifier | Value |
---|---|---|---|---|---|
ssdeepIndex | Index Row | Encoded Bucket, Encoded Chunk Index, SSDeep Ngram | Encoded Chunk Index | SSDeep Hash | Empty |
The bucket is encoded with two characters and the chunk is encoded a single character, so Row ID's will be ngram size + 3.
The encoded chunk index is included a second time in the column family for isolation (if we want to take advantage of this).
The SSDeepSimilarityQueryLogic introduced in PR #2085 takes one or more SSDeep Hashes, generates nGrams of a predetermined size and searches for each of these in the ssdeepIndex
table, matching only those nGrams in the index with a compatible chunk size. Each ssdeep hash in the index that corresponds to one of these ngrams is returned in the query result set. Query results are scored two ways: (1) the original scoring algorithm based on the number of overlapping ngrams between the query and matching hatch (2) PR #2129 adds a weighted score between 0-100 that is based on the edit distance between the query hash and matching hash, with better matches ranked higher and poor matches ranked closer to zero.
The SSDeep Similarity Query Logic returns matches as they are found by a BatchScanner in Accumulo and does not attempt to sort or rank returned matches based on scores. To mitigate some of the issues related to unranked matches, PR #2129 also adds the ability to specify a minimum score threshold, where matches below a certain score will be dropped from the result set.
PR #2236 refactors the SSdeepSimilarityQueryLogic so that scores are calculated in the query logic implementation itself instead of being calculated in the query transformer component. This enables chaining the output of the ssdeep similarity logic with other query logics more effectively because scores can be readily merged into the output of the second logics in the chain.
In order to support improved testing of the SSDeep query code, and integration between the SSDeep similarity query code and other Datawave query logics (e.g, the Discovery Query Logic and Event Query Logic), PR #2229 moves the SSDeep-related classes to the ssdeep-common
submodule.
PR #2230 builds on this by creating an ingest-ssdeep submodule containing implementations of a prototype SSDeepIngestHandler, and components for the query test framework that provide the ability to load SSDeep hashes into both an ssdeepIndex
table and the standard Datawave shard (and related tables). PR #2230 includes unit tests that exercise this capability using the test framework and an in-memory accumulo instance.
PR #2242 implements the chained SSDeep Similarity and Discovery Query. This will allow a user to query for one-or-more SSDeep hashes, find similar hashes in the ssdeepIndex
table and then query the standard Datawave shard tables for items that are related to those matching hashes. The results returned not only include the standard Discovery query output, but also include the corresponding query hash and scores that resulted in the return of the discovered data.
A future PR will implement a chained SSDeep Similarity and Event Query. Like the Chained Discovery Query, this will allow a user to query for one or more SSDeep hashes to find similar hashes from the ssdeepIndex table and then run an Event query on the standard Datawave shard tables for items that have those similar hashes. Once again, the standard event results will be enriched by the query hash and scores that resulted in the return of the event data.