Context Triggered Piecewise Hashes
To understand rolling hashes, we first need to understand piecewise hashing.
Originally developed by Nicholas Harbour for dcfldd, piecewise hashing uses an arbitrary hashing algorithm to create many checksums for a file instead of just one. Rather than to generate a single hash for the entire file, a hash is generated for many discrete fixed-size segments of the file. For example, one hash is generated for the first 512 bytes of input, another hash for the next 512 bytes, and so on.
The technique was originally developed to mitigate errors during forensic imaging. If an error occurred, only one of the piecewise hashes would be invalidated. The remainder of the piecewise hashes, and thus the integrity of the remainder of the data, was still assured.
A rolling hash algorithm produces a pseudo-random value based only on the current context of the input. The rolling hash works by maintaining a state based solely on the last few bytes from the input. Each byte is added to the state as it is processed, and removed from the state after a set number of other bytes have been processed.
Whereas current piecewise hashing programs such as dcfldd used fixed offsets to determine when to start and stop the traditional hash algorithm, a CTPH algorithm uses the rolling hash. When the output of the rolling hash produces a specific output, or trigger value, the traditional hash is triggered. That is, while processing the input file, one begins to compute the traditional hash for the file. Simultaneously, one must also compute the rolling hash for the file. When the rolling hash produces a trigger value, the value of the traditional hash is recorded in the CTPH signature and the traditional hash is reset.
Consequently, each recorded value in the CTPH signature depends only on part of the input, and changes to the input will result in only localized changes in the CTPH signature. For instance, if a byte of the input is changed, at most two, and in many cases, only one of the traditional hash values will be changed; the majority of the CTPH signature will remain the same. Because the majority of the signature remains the same, files with modifications can still be associated with the CTPH signatures of known files.[1]
Example Implementation
python-ssdeep: Python wrapper for ssdeep fuzzy hashing library
This is a straightforward Python wrapper for ssdeep by Jesse Kornblum, which is a library for computing context triggered piecewise hashes (CTPH). Also called fuzzy hashes, CTPH can match inputs that have homologies. Such inputs have sequences of identical bytes in the same order, although bytes in between these sequences may be different in both content and length.
To compute a fuzzy hash, use hash
function:
>>> import ssdeep
>>> hash1 = ssdeep.hash('Also called fuzzy hashes, Ctph can match inputs that have homologies.')
>>> hash1
'3:AXGBicFlgVNhBGcL6wCrFQEv:AXGHsNhxLsr2C'
>>> hash2 = ssdeep.hash('Also called fuzzy hashes, CTPH can match inputs that have homologies.')
>>> hash2
'3:AXGBicFlIHBGcL6wCrFQEv:AXGH6xLsr2C'
The compare
function returns the match between 2 hashes, an integer value from 0 (no match) to 100.
>>> ssdeep.compare(hash1, hash2)
22
Relevant Note(s): Pyramid of Pain Hashing