hansken.query — Constructing Hansken queries

Calls like ProjectContext.search accept query arguments. Aside from passing None for those arguments, a search query can either be a str formatted as Hansken Query Language that is used for user interfaces as well, or a Query object.

Query values

Most queries will search for traces where a particular property of a trace, like picture.width, matches a particular value, like 1024. picture.width is clearly a numeric property. Hansken will treat any value provided with a query for a numeric property like a number, attempting to convert it if needed. The Term query Term('picture.width', 1024) is equivalent to the queries Term('picture.width', '1024') or Term('picture.width', 1024.0). hansken.py will only format query values in the following two cases:

  • When the value is an instance of datetime.date or datetime.datetime, the value will be converted to ISO 8601 string representation. hansken.py will require that a timezone is set on the value, so-called naive instances will cause a ValueError to be raised.

  • When the value is a list or tuple of length 2 containing only numbers, the value is assumed to be a location and will be converted to ISO 6709 string representation.

Note

Query types that accept text values (like Term and Phrase) can be made to match the ‘raw’, ‘original’ text, before tokenization. Hansken calls this a ‘full matching’ query. Full matching queries are only supported on metadata, so hansken.py will switch a full matching query like Term('Bomb/explos!ve', full=True) to match only metadata if no property to match is provided.

Combining queries

Combinations of queries typically concern the boolean operations and, or, not. To make these combinations, use either the corresponding query types And, Or and Not or Python’s binary operators &, | and ~:

# import query objects like Range, Term and others
from hansken.query import *

# traces containing either the term 'test' or 'term'
test_term = Or(Term('test'), Term('term'))
# the following is identical
test_term = Term('test') | Term('term')

# files containing the term 'test' in either data or meta-data
test_files = And(Term('type', 'file'), Term('test'))
# the following is identical
test_files = Term('type', 'file') & Term('test')

# pictures over 8MiB in size
big_pictures = Term('type', 'picture') & Range('data.raw.size', gt=8388608)

# pictures that are not a file (e.g.: carved pictures)
pictures = And(Term('type', 'picture'), Not(Term('type', 'file')))
# the following is identical
pictures = Term('type', 'picture') & ~Term('type', 'file')

Another way of combining queries is using the Nested type, searching for traces where a particular property matches values from traces resulting from another query. This query type can be very powerful for answering complex questions.

A question like finding files from one image that also occur on another image, ignoring files that are known from the NSRL database:

from hansken.query import Nested, Term

image_a = 'aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa'
image_b = 'bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb'
# step 1: define a query to find files from image A
from_a = Term('type', 'file') & Term('image', image_a)
# step 2: define a query to find non-nsrl files from image B
from_b = Term('type', 'file') & Term('image', image_b) & ~Term('data.raw.hashHits', 'nsrl')
# step 3: combine the two together to find files from image A also found on image B
cross_hits = from_a & Nested('data.raw.hash.sha256', from_b)

Note

There is a limit on the number of traces resulting from the inner query (from_b in the example), if that query should contain more than 100,000 traces, an error will be returned. Additionally, due to the way Nested type queries are processed, they will tend to be slower than other query types.

Sorting results

Hansken allows you to sort search results the way you want them to. hansken.py defaults to sorting on traces’ unique identifier (property “uid”) to create reproducible results, but enables you to override this. Search calls take a sort argument, that can a single or multiple clauses to sort on, expressed as a str or a Sort instance:

from hansken.query import Sort

# search for traces matching query, sort them by name
results = context.search(query, sort='name')
# search for traces matching query, sort them in reverse by name (note the -), then by uid
results = context.search(query, sort=('name-', 'uid'))
# the same as above, but the first clause expressed as a Sort instance
results = context.search(query, sort=[Sort('name', direction='descending'),
                                      'uid'])

Hansken also supports ‘filtered sort’, sorting on values if the traces being sorted match a certain query. This comes in handy when searching pictures that have classification predictions. In this example, we’d want pictures with the highest chances of containing a banana first:

from hansken.query import Sort, Term

bananas = context.search(
    # the query expressed as a Term object
    query=Term('type', 'picture'),
    # sort on property "prediction.confidence"
    sort=Sort('prediction.confidence',
              # in descending order, provide the highest confidence first
              direction='descending',
              # but only those confidences that belong to the "banana" class
              filter=Term('prediction.class', 'banana'))
)

bananas = context.search(
    query=Term('type', 'picture'),
    # the filter for a sort clause is just a query, so this can be expressed through a str
    sort=Sort('prediction.confidence', direction='descending', filter='prediction.class:banana')
)

# the entire sort clause can be expressed as a str too,
# using a minus to indicate descending order and a pipe character to specify the filter
bananas = context.search(query='type:picture', sort='prediction.confidence- | prediction.class:banana')

Note

Filtered sort is not universally supported on all properties. The primary use case for this is sorting on properties that occur multiple times for a trace (like prediction.confidence). The Hansken remote will return an error when an attempt is made to sort (or filter) on a property that doesn’t support it.

When searching for traces that have a certain similarity, like pictures that look alike, Hansken can sort on the similarity / distance of the associated vectors:

from hansken.query import Sort, Term

# let's say that the one banana I found earlier was tagged as "banana"
my_banana = context.search('tags:banana').takeone()
# for this example, use the first prediction.embedding vector (assuming there is one)
banana_prediction = my_banana.get('prediction')[0]
banana_vector = banana_prediction.embedding
# take note of the model that created this embedding vector
banana_model = banana_prediction.model_name

# search for all pictures, but sort the results by similarity of the vector obtained earlier
banana_candidates = context.search('type:picture', sort=Sort(
    # sort on the same vector property
    field='prediction.embedding',
    # use cosine similarity for the actual sorting
    mode='cosineSimilarity',
    # supply the vector to calculate distances from
    value=banana_vector,
    # apply this sorting only on the applicable embeddings, those from the same model as the banana_vector
    filter=Term('prediction.modelName', banana_model),
))

Note

Note that vector properties can ‘mean’ a lot of things, depending on what kind of algorithm created the vectors. Face embedding vectors might not be well suited for finding pictures portraying a particular concept, for example. As such, make sure to combine this with a filtered sort to make sure the sort is applied to the right kind of vectors. Consult the trace model and a domain expert when using results sorted like this.

Additionally, the sort direction will usually default to ascending order. When the sort mode is set to 'cosineSimilarity', however, the sort order is flipped to make sure the most similar vectors appear at the top of the result. This behaviour is only applied when no direction is supplied (the default is None, causing Sort to automatically determine the ‘right’ order).

Facets

The Facet classes can be used to create a histogram from a query result, like an overview of data sizes of traces in the result. To request such information, supply one or more Facet instances to calls like ProjectContext.search:

hansken.py supports three types of facets:

  • TermFacet: a facet on any field, counting occurrences of values. TermFacet('type'), for example, would request a histogram of trace types like file or email;

  • RangeFacet: a facet grouping numeric or date fields into ‘buckets’ of a specified size. RangeFacet('data.raw.entropy', scale='linear', interval=1), for example, would request a histogram of raw data entropy, grouped into buckets (0..1], (1..2], and so on;

  • GeohashFacet: a facet requesting geohashes on a lat/long field. GeohashFacet('gps.latlong', precision=5), for example, would request a histogram of geohashes of length 5.

# search for pictures, generate a histogram of the camera types in exif data
result = context.search('type:picture', facets=TermFacet('picture.exif.camera'))
# a single facet was specified, we'll expect a single facet result on the search result
cameras = result.facets[0]

Aside from the required field parameter, Facet takes a scale parameter, that specify the sizes of the buckets in the resulting histogram. Scales like this only make sense for datetime or numeric fields. For datetime fields, the possible values are year, month, day, hour, minute and second. Numeric fields can be faceted using a linear or logarithmic scale that are coupled to a certain interval or base respectively:

# search for files created in 2014, generate a histogram of all the days in 2014
context.search(Term('file.createdOn', '2014'),
               facets=RangeFacet('file.createdOn', scale='day'))
# search for files, generate a linear histogram of the entropy
context.search('type:file', facets=RangeFacet('data.raw.entropy', scale='linear', interval=1))
# search for everything, generate a logarithmic histogram of the raw sizes
# the histogram buckets will represent 0-1KB, 1KB-1MB, 1MB-1GB, …
context.search(facets=RangeFacet('data.raw.size', scale='log', base=10))
# multiple facets can be requested as a sequence
context.search(facets=[TermFacet('type'), RangeFacet('data.raw.size', scale='log', base=10)])
# the result of the term facet on type would be available on the search result.facets at index 0, data size at index 1

Note

There is a limit to the number of buckets any facet can have; specifying scale='second' within a date range of multiple years will likely fail. Likewise, using a linear scale of 10 bytes for data size will likely return an error.

Classes in hansken.query

class Query[source]

Bases: object

Base class for Hansken query types. Implementations are required to implement as_dict for transformation to wire format.

as_dict()[source]

Turns this query into a dict as specified by the Hansken Query Language Specification.

__and__(other)[source]

Binary and operator (&) handling, resulting in an And query. Resulting query is flattened when one or more operands are already And queries.

__or__(other)[source]

Binary or operator (|) handling, resulting in an Or query. Resulting query is flattened when one or more operands are already Or queries.

__invert__()[source]

Binary not operator (~) handling, resulting in a Not query.

class And(*queries)[source]

Bases: Sized, Iterable, Query

Boolean conjunction of multiple queries; traces should match all contained queries, for example:

And(Term('file.name', 'query.py'),
    Range('data.raw.size', min=512))
__init__(*queries)[source]
as_dict()[source]

Turns this query into a dict as specified by the Hansken Query Language Specification.

class Or(*queries)[source]

Bases: Sized, Iterable, Query

Boolean disjunction of multiple queries, traces should match any contained query, for example:

Or(Term('file.name', 'query.py'),
   Range('data.raw.size', max=1024))
__init__(*queries)[source]
as_dict()[source]

Turns this query into a dict as specified by the Hansken Query Language Specification.

class Not(query)[source]

Bases: Query

Negates a single query, for example:

Not(Term('file.name', 'query.py'))
__init__(query)[source]
as_dict()[source]

Turns this query into a dict as specified by the Hansken Query Language Specification.

class Nested(field, query)[source]

Bases: Query

Query a field for values matching the results of another query, for example:

Nested('data.raw.hash.md5', Term('file.name', 'query.py'))
__init__(field, query)[source]
as_dict()[source]

Turns this query into a dict as specified by the Hansken Query Language Specification.

class Tracelet(tracelet_type, query=None)[source]

Bases: Query

Restrict a query for a tracelet type to the same tracelet instance of that tracelet type.

# find traces containing an entity
Tracelet('entity')

# find traces containing an entity that has both:
# - a value starting with "http://"
# - a confidence of at least 0.9
Tracelet('entity', Term('entity.value', 'http://*', full=True) & Range('entity.confidence', min=0.9))

Note that without the Tracelet query, the Term and Range queries above could match different entities, ultimately matching traces that contain any entity with a value starting with http:// and any entity with a confidence of at least 0.9 (not necessarily to the same entity).

__init__(tracelet_type, query=None)[source]
as_dict()[source]

Turns this query into a dict as specified by the Hansken Query Language Specification.

class Trace(query)[source]

Bases: Query

Restrict a tracelet query to tracelets belonging to traces matching the inner query.

# match entities of type iban, but only if the trace they belong to is from a specific image
Term('entity.type', 'iban') & Trace(Term('image', '1234-abcd'))
__init__(query)[source]
as_dict()[source]

Turns this query into a dict as specified by the Hansken Query Language Specification.

class Term(field_or_value, value=None, full=False)[source]

Bases: Query

Query for the value of single field, for example:

# search for files with name "query.py"
Term('file.name', 'query.py')
# search for occurrences of the term "query" (in either data or metadata)
Term('query')
__init__(field_or_value, value=None, full=False)[source]

Create a new Term query.

Parameters:
  • field_or_value – the field to search, or (when value is not supplied) the search value

  • value – value to search for (only needed when searching a specific field)

  • full – search the untokenized variant of any string, see full matches

as_dict()[source]

Turns this query into a dict as specified by the Hansken Query Language Specification.

class Regex(field_or_pattern, pattern=None, full=False)[source]

Bases: Query

Query a field for string-values matching a regular expression, for example:

# search for replies or forwards
Regex('email.subject', '(re|fw): .*', full=True)
# search for bombs, or some curious misspellings
Regex(re.compile(r'bo[mn]+bs'))

Either a str or a re.Pattern object is accepted, of which only the pattern property is used.

Note

  • Regular expressions always match entire terms or (in case of full=True) properties, as if the regular expression was anchored at both ends, see full matches.

  • Not every feature supported by Python’s re module (like particular character classes (\s / \w), start/end anchors (^/$), look ahead/behind or non-greedy quantifiers (?? / *?)) will be supported by Hansken. The use of these is not validated by hansken.py, but will result in errors when submitted.

  • Regular expressions queries are always case insensitive and ignore diacritics in values.

__init__(field_or_pattern, pattern=None, full=False)[source]

Create a new Regex query.

Parameters:
  • field_or_pattern – the field to match, or (when pattern is not supplied) the search pattern

  • pattern – pattern to match, either a str or re.Pattern

  • full – match the untokenized variant of the value, see full matches

as_dict()[source]

Turns this query into a dict as specified by the Hansken Query Language Specification.

class Range(field, **ranges)[source]

Bases: Query

Query a field for values in a particular range, for example:

# search for traces with entropy between 4.0 (exclusive) and 7.0 (inclusive)
Range('data.raw.entropy', gt=4.0, max=7)
# search for traces no larger than 1MiB (1 << 20 == 2 ** 20 == 1048576 bytes)
Range('data.raw.size', max=1 << 20)
# search for traces with peculiar names (matches file name aab.txt, but not ccb.txt)
Range('file.name', min='aa', max='cc')
__init__(field, **ranges)[source]

Create a new Range query.

Parameters:
  • field – the field to query for

  • ranges

    keyword arguments of the following forms:

    • >, gt: value should be greater than supplied value;

    • >=, gte, min, minvalue, min_value: value should be greater or equal to supplied value;

    • <, lt: value should be less than supplied value;

    • <=, lte, max, maxvalue, max_value: value should be less or equal to supplied value;

as_dict()[source]

Turns this query into a dict as specified by the Hansken Query Language Specification.

class Exists(field)[source]

Bases: Query

Search for traces that have a particular field, for example:

Exists('email.headers.In-Reply-To')
__init__(field)[source]
as_dict()[source]

Turns this query into a dict as specified by the Hansken Query Language Specification.

class Phrase(field_or_value, value=None, distance=0)[source]

Bases: Query

Search for a phrase of terms, occurring within a particular distance of each other, for example:

Phrase('email.subject', 'sell you a bomb')
# will also match "sell you a bomb", not restricted to just email.subject
Phrase('sell bomb', distance=2)
__init__(field_or_value, value=None, distance=0)[source]

Create a new Phrase query.

Parameters:
  • field_or_value – the field to search, or (when value is not supplied) the search value

  • value – value to search for (only needed when searching a specific field)

  • distance – the max number of position displacements between terms in the phrase (0 being an exact phrase match)

as_dict()[source]

Turns this query into a dict as specified by the Hansken Query Language Specification.

class GeoBox(field, southwest, northeast)[source]

Bases: Query

Search for traces with location data within the bounding box between two corner points: southwest and northeast, for example:

# a location can either be a 2-tuple (…)
GeoBox('gps.latlong', (-1, -2), (3, 4))
# (…) or an ISO 6709 latlong string
GeoBox('gps.latlong', '+12.5281-070.0229', '+13.5281-080.0229')
__init__(field, southwest, northeast)[source]
as_dict()[source]

Turns this query into a dict as specified by the Hansken Query Language Specification.

class HQLHuman(query)[source]

Bases: Query

Search for traces using HQL Human query syntax, for example:

HQLHuman('file.name:query.py')
HQLHuman('data.raw.size>1024')
__init__(query)[source]
as_dict()[source]

Turns this query into a dict as specified by the Hansken Query Language Specification.

to_query(query)[source]

Make sure query is a Query instance by either wrapping it with a HQLHuman or returning it as is.

Parameters:

query – either a str or a Query

Returns:

a Query instance

Raises:

TypeError – when query’s type is not acceptable

class Sort(field, direction=None, filter=None, mode=None, value=None)[source]

Bases: object

__init__(field, direction=None, filter=None, mode=None, value=None)[source]

Creates a sort clause for use with a search request.

The mode parameter determines what kind of sorting should be applied:

  • value: a regular sort-by-value (the default applied by the remote);

  • exists: simply sort on whether the sort field as a value;

  • cosineSimilarity: use parameter value to sort on the cosine similarity between value and the value of the sort field;

  • manhattanDistance: similarly sort on the manhattan distance (or L1 norm) between value and the value of the sort field;

  • euclideanDistance: similarly sort on the euclidean distance (or L2 norm) between value and the value of the sort field;

Parameters:
  • field – the field to sort on

  • direction – the sorting direction (ascending or descending, or None to auto-determine the sort direction from other arguments)

  • filter – an optional query to restrict the tracelets included for sorting

  • mode – a sort mode (see above)

  • value – a (vector) value to use for similarity / distance calculations in applicable sorting modes (see above)

as_dict()[source]
classmethod from_str(sort)[source]

Creates a Sort from sort, parsing field, direction and filter.

Formats supported:

  • some.field: sort on field “some.field”, ascending

  • some.field+: sort on field “some.field”, ascending

  • some.field-: sort on field “some.field”, descending

  • some.field | query*: sort on field “some.field” within matches for query “query*”, ascending (sorting non-matches after matches)

Parameters:

sort – a sorting string to parse

Returns:

a Sort instance

to_sort(sort)[source]
class Facet(field, size=100, include_total=None, scale=None, filter=None)[source]

Bases: object

__init__(field, size=100, include_total=None, scale=None, filter=None)[source]
as_dict()[source]

Turns this facet into a dict as specified by the Hansken Query Language Specification.

class TermFacet(field, size=100, include_total=None, filter=None)[source]

Bases: Facet

__init__(field, size=100, include_total=None, filter=None)[source]

Create a new TermFacet to use with a query. A term facet can be created on any type of field, counting the occurrences of any value.

Parameters:
  • field – field to create a facet on

  • size – the max number of facet counters to return, default is 100

  • filter – only count traces matching filter

class RangeFacet(field, scale, base=None, interval=None, min=None, max=None, include_total=None, filter=None)[source]

Bases: Facet

__init__(field, scale, base=None, interval=None, min=None, max=None, include_total=None, filter=None)[source]

Create a new RangeFacet to use with a query. A range facet can be made on either numeric or date fields.

Parameters:
  • field – field to create a facet on

  • scale

    • year, month, day, hour, minute or second for date fields

    • linear or log for numeric fields

  • base – logarithmic base when scale is 'log'

  • interval – interval or bucket size when scale is 'linear'

  • min – minimum value to include in the facet result

  • max – maximum value to include in the facet result

  • filter – only count traces matching filter

class GeohashFacet(field, size=100, include_total=None, precision=1, southwest=None, northeast=None, filter=None)[source]

Bases: Facet

__init__(field, size=100, include_total=None, precision=1, southwest=None, northeast=None, filter=None)[source]

Create a new Facet to use with a query.

Parameters:
  • field – field to create a facet on

  • size – the max number of facet counters to return, default is 100

  • precision – number of characters of the returned geohashes

  • southwest – south west bound / corner point

  • northeast – north west bound / corner point

  • filter – only count traces matching filter

to_facet(facet)[source]