hansken.query
— Constructing Hansken queries
Calls like ProjectContext.search
accept query arguments.
Aside from passing None
for those arguments, a search query can either be a str
formatted as
Hansken Query Language that is used for user interfaces as well, or a Query
object.
Query values
Most queries will search for traces where a particular property of a trace, like picture.width
, matches a particular value, like 1024
.
picture.width
is clearly a numeric property.
Hansken will treat any value provided with a query for a numeric property like a number, attempting to convert it if needed.
The Term
query Term('picture.width', 1024)
is equivalent to the queries Term('picture.width', '1024')
or Term('picture.width', 1024.0)
.
hansken.py
will only format query values in the following two cases:
When the value is an instance of
datetime.date
ordatetime.datetime
, the value will be converted to ISO 8601 string representation.hansken.py
will require that a timezone is set on the value, so-called naive instances will cause aValueError
to be raised.When the value is a
list
ortuple
of length 2 containing only numbers, the value is assumed to be a location and will be converted to ISO 6709 string representation.
Note
Query types that accept text values (like Term
and Phrase
) can be made to match the ‘raw’, ‘original’ text, before tokenization.
Hansken calls this a ‘full matching’ query.
Full matching queries are only supported on metadata,
so hansken.py
will switch a full matching query like Term('Bomb/explos!ve', full=True)
to match only metadata if no property to match is provided.
Combining queries
Combinations of queries typically concern the boolean operations and, or, not.
To make these combinations, use either the corresponding query types And
, Or
and Not
or Python’s binary operators &
, |
and ~
:
# import query objects like Range, Term and others
from hansken.query import *
# traces containing either the term 'test' or 'term'
test_term = Or(Term('test'), Term('term'))
# the following is identical
test_term = Term('test') | Term('term')
# files containing the term 'test' in either data or meta-data
test_files = And(Term('type', 'file'), Term('test'))
# the following is identical
test_files = Term('type', 'file') & Term('test')
# pictures over 8MiB in size
big_pictures = Term('type', 'picture') & Range('data.raw.size', gt=8388608)
# pictures that are not a file (e.g.: carved pictures)
pictures = And(Term('type', 'picture'), Not(Term('type', 'file')))
# the following is identical
pictures = Term('type', 'picture') & ~Term('type', 'file')
Another way of combining queries is using the Nested
type,
searching for traces where a particular property matches values from traces resulting from another query.
This query type can be very powerful for answering complex questions.
A question like finding files from one image that also occur on another image, ignoring files that are known from the NSRL database:
from hansken.query import Nested, Term
image_a = 'aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa'
image_b = 'bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb'
# step 1: define a query to find files from image A
from_a = Term('type', 'file') & Term('image', image_a)
# step 2: define a query to find non-nsrl files from image B
from_b = Term('type', 'file') & Term('image', image_b) & ~Term('data.raw.hashHits', 'nsrl')
# step 3: combine the two together to find files from image A also found on image B
cross_hits = from_a & Nested('data.raw.hash.sha256', from_b)
Note
There is a limit on the number of traces resulting from the inner query (from_b
in the example),
if that query should contain more than 100,000 traces, an error will be returned.
Additionally, due to the way Nested
type queries are processed, they will tend to be slower than other query types.
Sorting results
Hansken allows you to sort search results the way you want them to.
hansken.py
defaults to sorting on traces’ unique identifier (property “uid”) to create reproducible results, but enables you to override this.
Search calls take a sort argument, that can a single or multiple clauses to sort on, expressed as a str
or a Sort
instance:
from hansken.query import Sort
# search for traces matching query, sort them by name
results = context.search(query, sort='name')
# search for traces matching query, sort them in reverse by name (note the -), then by uid
results = context.search(query, sort=('name-', 'uid'))
# the same as above, but the first clause expressed as a Sort instance
results = context.search(query, sort=[Sort('name', direction='descending'),
'uid'])
Hansken also supports ‘filtered sort’, sorting on values if the traces being sorted match a certain query. This comes in handy when searching pictures that have classification predictions. In this example, we’d want pictures with the highest chances of containing a banana first:
from hansken.query import Sort, Term
bananas = context.search(
# the query expressed as a Term object
query=Term('type', 'picture'),
# sort on property "prediction.confidence"
sort=Sort('prediction.confidence',
# in descending order, provide the highest confidence first
direction='descending',
# but only those confidences that belong to the "banana" class
filter=Term('prediction.class', 'banana'))
)
bananas = context.search(
query=Term('type', 'picture'),
# the filter for a sort clause is just a query, so this can be expressed through a str
sort=Sort('prediction.confidence', direction='descending', filter='prediction.class:banana')
)
# the entire sort clause can be expressed as a str too,
# using a minus to indicate descending order and a pipe character to specify the filter
bananas = context.search(query='type:picture', sort='prediction.confidence- | prediction.class:banana')
Note
Filtered sort is not universally supported on all properties.
The primary use case for this is sorting on properties that occur multiple times for a trace (like prediction.confidence
).
The Hansken remote will return an error when an attempt is made to sort (or filter) on a property that doesn’t support it.
When searching for traces that have a certain similarity, like pictures that look alike, Hansken can sort on the similarity / distance of the associated vectors:
from hansken.query import Sort, Term
# let's say that the one banana I found earlier was tagged as "banana"
my_banana = context.search('tags:banana').takeone()
# for this example, use the first prediction.embedding vector (assuming there is one)
banana_prediction = my_banana.get('prediction')[0]
banana_vector = banana_prediction.embedding
# take note of the model that created this embedding vector
banana_model = banana_prediction.model_name
# search for all pictures, but sort the results by similarity of the vector obtained earlier
banana_candidates = context.search('type:picture', sort=Sort(
# sort on the same vector property
field='prediction.embedding',
# use cosine similarity for the actual sorting
mode='cosineSimilarity',
# supply the vector to calculate distances from
value=banana_vector,
# apply this sorting only on the applicable embeddings, those from the same model as the banana_vector
filter=Term('prediction.modelName', banana_model),
))
Note
Note that vector properties can ‘mean’ a lot of things, depending on what kind of algorithm created the vectors. Face embedding vectors might not be well suited for finding pictures portraying a particular concept, for example. As such, make sure to combine this with a filtered sort to make sure the sort is applied to the right kind of vectors. Consult the trace model and a domain expert when using results sorted like this.
Additionally, the sort direction will usually default to ascending order.
When the sort mode is set to 'cosineSimilarity'
, however, the sort order is flipped to make sure the most similar vectors appear at the top of the result.
This behaviour is only applied when no direction is supplied (the default is None
, causing Sort
to automatically determine the ‘right’ order).
Facets
The Facet
classes can be used to create a histogram from a query result,
like an overview of data sizes of traces in the result.
To request such information, supply one or more Facet
instances to calls like ProjectContext.search
:
hansken.py
supports three types of facets:
TermFacet
: a facet on any field, counting occurrences of values.TermFacet('type')
, for example, would request a histogram of trace types like file or email;RangeFacet
: a facet grouping numeric or date fields into ‘buckets’ of a specified size.RangeFacet('data.raw.entropy', scale='linear', interval=1)
, for example, would request a histogram of raw data entropy, grouped into buckets (0..1], (1..2], and so on;GeohashFacet
: a facet requesting geohashes on a lat/long field.GeohashFacet('gps.latlong', precision=5)
, for example, would request a histogram of geohashes of length 5.
# search for pictures, generate a histogram of the camera types in exif data
result = context.search('type:picture', facets=TermFacet('picture.exif.camera'))
# a single facet was specified, we'll expect a single facet result on the search result
cameras = result.facets[0]
Aside from the required field
parameter, Facet
takes a scale
parameter,
that specify the sizes of the buckets in the resulting histogram.
Scales like this only make sense for datetime or numeric fields.
For datetime fields, the possible values are year
, month
, day
, hour
, minute
and second
.
Numeric fields can be faceted using a linear or logarithmic scale that are coupled to a certain interval or base respectively:
# search for files created in 2014, generate a histogram of all the days in 2014
context.search(Term('file.createdOn', '2014'),
facets=RangeFacet('file.createdOn', scale='day'))
# search for files, generate a linear histogram of the entropy
context.search('type:file', facets=RangeFacet('data.raw.entropy', scale='linear', interval=1))
# search for everything, generate a logarithmic histogram of the raw sizes
# the histogram buckets will represent 0-1KB, 1KB-1MB, 1MB-1GB, …
context.search(facets=RangeFacet('data.raw.size', scale='log', base=10))
# multiple facets can be requested as a sequence
context.search(facets=[TermFacet('type'), RangeFacet('data.raw.size', scale='log', base=10)])
# the result of the term facet on type would be available on the search result.facets at index 0, data size at index 1
Note
There is a limit to the number of buckets any facet can have;
specifying scale='second'
within a date range of multiple years will likely fail.
Likewise, using a linear scale of 10 bytes for data size will likely return an error.
Classes in hansken.query
- class Query[source]
Bases:
object
Base class for Hansken query types. Implementations are required to implement
as_dict
for transformation to wire format.- as_dict()[source]
Turns this query into a dict as specified by the Hansken Query Language Specification.
- __and__(other)[source]
Binary and operator (
&
) handling, resulting in anAnd
query. Resulting query is flattened when one or more operands are alreadyAnd
queries.
- class And(*queries)[source]
Bases:
Sized
,Iterable
,Query
Boolean conjunction of multiple queries; traces should match all contained queries, for example:
And(Term('file.name', 'query.py'), Range('data.raw.size', min=512))
- class Or(*queries)[source]
Bases:
Sized
,Iterable
,Query
Boolean disjunction of multiple queries, traces should match any contained query, for example:
Or(Term('file.name', 'query.py'), Range('data.raw.size', max=1024))
- class Not(query)[source]
Bases:
Query
Negates a single query, for example:
Not(Term('file.name', 'query.py'))
- class Nested(field, query)[source]
Bases:
Query
Query a field for values matching the results of another query, for example:
Nested('data.raw.hash.md5', Term('file.name', 'query.py'))
- class Tracelet(tracelet_type, query=None)[source]
Bases:
Query
Restrict a query for a tracelet type to the same tracelet instance of that tracelet type.
# find traces containing an entity Tracelet('entity') # find traces containing an entity that has both: # - a value starting with "http://" # - a confidence of at least 0.9 Tracelet('entity', Term('entity.value', 'http://*', full=True) & Range('entity.confidence', min=0.9))
Note that without the
Tracelet
query, theTerm
andRange
queries above could match different entities, ultimately matching traces that contain any entity with a value starting withhttp://
and any entity with a confidence of at least 0.9 (not necessarily to the same entity).
- class Trace(query)[source]
Bases:
Query
Restrict a tracelet query to tracelets belonging to traces matching the inner query.
# match entities of type iban, but only if the trace they belong to is from a specific image Term('entity.type', 'iban') & Trace(Term('image', '1234-abcd'))
- class Term(field_or_value, value=None, full=False)[source]
Bases:
Query
Query for the value of single field, for example:
# search for files with name "query.py" Term('file.name', 'query.py') # search for occurrences of the term "query" (in either data or metadata) Term('query')
- __init__(field_or_value, value=None, full=False)[source]
Create a new
Term
query.- Parameters:
field_or_value – the field to search, or (when value is not supplied) the search value
value – value to search for (only needed when searching a specific field)
full – search the untokenized variant of any string, see full matches
- class Regex(field_or_pattern, pattern=None, full=False)[source]
Bases:
Query
Query a field for string-values matching a regular expression, for example:
# search for replies or forwards Regex('email.subject', '(re|fw): .*', full=True) # search for bombs, or some curious misspellings Regex(re.compile(r'bo[mn]+bs'))
Either a
str
or are.Pattern
object is accepted, of which only thepattern
property is used.Note
Regular expressions always match entire terms or (in case of
full=True
) properties, as if the regular expression was anchored at both ends, see full matches.Not every feature supported by Python’s
re
module (like particular character classes (\s
/\w
), start/end anchors (^
/$
), look ahead/behind or non-greedy quantifiers (??
/*?
)) will be supported by Hansken. The use of these is not validated byhansken.py
, but will result in errors when submitted.Regular expressions queries are always case insensitive and ignore diacritics in values.
- __init__(field_or_pattern, pattern=None, full=False)[source]
Create a new
Regex
query.- Parameters:
field_or_pattern – the field to match, or (when pattern is not supplied) the search pattern
pattern – pattern to match, either a
str
orre.Pattern
full – match the untokenized variant of the value, see full matches
- class Range(field, **ranges)[source]
Bases:
Query
Query a field for values in a particular range, for example:
# search for traces with entropy between 4.0 (exclusive) and 7.0 (inclusive) Range('data.raw.entropy', gt=4.0, max=7) # search for traces no larger than 1MiB (1 << 20 == 2 ** 20 == 1048576 bytes) Range('data.raw.size', max=1 << 20) # search for traces with peculiar names (matches file name aab.txt, but not ccb.txt) Range('file.name', min='aa', max='cc')
- __init__(field, **ranges)[source]
Create a new
Range
query.- Parameters:
field – the field to query for
ranges –
keyword arguments of the following forms:
>
,gt
: value should be greater than supplied value;>=
,gte
,min
,minvalue
,min_value
: value should be greater or equal to supplied value;<
,lt
: value should be less than supplied value;<=
,lte
,max
,maxvalue
,max_value
: value should be less or equal to supplied value;
- class Exists(field)[source]
Bases:
Query
Search for traces that have a particular field, for example:
Exists('email.headers.In-Reply-To')
- class Phrase(field_or_value, value=None, distance=0)[source]
Bases:
Query
Search for a phrase of terms, occurring within a particular distance of each other, for example:
Phrase('email.subject', 'sell you a bomb') # will also match "sell you a bomb", not restricted to just email.subject Phrase('sell bomb', distance=2)
- __init__(field_or_value, value=None, distance=0)[source]
Create a new
Phrase
query.- Parameters:
field_or_value – the field to search, or (when value is not supplied) the search value
value – value to search for (only needed when searching a specific field)
distance – the max number of position displacements between terms in the phrase (0 being an exact phrase match)
- class GeoBox(field, southwest, northeast)[source]
Bases:
Query
Search for traces with location data within the bounding box between two corner points: southwest and northeast, for example:
# a location can either be a 2-tuple (…) GeoBox('gps.latlong', (-1, -2), (3, 4)) # (…) or an ISO 6709 latlong string GeoBox('gps.latlong', '+12.5281-070.0229', '+13.5281-080.0229')
- class HQLHuman(query)[source]
Bases:
Query
Search for traces using HQL Human query syntax, for example:
HQLHuman('file.name:query.py') HQLHuman('data.raw.size>1024')
- to_query(query)[source]
Make sure query is a
Query
instance by either wrapping it with aHQLHuman
or returning it as is.
- class Sort(field, direction=None, filter=None, mode=None, value=None)[source]
Bases:
object
- __init__(field, direction=None, filter=None, mode=None, value=None)[source]
Creates a sort clause for use with a search request.
The mode parameter determines what kind of sorting should be applied:
value
: a regular sort-by-value (the default applied by the remote);exists
: simply sort on whether the sort field as a value;cosineSimilarity
: use parameter value to sort on the cosine similarity between value and the value of the sort field;manhattanDistance
: similarly sort on the manhattan distance (or L1 norm) between value and the value of the sort field;euclideanDistance
: similarly sort on the euclidean distance (or L2 norm) between value and the value of the sort field;
- Parameters:
field – the field to sort on
direction – the sorting direction (ascending or descending, or
None
to auto-determine the sort direction from other arguments)filter – an optional query to restrict the tracelets included for sorting
mode – a sort mode (see above)
value – a (vector) value to use for similarity / distance calculations in applicable sorting modes (see above)
- classmethod from_str(sort)[source]
Creates a
Sort
from sort, parsing field, direction and filter.Formats supported:
some.field
: sort on field “some.field”, ascendingsome.field+
: sort on field “some.field”, ascendingsome.field-
: sort on field “some.field”, descendingsome.field | query*
: sort on field “some.field” within matches for query “query*”, ascending (sorting non-matches after matches)
- Parameters:
sort – a sorting string to parse
- Returns:
a
Sort
instance
- class TermFacet(field, size=100, include_total=None, filter=None)[source]
Bases:
Facet
- __init__(field, size=100, include_total=None, filter=None)[source]
Create a new
TermFacet
to use with a query. A term facet can be created on any type of field, counting the occurrences of any value.- Parameters:
field – field to create a facet on
size – the max number of facet counters to return, default is 100
filter – only count traces matching filter
- class RangeFacet(field, scale, base=None, interval=None, min=None, max=None, include_total=None, filter=None)[source]
Bases:
Facet
- __init__(field, scale, base=None, interval=None, min=None, max=None, include_total=None, filter=None)[source]
Create a new
RangeFacet
to use with a query. A range facet can be made on either numeric or date fields.- Parameters:
field – field to create a facet on
scale –
year
,month
,day
,hour
,minute
orsecond
for date fieldslinear
orlog
for numeric fields
base – logarithmic base when scale is
'log'
interval – interval or bucket size when scale is
'linear'
min – minimum value to include in the facet result
max – maximum value to include in the facet result
filter – only count traces matching filter
- class GeohashFacet(field, size=100, include_total=None, precision=1, southwest=None, northeast=None, filter=None)[source]
Bases:
Facet
- __init__(field, size=100, include_total=None, precision=1, southwest=None, northeast=None, filter=None)[source]
Create a new
Facet
to use with a query.- Parameters:
field – field to create a facet on
size – the max number of facet counters to return, default is 100
precision – number of characters of the returned geohashes
southwest – south west bound / corner point
northeast – north west bound / corner point
filter – only count traces matching filter