Recipes

Recipes are convenience modules to perform tasks that are use cases for hansken.py.

`hansken.recipes.export` — Export data and metadata

to_csv(traces, output, fields, to_dict=<function get_fields>, delimiter='\t', lineterminator='\n', encoding='utf-8', **fmtparams)[source]

Writes values for fields from each trace in traces to output. Field names can be supplied as property names (e.g. 'file.createdOn') or as type_fields instances that automatically expand to properties defined for the specified types. Data can be included by using data_stream instances.

Note

Using type_fields instances requires that the traces argument carries the applicable trace model as attribute model. Using data_stream instances requires that the Trace objects in the traces argument carry a ProjectContext object. This is the case for SearchResult instances obtained from calls like ProjectContext.search:

# obtain search results as normally
results = context.search('query')
# export the results to a local CSV file
to_csv(results, 'path/to/export.csv',
       # explicit fields and automatically expanded fields can be mixed
       fields=['uid', 'name',
               # include all model-defined metadata fields for email traces
               type_fields('email'),
               # include the first kilobyte of data from the plain data stream (converted to text)
               data_stream('plain', max_size=1024)])

The exported file with have the explicitly provided fields like uid and name, but also fields from property names generated from the trace model retrieved from the ProjectContext, like email.subject and email.to. The data_stream instance will cause a data.plain field.

Parameters:

traces – collection of traces
output – name of the file to write to
fields – fields to retrieve values for, a sequence of property names (str) or type_fields instances, used to generate field and property names from a trace type
to_dict – callable to create a dict {field: value}; passed kwargs: trace, fields
delimiter – field delimiter in the output
lineterminator – line terminator in the output
encoding – text encoding for the output
fmtparams – additional format parameters, see module csv in the standard lib

class type_fields(*type_names)[source]

Bases: object

An object to request that all fields of the requested types should be used as field names when exporting to CSV format. Multiple type names can be provided:

to_csv(..., fields=type_fields('file', 'link'))
# supplying two types at once is equivalent to supplying two separate type_fields
to_csv(..., fields=[type_fields('file'), type_fields('link')])

Both forms would result in headers like file.createdOn and link.target in the resulting export file.

class data_stream(data_type, max_size=4096, fallback_encoding=None)[source]

Bases: object

An object to request text data for traces to be exported. Like type_fields, this can be mixed with regular metadata fields:

to_csv(..., fields=['data.raw.size', data_stream('raw', max_size=1024)])
to_csv(..., fields=data_stream('text', max_size=None))

A fallback text encoding can be provided for data stream exports that have no explicit or known text encoding:

to_csv(..., fields=data_stream('plain', fallback_encoding='ascii'))

Encoding errors will result in replacement characters — ?’s — in the output CSV. The fallback encoding is unset by default, resulting in no data in the output CSV.

For custom functionality (like including binary data encoded as hex or base64), this class can be extended. to_text is called by the exporter (get_fields when to_csv is used) to get a str from a Trace being exported.

Note

The maximum number of bytes that are read from the stream defaults to 4KiB. While it’s possible to include an entire data stream in the export, please note that data streams can grow quite large; use this with caution.

Note

As exporting data requires an additional HTTP request for each data stream of each Trace being exported, including data in an export slows down the export considerably.

to_text(trace, default=None)[source]

Get a str from a Trace. Uses this data_stream’s data_type and max_size to retrieve the requested data and turns it into a str if possible.

Parameters:

trace – the Trace being exported
default – the value to be used when retrieving the data or turning it into a str fails

Returns:

a str representation of (a part of) the requested data stream

get_fields(trace, fields, prefix='system.extracted.', use_fallback=True, default=None)[source]

Retrieves values for fields from source by calling source.get(prefix + field) for each field.

Parameters:

trace – collection of mapped values
fields – fields to retrieve a value for
prefix – prefix used with get
use_fallback – whether to try getting a field without prefix when a value for the full field name is not available
default – value to use when no value was mapped

Returns:

a dictionary with all of the requested fields and their value in the source of trace, or None if trace has no value for the field

Return type:

dict

bulk(traces, dest, split=1000, stream='raw', fname=<function safe_name>, write=<function to_file>, on_error=None, side_effect=None, jobs=16)[source]

Performs a bulk export of traces to dest.

Note

bulk is internally parallellized by default, requiring that the argument to write is thread-safe. safe_name, on_error and side_effect are all called from the calling thread after the export of a particular trace in traces was processed.

As on_error is not called from the except clause that catches the Exception instance, logging the exception with its traceback requires special care to pass the exc_info keyword to either logging or logbook. Leaving on_error as None will raise a ValueError on the thread calling bulk with the error that is processed first as its cause.

This also means that the order of traces with which write, on_error and side_effect are called need not be the same order as that of traces. To turn this parallellism off, pass jobs=False.

Parameters:

traces – collection of traces to export
dest – path to export traces to
split – max number of files per directory (when set to None, all files will be saved to the same directory)
stream – stream name to read from the traces, optionally supplied as a callable returning the stream name (trace will be omitted from export if the return value is falsy); passed kwargs: trace
fname – callable to generate a file name for a trace; passed kwargs: trace, num, split, stream (defaults to safe_name, resulting in a file name that uses both trace.image_id and trace.id, ensuring the name is unique within a project)
write – thread-safe callable to write a trace to a file name; passed kwargs: trace, output, stream (defaults to to_file)
on_error – callable to report an error thrown during write(); passed kwargs: num, trace, stream, output, exception
side_effect – callable to perform a side effect for each exported trace; passed kwargs: trace, stream, num, split, dest, folder, file, output
jobs – maximum number of data exports to run in parallel (an int), or False to turn parallel processing of traces off

Raises:

ValueError – on the first error result when on_error is not supplied (the error is set as the cause)

to_file(trace, output, stream='raw', offset=0, size=None, key=<auto-fetch>, bufsize=1048576)[source]

Writes a data stream of a trace to a file.

Parameters:

trace – trace to write
output – name of the file to write to
stream – named stream to read from the trace
offset – byte offset to start the stream on
size – the number of bytes to make available
key – key for the image of trace (default is to fetch the key automatically, if it’s available)
bufsize – buffer size to be used during the read/write loop

safe_name(trace, num=None, split=1000, stream='raw', template='{trace.image_id}_{trace.id}_{stream}_{trace.name}')[source]

Generate a file name for a trace. Resulting file name can contain unicode characters, but no slashes, backslashes or line endings.

Parameters:

trace – trace to name
num – the number of this trace within the set being exported
split – the max number of files in a directory
stream – the named stream to be exported
template – format string used as the file name, slashes and newlines are replaced by underscores in the result; passed kwargs: trace, num, split, stream

Returns:

generated file name

`hansken.recipes.report` — Generate reports from Hansken

The report recipe is split into two parts: a set of templates with macros and a number of utility functions to render templates into content or write them to files. Templates mentioned here use the Jinja2 templating language and accompanying Python modules. A basic template in Jinja2 looks something like this:

{% extends 'hansken/base.html' %}

{% block extra_styles %}
    <style type="text/css">
        p.special {
            color: red;
        }
    </style>
{% endblock %}

{% block content %}
    <p class="special">
        Lorum ipsum dolor sit amet. <br />
        {{ template_variables }} are included as such. <br />
        The result of macros can be included easily: {{ hansken.some_macro(argument) }}. <br />
    </p>
{% endblock %}

{% block postamble %}
    So, in conclusion, it turns out that this templating stuff isn't hard.
{% endblock %}

The example above extends a ‘base template’ provided with hansken.py (covered in more detail below), which contains an HTML document skeleton and provides a number of ‘blocks’ to be filled by extensions of the template. The list of named blocks the skeleton defines (like extra_styles and content) and their roles are listed below. Printing out template variables (also called arguments), is done by surrounding them with double curly brackets. Likewise, calling macros like functions is done inside double curly brackets. Jinja2 can do a lot more than the simple example above. For additional information, see the Jinja2 documentation. In particular, the “Template Designer Documentation” section covers the templating side of Jinja2.

Utility functions that create a PDF version of a report use HTML content to render the PDF document with WeasyPrint. See the WeasyPrint documentation for notes on supported features, caveats and additional information on the use of particular parameters.

Note

Template definitions below are presented as classes and methods (this will likely change in the future).

hansken/base.html

Base template for reports generated by the report recipe. hansken/base.html provides an HTML document skeleton with a basic style sheet and defines named blocks available for overrides in extensions. See the “Template Inheritance” section in the Jinja2 docs for a more detailed explanation on how this is used. As this template is intended as a base for other templates, it imports the hansken/macros.html template by default, exposing its macros on a namespace hansken. This allows any extension of hansken/base.html to call the macros defined in hansken/macros.html as hansken.macro_name().

hansken/base.html defines the following blocks:

title: A block inside the <title> element, providing a document title. Defaults to {{ title }}, allowing a document title to be provided as a template variable as well.
extra_styles: An empty block inside the <head> element, intended for user-provided <style> elements. This block is itself located in a block named styles, which contains a provided style sheet. Override the styles block to discard the style sheet provided by hansken.py.
extra_scripts: An empty block inside the <head> element, intended for user-provided <script> elements. This block is itself located in a block named scripts.
preamble: An empty block intended as a place to put introductory content. Defaults to {{ preamble }}, allowing a preamble to be provided as a template variable as well. Note that
content: An empty block intended as the main content of a document. Defaults to {{ content }}, allowing content to be provided as a template variable as well.
postamble: An empty block intended as a place to put concluding content. Defaults to {{ postamble }}, allowing a postamble to be provided as a template variable as well.

hansken/macros.html

A template containing only macros. hansken/base.html imports these macros by default as the hansken namespace.

traces_table(traces, fields)

Parameters:

traces – a collection of Trace objects, typically a SearchResult
fields – sequence of table columns filled with values from the Trace objects passed to traces

hansken/table.html: A convenience template extending from hansken/base.html, overriding the content block with a call to the trace_table macro. Parameters to this template are identical to the trace_table macro.

default_environment: Default jinja2.Environment used by the render_* functions in this recipe. This environment is able to load the templates listed above as hansken/template.html.

template_path: Path to the template directory bundled with hansken.py.

environment_with(searchpath=None, loader=None, **kwargs)[source]

Create an Environment loaded with hansken.py ‘s provided templates, while adding the provided search path or loader as a template source that precedes the provided templates. The resulting Environment is set to auto-escaping unless explicitly set to False in kwargs.

Parameters:

searchpath – str or sequence str of paths containing templates
loader – a Jinja2 template loader to use in conjunction with hansken.py ‘s default_loader
kwargs –
arguments passed to Environment, see the Jinja2 documentation

Returns:

an Environment, loading templates from both searchpath and template_path

render_template(template_name, environment=<hansken.py default environment>, **kwargs)[source]

Render a named template.

Parameters:

template_name – the name of the template to be rendered (e.g. 'hansken/table.html')
environment – the Environment to be used, defaults to the template environment defined by hansken.py
kwargs – named arguments to pass to the template

Returns:

a str containing the rendered template

render_string(string, environment=<hansken.py default environment>, **kwargs)[source]

Render an anonymous template, provided as a str.

Parameters:

string – the template content to be rendered
environment – the Environment to be used, defaults to the template environment defined by hansken.py
kwargs – named arguments to pass to the template

Returns:

a str containing the rendered template

to_html_table(output, traces, fields)[source]

Render traces into an HTML table and save to output.

Parameters:

output – name of the file to write to
traces – collection of Trace objects to render (typically a SearchResult)
fields – fields to retrieve values for, a sequence of property names (str)

to_pdf(output, content, base_url='.', **pdf_options)[source]

Save HTML content as PDF to output.

Parameters:

output – name of the file to write to
content – HTML content to write to PDF
base_url – base url for resolving linked resources in the template (typically only useful when providing custom style sheets, see WeasyPrint documentation)
pdf_options – keyword arguments passed verbatim to HTML.write_pdf (see WeasyPrint documentation)

to_pdf_table(output, traces, fields, base_url='.', **pdf_options)[source]

Render traces into a table and save as PDF to output.

Parameters:

output – name of the file to write to
traces – collection of Trace objects to render (typically a SearchResult)
fields – fields to retrieve values for, a sequence of property names (str)
base_url – base url for resolving linked resources in the template (typically only useful when providing custom style sheets, see WeasyPrint documentation)
pdf_options – keyword arguments passed verbatim to HTML.write_pdf (see WeasyPrint documentation)

Recipes

hansken.recipes.export — Export data and metadata

hansken.recipes.report — Generate reports from Hansken

`hansken.recipes.export` — Export data and metadata

`hansken.recipes.report` — Generate reports from Hansken