The BDQC framework has no requirements other than Python 3.3.2 or later.
The GCC toolchain is required for installation as some of its
components are C code that must be compiled.
After extracting the archive...
.. code-block:: shell
python3 setup.py install
...installs the framework, after which...
.. code-block:: shell
python3 -m bdqc.scan <directory>
...will analyze all files in <directory>, and
.. code-block:: shell
python3 -m bdqc.scan --help
...provides further help.
The contents of the online help is not repeated in this document.
_`Overview`
###########
_`What is it?`
==============
BDQC is a Python3_ software framework and executable module.
Although it provides built-in capabilities that make it useful "out of the
box", being a "framework" means that users (knowledgeable in Python
programming) can extend its capabilities, and it is designed to
be so extended.
_`What is it for?`
==================
BDQC identifies anomalous files among large collections of files which are
*a priori* assumed to be "similar."
It is intended to:
1. validate primary input data.
2. validate output (or intermediate stages of) data processing pipelines.
3. discover potentially "interesting" outliers.
These use cases merely highlight different sources of anomalies in data.
In the first, anomalies might be due to faulty data handling or acquisition
(e.g. sloppy manual procedures or faulty sensors). In the second, anomalies
might appear in data due to bugs in pipeline software or runtime failures
(e.g. power outages, network unavailablity, etc.). Finally, anomalies that
can't be discounted as being due to technical problems might actually be
"interesting" observations to be followed up in research.
.. In other words, although it was developed as a rapid means of spotting
.. problems in pipelines ("validating" or "QC'ing" data), it can serve
.. the goal of discovery as well.
Although it was developed in the context of genomics research, it is
expressly not tied to a specific knowledge domain. It can, however,
be customized (via the plugin mechanism) for specific knowledge domains.
.. motivated by realization that when faced with thousands of individual
.. files it becomes challenging to even confirm they all contain approximately
.. "what they should."
Importantly, *files* are its fundamental unit of operation.
This means that a file must constitute a meaningful unit of
information--one sample's data, for example--in any
application of BDQC.
.. (for \#3 above to be well-defined).
_`What does it do?`
===================
BDQC analyzes a collection of files in **two stages**.
First, it analyzes each file individually and produces a summary of the
file's content (*within-file* analysis).
Second, a configurable set of heuristics is applied to the aggregated
file summaries (*across-file* analysis) to identify possible anomalies.
BDQC can be run from the command line and command line arguments control
which files are analyzed,
how individual files are analyzed,
how the aggregate file analyses are analyzed.
All command line arguments are optional; the framework will carry out
default actions. See command line help.
Alternatively, the bdqc.Executor Python class can be incorporated directly
into third party Python code. This allows it to be incorporated into
pipelines.
Design goals
============
The BDQC framework was developed with several explicit goals in mind:
1. Identify an "anomalous" file among a large collection of *similar* files of *arbitrary* type with as little guidance from the user as possible, ideally none. In other words, it should be useful "out of the box" with almost no learning curve.
2. "Simple things should be simple; complex things should be possible" [#]_ Although basic use should involve almost no learning curve, it should be possible to extend it with arbitrarily complex (and possibly domain-specific) analysis capabilities.
3. Plugins should be simple (for a competent Python programmer) to develop, and the system must be robust to faults in plugins.
.. The third goal motivated the use of Python.
_`How does it work?`
####################
This section describes exhaustively how BDQC works internally.
Summary production: within-file analysis
========================================
The BDQC *framework* orchestrates the execution of *plugins*.
**All of the within-file analysis capabilities are provided by
plugins** [#]_
That is, the plugins that are executed on a file entirely determine the
content of the summary generated for that file. The framework itself
*never* looks inside a file; only the plugins. The framework:
1. assembles a list of paths identifying files to be analyzed,
2. executes a *dynamically-determined* subset of plugins on each filename,
3. combines the executed plugins' results into (JSON_) summaries for each file.
Plugins are described more fully elsewhere. Here it suffices to understand
that each plugin can declare (as part of its implementation) that it depends
on zero or more other plugins.
The framework:
1. insures that a plugin's dependencies execute before the plugin itself, and
2. each plugin is provided with the results of its dependencies' execution.
Thus, the set of all *candidate* plugins--that is, all plugins installed on
the user's machine [#]_ --constitute an implicit DAG (directed acyclic graph),
and an "upstream" plugin can determine how (or even whether or not) a
downstream plugin is run. The framework minimizes work by only executing a
plugin when required.
By default, the summary for file foo.txt is left in an adjacent file named
foo.txt.bdqc.
Again, the BDQC *framework* does not touch files' content--it only
handles filenames.
Heuristic application: across-file analysis
===========================================
1. Summary (\*.bdqc) files are collected.
2. The JSON_ content of all files' summaries is *flattened* into a matrix.
3. A specified set of heuristics are applied to the columns of the matrix.
Plugins are described more fully elsewhere. Here it suffices to understand
that a plugin's output can be (*almost*) anything representable as JSON_
data. Since JSON_ is capable of representing complex/compound datatypes,
the individual statistics in plugins' summaries may exist in nested
representations and access to a particular statistic may involve specifying
a *path* through the object. For example, in the following JSON text...