README.rst 22 KB
Newer Older
Roger Kramer's avatar
Roger Kramer committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14

Table of Contents
#################

- Installation_
- Overview_

	- `What is it?`_
	- `What is it for?`_
	- `What does it do?`_
- `How does it work?`_
- Plugins_


kramer's avatar
kramer committed
15 16
Installation
############
Roger Kramer's avatar
Roger Kramer committed
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43

The BDQC framework has no requirements other than Python 3.3.2 or later.
The GCC toolchain is required for installation as some of its
components are C code that must be compiled.

After extracting the archive...

.. code-block:: shell

	python3 setup.py install

...installs the framework, after which...

.. code-block:: shell

	python3 -m bdqc.scan <directory>

...will analyze all files in <directory>, and

.. code-block:: shell

	python3 -m bdqc.scan --help
	
...provides further help.
The contents of the online help is not repeated in this document.


kramer's avatar
kramer committed
44 45
Overview
########
Roger Kramer's avatar
Roger Kramer committed
46

kramer's avatar
kramer committed
47 48
What is it?
===========
Roger Kramer's avatar
Roger Kramer committed
49 50 51 52

BDQC is a Python3_ software framework and executable module.
Although it provides built-in capabilities that make it useful "out of the
box", being a "framework" means that users (knowledgeable in Python
53 54
programming) can extend its capabilities, and it is *intended* to
be so extended.
Roger Kramer's avatar
Roger Kramer committed
55

kramer's avatar
kramer committed
56 57
What is it for?
===============
Roger Kramer's avatar
Roger Kramer committed
58 59 60

BDQC identifies anomalous files among large collections of files which are
*a priori* assumed to be "similar."
kramer's avatar
kramer committed
61
It was motivated by the realization that when faced with many thousands of
Roger Kramer's avatar
Roger Kramer committed
62
individual files it can be challenging to even confirm they all contain
kramer's avatar
kramer committed
63
approximately what they should.
Roger Kramer's avatar
Roger Kramer committed
64

Roger Kramer's avatar
Roger Kramer committed
65
It is useful for:
Roger Kramer's avatar
Roger Kramer committed
66

Roger Kramer's avatar
Roger Kramer committed
67 68 69
1. validating primary input data to pipelines
2. validating output (or intermediate stages of) data processing pipelines
3. discovering potentially "interesting" outliers
Roger Kramer's avatar
Roger Kramer committed
70 71 72 73 74 75 76 77 78 79

These use cases merely highlight different sources of anomalies in data.
In the first, anomalies might be due to faulty data handling or acquisition
(e.g. sloppy manual procedures or faulty sensors). In the second, anomalies
might appear in data due to bugs in pipeline software or runtime failures
(e.g. power outages, network unavailablity, etc.). Finally, anomalies that
can't be discounted as being due to technical problems might actually be
"interesting" observations to be followed up in research.

Although it was developed in the context of genomics research, it is 
kramer's avatar
kramer committed
80 81
expressly not tied to a specific knowledge domain. It can be customized
as much as desired (via the plugin mechanism) for specific knowledge domains.
Roger Kramer's avatar
Roger Kramer committed
82 83 84 85 86 87

Importantly, *files* are its fundamental unit of operation.
This means that a file must constitute a meaningful unit of
information--one sample's data, for example--in any
application of BDQC.

kramer's avatar
kramer committed
88 89
What does it do?
================
Roger Kramer's avatar
Roger Kramer committed
90

kramer's avatar
kramer committed
91
BDQC analyzes a collection of files in two stages.
Roger Kramer's avatar
Roger Kramer committed
92
First, it analyzes each file individually and produces a summary of the
93
file's content (`Within-file Analysis`_).
kramer's avatar
kramer committed
94
Second, the aggregated file summaries are analyzed heuristically
95
(`Between-file Analysis`_) to identify possible anomalies.
Roger Kramer's avatar
Roger Kramer committed
96

kramer's avatar
kramer committed
97 98 99
The two stages of operation can be run independently.

BDQC can be run from the command line, and command line arguments control
Roger Kramer's avatar
Roger Kramer committed
100
which files are analyzed,
kramer's avatar
kramer committed
101 102
how files are summarized,
how the summaries are aggregated and finally analyzed.
Roger Kramer's avatar
Roger Kramer committed
103 104 105 106 107 108 109
All command line arguments are optional; the framework will carry out
default actions. See command line help.

Alternatively, the bdqc.Executor Python class can be incorporated directly
into third party Python code. This allows it to be incorporated into
pipelines.

Roger Kramer's avatar
Roger Kramer committed
110 111 112 113 114 115 116 117 118
Results
-------

A successful run of bdqc.scan ends with one of 3 general results:

1. nothing of interest found ("Everything is OK.")
2. two or more files were found to be *incomparable*
3. anomalies were detected in specific files

Roger Kramer's avatar
Roger Kramer committed
119
Files are considered "incomparable" when they are *so* different (e.g.
Roger Kramer's avatar
Roger Kramer committed
120
log files and JPEG image files) that comparison is essentially meaningless.
121
This rarely occurs because of the way `Between-file Analysis`_ works.
Roger Kramer's avatar
Roger Kramer committed
122

123
A file is considered "anomalous" when one or more of the statistics computed
124 125
on its content (`Within-file Analysis`_) are *outliers*, either in the usual
sense of the word or as explained in `Between-file Analysis`_.
Roger Kramer's avatar
Roger Kramer committed
126 127 128 129

In the second and third cases, a report is optionally generated (as text or HTML)
summarizing the evidence.

Roger Kramer's avatar
Roger Kramer committed
130 131 132 133 134 135 136
Design goals
============

The BDQC framework was developed with several explicit goals in mind:

1. Identify an "anomalous" file among a large collection of *similar* files of *arbitrary* type with as little guidance from the user as possible, ideally none.  In other words, it should be useful "out of the box" with almost no learning curve.
2. "Simple things should be simple; complex things should be possible" [#]_ Although basic use should involve almost no learning curve, it should be possible to extend it with arbitrarily complex (and possibly domain-specific) analysis capabilities.
Roger Kramer's avatar
Roger Kramer committed
137
3. Plugins should be simple (for a competent Python programmer) to develop, and the framework must be robust to faults in plugins.
Roger Kramer's avatar
Roger Kramer committed
138

kramer's avatar
kramer committed
139 140
How does it work?
#################
Roger Kramer's avatar
Roger Kramer committed
141

kramer's avatar
kramer committed
142 143
This section describes in more detail how BDQC works internally.
This and following sections are required reading for anyone
Roger Kramer's avatar
Roger Kramer committed
144
wanting to develop their own plugins.
Roger Kramer's avatar
Roger Kramer committed
145

Roger Kramer's avatar
Roger Kramer committed
146
The most important fact to understand about BDQC is that
Roger Kramer's avatar
Roger Kramer committed
147
**plugins, not the** *framework*, **carry out all within-file analysis of input files.**
148
The BDQC framework merely orchestrates the execution of plugins
149
and performs the final `Between-file Analysis`_, but only plugins
Roger Kramer's avatar
Roger Kramer committed
150
examine a files' content.
kramer's avatar
kramer committed
151 152
(The BDQC *package* includes several "built-in" plugins which insure
it is useful "out of the box." Though they are built-in, they are
153
nonetheless plugins because the follow the plugin architecture.)
kramer's avatar
kramer committed
154

155 156 157 158 159
Plugins_ are simply Python modules installable like any Python module.
Plugins_ provide functions that can read a file and produce one or more
summary statistics about it.
The functions are expected to take certain forms, and the plugin is expected
to export certain symbols used by the BDQC framework.
kramer's avatar
kramer committed
160 161 162

.. image:: doc/dataflow2.png
	:align: center
Roger Kramer's avatar
Roger Kramer committed
163

kramer's avatar
kramer committed
164

165
Within-file Analysis
kramer's avatar
kramer committed
166
====================
kramer's avatar
kramer committed
167 168 169

The plugins that are executed on a file entirely determine
the content of the summary (the statistics) generated for that file.
170
The framework itself *never* looks inside a file; only the plugins examine
kramer's avatar
kramer committed
171 172 173
file content.

The framework:
Roger Kramer's avatar
Roger Kramer committed
174 175

1. assembles a list of paths identifying files to be analyzed,
kramer's avatar
kramer committed
176
2. executes a *dynamically-determined* subset of the available plugins on each file path,
kramer's avatar
kramer committed
177
3. merges the plugins' results into one (JSON_-format) summary per analyzed file.
Roger Kramer's avatar
Roger Kramer committed
178

Roger Kramer's avatar
Roger Kramer committed
179
Each plugin can declare (as part of its implementation) that it depends
Roger Kramer's avatar
Roger Kramer committed
180 181 182 183 184
on zero or more other plugins.

The framework:

1. insures that a plugin's dependencies execute before the plugin itself, and
Roger Kramer's avatar
Roger Kramer committed
185
2. each plugin is provided with the results of its *declared* dependencies' execution.
Roger Kramer's avatar
Roger Kramer committed
186

kramer's avatar
kramer committed
187 188 189
By virtue of their declared dependencies, the set of all plugins available
to BDQC (installed on the user's computer and visible on the PYTHONPATH)
constitute a directed acyclic graph (DAG), and a plugin that is "upstream"
Roger Kramer's avatar
Roger Kramer committed
190
in the DAG can determine how (or even whether or not) a downstream plugin runs.
kramer's avatar
kramer committed
191

kramer's avatar
kramer committed
192
The framework minimizes work by only executing a plugin when required.
193 194
The figure above represents the skipping of plugins; plugin *#3*, for example,
was not run on file *#N*.
Roger Kramer's avatar
Roger Kramer committed
195

Roger Kramer's avatar
Roger Kramer committed
196 197
.. TODO: cover the rerun decision tree.

Roger Kramer's avatar
Roger Kramer committed
198 199 200
By default, the summary for file foo.txt is left in an adjacent file named
foo.txt.bdqc.

201
Again, the BDQC *framework* does not read files' content; it only
kramer's avatar
kramer committed
202
handles filenames and paths.
Roger Kramer's avatar
Roger Kramer committed
203

204
Between-file Analysis
kramer's avatar
kramer committed
205
=====================
Roger Kramer's avatar
Roger Kramer committed
206

207 208 209
1. Collection_ - Summary (\*.bdqc) files are collected.
2. Filtering_ - The statistics in \*.bdqc files are filtered so that they only include the "leaves" in the dependency tree.
3. Flattening_ - All files' summaries (the JSON_-formatted content of all corresponding \*.bdqc files) are flattened into a matrix.
210
4. `Heuristic Analysis`_ is applied to the columns of the matrix to identify rows (corresponding to the original files) that might be anomalies.
Roger Kramer's avatar
Roger Kramer committed
211 212 213 214

The framework (bdqc.scan or bdqc.analysis) exits with a status code indicating
the overall analysis result: no anomalies, incomparable files, anomalies detected
(or an error occurred).
Roger Kramer's avatar
Roger Kramer committed
215

Roger Kramer's avatar
Roger Kramer committed
216 217 218 219 220 221 222 223 224 225 226
**Two or more files are considered incomparable when their summaries do not
contain the same set of statistics.** This typically only occurs when files
are so different that different plugins ran, and it is usually the result of
insufficiently constraining the bdqc.scan run
(see the --include and --exclude options).
It can also occur when \*.bdqc files from different bdqc.scan runs are
inappropriately aggregated in an independent bdqc.analysis run.

When incomparable files are detected it is impossible to determine which, if
any, are anomalous.

kramer's avatar
kramer committed
227 228
Collection
----------
Roger Kramer's avatar
Roger Kramer committed
229

230
Typically bdqc.scan automatically invokes the `Between-file Analysis`_ on
231 232
the results of `Within-file Analysis`_.
However, `Between-file Analysis`_ can also be run independently, and files
233
listing and/or directories containing \*.bdqc files to analyze can be
Roger Kramer's avatar
Roger Kramer committed
234 235 236 237 238
specified exactly as with bdqc.scan. See

.. code-block:: shell

	python3 -m bdqc.analysis --help
239

240 241 242 243 244 245
Filtering
---------

Recall that plugins exist in DAGs ("trees") defined by their dependencies.
This arrangement facilitates reuse by allowing capabilities to be
modularized and dynamically chained together at runtime.
246 247 248 249 250
Typically, upstream plugins are the most general-purpose (domain-blind),
and, conversely, downstream plugins are the most specialized (domain-aware).
Thus, the leaves of the plugin DAG are the most authoritative with respect
to what constitutes an anomalous file.
**For this reason, only the results of "terminal plugins",
251
those in the "leaves" of the DAG, are included by default in**
252
`Between-file Analysis`_. (However, this does not apply when
253 254 255 256 257 258 259 260 261 262
`Between-file Analysis`_ is launched independently of the bdqc.scan module.)

For example, one might launch BDQC on a directory tree, specifying a single
image-processing plugin to analyze image files. The image-processing plugin
might depend on a filetype plugin to identify files that it should process.)
The results of the filetype plugin are not of ultimate interest; it is being
used as a *filter* by the image-processing plugin.
Only the results of the image-processing plugin are relevant to anomaly
detection.

263 264 265
Thus, the statistics analyzed during `Between-file Analysis`_ typically
come from a subset of all the plugins run during `Within-file Analysis`_.

kramer's avatar
kramer committed
266 267
Flattening
----------
kramer's avatar
kramer committed
268

269
A plugin's output can be (almost) anything representable as JSON_ data.
Roger Kramer's avatar
Roger Kramer committed
270 271
In particular, the "statistic(s)" produced by a plugin need not be scalars
(numbers and strings); they can be compound data like matrices or sets.
272
However, only scalar statistics are (currently) used in subsequent analysis.
273

274 275 276
Since JSON_ includes compound types (Object and Array), it supports the
creation of hierarchical data representations.
Thus, the individual (scalar) statistics in plugins' summaries are
kramer's avatar
kramer committed
277
necessarily identified by *paths* in the JSON_ data.
Roger Kramer's avatar
Roger Kramer committed
278
For example, the following excerpt of output from the `bdqc.builtin.tabular`_
kramer's avatar
kramer committed
279
plugin's analysis of *one file* shows some of the many statistics it produces:
280 281

.. code-block:: JSON
Roger Kramer's avatar
Roger Kramer committed
282

283 284 285 286 287 288 289 290 291 292 293 294
	{
		"non_utf8": 0, 
		"table": {
			"metadata_prefix": "", 
			"lines_empty": 0, 
			"lines_data": 29, 
			"lines_meta": 0, 
			"lines_aberrant": 0,
			"column_count": 170, 
			"columns": [
				{
					"type": "string", 
Roger Kramer's avatar
Roger Kramer committed
295 296
					"class": "categorical",
					"label_set_hash": "E02B9961"
297 298 299 300 301 302
				}, 
				{
					"type": "string", 
					"class": "unknown"
				}, 
				{
Roger Kramer's avatar
Roger Kramer committed
303 304
					"type": "float", 
					"class": "quantitative",
305 306 307
					"stats": {
						"stddev": 3.812, 
						"mean": 47.38
Roger Kramer's avatar
Roger Kramer committed
308
					}
309 310 311
				}, 
				{
					"type": "int", 
Roger Kramer's avatar
Roger Kramer committed
312 313
					"class": "categorical",
					"label_set_hash": "8D4D4E1B"
314 315 316
				}, 
				...
			]
kramer's avatar
kramer committed
317
		}
318
	}
kramer's avatar
kramer committed
319

320
The plugin inferred that the 3rd column in the file contains quantitative
321
data ("class"), and the mean value of that column was 47.38.
kramer's avatar
kramer committed
322 323 324 325
The process of "flattening" the JSON summaries creates one column in the
aggregate matrix from the values of the mean statistic *for all files analyzed*,
and that column's *name* is the path:

326
	bdqc.builtin.tabular/table/columns/2/stats/mean.
kramer's avatar
kramer committed
327 328 329 330 331 332 333 334 335 336 337

These paths can be used to make heuristic analysis selective. (See
heuristic configuration (TODO)).

In summary, each \*.bdqc file contains all plugins' statistics for one
analyzed file; each column in the aggregate matrix contains one statistic
(from one plugin) for all files analyzed.

.. The columns of the matrix are the individual statistics that plugins produce
.. in their analysis summaries.

Roger Kramer's avatar
Roger Kramer committed
338
Heuristic Analysis
339 340
------------------

341
`Within-file Analysis`_ (and BDQC itself) is based on a simple heuristic:
342

kramer's avatar
kramer committed
343 344
	**Files that** *a priori* **are expected to be "similar" should be
	effectively** *identical* **in specific, measurable ways.**
345 346

For example, files that are known to contain tabular data typically should
Roger Kramer's avatar
Roger Kramer committed
347 348
have identical column counts. This need not *always* be the case, though,
which is why it is a *heuristic*.
349 350 351 352 353

In concrete terms this means that each column in the summary matrix should
contain *a single value*. (e.g. The bdqc.builtin.tabular/table/column_count
column in the summary matrix should contain only one value in all rows.)

Roger Kramer's avatar
Roger Kramer committed
354 355
If the column is not single-valued, then the analyzed files corresponding to
rows containing the minority value(s) will be reported as anomalies.
356

Roger Kramer's avatar
Roger Kramer committed
357 358 359 360 361
Clearly, this heuristic cannot be applied to quantitative data since it
usually contains *noise* inherent in the phenomena itself or its measurement.
However, a "relaxation" of the heuristic still applies:
a quantitative statistic should manifest *central tendency* and an *absence*
of outliers ("outliers" in the usual univariate statistical sense of the word).
362 363 364 365 366 367

For example, files containing genetic variant calls of many individuals
of the same species (one individual per file), performed on the same
sequencing platform, called by the same variant-calling algorithm, etc.
should typically be *approximately* the same size (in bytes).

Roger Kramer's avatar
Roger Kramer committed
368 369 370 371 372 373 374 375 376 377 378 379 380
Note that inference of statistical class (quantitative, categorical)
relies on inference of data *type* (integer, floating-point, or
string). See `Type inference`_ below.

Finally, missing data is also treated as anomalous. A statistic that
contains a value of null (None in Python) is *always* considered an
anomaly.

Thus, BDQC identifies anomalous files by three different indicators:

	1. outliers in *quantitative* data (the usual sense of the word "outlier")
	2. outliers in categorical data defines as the minority value(s) when a categorical column contains more than one value
	3. missing values
Roger Kramer's avatar
Roger Kramer committed
381 382 383

Obviously, **plugins must support these rationale** by only producing
statistics that satisfy them (when files are "normal").
Roger Kramer's avatar
Roger Kramer committed
384 385 386 387 388 389 390

Finally, because heuristics are *by definition* not universally applicable,
plugins' output (the statistics) can be filtered so that the heuristic is
applied selectively. For example, in a particular context "normal" files
containing tabular data may actually be expected to contain variable column
counts, so this should not be reported as an anomaly.
(See heuristic configuration).
391

kramer's avatar
kramer committed
392 393
Plugins
#######
Roger Kramer's avatar
Roger Kramer committed
394

Roger Kramer's avatar
Roger Kramer committed
395
The BDQC executable *framework* does not itself examine files' content.
Roger Kramer's avatar
Roger Kramer committed
396
All *within-file* analysis is performed by plugins.
kramer's avatar
kramer committed
397
Several plugins are included in (but are, nonetheless, distinct from) the
kramer's avatar
kramer committed
398
framework. These plugins are referred to as "`Built-ins`_".
Roger Kramer's avatar
Roger Kramer committed
399 400 401 402 403 404 405 406 407

A plugin is simply a Python module with several required and optional
elements shown in the example below.

.. code-block:: python

	VERSION=0x00010000
	DEPENDENCIES = ['bdqc.builtin.extrinsic','some.other.plugin']
	def process( filename, dependencies_results ):
kramer's avatar
kramer committed
408 409 410 411 412 413 414 415 416 417
		# Optionally, verify or use contents of dependencies_results.
		with open( filename ) as fp:
			pass # ...do whatever is required to compute the values
		# returned below...
		return {
			'a_quantitative_statistic':1.2345,
			'a_3x2_matrix_of_float_result':[[3.0,1.2],[0.0,1.0],[1,2]],
			'a_set_result':['foo','bar','baz'],
			'a_categorical_result':"yes" }

Roger Kramer's avatar
Roger Kramer committed
418
Plugins must satisfy several constraints:
kramer's avatar
kramer committed
419

420 421
1. Every plugin *must* provide a two-argument function called process.
2. A plugin *may* provide a list called DEPENDENCIES (which may be empty). Each dependency is a fully-qualified Python package name (as a string).
kramer's avatar
kramer committed
422
3. A plugin *may* include a VERSION declaration. If present, it must be convertible to an integer (using int()).
423 424 425 426 427 428
4. The process function *must* return data built entirely of the basic Python types:
	1. dict
	2. list
	3. tuple
	4. a scalar (int, float, string)
	5. None
Roger Kramer's avatar
Roger Kramer committed
429 430 431 432 433 434 435 436

These requirements do not limit what a plugin can *do*.
They merely define a *packaging* that allows the plugin to be hosted
by the framework. In particular, a plugin may invoke compiled code (e.g.
C or Fortran) and/or use arbitrary 3rd party libraries using standard
Python mechanisms.

Moreover, while a plugin is free to return multiple statistics,
kramer's avatar
kramer committed
437 438
the `Unix philosophy`_ of "Do one thing and do it well" suggests that a
plugin *should* return few statistics (or even only one).
Roger Kramer's avatar
Roger Kramer committed
439
This promotes reuse, extensibility, and unit-testability of plugins, and is
Roger Kramer's avatar
Roger Kramer committed
440
part of the motivation behind the plugin architecture.
Roger Kramer's avatar
Roger Kramer committed
441 442 443

There is no provision for passing arguments to plugins from the framework
itself. Environment variables can be used when a plugin must be
444
parameterized.
Roger Kramer's avatar
Roger Kramer committed
445 446

Developers are advised to look at the source code of any of the built-in
Roger Kramer's avatar
Roger Kramer committed
447 448
plugins for examples of how to write their own. The `bdqc.builtin.extrinsic`_
is a very simple plugin; `bdqc.builtin.tabular`_ is much more complex and
Roger Kramer's avatar
Roger Kramer committed
449 450
demonstrates how to use C code.

Roger Kramer's avatar
Roger Kramer committed
451 452
The framework will incorporate the VERSION number, if present, into the plugin's output
automatically. The plugin's code need not (and *should* not) include it in the
Roger Kramer's avatar
Roger Kramer committed
453
returned value. The version number is used by the framework (along with other factors) to decide
Roger Kramer's avatar
Roger Kramer committed
454 455 456 457 458
whether to *re*-run a plugin.

A plugin *should* return a Python dict with the name(s) of its statistic(s) as keys.
If a plugin returns any of the other allowed types, the framework will wrap it in
a dict and its value will be associated with the key "value."
kramer's avatar
kramer committed
459

kramer's avatar
kramer committed
460 461
Built-ins
=========
Roger Kramer's avatar
Roger Kramer committed
462 463 464 465 466 467 468 469

The BDQC software package includes several built-in plugins so that it is
useful "out of the box." These plugins provide very general purpose analyses
and assume *nothing* about the files they analyze.
Although their output is demonstrably useful on its own, the built-in plugins
may be viewed as a means to "bootstrap" more specific (more domain-aware)
analyses.

kramer's avatar
kramer committed
470 471
bdqc.builtin.extrinsic
----------------------
Roger Kramer's avatar
Roger Kramer committed
472 473 474

.. warning:: Unfinished.

kramer's avatar
kramer committed
475 476
bdqc.builtin.filetype
---------------------
Roger Kramer's avatar
Roger Kramer committed
477 478 479

.. warning:: Unfinished.

kramer's avatar
kramer committed
480 481
bdqc.builtin.tabular
--------------------
Roger Kramer's avatar
Roger Kramer committed
482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513

.. warning:: Unfinished.

.. Framework execution
.. ###################
.. 
.. After parsing command line arguments the framework (bdqc.scan):
.. 
.. 1. builds a list *P* of all candidate plugins
.. 2. identifies an ordering of plugins that respects all declared dependencies
.. 3. builds a list *F* of files to be (potentially) analyzed
.. 4. for each file *f* in *F*, for each plugin *p* in *P* it runs *p* on *f* *if it needs to be run*.
.. 
.. The files to be analyzed as well as the set of candidate plugins are
.. controlled by multiple command line options. See online help.
.. 
.. These steps always happen.
.. Aggregate analysis--that is, analysis of the plugins' analyses--is
.. carried out if and only if a file is specified (with the {\tt --accum}
.. option) to contain the plugins' results.
.. 
.. Whether a plugin is actually run on a file depends on global options,
.. the existence of earlier analysis results, the modification time of
.. the file and the version (if present) of the plugin.
.. 
.. A plugin is run on a file:
.. 1. if the --clobber flag is included in the command line; this forces (re)run and preempts all other considerations.
.. 2. if no results from the current plugin exist for the file.
.. 3. if results exist but their modification time is older than the file.
.. 4. if any of the plugin's dependencies were (re)run.
.. 5. when the plugin version is (present and) newer (greater) than the version that produced existing results.

kramer's avatar
kramer committed
514 515 516 517 518 519 520 521 522 523 524 525 526 527 528
Advanced topics
###############

Aggregation and "flattening" of JSON data
=========================================

The JSON_-formatted summaries generated by plugins are hierarchical in nature
since JSON_ Objects and Arrays can each contain other JSON_ Objects and Arrays.

The process of flattening the JSON_ to produce the summary matrix
need not, in general, result in columns of *scalars* (eg. numbers and string
labels).
Although it is always possible to arrive at columns of scalars by flattening ("exploding")
JSON_ compound objects *exhaustively*, the process is intentionally *not* exhaustive by default.
Because we want plugins to be able to return compound values as results (e.g. sets,
529
vectors, matrices) *without complicating JSON by defining special labeling
kramer's avatar
kramer committed
530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564
requirements*, the following rules and conventions are observed:

	1.	Arrays of values of a single *scalar type* are not flattened (e.g. an Array with only Numbers).
	2.	Nested Arrays--Arrays that contain other Arrays of *identical dimension*--are also not flattened.

Arrays of the first type are interpreted as either vectors (1D matrices) or *sets*.
An Array is interpreted as a set when and only when it contains *non-repeated*
String values.

BDQC interprets the second use of JSON_ Arrays as matrices. For example, in...

.. code-block:: JSON

        "foo.bar": {
            "baz": [
                [ 1, 2 ],
                [ 3, 4 ],
                [ 5, 6 ],
                [ 7, 8 ],
            ],
            "fuz": [
                [ [ "a", "b", "c", "d" ], [ "e", "f", "g", "h" ] ],
                [ [ "i", "j", "k", "l" ], [ "m", "n", "o", "p" ] ],
                [ [ "q", "r", "s", "t" ], [ "u", "v", "w", "x" ] ],
            ],
            "woz": [ "none","of","these","strings","are","repeated" ],
            ...
        }

1. foo.bar/baz will be treated as a 4x2 (numeric) matrix.
2. foo.bar/fuz will be treated as a 3x2x4 (String-valued) matrix.
3. foo.bar/woz will be treated as a *set*.

An Array that contains *any* JSON_ Objects is *always* further flattened.

Roger Kramer's avatar
Roger Kramer committed
565 566 567 568 569
Type inference
==============

TODO

kramer's avatar
kramer committed
570 571 572 573 574 575 576 577
Terms and Definitions
#####################

within-file analysis
between-file analysis
summary matrix
heuristic

Roger Kramer's avatar
Roger Kramer committed
578 579 580 581 582 583 584 585 586
Footnotes
#########

.. [#] `Alan Kay`_

.. Collected external URLS

..	_Python3: https://wiki.python.org/moin/Python2orPython3
..	_`Unix philosophy`: https://en.wikipedia.org/wiki/Unix_philosophy
587 588
..	_`Alan Kay`: https://en.wikipedia.org/wiki/Alan_Kay
..	_JSON: http://json.org
kramer's avatar
kramer committed
589