README.rst 20.7 KB
Newer Older
Roger Kramer's avatar
Roger Kramer committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14

Table of Contents
#################

- Installation_
- Overview_

	- `What is it?`_
	- `What is it for?`_
	- `What does it do?`_
- `How does it work?`_
- Plugins_


kramer's avatar
kramer committed
15 16
Installation
############
Roger Kramer's avatar
Roger Kramer committed
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43

The BDQC framework has no requirements other than Python 3.3.2 or later.
The GCC toolchain is required for installation as some of its
components are C code that must be compiled.

After extracting the archive...

.. code-block:: shell

	python3 setup.py install

...installs the framework, after which...

.. code-block:: shell

	python3 -m bdqc.scan <directory>

...will analyze all files in <directory>, and

.. code-block:: shell

	python3 -m bdqc.scan --help
	
...provides further help.
The contents of the online help is not repeated in this document.


kramer's avatar
kramer committed
44 45
Overview
########
Roger Kramer's avatar
Roger Kramer committed
46

kramer's avatar
kramer committed
47 48
What is it?
===========
Roger Kramer's avatar
Roger Kramer committed
49 50 51 52

BDQC is a Python3_ software framework and executable module.
Although it provides built-in capabilities that make it useful "out of the
box", being a "framework" means that users (knowledgeable in Python
53 54
programming) can extend its capabilities, and it is *intended* to
be so extended.
Roger Kramer's avatar
Roger Kramer committed
55

kramer's avatar
kramer committed
56 57
What is it for?
===============
Roger Kramer's avatar
Roger Kramer committed
58 59 60

BDQC identifies anomalous files among large collections of files which are
*a priori* assumed to be "similar."
kramer's avatar
kramer committed
61
It was motivated by the realization that when faced with many thousands of
Roger Kramer's avatar
Roger Kramer committed
62
individual files it can be challenging to even confirm they all contain
kramer's avatar
kramer committed
63
approximately what they should.
Roger Kramer's avatar
Roger Kramer committed
64

Roger Kramer's avatar
Roger Kramer committed
65
It is useful for:
Roger Kramer's avatar
Roger Kramer committed
66

Roger Kramer's avatar
Roger Kramer committed
67 68 69
1. validating primary input data to pipelines
2. validating output (or intermediate stages of) data processing pipelines
3. discovering potentially "interesting" outliers
Roger Kramer's avatar
Roger Kramer committed
70 71 72 73 74 75 76 77 78 79

These use cases merely highlight different sources of anomalies in data.
In the first, anomalies might be due to faulty data handling or acquisition
(e.g. sloppy manual procedures or faulty sensors). In the second, anomalies
might appear in data due to bugs in pipeline software or runtime failures
(e.g. power outages, network unavailablity, etc.). Finally, anomalies that
can't be discounted as being due to technical problems might actually be
"interesting" observations to be followed up in research.

Although it was developed in the context of genomics research, it is 
kramer's avatar
kramer committed
80 81
expressly not tied to a specific knowledge domain. It can be customized
as much as desired (via the plugin mechanism) for specific knowledge domains.
Roger Kramer's avatar
Roger Kramer committed
82 83 84 85 86 87

Importantly, *files* are its fundamental unit of operation.
This means that a file must constitute a meaningful unit of
information--one sample's data, for example--in any
application of BDQC.

kramer's avatar
kramer committed
88 89
What does it do?
================
Roger Kramer's avatar
Roger Kramer committed
90

kramer's avatar
kramer committed
91
BDQC analyzes a collection of files in two stages.
Roger Kramer's avatar
Roger Kramer committed
92
First, it analyzes each file individually and produces a summary of the
Roger Kramer's avatar
Roger Kramer committed
93
file's content (`within-file analysis <Within-file analysis>`_).
kramer's avatar
kramer committed
94 95
Second, the aggregated file summaries are analyzed heuristically
(`between-file analysis <Between-file analysis_>`_) to identify possible anomalies.
Roger Kramer's avatar
Roger Kramer committed
96

kramer's avatar
kramer committed
97 98 99
The two stages of operation can be run independently.

BDQC can be run from the command line, and command line arguments control
Roger Kramer's avatar
Roger Kramer committed
100
which files are analyzed,
kramer's avatar
kramer committed
101 102
how files are summarized,
how the summaries are aggregated and finally analyzed.
Roger Kramer's avatar
Roger Kramer committed
103 104 105 106 107 108 109
All command line arguments are optional; the framework will carry out
default actions. See command line help.

Alternatively, the bdqc.Executor Python class can be incorporated directly
into third party Python code. This allows it to be incorporated into
pipelines.

Roger Kramer's avatar
Roger Kramer committed
110 111 112 113 114 115 116 117 118
Results
-------

A successful run of bdqc.scan ends with one of 3 general results:

1. nothing of interest found ("Everything is OK.")
2. two or more files were found to be *incomparable*
3. anomalies were detected in specific files

Roger Kramer's avatar
Roger Kramer committed
119
Files are considered "incomparable" when they are *so* different (e.g.
Roger Kramer's avatar
Roger Kramer committed
120 121
log files and JPEG image files) that comparison is essentially meaningless.

Roger Kramer's avatar
Roger Kramer committed
122 123 124
A file is considered "anomalous" when one or more of the statistics that
plugins compute about it are "outliers," either in the usual sense of the
word or another sense explained more fully below (in `Between-file analysis`_).
Roger Kramer's avatar
Roger Kramer committed
125 126 127 128

In the second and third cases, a report is optionally generated (as text or HTML)
summarizing the evidence.

Roger Kramer's avatar
Roger Kramer committed
129 130 131 132 133 134 135
Design goals
============

The BDQC framework was developed with several explicit goals in mind:

1. Identify an "anomalous" file among a large collection of *similar* files of *arbitrary* type with as little guidance from the user as possible, ideally none.  In other words, it should be useful "out of the box" with almost no learning curve.
2. "Simple things should be simple; complex things should be possible" [#]_ Although basic use should involve almost no learning curve, it should be possible to extend it with arbitrarily complex (and possibly domain-specific) analysis capabilities.
Roger Kramer's avatar
Roger Kramer committed
136
3. Plugins should be simple (for a competent Python programmer) to develop, and the framework must be robust to faults in plugins.
Roger Kramer's avatar
Roger Kramer committed
137

kramer's avatar
kramer committed
138 139
How does it work?
#################
Roger Kramer's avatar
Roger Kramer committed
140

kramer's avatar
kramer committed
141 142
This section describes in more detail how BDQC works internally.
This and following sections are required reading for anyone
Roger Kramer's avatar
Roger Kramer committed
143
wanting to develop their own plugins.
Roger Kramer's avatar
Roger Kramer committed
144

Roger Kramer's avatar
Roger Kramer committed
145
The most important fact to understand about BDQC is that
Roger Kramer's avatar
Roger Kramer committed
146 147 148 149
**plugins, not the** *framework*, **carry out all within-file analysis of input files.**
The BDQC framework merely orchestrates the execution of `plugins <Plugins_>`_
and performs the final *across-file* analysis, but only plugins
examine a files' content.
kramer's avatar
kramer committed
150 151
(The BDQC *package* includes several "built-in" plugins which insure
it is useful "out of the box." Though they are built-in, they are
152
nonetheless plugins because the follow the plugin architecture.)
kramer's avatar
kramer committed
153 154

A plugin is simply a Python module that is installable like any Python module.
155 156
Plugins provide functions that can read a file and produce one or more summary
statistics about it.
157
The functions are expected to take certain forms, and the plugin is expected to
158 159
export certain symbols used by the BDQC framework (described in detail
`below <Plugins_>`_).
kramer's avatar
kramer committed
160 161 162

.. image:: doc/dataflow2.png
	:align: center
Roger Kramer's avatar
Roger Kramer committed
163

kramer's avatar
kramer committed
164

kramer's avatar
kramer committed
165 166
Within-file analysis
====================
kramer's avatar
kramer committed
167 168 169

The plugins that are executed on a file entirely determine
the content of the summary (the statistics) generated for that file.
170
The framework itself *never* looks inside a file; only the plugins examine
kramer's avatar
kramer committed
171 172 173
file content.

The framework:
Roger Kramer's avatar
Roger Kramer committed
174 175

1. assembles a list of paths identifying files to be analyzed,
kramer's avatar
kramer committed
176
2. executes a *dynamically-determined* subset of the available plugins on each file path,
kramer's avatar
kramer committed
177
3. merges the plugins' results into one (JSON_-format) summary per analyzed file.
Roger Kramer's avatar
Roger Kramer committed
178

kramer's avatar
kramer committed
179
Each `plugin <Plugins_>`_ can declare (as part of its implementation) that it depends
Roger Kramer's avatar
Roger Kramer committed
180 181 182 183 184
on zero or more other plugins.

The framework:

1. insures that a plugin's dependencies execute before the plugin itself, and
Roger Kramer's avatar
Roger Kramer committed
185
2. each plugin is provided with the results of its *declared* dependencies' execution.
Roger Kramer's avatar
Roger Kramer committed
186

kramer's avatar
kramer committed
187 188 189
By virtue of their declared dependencies, the set of all plugins available
to BDQC (installed on the user's computer and visible on the PYTHONPATH)
constitute a directed acyclic graph (DAG), and a plugin that is "upstream"
Roger Kramer's avatar
Roger Kramer committed
190
in the DAG can determine how (or even whether or not) a downstream plugin runs.
kramer's avatar
kramer committed
191

kramer's avatar
kramer committed
192
The framework minimizes work by only executing a plugin when required.
193 194
The figure above represents the skipping of plugins; plugin *#3*, for example,
was not run on file *#N*.
Roger Kramer's avatar
Roger Kramer committed
195

Roger Kramer's avatar
Roger Kramer committed
196 197
.. TODO: cover the rerun decision tree.

Roger Kramer's avatar
Roger Kramer committed
198 199 200
By default, the summary for file foo.txt is left in an adjacent file named
foo.txt.bdqc.

kramer's avatar
kramer committed
201 202
Again, the BDQC *framework* does not touch files' content; it only
handles filenames and paths.
Roger Kramer's avatar
Roger Kramer committed
203

kramer's avatar
kramer committed
204 205
Between-file analysis
=====================
Roger Kramer's avatar
Roger Kramer committed
206

kramer's avatar
kramer committed
207 208
1. Summary (\*.bdqc) files are `collected <Collection_>`_.
2. All files' summaries (the JSON_-formatted content of all corresponding \*.bdqc files) are `flattened <Flattening_>`_ into a matrix.
Roger Kramer's avatar
Roger Kramer committed
209 210 211 212 213
3. `Heuristic analysis is applied <Heuristic Analysis_>`_ to the columns of the matrix to identify rows (corresponding to the original files) that might be anomalies.

The framework (bdqc.scan or bdqc.analysis) exits with a status code indicating
the overall analysis result: no anomalies, incomparable files, anomalies detected
(or an error occurred).
Roger Kramer's avatar
Roger Kramer committed
214

Roger Kramer's avatar
Roger Kramer committed
215 216 217 218 219 220 221 222 223 224 225
**Two or more files are considered incomparable when their summaries do not
contain the same set of statistics.** This typically only occurs when files
are so different that different plugins ran, and it is usually the result of
insufficiently constraining the bdqc.scan run
(see the --include and --exclude options).
It can also occur when \*.bdqc files from different bdqc.scan runs are
inappropriately aggregated in an independent bdqc.analysis run.

When incomparable files are detected it is impossible to determine which, if
any, are anomalous.

kramer's avatar
kramer committed
226 227
Collection
----------
Roger Kramer's avatar
Roger Kramer committed
228

kramer's avatar
kramer committed
229 230
Typically bdqc.scan automatically invokes the between-files analysis on
the results of within-file analysis.
231 232
However, the between-file analysis can also be run independently, and files
listing and/or directories containing \*.bdqc files to analyze can be
Roger Kramer's avatar
Roger Kramer committed
233 234 235 236 237
specified exactly as with bdqc.scan. See

.. code-block:: shell

	python3 -m bdqc.analysis --help
238

kramer's avatar
kramer committed
239 240
Flattening
----------
kramer's avatar
kramer committed
241

242
A `plugin's <Plugins_>`_ output can be (almost) anything
kramer's avatar
kramer committed
243
representable as JSON_ data.
Roger Kramer's avatar
Roger Kramer committed
244 245 246
In particular, the "statistic(s)" produced by a plugin need not be scalars
(numbers and strings); they can be compound data like matrices or sets.
However, currently only scalar statistics are used in subsequent analysis.
247

Roger Kramer's avatar
Roger Kramer committed
248 249
Since JSON_ is inherently hierarchical (because it supports compound types),
the individual statistics in plugins' summaries are
kramer's avatar
kramer committed
250 251 252
necessarily identified by *paths* in the JSON_ data.
For example, the following excerpt of output from the bdqc.builtin.tabular_
plugin's analysis of *one file* shows some of the many statistics it produces:
253 254

.. code-block:: JSON
Roger Kramer's avatar
Roger Kramer committed
255

256 257 258 259 260 261 262 263 264 265 266 267
	{
		"non_utf8": 0, 
		"table": {
			"metadata_prefix": "", 
			"lines_empty": 0, 
			"lines_data": 29, 
			"lines_meta": 0, 
			"lines_aberrant": 0,
			"column_count": 170, 
			"columns": [
				{
					"type": "string", 
Roger Kramer's avatar
Roger Kramer committed
268 269
					"class": "categorical",
					"label_set_hash": "E02B9961"
270 271 272 273 274 275
				}, 
				{
					"type": "string", 
					"class": "unknown"
				}, 
				{
Roger Kramer's avatar
Roger Kramer committed
276 277
					"type": "float", 
					"class": "quantitative",
278 279 280
					"stats": {
						"stddev": 3.812, 
						"mean": 47.38
Roger Kramer's avatar
Roger Kramer committed
281
					}
282 283 284
				}, 
				{
					"type": "int", 
Roger Kramer's avatar
Roger Kramer committed
285 286
					"class": "categorical",
					"label_set_hash": "8D4D4E1B"
287 288 289
				}, 
				...
			]
kramer's avatar
kramer committed
290
		}
291
	}
kramer's avatar
kramer committed
292

293
The plugin inferred that the 3rd column in the file contains quantitative
294
data ("class"), and the mean value of that column was 47.38.
kramer's avatar
kramer committed
295 296 297 298
The process of "flattening" the JSON summaries creates one column in the
aggregate matrix from the values of the mean statistic *for all files analyzed*,
and that column's *name* is the path:

299
	bdqc.builtin.tabular/table/columns/2/stats/mean.
kramer's avatar
kramer committed
300 301 302 303 304 305 306 307 308 309 310

These paths can be used to make heuristic analysis selective. (See
heuristic configuration (TODO)).

In summary, each \*.bdqc file contains all plugins' statistics for one
analyzed file; each column in the aggregate matrix contains one statistic
(from one plugin) for all files analyzed.

.. The columns of the matrix are the individual statistics that plugins produce
.. in their analysis summaries.

Roger Kramer's avatar
Roger Kramer committed
311
Heuristic Analysis
312 313
------------------

Roger Kramer's avatar
Roger Kramer committed
314
Within-file analysis (and BDQC itself) is based on a simple heuristic:
315

kramer's avatar
kramer committed
316 317
	**Files that** *a priori* **are expected to be "similar" should be
	effectively** *identical* **in specific, measurable ways.**
318 319

For example, files that are known to contain tabular data typically should
Roger Kramer's avatar
Roger Kramer committed
320 321
have identical column counts. This need not *always* be the case, though,
which is why it is a *heuristic*.
322 323 324 325 326

In concrete terms this means that each column in the summary matrix should
contain *a single value*. (e.g. The bdqc.builtin.tabular/table/column_count
column in the summary matrix should contain only one value in all rows.)

Roger Kramer's avatar
Roger Kramer committed
327 328
If the column is not single-valued, then the analyzed files corresponding to
rows containing the minority value(s) will be reported as anomalies.
329

Roger Kramer's avatar
Roger Kramer committed
330 331 332 333 334
Clearly, this heuristic cannot be applied to quantitative data since it
usually contains *noise* inherent in the phenomena itself or its measurement.
However, a "relaxation" of the heuristic still applies:
a quantitative statistic should manifest *central tendency* and an *absence*
of outliers ("outliers" in the usual univariate statistical sense of the word).
335 336 337 338 339 340

For example, files containing genetic variant calls of many individuals
of the same species (one individual per file), performed on the same
sequencing platform, called by the same variant-calling algorithm, etc.
should typically be *approximately* the same size (in bytes).

Roger Kramer's avatar
Roger Kramer committed
341 342 343 344 345 346 347 348 349 350 351 352 353
Note that inference of statistical class (quantitative, categorical)
relies on inference of data *type* (integer, floating-point, or
string). See `Type inference`_ below.

Finally, missing data is also treated as anomalous. A statistic that
contains a value of null (None in Python) is *always* considered an
anomaly.

Thus, BDQC identifies anomalous files by three different indicators:

	1. outliers in *quantitative* data (the usual sense of the word "outlier")
	2. outliers in categorical data defines as the minority value(s) when a categorical column contains more than one value
	3. missing values
Roger Kramer's avatar
Roger Kramer committed
354 355 356

Obviously, **plugins must support these rationale** by only producing
statistics that satisfy them (when files are "normal").
Roger Kramer's avatar
Roger Kramer committed
357 358 359 360 361 362 363

Finally, because heuristics are *by definition* not universally applicable,
plugins' output (the statistics) can be filtered so that the heuristic is
applied selectively. For example, in a particular context "normal" files
containing tabular data may actually be expected to contain variable column
counts, so this should not be reported as an anomaly.
(See heuristic configuration).
364

kramer's avatar
kramer committed
365 366
Plugins
#######
Roger Kramer's avatar
Roger Kramer committed
367

Roger Kramer's avatar
Roger Kramer committed
368
The BDQC executable *framework* does not itself examine files' content.
Roger Kramer's avatar
Roger Kramer committed
369
All *within-file* analysis is performed by plugins.
kramer's avatar
kramer committed
370
Several plugins are included in (but are, nonetheless, distinct from) the
kramer's avatar
kramer committed
371
framework. These plugins are referred to as "`Built-ins`_".
Roger Kramer's avatar
Roger Kramer committed
372 373 374 375 376 377 378 379 380

A plugin is simply a Python module with several required and optional
elements shown in the example below.

.. code-block:: python

	VERSION=0x00010000
	DEPENDENCIES = ['bdqc.builtin.extrinsic','some.other.plugin']
	def process( filename, dependencies_results ):
kramer's avatar
kramer committed
381 382 383 384 385 386 387 388 389 390
		# Optionally, verify or use contents of dependencies_results.
		with open( filename ) as fp:
			pass # ...do whatever is required to compute the values
		# returned below...
		return {
			'a_quantitative_statistic':1.2345,
			'a_3x2_matrix_of_float_result':[[3.0,1.2],[0.0,1.0],[1,2]],
			'a_set_result':['foo','bar','baz'],
			'a_categorical_result':"yes" }

Roger Kramer's avatar
Roger Kramer committed
391
Plugins must satisfy several constraints:
kramer's avatar
kramer committed
392

393 394
1. Every plugin *must* provide a two-argument function called process.
2. A plugin *may* provide a list called DEPENDENCIES (which may be empty). Each dependency is a fully-qualified Python package name (as a string).
kramer's avatar
kramer committed
395
3. A plugin *may* include a VERSION declaration. If present, it must be convertible to an integer (using int()).
396 397 398 399 400 401
4. The process function *must* return data built entirely of the basic Python types:
	1. dict
	2. list
	3. tuple
	4. a scalar (int, float, string)
	5. None
Roger Kramer's avatar
Roger Kramer committed
402 403 404 405 406 407 408 409

These requirements do not limit what a plugin can *do*.
They merely define a *packaging* that allows the plugin to be hosted
by the framework. In particular, a plugin may invoke compiled code (e.g.
C or Fortran) and/or use arbitrary 3rd party libraries using standard
Python mechanisms.

Moreover, while a plugin is free to return multiple statistics,
kramer's avatar
kramer committed
410 411
the `Unix philosophy`_ of "Do one thing and do it well" suggests that a
plugin *should* return few statistics (or even only one).
Roger Kramer's avatar
Roger Kramer committed
412
This promotes reuse, extensibility, and unit-testability of plugins, and is
Roger Kramer's avatar
Roger Kramer committed
413
part of the motivation behind the plugin architecture.
Roger Kramer's avatar
Roger Kramer committed
414 415 416

There is no provision for passing arguments to plugins from the framework
itself. Environment variables can be used when a plugin must be
417
parameterized.
Roger Kramer's avatar
Roger Kramer committed
418 419

Developers are advised to look at the source code of any of the built-in
kramer's avatar
kramer committed
420 421
plugins for examples of how to write their own. The bdqc.builtin.extrinsic_
is a very simple plugin; bdqc.builtin.tabular_ is much more complex and
Roger Kramer's avatar
Roger Kramer committed
422 423
demonstrates how to use C code.

Roger Kramer's avatar
Roger Kramer committed
424 425
The framework will incorporate the VERSION number, if present, into the plugin's output
automatically. The plugin's code need not (and *should* not) include it in the
Roger Kramer's avatar
Roger Kramer committed
426
returned value. The version number is used by the framework (along with other factors) to decide
Roger Kramer's avatar
Roger Kramer committed
427 428 429 430 431
whether to *re*-run a plugin.

A plugin *should* return a Python dict with the name(s) of its statistic(s) as keys.
If a plugin returns any of the other allowed types, the framework will wrap it in
a dict and its value will be associated with the key "value."
kramer's avatar
kramer committed
432

kramer's avatar
kramer committed
433 434
Built-ins
=========
Roger Kramer's avatar
Roger Kramer committed
435 436 437 438 439 440 441 442

The BDQC software package includes several built-in plugins so that it is
useful "out of the box." These plugins provide very general purpose analyses
and assume *nothing* about the files they analyze.
Although their output is demonstrably useful on its own, the built-in plugins
may be viewed as a means to "bootstrap" more specific (more domain-aware)
analyses.

kramer's avatar
kramer committed
443 444
bdqc.builtin.extrinsic
----------------------
Roger Kramer's avatar
Roger Kramer committed
445 446 447

.. warning:: Unfinished.

kramer's avatar
kramer committed
448 449
bdqc.builtin.filetype
---------------------
Roger Kramer's avatar
Roger Kramer committed
450 451 452

.. warning:: Unfinished.

kramer's avatar
kramer committed
453 454
bdqc.builtin.tabular
--------------------
Roger Kramer's avatar
Roger Kramer committed
455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486

.. warning:: Unfinished.

.. Framework execution
.. ###################
.. 
.. After parsing command line arguments the framework (bdqc.scan):
.. 
.. 1. builds a list *P* of all candidate plugins
.. 2. identifies an ordering of plugins that respects all declared dependencies
.. 3. builds a list *F* of files to be (potentially) analyzed
.. 4. for each file *f* in *F*, for each plugin *p* in *P* it runs *p* on *f* *if it needs to be run*.
.. 
.. The files to be analyzed as well as the set of candidate plugins are
.. controlled by multiple command line options. See online help.
.. 
.. These steps always happen.
.. Aggregate analysis--that is, analysis of the plugins' analyses--is
.. carried out if and only if a file is specified (with the {\tt --accum}
.. option) to contain the plugins' results.
.. 
.. Whether a plugin is actually run on a file depends on global options,
.. the existence of earlier analysis results, the modification time of
.. the file and the version (if present) of the plugin.
.. 
.. A plugin is run on a file:
.. 1. if the --clobber flag is included in the command line; this forces (re)run and preempts all other considerations.
.. 2. if no results from the current plugin exist for the file.
.. 3. if results exist but their modification time is older than the file.
.. 4. if any of the plugin's dependencies were (re)run.
.. 5. when the plugin version is (present and) newer (greater) than the version that produced existing results.

kramer's avatar
kramer committed
487 488 489 490 491 492 493 494 495 496 497 498 499 500 501
Advanced topics
###############

Aggregation and "flattening" of JSON data
=========================================

The JSON_-formatted summaries generated by plugins are hierarchical in nature
since JSON_ Objects and Arrays can each contain other JSON_ Objects and Arrays.

The process of flattening the JSON_ to produce the summary matrix
need not, in general, result in columns of *scalars* (eg. numbers and string
labels).
Although it is always possible to arrive at columns of scalars by flattening ("exploding")
JSON_ compound objects *exhaustively*, the process is intentionally *not* exhaustive by default.
Because we want plugins to be able to return compound values as results (e.g. sets,
502
vectors, matrices) *without complicating JSON by defining special labeling
kramer's avatar
kramer committed
503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537
requirements*, the following rules and conventions are observed:

	1.	Arrays of values of a single *scalar type* are not flattened (e.g. an Array with only Numbers).
	2.	Nested Arrays--Arrays that contain other Arrays of *identical dimension*--are also not flattened.

Arrays of the first type are interpreted as either vectors (1D matrices) or *sets*.
An Array is interpreted as a set when and only when it contains *non-repeated*
String values.

BDQC interprets the second use of JSON_ Arrays as matrices. For example, in...

.. code-block:: JSON

        "foo.bar": {
            "baz": [
                [ 1, 2 ],
                [ 3, 4 ],
                [ 5, 6 ],
                [ 7, 8 ],
            ],
            "fuz": [
                [ [ "a", "b", "c", "d" ], [ "e", "f", "g", "h" ] ],
                [ [ "i", "j", "k", "l" ], [ "m", "n", "o", "p" ] ],
                [ [ "q", "r", "s", "t" ], [ "u", "v", "w", "x" ] ],
            ],
            "woz": [ "none","of","these","strings","are","repeated" ],
            ...
        }

1. foo.bar/baz will be treated as a 4x2 (numeric) matrix.
2. foo.bar/fuz will be treated as a 3x2x4 (String-valued) matrix.
3. foo.bar/woz will be treated as a *set*.

An Array that contains *any* JSON_ Objects is *always* further flattened.

Roger Kramer's avatar
Roger Kramer committed
538 539 540 541 542
Type inference
==============

TODO

kramer's avatar
kramer committed
543 544 545 546 547 548 549 550
Terms and Definitions
#####################

within-file analysis
between-file analysis
summary matrix
heuristic

Roger Kramer's avatar
Roger Kramer committed
551 552 553 554 555 556 557 558 559
Footnotes
#########

.. [#] `Alan Kay`_

.. Collected external URLS

..	_Python3: https://wiki.python.org/moin/Python2orPython3
..	_`Unix philosophy`: https://en.wikipedia.org/wiki/Unix_philosophy
560 561
..	_`Alan Kay`: https://en.wikipedia.org/wiki/Alan_Kay
..	_JSON: http://json.org
kramer's avatar
kramer committed
562