Hippocampus Gene Architecture Atlas -- Data Integrated Software
Author: Ian Bowman ibowman@loni.usc.edu
Copyright (c) 2017 USC Stevens Neuroimaging and Informatics Institute
Center for Integrative Connectomics
Summary
This repository contains the annotation data, analysis software, and processed results of The genetic architecture of mouse hippocampal networks publication. It relates to the verification, format conversion, modularity maximization/clustering, and visualization of the annotated data featured in the paper.
If you are an third-party interested in the HGAA work, this software will allow you to verify and reproduce the hippocampal results published.
If you are interested in exploring alternate hypothesis/visualizations, the software in this repository will furthermore allow you to process hippocampal data with customized parameters as well.
Finally, if you are a researcher who has collected similarly formatted connectivity or gene expression data, this software will allow you to conduct clustering analysis and visualization on third-party connectivity or gene expression data.
Introduction
During the course of the Center for Integrated Connectomics (CIC) HGAA study, we noted the case for, developed and employed software for a set of compatible goals. Listed roughly in chronological order, as well as rudimentary to sophisticated, these overlapping goals include:
- Converting between file formats, especially representations of brain structure annotation
- Verification of input
- Efficient execution of clustering algorithms on structural and gene expression annotation
- Effective organization and storage of textual and binary output data
- Enabling and enhancing reproducibility
The last goal is a big one.
Data Integrated Software?
Despite the advent of software automation, or because of it, ensuring that results are reproducible is hard. Without careful planning from the outset, it is easy to generate a volume of data so vast that identifying origin of inputs leading to the results becomes untractible.
With concern about fulfilling the goals of open-access, data sharing and reproducibility, we agreed upon a software and data sharing plan. The resulting CIC Data Integrated Software Directive is an explicit commitment to provide access to digitized versions of all data published in CIC work, as well as the custom software involved. The aim is to encourage data access, software-driven research, and globally beneficial development of derivative works.
Organization
This repository consists of:
- Annotated, and result data -- All data is available in the publish subdirectories. The data guide lists details.
- Software scripts and modules(/src/software_guide.md) -- All scripts and modules are located in src. The software documentation(/src/software_guide.md) identifies and categorizes functionality.
- Tests(/tests/testing_guide.md) -- All tests can be run via the run_tests.sh script, with individual tests located in the tests directory. Unit test coverage is good, while at the time of writing this README, smoke tests are incomplete. More details are in the test documentation(/tests/testing_guide.md).
Reproducing Results
Most result data is associated with a file containing the arguments used to generate it. This can be accessed with the pickle_juice.py script. (The name is derived from the python pickle module, which the script makes use of.)
Identifying arguments is helpful in reproducing results, but also to customize parameters. In this example, we first retrieve the command line arguments used to create a gene expression matrix plot:
$ python src/pickle_juice.py \
-co publish/ms_char_clust_gene_annotation_dir_rect_row_col_mat/gene_annotation_dir_rect_row_col_mat_k-4-3_runs-1000_mps-44.8502090942_mod_reorder.svg
--format=svg --line_num=5 --char_cmt_str_csv=publish/ms_char_clust_gene_annotation_dir_rect_row_col_mat/ms_char_clust_gene_annotation_dir_rect_row_col_mat_k-001-200_runs_1000.csv --dictionary_list=None --partition="[['DGd', 'DGi', 'DG_pod'], ['CA3dd', 'CA3d'], ['CA3i'], ['CA2'], ['DGv', 'DG_pov'], ['CA1i', 'CA1v', 'CA3v', 'SUB_3'], ['CA1vv', 'CA3vv', 'SUB_2'], ['CA1d', 'SUB_1'], ['SUB_4', 'Putative HPF Interneurons']]" --pretend_square --matrix_csv=publish/ms_char_clust_gene_annotation_dir_rect_row_col_mat/gene_annotation_dir_rect_row_col_mat.csv --draw_subnetwork=None --community_line_weight=2.0 --rankdir=TB --fontsize=3 --row_value_community_sort --plot_type=mod_mat --inter_line_weight=0.1 --module_line_weight=0.5
There is a log there, so let's focus on the first two arguments returned: --format=svg --line_num=5
. We can reproduce the plot as a PNG image by passing the arguments to plot_mat.py with --format=svg
changed to --format=png
:
$ python src/plot_mat.py --format=png --line_num=5 \
--char_cmt_str_csv=publish/ms_char_clust_gene_annotation_dir_rect_row_col_mat/ms_char_clust_gene_annotation_dir_rect_row_col_mat_k-001-200_runs_1000.csv --dictionary_list=None --partition="[['DGd', 'DGi', 'DG_pod'], ['CA3dd', 'CA3d'], ['CA3i'], ['CA2'], ['DGv', 'DG_pov'], ['CA1i', 'CA1v', 'CA3v', 'SUB_3'], ['CA1vv', 'CA3vv', 'SUB_2'], ['CA1d', 'SUB_1'], ['SUB_4', 'Putative HPF Interneurons']]" --pretend_square --matrix_csv=publish/ms_char_clust_gene_annotation_dir_rect_row_col_mat/gene_annotation_dir_rect_row_col_mat.csv --draw_subnetwork=None --community_line_weight=2.0 --rankdir=TB --fontsize=3 --row_value_community_sort --plot_type=mod_mat --inter_line_weight=0.1 --module_line_weight=0.5
Now suppose, we look through ms_char_clust_gene_annotation_dir_rect_row_col_mat_k-001-200_runs_1000.csv and we are interested in k-2, since that also has a high mean partition similarity. We can change the --line-num=5
argument to --line-num=3
to render an expression matrix with that k-value:
$ python src/plot_mat.py --format=png --line_num=3 \
--char_cmt_str_csv=publish/ms_char_clust_gene_annotation_dir_rect_row_col_mat/ms_char_clust_gene_annotation_dir_rect_row_col_mat_k-001-200_runs_1000.csv --dictionary_list=None --partition="[['DGd', 'DGi', 'DG_pod'], ['CA3dd', 'CA3d'], ['CA3i'], ['CA2'], ['DGv', 'DG_pov'], ['CA1i', 'CA1v', 'CA3v', 'SUB_3'], ['CA1vv', 'CA3vv', 'SUB_2'], ['CA1d', 'SUB_1'], ['SUB_4', 'Putative HPF Interneurons']]" --pretend_square --matrix_csv=publish/ms_char_clust_gene_annotation_dir_rect_row_col_mat/gene_annotation_dir_rect_row_col_mat.csv --draw_subnetwork=None --community_line_weight=2.0 --rankdir=TB --fontsize=3 --row_value_community_sort --plot_type=mod_mat --inter_line_weight=0.1 --module_line_weight=0.5
This example and other details are presented in the detailed in the data guide.
For more information
Please see documentation as referenced in above sections.
Do not hesitate to contact the authors for more information, whether a general inquiry regarding the HGAA publication (@mbienkowski) or relating specifically to software, data and related documentation (@ibowman).
Acknolowedgments
HGAA CIC Data Integrated Software would not exist without liberal imports of the excellent bctpy, BCT, matplotlib, scikit_learn and numpy libraries.