BioInterchange 2.0

image-alt

Integrating and Scaling Genomics Data.


Features

(See also the brochure.)

Unified Data Model

One easy-to-read — for humans and machines — for multiple genomic standards: Generic Feature Format Version 3 (GFF3), Genome Variation Format (GVF) and Variant Call Format (VCF). Access your data without having to know the specification details of every single genomics file format.

High-Performance API

High-performance data access via a Python API that outperforms both BioPython and BioRuby in terms of computational speed and reduced memory footprint. The API is perfect for data filtering, data annotation and data analysis without the need of a database management system.

Data Integration

Straightforward data integration via JSON into NoSQL databases such as RethinkDB, MongoDB, CouchDB, ArangoDB; or JSON supporting SQL databases like MySQL and PostgreSQL; or search servers such as Elasticsearch or Solr; or Hadoop based solutions such as Apache Hive and Apache Pig; or even RDF triple stores via JSON-LD.

Distributions

CloudBioLinux

Direct Download

Docker Hub

Homebrew Science

In Detail

Unified Data Model

Genomic features and variations are encoded very differently across the genomics file formats. BioInterchange normalizes your genomic datasets in one canonical data model. That means that data analysis algorithms can be implemented with only a single point-of-view for all genomic data. Instead of dealing with data encoding specifics, or various data representations even within the genomic file standards, it becomes possible to focus on the actual data instead.

BioInterchange’s data model is extensively described in the documentation section, but here are a few intuituve examples.

Genomic Loci

(Data model excerpt.)

{
    "locus" : {
        "landmark" : "Chr1",
        "start" : 675,
        "end" : 675,
        "strand" : "+"
    }
}

Reference Sequences and Sequence Variations

(Data model excerpt.)

{
    "reference" : {
        "sequence" : "A",
        "codon" : "GAG"
    },
    "variants" : {
        "B" : {
            "sequence" : "G",
            "codon" : "GAG"
        },
        "C" : {
            "codon" : "GGG",
            "sequence" : "T"
        }
    }
}

Sample Data (Associated with a Genomic Feature)

(Data model excerpt.)

{
    "samples" : [
        {
            "id" : "WSB_EiJ",
            "depth" : 43,
            "mapping-quality-rms" : 55,
            "genotype-quality" : 127,
            "allele-total-number" : 2,
            "genotype" : {
                "sequences" : [
                    "T",
                    "T"
                ],
                "alleles" : "BB",
                "phased" : false
            },
            "AA" : {
                "genotype-probabilities-phred-scaled" : 287,
                "genotype-likelihood-phred-scaled" : 255
            },
            "AB" : {
                "genotype-likelihood-phred-scaled" : 129,
                "genotype-probabilities-phred-scaled" : 142
            },
            "BB" : {
                "genotype-likelihood-phred-scaled" : 0,
                "genotype-probabilities-phred-scaled" : 0
            }
        }
    ]
}

High-Performance API

BioInterchange is written in C, it makes use of some low-level system features for best performance, and has clever algorithms that minimize its memory footprint! In comparison to BioPython and BioRuby projects:1

Computing Time

Shown: factor of time needed to process a GFF3 file in reference to the time needed by BioInterchange.

Memory Consumption

Shown: factor of memory allocated when processing a GFF3 file in reference to the memory allocated by BioInterchange. Yes, that baseline for BioInterchange is really that low.

Data Integration

On top of one data model for all your genomics data, you also benefit from JSON as the lingua franca for modern database management systems. JSON is used at its core in NoSQL database management systems such as MongoDB and RethinkDB, as well as the famous search server Elasticsearch. Established relational database management systems, for example PostgreSQL and MySQL, natively support JSON nowadays too, and so is the Apache Hive data warehouse infrastructure and the Apache Pig data analysis platform that are build on top of Hadoop.

BioInterchange’s JSON is also JSON-LD (JSON for Linked Data), which enables the use of triple stores, but requires for less storage than alternative triple store formats that are based on the Resource Description Framework (RDF). If your company builds on triple stores such as Virtuoso or Sesame, then you can benefit from BioInterchange’s JSON-LD contexts/types and easily turn the genomics data into any RDF format of your choice!

Last, but not least, BioInterchange fully supports conversions from JSON/JSON-LD data back into GFF3, GVF and VCF genomic files.

Compatible Software/Platforms/Frameworks

There are more database management systems, data analysis platforms, and data frameworks that are potentially compatible with BioInterchange. For specific questions, please use the contact form to get in touch.

Try It Now!

Current version: 2.0.3+100

Step 1: Install the Software

BioInterchange is available for OS X2 and Linux.3 Installation packages come in various flavors: some are better for easy manual installation, others are geared towards automatic deployment, for use in the cloud, or for installation on high performance computing clusters. If in doubt, choose the direct download method below.

Direct Download

Homebrew (Linux and OS X)

BioInterchange is part of Homebrew Science:

brew install homebrew/science/biointerchange

Or:

brew tap homebrew/science
brew install biointerchange

Docker (Linux)

Pull a shipshape image:

docker pull codamono/biointerchange
docker run -i -t codamono/biointerchange

Or, build your own with this Dockerfile:

# Dockerfile: BioInterchange 2.0
FROM debian:jessie

RUN apt-get clean && \
    apt-get update && \
    apt-get install -y \
        apt-transport-https \
        python3.4

RUN echo "deb https://www.codamono.com/debs/ stable main" >> /etc/apt/sources.list

RUN apt-get update && \
    apt-get install -y --allow-unauthenticated biointerchange

Debian Package (Linux)

Install from CODAMONO’s repository (need to be root for that, obviously):

apt-get update
apt-get install -y apt-transport-https python3.4
echo "deb https://www.codamono.com/debs/ stable main" >> /etc/apt/sources.list
apt-get update
apt-get install -y --allow-unauthenticated biointerchange

CloudBioLinux (Linux)

Please visit the CloudBioLinux web-site for instructions on installing CloudBioLinux and the BioInterchange package.

Step 2: Get a Trial License

Trial licenses are valid for one month (30 days). Fill in the small form below and you will receive an e-mail with a trial license code. Save the license code in the file ~/.biointerchange/biointerchange-license and you are good to go!

License Request Form

Step 3: Enjoy

1. Save your license key in this file:

~/.biointerchange/biointerchange-license

2. Download (and unpack) some genomics data: cat lovers/dog lovers

3. Run BioInterchange (cat lovers’ data):

biointerchange -o Felis_catus_incl_consequences.ldj Felis_catus_incl_consequences.vcf

Next Steps

The documentation section has examples of getting the data into NoSQL databases, how to use the Python API, and extensive information on the JSON/JSON-LD objects.


  1. Average time/memory (at least 5 data points) shown for reading RegulatoryFeatures_HeLa-S3.gff and writing the output to /dev/null. BioRuby 1.4.3.0001, Ruby 2.0.0-p247, Darwin; BioPython 1.65/BCBio 0.6.2, Python 3.4.2_1, Darwin.

  2. OS X Yosemite (Version 10.10.3) or later.

  3. Debian 8.1 (jessie) or later; 64-bit PC (amd64) architecture.