BioInterchange 2.0

image-alt

Scaling and Integrating Genomics Data.


Genomic data comes in files, which is pretty awesome, if you are a file system! For data scientists or bioinformaticians it is an unnecessary hurdle to get to the juicy data though.

BioInterchange gets you to the data straightaway:

  • one easy-to-read data model to multiple genomics standards (GFF3, GVF, VCF)
  • high-performance data access via Python API that outperforms BioPython and BioRuby
  • data integration into RethinkDB, MongoDB, Elasticseach, etc., via JSON
  • more data integration into RDF triple stores via JSON-LD

Data Model

Genomic features and variations are encoded very differently across the genomics file formats. BioInterchange normalizes your genomic datasets in one canonical data model.

Genomic Data Mashup
{
    "@context" : "https://www.codamono.com/jsonld/gvf-f1.json",
    "id" : "76",
    "locus" : {
        "landmark" : "Chr1",
        "start" : 675,
        "end" : 675,
        "strand" : "+"
    },
    "source" : "SGRP",
    "type" : "SNV",
    "dbxref" : [
        "SGRP:s01-675",
        "EMBL:AA816246"
    ],
    "reference" : {
        "sequence" : "A",
        "codon" : "GAG"
    },
    "variants" : {
        "B" : {
            "sequence" : "G",
            "codon" : "GAG"
        },
        "C" : {
            "codon" : "GGG",
            "sequence" : "T"
        }
    }
}

High-Performance

BioInterchange is written in C, makes use of some nifty system features for best performance, and has some spiffy algorithms under the hood! It is super fast and memory efficient in comparison to BioPython and BioRuby projects.1

Computing Time

Shown: factor of time needed to process a GFF3 file in reference to the time needed by BioInterchange.

Memory Consumption

Shown: factor of memory allocated when processing a GFF3 file in reference to the memory allocated by BioInterchange. Yes, that baseline for BioInterchange is really that low.

Data Integration

On top of one data model for all your genomics data, you also benefit from JSON as the lingua franca for modern database management systems. There are the amazing NoSQL database management systems such as MongoDB and RethinkDB which are all about JSON, as well as the famous search server Elasticsearch. Established relational database management systems, for example PostgreSQL, support JSON nowadays too, and so is the Apache Hive data warehouse infrastructure that is build on top of Hadoop.

Not enough yet? Well, BioInterchange’s JSON is also JSON-LD (JSON Linked Data), which plays well with triple stores, but requires for less storage than alternative triple store formats such as RDF N-Triples. If your company builds on triple stores such as Virtuoso or Sesame, then your can fully make use of the JSON-LD contexts/types and easily turn the genomics data into any RDF format of your choice!

Last, but not least, go full circle and turn your JSON/JSON-LD data back into GFF3, GVF and VCF genomic files! How many tools can claim this feature, please?

Try It Now!

Step 1: Install the Software

BioInterchange is available for OS X2 and Linux.3 Installation packages come in various flavors: some are better for easy manual installation, others are geared towards automatic deployment, for use in the cloud, or for installation on high performance computing clusters. If in doubt, choose the direct download method below.

Direct Download

Homebrew (Linux and OS X)

BioInterchange is part of Homebrew Science:

brew install homebrew/science/biointerchange

Or:

brew tap homebrew/science
brew install biointerchange

Docker (Linux)

Pull a shipshape image:

docker pull codamono/biointerchange
docker run -i -t codamono/biointerchange

Or, build your own with this Dockerfile:

# Dockerfile: BioInterchange 2.0
FROM debian:jessie

RUN apt-get clean && \
    apt-get update && \
    apt-get install -y \
        apt-transport-https \
        python3.4

RUN echo "deb https://www.codamono.com/debs/ stable main" >> /etc/apt/sources.list

RUN apt-get update && \
    apt-get install -y --allow-unauthenticated biointerchange

Debian Package (Linux)

Install from CODAMONO’s repository (need to be root for that, obviously):

apt-get update
apt-get install -y apt-transport-https python3.4
echo "deb https://www.codamono.com/debs/ stable main" >> /etc/apt/sources.list
apt-get update
apt-get install -y --allow-unauthenticated biointerchange

Step 2: Get a Trial License

Trial licenses are valid for one month (30 days). Fill in the small form below and you will receive an e-mail with a trial license code. Save the license code in the file ~/.biointerchange/biointerchange-license and you are good to go!

License Request Form

Step 3: Enjoy

1. Save your license key in this file:

~/.biointerchange/biointerchange-license

2. Download (and unpack) some genomics data: cat lovers/dog lovers

3. Get going (cat lovers’ data):

biointerchange -o Felis_catus_incl_consequences.ldj Felis_catus_incl_consequences.vcf

Next Steps

Head over to the documentation page to see examples of getting the data into NoSQL databases, how to use the Python API, and for extensive information on the JSON/JSON-LD objects.


  1. Average time/memory (at least 5 data points) shown for reading RegulatoryFeatures_HeLa-S3.gff and writing the output to /dev/null. BioRuby 1.4.3.0001, Ruby 2.0.0-p247, Darwin; BioPython 1.65/BCBio 0.6.2, Python 3.4.2_1, Darwin.

  2. OS X Yosemite (Version 10.10.3) or later.

  3. Debian 8.1 (jessie) or later; 64-bit PC (amd64) architecture.