bioSyntax Manual

Grok your data

The objective of bioSyntax is to bring you closer to your data, giving you an intuitive & empathetic understanding of biology. To appreciate all that bioSyntax has to offer read this short manual (~10 minutes) and go explore.

  1. Getting Started
    1. See: Installing bioSyntax
    2. Reading large-data
      1. Read your large-data set with less directly
      2. Streaming your data directly into less with pipes |
      3. Bypassing bioSyntax (data in plain-text)
  2. Reading Data
    1. Nucleotides
    2. PHRED Scores
    3. CIGAR Strings
    4. Amino Acid Color Schemes
  3. Supported File Formats
    1. Core bioSyntax
    2. Auxillary Syntaxes
    3. See Also: Alternative/User Syntax Definitions
    4. Science Syntaxes
  4. Support
    1. Report a bug / Ask a question
      1. Uninstallation Instructions
  5. Collaborating on bioSyntax
  6. See also: bioSyntax Manuscript

Getting Started

See: Installing bioSyntax

bioSyntax integreates seamlessly with vim (Linux / Mac / Win), sublime (Linux / Mac / Win), gedit (Linux / Win), & less (Linux / Mac). After installing bioSyntax files will automatically detected by file-extension.

Reading large-data

For very large data sets, it’s often slow to open them in a text editor. It’s best to use the command-line program less which will read your file from a data-stream.

Read your large-data set with less directly
# If your file is uncompressed, it can be read directly.
# less will recognize the file extension (.XYZ)

cd ~/myData/

less dbSNP107_common.vcf

less hg19.fa
Streaming your data directly into less with pipes |
# If your file is compressed, you can 'pipe' the data 
# using the "|" operator from decompression, directly into
# less. You must prefix the file extension you want
# as file formats are not recognized within streams.

cd ~/myCompressedData/ 

samtools view -h NA12878_hg38.bam | sam-less

gzip -dc dbSNP107_rare.vcf.gz | vcf-less

gzip -dc hg38.fa.gz | fa-less
Bypassing bioSyntax (data in plain-text)

For vim Type :syntax off in vim

For less

# You may want to view your data without syntax highlighting
# such as where a file is improperly formatted or very large
# files where syntax highlighting may be slow (i.e. VCF files
# with hundreds of columns).

# 1. Pipe your data through cat
cat snp_1000genomes.vcf | less - 

# 2. Within less, switch to a visual editor
less snp_1000genomes.vcf
  # press 'CTRL-C' to stop process
  # press 'v' to switch to visual editor

Reading Data

Nucleotides

bioSyntax implements a novel, full IUPAC Nucleotide Code coloring. Ambiguous bases are represented by an ~additive color-mixing of the parent bases. For example, Thymine (blue) + Cytosine (red) are both pYrimidines (magenta).

An intuitive feature of the bioSyntax color scheme is that the ‘GC-content’ of a sequence can be quickly approximated by how warm (high GC, red-orange) or cool (low GC, blue-green) a sequence looks.

vim myc_gcContent.fa

PHRED Scores

When available, bioSyntax will highlight PHRED quality scores in a step-gradient of blacks (PHRED = 0-10) to whites (PHRED = 40+).

CIGAR Strings

In .sam files the Query:Reference alignment is summarized efficiently but illegibly as a CIGAR String. With a little bit of highlighting these become much easier to read.

Amino Acid Color Schemes

You can choose from several color-schemes for amino-acid fasta files. The Fasta Clustal (Default) syntax colors amino acids based on their physiochemical properties, so does Fasta Hydrophobicity, or you may prefer better discrimination of each amino acids with Fasta Zappo or Fasta Taylor.

Supported File Formats

File format and software compatibility matrix for bioSyntax.

  status
X Syntax Complete
o In Development
- Unavailable

Core bioSyntax

File Format Description sublime vim gedit less
.fasta Generic nt/aa sequence X X X X
.fastq Fasta + PHRED quality X X X X
.clustal Multiple Sequence Alignment X X X X
.bed Genomic Ranges X X X X
.gtf Genomic Annotation X X X X
.pdb Protein Structure X X X X
.vcf Variant Call Format X X X X
.sam NGS Sequence Data X X X X

Auxillary Syntaxes

File Format Description sublime vim gedit less
.fasta fasta alternative AA colors        
- Clustal X - X -
- Taylor X - X -
- Zappo X - X -
- Hydrophobicity X - X -
.fai Fasta Index (faidx) X X X X
.flagstat samtools flag summary X X X X
.cwl Common Workflow Language X X X -
.wig Wiggle data - - X -
.nexus Phylogenetics data - X - -
.pml Pymol Script Language X X - -

See Also: Alternative/User Syntax Definitions

These syntaxes are not part of the unified bioSyntax suite but often serve specialized functions.

Science Syntaxes

File Format Description sublime vim gedit less
.gaussian Gaussian File (chemistry) - X - -

If you’d like to add support for another file-format; check the development page to get started.

Support

Report a bug / Ask a question

The fastest way to get an answer is to:

1) Search / Open an issue on the bioSyntax Repo.

Please Include:

  • A detailed and descriptive title.
  • Enough information about what did for someone else to replicate the problem.
  • Information about the operating system / software you’re using (uname -a)
  • If it’s a syntax highlighting issue: a screenshot of the error and a small bit of the input file you used.

Open an Issue

2) If you really don’t want to make a (fake) github account. Email [email protected] and we’ll open the issue, but it will be slower.

Uninstallation Instructions

Collaborating on bioSyntax

bioSyntax is a community-oriented project for scientific syntax highlighting. We encourage you to change and customize it to suit your needs.

Check out the Development page to create syntax-highlighting for custom file-formats and for other ways to help out.

Collaborate!

See also: bioSyntax Manuscript