An Assembly Exercise
====================

:Author: C. Titus Brown
:Date: May 22, 2013

Start up an EC2 instance and log in
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Follow the instructions from yesterday (in
:doc:`start-up-an-ec2-instance`) BUT with one modification: **use the
machine image 'ami-c17ec8a8', instead of the other ami**.

Log in to the machine with SSH (as in :doc:`log-in-with-ssh-win`).  (If
you're using a Mac, read :doc:`log-in-with-ssh-mac`.)

Install the 'Velvet' assembler
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

At the command prompt, copy and paste the following::

   cd /root
   curl -O http://www.ebi.ac.uk/~zerbino/velvet/velvet_1.2.08.tgz
   tar xzf velvet_1.2.08.tgz
   cd velvet
   make MAXKMERLENGTH=51
   cp velvet? /usr/local/bin

Grab some data
~~~~~~~~~~~~~~

Now, let's grab some read sequencing to assemble::

   cd /mnt
   curl -O https://s3.amazonaws.com/public.ged.msu.edu/ecoli-reads-5m-dn-paired.fa.gz

This takes some data that I've prepared for you, and downloads it into the file 'ecoli-reads-5m-dn-paired.fa.gz'.

Let's take a quick look at the contents here::

   gunzip -c ecoli*.gz | head

This command uncompresses the data into text, and then shows you the first 10
lines.  You should see this::

   >EAS20_8_6_1_6_1407/1
   AGCGATTGCTGTGTGGCTTCAACCACTGGACAGGAGCGCGTTTTCAGGCTGATCACCAGTGCGTCGATTG
   >EAS20_8_6_1_6_1407/2
   TCTACCGGGAGCGAAATCATGATCAAGATTGGCGTTATCGCCGATGATTTTACCGGCGCGACGGATATCG
   >EAS20_8_6_1_6_1272/1
   TACCTTCGCACTGCCCGCAGCTCGAACAGCTTGGCGCACCAACGGTATTGCCGGAATTACTCAAGAGCGA
   >EAS20_8_6_1_6_1272/2
   AGATATTCAACAGGATCTTCCAGCGTCAGAATATGCGCATCGGCATGTTGATTGAGATAGCCAACCATCG
   >EAS20_8_6_1_6_668/1
   CGAAGTCGGGCGAAAATGGCGTGATTCAGGGAGGATTTTCCAGAACATCAACGCCGAGGCCAGCGCGAAA

which is a bunch of sequence reads in something called the FASTA
format: a '>' character followed by a name (in this case computer
generated, and more or less random), with a sequence right after it.

These sequences are generated by an instrument that takes shredded DNA
and "digitizes" it -- in this case it's an Illumina sequencer, but
there are many other such machines.

Assembling the data
~~~~~~~~~~~~~~~~~~~

Now run the following two commands::

   velveth g.31 31 -shortPaired ecoli-reads-5m-dn-paired.fa.gz 
   velvetg g.31 -exp_cov auto

The first command tells the Velvet assembler to load the sequences
into the directory 'g.31', using the value '31' for the required 'k'
parameter value.  (More on 'k' later...)  The name 'g.31' is just our
way of keeping track of things -- this can be any filename.  I'm using
'g' for 'genome' and '31' to remind me of what 'k' value I used.

The second command tells Velvet to assemble the shorter sequences into
longer contiguous sequences, or "contigs".

This essentially does what we did manually in class: looks for overlaps,
and then sticks the sequences together.

The output of the last command should end with something like::

   Final graph has 3590 nodes and n50 of 2328, max 11865, total 4580179, using 366068/371922 reads

which tells you a few statistics about the assembly --

 - Velvet could assemble about 4000 sequences (3590 "nodes");

 - the N50 of the sequences was about 2.3 kb (2328 bases) -- more on N50 later;

 - the maximum contig length assembled is 11kb, which means hundreds of those
   little reads were put together;

 - the sum of the bases that were assembled is about 4.5 mb (4580179), which
   is pretty close to the size of the E. coli genome!

The output of all of this is in the file 'g.31/contigs.fa', which you can
look at using 'head' again::

   head g.31/contigs.fa

You should see something like 

   >NODE_1_length_1698_cov_3.221437
   CCTGTTTATCTTGCCCGGCCCATAAGGCAATCTGTAACCAGTCAGCAATTTGGTTATTGC
   TGAGTGCTGATTTTAGTGCAAACCATGACAAAGCTGGCTGAGTATTACCCTTGCGAGCTT
   CAATAATCAATGCATCATAGGCGTTATTAACAGCACTCTTCGCCGCGGGACTCGCTGCTA
   AAAATGCGGCAGTAAGAAGTTTCAAAGCCCATTTGGTTTTCGGGCACCTTTTTCTGCTAC
   TTGAATACATCCTGTATTACTCCATGTATTGCCAAAATCTCTCTCTGTATCTAATTACAG
   GTAACTGAAAAGAAAGATATTTTTGCACCTCATAATCCGTTATTAAACGCGGAAGAGAGA
   CGTGAATTGTTGATGATGAGAAGAAGAAATGATGAGCAGAGTGTCCATATAAAATCCTTT
   TCTCGCCCGAAAATCCATTCCAATGATGAGGATCTTCAGGAATACGGCATAAATCCCAAT
   GCCTTTTTCAAAATAAATTAGGATTAAAATAATTAAATCAGTAAATTCCGATGCATGATT

Running multiple assemblies
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Do one more assembly -- for example, set the 'k' parameter to 21 (you can
set it to any odd number between 19 and 51, if you want to try something
different than 21). ::

   velveth g.21 21 -shortPaired ecoli-reads-5m-dn-paired.fa.gz 
   velvetg g.21 -exp_cov auto

Now we have *two* assemblies... the second one should look like this::

   Final graph has 2060 nodes and n50 of 6284, max 36734, total 4526331, using 370625/371922 reads

Is this better or worse than the k=31 assembly?  Why?

Generate a few more assemblies -- work with a pal to cover more
ground.  You should keep track of the velvetg statistics output; if
you lose it, you can recover it by doing 'tail g.31/Log'.

You can try:

 - varying k by choosing any odd number between 19 and 51;

 - removing the '-exp_cov auto' command from 'velvetg';

 - adding '-scaffolding no' to the 'velvetg' command;

 - Adding more read data.  Grab this file::

     https://s3.amazonaws.com/public.ged.msu.edu/ecoli-reads-5m-dn-orphan.fa.gz

   using 'curl' as above, and then append '-short
   ecoli-reads-5m-dn-orphan.fa.gz' to the 'velveth' command line.
   (The 'velvetg' command doesn't need to change.)

Which of these assemblies is "best" by some criterion?  Can you find an
assembly that is "best" by more than one (unrelated) criterion?

Finishing up for today
~~~~~~~~~~~~~~~~~~~~~~

Just leave your EC2 instances running so that we can access the data tomorrow.

Tomorrow, we'll cover ways of graphing some of your statistics.  One possible
project to present on Friday is your analysis of these various assemblies.

**Update**: `Here's an IPython Notebook <http://nbviewer.ipython.org/urls/raw.github.com/ged-lab/2013-srop-summer/master/srop-2013-assembly-stats.ipynb>`__ that shows you average length stats for a bunch of assemblies.

Questions and thoughts to address
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Things to meditate upon --

 - how do we manage complexity? Do we need to understand all these commands?
   What does each command do? In detail?

 - why don't we have a nice user interface?  Why is everything typing!?

 - why are we using this Amazon machine rather than the computer in
   front of us?

 - what is source code, anyway?

 - why do the assemblies change when you change k?

 - why might you get different numbers than me out of the velvet
   commands, sometime?  The data going in isn't changing...?

 - combinatorial explosion of parameters!!!

Reading
~~~~~~~

`Genome sequence assembly primer <http://www.cbcb.umd.edu/research/assembly_primer.shtml>`__

`What does the k parameter do in assembly? <http://ivory.idyll.org/blog/the-k-parameter.html>`__

`Assembly algorithms for next-generation sequencing data <http://www.ncbi.nlm.nih.gov/pubmed/20211242>`__