An Assembly Exercise ==================== :Author: C. Titus Brown :Date: May 22, 2013 Start up an EC2 instance and log in ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Follow the instructions from yesterday (in :doc:`start-up-an-ec2-instance`) BUT with one modification: **use the machine image 'ami-c17ec8a8', instead of the other ami**. Log in to the machine with SSH (as in :doc:`log-in-with-ssh-win`). (If you're using a Mac, read :doc:`log-in-with-ssh-mac`.) Install the 'Velvet' assembler ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ At the command prompt, copy and paste the following:: cd /root curl -O http://www.ebi.ac.uk/~zerbino/velvet/velvet_1.2.08.tgz tar xzf velvet_1.2.08.tgz cd velvet make MAXKMERLENGTH=51 cp velvet? /usr/local/bin Grab some data ~~~~~~~~~~~~~~ Now, let's grab some read sequencing to assemble:: cd /mnt curl -O https://s3.amazonaws.com/public.ged.msu.edu/ecoli-reads-5m-dn-paired.fa.gz This takes some data that I've prepared for you, and downloads it into the file 'ecoli-reads-5m-dn-paired.fa.gz'. Let's take a quick look at the contents here:: gunzip -c ecoli*.gz | head This command uncompresses the data into text, and then shows you the first 10 lines. You should see this:: >EAS20_8_6_1_6_1407/1 AGCGATTGCTGTGTGGCTTCAACCACTGGACAGGAGCGCGTTTTCAGGCTGATCACCAGTGCGTCGATTG >EAS20_8_6_1_6_1407/2 TCTACCGGGAGCGAAATCATGATCAAGATTGGCGTTATCGCCGATGATTTTACCGGCGCGACGGATATCG >EAS20_8_6_1_6_1272/1 TACCTTCGCACTGCCCGCAGCTCGAACAGCTTGGCGCACCAACGGTATTGCCGGAATTACTCAAGAGCGA >EAS20_8_6_1_6_1272/2 AGATATTCAACAGGATCTTCCAGCGTCAGAATATGCGCATCGGCATGTTGATTGAGATAGCCAACCATCG >EAS20_8_6_1_6_668/1 CGAAGTCGGGCGAAAATGGCGTGATTCAGGGAGGATTTTCCAGAACATCAACGCCGAGGCCAGCGCGAAA which is a bunch of sequence reads in something called the FASTA format: a '>' character followed by a name (in this case computer generated, and more or less random), with a sequence right after it. These sequences are generated by an instrument that takes shredded DNA and "digitizes" it -- in this case it's an Illumina sequencer, but there are many other such machines. Assembling the data ~~~~~~~~~~~~~~~~~~~ Now run the following two commands:: velveth g.31 31 -shortPaired ecoli-reads-5m-dn-paired.fa.gz velvetg g.31 -exp_cov auto The first command tells the Velvet assembler to load the sequences into the directory 'g.31', using the value '31' for the required 'k' parameter value. (More on 'k' later...) The name 'g.31' is just our way of keeping track of things -- this can be any filename. I'm using 'g' for 'genome' and '31' to remind me of what 'k' value I used. The second command tells Velvet to assemble the shorter sequences into longer contiguous sequences, or "contigs". This essentially does what we did manually in class: looks for overlaps, and then sticks the sequences together. The output of the last command should end with something like:: Final graph has 3590 nodes and n50 of 2328, max 11865, total 4580179, using 366068/371922 reads which tells you a few statistics about the assembly -- - Velvet could assemble about 4000 sequences (3590 "nodes"); - the N50 of the sequences was about 2.3 kb (2328 bases) -- more on N50 later; - the maximum contig length assembled is 11kb, which means hundreds of those little reads were put together; - the sum of the bases that were assembled is about 4.5 mb (4580179), which is pretty close to the size of the E. coli genome! The output of all of this is in the file 'g.31/contigs.fa', which you can look at using 'head' again:: head g.31/contigs.fa You should see something like >NODE_1_length_1698_cov_3.221437 CCTGTTTATCTTGCCCGGCCCATAAGGCAATCTGTAACCAGTCAGCAATTTGGTTATTGC TGAGTGCTGATTTTAGTGCAAACCATGACAAAGCTGGCTGAGTATTACCCTTGCGAGCTT CAATAATCAATGCATCATAGGCGTTATTAACAGCACTCTTCGCCGCGGGACTCGCTGCTA AAAATGCGGCAGTAAGAAGTTTCAAAGCCCATTTGGTTTTCGGGCACCTTTTTCTGCTAC TTGAATACATCCTGTATTACTCCATGTATTGCCAAAATCTCTCTCTGTATCTAATTACAG GTAACTGAAAAGAAAGATATTTTTGCACCTCATAATCCGTTATTAAACGCGGAAGAGAGA CGTGAATTGTTGATGATGAGAAGAAGAAATGATGAGCAGAGTGTCCATATAAAATCCTTT TCTCGCCCGAAAATCCATTCCAATGATGAGGATCTTCAGGAATACGGCATAAATCCCAAT GCCTTTTTCAAAATAAATTAGGATTAAAATAATTAAATCAGTAAATTCCGATGCATGATT Running multiple assemblies ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Do one more assembly -- for example, set the 'k' parameter to 21 (you can set it to any odd number between 19 and 51, if you want to try something different than 21). :: velveth g.21 21 -shortPaired ecoli-reads-5m-dn-paired.fa.gz velvetg g.21 -exp_cov auto Now we have *two* assemblies... the second one should look like this:: Final graph has 2060 nodes and n50 of 6284, max 36734, total 4526331, using 370625/371922 reads Is this better or worse than the k=31 assembly? Why? Generate a few more assemblies -- work with a pal to cover more ground. You should keep track of the velvetg statistics output; if you lose it, you can recover it by doing 'tail g.31/Log'. You can try: - varying k by choosing any odd number between 19 and 51; - removing the '-exp_cov auto' command from 'velvetg'; - adding '-scaffolding no' to the 'velvetg' command; - Adding more read data. Grab this file:: https://s3.amazonaws.com/public.ged.msu.edu/ecoli-reads-5m-dn-orphan.fa.gz using 'curl' as above, and then append '-short ecoli-reads-5m-dn-orphan.fa.gz' to the 'velveth' command line. (The 'velvetg' command doesn't need to change.) Which of these assemblies is "best" by some criterion? Can you find an assembly that is "best" by more than one (unrelated) criterion? Finishing up for today ~~~~~~~~~~~~~~~~~~~~~~ Just leave your EC2 instances running so that we can access the data tomorrow. Tomorrow, we'll cover ways of graphing some of your statistics. One possible project to present on Friday is your analysis of these various assemblies. **Update**: `Here's an IPython Notebook `__ that shows you average length stats for a bunch of assemblies. Questions and thoughts to address ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Things to meditate upon -- - how do we manage complexity? Do we need to understand all these commands? What does each command do? In detail? - why don't we have a nice user interface? Why is everything typing!? - why are we using this Amazon machine rather than the computer in front of us? - what is source code, anyway? - why do the assemblies change when you change k? - why might you get different numbers than me out of the velvet commands, sometime? The data going in isn't changing...? - combinatorial explosion of parameters!!! Reading ~~~~~~~ `Genome sequence assembly primer `__ `What does the k parameter do in assembly? `__ `Assembly algorithms for next-generation sequencing data `__