Homework 8

Genomics (Ecol 553) Computational Lab

Week 13:  Nov 14, 2006.

 

Course webpage: http://genomics.arizona.edu/553/computation

 

 

Homework8

      To be completed by noon on Tuesday, Nov 21. 

 

On amadeus, create a directory called ~/homework/homework8. Place the program described below into this directory.  ** Do not make copies of the input files into your directory. **


1) Write a program, called parse_longform_blast.pl, which does the following:

            * Using bioperl, reads in the file at /tmp/week13/blast_longform.out

            * For each query sequence,

                    * identify the hsp that covers the greatest length of that query sequence  (there may be more than one with the same maximum length - pick one)

                              ( how much of the query sequence is covered by an hsp?  See bioperl docs:  $len = $hsp->length( 'query' );

                    * having identified the hsp for the current query sequence, print out the following:


                            - query sequence name

                            - hit sequence name

                            - length of the query sequence covered by the hsp

                            - an alignment of the query and hit sequences making up that hsp



       sample (bogus) output:

        Query = sequence1
        "Best" hit = ENSG00000198767.ENSP00000355023
        Length = 120
       Q: KKIACPHKGCNKHFRDSSAMRKHLHTHGPRVHVCAECGKAFVESSKLKRHQLVHTGEKPFQCTFEGCGKRFSLDFNLRTHVRIHTGDRPFVCPFDACNKKFAQSTNLKSHILTHAKAKRN
       H: KTVPCSYSGCEKMFRDYAAMRKHLHIHGPRVHVCAECGKAFLESSKLRRHQLVHTGEKPFQCTFEGCGKRFSLDFNLRT----------TGDKPFVCPFDVCNRKFAQSTNLKTHILTHVKTKNN



        Query = sequence2
        "Best" hit = ENSG00000198768.ENSP00000345015
        Length = 38
        Q: KKIACPHKGCNKHFRDS----KHLHTHGPRVHVCAE--------SKLKRH
        H: KTVPCSYSG--------AAMRKHLHIHGPRVHVCAECGKAFLESS-----H


        etc.


       Note: the "alignment" need not contain the "homology string" that separates the sequences in the blast output file. Also, the sequence may very well overflow the first line so that you get a sloppy looking "alignment" (see below for an example). That's ok for the purposes of this asignment.

An example of a "sloppy looking" alignment (which is ok for this assignment):

        Query = sequence1
        "Best" hit = ENSG00000198767.ENSP00000355023
        Length = 120
       Q: KKIACPHKGCNKHFRDSSAMRKHLHTHGPRVHVCAECGKAFVESSKLKRHQLVHTGEKPFQCTFEGCGKRFSL
       DFNLRTHVRIHTGDRPFVCPFDACNKKFAQSTNLKSHILTHAKAKRN
       H: KTVPCSYSGCEKMFRDYAAMRKHLHIHGPRVHVCAECGKAFLESSKLRRHQLVHTGEKPFQCTFEGCGKRFSL
       DFNLRT----------TGDKPFVCPFDVCNRKFAQSTNLKTHILTHVKTKNN