Homework 7

Genomics (Ecol 553) Computational Lab

Week 10:  Oct 25, 2006.

 

Course webpage: http://genomics.arizona.edu/553/computation

 

 

Homework7

      To be completed by noon on Thursday, Nov 2. 

 

On amadeus, create a directory called ~/homework/homework7. Place all programs described below into this directory.  ** Do not make copies of the input files into your directory. **


1) Write a program, called 99_words.pl, which does the following:

            * Reads in the file at /tmp/week10/99redballoons.lyr

            * Prints out a list of the words used in the song (in alphabetical order), and the number of times each was used.

 



2) This homework looks at the impact of file I/O decisions.  Write two programs, which do the following:

            * Take a single filename as an argument

            * Read the file and count the number of lines in the file, then print out that number.


     The two programs are:

      a) count_lines_using_cat.pl  -  this should use the cat method we've used up to this week [ $var = `cat $filename`; @arr = split("\n", $var) ;  ], then print out the number of elements in the array.

      b) count_lines_using_fh.pl  -   this should use file handles as we discussed on Tuesday  [ open (FH, "<$filename"); while( ... ) { }  ],  and count the number of lines by incrementing a variable named $i.

   

              After you've written these scripts (and are sure they work correctly), you can test their relative speeds using the unix command "time". Run these two commands:
              >  time  perl  count_lines_using_cat.pl  /tmp/week10/swissprot_000.seq
              -and-
              >  time  perl  count_lines_using_fh.pl  /tmp/week10/swissprot_000.seq

              The output of these commands will be whatever the perl script outputs, followed by a line that looks like this:
               2.72u 0.83s 0:03.73 95.1%
              The first entry tells you how many seconds it took for the script to finish.  You can ignore the rest, if you want (or read all about them by running "man time");

               Create a file called file_io.txt. In that file, write a 1 or 2 sentence summary of the results you see in the comparison of count_lines_using_cat.pl and count_lines_using_fh.pl.

               When you've done this, and are comfortable that you understand the results you've seen, do the following:
               * Run the same two commands as above, but using /tmp/week10/swissprot_000.ref instead of /tmp/week10/swissprot_000.seq 
                                                    (please don't do this more than twice. The file is huge - ~360MB - and will drag the system to a crawl)
               * Think about another way you know of counting the number of lines in a file. Time it on each of the two swissprot files. How does it compare? Are you surprised?

                In file_io.txt, write another 2 or 3 sentences summarizing the results you see in these two tests.


3) Write a program, called filter_blast.pl, which does the following:

            * Reads in the file at /tmp/week10/blast_results.out

            * Prints out to a file "147.blastout" all lines that contain the string "ENSG00000198223", and have an alignment length 147.

            * Prints out to a file "343_369.blastout" all lines that contain the string "ENSG000001982", and have an alignment length 343 or 369.



4) Write a program, called parse_longform_blast.pl, which does the following:

            * Reads in the file at /tmp/week10/blast_longform.out

            * For each hit associated with the query for "sequence3", print a line containing (tab-delimited):

                      * The name of the matching sequence

                      * The length, score, expectation and %-identities of the hit

            * As an example, the first line of your output should look like this:

                        ENSG00000147027.ENSP00000275954   181   322    1e-89    86%


            extra credit -

                    Modify parse_longform_blast.pl to loop through all the query sequences in blast_longform.out, printing out the same stats as above for each one.

                    Be sure to clearly delineate when the stats for one query end and another query begin.