Homework 6

Genomics (Ecol 553) Computational Lab

Week 8:  Oct 12, 2006.

 

Course webpage: http://genomics.arizona.edu/553/computation

 

 

Homework6

      To be completed by noon on Thursday, Oct 19. 

 

On amadeus, create a directory called ~/homework/homework6. Place all programs described below into this directory.


1) Write a program, called blast_smallest_pct.pl, which does the following:

            * Reads in the blast result file at /tmp/week8/blast_result.out

            * Prints out the smallest percent identity among of all hits.

 



2) Write a program, called k_biggest.pl, which does the following:

                  * Takes a number as an argument (I'll call that number k);

            * Reads the list of numbers from the file /tmp/week8/numbers.txt;

            * Prints the k largest numbers in that file.


            Sample program outputs:

                > perl k_biggest.pl 4

                        2343, 1126, 996, 978

                (i.e. the 4 biggest numbers found in numbers.txt should be printed.)


                > perl k_biggest.pl 50

                2343, 1126, 996, 978, 956, 926, 911, 906, 878, 868 ...

                        (i.e. the 50 biggest numbers found in numbers.txt should be printed.)

                    But be careful: your program should gracefully handle the case in which the argument (k) exceeds the number of entries in numbers.txt.



3) Write a program, called highest_gc.pl, which does the following:

                  * Opens the file /tmp/week8/gene_composition.csv

                         This is a comma-delimited file. Each row represents a gene, and contains the fields (in this order) :

                                           * name

                                           * number of adenines

                                           * number of cytosines

                                                   * number of guanines

                                           * number of thymines

                  * For each row,  calculate the gc%    -   that's (#gs + #cs) /(#gs + #cs + #as + #ts)

            * Keep track of the highest gc%. Print the name and gc% of the gene with the highest gc%





4) Write a program, called highest_k_gc.pl, which does the following:

                  * Uses the same file as in #3

                  * Takes a number as an argument (I'll call that number k)

             * Prints out the top k gc%s   (don't worry about listing the associated gene names)

                        Hints:

                                This will require that you create an array to hold the gc%s, and add an entry to that array for each line in the input.

                                You'll need to use push (or unshift) for this purpose. Details about push are found on pages 100-102 (chapter 3) of the text (and will be discussed in Tuesday's class). 

                                Once you have the entire array filled in, the problem is roughly the same as #2.







5) Write a program, called fetch_seqs.pl, which does the following:

            * Read in the contents of /tmp/week8/misc_genes.fa,  storing it in a hash with the name as the key and sequence as the value, e.g.:

                                  >14_2dorA
                          LSVKLPGLNLKNPIMPASGCFGFGKEYSEY ...


                            would be stored in the hash as:

                                  $genes{14_2dorA} = "LSVKLPGLNLKNPIMPASGCFGFGKEYSEY ..."

             *Accept a list of arguments, where each argument is a gene name.

             * For each gene name argument, print out the gene name and the first 20 characters of the corresponding sequence.

            (hashes are covered in pages 104-109 in chapter 3 of the reading, and will be discussed on Tuesday)  


                     Examples of the program in action:

                      >perl fetch_seqs.pl 14_2dorA
                      14_2dorA : LSVKLPGLNLKNPIMPASGC

                      >perl fetch_seqs.pl 14_2dorA  17_2dorA  33_1ep3A
                      14_2dorA : LSVKLPGLNLKNPIMPASGC
                      17_2dorA : MLETSICNIELRNPTILAAG
                      33_1ep3A : DLHVTIPSGLHGLELKNPVM