cQC - Documentation: Program Description

A tool for resolving putative sequencing errors in single-pass cDNA, based on genomic sequence


Return to Basic Submission Form

cDNA sequences are produced by single pass sequencing and therefore often contain errors.  cQC  takes cDNA sequences (in FASTA format) and outputs corrected cDNA sequence (Cleaned Sequences file) based on genomic sequence of the appropriate organism.  cQC uses MegaBLAST (Zhang, et al. 2000) to locate genomic sequence, fastacmd (NCBI BLAST executable archives) to extract the genomic sequence, and subsequently runs sim4 (Florea et al., 1998) to align the cDNA sequence back to genomic sequence.  cQC also evaluates the quality of the cDNA sequences by counting instances of insertions, deletions, and substitutions in the 5’UTR, ORF, and 3’UTR, flagging cDNAs that contain frameshifts and premature termination codons due to errors in the ORF, and writing the data to the Altered cDNA Frequencies file.  For the moment, corrections are possible in rice (Oryza sativa spp. japonica cv Nipponbare) and Arabidopsis (Arabidopsis thaliana Columbia) sequences.  We hope to incorporate other species into the program in the near future. 

Basic Submission Form

This form uses all the values set out in the original study (Hayden et al, submitted) for both rice and Arabidopsis. On this form, it is possible to change the E-value cutoff used by MegaBLAST when aligning the cDNAs to genomic sequence (presently set to require a minimum of ~100nt perfect homology).  Percent ID cutoff can also be altered on this page.  This webpage outputs all files:  Cleaned Sequences, Altered cDNA Frequencies, Alignment of Altered cDNAs, Lacking End Similarity, Lacking Internal Similarity, No Genomic Counterpart, Chimeric Sequences, rDNA-Containing Sequences, IS-Containing Sequences.  For customized output files, use the Advanced Submission Form.

 

Advanced Submission Form

This form provides the E-value cutoff, Percent ID cutoff, and the Maximum intron size option.  Output of files can be customized.

E-value cutoff:  Maximum expect value used by MegaBLAST when aligning the cDNAs to genomic sequence (presently set to require a minimum of ~100nt of perfect homology)

Percent ID cutoff:  Minimum percent identity of the MegaBLAST high scoring pairs (HSPs) used in subsequent analysis (clustering of HSPs, and extracting genomic sequence)

Maximum intron size: Defines the maximum genomic distance between two HSPs in the same cluster.

Altered cDNA Frequencies :  Insertion, deletion, and substitution counts calculated from comparing cDNA to genomic sequence in the 5’UTR, ORF, and 3’UTR.  Frameshifts and premature termination codons in the ORF are also tracked. Column headers are: cDNA ID, number of inserted nt in 5’UTR, number of deleted nt in 5’UTR, number of substitutions in 5’UTR, number of inserted nt in ORF, number of deleted nt in ORF, number of substitutions in ORF, number of inserted nt in 3’UTR, number of deleted nt in 3’UTR, number of substitutions in 3’UTR, ORF frameshift (True or False), ORF premature termination codon (True or False). At the end of this file, the total number of nucleotides present in the 5’UTR, ORF, and 3’UTR are given for all sequences that contained discrepancies as well as all sequences with no discrepancies (all sequences with a single genomic counterpart with good alignment throughout their length).  

Alignment of Altered cDNAs:  sim4 output (A=3) of cDNA to genomic sequence alignment

Cleaned Sequences :  These sequences are comprised of more than just the final set of cDNAs described by Hayden et al.  They are a compilation of cDNA sequences with no detectable errors, cDNAs with corrected errors (this includes sequences which are Lacking End Similarity and have had non-aligning ends truncated), and cDNAs with No Genomic Counterpart.  cDNAs from each of these groups can be identified by the description following the cDNA name (definition line). 

Lacking End Similarity: cDNA sequences for which sim4 alignment does not align to within 20 nt of the cDNA ends (20 nt allowed for poor sequence reads or short stretches of remaining vector).  Note, these sequences are included in the Cleaned Sequences file but their non-aligning ends are truncated from the sequence. 

Lacking Internal Similarity:  cDNA sequences for which sim4 alignment cannot align an internal portion of the cDNA with the extracted genomic sequence.  This may be due to a stretch of poor sequence quality, or short rDNA or IS contamination not detected by the initial MegaBLAST (these would normally be identified as chimeric sequences).

No Genomic Counterpart:  For a given E-value cutoff and Percent ID cutoff, no HSP can be found using MegaBLAST for these cDNA sequences.

Chimeric Sequences:  cDNAs in which two or more distantly separated genomic regions align to non-overlapping regions on the cDNA.  Note:  if intron length of a cDNA exceeds the Maximum intron size, a cDNA will spuriously appear in this file.

rDNA-Containing Sequences:  rDNA sequences detected in cDNA (these are removed from the dataset).

IS-Containing Sequences:  IS elements detected in cDNA (these must be removed manually from the dataset).


Return to Basic Submission Form