cDNA sequences are produced by single pass
sequencing and
therefore often contain errors. cQC takes cDNA sequences (in FASTA format) and
outputs corrected cDNA sequence (Cleaned
Sequences
file) based on genomic sequence of the appropriate organism. cQC uses MegaBLAST (Zhang, et al.
2000) to locate genomic sequence,
fastacmd (NCBI BLAST executable archives) to extract the genomic
sequence, and
subsequently runs sim4 (Florea et al.,
1998) to align the cDNA sequence back to genomic sequence.
cQC also evaluates the quality of the cDNA
sequences by counting instances of insertions, deletions, and
substitutions in
the 5’UTR, ORF, and 3’UTR, flagging cDNAs that contain frameshifts and
premature
termination codons due to errors in the ORF, and writing the data to the Altered cDNA
Frequencies file. For the moment, corrections are possible in
rice (Oryza sativa spp. japonica cv
Nipponbare) and Arabidopsis
(Arabidopsis thaliana
This form uses all the values set out in the original study (Hayden et al, submitted) for both rice and Arabidopsis. On this form, it is possible to change the E-value cutoff used by MegaBLAST when aligning the cDNAs to genomic sequence (presently set to require a minimum of ~100nt perfect homology). Percent ID cutoff can also be altered on this page. This webpage outputs all files: Cleaned Sequences, Altered cDNA Frequencies, Alignment of Altered cDNAs, Lacking End Similarity, Lacking Internal Similarity, No Genomic Counterpart, Chimeric Sequences, rDNA-Containing Sequences, IS-Containing Sequences. For customized output files, use the Advanced Submission Form.
E-value cutoff:
Maximum expect value used by MegaBLAST when aligning the cDNAs
to
genomic sequence (presently
set to require a minimum of ~100nt of perfect
homology)
Percent ID cutoff: Minimum percent identity of the MegaBLAST high
scoring pairs (HSPs) used in subsequent analysis (clustering of HSPs,
and
extracting genomic sequence)
Maximum intron size: Defines the maximum genomic distance between two HSPs in the same cluster.
Altered cDNA Frequencies : Insertion, deletion, and substitution counts calculated from comparing cDNA to genomic sequence in the 5’UTR, ORF, and 3’UTR. Frameshifts and premature termination codons in the ORF are also tracked. Column headers are: cDNA ID, number of inserted nt in 5’UTR, number of deleted nt in 5’UTR, number of substitutions in 5’UTR, number of inserted nt in ORF, number of deleted nt in ORF, number of substitutions in ORF, number of inserted nt in 3’UTR, number of deleted nt in 3’UTR, number of substitutions in 3’UTR, ORF frameshift (True or False), ORF premature termination codon (True or False). At the end of this file, the total number of nucleotides present in the 5’UTR, ORF, and 3’UTR are given for all sequences that contained discrepancies as well as all sequences with no discrepancies (all sequences with a single genomic counterpart with good alignment throughout their length).
Alignment of Altered cDNAs: sim4 output (A=3) of cDNA to genomic sequence alignment
Cleaned Sequences : These sequences are comprised of more than just the final set of cDNAs described by Hayden et al. They are a compilation of cDNA sequences with no detectable errors, cDNAs with corrected errors (this includes sequences which are Lacking End Similarity and have had non-aligning ends truncated), and cDNAs with No Genomic Counterpart. cDNAs from each of these groups can be identified by the description following the cDNA name (definition line).
Lacking End Similarity: cDNA sequences for which sim4 alignment does not align to within 20 nt of the cDNA ends (20 nt allowed for poor sequence reads or short stretches of remaining vector). Note, these sequences are included in the Cleaned Sequences file but their non-aligning ends are truncated from the sequence.
Lacking Internal Similarity: cDNA sequences for which sim4 alignment cannot align an internal portion of the cDNA with the extracted genomic sequence. This may be due to a stretch of poor sequence quality, or short rDNA or IS contamination not detected by the initial MegaBLAST (these would normally be identified as chimeric sequences).
No Genomic Counterpart: For a given E-value cutoff and Percent ID cutoff, no HSP can be found using MegaBLAST for these cDNA sequences.
Chimeric Sequences: cDNAs in which two or more distantly separated genomic regions align to non-overlapping regions on the cDNA. Note: if intron length of a cDNA exceeds the Maximum intron size, a cDNA will spuriously appear in this file.
rDNA-Containing Sequences: rDNA sequences detected in cDNA (these are removed from the dataset).
IS-Containing Sequences: IS elements detected in cDNA (these must be
removed manually from the dataset).