Potential introns are identified and assigned scores based on:
1) how well each of the sequences matches the consensus
2) the spacing between the branch-point and the right junction, and
3) the length of the resulting intron.
The versions available here have been optimized for Aspergillus niger genomic DNA and Saccharomyces cerevisiae. Since most A.
niger genes are interrupted by several (up to 8) introns, the A. niger program was designed to handle potential open reading frames
(ORFs) containing up to 10 introns. The yeast program will identify ORFs having up to two introns. The following strategy is employed:
1) ORFs are constructed from each start site to the first in-frame stop codon.
2) These ORFs are then assigned a score based on their length.
3) Introns are sought which either excise the stop or alter the reading frame.
4) The next in-frame stop codon is identified.
5) The new ORF is given a score based on length and intron score.
6) And so on....
Since billions of hypothetical ORFs can be constructed from a few kilobases of A. niger genomic sequence, a few strictures were applied. First, only the top 25 A. niger ORFs are stored in memory. For yeast, the top 100 ORFs are kept. Second, only potential introns which exceed a certain score are considered. (Reducing this stringency level results in a much slower program.)
Note also that the context of the start and stop site are not taken into consideration in this version of the program. This is mainly to streamline processing. Queries from the net are expected to be relatively small (kb not Mb). Information of DNA sequence up to 1 kb upstream of the start site are required to evaluate the likelihood of authenticity. However, in A. niger at least, most authentic start sites are easily identified.
The parameters used in the A. niger program are based on the author's study of all A. niger sequences available in 1996. The
numerical values assigned to the various parameters were determined empirically. The parameters used in the yeast program are based on the
author's analysis of 33 intron-containing S.cerevisiae genes. Continuing studies of gene sequences may result in further optimization of
these values.