Finding the beginning of genes within genomic sequence presents a formidable challenge to projects to annotate the human genome sequence. In the Advanced Online Publication of
They developed a new program, called FirstEF, that attempts to predict the starts of genes. They collected over two thousand first-exons to use as a training dataset, and characterized those that were associated with a CpG island. FirstEF is designed to recognize CpG islands, promoter regions and first splice-donor sites.
The program could predict 86% of all first exons with about 17% false positives (92% of CpG-related first-exons and 74% of non-CpG exons). FirstEF gave a similar performance when tested against the finished sequences for human chromosomes 21 and 22.