Advance Search              Latest Recources





















Showing search results (1–1 of 1):

  1. HS3D
    Full Name of the Resource : Homo Sapiens Splice Sites Dataset
    Resource Category : Databases -> Nucleotide Sequence Databases -> Gene Structure, Introns and Exons, Splice Sites

    Brief Description : In the last years many computational tools for gene identification and characterization[1,2,3,4,5,6,7,8 and many others], mostly based on machine learning approaches, have been used. In the machine learning approach, a learning algorithm receives a set of training examples, each labelled as belonging to a particular class. The algorithms goal is to produce a classification rule for correctly assigning new examples to these classes. The success of these methods depends largely on the quality of the data sets that are used as the training set[9]. Furthermore a common data set is necessary when the prediction accuracy of different programs needs to be comparatively assessed[10,11]. The Irvine Primate Splice Junctions Dataset (UCI Machine Learning Repository http://www.ics.uci.edu/~mlearn/MLRepository.html) is a standard 'de facto' in the machine learning community [12,13,14,15 and many others], but it is now very out of date and do not include sufficient material for the most learning algorithm needs. A recent and EST confirmed data set[16] has the same limitation in the data extend. More recently Burset et al.[17] developed an extensive data base, but the data do not include false splice sites (negative examples), and, specifically, proximal false splice sites. The latter data form a well known critical point of classification systems[11]. We developed a new database (HS3D - Homo Sapiens Splice Site Dataset) of Homo Sapiens Exon, Intron and Splice regions. The aim of this data set is to give standardized material to train and to assess the prediction accuracy of computational approaches for gene identification and characterization. From the complete GenBank Primate Sequences Rel.123 (8436 entries), 697 entries of Human Nuclear DNA including a Gene with Complete CDS and with more than one exon have been selected according to assessed selection criteria[18] (file genbank_filtered.inf). 4450 exons and 3752 introns have been extracted from these entries (files exons.seq and introns.seq). Several statistics for such exons and introns (overall nucleotides, average GC content, number of exons/introns including not AGCT bases, number of exons/introns in which the annotated end is not found, exon/intron minimum length, exon/intron maximum length, exon/intron average length, exon/intron length standard deviation, number of introns in which the sequence does not start with GT, number of introns in which the sequence does not end with AG) are reported (files exons.stat and introns.stat). Then 3762 + 3762 donor and acceptor sites have been extracted as windows of 140 nucleotides around each splice site. After discarding sequences not including canonical GT-AG junctions (176 +191), including insufficient data (not enough material for a 140 nucleotide window) (590+547), and including not AGCT bases (30+32), there are 2955+2992 windows (files GT_true.seq and AG_true.seq). Information and several statistics about the splice sites extraction are reported (files GT_true.inf, AG_true.inf, GT_true.stat, and AG_true.stat). Finally, there are 287,296+348,370 windows of false splice sites, selected by searching canonical GT-AG pairs in not splicing positions. The false sites in a range+/- 60 from a true splice site are marked as proximal (files GT_false.seq, and AG_false.seq) (Related information: GT_false.inf, and AG_false.inf). HS3D is available at the Web server of the University of Sannio http://www.sci.unisannio.it/docenti/rampone/
    Subject Area : Homo Sapiens Splice Site


    Institute/s :
    Facoltà di Scienze MM.FF.NN. and INFM Università del Sannio Via Port&apos
    ITALY

    Address of Institute/s :
    Facoltà di Scienze MM.FF.NN. and INFM Università del Sannio Via Port&apos
    Arsa 11 I-82100 Benevento ITALY

    Country : Italy

    Associated Institutes :

    • Facoltà di Scienze MM.FF.NN. and INFM Università del Sannio Via Port&apos
    • Arsa 11 I-82100 Benevento ITALY

    Associated Country : Italy


    Authors/Contributors : Rampone, S.
    Contact Email : rampone@unisannio.it
    Year : 2002
    Language : English

    Keywords : Homo Sapiens Splice Gene Sequence