Loading Events

« All Events

  • This event has passed.

Seminar on Monday, May 29, 2017

May 29, 2017 @ 10:00 am - 11:00 am

Construction of Ensembles by Exploiting the Richness of Feature Variables in High-Dimensional Data with Application in Protein Homology

Full Title: Construction of Ensembles by Exploiting the Richness of Feature Variables in High-Dimensional Data with Application in Protein Homology
Speaker: Dr. Jabed Tomal
Assistant Professor
Department of Computer and Mathematical Sciences
The University of Toronto Scarborough, Canada.
Date/Time: Monday, May 29, 2017, 10 a.m.
Venue: ISRT seminar room

 

ABSTRACT

High-dimensional data may contain complementary subsets of useful feature variables which could bevaluable in predicting a response. In this work, I have developed a predictionmodel which exploits the richness of information contained in the complementary subsets ofuseful feature variables in high-dimensional data. The proposed model – which is an aggregated collection of logistic regression models (LRM) – is called an ensemble, where each constituent LRM is fitted to a subset of feature variables. An algorithm is developed to cluster the feature variables into subsets in a way that the variables in a subset are good to put together in an LRM, and the variables in different subsets are good in separate LRMs. Each subset of variables is called a “phalanx”, and the resulting ensemble is called an “ensemble of phalanxes (EPX).” The strength of the ensembledepends on the algorithm’s ability to identify/output strong and diverse subsets of feature variables.
Homologous proteins are considered to havea common evolutionary origin, i.e., the bearers of homologous proteins share a common ancestor. To develop an evolutionary sequence of proteins, a scientist needs to predict their biological homogeneity. The proposed ensembleis applied to the protein homology data, obtained from the 2004 KDD cup competition,and used to predict biological homogeneity of proteins. In this application, the feature variables are various scores representing structural similarity and amino acid sequence identity of proteins. Theunderlying assumption, for model building,is that the structural similarity and amino acid sequence identityare predictive to proteins’ biological homogeneity.As the proportion of homologous proteins is rare, the prediction performances of theensemble are evaluated by checking its ability to rank rare homologous proteins ahead of the non-homologous proteins. While prediction performances of an EPX are competitive to contemporary state-of-the-art ensembles, a big leap of improvement in prediction performances is achieved by aggregating two diverse EPXs obtained from optimizing two complementary evaluation metrics.Here, the algorithm and complementary-metrics guaranteed increased strength and diversity, respectively, among the ensembles of phalanxes to aggregate. Importantly, the performances of the two aggregated EPXs are robust against individual EPX when one EPX is good for detecting close homologs and the other is good for detecting distant homologs. Using parallel computing, the proposed ensemble is shown computationally efficient as well.

Details

Date:
May 29, 2017
Time:
10:00 am - 11:00 am
Event Category:

Leave a Reply