## EvoWorkshops2006: EvoBIO

### 4th European Workshop on Evolutionary Computation and Machine Learning in Bioinformatics

EvoBIO is the fourth workshop organized by the EvoNet working group on bioinformatics. EvoBIO covers research in all aspects of computational intelligence in bioinformatics and computational biology. The emphasis is on algorithms based on evolutionary computation, on neural networks and on other novel optimisation and machine learning methods, that address important problems in molecular biology, genomics and genetics, that are computationally efficient, and that have been implemented and tested in simulations and on real datasets.

#### Organising Committee

Program Chairs
Carlos Cotta
ccottap AT lcc DOT uma DOT es
University of Malaga (Spain)

Jason Moore
Jason.h.Moore AT Dartmouth DOT edu
Dartmouth College - CGL (USA)

EvoWorkshops2006 Chair
Franz Rothlauf
rothlauf AT uni-mannheim DOT de
University of Mannheim, Germany

Local Chair
Anikó Ekárt
ekart AT sztaki DOT hu

Publicity Chair
Steven Gustafson
smg AT cs DOT nott DOT ac DOT uk
University of Nottingham, UK

#### Programme Committee

Jesus Aguilar, Spain
Jacek Blazewicz, Poland
David Corne, UK
Vincenzo Cutello, Italy
Gary Fogel, USA
James Foster, USA
Alex Freitas, UK
Raul Giraldez, Spain
Rosalba Giugno, Italy
Jin-Kao Hao, France
Natalio Krasnogor, UK
Bill Langdon, UK
Bob MacCallum, Sweden
Elena Marchiori, The Netherlands
Andrew Martin, UK
Pablo Moscato, Australia
Ajit Narayanan, UK
Vic J. Rayward-Smith, UK
John Rowe, UK
Jem Rowland, UK
El-Ghazali Talbi, France
Antoine van Kampen, The Netherlands
Gwen Volkert, UK
Ray Walshe, Ireland
Eckart Zitzler, Switzerland
Igor Zwir, Spain

### Accepted Papers: titles and abstracts

Functional Classification of G-Protein Coupled Receptors, Based on Their Specific Ligand Coupling Patterns
Burcu Bakir, Osman Ugur Sezerman
(Nominated for Best Paper Award)

Functional identification of G-Protein Coupled Receptors (GPCRs) is one of the current focus areas of pharmaceutical research. Although thousands of GPCR sequences are known, many of them remain as orphan sequences (the activating ligand is unknown). Therefore, classification methods for automated characterization of orphan GPCRs are imperative. In this study, for predicting Level 2 subfamilies of Amine GPCRs, a novel method for obtaining fixed-length feature vectors, based on the existence of activating ligand specific patterns, has been developed and utilized for a Support Vector Machine (SVM)-based classification. Exploiting the fact that there is a non-promiscuous relationship between the specific binding of GPCRs into their ligands and their functional classification, our method classifies Level 2 subfamilies of Amine GPCRs with a high predictive accuracy of 97.02% in a ten-fold cross validation test. The presented machine learning approach, bridges the gulf between the excess amount of GPCR sequence data and their poor functional characterization.

Incorporating biological domain knowledge into cluster validity assessment
(Nominated for Best Paper Award)

This paper presents an approach for assessing cluster validity based on similarity knowledge extracted from the Gene Ontology (GO) and databases annotated to the GO. A knowledge-driven cluster validity assessment system for microarray data was implemented. Different methods were applied to measure similarity between yeast genes products based on the GO. This research proposes two methods for calculating cluster validity indices using GO-driven similarity. The first approach processes overall similarity values, which are calculated by taking into account the combined annotations originating from the three GO hierarchies. The second approach is based on the calculation of GO hierarchy-independent similarity values, which originate from each of these hierarchies. A traditional node-counting method and an information content technique have been implemented to measure knowledge-based similarity between genes products (biological distances). The results contribute to the evaluation of clustering outcomes and the identification of optimal cluster partitions, which may represent an effective tool to support biomedical knowledge discovery in gene expression data analysis.

A novel Mathematical Model for the Optimization of DNA--Chip Design and its Implementation
Korn´elia Danyi, Gabriella K´okai, J´ozsef Csontos

A variety of recent achievements in the field of biology, chemistry and information technology have made possible the development of DNA chips. They allow us to analyze the sequences and functions of different genes simultaneously and detect small differences in those. They are source of tremendous amount of data in the field of Bioinformatics. Moreover, the engineering process of DNA chip requires the latest results of information technology, too. In this paper, we address the mathematical problem of the prediction the hybridization process on the chip surface. A novel in situ in silico approach is presented and the obtained results are discussed.

A Hybrid GA/SVM Approach for Gene Selection and Classification of Microarray Data
Edmundo Bonilla Huerta, Beatrice Duval, Jin-Kao Hao

Abstract: We propose a Genetic Algorithm (GA) approach combined with Support Vector Machines (SVM) for the classification of high dimensional Microarray data. This approach is associated to a fuzzy logic based pre-filtering technique. The GA is used to evolve gene subsets whose fitness is evaluated by a SVM classifier. Using archive records of "good" gene subsets, a frequency based technique is introduced to identify the most informative genes. Our approach is assessed on two well-known cancer datasets and shows competitive results with six existing methods.

Multi-Stage Evolutionary Algorithms for Efficient Identification of Gene Regulatory Networks
Kee-Young Kim, Dong-Yeon Cho, Byoung-Tak Zhang

Abstract. With the availability of the time series data from the high-throughput technologies, diverse approaches have been proposed to model gene regulatory networks. Compared with others, S-system has the advantage for these tasks in the sense that it can provide both quantitative (structural) and qualitative (dy-namical) modeling in one framework. However, it is not easy to identify the structure of the true network since the number of parameters to be estimated is much larger than that of the available data. Moreover, conventional parameter estimation requires the time-consuming numerical integration to reproduce dy-namic profiles for the S-system. In this paper, we propose multi-stage evolu-tionary algorithms to identify gene regulatory networks efficiently. With the symbolic regression by genetic programming (GP), we can evade the numerical integration steps. This is because the estimation of slopes for each time-course data can be obtained from the results of GP. We also develop hybrid evolution-ary algorithms and modified fitness evaluation function to identify the structure of gene regulatory networks and to estimate the corresponding parameters at the same time. By applying the proposed method to the identification of an arti-ficial genetic network, we verify its capability of finding the true S-system.

Human Papillomavirus Risk Type Classification from Protein Sequences Using Support Vector Machines
Sun Kim, Byoung-Tak Zhang

Infection by the human papillomavirus (HPV) is associated with the development of cervical cancer. HPV can be classified to high- and low-risk type according to its malignant potential, and detection of the risk type is important to understand the mechanisms and diagnose potential patients. In this paper, we classify the HPV protein sequences by support vector machines. A string kernel is introduced to discriminate HPV protein sequences. The kernel emphasizes amino acids pairs with a distance. In the experiments, our approach is compared with previous methods in accuracy and F1-score, and it has showed better performance. Also, the prediction results for unknown HPV types are presented.

Hierarchical Clustering, Languages and Cancer
Pritha Mahata, Wagner Costa, Carlos Cotta, Pablo Moscato

In this paper, we introduce a novel objective function for the hierarchical clustering of data from distance matrices, a very relevant task in Bioinformatics. To test the robustness of the method, we test it in two areas: (a) the problem of deriving a phylogeny of languages and (b) subtype cancer classification from microarray data. For comparison purposes, we also consider both the use of ultrametric trees (generated via a two-phase evolutionary approach that creates a large number of hypothesis trees, and then takes a consensus), and the best-known results from the literature. We used a dataset of measured 'separation time' among 84 Indo-European languages. The hierarchy we produce agrees very well with existing data about these languages across a wide range of levels, and it helps to clarify and raise new hypothesis about the evolution of these languages. Our method also generated a classification tree for the different cancers in the NCI60 microarray dataset (comprising gene expression data for 60 cancer cell lines). In this case, the method seems to support the current belief about the heterogeneous nature of the ovarian, breast and non-small-lung cancer, as opposed to the relative homogeneity of other types of cancer. However, our method reveals a close relationship of the melanoma and CNS cell-lines. This is in correspondence with the fact that metastatic melanoma first appears in central nervous system (CNS).

Robust SVM-based biomarker selection with noisy mass spectrometric proteomic data
Elena Marchiori, Connie R. Jimenez, Mikkel West-Nielsen, Niels H.H.Heegaard
(Nominated for Best Paper Award)

Computational analysis of mass spectrometric (MS) proteomic data from sera is of potential relevance for diagnosis, prognosis, choice of therapy, and study of disease activity. To this aim, feature selection techniques based on machine learning can be applied for detecting potential biomarkes and biomaker patterns. A key issue concerns the interpretability and robustness of the output results given by such techniques. In this paper we propose a robust method for feature selection with MS proteomic data. The method consists of the sequentail application of a filter feature selection algorithm, RELIEF, followed by multiple runs of a wrapper feature selection technique based on support vector machines (SVM), where each run is obtained by changing the class label of one support vector. Frequencies of features selected over the runs are used to identify features which are robust with respect to perturbations of the data. This method is tested on a dataset produced by a specific MS technique, called MALDI-TOF MS. Two classes have been artificially generated by spiking. Moreover, the samples have been collected at different storage durations. Leave-one-out cross validation (LOOCV) applied to the resulting dataset, indicates that the proposed feature selection method is capable of identifying highly discriminatory proteomic patterns.

On the Use of Variable Complementarity for Feature Selection in Cancer Classification
Patrick Emmanuel Meyer, Gianluca Bontempi

The paper presents an original filter approach for effective feature selection in classification tasks with a very large number of input variables. The approach is based on the use of a new information theoretic selection criterion: the double input symmetrical relevance (DISR). The rationale of the criterion is that a set of variables can return an information on the output class that is higher than the sum of the informations of each variable taken individually. This property will be made explicit by defining the measure of variable complementarity. A feature selection filter based on the DISR criterion is compared in theoretical and experimental terms to recently proposed information theoretic criteria. Experimental results on a set of eleven microarray classification tasks show that the proposed technique is competitive with existing filter selection methods.

Comparison of Neural Network Optimization Approaches for Studies of Human Genetics
Alison Motsinger, Scott Dudek, Lance Hahn, Marylyn Ritchie

A major goal of human genetics is the identification of susceptibility genes associated with common, complex diseases. The preponderance of gene-gene and gene-environment interactions comprising the genetic architecture of common diseases presents a difficult challenge. To address this, novel computational approaches have been applied to studies of human disease. These novel approaches seek to capture the complexity inherent in common diseases. Previously, we developed a genetic programming neural network (GPNN) to optimize network architecture for the detection of disease susceptibility genes in association studies. While GPNN was a successful endeavor, we wanted to address the limitations in its flexibility and ease of development. To this end, we developed a grammatical evolution neural network (GENN) approach that accounts for the drawbacks of GPNN. In this study we show that this new method has high power to detect gene-gene interactions in simulated data. We also compare the performance of GENN to GPNN, a traditional back-propagation neural network (BPNN) and a random search algorithm. GENN outperforms both BPNN and the random search, and performs at least as well as GPNN. This study demonstrates the utility of using GE to evolve NN in studies of complex human disease.

Obtaining Biclusters in Microarrays with Population-based Heuristics
Pablo Palacios, David Pelta, Armando Blanco

In this article, we shall analyze the behavior of populationbased heuristics for obtaining biclusters from DNA microarray data. More specically, we shall propose an evolutionary algorithm, an estimation of distribution algorithm, and several memetic algorithms that dier in the local search used. In order to analyze the eectiveness of the proposed algorithms, the freely available yeast microarray dataset has been used. The results obtained have been compared with the algorithm proposed by Cheng and Church. Both in terms of the computation time and the quality of the solutions, the comparison reveals that a standard evolutionary algorithm and the estimation of distribution algorithm oer an ecient alternative for obtaining biclusters.

Multiple sequence alignment based on set covers
Alexandre H. L. Porto, Valmir C. Barbosa

We introduce a new heuristic for the multiple alignment of a set of sequences. The heuristic is based on a set cover of the residue alphabet of the sequences, and also on the determination of a significant set of blocks comprising subsequences of the sequences to be aligned. These blocks are obtained with the aid of a new data structure, called a suffix-set tree, which is constructed from the input sequences with the guidance of the residue-alphabet set cover and generalizes the well-known suffix tree of the sequence set. We provide performance results on selected BAliBASE amino-acid sequences and compare them with those yielded by some prominent approaches.

A methodology for determining amino-acid substitution matrices from set covers
Alexandre H. L. Porto, Valmir C. Barbosa

We introduce a new methodology for the determination of amino-acid substitution matrices for use in the alignment of proteins. The new methodology is based on a pre-existing set cover on the set of residues and on the undirected graph that describes residue exchangeability given the set cover. For fixed functional forms indicating how to obtain edge weights from the set cover and, after that, substitution-matrix elements from weighted distances on the graph, the resulting substitution matrix can be checked for performance against some known set of reference alignments and for given gap costs. Finding the appropriate functional forms and gap costs can then be formulated as an optimization problem that seeks to maximize the performance of the substitution matrix on the reference alignment set. We give computational results on the BAliBASE suite using a genetic algorithm for optimization. Initial results indicate that it is possible to obtain substitution matrices whose performance is either comparable to or surpasses that of several others.

Multi-Objective Evolutionary Algorithm for Discovering Peptide Binding Motifs
Menaka Rajapakse, Bertil Schmidt, Vladimir Brusic

Multi-Objective Evolutionary Algorithms (MOEA) use Genetic Algorithms (GA) to find a set of potential solutions, which are reached by compromising trade-offs between the multiple objectives. This paper presents a novel approach using MOEA to search for a motif which can unravel rules governing peptide binding to medically important receptors with applications to drugs and vaccines target discovery. However, the degeneracy of motifs due to the varying physicochemical properties at the binding sites across large number of active peptides poses a challenge for the detection of motifs of specific molecules such as MHC Class II molecule I-Ag7 of the non-obese diabetic (NOD) mouse. Several motifs have been experimentally derived for I-Ag7 molecule, but they differ from each other significantly. We have formulated the problem of finding a consensus motif for I-Ag7 by using MOEA as an outcome that satisfies two objectives: extract prior information by minimizing the distance between the experimentally derived motifs and the resulting matrix by MOEA; minimize the overall number of false positives and negatives resulting by using the putative MOEA-derived motif. The MOEA results in a Pareto optimal set of motifs from which the best motif is chosen by the Area under the Receiver Operator Characteristics (AROC) performance on an independent test dataset. We compared the MOEA-derived motif with the experimentally derived motifs and motifs derived by computational techniques such as MEME, RANKPEP, and Gibbs Motif Sampler. The overall predictive performance of the MOEA derived motif is comparable or better than the experimentally derived motifs and is better than the computationally derived motifs.

Mining Structural Databases: An Evolutionary Multi-Objetive Conceptual Clustering Methodology
Roc´ýo Romero-Zaliz, Cristina Rubio-Escudero, Oscar Cord´on, Oscar Harari, Coral del Val, Igor Zwir

The increased availability of biological databases containing representations of complex objects permits access to vast amounts of data. In spite of the recent renewed interest in knowledge-discovery techniques (or data mining), there is a dearth of data analysis methods intended to facilitate understanding of the represented objects and related systems by their most representative features and those relationship derived from these features (i.e., structural data). In this paper we propose a conceptual clustering methodology termed EMO-CC for Evolutionary Multi-Objective Conceptual Clustering that uses multi-objective and multi-modal optimization techniques based on Evolutionary Algorithms that uncover representative substructures from structural databases. Besides, EMO-CC provides annotations of the uncovered substructures, and based on them, applies an unsupervised classification approach to retrieve new members of previously discovered substructures. We apply EMO-CC to the Gene Ontology database to recover interesting substructures that describes problems from different points of view and use them to explain inmuno-inflammatory responses measured in terms of gene expression profiles derived from the analysis of longitudinal blood expression profiles of human volunteers treated with intravenous endotoxin compared to placebo.

Optimal Selection of Microarray Analysis Methods using a Conceptual Clustering Algorithm
Cristina Rubio-Escudero, Roc´ýo Romero-Zaliz, Oscar Cordon, Oscar Harari, Coral del Val, Igor Zwir

The rapid development of methods that select over/under expressed genes from microarray experiments have not yet matched the need for tools that identify informational profiles that differentiate between experimental conditions such as time, treatment and phenotype. Uncertainty arises when methods devoted to identify significantly expressed genes are evaluated: do all microarray analysis methods yield similar results from the same input dataset? do different microrray datasets require distinct analysis methods?. We performed a detailed evaluation of several microarray analysis methods, finding that none of these methods alone identifies all observable differential profiles, nor subsumes the results obtained by the other methods. Consequently, we propose a procedure that, given certain user-defined preferences, generates an optimal suite of statistical methods. These solutions are optimal in the sense that they constitute partial ordered subsets of all possible method-associations bounded by both, the most specific and the most sensitive available solution.

Microarray Probe Design using $\epsilon$-Multi-Objective Evolutionary Algorithms with Thermodynamic Criteria
Soo-Yong Shin, In-Hee Lee, Byoung-Tak Zhang
(Nominated for Best Paper Award)

As DNA microarrays have been widely used for gene expression profiling and other fields, the importance of reliable probe design for microarray has been highlighted. First, the probe design for DNA microarray was formulated as a constrained multi-objective optimization task by investigating the characteristics of probe design. Then the probe set for human paillomavrius (HPV) was found using $\epsilon$-multi-objective evolutionary algorithm with thermodynamic fitness calculation. The evolutionary optimization of probe set showed better results than the commercial microarray probe set made by Biomedlab Co.\ Korea.

An Algorithm for the Automated Verification of DNA Supercontig Assemblies
Nikola Stojanovic

Genome sequencing has achieved tremendous progress over the last few years. However, along with the speedup of the process and an ever increasing volume of data there are continuing concerns about the quality of the assembled sequence. Many genomes have been sequenced only to a draft, leaving the data in a series of more-or-less organized scaffolds, and many feature a small, but not negligible number of misassembled pieces. In this paper we present a new method for automated flagging of potential trouble spots in large assembled supercontigs. It can be incorporated into existing quality control pipelines and lead to a considerable improvement in the sensitivity to certain types of errors.

From HP Lattice Models to Real Proteins: coordination number prediction using Learning Classifier Systems
Michael Stout, Jaume Bacardit, Jonathan D. Hirst, Natalio Krasnogor,Jacek Blazewicz

Prediction of the coordination number (CN) of residues in proteins based solely on protein sequence has recently received renewed attention. At the same time, simplified protein models such as the HP model have been used to understand protein folding and protein structure prediction. These models represent the sequence of a protein using two residue types: hydrophobic and polar, and restrict the residue locations to those of a lattice. The aim of this paper is to compare CN prediction at three levels of abstraction a) 3D Cubic lattice HP model proteins, b) Real proteins represented by their HP sequence and c) Real proteins using residue sequence alone. For the 3D HP lattice model proteins the CN of each residue is simply the number of neighboring residues on the lattice. For the real proteins, we use a recent real-valued definition of CN proposed by Kinjo et al. To perform the predictions we use GAssist, a recent evolutionary computation based machine learning method belonging to the Learning Classifier System (LCS) family. Its performance was compared against some alternative learning techniques. Predictions using the HP sequence representation with only two residue types were only a little worse than those using a full 20 letter amino acid alphabet (64% vs 68% for two state prediction, 45% vs 50% for three state prediction and 30% vs 33% for five state prediction). That HP sequence information alone can result in predictions accuracies that are within 5% of those obtained using full residue type information indicates that hydrophobicity is a key determinant of CN and further justifies studies of simplified models.

Conditional Random Fields for Predicting and Analyzing Histone Occupancy, Acetylation and Methylation Areas in DNA Sequences
DangHung Tran, ThoHoan Pham, Kenji Satou, TuBao Ho

Eukaryotic genomes are packaged by the wrapping of DNA around histone octamers to form nucleosomes. Nucleosome occupancies together with their acetylation and methylation are important modification factors on all nuclear processes involving DNA. There have been recently many studies of mapping these modifications in DNA sequences and of relationship between them and various genetic activities, such as transcription, DNA repair, and DNA remodeling. However, most of these studies are experimental approaches. In this paper, we introduce a computational approach to both predicting and analyzing nucleosome occupancy, acetylation, and methylation areas in DNA sequences. Our method employs conditional random fields (CRFs) to discriminate between DNA areas with high and low relative occupancy, acetylation, or methylation; and rank features of DNA sequences based on their weight in the CRFs model trained from the datasets of these DNA modifications. The results from our method on the yeast genome reveal genetic area preferences of nucleosome occupancy, acetylation, and methylation are consistent with previous studies.

DNA Fragment Assembly: An Ant Colony System Approach
Wannasak Wetcharaporn, Nachol Chaiyaratana, Sissades Tongsima

This paper presents the use of an ant colony system (ACS) algorithm in DNA fragment assembly. The assembly problem generally arises during the sequencing of large strands of DNA where the strands are needed to be shotgun-replicated and broken into fragments that are small enough for sequencing. The assembly problem can thus be classified as a combinatorial optimisation problem where the aim is to find the right order of each fragment in the ordering sequence that leads to the formation of a consensus sequence that truly reflects the original DNA strands. The assembly procedure proposed is composed of two stages: fragment assembly and contiguous sequence (contig) assembly. In the fragment assembly stage, a possible alignment between fragments is determined with the use of a Smith-Waterman algorithm where the fragment ordering sequence is created using the ACS algorithm. The resulting contigs are then assembled together using a nearest neighbour heuristic (NNH) rule. The results indicate that in overall the performance of the combined ACS/NNH technique is superior to that of the NNH search and a CAP3 program. The results also reveal that the solutions produced by the CAP3 program contain a higher number of contigs than the solutions produced by the proposed technique. In addition, the quality of the combined ACS/NNH solutions is higher than that of the CAP3 solutions when the problem size is large.