Computational inference of difficult word boundaries in DNA languages

Guy Tsafnat, Paul Setzermann, Sally R. Partridge, Dominik Grimm

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

Many applications in molecular and systems biology exploit similarities between DNA and languages to make predictions about cell function. This approach provides structure to an otherwise monotonous sequence of nucleotides. However, one of the major differences between DNA sequences and text is in how semantic units (e.g. words) are distinguished within them. Whereas words and sentences are separated by spaces and punctuation in natural languages, no such markers exist in DNA. Some semantic units in DNA (e.g. genes) can be identified relatively easily and with relatively high accuracy. Other units may have less known molecular mechanisms and are therefore harder to identify accurately. In this paper we discuss three machine learning methods to elucidate the boundaries of such difficult units: heuristic approaches use hypothesized models of the mechanism to identify word boundaries, supervised machine learning methods generalise labelled examples of word boundaries to a model that can be used to detect these boundaries, and unsupervised machine learning methods infer a model from unlabeled data. As an example, we use a bacterial transposable element called ISEcp1 that moves DNA segments of variable length. We assess the accuracy of each of the above methods using rediscovery experiments. We demonstrate the power of the methods by examining 9 instances of DNA segments associated with ISEcp1 that lack known boundaries. We identified 6 units that include genes that confer resistance to clinically important antibiotics.

Original languageEnglish
Title of host publicationProceedings of the 4th International Symposium on Applied Sciences in Biomedical and Communication Technologies, ISABEL'11
DOIs
StatePublished - 2011
Externally publishedYes
Event4th International Symposium on Applied Sciences in Biomedical and Communication Technologies, ISABEL'11 - Barcelona, Spain
Duration: 26 Oct 201129 Oct 2011

Publication series

NameACM International Conference Proceeding Series

Conference

Conference4th International Symposium on Applied Sciences in Biomedical and Communication Technologies, ISABEL'11
Country/TerritorySpain
CityBarcelona
Period26/10/1129/10/11

Keywords

  • DNA languages
  • antibioitc resistance
  • machine learning
  • translational bioinformatics

Fingerprint

Dive into the research topics of 'Computational inference of difficult word boundaries in DNA languages'. Together they form a unique fingerprint.

Cite this