Computational inference of difficult word boundaries in DNA languages

Guy Tsafnat, Paul Setzermann, Sally R. Partridge, Dominik Grimm

Publikation: Beitrag in Buch/Bericht/KonferenzbandKonferenzbeitragBegutachtung

3 Zitate (Scopus)

Abstract

Many applications in molecular and systems biology exploit similarities between DNA and languages to make predictions about cell function. This approach provides structure to an otherwise monotonous sequence of nucleotides. However, one of the major differences between DNA sequences and text is in how semantic units (e.g. words) are distinguished within them. Whereas words and sentences are separated by spaces and punctuation in natural languages, no such markers exist in DNA. Some semantic units in DNA (e.g. genes) can be identified relatively easily and with relatively high accuracy. Other units may have less known molecular mechanisms and are therefore harder to identify accurately. In this paper we discuss three machine learning methods to elucidate the boundaries of such difficult units: heuristic approaches use hypothesized models of the mechanism to identify word boundaries, supervised machine learning methods generalise labelled examples of word boundaries to a model that can be used to detect these boundaries, and unsupervised machine learning methods infer a model from unlabeled data. As an example, we use a bacterial transposable element called ISEcp1 that moves DNA segments of variable length. We assess the accuracy of each of the above methods using rediscovery experiments. We demonstrate the power of the methods by examining 9 instances of DNA segments associated with ISEcp1 that lack known boundaries. We identified 6 units that include genes that confer resistance to clinically important antibiotics.

OriginalspracheEnglisch
TitelProceedings of the 4th International Symposium on Applied Sciences in Biomedical and Communication Technologies, ISABEL'11
DOIs
PublikationsstatusVeröffentlicht - 2011
Extern publiziertJa
Veranstaltung4th International Symposium on Applied Sciences in Biomedical and Communication Technologies, ISABEL'11 - Barcelona, Spanien
Dauer: 26 Okt. 201129 Okt. 2011

Publikationsreihe

NameACM International Conference Proceeding Series

Konferenz

Konferenz4th International Symposium on Applied Sciences in Biomedical and Communication Technologies, ISABEL'11
Land/GebietSpanien
OrtBarcelona
Zeitraum26/10/1129/10/11

Fingerprint

Untersuchen Sie die Forschungsthemen von „Computational inference of difficult word boundaries in DNA languages“. Zusammen bilden sie einen einzigartigen Fingerprint.

Dieses zitieren