TY - GEN
T1 - Computational inference of difficult word boundaries in DNA languages
AU - Tsafnat, Guy
AU - Setzermann, Paul
AU - Partridge, Sally R.
AU - Grimm, Dominik
PY - 2011
Y1 - 2011
N2 - Many applications in molecular and systems biology exploit similarities between DNA and languages to make predictions about cell function. This approach provides structure to an otherwise monotonous sequence of nucleotides. However, one of the major differences between DNA sequences and text is in how semantic units (e.g. words) are distinguished within them. Whereas words and sentences are separated by spaces and punctuation in natural languages, no such markers exist in DNA. Some semantic units in DNA (e.g. genes) can be identified relatively easily and with relatively high accuracy. Other units may have less known molecular mechanisms and are therefore harder to identify accurately. In this paper we discuss three machine learning methods to elucidate the boundaries of such difficult units: heuristic approaches use hypothesized models of the mechanism to identify word boundaries, supervised machine learning methods generalise labelled examples of word boundaries to a model that can be used to detect these boundaries, and unsupervised machine learning methods infer a model from unlabeled data. As an example, we use a bacterial transposable element called ISEcp1 that moves DNA segments of variable length. We assess the accuracy of each of the above methods using rediscovery experiments. We demonstrate the power of the methods by examining 9 instances of DNA segments associated with ISEcp1 that lack known boundaries. We identified 6 units that include genes that confer resistance to clinically important antibiotics.
AB - Many applications in molecular and systems biology exploit similarities between DNA and languages to make predictions about cell function. This approach provides structure to an otherwise monotonous sequence of nucleotides. However, one of the major differences between DNA sequences and text is in how semantic units (e.g. words) are distinguished within them. Whereas words and sentences are separated by spaces and punctuation in natural languages, no such markers exist in DNA. Some semantic units in DNA (e.g. genes) can be identified relatively easily and with relatively high accuracy. Other units may have less known molecular mechanisms and are therefore harder to identify accurately. In this paper we discuss three machine learning methods to elucidate the boundaries of such difficult units: heuristic approaches use hypothesized models of the mechanism to identify word boundaries, supervised machine learning methods generalise labelled examples of word boundaries to a model that can be used to detect these boundaries, and unsupervised machine learning methods infer a model from unlabeled data. As an example, we use a bacterial transposable element called ISEcp1 that moves DNA segments of variable length. We assess the accuracy of each of the above methods using rediscovery experiments. We demonstrate the power of the methods by examining 9 instances of DNA segments associated with ISEcp1 that lack known boundaries. We identified 6 units that include genes that confer resistance to clinically important antibiotics.
KW - DNA languages
KW - antibioitc resistance
KW - machine learning
KW - translational bioinformatics
UR - http://www.scopus.com/inward/record.url?scp=84856703816&partnerID=8YFLogxK
U2 - 10.1145/2093698.2093709
DO - 10.1145/2093698.2093709
M3 - Conference contribution
AN - SCOPUS:84856703816
SN - 9781450309134
T3 - ACM International Conference Proceeding Series
BT - Proceedings of the 4th International Symposium on Applied Sciences in Biomedical and Communication Technologies, ISABEL'11
T2 - 4th International Symposium on Applied Sciences in Biomedical and Communication Technologies, ISABEL'11
Y2 - 26 October 2011 through 29 October 2011
ER -