TY - JOUR
T1 - Error-Correcting Codes for Nanopore Sequencing
AU - Banerjee, Anisha
AU - Yehezkeally, Yonatan
AU - Wachter-Zeh, Antonia
AU - Yaakobi, Eitan
N1 - Publisher Copyright:
© 2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
PY - 2024/7/1
Y1 - 2024/7/1
N2 - Nanopore sequencing, superior to other sequencing technologies for DNA storage in multiple aspects, has recently attracted considerable attention. Its high error rates, however, demand thorough research on practical and efficient coding schemes to enable accurate recovery of stored data. To this end, we consider a simplified model of a nanopore sequencer inspired by Mao et al., incorporating intersymbol interference and measurement noise. Essentially, our channel model passes a sliding window of length ℓ over a q-ary input sequence that outputs the composition of the enclosed ℓ bits, and shifts by δ positions with each time step. In this context, the composition of a q-ary vector x specifies the number of occurrences in x of each symbol in {0, 1, . . ., q − 1}. The resulting compositions vector, termed the read vector, may also be corrupted by t substitution errors. By employing graph-theoretic techniques, we deduce that for δ = 1, at least log log n symbols of redundancy are required to correct a single (t = 1) substitution. Finally, for ℓ ≥ 3, we exploit some inherent characteristics of read vectors to arrive at an error-correcting code that is of optimal redundancy up to a (small) additive constant for this setting. This construction is also found to be optimal for the case of reconstruction from two noisy read vectors.
AB - Nanopore sequencing, superior to other sequencing technologies for DNA storage in multiple aspects, has recently attracted considerable attention. Its high error rates, however, demand thorough research on practical and efficient coding schemes to enable accurate recovery of stored data. To this end, we consider a simplified model of a nanopore sequencer inspired by Mao et al., incorporating intersymbol interference and measurement noise. Essentially, our channel model passes a sliding window of length ℓ over a q-ary input sequence that outputs the composition of the enclosed ℓ bits, and shifts by δ positions with each time step. In this context, the composition of a q-ary vector x specifies the number of occurrences in x of each symbol in {0, 1, . . ., q − 1}. The resulting compositions vector, termed the read vector, may also be corrupted by t substitution errors. By employing graph-theoretic techniques, we deduce that for δ = 1, at least log log n symbols of redundancy are required to correct a single (t = 1) substitution. Finally, for ℓ ≥ 3, we exploit some inherent characteristics of read vectors to arrive at an error-correcting code that is of optimal redundancy up to a (small) additive constant for this setting. This construction is also found to be optimal for the case of reconstruction from two noisy read vectors.
KW - DNA sequences
KW - Sequence reconstruction
KW - composition errors
KW - error-correction codes
KW - nanopore sequencing
UR - http://www.scopus.com/inward/record.url?scp=85188963562&partnerID=8YFLogxK
U2 - 10.1109/TIT.2024.3380615
DO - 10.1109/TIT.2024.3380615
M3 - Article
AN - SCOPUS:85188963562
SN - 0018-9448
VL - 70
SP - 4956
EP - 4967
JO - IEEE Transactions on Information Theory
JF - IEEE Transactions on Information Theory
IS - 7
M1 - 10478160
ER -