TY - GEN
T1 - Achieving the Capacity of the DNA Storage Channel
AU - Lenz, Andreas
AU - Siegel, Paul H.
AU - Wachter-Zeh, Antonia
AU - Yaakohi, Eitan
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/5
Y1 - 2020/5
N2 - Significant advances in biochemical technologies, such as synthesizing and sequencing devices, have made DNA a competitive medium for archival data storage. In this paper we analyze storage systems based on these macromolecules from an information theoretic perspective. Using an appropriate channel model for the synthesis and sequencing steps, we study the maximum achievable information density per nucleotide for reliable and error resilient data storage. The channel model features the main attributes that characterize DNA-based data storage. That is, information is synthesized onto many short DNA strands, and each strand is copied many times. Due to the storage and sequencing methods, the receiver draws strands from these synthesized strands in an uncontrollable manner, where it is possible that strands are drawn multiple times and also that some strands are not drawn at all. Additionally, due to imperfections, the obtained strands can contain errors. Here we prove the achievability of a recently published upper bound on the Shannon capacity of this channel for a large range of parameters by proposing and analyzing a decoder that clusters received strands according to their similarity and then efficiently estimates the original strands based on these clusters.
AB - Significant advances in biochemical technologies, such as synthesizing and sequencing devices, have made DNA a competitive medium for archival data storage. In this paper we analyze storage systems based on these macromolecules from an information theoretic perspective. Using an appropriate channel model for the synthesis and sequencing steps, we study the maximum achievable information density per nucleotide for reliable and error resilient data storage. The channel model features the main attributes that characterize DNA-based data storage. That is, information is synthesized onto many short DNA strands, and each strand is copied many times. Due to the storage and sequencing methods, the receiver draws strands from these synthesized strands in an uncontrollable manner, where it is possible that strands are drawn multiple times and also that some strands are not drawn at all. Additionally, due to imperfections, the obtained strands can contain errors. Here we prove the achievability of a recently published upper bound on the Shannon capacity of this channel for a large range of parameters by proposing and analyzing a decoder that clusters received strands according to their similarity and then efficiently estimates the original strands based on these clusters.
UR - http://www.scopus.com/inward/record.url?scp=85089242159&partnerID=8YFLogxK
U2 - 10.1109/ICASSP40776.2020.9053049
DO - 10.1109/ICASSP40776.2020.9053049
M3 - Conference contribution
AN - SCOPUS:85089242159
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 8846
EP - 8850
BT - 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020
Y2 - 4 May 2020 through 8 May 2020
ER -