TY - GEN
T1 - On the Capacity of DNA-based Data Storage under Substitution Errors
AU - Lenz, Andreas
AU - Siegel, Paul H.
AU - Wachter-Zeh, Antonia
AU - Yaakobi, Eitan
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - Advances in biochemical technologies, such as synthesizing and sequencing devices, have fueled manifold recent experiments on archival digital data storage using DNA. In this paper we review and analyze recent results on information-theoretic aspects of such storage systems. The discussion focuses on a channel model that incorporates the main properties of DNA-based data storage. Namely, the user data is synthesized many times onto a large number of short-length DNA strands. The receiver then draws strands from the stored sequences in an uncontrollable manner. Since the synthesis and sequencing are prone to errors, a received sequence can differ from its original strand, and their relationship is described by a probabilistic channel. Recently, the capacity of this channel was derived for the case of substitution errors inside the sequences. We review the main techniques used to prove a coding theorem and its converse, showing the achievability of the capacity and the fact that it cannot be exceeded. We further provide an intuitive interpretation of the capacity formula for relevant channel parameters, compare with sub-optimal decoding methods, and conclude with a discussion on cost-efficiency.
AB - Advances in biochemical technologies, such as synthesizing and sequencing devices, have fueled manifold recent experiments on archival digital data storage using DNA. In this paper we review and analyze recent results on information-theoretic aspects of such storage systems. The discussion focuses on a channel model that incorporates the main properties of DNA-based data storage. Namely, the user data is synthesized many times onto a large number of short-length DNA strands. The receiver then draws strands from the stored sequences in an uncontrollable manner. Since the synthesis and sequencing are prone to errors, a received sequence can differ from its original strand, and their relationship is described by a probabilistic channel. Recently, the capacity of this channel was derived for the case of substitution errors inside the sequences. We review the main techniques used to prove a coding theorem and its converse, showing the achievability of the capacity and the fact that it cannot be exceeded. We further provide an intuitive interpretation of the capacity formula for relevant channel parameters, compare with sub-optimal decoding methods, and conclude with a discussion on cost-efficiency.
UR - http://www.scopus.com/inward/record.url?scp=85125266105&partnerID=8YFLogxK
U2 - 10.1109/VCIP53242.2021.9675410
DO - 10.1109/VCIP53242.2021.9675410
M3 - Conference contribution
AN - SCOPUS:85125266105
T3 - 2021 International Conference on Visual Communications and Image Processing, VCIP 2021 - Proceedings
BT - 2021 International Conference on Visual Communications and Image Processing, VCIP 2021 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2021 International Conference on Visual Communications and Image Processing, VCIP 2021
Y2 - 5 December 2021 through 8 December 2021
ER -