TY - JOUR
T1 - Clustering-Correcting Codes
AU - Shinkar, Tal
AU - Yaakobi, Eitan
AU - Lenz, Andreas
AU - Wachter-Zeh, Antonia
N1 - Publisher Copyright:
© 1963-2012 IEEE.
PY - 2022/3/1
Y1 - 2022/3/1
N2 - A new family of codes, called clustering-correcting codes, is presented in this paper. This family of codes is motivated by the special structure of the data that is stored in DNA-based storage systems. The data stored in these systems has the form of unordered sequences, also called strands, and every strand is synthesized thousands to millions of times, where some of these copies are read back during sequencing. Due to the unordered structure of the strands, an important task in the decoding process is to place them in their correct order. This is usually accomplished by allocating part of the strand for an index. However, in the presence of errors in the index field, important information on the order of the strands may be lost. Clustering-correcting codes ensure that if the distance between the index fields of two strands is small, their data fields have large distance. It is shown how this property enables to place the strands together in their correct clusters even in the presence of errors. We present lower and upper bounds on the size of clustering-correcting codes and an explicit construction of these codes which uses only a single symbol of redundancy. The results are first presented for the Hamming metric and are then extended for the edit distance.
AB - A new family of codes, called clustering-correcting codes, is presented in this paper. This family of codes is motivated by the special structure of the data that is stored in DNA-based storage systems. The data stored in these systems has the form of unordered sequences, also called strands, and every strand is synthesized thousands to millions of times, where some of these copies are read back during sequencing. Due to the unordered structure of the strands, an important task in the decoding process is to place them in their correct order. This is usually accomplished by allocating part of the strand for an index. However, in the presence of errors in the index field, important information on the order of the strands may be lost. Clustering-correcting codes ensure that if the distance between the index fields of two strands is small, their data fields have large distance. It is shown how this property enables to place the strands together in their correct clusters even in the presence of errors. We present lower and upper bounds on the size of clustering-correcting codes and an explicit construction of these codes which uses only a single symbol of redundancy. The results are first presented for the Hamming metric and are then extended for the edit distance.
KW - Clusters
KW - DNA
KW - DNA-based storage systems
KW - Decoding
KW - Encoding
KW - Error correction codes
KW - Indexes
KW - Noise measurement
KW - Unordered sequences
UR - http://www.scopus.com/inward/record.url?scp=85119442844&partnerID=8YFLogxK
U2 - 10.1109/TIT.2021.3127174
DO - 10.1109/TIT.2021.3127174
M3 - Article
AN - SCOPUS:85119442844
SN - 0018-9448
VL - 68
SP - 1560
EP - 1580
JO - IEEE Transactions on Information Theory
JF - IEEE Transactions on Information Theory
IS - 3
ER -