Publications
Improvement of Decoding Performance by Preprocessing Methods and Soft Information for Fountain Code in DNA Storage System
Abstract
In this dissertation, two main contributions are given as: i) improving decoding performance in deoxyribonucleic acid (DNA) storage system by using preprocessing methods and ii) improving decoding performance in DNA storage system by an iterative soft-decision decoding algorithm. First, several preprocessing techniques to improve the erasure decoding performance for DNA storage system using fountain code and Reed-Solomon (RS) code are introduced. In DNA storage system, there are tradeoffs between writing and reading costs. Increasing the code rate of error correcting codes may save writing cost, but it will need more sequence reads for data retrieval. There is potentially a way to improve sequencing and decoding processes in such a way that the reading cost induced by this tradeoff is reduced without increasing the writing cost. In past researches, clustering, alignment, and decoding processes were considered as separate stages but I believe that using the information from all these processes together may improve decoding performance. Actual experiments of DNA synthesis and sequencing should be performed because simulations cannot be relied on to cover all error possibilities in practical circumstances. Therefore, I design the decoding process focusing on the cooperation of key components: Hamming-distance based clustering, discarding of abnormal sequence reads, RS error correction as well as detection, and quality score-based ordering of sequences. I synthesize 513.6KB data into DNA oligo pools and sequence this data successfully with Illumina MiSeq instrument. Compared to Erlichs research, the proposed decoding method additionally incorporates sequence reads with minor errors which had been discarded before. Thus, it is able to make use of 10.6-11.9% more sequence reads from the same sequencing environment, which results in 6.5-8.9% reduction in the reading cost. Channel characteristics including sequence coverage and read-length distributions are discussed as well. Second, a new iterative soft-decision decoding algorithm, where soft information is obtained from FASTQ files and channel statistics is proposed. Ever since DNA is considered as a next-generation data-storage medium, lots of research efforts have been made to correct errors occurred during the synthesis, storage, and sequencing processes using error correcting codes (ECCs). Previous works on recovering the data from the sequenced DNA oligo pool with errors have utilized decoding algorithms based on hard values and a majority decision rule. To improve the error correction capability by ECCs and robustness of the DNA storage system, I propose a new formula for log-likelihood ratio (LLR) calculation using quality scores (Q-scores) and a redecoding method which may be suitable for the error correction and detection in the DNA sequencing area. Based on the widely adopted encoding scheme of the fountain code structure proposed by Erlich et al., I use three different sets of sequenced data to show consistency for the performance evaluation. The proposed soft-decision decoding (soft decoding) algorithm gives 2.3-7.0% improvement of the reading number reduction compared to the state-of-the-art decoding method and it is shown that it can deal with erroneous sequenced oligo reads with insertion and deletion errors.
Product Used
Oligo Pools
Related Publications