Efficient Computation of Sequence Mappability

Straszyński, Juliusz; Radoszewski, Jakub; Pissis, Solon P.; Kociumaka, Tomasz; Iliopoulos, Costas S.; Charalampopoulos, Panagiotis

doi:10.1007/S00453-022-00934-Y

Artykuł w czasopiśmie

Licencja

CC-BY - Uznanie autorstwa

Efficient Computation of Sequence Mappability

DOI

10.1007/S00453-022-00934-Y

Autor

Straszyński, Juliusz

Radoszewski, Jakub

Pissis, Solon P.

Kociumaka, Tomasz

Iliopoulos, Costas S.

Charalampopoulos, Panagiotis

Data publikacji

2022

Abstrakt (EN)

Sequence mappability is an important task in genome resequencing. In the (k, m)-mappability problem, for a given sequence T of length n, the goal is to compute a table whose ith entry is the number of indices j≠i such that the length-m substrings of T starting at positions i and j have at most k mismatches. Previous works on this problem focused on heuristics computing a rough approximation of the result or on the case of k=1. We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that, for k=O(1), works in O(n) space and, with high probability, in O(n⋅min{mk,logkn}) time. Our algorithm requires a careful adaptation of the k-errata trees of Cole et al. [STOC 2004] to avoid multiple counting of pairs of substrings. Our technique can also be applied to solve the all-pairs Hamming distance problem introduced by Crochemore et al. [WABI 2017]. We further develop O(n2)-time algorithms to compute all (k, m)-mappability tables for a fixed m and all k∈{0,…,m} or a fixed k and all m∈{k,…,n}. Finally, we show that, for k,m=Θ(logn), the (k, m)-mappability problem cannot be solved in strongly subquadratic time unless the Strong Exponential Time Hypothesis fails. This is an improved and extended version of a paper presented at SPIRE 2018.

Słowa kluczowe EN

Sequence mappability

k-errata tree

Hamming distance

Dyscyplina PBN

informatyka

Czasopismo

Algorithmica

Tom

84

Zeszyt

5

Strony od-do

1418-1440

ISSN

0178-4617

Data udostępnienia w otwartym dostępie

2022-02-02

Licencja otwartego dostępu

Uznanie autorstwa

Licencja

Efficient Computation of Sequence Mappability

Opcje

DOI

Abstrakt (EN)