Project Description

Project name: Complexity of Haplotyping Problems
Leadership: Prof. Dr. Till Tantau
Duration and Funding: Started 2005, estimated completion in 2010. Supported by the DFG.
Project staff: Ilka Schnoor, Michael Elberfeld, Jakob Kuczewski, Jannis Pohlmann
Abstract: One of the next steps in human genome research will be the construction of the haplotype map. It will show which relationships exists between an individual's haplotypes (variations of the DNA-structure) and the indiviual's likelihood to react to a specific drug or to get a specific desease. Simple, inexpensive technology for determining an individual's haplotypes will become important both in medical treatment and screening. Current technology only allows us to quickly determine the genotypes of individuals. Genotypes contain all information about the genetic variations of an individual at sites of interest, but the mapping of the variations to the two chromosomes is lost. Under certain evolution-theoretic assumptions it is nevertheless possible to reconstruct the haplotypes from the genotypes. This reconstruction problem is called the haplotyping problem.

Project System Development

A software library was developped for this project.

Project Publications

2012

Michael Elberfeld, Ilka Schnoor, Till Tantau:
Influence of Tree Topology Restrictions on the Complexity of Haplotyping with Missing Data.
Theoretical Computer Science, 432:38–51, 2012.
Show PDF | Go to website | Show abstract
Haplotyping, also known as haplotype phase prediction, is the problem of predicting likely haplotypes based on genotype data. One fast haplotyping method is based on an evolutionary model where a perfect phylogenetic tree is sought that explains the observed data. Unfortunately, when data entries are missing, which is often the case in laboratory data, the resulting formal problem IPPH, which stands for incomplete perfect phylogeny haplotyping, is NP-complete. Even radically simplified versions, such as the restriction to phylogenetic trees consisting of just two directed paths from a given root, are still NP-complete; but here, at least, a fixed-parameter algorithm is known. Such drastic and ad hoc simplifications turn out to be unnecessary to make IPPH tractable: we present the first theoretical analysis of a parametrized algorithm, which we develop in the course of the paper, that works for arbitrary instances of IPPH. This tractability result is optimal insofar as we prove IPPH to be NP-complete whenever any of the parameters we consider is not fixed, but part of the input.
Michael Elberfeld, Till Tantau:
Phylogeny- and Parsimony-Based Haplotype Inference with Constraints.
Information and Computation, 213:33–47, 2012.
Show PDF | Go to website | Show abstract
Haplotyping, also known as haplotype phase prediction, is the problem of predicting likely haplotypes based on genotype data. One fast computational haplotyping method is based on an evolutionary model where a perfect phylogenetic tree is sought that explains the observed data. An extension of this approach tries to incorporate prior knowledge in the form of a set of candidate haplotypes from which the right haplotypes must be chosen. The objective is to increase the accuracy of haplotyping methods, but it was conjectured that the resulting formal problem constrained perfect phylogeny haplotyping might be NP-complete. In the paper at hand we present a polynomial-time algorithm for it. Our algorithmic ideas also yield new fixed-parameter algorithms for related haplotyping problems based on the maximum parsimony assumption.

2010

Michael Elberfeld, Till Tantau:
Phylogeny- and Parsimony-Based Haplotype Inference with Constraints.
Technical report SIIM-TR-A-10-01, Schriftenreihe der Institute für Informatik/Mathematik der Universität zu Lübeck, 2010.
Show PDF | Show abstract
Haplotyping, also known as haplotype phase prediction, is the problem of predicting likely haplotypes based on genotype data. One fast computational haplotyping method is based on an evolution- ary model where a perfect phylogenetic tree is sought that explains the observed data. In their CPM’09 paper, Fellows et al. studied an exten- sion of this approach that incorporates prior knowledge in the form of a set of candidate haplotypes from which the right haplotypes must be chosen. While this approach is attractive to increase the accuracy of haplotyping methods, it was conjectured that the resulting formal prob- lem constrained perfect phylogeny haplotyping might be NP-complete. In the paper at hand we present a polynomial-time algorithm for it. Our algorithmic ideas also yield new fixed-parameter algorithms for related haplotyping problems based on the maximum parsimony assumption.
Michael Elberfeld, Till Tantau:
Phylogeny- and Parsimony-Based Haplotype Inference with Constraints.
In Proceedings of the 21st Annual Symposium on Combinatorial Pattern Matching (CPM 2010), Volume 6129 of Lecture Notes in Computer Science, pp. 177-189. Springer, 2010.
Go to website | Show abstract
Haplotyping, also known as haplotype phase prediction, is the problem of predicting likely haplotypes based on genotype data. One fast computational haplotyping method is based on an evolutionary model where a perfect phylogenetic tree is sought that explains the observed data. In their CPM 2009 paper, Fellows et al. studied an extension of this approach that incorporates prior knowledge in the form of a set of candidate haplotypes from which the right haplotypes must be chosen. While this approach may help to increase the accuracy of haplotyping methods, it was conjectured that the resulting formal problem constrained perfect phylogeny haplotyping might be NP-complete. In the present paper we present a polynomial-time algorithm for it. Our algorithmic ideas also yield new fixed-parameter algorithms for related haplotyping problems based on the maximum parsimony assumption.

2009

Michael Elberfeld, Ilka Schnoor, Till Tantau:
Influence of Tree Topology Restrictions on the Complexity of Haplotyping with Missing Data.
In Proceedings of the 6th Annual Conference on Theory and Applications of Models of Computation (TAMC 2009), Volume 5532 of Lecture Notes in Computer Science, pp. 201-210. Springer, 2009.
Show PDF | Go to website | Show abstract
Haplotyping, also known as haplotype phase prediction, is the problem of predicting likely haplotypes from genotype data. One fast haplotyping method is based on an evolutionary model where a perfect phylogenetic tree is sought that explains the observed data. Unfortunately, when data entries are missing as is often the case in laboratory data, the resulting incomplete perfect phylogeny haplotyping problem IPPH is NP-complete and no theoretical results are known concerning its approximability, fixed-parameter tractability, or exact algorithms for it. Even radically simplified versions, such as the restriction to phylogenetic trees consisting of just two directed paths from a given root, are still NP-complete; but here a fixed-parameter algorithm is known. We show that such drastic and ad hoc simplifications are not necessary to make IPPH fixed-parameter tractable: We present the first theoretical analysis of an algorithm, which we develop in the course of the paper, that works for arbitrary instances of IPPH. On the negative side we show that restricting the topology of perfect phylogenies does not always reduce the computational complexity: while the incomplete directed perfect phylogeny problem is well-known to be solvable in polynomial time, we show that the same problem restricted to path topologies is NP-complete.
Michael Elberfeld:
Perfect Phylogeny Haplotyping is Complete for Logspace.
Technical report abs/0905.0602 [cs.CC], Computing Research Repository (CoRR), 2009.
Go to website | Show abstract
Haplotyping is the bioinformatics problem of predicting likely haplotypes based on given genotypes. It can be approached using Gusfield's perfect phylogeny haplotyping (PPH) method for which polynomial and linear time algorithms exist. These algorithm use sophisticated data structures or do a stepwise transformation of the genotype data into haplotype data and, therefore, need a linear amount of space. We are interested in the exact computational complexity of PPH and show that it can be solved space-efficiently by an algorithm that needs only a logarithmic amount of space. Together with the recently proved L-hardness of PPH, we establish L-completeness. Our algorithm relies on a new characterization for PPH in terms of bipartite graphs, which can be used both to decide and construct perfect phylogenies for genotypes efficiently.
Jens Gramm, Tzvika Hartman, Till Nierhoff, Roded Sharan, Till Tantau:
On the complexity of SNP block partitioning under the perfect phylogeny model.
Discrete Mathematics, 309(18):5610-5617, 2009.
Go to website | Show abstract
ecent technologies for typing single nucleotide polymorphisms (SNPs) across a population are producing genome-wide genotype data for tens of thousands of SNP sites. The emergence of such large data sets underscores the importance of algorithms for large-scale haplotyping. Common haplotyping approaches first partition the SNPs into blocks of high linkage-disequilibrium, and then infer haplotypes for each block separately. We investigate an integrated haplotyping approach where a partition of the SNPs into a minimum number of non-contiguous subsets is sought, such that each subset can be haplotyped under the perfect phylogeny model. We show that finding an optimum partition is NP-hard even if we are guaranteed that two subsets suffice. On the positive side, we show that a variant of the problem, in which each subset is required to admit a perfect path phylogeny haplotyping, is solvable in polynomial time.

2008

Jens Gramm, Arfst Nickelsen, Till Tantau:
Fixed-Parameter Algorithms in Phylogenetics.
In Bioinformatics: Volume I: Data, Sequence Analysis and Evolution, Volume 452 of Methods in Molecular Biology, pp. 507-535. Springer, 2008.
Show abstract
We survey the use of fixed-parameter algorithms in phylogenetics. A central computational problem in this field is the construction of a likely phylogeny (genealogical tree) for a set of species based on observed differences in the phenotype, on differences in the genotype, or on given partial phylogenies. Ideally, one would like to construct so-called perfect phylogenies, which arise from an elementary evolutionary model, but in practice one must often be content with phylogenies whose "distance from perfection" is as small as possible. The computation of phylogenies also has applications in seemingly unrelated areas such as genomic sequencing and finding and understanding genes. The numerous computational problems arising in phylogenetics are often NP-complete, but for many natural parametrizations they can be solved using fixed-parameter algorithms.
Michael Elberfeld, Till Tantau:
Computational Complexity of Perfect-Phylogeny-Related Haplotyping Problems.
Technical report SIIM-TR-A-08-02, Schriftenreihe der Institute für Informatik/Mathematik der Universität zu Lübeck, 2008.
Show PDF | Show abstract
Haplotyping, also known as haplotype phase prediction, is the problem of predicting likely haplotypes based on genotype data. This problem, which has strong practical applications, can be approached using both statistical as well as combinatorial methods. While the most direct combinatorial approach, maximum parsimony, leads to NP-complete problems, the perfect phylogeny model proposed by Gusfield yields a problem, called PPH, that can be solved in polynomial (even linear) time. Even this may not be fast enough when the whole genome is studied, leading to the question of whether parallel algorithms can be used to solve the PPH problem. In the present paper we answer this question affirmatively, but we also give lower complexity bounds on its complexity. In detail, we show that the problem lies in Mod$_2$L, a subclass of the circuit complexity class NC$^2$, and is hard for logarithmic space and thus presumably not in NC$^1$. We also investigate variants of the PPH problem that have been studied in the literature, like the perfect path phylogeny haplotyping problem and the combined problem where a perfect phylogeny of maximal parsimony is sought, and show that some of these variants are TC$^0$-complete or lie in AC$^0$.
Michael Elberfeld, Till Tantau:
Computational Complexity of Perfect-Phylogeny-Related Haplotyping Problems.
In Proceedings of the 33rd International Symposium on Mathematical Foundations of Computer Science (MFCS 2008), Volume 5162 of Lecture Notes in Computer Science, pp. 299-310. Springer, 2008.
Go to website | Show abstract
Haplotyping, also known as haplotype phase prediction, is the problem of predicting likely haplotypes based on genotype data. This problem, which has strong practical applications, can be approached using both statistical as well as combinatorial methods. While the most direct combinatorial approach, maximum parsimony, leads to NP-complete problems, the perfect phylogeny model proposed by Gusfield yields a problem, called PPH, that can be solved in polynomial (even linear) time. Even this may not be fast enough when the whole genome is studied, leading to the question of whether parallel algorithms can be used to solve the PPH problem. In the present paper we answer this question affirmatively, but we also give lower complexity bounds on its complexity. In detail, we show that the problem lies in Mod$_2$L, a subclass of the circuit complexity class NC$^2$, and is hard for logarithmic space and thus presumably not in NC$^1$. We also investigate variants of the PPH problem that have been studied in the literature, like the perfect path phylogeny haplotyping problem and the combined problem where a perfect phylogeny of maximal parsimony is sought, and show that some of these variants are TC$^0$-complete or lie in AC$^0$.
Michael Elberfeld, Ilka Schnoor, Till Tantau:
Influence of Tree Topology Restrictions on the Complexity of Haplotyping with Missing Data.
Technical report SIIM-TR-A-08-05, Schriftenreihe der Institute für Informatik/Mathematik der Universität zu Lübeck, 2008.
Show PDF | Show abstract
Haplotyping, also known as haplotype phase prediction, is the problem of predicting likely haplotypes based on genotype data. One fast haplotyping method is based on an evolutionary model where a perfect phylogenetic tree is sought that explains the observed data. Unfortunately, when data entries are missing, as is often the case in real laboratory data, the resulting formal problem IPPH, which stands for incomplete perfect phylogeny haplotyping, is NP-complete and no theoretical results are known concerning its approximability, fixed-parameter tractability or exact algorithms for it. Even radically simplified versions, such as the restriction to phylogenetic trees consisting of just two directed paths from a given root, are still NP-complete, but here, at least, a fixed-parameter algorithm is known. We generalize this algorithm to arbitrary tree topologies and present the first theoretical analysis of an algorithm that works on arbitrary instances of the original IPPH problem. At the same time we also show that restricting the tree topology does not always make finding phylogenies easier: while the incomplete directed perfect phylogeny problem is well-known to be solvable in polynomial time, we show that the same problem restricted to path topologies is NP-complete.
Jens Gramm, Arfst Nickelsen, Till Tantau:
Fixed-Parameter Algorithms in Phylogenetics.
The Computer Journal, 51(1):79--101, 2008.
Show PDF | Show abstract
We survey the use of fixed-parameter algorithms in the field of phylogenetics, which is the study of evolutionary relationships. The central problem in phylogenetics is the reconstruction of the evolutionary history of biological species, but its methods also apply to linguistics, philology, or architecture. A basic computational problem is the reconstruction of a likely phylogeny (genealogical tree) for a set of species based on observed differences in the phenotype like color or form of limbs, based on differences in the genotype like mutated nucleotide positions in the DNA sequence, or based on given partial phylogenies. Ideally, one would like to construct so-called perfect phylogenies, which arise from a very simple evolutionary model, but in practice one must often be content with phylogenies whose ``distance from perfection'' is as small as possible. The computation of phylogenies has applications in seemingly unrelated areas such as genomic sequencing and finding and understanding genes. The numerous computational problems arising in phylogenetics often are NP-complete, but for many natural parametrizations they can be solved using fixed-parameter algorithms.

2007

Jens Gramm, Till Nierhoff, Roded Sharan, Till Tantau:
Haplotyping with Missing Data via Perfect Path Phylogenies.
Discrete and Applied Mathematics, 155:788-805, 2007.
Show abstract
Computational methods for inferring haplotype information from genotype data are used in studying the association between genomic variation and medical condition. Recently, Gusfield proposed a haplotype inference method that is based on perfect phylogeny principles. A fundamental problem arises when one tries to apply this approach in the presence of missing genotype data, which is common in practice. We show that the resulting theoretical problem is NP-hard even in very restricted cases. To cope with missing data, we introduce a variant of haplotyping via perfect phylogeny in which a path phylogeny is sought. Searching for perfect path phylogenies is strongly motivated by the characteristics of human genotype data: 70% of real instances that admit a perfect phylogeny also admit a perfect path phylogeny. Our main result is a fixed-parameter algorithm for haplotyping with missing data via perfect path phylogenies. We also present a simple linear-time algorithm for the problem on complete data.

2006

Jens Gramm, Tzvika Hartman, Till Nierhoff, Roded Sharan, Till Tantau:
On the Complexity of SNP Block Partitioning Under the Perfect Phylogeny Model.
In Proceedings of Workshop on Algorithms in Bioinformatics (WABI), Volume 4175 of Lecture Notes in Computer Science, pp. 92-102. Springer, 2006.
Show abstract
Recent technologies for typing single nucleotide polymorphisms (SNPs) across a population are producing genome-wide genotype data for tens of thousands of SNP sites. The emergence of such large data sets underscores the importance of algorithms for large-scale haplotyping. Common haplotyping approaches first partition the SNPs into blocks of high linkage-disequilibrium, and then infer haplotypes for each block separately. We investigate an integrated haplotyping approach where a partition of the SNPs into a minimum number of non-contiguous subsets is sought, such that each subset can be haplotyped under the perfect phylogeny model. We show that finding an optimum partition is NP-hard even if we are guaranteed that two subsets suffice. On the positive side, we show that a variant of the problem, in which each subset is required to admit a perfect path phylogeny haplotyping, is solvable in polynomial time.

2004

Jens Gramm, Till Nierhoff, Roded Sharan, Till Tantau:
On the Complexity of Haplotyping via Perfect Phylogeny.
In Proceedings of Second RECOMB Satellite Workshop on Computational Methods for SNPs and Haplotypes, pp. 35-46. , 2004.
Show abstract
The problem of haplotyping via perfect phylogeny has received a lot of attention lately due to its applicability to real haplotyping problems and its theoretical elegance. However, two main research issues remained open: The complexity of haplotyping with missing data, and whether the problem is linear-time solvable. In this paper we settle the first question and make progress toward answering the second one. Specifically, we prove that Perfect Phylogeny Haplotyping with missing data is NP-complete even when the phylogeny is a path and only one allele of every polymorphic site is present in the population in its homozygous state. Our result implies the hardness of several variants of the missing data problem, including the general Perfect Phylogeny Haplotyping Problem with missing data, and Hypergraph Tree Realization with missing data. On the positive side, we give a linear-time algorithm for Perfect Phylogeny Haplotyping when the phylogeny is a path. This variant is important due to the abundance of yin-yang haplotypes in the human genome. Our algorithm relies on a reduction of the problem to that of deciding whether a partially ordered set has width~2.
Jens Gramm, Till Nierhoff, Till Tantau:
Perfect Path Phylogeny Haplotyping with Missing Data is Fixed-Parameter Tractable.
In Proceedings of the 2004 International Workshop on Parameterized and Exact Computation, Volume 3162 of Lecture Notes in Computer Science, pp. 174-186. Springer, 2004.
Show abstract
Haplotyping via perfect phylogeny is a method for retrieving haplotypes from genotypes. Fast algorithms are known for computing perfect phylogenies from complete and error-free input instances---these instances can be organized as a genotype matrix whose rows are the genotypes and whose columns are the single nucleotide polymorphisms under consideration. Unfortunately, in the more realistic setting of missing entries in the genotype matrix, even restricted forms of the perfect phylogeny haplotyping problem become NP-hard. We show that haplotyping via perfect phylogeny with missing data becomes computationally tractable when imposing additional biologically motivated constraints. Firstly, we focus on asking for perfect phylogenies that are paths, which is motivated by the discovery that yin-yang haplotypes span large parts of the human genome. A yin-yang haplotype implies that every corresponding perfect phylogeny <i>has</i> to be a path. Secondly, we assume that the number of missing entries in every column of the input genotype matrix is bounded. We show that the perfect path phylogeny haplotyping problem is fixed-parameter tractable when we consider the maximum number of missing entries per column of the genotype matrix as parameter. The restrictions we impose are met by a majority of the problem instances encountered in publicly available human genome data.

Staff Overview

Professors

Secretary

Contact

Research assistants

Technical staff

Institute for Theoretical Computer Science

Complexity of Haplotyping Problems

The Institute

Research

Courses and Studies

Quick links for

Project Description

Project System Development

Project Publications

2012

2010

2009

2008

2007

2006

2004