To content
Department of Computer Science

Theses on Algorithmic Foundations

Requirements

Please respect our general guideline for theses in our group.

For theses in the area of algorithmic fondations, your should satisfy the following:

  • You enjoy working on algorithmic problems
  • You have knowledge of a memory and runtime-efficient programming language; unless stated otherwise, we expect implementations in C++
  • You have experience with programming on the bit level / you feel comfortable with AND, OR and shifts

These topics are suitable for students of computer science. If you are interested, please contact Prof. Dr. Johannes Fischer, Patrick Dinklage or Francesco Masillo.

Open Topics

Please contact us if you are interested in writing a bachelor's or master's thesis with our groups.

Topics typically revolve around succinct data structures, text compression and text indexing.

The Lempel-Ziv (LZ) factorization of a text T[1..n] produces a sequence of factors, also called"phrases", which are substrings of T and their concatenation is T itself. This algorithm aims at compressing the text greedily in a left-to-right manner. When processing a position i in T, it exploits substrings that the parser has already seen to the left of the currently processed position. When a longest previous factor is computed, it outputs a pair of integers: (pos, len), where pos is the position of an occurrence of the longest previously parsed substring (or 0 if a never seen character is processed), and len is its length (or the newly-seen character). For example, the LZ parse of T = abaababaaabab is LZ(T) = (0,a)(0,b)(1,1)(1,3)(2,3)(4,4). 

The LZ factorization is widely known for achieving a strong compression ratio on a variety of texts. After the text has been compressed, a common task is accessing arbitrary characters in the text while keeping it in its compressed form. On LZ parsings, this task can be arbitrarily slow, as it can be shown that for accessing a single character, we might need to perform O(n) operations (also called jumps), where |T| = n is the length of the uncompressed text. Very recently, researchers have proposed different parsing schemes aiming at setting a bound on the number of jumps one has to perform. Given an integer constant c, the resulting modified parsing allows one to go through at most c jumps for extracting a single character. The goal is to be able to set c as low as possible, while keeping the same number of phrases as the original LZ factorization. Unfortunately, finding the smallest modified LZ parse with a bound on the number of jumps is NP-hard for any constant c.

The aim of this thesis is exploring practical heuristics that enable us to set a very small c, while keeping the number of factors close to the optimal.  

The Lempel-Ziv (LZ) factorization of a text T[1..n] produces a sequence of factors, also called"phrases", which are substrings of T and their concatenation is T itself. This algorithm aims at compressing the text greedily in a left-to-right manner. When processing a position i in T, it exploits substrings that the parser has already seen to the left of the currently processed position. When a longest previous factor is computed, it outputs a pair of integers: (pos, len), where pos is the position of an occurrence of the longest previously parsed substring (or 0 if a never seen character is processed), and len is its length (or the newly-seen character). For example, the LZ parse of T = abaababaaabab is LZ(T) = (0,a)(0,b)(1,1)(1,3)(2,3)(4,4). 

The LZ factorization is widely known for achieving a strong compression ratio on a variety of texts. After the text has been compressed, a common task is accessing arbitrary characters in the text while keeping it in its compressed form. On LZ parsings, this task can be arbitrarily slow, as it can be shown that for accessing a single character, we might need to perform O(n) operations (also called jumps), where |T| = n is the length of the uncompressed text. Very recently, researchers have proposed different parsing schemes aiming at setting a bound on the number of jumps one has to perform. Given an integer constant c, the resulting modified parsing allows one to go through at most c jumps for extracting a single character. The goal is to be able to set c as low as possible, while keeping the same number of phrases as the original LZ factorization. Unfortunately, finding the smallest modified LZ parse with a bound on the number of jumps is NP-hard for any constant c.

The algorithms proposed in the recent literature all employ space demanding data structures, such as suffix trees and suffix arrays. The aim of this thesis is to implement the existing algorithms while reducing the space consumption by using compact data structures, such as compressed suffix trees/arrays, or compressed data structures, such as the r-index. The r-index is a powerful text index that takes spaces proportional to a well-studied compressibility measure, the number of runs r in the Burrows-Wheeler Transform (BWT) of T.

The Burrows-Wheeler Transform BWT is a reversible permutation of the input text T[1..n]. The BWT can be computed by lexicographically sorting all the cyclic rotations of the input text. This generates a conceptual matrix where the first column (F) is the list of lexicographically sorted characters of T, and the last column (L) is the BWT. For example, the BWT of T = banana is BWT(T) = nnbaaa

The BWT induces a permutation that tends to group together equal characters. Usually, on repetitive texts we can observe a so-called clustering effect. Applying a simple run-length encoding (RLE) strategy on the BWT can improve the compression of the original text. Already in our small example, we have that the number of equal letter runs of the BWT is 3 (2n, 1b, 3a) while the number of runs in the original text is 6 (1b, 1a, 1n, 1a, 1n, 1a).

Powerful text indexes are based on the BWT, such as the FM-index or compressed suffix trees. More recent data structures focus on using space proportional to r, the number of runs of the BWT of the input text. RLFM-indexes and the more complex r-index allow for fast pattern matching in O(r) space. 

One of the main open questions in BWT-related research is whether it is possible to achieve fast random access to the original text using O(r) space. It is trivial to achieve O(n) time in O(r) space, but this is, of course, not satisfactory. The best trade-off is Õ(log n) time using O(r log n) space.

The aim of this thesis is to engineer data structures built on the RLE BWT for practical fast random access. The data structures will explore possible tradeoffs between time and space.