To content
Department of Computer Science

Theses on Algorithmic Foundations

Requirements

Please respect our general guideline for theses in our group.

For theses in the area of algorithmic fondations, your should satisfy the following:

  • You enjoy working on algorithmic problems
  • You have knowledge of a memory and runtime-efficient programming language; unless stated otherwise, we expect implementations in C++
  • You have experience with programming on the bit level / you feel comfortable with AND, OR and shifts

These topics are suitable for students of computer science. If you are interested, please contact Prof. Dr. Johannes Fischer, Patrick Dinklage or Francesco Masillo.

Open Topics

The following topics are currently available for bachelor's (BA) or master's (MA) theses.

In the Relative Lempel-Ziv (RLZ) compression scheme, the input text is compressed using a reference: patterns in the text that also occur in the reference are replaced by pointers into the reference. For example, if the reference is R = abab, then the text T=abbababababa can be compressed to T'=(1,2)(2,3)(1,4)(1,3), where each tuple (i, m) consists of the position i in the reference R at which a pattern was found, and its length m.

Naturally, the "better" the reference is, the better we can compress T. Of course, we could choose R=T, but because R needs to be written to the output so that a decoder can restore T, this would not achieve any compression (the output size is |R| + |T'|). In this regard, the reference R needs to cover as many and as long patterns of T as possible, and at the same time it should be much shorter than T itself. Constructing a good reference R given a text T is therefore not trivial.

In this thesis, we want to explore a new way to construct a reference. We partition T into blocks of some fixed size and compute an all-to-all similarity matrix that we can use to tell how "similar" a particular block is to any other block. This can be done using MinHashes. The matrix is then used to compute a score for each block with the intent that the higher the score of a block, the more urgently we want to include it in the reference R. To avoid R becoming too repetitive in itself, whenever we add a block to R, the scores of all other blocks should be adjusted accordingly. This is repeated until R reaches the desired size, which is a parameter of the algorithm. In the thesis, this new approach is to be implemented in C++ and compared to other approaches from the literature.

In the Relative Lempel-Ziv (RLZ) compression scheme, the input text is compressed using a reference: patterns in the text that also occur in the reference are replaced by pointers into the reference. For example, if the reference is R = abab, then the text T=abbababababa can be compressed to T'=(1,2)(2,3)(1,4)(1,3), where each tuple (i, m) consists of the position i in the reference R at which a pattern was found, and its length m.

Naturally, the "better" the reference is, the better we can compress T. Of course, we could choose R=T, but because R needs to be written to the output so that a decoder can restore T, this would not achieve any compression (the output size is |R| + |T'|). In this regard, the reference R needs to cover as many and as long patterns of T as possible, and at the same time it should be much shorter than T itself. Constructing a good reference R given a text T is therefore not trivial.

In this thesis, we want to explore a new way to construct a reference. From the suffix tree of T, step by step, we attempt to find a branch that covers a large portion of T and then prune the branch from the tree, continuing using the remaining suffix tree. The difficulty lies in balancing the selection of branches at large enough depths such that larger portions of T are covered against the desire to cover all of T (the full breadth of the suffix tree) without R becoming too long. In this regard, finding an "optimal" balance is NP-hard, and thus this thesis aims at finding practical heuristics.

Bitvectors capable of performing rank and select queries are fundamental to succinct data structures and are widely used in a wide spectrum of applications in computer science. A rank query on a bitvector B returns the number of bits of type b = {0, 1} that occur before a given position i, i.e. rankb(i) = |{j : 0 <= j < i,  B[j] = b}|. The select operation is the inverse of the rank operation. It returns the position of the kth occurrence of a specific bit. Formally, selectb(k) = min{j : 1 <= j <= |B|, rankb(j) = k}.

For example, given a bitvector B = 01000110101111, we have that query rank1(5) = 1 and select0(6) = 10

There exist numerous solutions for these two types of queries that run in constant time and require little space, i.e. o(|B|). However, most existing bitvector implementations supporting rank/select are static, which in turn limits more complex data structures built upon them to being static as well. If we look at a dynamic scenario, where we allow modifications to the bitvector, we also need to support insertions, deletions or changing the value of some positions in B. All these operations must be reflected in the future rank and select queries.

In terms of compression, many different schemes can be used to reduce the space used by the bitvector. One of the simplest types of compression is run-length encoding (RLE). This encoding consists of tuples of the form (number of elements, type of element). Looking at the previous example, RLE(B) = (1,0), (1,1), (3,0), (2,1), (1,0), (1,1), (1,0), (4,1).

Combining compressed and dynamic bitvectors supporting rank/select operations is non-trivial, and very few implementations are currently available. In this thesis, the aim will be exploring a new approach for implementing dynamic run-length encoded bitvectors using splay trees. Splay trees are a special kind of binary search trees that are very easy to implement and allow a wide range of operations and modifications on the tree structure. This will also allow the possibility of improving the running time of batched contiguous insertions/deletions, i.e. adding or removing parts of the bitvector all together instead of one at a time.