Theses on Algorithmic Foundations
Requirements
Please respect our general guideline for theses in our group.
For theses in the area of algorithmic fondations, your should satisfy the following:
- You enjoy working on algorithmic problems
- You have knowledge of a memory and runtime-efficient programming language; unless stated otherwise, we expect implementations in C++
- You have experience with programming on the bit level / you feel comfortable with AND, OR and shifts
These topics are suitable for students of computer science. If you are interested, please contact Prof. Dr. Johannes Fischer, Patrick Dinklage or Jonas Ellert.
Open Topics
The following topics are currently available for bachelor's (BA) or master's (MA) theses.
In the Relative Lempel-Ziv (RLZ) compression scheme, the input text is compressed using a reference: patterns in the text that also occur in the reference are replaced by pointers into the reference. For example, if the reference is R = abab, then the text T=abbababababa can be compressed to T'=(1,2)(2,3)(1,4)(1,3), where each tuple (i, m) consists of the position i in the reference R at which a pattern was found, and its length m.
Naturally, the "better" the reference is, the better we can compress T. Of course, we could choose R=T, but because R needs to be written to the output so that a decoder can restore T, this would not achieve any compression (the output size is |R| + |T'|). In this regard, the reference R needs to cover as many and as long patterns of T as possible, and at the same time it should be much shorter than T itself. Constructing a good reference R given a text T is therefore not trivial.
In this thesis, we want to explore a new way to construct a reference. We partition T into blocks of some fixed size and compute an all-to-all similarity matrix that we can use to tell how "similar" a particular block is to any other block. This can be done using MinHashes. The matrix is then used to compute a score for each block with the intent that the higher the score of a block, the more urgently we want to include it in the reference R. To avoid R becoming too repetitive in itself, whenever we add a block to R, the scores of all other blocks should be adjusted accordingly. This is repeated until R reaches the desired size, which is a parameter of the algorithm. In the thesis, this new approach is to be implemented in C++ and compared to other approaches from the literature.