I am a researcher in machine learning /
artificial intelligence
at the University of Cambridge.
I work in the
CBL Laboratory
at the Department of Engineering.
I am interested in Bayesian inference,
information theory,
data compression,
probabilistic programming,
sampling methods,
language design,
automated model construction, and other topics.

I have worked with Prof David J.C. MacKay
and with Prof Zoubin Gharamani,
and I am currently working on the
Automatic Statistician Project.

My

Erdős number is 3,
and I am a member of

King's College.

### Contact

I can best be reached by email:

**tcs27
[at] cam
. ac
. uk**

### Some recent work

#### Making compression algorithms for Unicode text

ArXiv ·

IEEE
The majority of online content is written in languages other
than English, and is most commonly encoded in UTF-8, the
world's dominant Unicode character encoding. Traditional
compression algorithms typically operate on individual bytes.
While this approach works well for the single-byte ASCII
encoding, it works poorly for UTF-8, where characters often
span multiple bytes. Previous research has focused on
developing Unicode compressors from scratch, which often failed
to outperform established algorithms such as bzip2. We develop
a technique to modify byte-based compressors to operate
directly on Unicode characters, and implement variants of LZW
and PPM that apply this technique. We find that our method
substantially improves compression effectiveness on a UTF-8
corpus, with our PPM variant outperforming the state-of-the-art
PPMII compressor. On ASCII and binary files, our variants
perform similarly to the original unmodified compressors.

*Proceedings of the Data Compression Conference, DCC 2017*.
[

CRH,

BibTeX,

RIS ]

#### Compressing combinatorial objects

Pre-print ·

ArXiv ·

IEEE
Most of the world's digital data is currently encoded in a
sequential form, and compression methods for sequences have
been studied extensively. However, there are many types of
non-sequential data for which good compression techniques are
still largely unexplored. This paper contributes insights and
concrete techniques for compressing various kinds of
non-sequential data via arithmetic coding, and derives
re-usable probabilistic data models from fairly generic
structural assumptions. Near-optimal compression methods are
described for certain types of permutations, combinations and
multisets; and the conditions for optimality are made explicit
for each method.

*Proceedings of the Data Compression Conference, DCC 2016*.
[

CRH,

BibTeX,

RIS ]

#### Improving PPM with dynamic parameter updates

Pre-print ·

IEEE ·

slide transcript
This article makes several improvements to the classic PPM algorithm,
resulting in a new algorithm with superior compression effectiveness
on human text. The key differences of our algorithm to classic PPM are
that: Ⓐ rather than the original escape mechanism, we use a
generalised blending method with explicit hyper-parameters that
control the way symbol counts are combined to form predictions;
Ⓑ different hyper-parameters are used for classes of different contexts;
and Ⓒ these hyper-parameters are updated dynamically using gradient
information. The resulting algorithm (PPM-DP) compresses human text
better than all currently published variants of PPM, CTW, DMC, LZ, CSE
and BWT, with runtime only slightly slower than classic PPM.

*Proceedings of the Data Compression Conference, DCC 2015*.
[

CRH,

BibTeX,

RIS ]

#### Compressing Sets and Multisets of Sequences

ArXiv ·

IEEE
This article describes lossless compression algorithms for multisets
of sequences, taking advantage of the multiset's unordered structure.
Multisets are a generalisation of sets where members are allowed to
occur multiple times. A multiset can be encoded naïvely by simply
storing its elements in some sequential order, but then information is
wasted on the ordering. We propose a technique that transforms the
multiset into an order-invariant tree representation, and derive an
arithmetic code that optimally compresses the tree. Our method
achieves compression even if the sequences in the multiset are
individually incompressible (such as cryptographic hash sums). The
algorithm is demonstrated practically by compressing collections of
SHA-1 hash sums, and multisets of arbitrary, individually encodable
objects.

*IEEE Transactions on Information Theory*, Vol 61, No. 3, March 2015.
[

CRH,

BibTeX,

RIS ]

#### Lossless Data Compression

PhD thesis
This thesis makes several contributions to the field of data
compression.
Compression algorithms are derived for a variety of
applications, employing
probabilistic modelling, Bayesian inference, and arithmetic coding;
and making the underlying probability distributions explicit
throughout.
A general compression toolbox is described, consisting of
practical algorithms for compressing data distributed by various
fundamental probability distributions, and mechanisms for combining
these algorithms in a principled way.
New mathematical theory is introduced for compressing objects
with an underlying combinatorial structure, such as permutations,
combinations, and multisets.
For text compression, a novel unifying construction is developed for a
family of context-sensitive compression algorithms, special cases of
which include the PPM algorithm and the Sequence Memoizer.
The work concludes with experimental results, example applications,
and a brief discussion on cost-sensitive compression and
adversarial sequences.

#### SAT-solving: Performance analysis of Survey Propagation and DPLL

Technical report
The Boolean Satisfiability Problem (SAT) belongs to the class of
NP-complete problems, meaning that there is no known deterministic
algorithm that can solve an arbitrary problem instance in less than
exponential time (parametrized on the length of the input). There is
great industrial demand for solving SAT, motivating the need for
algorithms which perform well. I present a comparison of two
approaches for solving SAT instances: DPLL (an exact algorithm from
classical computer science) and Survey Propagation (a probabilistic
algorithm from statistical physics). The two algorithms were compared
on randomly generated 3-SAT problems with varying clause to variable
ratios.

Cavendish Laboratory, University of Cambridge.
[

CRH,

BibTeX,

RIS ]

#### Quantum computation on non-quantum computers

BA dissertation
This dissertation outlines the the design of a statically typed
programming language for quantum computers,
and describes a working implementation of it
for classical computer systems.

Computer Laboratory, University of Cambridge.
[

CRH,

BibTeX,

RIS ]

### Teaching

I have taught various undergraduate courses at Cambridge University,
including:
*Probability Theory,
Machine Learning,
Functional Programming,
Type Theory,
Computation Theory,
Quantum Computation,
various programming languages (ML, Prolog, Java),*
and *Computer Science Foundations*.

I also supervised several
B.A. dissertations and MPhil dissertations,
on topics ranging from machine learning to computation theory.