Christian Steinruecken

I am a researcher in machine learning / artificial intelligence at the University of Cambridge. I work in the CBL Laboratory at the Department of Engineering. I am interested in Bayesian inference, information theory, language models, data compression, probabilistic programming, automated model discovery, and other topics.

I am working with Prof Zoubin Ghahramani, and am the principal architect of the Automatic Statistician Project. I have previously worked with Prof Sir David J.C. MacKay, with whom I have co-taught courses in machine learning and information theory. I have an associate faculty position at the Cambridge ELLIS unit.

My Erdős number is 3, and I am a member of King's College.

I have been a co-founder, executive, and board member of multiple tech companies. I have also acted as technical or strategic advisor for Venture Capital funds, and to private equity investors.

Contact

I can best be reached by email:

tcs27 [at] cam . ac . uk

Selected work

The Automatic Statistician

Springer · IEEE

The Automatic Statistician project aims to automate data science, producing predictions and human-readable reports from raw datasets with minimal human intervention. Alongside basic graphs and statistics, the generated reports contain a curation of high-level insights about the dataset that are obtained from (1) an automated construction of models for the dataset, (2) a comparison of these models, and (3) a software component that turns these results into natural language descriptions. This chapter describes the common architecture of such Automatic Statistician systems, and discusses some of the design decisions and technical challenges.

In Automated Machine Learning (Springer Series on Challenges in Machine Learning, 2018). [ CRH, BibTeX, RIS ]

Making compression algorithms for Unicode text

ArXiv · IEEE

The majority of online content is written in languages other than English, and is most commonly encoded in UTF-8, the world's dominant Unicode character encoding. Traditional compression algorithms typically operate on individual bytes. While this approach works well for the single-byte ASCII encoding, it works poorly for UTF-8, where characters often span multiple bytes. Previous research has focused on developing Unicode compressors from scratch, which often failed to outperform established algorithms such as bzip2. We develop a technique to modify byte-based compressors to operate directly on Unicode characters, and implement variants of LZW and PPM that apply this technique. We find that our method substantially improves compression effectiveness on a UTF-8 corpus, with our PPM variant outperforming the state-of-the-art PPMII compressor. On ASCII and binary files, our variants perform similarly to the original unmodified compressors.

Proceedings of the Data Compression Conference, DCC 2017. [ CRH, BibTeX, RIS ]

Compressing combinatorial objects

Pre-print · ArXiv · IEEE

Most of the world's digital data is currently encoded in a sequential form, and compression methods for sequences have been studied extensively. However, there are many types of non-sequential data for which good compression techniques are still largely unexplored. This paper contributes insights and concrete techniques for compressing various kinds of non-sequential data via arithmetic coding, and derives re-usable probabilistic data models from fairly generic structural assumptions. Near-optimal compression methods are described for certain types of permutations, combinations and multisets; and the conditions for optimality are made explicit for each method.

Proceedings of the Data Compression Conference, DCC 2016. [ CRH, BibTeX, RIS ]

Improving PPM with dynamic parameter updates

Pre-print · IEEE · slide transcript

This article makes several improvements to the classic PPM algorithm, resulting in a new algorithm with superior compression effectiveness on human text. The key differences of our algorithm to classic PPM are that: Ⓐ rather than the original escape mechanism, we use a generalised blending method with explicit hyper-parameters that control the way symbol counts are combined to form predictions; Ⓑ different hyper-parameters are used for classes of different contexts; and Ⓒ these hyper-parameters are updated dynamically using gradient information. The resulting algorithm (PPM-DP) compresses human text better than all currently published variants of PPM, CTW, DMC, LZ, CSE and BWT, with runtime only slightly slower than classic PPM.

Proceedings of the Data Compression Conference, DCC 2015. [ CRH, BibTeX, RIS ]

Compressing Sets and Multisets of Sequences

ArXiv · IEEE

This article describes lossless compression algorithms for multisets of sequences, taking advantage of the multiset's unordered structure. Multisets are a generalisation of sets where members are allowed to occur multiple times. A multiset can be encoded naïvely by simply storing its elements in some sequential order, but then information is wasted on the ordering. We propose a technique that transforms the multiset into an order-invariant tree representation, and derive an arithmetic code that optimally compresses the tree. Our method achieves compression even if the sequences in the multiset are individually incompressible (such as cryptographic hash sums). The algorithm is demonstrated practically by compressing collections of SHA-1 hash sums, and multisets of arbitrary, individually encodable objects.

IEEE Transactions on Information Theory, Vol 61, No. 3, March 2015. [ CRH, BibTeX, RIS ]

Lossless Data Compression

PhD thesis

This thesis makes several contributions to the field of data compression. Compression algorithms are derived for a variety of applications, employing probabilistic modelling, Bayesian inference, and arithmetic coding; and making the underlying probability distributions explicit throughout. A general compression toolbox is described, consisting of practical algorithms for compressing data distributed by various fundamental probability distributions, and mechanisms for combining these algorithms in a principled way. New mathematical theory is introduced for compressing objects with an underlying combinatorial structure, such as permutations, combinations, and multisets. For text compression, a novel unifying construction is developed for a family of context-sensitive compression algorithms, special cases of which include the PPM algorithm and the Sequence Memoizer. The work concludes with experimental results, example applications, and a brief discussion on cost-sensitive compression and adversarial sequences.

University of Cambridge, 2014.
[ CRH,
  BibTeX,
  RIS ]

SAT-solving: Performance analysis of Survey Propagation and DPLL

Technical report

The Boolean Satisfiability Problem (SAT) belongs to the class of NP-complete problems, meaning that there is no known deterministic algorithm that can solve an arbitrary problem instance in less than exponential time (parametrized on the length of the input). There is great industrial demand for solving SAT, motivating the need for algorithms which perform well. I present a comparison of two approaches for solving SAT instances: DPLL (an exact algorithm from classical computer science) and Survey Propagation (a probabilistic algorithm from statistical physics). The two algorithms were compared on randomly generated 3-SAT problems with varying clause to variable ratios.

Cavendish Laboratory, University of Cambridge.
[ CRH,
  BibTeX,
  RIS ]

Quantum computation on non-quantum computers

BA dissertation

This dissertation outlines the the design of a statically typed programming language for quantum computers, and describes a working implementation of it for classical computer systems.

Computer Laboratory, University of Cambridge.
[ CRH,
  BibTeX,
  RIS ]

Teaching

I have given tutorials, invited lectures, or example classes on various topics in machine learning and information theory. At the Machine Learning Summer School (MLSS) in South Africa 2019, I gave two 2h lectures: one on Bayesian non-parametrics, and one on data compression. The data compression lecture is available on YouTube: [Data Compression – Part 1] [Data Compression – Part 2].

I supervised various undergraduate courses at Cambridge University, including: Probability Theory, Machine Learning, Functional Programming, Type Theory, Computation Theory, Quantum Computation, various programming languages (ML, Prolog, Java), and Computer Science Foundations.

I also proposed and supervised several dissertations, on topics ranging from computation theory to machine learning.

Talks

2018-06-26	Invited Speaker	CogX Conference (London), "Automating AI: Applying Machine learning to itself". (Recorded video)
2018-03-08	Invited Speaker	CleanEquity Conference (Monaco), "Environmental Intelligence"
2018-03-01	Tutorial	CBL (Cambridge), Information Theory, Codes, and Compression
2018-01-24	Invited Speaker	Canadian Consulate Visit (Cambridge Science Park), "AI start-ups and culture in Cambridge"
2017-12-06	Panel Discussion	The Royal Society (London), "Machine learning and AI for social good"
2017-11-16	Workshop Presentation	Alan Turing Institute (London), "The Automatic Statistician: Project Overview and Future Directions"
2017-09-21	Invited Speaker	InvestOps Europe 2018 (London), "AI and its impact on enterprise and financial applications"
2017-06-29	Invited Speaker	HR360 European Summit 2017 (Vienna), "AI and its impact on work, HR, and the future"
2017-06-09	Invited Speaker	AI for Good · UN Global Summit (Geneva), "Environmental applications of AI"
2016-12-10	Workshop Keynote	NIPS 2016 (Barcelona), "Automated Model Construction and the Automatic Statistician"
2016-12-02	Presentation	Engineering Div-F Conference (Cambridge), "Inverting a calculator"
2016-11-24	Invited Speaker	TEDx Salon (Cambridge), Opening Keynote: "The Bright Future of Artificial Intelligence"
2016-11-16	Invited Speaker	The Royal Society (London), "Artificial Intelligence: Three Directions. New and emerging technologies in data science." (Summary transcript)
2016-11-02	Panel discussion	FIMA 2016 (London), "All star panel: What are the implications of AI, machine learning and robotics for financial data management?"
2016-03-30	Plenary Presentation	Data Compression Conference (Utah), "Compressing Combinatorial Objects" (see paper)
2016-03-14	Organiser	"Information, Inference, and Energy · A symposium to celebrate the work of Professor Sir David MacKay FRS" (Computer Laboratory, Cambridge)
2016-02-20	Invited Speaker	Stiftung der Deutschen Wirtschaft (Cambridge), "Industry 4.0: Artificial Intelligence"
2016-02-05	Workshop Presentation	Probabilistic Programming Workshop · Alan Turing Institute (London), "Randomness, Inference, Sampling, Compression"
2015-11-19	Tutorial	CBL (Cambridge), Logic, Theorem Proving, and Probabilistic Programming
2015-04-08	Plenary Presentation	Data Compression Conference (Utah), "Improving PPM with Dynamic Parameter Updates" (see paper)
2015-01-29	Guest Presentation	Harvard University (Boston), "Bayesian Data Compression".
2014-10-16	Research Seminar	CBL (Cambridge), Compressing Text - A new approach
2014-03-17	Poster Presentation	Data Compression Conference (Utah), "Compressing Sets and Multisets of Sequences"
2014-02-11	Invited Speaker	Microsoft Research (Cambridge), "A modular Bayesian compression library"
2014-01-23	Tutorial	CBL (Cambridge), Probabilistic Data Structures and Algorithms
2013-01-24	Tutorial	CBL (Cambridge), Advanced Sampling Methods