Table of Contents
- Notation 11
- 1 Introduction 15
- 1.1 Codes and communication 15
- 1.2 Data compression 16
- 1.3 Outline of this thesis 18
- 2 Classical compression algorithms 21
- 2.1 Symbol codes 22
- 2.1.1 Huffman coding 23
- 2.2 Integer codes 24
- 2.2.1 Unary code 25
- 2.2.2 Elias gamma code 26
- 2.2.3 Exponential Golomb codes 27
- 2.2.4 Other integer codes 28
- 2.3 Dictionary coding 29
- 2.3.1 LZW 29
- 2.3.2 History and significance 30
- 2.3.3 Properties 32
- 2.4 Transformations 32
- 2.4.1 Move-to-front encoding 32
- 2.4.2 Block compression 33
- 2.5 Summary 34
- 3 Arithmetic coding 37
- 3.1 Introduction 37
- 3.1.1 Information content and information entropy 37
- 3.1.2 The relationship of codes and distributions 38
- 3.1.3 Algorithms for building optimal compression codes 39
- 3.2 How an arithmetic coder operates 40
- 3.2.1 Mapping a sequence of symbols to nested regions 40
- 3.2.2 Mapping a region to a sequence of symbols 42
- 3.2.3 Encoding a region as a sequence of binary digits 42
- 3.3 Concrete implementation of arithmetic coding 43
- 3.3.1 Historical notes 49
- 3.3.2 Other generic compression algorithms 49
- 3.4 Arithmetic coding for various distributions 50
- 3.4.1 Bernoulli code 50
- 3.4.2 Discrete uniform code 51
- 3.4.3 Finite discrete distributions 51
- 3.4.4 Binomial code 52
- 3.4.5 Beta-binomial code 54
- 3.4.6 Multinomial code 55
- 3.4.7 Dirichlet-multinomial code 56
- 3.4.8 Codes for infinite discrete distributions 57
- 3.5 Combining distributions 58
- 3.5.1 Sequential coding 58
- 3.5.2 Exclusion coding 59
- 3.5.3 Mixture models 60
- 4 Adaptive compression 61
- 4.1 Learning a distribution 61
- 4.1.1 Introduction 61
- 4.1.2 Motivation 62
- 4.2 Histogram methods 62
- 4.2.1 A simple exchangeable method for adaptive compression 62
- 4.2.2 Dirichlet processes 65
- 4.2.3 Power-law variants 65
- 4.2.4 Non-exchangeable histogram learners 66
- 4.3 Other adaptive models 67
- 4.3.1 A Pólya tree symbol compressor 67
- 4.4 Online versus header-payload compression 71
- 4.4.1 Online compression 71
- 4.4.2 Header-payload compression 73
- 4.4.3 Which method is better? 75
- 4.5 Summary 76
- 5 Compressing structured objects 79
- 5.1 Introduction 79
- 5.2 Permutations 82
- 5.2.1 Complete permutations 82
- 5.2.2 Truncated permutations 84
- 5.2.3 Set permutations via elimination sampling 84
- 5.3 Combinations 85
- 5.4 Compositions 86
- 5.4.1 Strictly positive compositions with unbounded components 87
- 5.4.2 Compositions with fixed number of components 87
- 5.4.3 Uniform K-compositions 88
- 5.4.4 Multinomial K-compositions 88
- 5.5 Multisets 89
- 5.5.1 Multisets via repeated draws from a known distribution 90
- 5.5.2 Multisets drawn from a Poisson process 91
- 5.5.3 Multisets from unknown discrete distributions 92
- 5.5.4 Submultisets 93
- 5.5.5 Multisets from a Blackwell–MacQueen urn scheme 93
- 5.6 Ordered partitions 95
- 5.7 Sequences 96
- 5.7.1 Sequences of known distribution, known length 97
- 5.7.2 Sequences of known distribution, uncertain length 98
- 5.7.3 Sequences of uncertain distribution 99
- 5.7.4 Sequences with known ordering 99
- 5.8 Sets 100
- 5.8.1 Uniform sets 100
- 5.8.2 Bernoulli sets 100
- 5.9 Summary 100
- 6 Context-sensitive sequence compression 103
- 6.1 Introduction to context models 104
- 6.1.1 Pure bigram and trigram models 104
- 6.1.2 Hierarchical context models 106
- 6.2 General construction 107
- 6.2.1 Engines 108
- 6.2.2 Probabilistic models 108
- 6.2.3 Learning 110
- 6.3 The PPM algorithm 111
- 6.3.1 Basic operation 111
- 6.3.2 Probability estimation 113
- 6.3.3 Learning mechanism 115
- 6.3.4 PPM escape mechanisms 115
- 6.3.5 Other variants of PPM 116
- 6.4 Optimising PPM 117
- 6.4.1 The effects of context depth 117
- 6.4.2 Generalisation of the PPM escape mechanism 117
- 6.4.3 The probabilistic model of PPM, stated concisely 122
- 6.4.4 Why the escape mechanism is convenient 122
- 6.5 Blending 123
- 6.5.1 Optimising the parameters of a blending PPM 124
- 6.5.2 Backing-off versus blending 124
- 6.6 Unbounded depth 128
- 6.6.1 History 128
- 6.6.2 The Sequence Memoizer 132
- 6.6.3 Hierarchical Pitman–Yor processes 132
- 6.6.4 Deplump 133
- 6.6.5 What makes Deplump compress better than PPM? 134
- 6.6.6 Optimising depth-dependent parameters 136
- 6.6.7 Applications of PPM-like algorithms 139
- 6.7 Conclusions 139
- 7 Multisets of sequences 143
- 7.1 Introduction 143
- 7.2 Collections of fixed-length binary sequences 144
- 7.2.1 Tree representation for multisets of fixed-length strings 145
- 7.2.2 Fixed-depth multiset tree compression algorithm 146
- 7.3 Collections of binary sequences of arbitrary length 148
- 7.3.1 Compressing multisets of self-delimiting sequences 148
- 7.3.2 Encoding string termination via end-of-sequence markers 151
- 7.4 Conclusions 153
- 8 Cost-sensitive compression and adversarial sequences 155
- 8.1 Cost-sensitive compression 156
- 8.1.1 Deriving the optimal target distribution 156
- 8.1.2 Optimising Morse code 157
- 8.2 Compression and sampling 161
- 8.3 Worst case compression and adversarial samples 161
- 8.3.1 Adversarial sequences 163
- 8.3.2 Some results 166
- 8.3.3 Discussion 179
- Conclusions 181
- A Compression results 185
- Bibliography 207
- Keyword index 224