How to cite this



Probabilistic Data Structures and Algorithms

Christian Steinruecken, Alexandre K. W. Navarro

Use arrow keys, PgUp, PgDn to move between slides, or press 'dot' to toggle the toolbar.
Typing 'HELP' + return opens a help screen.







How to cite this presentation

@misc{steinruecken2013c,
   title = {Probabilistic Data Structures and Algorithms},
   author = {Christian Steinruecken and Alexandre K. W. Navarro},
   year = {2014},
   month = jan,
   url = {http://www.inference.phy.cam.ac.uk/tcs27/talks/probdata.html},
   howpublished = {Slide presentation. CBL Lab, Engineering Department, University of Cambridge},
}

Why use randomness?


Question:
Why use randomness in a deterministic algorithm?

Why would randomness help to answer a deterministic question?
What does randomness even mean?

Why build machines?


Humans construct machines to make life more predictable:
machines are mainly designed to guarantee a certain behaviour.
(Also: to do things humans can't.)

The most general machine


The most general machine is a computer, capable of simulating any other machine (except for hardware).

Quantum computers may be slightly more general, offering fundamentally different runtime behaviour for certain problems.

Computers aren't perfect


A computer's probability of error isn't zero. Normand (1996)

But computer scientists design software components to zero-error specifications.

The zero error design targets can be costly.

The probability of error


Computers and storage components are engineered to have a low probability of error. (But it's non-zero.) Normand (1996)

The CS illusion: we design software components assuming that the probability of error is zero.

In some cases, this illusion comes at a cost.
Jaynes (1994)

From logic to probability


Probability theory generalises logic.

\[ \p{A \| B} = \frac{\p{B \| A} \mul \p{A}}{ \p{B} } \]
How to cite this

Contents

Key themes of this talk


Randomness

Non-chaoticChaotic
Predictablenon-randompseudo-random
Unpredictableweakly randomstrongly random
A message is random to a recipient if the recipient cannot predict it.

A sequence of random numbers has several distinct properties: These properties are subjective.

Predictability may depend on e.g. a shared secret.
Chaos is the degree to which the sequence is statistically indistinguishable from true randomness.
"objectively random"

Unpredictability

Some sources of unpredictability:
These sources are weakly random.
Use randomness extractors or compression algorithms to get strongly random (unpredictable + chaotic) numbers.
"subjectively random"

Chaotic behaviour

Some sources of deterministic chaos:
Properties of interest: invertability, stochastic properties of the output, resource costs, dependence on a shared secret

Deterministic random numbers generators (PRNGs)




Uses of randomness

Hash functions


Properties of a good hash function:

Hash functions: there are lots


There are many well-known hash functions.

The use of hash functions is ubiquitous in computer science.
NIST (2012)

Hash functions: example


Hash function demo for string inputs.
Computationally impractical to reverse / invert.


(This demo uses the SHA-256 secure hashing algorithm.)

Cryptographic hash functions: Quiz


Q1: Does applying a hash function twice make it more secure? \(h(h(x))\)
Q2: We could create a new hash function by combining several different hash functions (e.g. by XOR).
Is that a good idea?

Demos

Hash maps / hash tables


Position 012345678910111213141516171819
Value                     

General idea:
Why hash? because of \(\bigO{1}\) look-ups.
Expected problems: hash collisions.

Arrays and hashes


Generalising this idea,
what other structures can be built from arrays with hash-mapped positions?

Whang et al. (1990)

Linear Counter

Estimates the cardinality (number of unique elements) of a set \(\set{X}\).

Operation:
Whang et al. (1990)

Linear Counter: Example


\(x\) \(h(x)\)
Zoubin 4
Carl 2
Zoubin 4
Richard 3
Máté 4
Zoubin 4
Carl 2
Bit array
PositionValueItems mapped
10
21Carl
31Richard
41Zoubin, Máté
50

Duplicates are removed...
... but collisions add errors.

In the example, cardinality\((\set{X}) \approx 3\)
Durand & Flayjolet (2003)

Loglog Counter


itemhashlead-0s
hash( ? ) = ?
Durand & Flayjolet (2003)

Loglog Counter

In a string of random bits:
any bit \( \sim \txtBern(\theta=\textstyle \frac12)\)
# of 0s before first 1 \( \sim \txtGeom(\theta=\textstyle \frac12)\)
Estimates the cardinality (number of unique elements) of a set \(\set{X}\).

Key idea:

Repeat the above with several different hash functions to improve the estimate of \(\setsize{\set{X}}\).
Bloom (1970)

Bloom filter

Bit vector
Set \(\set{X} = \{ \) \(\} \) P(false positive) = 0%

Insertion

\(h_1(x) =\) 1, \(h_2(x) =\) 0
Query
\(\in \set{X}\ ?\) No
\(h_1(x) = \) 1, \(h_2(x) = \) 0
Script adapted from: http://billmill.org/bloomfilter-tutorial/
Bloom (1970)

Bloom filter: Summary

Count-min sketch

min( ? ) = ?
Cormode & Muthukrishnan (2005)

Count-min sketch

Pagh & Rodler (2001)

Cuckoo hashing

Monte Carlo tree search game playing

image/svg+xml 26/49 19/29 7/20 0/8 8/15 11/14 4/7 6/7 3/8 4/12

Monte Carlo tree search game playing

image/svg+xml 26/49 19/29 7/20 0/8 8/15 11/14 4/7 6/7 0/0 3/8 4/12

Monte Carlo tree search game playing

image/svg+xml 26/49 19/29 7/20 0/8 8/15 11/14 4/7 6/7 1/1 3/8 4/12

Monte Carlo tree search game playing

image/svg+xml 27/50 20/30 7/20 0/8 8/15 12/15 4/7 7/8 1/1 3/8 4/12
Browne & al (2012)

Monte Carlo tree search: Summary

Digital Fountain codes: Encoding

image/svg+xml OriginalMessage Encoded 1 0 1 1 1 0 1 =1 =0 XOR 1 XOR 0 1 =1 XOR 1 0 =1 XOR 0 XOR 1 0

Digital Fountain codes: Decoding

image/svg+xml DecodingMessage Encoded 1 ? ? ? ? ? 1 =1 1 0 0 DecodingMessage Encoded 1 ? 1 ? ? ? 1 1 1 0 0 1=
Luby (2002) MacKay (2005)

Digital Fountains: Summary

Luby (2002) MacKay (2005)

Digital Fountains: Operation

Degree distributions
\[\text{IdealSoliton}\left( d \| K \right) = \begin{cases} \frac{1}{K} & d = 1 \\ \frac{1}{d(d-1)} & d = 2, \mdots, K \end{cases} \]
\[\text{RobustSoliton}\left( d \| K,R \right) = \begin{cases} \frac{R}{dK} & d = 1, \mdots,\frac{K}{R} - 1 \\[0.5em] \frac{R \mul \ln \left( R / \delta \right)}{K} & d = \frac{K}{R} \\[0.5em] 0 & d \gt \frac{K}{R} \\ \end{cases} \] with \(R = c \mul \ln(K/\delta)\sqrt{K}\) and \(c \gt 0\).
The quantity \(\delta\) is the allowable failure probability.

Random tree classifier

image/svg+xml
Breiman (2001)

Random forests

Karp & Rabin (1987)

Rabin–Karp pattern matching

Rabin (1980)

Miller–Rabin primality test: Summary

See also Kleinberg (2010) for an entry-level description.
Gallager (1963)

Low-density parity-check codes

Min-cut

image/svg+xml a b c d e a b c d e a b c d e a b c d e a b c d e a b c d e a b c d e
Karger (1993)

Randomized Min-Cut

Sedgewick (1978) Hoare (1962)

Randomized Quicksort

Heaps

image/svg+xml 10 8 5 2 3 6 1 10 5 8 2 1 3 6

Randomized Search Trees and Heaps (Treaps)

image/svg+xml 10 8 5 2 3 6 1 10 5 8 2 1 3 6 58 27 76 99 62 43 18
Seidel & Aragon (1996)

Randomized Search Trees and Heaps (Treaps)

Pugh (1990)

Skip lists

image/svg+xml Level 1 Level 2 Level 3 Level 4 -∞ Head 10 25 30 85 96 +Head
Pugh (1990)

Skip lists: Summary

Metwally et al. (2005)

Stream Summary

image/svg+xml 7 5 2 8 1 3 5 5 81 8 26 3 42 1 63 2 92 0 0 0 0 0 7 Stream ... Monitored elements: Label Occurences Max Error
Metwally et al. (2005)

Stream Summary

image/svg+xml 7 5 2 8 1 3 5 5 81 3 42 1 63 2 92 0 0 0 0 7 Stream ... Monitored elements: Label Occurences Max Error 8 26 0 Detach label withleast occurences 7 27 26 Attach new label Update estimatives +1
Metwally et al. (2005)

Stream Summary

Recap: use of randomness


Use chaotic behaviour of systems to build efficient components.


Many of these data structures and algorithms underlie much of the software infrastructure supporting our daily lives.

Arrays with hash-mapped positions


StructureArray-typeHash fn'sValue-type
Hashtable linear 1 pointer to linked list
Linear counter linear 1 single bit
Bloom filter linear \(k\) single bit
Count-min sketch 2-dim. \(k\) integer count
yours :-) ? ? ?

Recommended reading

Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Mitzenmacher & Upfal, 2005.
Entry-level textbook (non-free, few references).
Probabilistic Data Structures for Web Analytics and Data Mining. Blog post. Ilya Katsov (2012)
Nice motivation and presentation of various probabilistic data structures, including Bloom filter, count-min sketch, loglog counter, stream summary and others.
Other presentations:
Random forests: one tool for all your problems (Neil Houlsby & Novi Quadrianto, 2013)
Fountain Codes (Gauri Joshi et al, 2010)

References

Edwin Thompson Jaynes.  Probability Theory: The Logic of Science, 2003. Book, ed. G. Larry Bretthorst. Cambridge University Press. ISBN 978-0-52159-271-0. [PDF] Eugene Normand.  Single event upset at ground level, 1996-12. In IEEE Transactions on Nuclear Science, Vol. 43, No. 6, pp. 2742-2750. IEEE. ISSN 0018-9499. [PDF] Anand Rajaraman; Jure Leskovec; Jeffrey D. Ullman.  Mining of massive datasets. Book, Version 1.3. [PDF] Michael Mitzenmacher; Eli Upfal.  Probability and Computing: Randomized Algorithms and Probabilistic Analysis, 2005. Book. Cambridge University Press. ISBN 978-0-521-83540-4. Lenore Blum; Manuel Blum; Michael Shub.  A Simple Unpredictable Pseudo-Random Number Generator, 1986-05. In SIAM Journal on Computing, Vol. 15, No. 2, pp. 364-383. SIAM. ISSN 0097-5397. Makoto Matsumoto; Takuji Nishimura.  Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator, 1998-01. In ACM Transactions on Modeling and Computer Simulation, Vol. 8, No. 1, pp. 3-30. ACM. ISSN 1049-3301. [PDF] John von Neumann.  Various Techniques used in Connection with Random Digits, 1951. In National Bureau of Standards, Applied Math Series, Vol. 11, pp. 36-38. ISSN 1049-4685. Yadolah Dodge.  A Natural Random Number Generator, 1996-12. In International Statistical Review / Revue Internationale de Statistique, Vol. 64, No. 3, pp. 329-344. International Statistical Institute (ISI). [download] NIST.  Secure hash standard, 2012-03-06. Federal Information Processing Standards (FIPS), Publication 180-4. For the earliest version, see NIST (1995). [PDF] NIST.  Secure hash standard, 1995-04-17. Federal Information Processing Standards (FIPS), Publication 180-1. [Web] Geoff Pike; Jyrki Alakuijala.  CityHash: fast hash functions for strings, 2012. Presentation Slides. [PDF] Burton Howard Bloom.  Space/time trade-offs in hash coding with allowable errors, 1970-07. In Communications of the ACM, Vol. 13, pp. 422-426. ACM Press, New York, NY, USA. ISSN 0001-0782. Rasmus Pagh; Flemming Friche Rodler.  Cuckoo hashing, 2001. In Algorithms – ESA 2001, Lecture Notes in Computer Science, Vol. 2161, ed. Friedhelm Meyer auf der Heide. Springer Berlin Heidelberg. ISSN 0302-9743. ISBN 978-3-540-42493-2. [PDF] Ruslan Salakhutdinov; Geoffrey Hinton.  Semantic hashing, 2009. In International Journal of Approximate Reasoning, Vol. 50, No. 7, pp. 969-978. Elsevier. ISSN 0888-613X. [PDF] Ilya Katsov.  Probabilistic Data Structures for Web Analytics and Data Mining, blog post in Highly Scalable Blog, 2012-05-01. [website] Graham Cormode; S. Muthukrishnan.  An improved data stream summary: the count-min sketch and its applications, 2005-02-04. In Journal of Algorithms, Vol. 55, No. 1, pp. 58-75. ISSN 0196-6774. Elsevier. Kyu-Yong Whang; Brad T. Vander-Zanden; Howard M. Taylor.  A Linear-Time Probabilistic Counting Algorithm for Database Applications, 1990-06. In ACM Transactions on Database Systems, Vol. 15, pp. 208-229. ACM Press, New York, NY, USA. [PDF] M. Durand; P. Flajolet.  Loglog Counting of Large Cardinalities 2003. In Lecture Notes in Computer Science, Vol. 2832, pp. 605-617. Elsevier. [PDF] Cameron Browne; Edward Powley; Daniel Whitehouse; Simon Lucas; Peter I. Cowling; Phillip Rohlfshagen; Stephen Tavener; Diego Perez; Spyridon Samothrakis; Simon Colton.  A Survey of Monte Carlo Tree Search Methods 2012-03. In IEEE Transactions on Computational Intelligence and AI in Games, Vol. 4, N. 01. pp. 1-49. IEEE. [PDF] William Pugh.  Skip Lists: A Probabilistic Alternative to Balanced Trees, 1990-06. In Journal of the ACM, Vol. 33, No. 6, pp. 668-676. ACM, New York, NY, USA. ISSN 0004-5411. [PDF] David J. C. MacKay.  Information Theory, Inference, and Learning Algorithms, 2003. Book. Cambridge University Press. ISBN 978-0-521-64298-9. [PDF] David J. C. MacKay.  Digital Fountain Codes, 2003. Chapter 50 in Information Theory, Inference, and Learning Algorithms, pp. 588-596. Cambridge University Press. ISBN 978-0-521-64298-9. [PS.GZ] Leo Breiman.  Random Forests, 2001-10. In Machine Learning, Vol. 45, No. 1, pp. 5-32. Kluwer Academic Publishers, Amsterdam, Netherlands. ISSN 0885-6125. [PDF] Richard M. Karp; Michael O. Rabin.  Efficient randomized pattern-matching algorithms, 1987-03. In IBM Journal of Research and Development, Vol. 31, No. 2, pp. 249-260. IBM. ISSN 0018-8646. [PDF] Bobby Kleinberg.  The Miller-Rabin Randomized Primality test 2010-05-05. In Lecture Notes. Cornell University. USA. [PDF] Shafi Goldwasser; Silvio Micali; Charles Rackoff.  The knowledge complexity of interactive proof systems 1989-02. In SIAM J. Computation. Vol 18. No. 1. pp. 186-208. SIAM. USA. [PDF] Robert G. Gallager.  Low Density Parity Check Codes 1963-01. In Transactions of the IRE Professional Group on Information Theory. Vol. IT-8 pp. 21-28. USA. [PDF] Charles Anthony Richard Hoare.  Quicksort, 1962. In The Computer Journal, Vol. 5, No. 1, pp. 10-16, ed. Eric N. Mutch. Oxford University Press. ISSN 0010-4620. [PDF] Robert Sedgewick.  Implementing Quicksort Programs, 1978-10. In Communications of the ACM, Vol. 21, No. 10, pp. 847-857. ACM Press, New York, NY, USA. ISSN 0001-0782. [PDF] Michael Luby.  LT codes, 2002. In 43rd Annual IEEE Symposium on Foundations of Computer Science, pp. 271-280. ISSN 0272-5428. ISBN 978-0-7695-1822-0. [PDF] David J. C. MacKay.  Fountain codes, 2005. In IEE Proceedings - Communications, Vol. 152, No. 6, pp. 1062-1068. ISSN 1350-2425. [download]
See also: MacKay (2003, ch50)
Michael O. Rabin.  Probabilistic algorithm for testing primality, 1980. In Journal of Number Theory, Vol. 12, No. 1, pp. 128-138. Elsevier. ISSN 0022-314X. [PDF] Donald Ervin Knuth; James H Morris, Jr; Vaughan R Pratt.  Fast pattern matching in strings, 1977. In SIAM Journal on Computing, Vol. 6, No. 2, pp. 323-350. SIAM. [PDF] Robert S. Boyer; J. Strother Moore.  A fast string searching algorithm, 1977. In Communications of the ACM, Vol. 20, No. 10, pp. 762-772. ACM Press, New York, NY, USA. ISSN 0001-0782. [PDF] Neil Houlsby; Novi Quadrianto.  Random Forests: one tool for all your problems, 2013-07-04. Slide Presentation, Machine Learning RCC. [PDF] Raimund Seidel; Cecilia R. Aragon.  Randomized Search Trees 1996. In Algorithmica. Vol. 16 pp. 464-497. USA. [PDF] Ahmed Metwally; Divyakant Agrawal; Amr Abbadi.  Efficient Computation of Frequent and Top-k Elements in Data Streams, 2005. In Database Theory - ICDT 2005, Lecture Notes in Computer Science, Vol. 3363, pp. 398-412, ed. Thomas Eiter; Leonid Libkin. Springer Berlin Heidelberg. ISSN 0302-9743. ISBN 978-3-540-24288-8. [Web] [PDF] David Karger.  Global Min-cuts in RNC and Other Ramifications of a Simple Min-cut Algorithm, 1993. In Proceedings of the 4th Annual ACM/SIGACT-SIAM Symposium on Algorithms. pp.21-30 SIAM. Austin, Texas, USA. ISBN 0-89871-313-7. [PS] Gauri Joshi; Joong Bum Rhim; John Sun; Da Wang.  Fountain codes, 2010-12-00. Slide Presentation, MIT. [PDF]