This page looks best with JavaScript enabled

Sampling

 ·   ·  โ˜• 2 min read  ·  ๐Ÿฆ‚ Kyle · ๐Ÿ‘€... views
    ๐Ÿท๏ธ

How to select k out of N objects.

Reservoir sampling

source code

Reservoir sampling is a family of randomized algorithms for choosing a simple random sample, without replacement, of k items from a population of unknown size n in a single pass over the items.

  • The size of the population n is not known and is typically very large. (scan from left to right, without looking back)
  • At any point, the current state is a simple random sample without replacement of size k over the part of the population seen so far.

Simple algorithm (Algorithm R)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
/**
 * A simple and popular but slow algorithm, commonly known as Algorithm R
 * time complexity O(n)
 */
void reservoir_sample(vector<int> &S, vector<int> &R, int k) {
    R.resize(k);
    copy(S.begin(), S.begin()+k, R.begin());
    for (int i = k, n = S.size(); i < n; ++i) {
        int j = rand_int(0, i);
        if (j < k) R[j] = S[i]; // choose ith with probability (k/(i+1))
                                // keep ith with
                                // probability not being swapped by remmaing elements [i+1...n-1]
                                // (k/(i+1)) * (1-1/(i+2))*...*(1-1/n) = k/n
    }
}

An optimal algorithm (Algorithm L)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
/**
 * Algorithm L improves upon this algorithm by
 * computing how many items are discarded before
 * the next item enters the reservoir
 * time complexity O(k(1+log(n/k)))
 */
void reservoir_sample(vector<int> &S, vector<int> &R, int k) {
    R.resize(k);
    copy(S.begin(), S.begin()+k, R.begin());
    double W = exp(log(rand_real(0, 1))/k);
    for (int i = k, n = S.size(); i < n;) {
        i += floor(log(rand_real(0, 1))/log(1-W)) + 1;
        if (i < n) {
            R[rand_int(0, k-1)] = S[i];
            W *= exp(log(rand_real(0, 1))/k);
        }
    }
}

Problems for practice

Refs

Share on