Reservoir sampling

19 Aug 2008

You want to obtain a random sample of n elements evenly distributed from an input stream of unknown length.

my $n = shift @ARGV;
my $i = 0;
my @reservoir;

while (<>) {
  $i++;
  push @reservoir, [$i, $_];
  last unless $i < $n;
}

while (<>) {
  $i++;
  $reservoir[int(rand($n))] = [$i, $_] if int(rand($i)) < $n;
}

@reservoir = sort { $a->[0] <=> $b->[0] } @reservoir;

for my $x (@reservoir) { print $x->[1]; }

Discussion and proof of correctness: Reservoir Sampling.

This recipe is available as a script on github: reservoir.