## Random sample from a long linked list

Sorry for the long no-update period. I had quite a few posts that I wrote half way in my posting queues but haven’t got the time to finish them since I was very busy with work and research. Plus I’m currently learning basic German and that really sucks out a lot of my free time. (Though German is really, really fun; I really hope that one day, I can actually work—maybe temporarily, maybe permanently— in Germany.)

Anyway, a friend of mine is currently applying for jobs in several “big” companies (you know, the Google-Microsoft-Apple kind?). So he was preparing for his interview by hunting quite a few online questions. One of the questions he asked reminded me of a question I used to use to exemplify the kind of questions these companies used in their technical interview:

Given a head pointer to a (possibly very long) linked list of unknown length, devise a method to sample m members at random. Ensure that the method yields equal probability for each element in the linked list to be picked.

Obviously we can traverse the list once to find the length, generate m random numbers from 1 to length of the list and traverse the list another time to pick up these items. A common question would be to prevent you from going through the list more than once. Say the list is very long, the disk I/O you need to get the items from harddisk (even after making sure the items are arranged well that it minimizes cache misses) would make traversal a very expensive operation and hence should be minimized.

The solution I have is inspired by the fact that if the linked list contained m or less elements, I have no choice but to pick all m of them. Hence, the main idea of the solution would be to pick the first m elements and keeping track of k—the number of elements we have seen so far. When we encounter a new element, the new element has an m/k chance of being picked. If it is picked, we dropped 1 of the previously selected m elements at random. We can easily prove that this indeed yields the correct probability via mathematical induction.

Now of course it won’t be fun to implement this in Java or C++, so I decided to use Scheme to code this, just for the fun of it. The solution is surprisingly short and pretty. Here goes:

(define (sample-m lst m) (define (helper curr k m-vec) (if (empty? curr) (if (< k m) lst (vector->list m-vec)) (let ((pos (random k))) (cond ((<= k m) (vector-set! m-vec (- k 1) (car curr))) ((< pos m) (vector-set! m-vec pos (car curr)))) (helper (cdr curr) (+ k 1) m-vec)))) (helper lst 1 (make-vector m))) ;;; Let's test this! (sample-m (build-list 1000 values) 7)

EDIT: I decided to try writing a version for Java as well. Here is the Java code:

public class LinkedListSampler { private static Random rand = new Random(); public static <T> List<T> sample( LinkedList<T> list, int m) { ArrayList<T> samples = new ArrayList<T>(m); int k = 0; for (T element : list) { int pos = k++; if (pos < m) { samples.add(element); } else if ((pos = rand.nextInt(k)) < m) { samples.set(pos, element); } } return samples; } }

EDIT 2: Btw, seriously, the Java code is just for fun. Just imagine that LinkedList does not have a size method, will ya? ;)

Tags: probability interview

November 5th, 2009 at 4:24 pm

Nice indeed! I attempted to prove it and realize it’s relatively easy as you said.

Like the Java implementation, very neat.

November 5th, 2009 at 4:40 pm

Thanks! Yes, the proof by induction is very straightforward, involving just a teensy bits of probability. (:

I didn’t think the Java implementation is that nice though, it’s just normal. Actually, I’m pretty sure someone can produce an even nicer code than this.

November 7th, 2009 at 3:28 pm

LOL. That must means I’m a very bad programmer ):

November 10th, 2009 at 7:57 am

No, that’s not true. (: I’m sure you can come up with this one too. Just take your time while coding, it’ll help. I’ll write one version that works. Unsatisfied, I’ll try rewriting it to be simpler. Usually you’ll learn a new trick or two whenever you do that. The next time you write a new code, the code will already be simpler.

November 11th, 2009 at 12:13 pm

lol! Yes, good assumption there with the size method ;)

December 14th, 2010 at 4:00 am

Hey,

can you pls give the proof of induction for the probability part?

Btw, your code is REALLY impressive

December 16th, 2010 at 8:23 pm

Can sb pls give the proof for this! why is the probablity of every element getting picked the same