## Random sample from a long linked list

November 5th, 2009Sorry for the long no-update period. I had quite a few posts that I wrote half way in my posting queues but haven’t got the time to finish them since I was very busy with work and research. Plus I’m currently learning basic German and that really sucks out a lot of my free time. (Though German is really, really fun; I really hope that one day, I can actually work—maybe temporarily, maybe permanently— in Germany.)

Anyway, a friend of mine is currently applying for jobs in several “big” companies (you know, the Google-Microsoft-Apple kind?). So he was preparing for his interview by hunting quite a few online questions. One of the questions he asked reminded me of a question I used to use to exemplify the kind of questions these companies used in their technical interview:

Given a head pointer to a (possibly very long) linked list of unknown length, devise a method to sample m members at random. Ensure that the method yields equal probability for each element in the linked list to be picked.

Obviously we can traverse the list once to find the length, generate m random numbers from 1 to length of the list and traverse the list another time to pick up these items. A common question would be to prevent you from going through the list more than once. Say the list is very long, the disk I/O you need to get the items from harddisk (even after making sure the items are arranged well that it minimizes cache misses) would make traversal a very expensive operation and hence should be minimized.

The solution I have is inspired by the fact that if the linked list contained m or less elements, I have no choice but to pick all m of them. Hence, the main idea of the solution would be to pick the first m elements and keeping track of k—the number of elements we have seen so far. When we encounter a new element, the new element has an m/k chance of being picked. If it is picked, we dropped 1 of the previously selected m elements at random. We can easily prove that this indeed yields the correct probability via mathematical induction.

Now of course it won’t be fun to implement this in Java or C++, so I decided to use Scheme to code this, just for the fun of it. The solution is surprisingly short and pretty. Here goes:

(define (sample-m lst m) (define (helper curr k m-vec) (if (empty? curr) (if (< k m) lst (vector->list m-vec)) (let ((pos (random k))) (cond ((<= k m) (vector-set! m-vec (- k 1) (car curr))) ((< pos m) (vector-set! m-vec pos (car curr)))) (helper (cdr curr) (+ k 1) m-vec)))) (helper lst 1 (make-vector m))) ;;; Let's test this! (sample-m (build-list 1000 values) 7)

EDIT: I decided to try writing a version for Java as well. Here is the Java code:

public class LinkedListSampler { private static Random rand = new Random(); public static <T> List<T> sample( LinkedList<T> list, int m) { ArrayList<T> samples = new ArrayList<T>(m); int k = 0; for (T element : list) { int pos = k++; if (pos < m) { samples.add(element); } else if ((pos = rand.nextInt(k)) < m) { samples.set(pos, element); } } return samples; } }

EDIT 2: Btw, seriously, the Java code is just for fun. Just imagine that LinkedList does not have a size method, will ya? ;)