<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Shattered Terminal &#187; probability interview</title>
	<atom:link href="http://shatteredterminal.com/tag/probability-interview/feed/" rel="self" type="application/rss+xml" />
	<link>http://shatteredterminal.com</link>
	<description>i don't have a tagline yet</description>
	<lastBuildDate>Sun, 11 Apr 2010 18:13:22 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Random sample from a long linked list</title>
		<link>http://shatteredterminal.com/2009/11/random-sample-from-a-long-linked-list/</link>
		<comments>http://shatteredterminal.com/2009/11/random-sample-from-a-long-linked-list/#comments</comments>
		<pubDate>Thu, 05 Nov 2009 04:08:35 +0000</pubDate>
		<dc:creator>shards</dc:creator>
				<category><![CDATA[Java]]></category>
		<category><![CDATA[Scheme]]></category>
		<category><![CDATA[probability interview]]></category>

		<guid isPermaLink="false">http://shatteredterminal.com/?p=192</guid>
		<description><![CDATA[Sorry for the long no-update period. I had quite a few posts that I wrote half way in my posting queues but haven&#8217;t got the time to finish them since I was very busy with work and research. Plus I&#8217;m currently learning basic German and that really sucks out a lot of my free time. [...]]]></description>
			<content:encoded><![CDATA[<p>Sorry for the long no-update period. I had quite a few posts that I wrote half way in my posting queues but haven&#8217;t got the time to finish them since I was very busy with work and research. Plus I&#8217;m currently learning basic German and that really sucks out a lot of my free time. (Though German is really, really fun; I really hope that one day, I can actually work&mdash;maybe temporarily, maybe permanently&mdash; in Germany.)</p>
<p>Anyway, a friend of mine is currently applying for jobs in several &#8220;big&#8221; companies (you know, the Google-Microsoft-Apple kind?). So he was preparing for his interview by hunting quite a few online questions. One of the questions he asked reminded me of a question I used to use to exemplify the kind of questions these companies used in their technical interview:</p>
<blockquote><p>Given a head pointer to a (possibly very long) linked list of unknown length, devise a method to sample m members at random. Ensure that the method yields equal probability for each element in the linked list to be picked.</p></blockquote>
<p>Obviously we can traverse the list once to find the length, generate m random numbers from 1 to length of the list and traverse the list another time to pick up these items. A common question would be to prevent you from going through the list more than once. Say the list is very long, the disk I/O you need to get the items from harddisk (even after making sure the items are arranged well that it minimizes cache misses) would make traversal a very expensive operation and hence should be minimized.</p>
<p>The solution I have is inspired by the fact that if the linked list contained m or less elements, I have no choice but to pick all m of them. Hence, the main idea of the solution would be to pick the first m elements and keeping track of k&mdash;the number of elements we have seen so far. When we encounter a new element, the new element has an m/k chance of being picked. If it is picked, we dropped 1 of the previously selected m elements at random. We can easily prove that this indeed yields the correct probability via mathematical induction.</p>
<p>Now of course it won&#8217;t be fun to implement this in Java or C++, so I decided to use Scheme to code this, just for the fun of it. The solution is surprisingly short and pretty. Here goes:</p>
<pre name="code" class="scheme">(define (sample-m lst m)
  (define (helper curr k m-vec)
    (if (empty? curr)
        (if (< k m)
            lst
            (vector->list m-vec))
        (let ((pos (random k)))
          (cond ((<= k m)
                 (vector-set! m-vec
                              (- k 1)
                              (car curr)))
                ((< pos m)
                 (vector-set! m-vec
                              pos
                              (car curr))))
          (helper (cdr curr) (+ k 1) m-vec))))
  (helper lst 1 (make-vector m)))

;;; Let's test this!
(sample-m (build-list 1000 values) 7)</pre>
<p>EDIT: I decided to try writing a version for Java as well. Here is the Java code:</p>
<pre name="code" class="java">public class LinkedListSampler {
  private static Random rand = new Random();
  public static &lt;T&gt; List&lt;T&gt; sample(
      LinkedList&lt;T&gt; list, int m) {
    ArrayList&lt;T&gt; samples = new ArrayList&lt;T&gt;(m);
    int k = 0;
    for (T element : list) {
      int pos = k++;
      if (pos &lt; m) {
        samples.add(element);
      } else if ((pos = rand.nextInt(k)) &lt; m) {
        samples.set(pos, element);
      }
    }
    return samples;
  }
}</pre>
<p>EDIT 2: Btw, seriously, the Java code is just for fun. Just imagine that LinkedList does not have a size method, will ya? ;)</p>
]]></content:encoded>
			<wfw:commentRss>http://shatteredterminal.com/2009/11/random-sample-from-a-long-linked-list/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
	</channel>
</rss>
