Random sample from a long linked list

November 5th, 2009

Sorry for the long no-update period. I had quite a few posts that I wrote half way in my posting queues but haven’t got the time to finish them since I was very busy with work and research. Plus I’m currently learning basic German and that really sucks out a lot of my free time. (Though German is really, really fun; I really hope that one day, I can actually work—maybe temporarily, maybe permanently— in Germany.)

Anyway, a friend of mine is currently applying for jobs in several “big” companies (you know, the Google-Microsoft-Apple kind?). So he was preparing for his interview by hunting quite a few online questions. One of the questions he asked reminded me of a question I used to use to exemplify the kind of questions these companies used in their technical interview:

Given a head pointer to a (possibly very long) linked list of unknown length, devise a method to sample m members at random. Ensure that the method yields equal probability for each element in the linked list to be picked.

Obviously we can traverse the list once to find the length, generate m random numbers from 1 to length of the list and traverse the list another time to pick up these items. A common question would be to prevent you from going through the list more than once. Say the list is very long, the disk I/O you need to get the items from harddisk (even after making sure the items are arranged well that it minimizes cache misses) would make traversal a very expensive operation and hence should be minimized.

The solution I have is inspired by the fact that if the linked list contained m or less elements, I have no choice but to pick all m of them. Hence, the main idea of the solution would be to pick the first m elements and keeping track of k—the number of elements we have seen so far. When we encounter a new element, the new element has an m/k chance of being picked. If it is picked, we dropped 1 of the previously selected m elements at random. We can easily prove that this indeed yields the correct probability via mathematical induction.

Now of course it won’t be fun to implement this in Java or C++, so I decided to use Scheme to code this, just for the fun of it. The solution is surprisingly short and pretty. Here goes:

(define (sample-m lst m)
  (define (helper curr k m-vec)
    (if (empty? curr)
        (if (< k m)
            lst
            (vector->list m-vec))
        (let ((pos (random k)))
          (cond ((<= k m)
                 (vector-set! m-vec
                              (- k 1)
                              (car curr)))
                ((< pos m)
                 (vector-set! m-vec
                              pos
                              (car curr))))
          (helper (cdr curr) (+ k 1) m-vec))))
  (helper lst 1 (make-vector m)))

;;; Let's test this!
(sample-m (build-list 1000 values) 7)

EDIT: I decided to try writing a version for Java as well. Here is the Java code:

public class LinkedListSampler {
  private static Random rand = new Random();
  public static <T> List<T> sample(
      LinkedList<T> list, int m) {
    ArrayList<T> samples = new ArrayList<T>(m);
    int k = 0;
    for (T element : list) {
      int pos = k++;
      if (pos < m) {
        samples.add(element);
      } else if ((pos = rand.nextInt(k)) < m) {
        samples.set(pos, element);
      }
    }
    return samples;
  }
}

EDIT 2: Btw, seriously, the Java code is just for fun. Just imagine that LinkedList does not have a size method, will ya? ;)

Applying for J1 Waiver

October 1st, 2009

I am starting my J1 foreign residency requirement waiver application in the next few days. Anyone with such experience before? I’ve heard that it is a long and painful wait. Sigh.

ACL-IJCNLP ‘09 Blog

May 13th, 2009

I have been invited to co-ordinate the blog for ACL-IJCNLP ‘09, a joint conference on natural language processing to be held in Singapore this August. We will be posting both general articles about NLP and computational linguistics for general public, and articles on things to do in Singapore and nearby regions. Stay tune to the new blog! It will officially be launched next Monday, but you can already access it here (sorry, about the design, we haven’t figured what design to use for the blog yet). Also check out the conference website.

Chrome Shorts

April 30th, 2009

OH MY GOD! The Chrome Shorts are really cute! It’s as if I’m turning into a small kids awestruck by some of the videos! Good timing! Have been pretty stressed up getting things right in my mind (while getting ready for exams).

Anyway, this one is the cutest! Shit! The cat and the video is just so cute!

(Watch the other 10 videos from YouTube’s Chrome channel.

Introduction to Probability (book)

April 29th, 2009
Introduction to Probability, 2e

I have finally decided to get this book. Have been struggling quite a bit with some more complex probability for mathematical and systems modeling and for my research. So I thought getting more rigorous grounding would help. Seems that a lot of people are recommending this book, so I thought I’ll give it a try. Will update after I completed at least several chapters of the book. (Well, the book itself won’t arrive for another week or so.)

Closure in Java

April 20th, 2009

I believe Closure should be part of any modern language, so I was delighted to read about Closure for Java (currently a JSR draft I believe) from here. Currently, Closure is not implemented in Java and is impossible to implement in its natural form. We can attempt to implement closures using anonymous objects, but that comes with two main issue: (1) it can not access non-final variables defined in its enclosing code; (2) you have to define closure interface for every possible scenario. We really can’t do anything with the first problem; but, as we see later, we can simplify the syntax considerably. Here is an example:

public interface ClosureIntReturnInt {
  public int invoke(int arg1);
}

@Test
public void testNaiveClosure() {
  final int a = 1; // This must be final!
  ClosureIntReturnInt addA =
      new ClosureIntReturnInt() {
    public int invoke(int v) {
      return v + a;
    }
  };

  assertEquals(3, addA.invoke(2));
}

This is obviously not a very good solution. So my first instinct would be to genericize the closure interface. Sure enough, it simplifies the problem by quite a bit. Here is my genericize Closure interface:

public interface Closure0Arg<R> {
  public R invoke();
}
public interface Closure1Arg<R, A1> {
  public R invoke(A1 arg1);
}
public interface Closure2Args<R, A1, A2> {
  public R invoke(A1 arg1, A2 arg2);
}
.
.
.

Sure enough, this works fine and it becomes much easier to use. It is of course still inconvenient for people who came from functional programming world. The problem is: (1) there is no universal Closure, that means you have to keep track of the Closure type as type annotation, many, including me would love to ditch the cumbersome type annotation; (2) you have to write the interface for each number of arguments (much better than writing for each argument type of course, but still…).

Of course, OTOH, this is the best you can do while maintaining static checking ability. Here is an example usage:

@Test
public void testClosure1Arg() {
  Closure1Arg intToString =
      new Closure1Arg<String, Integer>() {
    public String invoke(Integer v) {
      return v.toString();
    }
  };

  assertEquals("5", intToString.invoke(5));
}

A sharp reader would have noticed several (not-very-correct?) things happening here. First, we eliminate the generic from intToString type annotation. This does not change the semantics of the code and make the code looks cleaner; however, it does create an “unchecked” warning with the Java compiler. Secondly, we utilize autoboxing freely in this implementation. Since we are utilizing generics, R, A1, …, An must all be objects. This is obviously undesirable and ugly. However, due to autoboxing, this is no longer an issue (except performance-wise).

So far, this is really the best we can do while preserving static type checking. However, if we ignore additional security provided by static type checking, we can go a step further: we can introduce a Closure base class; the rest of the closures interface (Closure1Arg, etc.) will become an abstract class that extends the Closure class. This is where all the interesting stuffs are happening. Here is the source code for the Closure class.

public class Closure<R> {
  @SuppressWarnings("unchecked")
  public R invoke(Object... args) {
    Class[] classes = new Class[args.length];
    // This is okay since generic type annotation
    // is stripped after compilation. We only
    // need to find invoke method with
    // the right number of params.
    Arrays.fill(classes, Object.class);

    try {
      // Search for correct invoke method
      // and execute it.
      Method method =
          this.getClass().getMethod(
              "invoke", classes);
      return (R) method.invoke(this, args);
    } catch (Exception e) {
      // TODO: handle exceptions correctly.
      // ClosureException is a runtime exception.
      throw new ClosureException("Error", e);
    }
  }
}

// e.g. of ClosureNArg abstract class.
public abstract class Closure1Arg<R, A1>
    extends Closure<R> {
  public abstract R invoke(A1 arg);
}

This code completely sacrifices static checking. Now we can simply have each closures to be stored in variables of Closure type. When invoke is called, the method from the base class (above) is called. It will find the correct (i.e. more specialized) invoke method (declared in the concrete anonymous object derived from the subclass) to call based on the type of the arguments to invoke. Here is how we can use it:

@Test
public void testClosure1() {
  Closure<Integer> addOne =
      new Closure1Arg<Integer, Integer>() {
    public Integer invoke(Integer arg) {
      return arg + 1;
    }
  };
  assertEquals(3, (int) addOne.invoke(2));
}

@Test
public void testClosure2() {
  Closure<Double> add =
      new Closure2Args<Double, Double, Double>() {
    public Double invoke(Double a, Double b) {
      return a + b;
    }
  };
  assertEquals(7.5, add.invoke(4.2, 3.3));
}

Sweet! Now this is neat! Of course, we have ignored much of the static checking and replace them with dynamic, runtime checking. I have removed much of the exception checking code in Closure<R> base class to save space and focus on the meat. Again, I’d draw readers’ attention to several important code fragments:

  1. We catch most of the checked exception and replaced them with unchecked runtime exception to make the usage much cleaner, this follows more closely to dynamic, interpreted language rather than static language (as with Java);
  2. We readily suppress warning with @SuppressWarnings (heh!);
  3. The code, arguably, takes a bit of performance hit, so users who are more concerned about performance should probably not use this (squeezing performance usually means messier code; closure is for elegance, not mess);
  4. You may want to write simple script to autogenerate ClosureNArgs (N from 0 to however many arguments you think is reasonable; I personally generate up to 7 arguments);
  5. You probably want to generate closures for void return value as well; that would make things messier (much messier) though. It is probably easier to return null instead.
  6. Here, I opt to go for clean ClosureNArgs subclasses. Others may opt to add some ugliness to the subclasses and make the base class cleaner (and likely result in faster closure code). I will likely experiment with this further, and try to implement variable binding as well.

I found it very interesting to implement Closure in Java; hope you do too. :)

EDIT: This is interesting, wonder how I missed this earlier: Closures for the Java Programming Language (v0.5).

EDIT 2: I simplified Closure&ltR> a little bit more. The simplification is a result of stripping of the generic type annotation after compilation. So the base class need only hunt for invoke method with the right number of parameters.