Archive for the ‘Scheme’ Category

Random sample from a long linked list

Thursday, November 5th, 2009

Sorry for the long no-update period. I had quite a few posts that I wrote half way in my posting queues but haven’t got the time to finish them since I was very busy with work and research. Plus I’m currently learning basic German and that really sucks out a lot of my free time. (Though German is really, really fun; I really hope that one day, I can actually work—maybe temporarily, maybe permanently— in Germany.)

Anyway, a friend of mine is currently applying for jobs in several “big” companies (you know, the Google-Microsoft-Apple kind?). So he was preparing for his interview by hunting quite a few online questions. One of the questions he asked reminded me of a question I used to use to exemplify the kind of questions these companies used in their technical interview:

Given a head pointer to a (possibly very long) linked list of unknown length, devise a method to sample m members at random. Ensure that the method yields equal probability for each element in the linked list to be picked.

Obviously we can traverse the list once to find the length, generate m random numbers from 1 to length of the list and traverse the list another time to pick up these items. A common question would be to prevent you from going through the list more than once. Say the list is very long, the disk I/O you need to get the items from harddisk (even after making sure the items are arranged well that it minimizes cache misses) would make traversal a very expensive operation and hence should be minimized.

The solution I have is inspired by the fact that if the linked list contained m or less elements, I have no choice but to pick all m of them. Hence, the main idea of the solution would be to pick the first m elements and keeping track of k—the number of elements we have seen so far. When we encounter a new element, the new element has an m/k chance of being picked. If it is picked, we dropped 1 of the previously selected m elements at random. We can easily prove that this indeed yields the correct probability via mathematical induction.

Now of course it won’t be fun to implement this in Java or C++, so I decided to use Scheme to code this, just for the fun of it. The solution is surprisingly short and pretty. Here goes:

(define (sample-m lst m)
  (define (helper curr k m-vec)
    (if (empty? curr)
        (if (< k m)
            lst
            (vector->list m-vec))
        (let ((pos (random k)))
          (cond ((<= k m)
                 (vector-set! m-vec
                              (- k 1)
                              (car curr)))
                ((< pos m)
                 (vector-set! m-vec
                              pos
                              (car curr))))
          (helper (cdr curr) (+ k 1) m-vec))))
  (helper lst 1 (make-vector m)))

;;; Let's test this!
(sample-m (build-list 1000 values) 7)

EDIT: I decided to try writing a version for Java as well. Here is the Java code:

public class LinkedListSampler {
  private static Random rand = new Random();
  public static <T> List<T> sample(
      LinkedList<T> list, int m) {
    ArrayList<T> samples = new ArrayList<T>(m);
    int k = 0;
    for (T element : list) {
      int pos = k++;
      if (pos < m) {
        samples.add(element);
      } else if ((pos = rand.nextInt(k)) < m) {
        samples.set(pos, element);
      }
    }
    return samples;
  }
}

EDIT 2: Btw, seriously, the Java code is just for fun. Just imagine that LinkedList does not have a size method, will ya? ;)

Optimize for the most common use case

Sunday, November 2nd, 2008

This is the best advice I’ve ever stick myself with when writing codes. Before delving deep into this, let’s take a look of one of my Scheme peeve (I was reminded of this when reading eric’s post in LispCast): the unavailability of length=.

Let’s write a simple Scheme code:

(define (two-len-list list)
  (= 2 (length list)))

What’s so wrong about this code? Well, a hint, a Scheme list is basically a linked-list. Got it? Yes! length will recurse through the entire list. The order of growth is O(n), where n is the length of the list for the above example. As suggested by eric in LispCast (his focus in the article above is that LISP should automatically optimized the above code to look like the code below, but that’s another matter), this should be easily optimized to:

(define (two-len-list list)
  (length= 2 list))

Apparently, that neat procedure is defined in LISP. In Scheme? AFAICT (Scheme is just a small hobby, I might have missed the procedure that does this), it’s not defined. Well, no matter, we can always write it ourselves.

(define (length= n lis)
  (cond ((= n 0) (empty? lis))
        ((empty? lis) false)
        (else (length= (- n 1) (cdr lis)))))

Neat! And it will only recurse n times at most. Used in two-len-list, it will make the procedure O(1). (Another thing that a lot of new Schemers do, including me during my first day learning about list, is to write this: (= 0 (length lis)), where it could be written as (empty? lis)).

Now, we come at the best part: optimize for the most common use case. Let’s use the same example, shall we? What if you know that in your code, the length of the list matters! Let’s say that in 70% of the call, the length of the list is required. Then make sure you keep the length of the list at all time, instead of recomputing it every time you need it. It used to be hard to do this when list was mutable, but with the advent of R6RS, list is now immutable, this becomes relatively straightforward. You can even abstract it into a new list-based data-structure.

(define (make-sclist lis)
  (list 'sclist (length lis) lis))
(define (is-sclist? sclist)
  (eq? 'sclist (car sclist)))
(define (sclist-length sclist)
  (if (not (is-sclist? sclist))
      (error "Not a self-counting list")
      (cadr sclist)))
(define (sclist-car sclist)
  (if (not (is-sclist? sclist))
      (error "Not a self-counting list")
      (car (get-plainlist sclist))))
(define (sclist-cdr sclist)
  (if (not (is-sclist? sclist))
      (error "Not a self-counting list")
      (list 'sclist
            (- (sclist-length sclist) 1)
            (cdr (get-plainlist sclist)))))
(define (get-plainlist sclist)
  (if (not (is-sclist? sclist))
      (error "Not a self-counting list")
      (caddr sclist)))
(define (sclist-cons new-member sclist)
  (if (not (is-sclist? sclist))
      (error "Not a self-counting list")
      (list 'sclist
            (+ 1 (sclist-length sclist))
            (cons new-member
                  (get-plainlist sclist)))))

When writing the above code, I realize how much I have coded in object-oriented paradigms. I kept wanting to write the above code in a message-passing style and encapsulated the length and the list in a let (similar to a private environment). At the end, I went for tag-based approach to increase the readability. You’ll also see the increase level of abstraction in the code.

Optimize for the most common use case is a very powerful advice. It allows you to focus your thoughts and willpower at that algorithms that will get accessed the most number of time, thus whatever small improvements you make there, it will be multiplied by the number of calls!

This does not only apply to programming. Take a look at database too. If you’re writing a blog, what’s the most common use case? Getting the last n posts, getting a posts between this date and that date, getting posts tagged with some tags, and searching. So optimize for them! Index the posts by date and tags. Try pre-scoring each posts to keywords (possibly, a keyword can be any word that is being mentioned in the post) and store them to speed up searching, at the expensive of higher storage.

On writing a webapps, make sure the most commonly accessed pages are the fastest. What is the point of optimizing those 10 pages that got accessed 3% of the time while another 5 pages were accessed 97% of the time (there is a point, but only after you optimize the 5 pages)? (Sometime it’s not easy to determine which pages will get accessed the most, but once you launched the webapps, you’d better be tracking the usage and allocate resources to those highly accessed pages.)

There is one place where it’s a little contradictory to follow this advice. Say in a webapps, you have the above scenario where 5 pages are being accessed 97% of the time. You would want links to these 5 pages in obvious places since users are very likely to want to go to these pages. At the same time, you may also want to publicize this really cool feature on the 6th page. So you also want to place a link to this page in obvious place. Well, all right, it’s not that contradictory, but in this case, you should optimize your screen real estate for both. If you opt for sidebar approach, it’s not too hard, a top navigation may be slightly more mischievous to deal with. Good luck!