## Archive for the ‘Computer Science’ Category

### Largest value sub-array [part 1]

Wednesday, November 11th, 2009

I was teaching a bunch of freshman the other day. One of the questions that came up was to determine the largest value sub-array given an array of integers. Obviously, the value is the array can be negative. There are two variants of the questions, one to find the actual maximum value when each of the members of the sub-array are added together, and the second variant is to actually find the start and ending location of the sub-array.

We will first consider the first variant to arrive at optimum answer.

A simple answer would be to perform brute force search in the array, considering each position as possible starting and ending point. Time complexity: O(n^2). Not very good eh? Can we improve on this? Yes, we can if we use dynamic programming. For many of us, the algorithm came up as obvious, but I realize that it is really not that simple for students new to programming to come up with a correct algorithm. To answer this, I tried to explain the concept behind the algorithm: for each element of the array, we find the maximum value sub-array that ends at that element. This value can either be the value of the element itself (meaning starting at the element and ending at the particular element) or the maximum value sub-array ending at the previous element added to the value of the element. (The latter only occurs if the maximum value sub-array at previous location is not negative.) There you go, we have a recurrence relation: given an array A, let v(k) be the maximum value of any sub-array ending at position k, where k ranging from 0 inclusive to N exclusive, N is the length of the array. Hence we have:

v(0) = A[0]
v(k) = max{v(k-1) + A[k], A[k]} if k > 0

The maximum value will be given by max{v_i} for i from 0 to N-1. To implement this in Java, we make several optimization. Firstly, we can opt for recursive method or dynamic programming. Both will take O(n) time but since we’re using Java, we would rather use dynamic programming since there is no tail call optimization in Java. Then, rather than finding the maximum value using max{v_i} at the end, we keep a running maximum throughout the running of the algorithm and return it. We also note that v(k-1) + A[k] will only be greater than A[k] iff v(k-1) is greater than 0. Here is how the code goes (this is a rather faithful implementation, it uses the same variable names as the recurrence relation above to make it easier to see):

```public static int calculateMaxValue(int[] A) {
int[] v = new int[A.length];
int max = v[0] = A[0];
for (int i = 1; i < A.length; ++i) {
v[i] = A[i];
if (v[i-1] > 0) {
v[i] += v[i-1];
}

if (v[i] > max) {
max = v[i];
}
}
return max;
}```

This algorithm has O(n) time complexity.

What else can we do to optimize this code? Well, we don’t really need to keep an array of maximum values. We notice that at any point of time, to calculate the value of v(k), we only need the value of A[k] and v(k-1). We just turn an O(n) space requirement to O(1).

```public static int calculateMaxValue(int[] A) {
int max = A[0], current = A[0];
for (int i = 1; i < A.length; ++i) {
if (current < 0) current = 0;
current += A[i];

if (current > max) max = current;
}
return max;
}```

Voila! We shall continue with the second variant next time. (:

### ACL-IJCNLP ‘09 Blog

Wednesday, May 13th, 2009

I have been invited to co-ordinate the blog for ACL-IJCNLP ‘09, a joint conference on natural language processing to be held in Singapore this August. We will be posting both general articles about NLP and computational linguistics for general public, and articles on things to do in Singapore and nearby regions. Stay tune to the new blog! It will officially be launched next Monday, but you can already access it here (sorry, about the design, we haven’t figured what design to use for the blog yet). Also check out the conference website.

### The teaching of computer science

Tuesday, April 7th, 2009

Kids & Computers

Today’s ACM TechNews includes an article on CMU trying to buck trends of having few women in Computer Science. I agree wholeheartedly.

However, before we even get there, computer science is seriously in need of polished publicity. I have talked with several undergraduates and high school students recently and encouraged them to take up computer science. In general, most of them were impressed when they heard that I majored in Computer Science. They thought that it is pretty cool as a major. However, when I asked them whether they are considering (especially the high school students) to major in computer science, the response I got was roughly this: no. “Why not?” They replied back that computer science involve lots of programming and they think that programming is hard, really hard. They also strongly equate computer science with programming and only programming. I do believe that programming is not easy, especially at the beginning, but it is not harder than, say, solving difficult physics questions, or designing chemistry experiment to confirm a hypothesis. (I was lucky to be introduced to programming, with LOGO and Basic—yes, Basic, unfortunately—in late elementary school; but most of these people have never touched any programming language before.) Few of them also think that programming is geeky, something that I would never refute; I think being geeky is cool.

Furthermore, even if programming were taught in schools before university, I have a strong feeling that they are being taught wrongly. Instead of being taught to appreciate the beauty of programming (recursion, abstraction, closure, memoization, etc.), teachers were more focused on getting the syntax right. That, in itself, is not too bad. However, the choice of language really do reveal the weakness of the method. Teaching C++ or Java as a first programming language is definitely not a good idea. These languages are very thick with syntax. Example to the point, I have seen loops being taught as a syntax; as for its concept, teachers keep reiterating that it is used to make a particular code loops over and over until the condition turns to become false. The explanation is correct. On the other hand, such explanation rarely extract the awe and understanding of the power of loop construct.

(Granted, at high school level, I am only familiar with Cambridge ‘A’ Level syllabus for Computing, which uses C/C++, I might have too narrow a viewpoint, or even outdated.)

So the first thing we should straighten out is to clean up the way programming should be taught to students. I would start with Javascript (hey, something that can be applied right away on their own websites is cool and fun!), Python, or Scheme. (Yes, you’re right, I think type system only make teaching programming more convoluted without offering much direct benefits.) Picking good teachers is no longer optional. The aim should be to make students appreciate computer science as a diverse, exciting, and fun field. It should definitely not be aimed to teach students to be proficient in one (or, worse, two) programming language. Probably introducing them at younger age would make it even more effective, as students get higher exposure to it and the curriculum gets more refined as it is harder to teach the younger students. (In fact, it is probably more rational to teach visual programming technique. About half a year ago, I watched a Google Tech Talk on tangible functional programming, where the concept of value, pair, and function/lambda is translated into intuitive visual system that allows even stronger form of composability.)

Certainly, putting some into good publicity will not hurt. Though I do think that we should put more effort into integrating computer science into early education (just like other sciences) to let the younger generation figures out how suitable is the major for themselves, before the stigma attached to CS becomes too deeply rooted in them.

### I need less entropy (spam filter)

Tuesday, December 23rd, 2008

So I have been getting a barrage of spam comments in the form of random letters with links. Fantastic. It relies on masking themselves as comments in another language. However, 90% of these spam comments are obviously meaningless.

Seriously, we need a powerful entropy-based spam filter for these. They are so easy to detect but computationally expensive unless you can develop a heuristics on what are considered high in entropy. A bunch of consonants with little vowels? No. Some languages have tonnes of consonants with little vowels. Long words? No. Some languages allow super long words (due to concatenation of words—ah, this problem is also what makes it hard for NLP guys to segment sentences in those languages).

Well, I guess I’m asking too much…

I think the best one would simply match against dictionary words. No not all dictionary words. It’s much easier if you have a substantial dictionary of stop words and basic grammar words. It will do well in most languages I know of. Say in english, how many sentences can you go without ‘is’ (and all other form of ‘to be’), ‘a/an/the’, ‘-ing’, or ‘hi/hello’? Not much. In French: ‘le’, l’, d’. In Chinese: 了，呢，不. In Japanese: の、わ、が? With a little more knowledge of what languages you’re expecting to receive, you can make such entropy-based filtering even better.

Machine learning? Probably not. Well, it might be possible but limited. Spam filter needs to be fast if it were to scale (say, to be used by GMail). What you can do though is to perform a training session that produces filtering rules that may be executed quickly against the message (a simple machine learning way would be to train against a corpus of spam and produce a dictionary lists of banned words). Oh well, but those would be out of my league. I don’t need too advanced a filter.

All right. Seems these spam comments have at least done me some good and let me think through some stuffs quite a bit. (And it makes me realize that some sentences means the same in Russian and Ukrainian.)

### Dynamic allocation: dynamic shadow copies

Thursday, November 27th, 2008

I was reading a book on advanced data structures over the past 2 days (and no, I have only managed to cover the first one and a half chapter). It has this neat idea bout shadow copies for dynamically allocated structures.

Let’s take an example of an array-based stack (based on vector or ArrayList, or even plain array). Normal dynamic allocation of a vector may be performed this way:

1. Sstarts with a dynamically allocated array of certain size (might be 0, 1, a predefined number, or number passed in a constructor) and keep the size in the array_size variable;
2. Keep in index variable that stores the position of the top of the element in the array;
3. When array is filled (index equals array_size), reallocate the array to twice the size: now reallocate may take one of the two forms, if there is enough space to extend the array, it will simply extend it, otherwise it will reallocate a new array and copied the entire original arrays

Now this is a pretty typical method of dynamic allocation. If you took a look at linebuffer.c from GNU readline, it uses exactly the same simple algorithms.

What is the drawback of the method described above? Of course, that very costly copying of the array when the array needs to be reallocated (instead of extended). So the book advised of using the technique called shadow copy. It basically a technique where you keep two arrays. One of the size array_size, and another of size m times array_size. For the purpose of this post, let’s assume that m = 2.

With m = 2, as you’re filling the smaller array, you also copies 2 elements from the smaller array to the larger array. As the small array fills up, it will just in time copied every elements from the smaller array to the larger array. Rinse and repeat.

The reason why the copying will catch up is that it will always start copying from index 0 when the smaller array is half-filled: if you think about it, if you copies one element from the first half of the array and write the element you’re writing to both the smaller and larger array, it will finish copying the first-half of the array by the time the second-half is filled, while at the same time, the second-half is already copied by virtue of writing new element to both array.

We return to this question again: what is the drawback of shadow copying? While the time taken for all operation will increase by at most a constant factor, the memory allocated to the array structure at any given time is always O((1+m)n) where n is array_size, a huge overhead. I kept asking myself, what should we do with this overhead? It’s huge, there must be something we can do about it. Moments later, I thought of one. I’m not claiming that this is the best use of such overhead, but surely it sounds interesting.

For ease of explanation, let’s start with an example. We start with just 1 array of size 16. We will fill the array halfway (8 elements) and attempt to extends the array to size 32 (note: not reallocate). If it succeeds, we do nothing and end up with array of size 32. Let’s continue from here, we’ll start filling the size 32 array with another 8 elements, reaching half capacity again. We then try to extend the array to size 64. Let’s assume it fails to extend to size 64. This is when we allocate a shadow copy array of size 64 and perform shadow copying as we fill the size 32 array.

What this meant is that under best circumstance you only need O(mn) space overhead, with worst case of O((1+m)n). While this may sound like a small difference, when we’re dealing with array of considerable size, such difference can account for 33% savings (for m = 2)! (And very slight savings in the amortized time of a given operation.)

### Caveats

The idea is interesting. You may argue that for large array where this method may be most beneficial, array extension may be virtually impossible. That is a really valid point with the current memory structure. However, this method does not induce much more complexity than shadow copying and it is certainly worthwhile to reclaim as much memory as possible. Furthermore, with evolving virtual memory model, there might be a time where we can allocate arrays over non-contiguous memory space (a virtual machine model), making such extension a way better option (and shadow copies worthless).

Now, as a reminder, it’s worthy to point out that even in the naive reallocation (the first method mentioned above), the worst case allocation is still O((1+m)n) when the reallocation occurs. With certain optimization, shadow copies is no worse than naive reallocation in terms of avoiding out-of-memory problem. One such optimization would be to not allocate a shadow copy array when it is impossible (out of memory) and simply reallocate when the array is full (failover mechanism).

Shadow copying will pose a problem for other data structures/processes, which may run out of memory due to the presence of shadow copies. This will be a problem in a system where memory allocation may vary wildly.