Swamped…

January 28th, 2009

Hello there… I’ve been pretty swamped recently (and hence the lack of postings). I just moved early January, still acclimatizing to the hot and humid climate after the cold pre-winter season. I sweared constantly the first few days after I moved, but I’m getting used to the weather. In fact, I’m starting to think that it’s great. I’ve been sweating often and it’s a good sign for my obese body.

Enough on the rant. I have been working on quite a few interesting things. I’m also taking several advanced courses that should help me with my weak maths; though, on the other hand, I’ve been pummeled by the difficulty as well. I hope I’ll survive.

So expect to hear more from me on things I’m working on now: webapps performance, general performance modeling, static analysis, concurrent programming, and more C++. Hopefully I will have more time soon and I can restart my nascent blogging habit. Wish me luck!

I need less entropy (spam filter)

December 23rd, 2008

So I have been getting a barrage of spam comments in the form of random letters with links. Fantastic. It relies on masking themselves as comments in another language. However, 90% of these spam comments are obviously meaningless.

Seriously, we need a powerful entropy-based spam filter for these. They are so easy to detect but computationally expensive unless you can develop a heuristics on what are considered high in entropy. A bunch of consonants with little vowels? No. Some languages have tonnes of consonants with little vowels. Long words? No. Some languages allow super long words (due to concatenation of words—ah, this problem is also what makes it hard for NLP guys to segment sentences in those languages).

Well, I guess I’m asking too much…

I think the best one would simply match against dictionary words. No not all dictionary words. It’s much easier if you have a substantial dictionary of stop words and basic grammar words. It will do well in most languages I know of. Say in english, how many sentences can you go without ‘is’ (and all other form of ‘to be’), ‘a/an/the’, ‘-ing’, or ‘hi/hello’? Not much. In French: ‘le’, l’, d’. In Chinese: 了,呢,不. In Japanese: の、わ、が? With a little more knowledge of what languages you’re expecting to receive, you can make such entropy-based filtering even better.

Machine learning? Probably not. Well, it might be possible but limited. Spam filter needs to be fast if it were to scale (say, to be used by GMail). What you can do though is to perform a training session that produces filtering rules that may be executed quickly against the message (a simple machine learning way would be to train against a corpus of spam and produce a dictionary lists of banned words). Oh well, but those would be out of my league. I don’t need too advanced a filter.

All right. Seems these spam comments have at least done me some good and let me think through some stuffs quite a bit. (And it makes me realize that some sentences means the same in Russian and Ukrainian.)

Solving the root of the problem

December 13th, 2008

I realized that there are two kinds of code maintainers in this world (to tell you upfront, I lied).

The first kind will find workaround around the symptoms of the problems. The second will attempt to find the root cause of the problem and stem it at the source. They are very different kind of maintainers. The first kind will solve problems quickly and appears to have higher productivity. But only the second kind adds value to this entire enterprise of maintaining a piece of software code. In fact the first will destroy values.

Recently, I fixed a major problem with one of my component by fixing its root cause. Then I went on to fix the other components that depend on it. Oh boy. You should have seen the spaghetti code people added to those other components (I wrote the original code of more than half of them, but I don’t recognize them anymore). People wanted to add new functionality, discovered that there is problem with the component I wrote, and instead of fixing it, they try all sorts of workaround. I returned to the component because there is an interesting feature I was asked to write, and I thought it’s one cool features. I hit upon the same problems and so on and so forth. Anyway…

Let’s see. Imagine you have a UI blocks that you use to, well, let’s say, show a tabular data. My code basically wrote a renderer that renders these tabular data with the UI blocks from this mythical framework. I wrote the entire set of data model and the renderer quite some time ago. It exposes many APIs and event hooks for other programmers to write additional features into the table (say, make the table sortable by its headings). Myself and others have added several other features into the table and it gets blown up. The problem is, the renderer assumes a 1 row per logical data, and pretty recently, it is started being used to render a table with several rows per logical data (where several is variable, even within one table). The model and renderer still works properly. But the API that exposed the rendered table (say getRowUIElement, or insertRowOnIndex) fails miserably.

Instead of fixing the root cause (the renderer was not able to provide correct API for these new kind of tables), many chose to re-code their features by accessing the UI blocks directly, thus breaking the abstraction provided by the renderer. That was an easy fix for most of them though they are indeed very awkward (a pluggable feature require user to pass in a function that basically convert a data model index into an actual index in the UI block—if each logical data renders into three rows, this function will convert row 2 from the data model to index 6 in the actual UI block). Needless to say, the code became very messy.

When I encountered the same problem, I was in better position to solve this problem since I knew the code well. So I practically rewrote the renderer, which turned out to be a straightforward job (really, the actual diff was less than 40 lines on the code!). The nightmare began when I started to make all the new features work without hacks (who love ugly code? I don’t.). It was insane! These people are no doubt smart if they could think of all these sort of workarounds. Some of them actually worth a second look if only to appreciate the way the writer slips around the problem altogether. Yet, in almost all cases, it is infinitely easier to just change the renderer, rather than make all these features almost unreadable.

I lack sleep now, but I think it’s a job well done this time around (I don’t have many codes I’m proud of, but this is certainly one of them).

Remember, think simple. Write simple code! Simple code is easy to maintain and easy to understand. Almost always, avoid workaround when possible. Try to discover the root cause.

Now there is the third kind of maintainer (see, I lied). The ones who wrote a workaround, filed a bug against the root cause, either fix the root cause (or get people who are more familiar with the code to fix it), and rewrote the workaround with the proper way. These are the guys you want when handling critical system. Hey, it’s true that solving the root problem is a good thing, but sometime you need to act decisive and fast. Workarounds may be the only way to push that piece of bugfix out before too many people got affected. But remember to fix the root cause! Remember, you do not want to build workaround around a workaround around another workaround (which seems to be pretty common nowadays).

Thanksgiving resolution: Ruby & Python

November 28th, 2008

Finally, thanksgiving is over. And I made a resolution. I’m going to learn Ruby, the language (probably Rails too, but let’s keep to Ruby first).

I have enough people now telling me that Ruby is a pretty language. Sure there are tonnes of implementations out there that don’t meet industry standard, but it surely is improving. And you know, Java didn’t start as fast as it is now. I am convinced now.

While I’m at it, I have to learn Python too. (This one is a “have to”.) I’ll very likely need this language next year since I’m hoping to do quite a bit of text processing next year. While text processing (as in things like log processing) can be boring, they are really just a means to an end for me. I have this really interesting goal right in front of me. But I suspect it’ll involve a lot of experimentation and log crawling and processing. Therefore: Python.

Okay, Ruby & Python, here I come!

Seinen manga

November 28th, 2008

Recently, I’ve started to read a manga (Japanese comics) genre called seinen. It is supposed to be a genre for university students and older males in general, as opposed to shounen for younger males, or, well, boys. (If you’re interested, the female equivalent is josei and shoujo respectively.) Seinen manga is suprisingly… interesting. It’s very refreshing and different from the hack and slash shounen and the insanely lovey-dovey shoujo.

Seinen ranges from utterly comedic to deeply psychological. Some of them still scars me to this day when I recalled the story. Indeed, it is not a genre to be taken lightly. So today, instead of discussing the usual programming stuffs, I shall be introducing some of the seinen manga I’ve read recently.

1. The Lucifer and Biscuit Hammer (Wakusei no Samidare) by Mizukami Satoshi

This manga shocked me. I was laughing uncontrollably after only the first two pages! It was an awesome introduction (read it here: page 1 and 2). The story revolves around a college student who woke up to find an… err… talking lizard. Hilarious! Oh, not to forget that he later on swore allegiance to a princess who wants to save the earth from being destroyed by a… biscuit hammer. Did I say that the princess does that just so that she could destroy the earth herself? Gosh!

Read it here.

2. Angel Heart by Hojo Tsukasa

The first thing I noticed about this manga is that it has a unique drawing style among manga. It looks more like an older American graphic novels than a Japanese manga. After several chapters, it matters no longer. The story revolves around an assassin who committed suicide only to receive a stolen heart transplant and ending up becoming the “daughter” of the guy who is the fiance of the dead girl from whom the heart was stolen. It’s a heartwarming story of redemption of an ex-assassin doing good deeds as a city hunter.

Read it here.

3. Binbou Shimai Monogatari by Kazuto Izumi

The story revolved around two sisters who are incredibly poor and living alone in a meagre apartment rented from a fierce landlord. They constantly had to skimp on food. It tells the tale of the two sisters coping with their lives with a smile everyday. A truly slice of life story, it amazingly sticks to the premise up to the end. It’s as if you’re looking at two real sisters in real life and not in a manga. There is no happy ending and I didn’t expect it to have one. After all, the two sisters are already happy from the beginning just by having each other’s company.

Read it here.

4. Real by Inoue Takehiko

This is a deeply psychological manga. It provides a looking glass into a world of handicapped basketball through the eye of a school drop-out (who used to be a darn good basketball player), an ex-sprinter who had his leg amputated, and a popular basketball captain who lost all his friends after being involved in an accident and was paralyzed from waist down. It is a perfect twist to other basketball manga like Slam Dunk or Cross Over.

Read it here.

5. Skyhigh by Takahashi Tsutomu

Do you believe in afterlife? This story brings you into the gate of heaven, where a lone girl becomes its guardian, offering each spirit three options: one, to go through the gate and be reborn; two, to stay as wandering spirit on earth; or three, to curse (read: kill) one person on earth and go to hell afterwards. Each chapter portrays the decision each spirit made and the consequences. I was captivated. I read the continuation: Skyhigh Karma, but I still think the original Skyhigh is far more amazing than the second.

Read it here. (Read Skyhigh Karma here.)


There were several other seinen I’ve read recently, notably Hot Milk, Bitter Virgin and Lucu Lucu; but I forced myself to limit the list to five. I was really torn when I was at #4. I decided to put up the two that are most different from the rest. But these three are pretty awesome as well.

Seinen is a refreshing genre because it’s so diverse. There is no common plots among many of the popular seinen manga. I’m glad I read them. Though I think they contribute to my headache. Imagine working 10 hours in front of an LCD screen and reading manga on another LCD screen at home. Surely, there is a limit for my eyes… :(

Dynamic allocation: dynamic shadow copies

November 27th, 2008

I was reading a book on advanced data structures over the past 2 days (and no, I have only managed to cover the first one and a half chapter). It has this neat idea bout shadow copies for dynamically allocated structures.

Let’s take an example of an array-based stack (based on vector or ArrayList, or even plain array). Normal dynamic allocation of a vector may be performed this way:

  1. Sstarts with a dynamically allocated array of certain size (might be 0, 1, a predefined number, or number passed in a constructor) and keep the size in the array_size variable;
  2. Keep in index variable that stores the position of the top of the element in the array;
  3. When array is filled (index equals array_size), reallocate the array to twice the size: now reallocate may take one of the two forms, if there is enough space to extend the array, it will simply extend it, otherwise it will reallocate a new array and copied the entire original arrays

Now this is a pretty typical method of dynamic allocation. If you took a look at linebuffer.c from GNU readline, it uses exactly the same simple algorithms.

Shadow copying

What is the drawback of the method described above? Of course, that very costly copying of the array when the array needs to be reallocated (instead of extended). So the book advised of using the technique called shadow copy. It basically a technique where you keep two arrays. One of the size array_size, and another of size m times array_size. For the purpose of this post, let’s assume that m = 2.

With m = 2, as you’re filling the smaller array, you also copies 2 elements from the smaller array to the larger array. As the small array fills up, it will just in time copied every elements from the smaller array to the larger array. Rinse and repeat.

The reason why the copying will catch up is that it will always start copying from index 0 when the smaller array is half-filled: if you think about it, if you copies one element from the first half of the array and write the element you’re writing to both the smaller and larger array, it will finish copying the first-half of the array by the time the second-half is filled, while at the same time, the second-half is already copied by virtue of writing new element to both array.

Dynamic shadow copying

We return to this question again: what is the drawback of shadow copying? While the time taken for all operation will increase by at most a constant factor, the memory allocated to the array structure at any given time is always O((1+m)n) where n is array_size, a huge overhead. I kept asking myself, what should we do with this overhead? It’s huge, there must be something we can do about it. Moments later, I thought of one. I’m not claiming that this is the best use of such overhead, but surely it sounds interesting.

For ease of explanation, let’s start with an example. We start with just 1 array of size 16. We will fill the array halfway (8 elements) and attempt to extends the array to size 32 (note: not reallocate). If it succeeds, we do nothing and end up with array of size 32. Let’s continue from here, we’ll start filling the size 32 array with another 8 elements, reaching half capacity again. We then try to extend the array to size 64. Let’s assume it fails to extend to size 64. This is when we allocate a shadow copy array of size 64 and perform shadow copying as we fill the size 32 array.

What this meant is that under best circumstance you only need O(mn) space overhead, with worst case of O((1+m)n). While this may sound like a small difference, when we’re dealing with array of considerable size, such difference can account for 33% savings (for m = 2)! (And very slight savings in the amortized time of a given operation.)

Caveats

The idea is interesting. You may argue that for large array where this method may be most beneficial, array extension may be virtually impossible. That is a really valid point with the current memory structure. However, this method does not induce much more complexity than shadow copying and it is certainly worthwhile to reclaim as much memory as possible. Furthermore, with evolving virtual memory model, there might be a time where we can allocate arrays over non-contiguous memory space (a virtual machine model), making such extension a way better option (and shadow copies worthless).

Now, as a reminder, it’s worthy to point out that even in the naive reallocation (the first method mentioned above), the worst case allocation is still O((1+m)n) when the reallocation occurs. With certain optimization, shadow copies is no worse than naive reallocation in terms of avoiding out-of-memory problem. One such optimization would be to not allocate a shadow copy array when it is impossible (out of memory) and simply reallocate when the array is full (failover mechanism).

Shadow copying will pose a problem for other data structures/processes, which may run out of memory due to the presence of shadow copies. This will be a problem in a system where memory allocation may vary wildly.