Archive for December, 2008

I need less entropy (spam filter)

Tuesday, December 23rd, 2008

So I have been getting a barrage of spam comments in the form of random letters with links. Fantastic. It relies on masking themselves as comments in another language. However, 90% of these spam comments are obviously meaningless.

Seriously, we need a powerful entropy-based spam filter for these. They are so easy to detect but computationally expensive unless you can develop a heuristics on what are considered high in entropy. A bunch of consonants with little vowels? No. Some languages have tonnes of consonants with little vowels. Long words? No. Some languages allow super long words (due to concatenation of words—ah, this problem is also what makes it hard for NLP guys to segment sentences in those languages).

Well, I guess I’m asking too much…

I think the best one would simply match against dictionary words. No not all dictionary words. It’s much easier if you have a substantial dictionary of stop words and basic grammar words. It will do well in most languages I know of. Say in english, how many sentences can you go without ‘is’ (and all other form of ‘to be’), ‘a/an/the’, ‘-ing’, or ‘hi/hello’? Not much. In French: ‘le’, l’, d’. In Chinese: 了,呢,不. In Japanese: の、わ、が? With a little more knowledge of what languages you’re expecting to receive, you can make such entropy-based filtering even better.

Machine learning? Probably not. Well, it might be possible but limited. Spam filter needs to be fast if it were to scale (say, to be used by GMail). What you can do though is to perform a training session that produces filtering rules that may be executed quickly against the message (a simple machine learning way would be to train against a corpus of spam and produce a dictionary lists of banned words). Oh well, but those would be out of my league. I don’t need too advanced a filter.

All right. Seems these spam comments have at least done me some good and let me think through some stuffs quite a bit. (And it makes me realize that some sentences means the same in Russian and Ukrainian.)

Solving the root of the problem

Saturday, December 13th, 2008

I realized that there are two kinds of code maintainers in this world (to tell you upfront, I lied).

The first kind will find workaround around the symptoms of the problems. The second will attempt to find the root cause of the problem and stem it at the source. They are very different kind of maintainers. The first kind will solve problems quickly and appears to have higher productivity. But only the second kind adds value to this entire enterprise of maintaining a piece of software code. In fact the first will destroy values.

Recently, I fixed a major problem with one of my component by fixing its root cause. Then I went on to fix the other components that depend on it. Oh boy. You should have seen the spaghetti code people added to those other components (I wrote the original code of more than half of them, but I don’t recognize them anymore). People wanted to add new functionality, discovered that there is problem with the component I wrote, and instead of fixing it, they try all sorts of workaround. I returned to the component because there is an interesting feature I was asked to write, and I thought it’s one cool features. I hit upon the same problems and so on and so forth. Anyway…

Let’s see. Imagine you have a UI blocks that you use to, well, let’s say, show a tabular data. My code basically wrote a renderer that renders these tabular data with the UI blocks from this mythical framework. I wrote the entire set of data model and the renderer quite some time ago. It exposes many APIs and event hooks for other programmers to write additional features into the table (say, make the table sortable by its headings). Myself and others have added several other features into the table and it gets blown up. The problem is, the renderer assumes a 1 row per logical data, and pretty recently, it is started being used to render a table with several rows per logical data (where several is variable, even within one table). The model and renderer still works properly. But the API that exposed the rendered table (say getRowUIElement, or insertRowOnIndex) fails miserably.

Instead of fixing the root cause (the renderer was not able to provide correct API for these new kind of tables), many chose to re-code their features by accessing the UI blocks directly, thus breaking the abstraction provided by the renderer. That was an easy fix for most of them though they are indeed very awkward (a pluggable feature require user to pass in a function that basically convert a data model index into an actual index in the UI block—if each logical data renders into three rows, this function will convert row 2 from the data model to index 6 in the actual UI block). Needless to say, the code became very messy.

When I encountered the same problem, I was in better position to solve this problem since I knew the code well. So I practically rewrote the renderer, which turned out to be a straightforward job (really, the actual diff was less than 40 lines on the code!). The nightmare began when I started to make all the new features work without hacks (who love ugly code? I don’t.). It was insane! These people are no doubt smart if they could think of all these sort of workarounds. Some of them actually worth a second look if only to appreciate the way the writer slips around the problem altogether. Yet, in almost all cases, it is infinitely easier to just change the renderer, rather than make all these features almost unreadable.

I lack sleep now, but I think it’s a job well done this time around (I don’t have many codes I’m proud of, but this is certainly one of them).

Remember, think simple. Write simple code! Simple code is easy to maintain and easy to understand. Almost always, avoid workaround when possible. Try to discover the root cause.

Now there is the third kind of maintainer (see, I lied). The ones who wrote a workaround, filed a bug against the root cause, either fix the root cause (or get people who are more familiar with the code to fix it), and rewrote the workaround with the proper way. These are the guys you want when handling critical system. Hey, it’s true that solving the root problem is a good thing, but sometime you need to act decisive and fast. Workarounds may be the only way to push that piece of bugfix out before too many people got affected. But remember to fix the root cause! Remember, you do not want to build workaround around a workaround around another workaround (which seems to be pretty common nowadays).