So I have been getting a barrage of spam comments in the form of random letters with links. Fantastic. It relies on masking themselves as comments in another language. However, 90% of these spam comments are obviously meaningless.
Seriously, we need a powerful entropy-based spam filter for these. They are so easy to detect but computationally expensive unless you can develop a heuristics on what are considered high in entropy. A bunch of consonants with little vowels? No. Some languages have tonnes of consonants with little vowels. Long words? No. Some languages allow super long words (due to concatenation of words—ah, this problem is also what makes it hard for NLP guys to segment sentences in those languages).
Well, I guess I’m asking too much…
I think the best one would simply match against dictionary words. No not all dictionary words. It’s much easier if you have a substantial dictionary of stop words and basic grammar words. It will do well in most languages I know of. Say in english, how many sentences can you go without ‘is’ (and all other form of ‘to be’), ‘a/an/the’, ‘-ing’, or ‘hi/hello’? Not much. In French: ‘le’, l’, d’. In Chinese: 了，呢，不. In Japanese: の、わ、が? With a little more knowledge of what languages you’re expecting to receive, you can make such entropy-based filtering even better.
Machine learning? Probably not. Well, it might be possible but limited. Spam filter needs to be fast if it were to scale (say, to be used by GMail). What you can do though is to perform a training session that produces filtering rules that may be executed quickly against the message (a simple machine learning way would be to train against a corpus of spam and produce a dictionary lists of banned words). Oh well, but those would be out of my league. I don’t need too advanced a filter.
All right. Seems these spam comments have at least done me some good and let me think through some stuffs quite a bit. (And it makes me realize that some sentences means the same in Russian and Ukrainian.)