Code highlighting is wonderful, but we're having too much of it

There's been some debate about syntax highlighting, about how and even if it's useful at all. This article talks about how we're doing it wrong, and this much older article doubts its usefulness in general. I think the arguments against syntax highlighting are thought provoking and so I've since been forming an opinion of my own.

I admit that before I read any of the articles, I did not really question syntax highlighting. It's just been there for a long time, it certainly looks neat, and since it became some kind of bare minimums standard feature, I even started to associate its absence with inferior or misconfigured editors (in emacs, the absence of syntax highlighting in a buffer is a good sign that you may be missing a much desired major mode for editing that kind of file).

By now though, I do agree that a lot of the common over-abundant syntax highlighting is questionable at best. As Linus Akesson's article demonstrates, reading a sentence where almost every word is colored according to some syntactical class neither makes reading easier, nor does it really add any valuable information. With respect to actual code, I then find myself thinking that "yes, I do know that if is a keyword of the language I'm using right now, thank you very much".

But I do also think that there's a lot of (not necessarily syntax) highlighting that really does add some tremendous value.

Let's take a look at some examples:

Making stuff that I am looking for pop out


When I am looking for pairs of matching things (parens, braces or even start/end tags in XML) it is certainly useful when the editor lights the corresponding element up without any effort on my part. And when I search for expressions, it's most helpful if the editor highlights all found occurrences while the search is active, so that I may visually scan through the file.

One of the most valuable features of modern editors, however, is the ability to highlight all occurrences of the same syntactical element, like a specific variable or function, just by placing the cursor on it. IntelliJ is one editor which does this, and it is often a huge help when reading or modifying code, even more so if it's someone else's code with which I am not fully familiar yet (which I tend to dig in a lot). While there is also the exhaustive "Find Usages" feature which gives you a clickable list of global usages, when working locally a clear visual standout is more useful than any list in some other panel.


This screenshot, however, shows well how IntelliJ's default of subtly underlining the selected variable is drowned out in a sea of over-abundant formatting attributes. Granted, IntelliJ's default is also a little bit too subtle for my tastes, but looking for those underlined "i"s in that lit up mess hurts me more than it helps. And since this is not my code (it's part of jansi, by the way), I would appreciate the help.

Let's turn the symbol usage up for just a little bit and all the other stuff way down:


Much better. I can easily see where "i" gets used. The colorful stuff is now the stuff that matters.


Notice that in this screenshot, the variable "buff" is also still highlighted with a yellowy background. That's because it belongs to another class of stuff that I care about, which is:

Anomalies


What IntelliJ is trying to tell me is that the code may use a StringBuilder instead of a StringBuffer at that point for efficiency. This, like any other warning, error or anything with the string "TODO" in it, is information that I am always interested in. In general, anything that I consider an anomaly should stand out from the rest, because it tells me: "hey, look at this, there's something wrong or unusual here, you should probably do something about it".

Syntax Subtleties


Syntax highlighting itself becomes useful again when the syntax is subtle enough that I welcome further help about it. Take this example of string interpolation in Scala:


Here, highlighting helps reminding me that .name in $person.name will be interpreted as part of the string (something I most likely did not intend), that I escaped the $ in 40$ correctly ("aha, so it's not \$"), and where the expression in ${...} really ends.

This is different from, say, plain old Java code where I really do know what word is a keyword.

Less is More


In theory, a lot of common syntax highlighting aims to convey useful information that is not always immediately available from the directly surrounding code. IntelliJ for example colors local variables, fields, constants etc. differently. In practice, though, I don't need that information constantly shoved into my face. I often already know what is what, because I am actively working on the code in question. But even if I don't, I'm entirely content to do some small additional action (like command-clicking on the symbol) to find out.

Only for some of the more interesting cases like constants or static variables do I keep some unobtrusive formatting (like cursive script) enabled. With a constant visual overload, on the other hand, I tend to ignore any formatting attributes altogether.


The conclusion is simple: I would not want to live without code highlighting in certain situations. But in order for it to become actually useful, I really need less highlighting than what a lot of the defaults offer.

Fixing xterm-256color with NetBSD

Update: the patch made it into NetBSD 6.1, so if you can upgrade, this fixes the problem.

I am mostly writing this article because I could not really find anything about this problem, despite the fact that it should occur somewhat commonly nowadays.

If you have ever logged into a NetBSD machine using xterm-256color as value for your TERM-variable (this is the default for Mac OS X's Terminal.app, for example), you may have noticed that any program which tries to use colors in the terminal produces garbage among its normal output, essentially rendering every colored terminal application useless.

Things to try this with are vim, emacs, screen and tmux. Depending on your configuration, starting them up is often enough to see that something is wrong, but for extra fun try M-x list-colors-display in emacs.

emacs displaying a world devoid of any colors, corrupted and bland :(

Workaround

A workaround is fairly easy: set your TERM variable to xterm instead of xterm-256color. You can, for example, do this ad hoc by typing export TERM=xterm, or instruct your terminal emulator to set TERM accordingly on startup, or employ some shell startup script logic. However this is not only kludgy, but of course also disables the use of 256 color capabilities, which at least for emacs would be quite a nice thing to have.

Fix

The actual reason is not any incompatibility, nor any problem in NetBSD's terminfo entries, but in fact a bug in NetBSD's libterminfo. The full bug report is here, but the short version is that conditionals in terminfo are not correctly handled on NetBSD. While those are pretty rare, which is probably why this bug has gone unnoticed for so long, they are used in the color escape codes for xterm-like emulators with support for 256 colors, making the problem immediately visible in those.

I included a patch in the aforementioned bug report, so hopefully this will be fixed in one of the next releases. If you don't want to wait until then, you may apply the patch yourself.

And if you are really lazy, you may also download the patched sources to just libterminfo and compile and install them like this:

$ cp /usr/bin/tic .
$ make
...
$ sudo make install

The first step is needed because the build system somehow wants tic (the terminfo compiler) in the same directory when building like this.

(Keep in mind that I took the libterminfo-sources from NetBSD 6.0-rc2, which may or may not be a good version for you. I also don't know if or how much of the NetBSD sources in /usr/src you actually need to have for the compilation to work.)

Once it is installed, color codes with xterm-256color should work right away!

a happier emacs in a shiny 256 color world.

"It's really not that bad!"

Look, people, I don't care if you somehow manage to create feasible, readable code in the horrible programming language of your choice. Fact is, if the language requires you to follow meticulous coding guidelines and have extraordinary discipline, because freely using its features would create a maintenance nightmare, it is a bad language.

errno.h

They squeezed Text file busy into ETXTBSY and no such process into ESRCH (very obvious). They even left out an S in EACCES and when your Argument list is too long, that condition is mnemonically referred to as E2BIG.

But File name too long is, of course, called ENAMETOOLONG.

Google Treasure Hunt finished

Today was a slow day, so I just decided to finish the last question of the Google Treasure Hunt 2008.

Here's a quick summary of all the questions and a basic outline of how I solved each of them:

1. The Robot

A robot is in the left upper corner of a rectangular grid. It has to reach the goal on the right lower corner of that grid. With every move, the robot can go either down or right to reach the next square of the grid. No other directions are allowed. The question is: how many different paths can the robot take to reach the goal?

How I solved it: The number of further possible paths for each grid follows a pretty simple recursion: for each grid that is not adjoint to the goal, the number of possible paths is 2 (since it can go either down or right) plus the number of possible paths when standing on the right neighbour (which would be the next square if you chose to go right), plus the number of possible paths when standing on the lower neighbour. The base cases are for the squares directly above or left from the goal, there is just one possible path there.

Writing down that recursive function is pretty trivial. But just evaluating it is not enough: the width and height of Google's rectangular grid lies somewhere between 30 and 60, which drives the naive approach infeasible.

One solution is to apply a straightforward approach of dynamic programming: start by computing the number of paths for every square in the rightmost column (which is 1, because you can't go right when being in the last column), then proceed to compute the left neighbouring column, starting with the lowermost square. During each computation, you can immediately use the memorized path numbers of the square's lower and right neighbours without any recursion. This is enough to get the correct answer in a few seconds.

A better approach: I didn't notice it, but the problem can easily be reduced to the binomial coefficient or choose function: since the robot can't ever go back and it always has to reach the opposite corner, every path has the same amount of down moves (D) and right moves (R) and it's sufficient to calculate the number of possible combinations of width-1 R-moves and height-1 D-moves.

Some people apparently chose to do the whole thing in Excel or OpenOffice: every grid square is represented by a sheet cell and each cell contains a formula that calculates the number of paths based on the neighbouring cells, just like I explained above. That's actually a pretty funny way to do some limited functional programming with automatic memoization!


2. The Zip File

You have to download a ZIP archive, which contains several text files in several subdirectories. For each given path patterns (simple ones, the paths must contain a substring and the filenames have to end with a specific extension), find the matching files and take the sum of the numbers in a specific line number. Multiply all sums together, enter the result. Example:

Sum of line 1 for all files with path or name containing BCD and ending in .rtf
Sum of line 5 for all files with path or name containing EFG and ending in .js
Hint: If the requested line does not exist, do not increment the sum.
Multiply all the above sums together and enter the product below.


How I solved it: I don't know the exact command line anymore, because I solved it within my shell, and the history has gone since. But it was mostly find and sed. There may have been some awk in there, too. There's really a million ways to do, and provided you use your shell it's pretty straightforward.

3. The Network


This one was pretty easy. You get a number of hosts (named A,B,C...) and a routing table which contains four entries for every host (including a default gateway). You are given a starting host name and a destination IP and your goal is to give the sequence of hosts that a packet would travel.

I'm not quite sure what they are aiming at. You need to know next to nothing to solve it, and you don't even need to use your computer in any way. In fact, people who never cared about computer science or mathematics at all could easily solve it after a 5 minutes introduction of how a routing table works.

How I solved it: I did it manually, by following the routes and writing down which hops the packet went through. The only part of the computer that I used was the screen. By looking at it.

4. The Primes

This got more interesting again. You have to find the smallest number which can be expressed as the sum of several given amounts of consecutive prime numbers and which is also a prime number itself. Here's an example:

Find the smallest number that can be expressed as
the sum of 17 consecutive prime numbers,
the sum of 43 consecutive prime numbers,
the sum of 57 consecutive prime numbers,
the sum of 569 consecutive prime numbers,
and is itself a prime number.
How I solved it: Because the constraints are about consecutive prime numbers, all you need is contiguous windows over a (theoretically infinite) list of prime numbers. Except for the last constraint that states that the matching number should be a prime number, which is equivalent to the number being member of the list.

With a lazy language, where infinite lists are no problem, you could write a more or less complex function which generates that list of prime numbers, but don't: there are plenty of precomputed lists of prime numbers on the internet. Just read one in and use it, it's much less hassle.

The first thing to do is to find a prime number candidate. I did this by sliding the largest window (whose size is the amount of summed numbers in the last constraint) until a prime number is found. The check for a prime number itself is a simple search over the prime number list.

Once that candidate is found, it's time to check the other constraints, in descending order of their sum size. To do that, you slide a new, smaller window over your prime list, its size again being the number of summands in the current constraint, until its sum matches your candidate. However, you don't have to start the windows at the beginning of the list: quick reasoning tells you that any sum of numbers within or ahead of a bigger window whose number of summands is less than the previous window will be smaller than the sum of that window and thus can't be your candidate. It is sufficient to start at i-n+1, where i is the first index position after the bigger window and n is the size of the current window. Since you are checking your constraints in descending order of the window size, you can always use the upper bound of the previous constraint's window, subtract the current size and add 1.

All constraints have to match, so once the sum of any window gets larger than your candidate, you have to start over and look for a new candidate.

The sliding itself can be optimized a bit: instead of just increasing the indexes and recalculate the full sum, it's enough to simply subtract the first number in the window and add the next number outside of the windows upper bound. The algorithm will be fast enough in any way.

As you can see, I didn't make any assumptions about prime numbers at all. In fact, the whole algorithm works with any arbitrary list of numbers, as long as it's ordered.


This was fun, let's do it again next year. Next time, it might be a good idea to start solving as soon as the questions come out, maybe I'll have the chance to get one of the treasures :)

spreading a single request over multiple access logs impossible with lighttpd?

It seems that lighttpd can't spread one HTTP request over multiple, possibly differently formatted access logs. At least I couldn't find anything in mod_accesslog's documentation or with some quick Google searches.

This is an example of how to specify the format and destination for access logging:

accesslog.filename = "/var/log/lighttpd/access.log"
accesslog.format = "%h %V %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\""

You may notice that there are just the two variables accesslog.filename and accesslog.format, only referencing "the" access log instead of several individual ones, and that the assigned values are strings, not lists (I tried specifying lists, lighttpd doesn't accept them).

Now, spreading different, individual HTTP requests over multiple log files is still possible. For example, you might want to log request for different virtual hosts into different log files, an intention which is quite common. To do this, simply use lighttpd's conditional configuration syntax, something which I like very much about lighttpd, and reassign different values to accesslog.filename (and/or accesslog.format):

$HTTP["host"] == "example.com" {
accesslog.filename = "/var/log/lighttpd/example.com-access.log"
}


But it seems that you can't log the same HTTP requests to multiple destinations. Perhaps you want to have one big access log that contains every request for every virtual host and several specific access logs that each contain requests to an individual virtual host. Or you want to have two access logs which contain exactly the same requests, but with different formats.

I did the latter one under apache, in order to log more information for debugging purposes, without disrupting the scripts that gather statistics from the common access logs. That's how I discovered lighttpd's apparent lack of ability to do so.

One workaround would be to pipe into a script instead of writing directly into a file, which lighttpd supports out of the box:

accesslog.filename = "|/path/to/some/log-script"


But that's adding an additional layer of complexity and you can't easily use lighttpd's conditional configuration to decide on the destination anymore. Additionally, in the above debugging scenario where different log files contain the same requests but with different formats, you'd have to specify the set union of all format options in your lighttpd configuration and parse and reformat inside of your script.

Is there anything I have overlooked or are there any plans for future lighttpd versions?

Internet Explorer 7 caches incomplete downloads sometimes

Internet Explorer 7 shows a strange issue with downloads that haven't finished for various reasons. After one incomplete download attempt, users are unable to properly restart the download in order to fetch the complete file. Subsequent tries seem to succeed immediately, without the truncated file actually being completed or replaced.

Thus, once a download was aborted, the user will have difficulties to download the file at all.

This problem occurs when some isolated issue caused a premature end of the HTTP connection, even though the server would normally be able to serve the file completely. I am not yet sure about the exact conditions, but it happened often enough for users to complain. The access logs are pretty clear about it when it happens: one 200 response followed by several 304 responses for the same IP.

Internet Explorer 7's misbehavior occurs because it sometimes doesn't remove incomplete downloads from its cache (which it usually seems to do). In all cases, IE7 should be able to make a distinction between complete and incomplete downloads, because the server did send a correct Content-Length header, but the amount of transferred and saved data was less than the announced value.

So, what IE7 should do on subsequent retries is either
  1. try to do a partial GET to acquire the rest of the file (this would be the preferred behaviour) or
  2. redownload the file with a normal, unconditional GET.
But what IE7 actually does is this: Because the file whose download attempt is being made is, allegedly, already in the browser's cache, it just sends a conditional GET to the web server. It does this by including an If-Modified-Since header in its request, containing the last modification time of the file.

However, since the file on the server really didn't change (it's just bigger, but file size is not checked), the web server correctly responds with a 304 Not Modified status code and no data. IE7 then proceeds to copy the file out of its cache, even though it's a truncated version from a previously aborted connection.

This happens for all subsequent requests and thus effectively blocks the user from downloading the corresponding file at all. Worse even: the user, which does not see any error messages, is instead just told his retry finished without any apparent issue. This leads him to think that the problem is a truncated file on the server, hence not a correctly retrievable file at all.

To resolve this problem, users have to manually clear their cache. However, it's hard to communicate this to possibly unknown users who (quite reasonably) assume that it's a persistent server problem.

Another workaround would be to tell the server that it shouldn't follow conditional GETs with If-Modified-Since header if MSIE7 is detected. But even if it's even possible at all to configure your web server to do this, you should only do this for the larger "download" content of your website (i.e. podcasts and other large static files), as you would be effectively disabling a part of the browser's cache.

So, if you are suffering some dropped connections for whatever reason, IE7 might make matters worse by turning temporary glitches into semipermanent problems. Let's hope this bug gets fixed in future browser versions. In the meantime, tell users which complain about truncated files to clear their cache.

converting videos for iPod video using ffmpeg

This should work with all platforms where ffmpeg is available, including Linux, Windows, Mac OS X and several BSD flavors.

After experimenting a bit, I finally got the right command lines working for ffmpeg which convert every video (at least every video that ffmpeg supports as input) to iPod video friendly format. Everything seems to work, including forward/backward seeking.

First of all, you need to compile xvid and support into ffmpeg. The corresponding arguments to ffmpeg's configure script are --enable-xvid and --enable-faac. You may need to compile and install faac and xvid first. You also need to specify --enable-gpl, since ffmpeg's configure script won't accept those options otherwise.

If you like to be able to encode videos into h264 format instead of MPEG4/xvid, you might also install x264 libraries and specify --enable-x264 to ffmpeg's configure.

Now that you have everything up and running, you may try to convert a video. For a 4:3 video, use:


ffmpeg -i foo.avi -f mov -b 1800 -maxrate 2500 \
-vcodec xvid -qmin 3 -qmax 5 -s 320x240 \
-acodec aac -ab 128 \
foo.ipod.mov


For some reason, I can't get my iPod to honor the aspect ratio settings and to squish a 16:9 video into letterbox format (or zoom into fullscreen, for that matter), although iTunes and Quicktime seem to have no problem doing so. To work this around, just tell ffmpeg to add the black bars on top and bottom itself and substract the height of the black bars from the actual video size:


ffmpeg -i foo.avi -f mov -b 1800 -maxrate 2500 \
-vcodec xvid -qmin 3 -qmax 5 \
-s 320x180 -padtop 30 -padbottom 30 \
-acodec aac -ab 128 \
foo.ipod.mov


It may be worth playing with bitrate and quality settings or also trying to use 2-pass-encoding, remember however that at least according to Apple's specification, the iPod with video only supports up to 2500kbps for MPEG4 and up to 768kbps for h264. Another thing that you might want to play with, especially if you plan on outputting your videos on a TV screen (using the optional A/V cable) instead of just on the iPod are the size and padding options.

If you know of any means to make the encoding process faster or to induce better quality, please drop me a comment. And now, enjoy your movies and clips on the go!

Update: If forward/backward skipping doesn't work, update your version of gtkpod.

This is it!

It is RAW.