Exploring pre-1990 versions of wc(1)

Last update: 2023-07-24 23:21:00

Can you blame a tool for not supporting standard X, if the tool was written before X was invented?

While a reasonable person would undoubtedly answer, “You can’t do that” (and add, “You shouldn’t do that”), it can sometimes be instructive to examine how the absence of X has shaped the assumptions of the tool’s creators.

Take one of the simplest Unix utils, wc(1). This is a man page from v7 Research UNIX (1979):

Note the clarity, the laconism (almost non-existent these days), and the following definition:

“A word is a maximum sequence of characters separated by spaces, tabs, or newlines.”

We can assume that this is the equivalent of what a base would be in JavaScript:

> "та за шо\n".split(/(\t\n )/).filter(Boolean).length
3

(I deliberately did not use \s & +.)

The answer from the JavaScript snippet is exactly what the modern wc(1) of coreutils prints:

$ echo та за шо | wc -w
3

wc However, v7 Research UNIX has a different opinion:

$ locale | grep LC_CTYPE
LC_CTYPE=uk_UA.UTF-8
$ cc -Wall usr/src/cmd/wc.c -o wc 2>&1 | grep warning | wc -l
7
$ echo та за шо | ./wc -w
      0

Obviously they didn’t have UTF8 in 1979 (it was invented in 1992), but if a word is just a sequence of bytes that are not \t, \n, or spaces, then even such a prehistoric version of wc should have parsed the input correctly?

The oldest version of coreutils’ wc I could find is from 1989. I grabbed it from a github mirror, and guess what:

$ curl -s https://raw.githubusercontent.com/coreutils/coreutils/b25038ce9a234ea0906ddcbd8a0012e917e6c661/src/wc.c > wc.coreutls.1989.c
$ cat patch
24c24,28
 #include 
> #include 
> #include 
> #include 
> #include 
$ patch wc.coreutls.1989.c -o wc.c  /dev/null
$ cc -Wall wc.c -o wc 2>&1 | grep warning | wc -l
4
$ echo та за шо | ./wc -w
      3

It gives the correct answer, even though we didn’t know about UTF8 in 1989 either.

I was wondering why AT&T’s version wasn’t working.

v7 1979

The src is pleasantly small. The following gist with a loop explains the deal:

linect = 0;
wordct = 0;
charct = 0;
token = 0;
for(;;) {
    c = getc(fp);
    if (c == EOF)
        break;
    charct++;
    if(' '

wordct variable increments only if the current character is within the range of space (040 in octal) and 177 (DEL, the last entry in ASCII). Most of our entry was

$ echo та за шо | hexdump -b
0000000 321 202 320 260 040 320 267 320 260 040 321 210 320 276 012
000000f

slightly above range.

v8 1985

The complete rewrite of wc.c for v8 did not solve our problem:

$ cc -Wall usr/src/cmd/wc.c -o wc 2>&1 | grep warning | wc -l
17
$ echo та за шо | ./wc -w
      0

From an aesthetic standpoint, the beginning of v8’s wc.c looks absolutely fantastic. It’s such a beauty, I’ll leave it here as a screenshot:

This time the idea is not to set the main loop main()they assigned a separate function to counting:

count(fd, name)
    char *name;
{
    register token=0, n;
    register unsigned char *cp;
    register long chars=0, lines=0, words=0;
    while((n=read(fd, buf, sizeof buf))>0){
        chars+=n;
        cp=buf;
        while(--n>=0)
            switch(type(*cp++)|token){
            case NL:
                lines++;
                break;
            case NL|TOKEN:
                lines++;
                token=0;
                break;
            case SP:
                break;
            case SP|TOKEN:
                token=0;
                break;
            case ORD:
                token=TOKEN;
                words++;
                break;
            case ORD|TOKEN:
                break;
            case JUNK:
            case JUNK|TOKEN:
                break;
            }
    }
    close(fd);
    print(chars, words, lines, name);
    tchars+=chars;
    twords+=words;
    tlines+=lines;
}

Although token variable is used in a clever way to keep track of whether the previous character was part of a word, the value never changes from 0 because anything > 0177 v8 Research UNIX considers
JUNKnot worth noting.

v9, 1986

Unfortunately, the major differences in wc.c between v8 and v9 did not affect the word count:

$ diff -u v8/usr/src/cmd/wc.c v9/cmd/wc.c
--- v8/usr/src/cmd/wc.c 1985-07-05 08:48:38.000000000 +0400
+++ v9/cmd/wc.c 1988-01-15 19:51:45.000000000 +0300
@@ -65,7 +65,7 @@
 {
        register i, fd, status=0;
        if(argc>1 && argv(1)(0)=='-'){
-               opt=++argv(1);
+               opt= ++argv(1);
                --argc, argv++;
        }
        if(argc==1)

v10, 1989

An important year: Poland broke away from the USSR and the Berlin Wall fell.

The modest wc.c got a new version.

$ cc -Wall cmd/wc.c -o wc 2>&1 | grep warning | wc -l
16
$ echo та за шо | ./wc -w
      3

I can’t believe my eyes! It works. What happened? Did the sudden competition from GNU force them to fix some v9 bugs? Or did someone at AT&T get an email in CP437 from Italy? Who knows.

Take a look at this example, where the loops are casually jumped between. It is a model we should strive for:

count(fd, name)
    char *name;
{
    register n;
    register unsigned char *cp, *cpend;
    register long chars=0, lines=0, words=0;

    for(;;){
        if((n=read(fd, buf, NBUF)) cpend)
                    break;
                goto doword;
            }
        }
    }
    for(;;){
        if((n=read(fd, buf, NBUF)) cpend)
                    break;
                words++;
                if(cp(-1) == '\n')
                    lines++;
                goto dospace;
            }
        }
    }
done:
    close(fd);
    printout(chars, words, lines, name);
    tchars+=chars;
    twords+=words;
    tlines+=lines;
}

I should close this by saying that I write sarcastic comments as a joke. I love everything the folks at Bell Labs have done. I got the code samples for v7-v10 from the tuhs.org archive.

Tags: ойті

Authors: ag