Wednesday, May 18, 2016

A Brief History of the C Programming Language

Note: This is the first of what will likely become a small series of posts about the C programming language.

It turns out that two of the most influential software projects in the history of computing1—the C programming language and the UNIX operating system—are deeply intertwined and share a common history.2

Unix and C

That history begins with an ambitious project to create a new time sharing operating system for mainframe computers. MIT, General Electric, and Bell Labs began cooperating on this project—called Multics—in 1964. But development on the project was relatively slow so in 1969, Bell Labs decided to withdraw from the project. Still wanting a new operating system, one of the Bell Labs researchers, Ken Thompson, took the lead in developing what became UNIX3. The first version of UNIX was written for the DEC PDP-7 That version of UNIX was written entirely in assembly code, or in other words, instructions fed directly to the CPU. Desiring a higher-level programming language for applications on the new operating system, Ken Thompson also started work on a language called B, which was loosely based on the existing language BCPL but heavily modified to work on the resource-constrained PDP-7

In 1970, the still-nascent UNIX project was given a DEC PDP-11. Because the PDP-11 could not run assembly code written for the PDP-7, UNIX had to be rewritten from the ground up. B, too, was ported to the new machine but several deficiencies in the language became apparent as it began to see wider usage within Bell Labs. In 1971, another Bell Labs researcher, Dennis Ritchie, began making several changes to B, both to adapt it to the new hardware and to resolve some of the issues in the language. While he initially called his project NB (short for New B), by 1973 it was clear that a new language had emerged which was given the name C. When C was mature enough to work with, the UNIX kernel itself was rewritten (a third time) in C, so that the code could be more easily updated. As a side benefit, having UNIX written in C meant that it would be easier to port to other computer architectures in the future.

Beyond Bell Labs

With UNIX up and running on the PDP-11 series of computers and with most of the UNIX code written in the highly portable C language, Bell Labs had something that should have been a great software product. But there was a problem. Bell Labs was, of course, owned by AT&T and at the time, AT&T remained a highly regulated monopoly. In particular, a 1956 consent decree entered into with the US Department of Justice restricted AT&T’s ability to enter new markets. While prohibiting the sale of UNIX, then, AT&T’s lawyers did at least allow Bell Labs to distribute the source code to universities and research institutions as long as they only charged for the cost of the media and shipping. That decision made UNIX effectively free.

Several institutions requested and received copies of UNIX. Many of these same institutions had also been participating in the ARPANET project, the predecessor to the Internet. In 1975, RFC 681 was published, advocating the use of UNIX on ARPANET hosts. By 1978, UNIX had been ported to the IBM System/360 and the Interdata 8/32. UNIX’s low cost, high portability, and connection to the growing ARPANET combined to fuel considerable growth in its use, especially at universities. By the early 1980s, it was the de facto standard operating system for university computer science departments in many parts of the world.

And, of course, wherever UNIX went, C went too. UNIX itself was written in C and therefore so were the extensions to UNIX that the universities were writing and distributing. But the fact that a C compiler was included with UNIX meant that most applications for UNIX were also written in C. UNIX and C became so pervasive in universities throughout the world that an entire generation of computer science majors were taught to program using C. C was also ported to other platforms and today compilers for C can be found on every major platform (and most minor ones, as well).

Standardization and Diversification

In 1978 Brian Kernighan and Denis Ritchie published The C Programming Language: a book detailing what the C language was. As the use of C spread, and especially as new compilers for C were written for new platforms, that book—often known simply and affectionately in the C community as K&R—effectively became the standard definition of what C was. By the early 1980s, however, it was clear that something more was needed. C had continued to evolve after the book was published and many of the new idioms and types available in most C compilers weren’t even mentioned in the book.

In 1983, the American National Standards Institute (ANSI) accepted the task of producing an official standard for the C programming language. The ANSI group working on C took a conservative, deliberate approach so the first standard wasn’t published until 1989. ANSI continues to maintain and occasionally update the standard with two major revisions so far: C99 and C11, published in 1999 and 2011 respectively.

Before C had even been standardized, however, new variants—C-like languages—had started to appear. As early as 1979, Bjarne Stroustrup began work on what initially was a series of object-oriented extensions to C (based on concepts from the Simula programming language) but eventually became its own language. Originally called “C with Classes”, it was renamed to C++ in 1983 and went on to have its own standardization process.

Another effort to bring object-oriented programming to C, called Objective-C, began in 1983. Objective-C was heavily influenced by the SmallTalk programming language and borrowed its concept of passing messages to objects as well as a square bracket syntax for message passing4. Objective-C was adopted by NeXT as its preferred programming language and as such went on to become the primary programming language of the NeXTSTEP operating system and its descendants, macOS and iOS.

C++ and Objective-C represent the vanguard of what became a Cambrian explosion of C-based languages. Dozens of languages, including Java, PHP, JavaScript (aka ECMAscript), C#, D, Scala, Rust, Go, and Swift, all share a C-style syntax, along with several other key concepts borrowed from C.

Fugit inreparabile tempus

According to the TIOBE index for April 2016, C is the second most popular programming language in use today. What’s curious is that if you look at the top 20 most popular programming languages,5 C is the oldest by about a decade. In fact, the next two oldest languages on that list—C++ and Objective-C—are C’s direct descendants mentioned above. Most of the rest of languages on that list date from the 1990s or later.

It’s difficult to overstate just how much computers have changed since C was developed in the early ’70s. As outlined above, initial work on the B and C languages took place on the PDP-7 and PDP-11, machines that are extremely limited by modern standards. The PDP-7 had only 8 kilowords of memory.6 Although cutting-edge when it was delivered and more advanced than the PDP-7 that it replaced, the PDP-11 still only had 24 kilobytes of memory. The limitations of these machines had a significant impact on the designs of both B and C as programming languages.

One example of this impact is C’s type system. The initial release of the PDP-11 did not support performing floating-point arithmetic in hardware. Its manufacturer, however, promised that hardware support for floating-point arithmetic would come in a soon-to-be-released add-on module. To make it easier to support the PDP-11 both with and without the hardware floating-point unit, Ritchie decided to add a type system to C. C’s model for variable types, therefore, was driven by practical concerns about the specific hardware on which C was developed. Or, as Ritchie himself puts it (emphasis added), “…it seemed that a typing scheme was necessary to cope with characters and byte addressing, and to prepare for the coming floating-point hardware. Other issues, particularly type safety and interface checking, did not seem as important then as they became later.

The C programming language was a remarkable achievement for its time. As is always the case, however, it is very much a product of its time. And that time was the early 1970s. The computing landscape has changed dramatically in the 45 years or so since C was being developed. As I’ll explore in a future post, C itself and the C-style syntax that so many other languages use feel very much like anachronisms to me; relics from an earlier time whose continued use is all the more puzzling given the technology industry’s relentless drive forward.

  1. The influence of the C programming language will become quite evident throughout the rest of this article. The influence of the UNIX operating system is too large a tangent to explore fully in this post so I will simply say this: if you are reading this article on an iPhone, iPad, Mac, Android Phone, Android Tablet, or Chrome Book, you are using an operating system derived, ultimately, from UNIX

  2. Most of the information about the early days of C in this post comes from Dennis Ritchie’s recollections of the origins of C, first published in 1993. 

  3. The initial spelling, Unics, was suggested as a pun about Multics. It seems no one remembers exactly how the final spelling of UNIX came to be. 

  4. This choice of syntax has long baffled myself and other programmers. The square brackets comport very nicely with SmallTalk’s style of syntax but clashed noticeably with C’s syntax style. 

  5. The index also includes “Assembly Language”. TIOBE has certain criteria for deciding what is and what is not a programming language and by those criteria assembly language is included. Its inclusion, however, is problematic in a number of ways, not least of which is the fact that there is no one assembly language; every processor architecture has a different assembly language. Pinning down an age for “assembly language” is therefore quite difficult. It seems reasonable to assume that the most commonly used assembly language would be the language for the most commonly used processor architecture, which at the moment is x86. The x86 architecture was introduced in 1978, which means that C is still older. Given the difficulties of establishing exactly what “assembly language” is, the discussion above about popular programming languages and their ages ignores it. 

  6. The PDP-7 came before the computer industry had settled on bytes as the fundamental unit of memory. A bit is a single binary value, either 0 or 1. A byte is a group of 8 bits, or eight values of 0 or 1 grouped together. The memory in the PDP-7, however, was grouped into 18-bit words instead of 8-bit bytes. In his recollections of the origins of C, Ritchie stated that it had “8K 18-bit words of memory”. To calculate the raw memory capacity, you can convert the PDP-7’s words into bytes: 8,000 units × 18 bits/unit = 144,000 bits; 144,000 bits ÷ 8 bits/byte = 18,000 bytes or 18 kilobytes. It’s important to remember, however, that the memory was addressed in 18-bit words and therefore no value in memory could be less than 18-bits. So even though the raw memory capacity was equivalent to 18 kilobytes, the maximum number of values that could be stored and accessed by the memory system was only 8,000 and not 18,000.