The widely publicised and very serious "gotofail" bug in iOS7 took me back ...

Early in my career I spent seven years in a very special software development environment. I didn't know it at the time, but this experience set the scene for much of my understanding of information security two decades later. I was in a team with a rigorous software development lifecycle; we attained ISO 9001 certification way back in 1992. My company deployed 30 software engineers in product development, 10 of whom were dedicated to testing. Other programmers elsewhere independently wrote manufacture test systems. We spent a lot of time researching leading edge development methodologies, such as Cleanroom, and formal specification languages like Z.

We wrote our own real time multi-tasking operating system; we even wrote our own C compiler and device drivers! Literally every single bit of the executable code was under our control. "Anal" doesn't ever begin to describe our corporate culture.

Why all the fuss? Because at Telectronics Pacing Systems, over 1985-1989, we wrote the code for the world's first software controlled implantable defibrillator, the Guardian 4210.

The team spent relatively little time actually coding; we were mostly occupied writing and reviewing documents. And then there were the code inspections. We walked through pseudo-code during spec reviews, and source code during unit validation. And before finally shipping the product, we inspected the entire 40,000 lines of source code. That exercise took a five person team working five hours a day for two months.

For critical modules, like the kernel and error correction routines, we walked through the compiled assembly code. We took the time to simulate the step-by-step operation of the machine code using pen and paper, each team member role-playing parts of the microprocessor (Phil would pretend to be the accumulator, Lou the program counter, me the index register). By the end of it all, we had several people who knew the defib's software like the back of their hand.

And we had demonstrably the most reliable real time software ever written. After amassing several thousand implant-years, we measured a bug rate of less than one in 10,000 lines.

The implant software team had a deserved reputation as pedants. Over 25 person years, the average rate of production was one line of debugged C per team member per day. We were painstaking, perfectionist, purist. And argumentative! Some of our debates were excruciating to behold. We fought over definitions of "verification" and "validation"; we disputed algorithms and test tools, languages and coding standards. We were even precious about code layout.

Yet 20 years later, purists are looking good.

Last week saw widespread attention to a bug in Apple's iOS operating system which rendered a huge proportion of website security impotent. The problem arose from a single superfluous line of code - an extra goto statement - that nullified checking of SSL connections, leaving users totally vulnerable to fake websites. The Twitterverse nicknamed the flaw #gotofail.

There are all sorts of interesting quality control questions in the #gotofail experience.

  • Was the code inspected? Do companies even do code inspections these days?
  • The extra goto was said to be a recent change to the source; if that's the case, what regression testing was performed on the change?
  • How are test cases selected?
  • For something as important as SSL, are there not test rigs with simulated rogue websites stress test security systems before release?

There seems to have been egregious shortcomings at every level: code design, code inspection, and testing.

A lot of attention is being given to the code layout. The spurious goto is indented in such a way that it appears to be part of a branch, but it is not. If curly braces were used religiously, or if an automatic indenting tool was applied, then the bug would have been more obvious (assuming that the code actually gets inspected by humans).

I agree of course that layout and coding standards are important, but there is a much more robust way to make source code clearer.  Beyond the lax testing and quality control, there is also a software-theoretic question in all this that is getting hardly any attention: Why are programmers using ANY goto statements at all?

I was taught at college and later at Telectronics that goto statements were to be avoided at all cost. Yes, on rare occasions a goto statement makes the code more compact, but with care, a program can almost always be structured to be compact in other ways. Don't programmers care anymore about elegance in logic design? Don't they make efforts to set out their code in a rigorous structured manner?

The conventional wisdom is that goto statements make source code harder to understand, harder to test and harder to maintain. Kernighan and Ritchie - UNIX pioneers and authors of the classic C programming textbook - said the goto statement is "infinitely abusable" and it "be used sparingly if at all." The Telectronics implant software coding standard prohibited goto statements, without exception.

Hard to understand, hard to test and hard to maintain is exactly what we see in the flawed iOS7 code. The critical bug never would have happened if Apple too banned the goto.
Now, I am hardly going to suggest that fanatical coding standards and intellectual rigor are sufficient to make software secure. It's unlikely that many commercial developers will be able to cost-justify exhaustive code walkthroughs when millions of lines are involved even in the humble mobile phone. It's not as if lives depend on commercial software.

Or do they?!

Let's leave aside that vexed question for now and return to fundamentals.

The #gotofail episode will become a text book example of not merely mention attention to detail, but moreover the importance of disciplined logic, rigor, elegance, and fundamental coding theory.

Yet a deeper lesson perhaps in all this is the fragility of software. Prof Arie van Deursen nicely describes the iOS7 routine as "brittle". I want to suggest that all software is tragically fragile. It takes just one line of silly code to bring security to its knees. The sheer non-linearity of software - the ability for one line of software anywhere in a hundred million lines to have unbounded impact on the rest of the system - is what separates development from conventional engineering practice. Software doesn't obey the laws of physics. No non-trivial software can ever fully tested, and we have gone too far for the software we live with to be comprehensively proof read. We have yet to build the sorts of software tools and best practice and habits that would merit the title "engineering" (See also "Security Isn't Secure).

I'd like to close with a philosophical musing that might have appealed to my old mentors at Telectronics. We have reached a sort of pinnacle in post-modernism where the real world has come to pivot precariously on pure text. It is weird and wonderful that engineers are arguing about the layout of source code - as if they are poetry critics.

We have come to depend daily on great obscure texts, drafted not by people we can truthfully call "engineers" but by a largely anarchic community we would be better of calling playwrights.