Tuesday, May 18, 2004
A bald-faced bug
Any developer has their "toughest bug they ever cracked" story. I'm no different.
Those of you who know me know that I'm hair-challenged. Thinning on top. Approaching cleanhead status. Alright, I'm basically bald. This is the story of how I got bald over the period of two days. Ripping out clumps of hair in frustration. Cats and dogs, sleeping together. Total chaos.
I had this small utility program I'd created in ANSI C running on a mid-range HP/UX server. Mid-eighties. The program was no more than five or six hundred lines in length. It read some input files and did some calculations for motion control. I think it created pre-planned routes for a high-speed turning machine that cut specially shaped pistons for Cadillac. Either that or it was routing air traffic over LaGuardia.
Anyhow, the program would do some complicated calculations and run for a while. Most of the time it would work. In fact, it might work perfectly forty times in a row. But on the forty-first run, it would crash. And it was completely random. It might work forty times, crash twice, work another five times, crash, then work fifty times.
I threw in printf's to isolate the location of the crash (no IDE's available on the HP/UX back then, Jimbo). It was crashing in RANDOM FRICKING LOCATIONS every time. WTF?
I analyzed the time-of-day of each crash. Nothing. It didn't seem to be time-based.
I analyzed the input data. The same data file would crash sometimes and not crash other times.
Exasperated, I started stripping out large chunks of code. The math calculations got stripped out, over several iterations, until there were no calculations whatsoever. Still crashed. Cheezus. What in the...?
I excised the up-front error-checking. Nada. Same results. And it was crashing in random locations!
Now all I had the was the loop that read the input file and pre-processed the data for calculation. Ripped out the pre-processor part. ARGGHHHGH! It still crashed.
All that remained was a loop and an fscanf that read the raw input data into the initial data variables. I removed the fscanf.
It worked. Well, it better, given that the whole damn program was just a freaking loop now. Something in the fscanf was toasting something else, given the random location of the crashes.
Long story short: one of the data items was overranging fscanf's load of a variable. Just one in a series of eight or nine. And since HP/UX's I/O subsystem was handling the fscanf (to give time back to the system)... the subsystem was occasionally blowing away my process - randomly, just to ratchet up the fun factor.
So... the I/O subsystem was randomly crashing the application code. Hey, HP/UX designer! Nice separation of process and system code! Okay, that's my frustration talking, it really wasn't their fault.
Lesson learned: separate steps wherever possible. For example, read input data into a buffer, _then_ sscanf it. Don't mask too many operations in the interest of "more elegant" code. You might end up, ahem, hair-challenged.
at 10:41 PM