[lug] software engineering
nate at natetech.com
Mon Nov 13 13:21:55 MST 2006
Zan Lynx wrote:
> On Sun, 2006-11-12 at 23:44 -0700, Nate Duehr wrote:
>> I was typing up a long reply to all the points, because I find many of
>> these things a lot of fun to talk about -- not being much of a software
>> developer but having worked on the receiving end (technical support) of
>> both good and bad software for most of my adult life, I (think) I have a
>> unique perspective, as do you.
>> I have contended for years that so-called Software Engineers don't play
>> by the same rules that Civil, Chemical, Structural, Electrical, and
>> other Engineers live by -- the industry just barely makes a half-hearted
>> effort at it. What I mean is, the creativity and drive are there of
>> other Engineers, but the discipline isn't.
>> It shows in the fact that open-source software blows away the
>> functionality and features of most "Engineered" code from most businesses.
> Civil engineers design and add a *hefty* safety margin. And they miss
> things, like resonance frequencies on bridges, and some of the mistakes
> made in "quake-proof" buildings in California. And the New Orleans
> flood control systems.
Nit-pick: New Orleans wasn't a mistake, the Army Corp of Engineers built
the levy system for a smaller storm size and surge, and clearly
indicated that to those paying for the job, the U.S. Congress. Congress
had an opportunity to pay for upgrades two years or so before the city
was destroyed and chose not to.
> Electrical engineers make *plenty* of mistakes. Please read the errata
> sheets for various components. Especially interesting are the computer
> related ones like SATA and Ethernet controllers, memory controllers, PCI
> and PCI Express, etc. CPUs have *many* bugs. There was the old Pentium
> divide bug, there was an Athlon 64 prefetch into protected memory
> segfault bug, plus hundreds of other little things I can't remember.
They just don't seem as critical as software's problems.
Every Joe on the street knows what a "computer software crash" is, and
how millions of dollars is lost a year to them. (Billions?)
But you rarely hear about "building crash"... ya know? They're not
falling down so often that people notice it, anyway.
There appear to be completely different level of mistakes going on here,
by many orders of magnitude.
> It's complexity. Engineers cannot hold everything related to their
> project in their heads, and they cannot predict all possible
> interactions between components plus the surrounding environment.
Building a building isn't complex?
The thing is... re-use of concepts is heavily used in building
buildings, and formalized job roles are in place -- Architects,
Engineers, Construction Supervisors, Workers of all sorts, they all know
their piece of the job very well.
Software companies, even big ones, typically still don't quite have that
level of consciousness about code.
The knowledge re-use in building structures is formalized to the point
where its put into Building Codes.
Software has almost no such correlation, and is far more immature (both
from a number of years standpoint, and also from an attitude standpoint
of many developers) in this regard.
> Software gets a bad rep because it isn't as critical to get right the
> first time as building a bridge, and it is easy to update and fix later,
> so perhaps less effort is put into verification. But the customer does
> get faster and cheaper in exchange for a few bugs.
I would contend that companies are starting to not agree anymore, but
software companies are stuck in the rut of quick releases, etc.
When 40% of CxO's say their biggest problem is "software that didn't
deliver what was promised", that's enlightening, isn't it?
> I would say really critical software, like F-22 flight control software,
> *is* heavily analyzed and tested, and is probably just as reliable as
> the mechanical engineering that goes into the wings and engines.
Probably true. But even more-so, it's designed with a failsafe
mentality and design philosophy.
It doesn't fail and crash, it fails, and switches to another, perhaps
"dumber" mode where the human has to do a little or a lot more work to
fly the aircraft. And in critical systems that the human can't do,
redundancy is multiple layers deep.
Again back to testing too -- the F-22 (using your example) was also
heavily TESTED for YEARS by pilots who weren't the operational
day-to-day folks who fly it.
(That makes F-22 Raptor pilots sound like they're not
steely-eyed-missile-men, but they are... the test pilots are just that
razor's edge better even.)
Software simply doesn't go through anything close to this level of
testing or anything many orders of magnitude close to it.
Even at our system's most core level, the kernel -- depending on the
timeline of a particular distro's release, they could pull in a section
of code that was on the kernel-traffic mailing list being discussed only
a few weeks ago, if they hit the release dates just right.
> The discipline is there, when it is needed and cost effective.
It's ALWAYS needed, so cost effectiveness is really the driver. :-)
How many software shops do you know that ask the engineers to look at
ways to create code more inexpensively? I've never heard that question
asked of a developer ever in the business world.
What you usually hear is that the underlying system changes so
dramatically that even asking the question at the application
developer/engineer level is silly. The OS will change so much
underneath you that you can't and won't ever attempt to stay on a stable
code-base for your application software.
OS level upgrades or changes are a good opportunity for most software
houses to package up a new product, discontinue/end-of-life the old, and
blame it on the OS upgrade in an awful lot of cases, hiding the reality
that their old code was starting to get really crufty and needed some
heavy design-level thinking to re-build major components of it.
And patches in the software world don't bring new dollars and cash with
them, only the next new version does. The whole sales cycle is
economically built to force instability. Some would say to force
"innovation" but I'd disagree. Most coders out there are probably
spending 50% or more of their time re-writing the same string
manipulation code they wrote the day they joined the company, on the
shiny new platform, be that a new OS, or a new OS and new hardware, etc.
> Really, consider how your boss and users would react if you claimed that
> you needed a week to design, analyze and test the Perl script that
> reports their maximum disk usage, to make sure it was 100% reliable.
Actually we do this, but hey -- I work in a weird environment. :-)
Scripts are never uploaded to production systems without peer and
Engineering review, and are never added during the Production day. Root
access is never allowed unless something is seriously broken during the
production day, and every single command that's going to be typed into
the machine at the shell prompt is written down in a formal
Method-Of-Procedure (MOP) document and reviewed and tested on a lab
system before done in production. Even a simple, "We'd like to remove
the excess junk in /tmp" requires a written document of exactly what
you'd type to do it.
One customer, for example, has done their own analysis and only allows
root and maintenance on Friday nights. That's their "safest" time that
matches their internal schedules. (Oh lucky us! GRIN...)
Others have procedures to allow vendors (us) to formally request root
access any time it's needed, which includes submitting those MOP's above
for their review prior to us logging in.
Once in a while after a great deal of trust is built, you can get a
verbal approval (and change of the root password to something you know
so you can get in) to log into the system to LOOK at a log file with
permissions you can't otherwise see. But you still don't change or
modify ANYTHING without permission.
Down-time is NOT in my customer's vocabulary. :-) Down-time caused by
a vendor typing the wrong command is very close to being an offense that
can lead to termination of the person that allowed us to do it.
> Right, its ridiculous. You just fix it whenever you notice it doesn't
> work for filesystems with really long device names.
This would be listed in the caveats and/or design doc for the script if
it comes from our Engineering group -- that document is required and the
script can't be released past the Change Control Board without that
More information about the LUG