Wednesday, June 30, 2004

Paranoia == Good



Only the Paranoid Survive: How to Exploit the Crisis Points That Challenge Every CompanyThis will sound a little bizarre, but one of the proudest moments in my career as a consultant to Fortune 100 companies came as the result of a major system meltdown. And it highlights why my "paranoia on steroids" mindset is a pretty valuable trait in the software world.

As a bit of background, I need to explain how a large Lotus Notes shop works, at least as it pertains to email [1]. Notes, among other things, can serve as the backbone of a large, distributed email infrastructure. In this particular IT shop, Notes was distributed globally (not an uncommon occurrence) over quite a large number of servers.

The databases within Notes undergo a task known as "replication" or, more specifically, "multi-master replication". This means that changes to any server are merged with other changes from related servers on a near real-time basis. This is one of the advantages of a distributed Notes systems: changes can occur anywhere in the system and they will automatically ripple through the entire network without user intervention, so everyone is always up-to-date. Occasionally, the same database record will be modified on two servers at once - this situation is known as a "replication conflict" and Notes will automatically remember both copies of the record for you.

One of the key databases held in each Notes server is a list of user accounts, known as the NAB. The NAB stands for "Name & Address Book". The NAB is pretty important because it contains all valid Notes user accounts on a given system and all of their personal information.

One of my areas of expertise is directory servers (LDAP directories) and, as a consultant, I did a fair amount of architecture, design and deployment of directory services. In this shop, among other things, I created a system-level process that would take a feed from Notes (since, at that time, it was the authoritative source for internal user accounts).

The way my feed process worked, high-level, was something like this:

> Read all of the Directory's user accounts into a map

> Read all of Notes' user accounts into another map

> Rip through all of the Notes accounts, comparing each one to a Directory account. If they compare, remove the account from the Directory's map. If they don't compare, save the changed fields into an Updates list.

> When done, save the remaining Notes accounts into an Inserts list.

> Save the remaining Directory accounts into a Deletes list.

Okay, that all seems pretty straightforward. But here's where my paranoia kicked in. Notice that instead of just doing the changes directly in the directory... I saved them into separate lists. Why?

Failsafe



Instead of doing the writes, our superhero, Paranoid-Boy, placed the following checks at this stage of the code:

> Is the number of Deletes greater than x% of the total number of directory accounts?

> Is the number of Updates greater than y% of the total number of directory accounts?

> Is the number of Inserts greater than z% of the total number of directory accounts?

If any of these values were exceeded, I assumed that either a catastrophic fault had occurred or that manual intervention for some major change in the user population was necessary. If this happened, the feed would reject the entire operation, notifying the operator and giving them the option to perform a "manual override".

That's why I didn't do the writes directly. I wanted to wait until I could 'sanity-check' the number of changes to the directory, which was the key repository of security data.

Sure enough, the alarm claxons went off one day. Apparently, a bad feed process had corrupted one of the central NABs and removed 12,000 users. Notes, doing as instructed, replicated the change throughout the planet before the system administrators could get a hold of the problem. So in (probably every) NAB around the world, all 12,000 users got purged. And I think some pretty senior management folks (maybe even the CEO) temporarily lost their access to Notes services. That's a tad embarrassing for the IT execs.

Worse still, the systems downstream were at huge risk, including the Directory. Luckily, my feed process' paranoia check kicked in and rejected the feed from Notes: too many changes! No one lost their security privileges due to the Directory and the folks on our team were safe in their beds that night, sleeping like babies... while the Notes sys-admins were, I'm sure, pulling all-nighters trying to get it all back together.

A little long-winded, I know, but my point is simple. None of this would have happened if the rogue feed process had some paranoia designed in from the start.

If your system is non-trivial, be paranoid. Be very paranoid.

[1] This information is somewhat dated, so don't expect you can get your Lotus Certification after reading this junk. In fact, don't even try to use Notes based upon anything I've ever written or said. Please use professionals like Pete for all things Lotus.

1 comment:

Anonymous said...

Can anyone recommend the best Script Deployment tool for a small IT service company like mine? Does anyone use Kaseya.com or GFI.com? How do they compare to these guys I found recently: N-able N-central remote control
? What is your best take in cost vs performance among those three? I need a good advice please... Thanks in advance!