Friday, February 19, 2021

A Rough Analogy

A Rough Analogy

For me, one little irony of the past year plus is that I had set myself the task of digging into system engineering, especially reliability. I knew going in that systems behave in counterintuitive ways; jumping from the unit level to the system level can surprise you.

There's an awful lot of work to absorb in system engineering. One reference I've found very useful is Nancy Leveson's book called Engineering A Safer World. Leveson has several books in this area. A Safer World delves into specific examples of system level analysis, and so gives a set of reminders of the challenges involved.

After Safer World and other work, and yeah the events of the past 12+ months, I've had time to rough out an analogy for a system failure and analysis. I thought I'd write it out as a memory exercise. Forgive me if this is boring or otherwise not interesting.

Ok, suppose that Alice is a day-shift operator for a cogen unit. In particular, Alice runs a gas turbine generator; some fraction of her turbine's daily output goes to her plant, the other fraction is sold to a utility company for supply to the grid.

Now, suppose further that Alice's turbine has the following running profile, learned through years of work and effort: at 8000 rpm, the turbine runs optimally and indefinitely. But, Alice's turbine can run up to approximately 10000 rpm for periods of no more than 2 hours in a 26 hour period.

Basically, over the years, Alice and the engineers have learned that the turbine can operate safely for 2 hours at 10000 rpm, if and only if the turbine is then returned to 8000 rpm for a full 24 hour re-stabilization period.

Anything beyond that 2 hour period means stress fractures in the blades can start to appear. Let's say that up to 4 hours running at 10000 rpm typically means that Alice has to then return to 4000 rpm, run at 24 hours at that level to insure no cracks have appeared, and only then return to 8000 rpm. And if the turbine hits 10000 rpm for more than 4 hours, the turbine must then be spun down completely for a tedious 3-day inspection process.

Again: this running profile has been determined through the school of hard knocks. Alice's company lost a few turbines in the early days, before they fully worked out the unit level reliability curve.

And Alice has been trained thoroughly in this reliability curve. Written on the day board in her office is this: any request for power greater than 8000 rpm must be less than 2 hours in duration, followed by a 24 hours cooldown period! Call the supervisor on duty if necessary!

That's Alice's end of the stick. Now supposed that Bob is in sales for Alice's company. Bob gets the daily power requests; Bob has also been trained so that any power request greater than normal can only be fulfilled by Alice's turbine if and only if it's shorter than 2 hours.

Yesterday, as happens occasionally, Bob got the power demand request that asked for 120% of the turbine's power throughput. Bob checked everything before calling Alice; unfortunately, for undocumented reasons, Bob didn't let Alice know that the time of 120% demand was undetermined.

And so Alice threw the switch on the turbine to spin up to 10000 rpm. So far, so good. Only, Alice went about her daily operating routine from there; she caught up to the 10000 rpm runtime some 3.5 hours after spinning the turbine up.

Following protocol, Alice spun her turbine down to 4000 rpm, locked in the controls with a note for the night shift, and started making her calls.

Ok, that's the basic setup, here's the question: Who's at fault? One could well blame Alice. She's been trained not to run up to 10000 rpm without a specific, known timeframe.

One could also blame Bob. He too has been trained to not answer requests for power above that 2 hour limit.

Actually, from Leveson and the other research in systems engineering, fault and blame turn out to be trick questions. Assigning fault or blame to either Bob or Alice would likely be counterproductive.

To see why, let's blame Bob and then ask what happens. Assuming Bob didn't leave the company, if he got blamed, what's going to happen the next time there's a call for 120% power for 2 hours or less, a perfectly reasonably and fulfillable request?

Bob's never going to sell that available power again. Why should he, after catching the blame for last time the turbine failed?

Or, similarly, let's blame Alice. I think we can guess what'll happen here: for the rest of her operating career, if the call comes in for More Power! even at a less than 2 hours interval, at least while Alice is in the control house there won't be any power to give.

We can go further: add Caryn the engineer, who's been pushing to get automatic overrides put in place that would spin the turbine down safely from 10000 rpm and lock out at 8000 rpm without manual intervention. And Deon, everyone's boss, who has had to push those overrides below other priorities due to the yearly budget process. Then there's Evena, QA/QC, who's trained everyone but got told to ease up on sales because they didn't appreciate Evena's yearly reminders of the turbine operating curve.

Obviously, I could go on. The point being, once you get into a system, especially one complicated by human or other outside factors, failures are almost always over-determined. Meaning, there's almost always more than one cause to a failure.

And trying to insure that that failure or similar ones cannot occur must take into account that blame and fault, i.e. assigning outsized weight to a single cause where multiple causes and interactions were at work, can and often does result in further faults, often hidden, once the system is back up and operating again.

In this simple case, I'd imagine two fixes coming down the pike: Caryn's automatic override system on the turbine, AND for IT to build a dashboard system for the salesforce that would automate some of the request/response processes. Then the salesforce training could shift to simply explaining why they have the restrictions, while both sales and operations now have a hard-coded safety net insuring that the turbine's reliability curve (in this particular dimension, at least) is protected at multiple independent levels.

Thus ends the analogy. Is it complete? Enough to illustrate the issues, I think. This one's human factors heavy; I'll have to think more to come up with one that's more hardware/software dependent. The human social dynamics here seem fairly obvious, but I know there are much more complicated possibilities, so I'll have to see if I can come with stories along those lines.

Why not just point to Leveson's real life examples? Because if I've learned a little of Leveson's research, then building a model that captures a little of that work is a useful way of examining whether I actually did learn something other than "go look in Nancy's book and pull out a canned example". So, a longwinded way of saying thanks to Nancy Leveson for wonderful work that I greatly enjoy learning from.

And, if you dear reader find it useful, maybe that means I also learned something well enough to possibly illuminate it for you. Assuming of course that I didn't just stub my toe on my own ignorance...

No comments:

Post a Comment

Please keep it on the sane side. There are an awful lot of places on the internet for discussions of politics, money, sex, religion, etc. etc. et bloody cetera. In this time and place, let us talk about something else, and politely, please.