Friday, 20 September 2013

Software & Hardware - How complexity and risk define engineering practice and culture.

Striking cultural differences exist between companies that specialise in software development and those that specialise in hardware development & manufacturing.

The key to understanding the divergence of these two cultures may be found in the differing approaches that each takes to risk and failure; driven both by a difference in the expected cost of failure and a difference in the cost associated with predicting failure.

One approach seeks to mitigate the impact of adverse events by emphasising flexibility and agility; the other approach seeks to minimise the chances that adverse events occur at all, by emphasising predictability and control.

In other words, do you design so you can fix your system quickly when it breaks (at the expense of having it break often), or do you design your system so that it very rarely breaks (but when it does, it is more expensive to fix)?

The answer to this question not only depends on how safety-critical the system is, but how complex it is too. The prediction-and-control approach rapidly becomes untenable when you have systems that reach a certain level of complexity -- the cost of accurately predicting when failures are going to occur rapidly becomes larger than the cost of the failure itself. As the complexity of the system under development increases, the activity looks less like development and more like research. The predictability of disciplined engineering falls apart in the face of sufficient complexity. Worse; complexity increases in a combinatorial manner, so we can very easily move from a predictable system to an unpredictable one with the addition of only a small number of innocuous looking components.

Most forms of mechanical engineering emphasise the second (predictive) approach, particularly for safety critical equipment, since the systems are simple (compared to a lot of software) and the costs of failure are high. On the other hand, a lot of software development emphasises the first (agile/reactive) approach, because the costs associated with failures are (normally) a lot less than the costs associated with development.

Of course, a lot of pejorative terms get mixed up in this, like: "sloppy engineering", and "cowboy developers", vs "expensive failures" and "moribund bureaucracy" but really, these approaches are just the result of the same cost/benefit analysis producing different answers given different input conditions.

Problems mainly arise when you use the wrong risk-management approach for the wrong application; or for the wrong *part* of the application. Things can get quite subtle quite quickly, and managers really need to be on top of their game to succeed.

One of the challenges in developing automotive ADAS systems is that a lot of the software is safety critical, and therefore very expensive to write, because of all of the (necessary) bureaucratic support that the OEMs require for traceability and accountability.

Equally, a lot of the advanced functionality for machine vision / Radar / Lidar signal processing is very advanced, and (unfortunately) has a lot of necessary complexity. As a result it is very very costly to develop when using the former approach; yet may be involved in safety-critical functions.

This is not by any means a solved problem, and very much requires detailed management on a case-by-case basis.

Certainly testing infrastructure becomes much more important as the sensor systems that we develop become more complex. (Disclaimer: my area of interest). Indeed, my experience indicates that for sophisticated sensor systems well over 80% of the effort (measured by both in hours of development & size of the code-base) is associated with test infrastructure, and less than 20% with the software that ends up in the vehicle.

Perhaps the word "test" is a misnomer here; since the role of this infrastructure is not so much to do V&V on the completed system, but to help to develop the system's requirements -- to do the "Data Science" and analytics that are needed to understand the operating environment well enough that you can correctly specify the behaviour of the application.