Idle Conjectures in Search of Refutation

Monday, 30 July 2012

Cost-control and Amortization in Software Engineering: The importance of the source repository.

This discussion is not finished, but it is getting late, so I am putting it out regardless. Please excuse the jumps, the argument is put together like a tree, working down from the leaves, through the branches to the trunk. I will try to re-work it later to make it more linear and easier to read.

--

This is a discussion about controlling costs in Software Engineering. It is also a discussion about communication, bureaucracy, filing systems and human nature, but it starts with the debate about how best to organize ones source code repository; a debate dominated by the ascendance of the Distributed Version Control System (DVCS):

In the Distributed-vs-Centralized VCS debate I am generally agnostic, perhaps inclined to view DVCS systems slightly more favorably than their centralized counterparts.

I have used Svn and Hg and am happy with both. For a distributed team, working on a large code-base over low-bandwidth connections, Hg or Git have obvious advantages.

Many DVCS activists strongly advocate a fine-grained division of the code-base into multiple per-project repositories. In the context of a global, distributed development team, this approach is highly efficient as it allows developers to work independently with minimal need for synchronization or communication.

This is particularly striking when we remember just how difficult and time-consuming (i.e. hugely expensive) synchronization and communication can be if developers are working in different time-zones or only working part-time on a project.

However, this property is less relevant to a traditional, co-located and tightly integrated development team. In fact, I intend to demonstrate that splitting the repository in this situation has serious implications that need to be considered.

Before I do that, however, I need to describe a number of considerations that motivate and support my argument.

Amortization as the only effective mechanism for controlling development costs.

Firstly, developing software and maintaining software is very expensive. The more complexity, the more cost. For a given problem, there is a limit to the amount of complexity that we can eliminate. There is also a limit to how much we can reduce developer salaries or outsource work before the complexity and cost imposed by the consequent deterioration in understanding and communication outweigh the savings.

The only other meaningful mechanism that we have to control costs is careful, meticulous amortization, through reuse across product lines, products & bespoke projects. I believe that reuse is fiendishly difficult to achieve, critically important, and requires a fundamental shift in thinking to tackle effectively. The difficulty in achieving reuse is supported by historical evidence: re-use is rare as hen's teeth. It's importance, I hope, is self-evident, and the fundamental shift in thinking is required because we are working against human nature, against historical precedent, and against some unfortunate physics.

Much of this argument is a discussion about how to overcome these problems and facilitate the amortization of costs at a variety of different levels, and how a single, monolithic repository (or filing system) can support our efforts.

Reuse is difficult partly because of what software development is.

More than anything else, software development is the process of learning about a problem, exploring different solutions, and applying that learning to the development of a machine that embodies your understanding of the problem and it's solution; software both defines a machine and describes a model of our understanding of the problem domain. The development of a piece of software is primarily a personal intellectual exercise undertaken by the individual developer, and only incidentally a group exercise undertaken by the team.

Reuse by one person, of a software component written by another, must at some level involve some transfer of some degree of understanding.

Communications bandwidth is insufficient.

The bandwidth offered by our limited senses (sight, hearing, smell etc..) is insignificant when held up against the tremendous expressive power of our imagination; the ability of our brain to bring together distant memories with present facts to build a sophisticated and detailed mental model. The bandwidth offered by our channels of communication is more paltry still; even the most powerful means of communication at our disposal, carried out face-to-face, is pathetic in comparison; writing, documentation, even more contemptible. (Although still necessary).

Communicating the understanding built up in the process of doing development work is very difficult because the means that we have at our disposal are totally inadequate.

Reuse by one person, of a software component written previously by the same person is orders of magnitude easier to achieve than reuse of a software component written by another. Facilitating reuse between individuals will require us to bolster both the available bandwidth and the available transfer time by all means and mechanisms possible. We will need to consider the developer's every waking moment as a potential communications opportunity, and every possible thing that is seen or touched as a potential communications channel.

Re-Think the approach.

One solution to the problem is to side-step communication. A developer is already an expert in, and thus tightly bound to the software components that he has written; A simple mechanism for reuse is simply to redeploy the original developer, along with his library of reusable components, on different projects.

As with the individual, so with the team. A team of individuals that has spent a long time together has a body of shared organizational knowledge; they have had opportunities to share experiences, have evolved a common vocabulary, culture and approach. Collectively, they are also tightly bound to the software components that they have written. A team, expert in a particular area, is a strategic asset to an organization, and the deployment of that asset also needs to be considered strategically, in some detail, by the executive.

(Make people's speciality part of their identity.)

Strategic planning needs to think in terms of capabilities and reuse of assets. The communication of the nature of those capabilities and assets needs to be a deliberate, planned activity.

Exploit pervasive communication mechanisms.

We need to think beyond the morning stand-up meeting; our current "managed" channels of communication are transient; but our eyes and ears are less fleetingly open. We need to take advantage of pervasive, persistent out-of-band communication mechanisms. Not intranets and wiki pages that are seldom read, nor emails and meetings that arrive when the recipient is not maximally receptive. We need communications channels that are in the background, not intrusive, yet always on. Do not place the message into the environment, make the environment the message.

Make the office layout communicate reuse. Make the way the company is structured communicate reuse, make where people sit communicate reuse, and above all else, make the way that your files are organized communicate reuse.

The repository is a great pervasive communication mechanism.

As developers, we need to use the source repository frequently. The structure of the repository dictates the structure of our work, and defines how components are to be reused.

The structure of the repository and the structure of the organization should be, by design, tightly bound together, as they are so often by accident. If the two are in harmony, the repository then becomes much more useful. It provides a map that we can use to navigate by. It will help us to find other software components, and, if components and people are tightly bound together, other people also. The repository is more than a place to store code; it is a place to store documentation and organizational knowledge, and to synchronize it amongst the members of the organization. Indeed, it is the place where we define the structure and the nature of the organization. It is the source of more than just binaries, but of all organizational behavior.

TODO:

Financial reporting still drives everything - it needs to be in harmony also.

Trust is required. (maybe not)

Not the whole story.

Another problem: reuse is difficult because it puts the cart before the horse. How can we decide what component to reuse before we have understood the problem? How can we truly understand the problem before we have done the development work? If we are solving externally defined and uncorrelated problems that we have no influence over, then we could never reuse anything.

Sounds difficult? Lessons from the Spartans on the nature of city walls.

Back to repositories.

Limiting code reuse. Difficult to do "hot" library development.

One of the frequent arguments that gets trotted out in the Git/Hg-vs-SVN debate is that of merging.
The merge conflict argument is bogus. Proper organization and DRY prevents merge issues.

10,000 hours and too little time...

The 10,000 hour rule suggests that it takes 10,000 hours to master a discipline. That is over a year running continuously; almost two years at 16 hours per day, or over 3 years at a somewhat more reasonable 60 hours per week.

With millions of developers working away world-wide, an awful lot changes in technology in 3 years. New languages, new tools, new paradigms and approaches. We must learn not just one discipline, but many.

In this new world, none of us can be masters: all of us must be students. The faster paced the world, the more tools and disciplines to master, the more fleeting the experience and the shallower the knowledge.

Please remember that when you write for others, that you are, invariably, writing for the inexperienced student, with too little time on his hands, no matter how worn and weathered the face.

Wednesday, 25 July 2012

The Reality Gap

A widely acknowledged truism in the development of complex systems is that, for any given project, we observe a huge dynamic range in individual developer ability. (Largely determined by level-of-obsession with the problem).

A less frequently noted corollary of this is that a huge dynamic range in team ability is also observed.

Team performance is, of course, driven to a large extent by the ability and combination of constituent members; however, we can also identify tools, techniques, methodologies and modes of organization that can help to optimise a team's performance within these constraints.

As a technologist, and a border line obsessive-compulsive, I am particularly interested in the intersection between tooling and team organization, and how the tools that we use can help us to impose order and discipline on our approach to problem solving.

I am also a realist, and acknowledge that we cannot all be 10X-ers all of the time. In fact, by definition, most of us will not be 10X-ers most of the time. There is a reality gap between our technical aspirations and the reality of high-probability mediocrity.

So, having taken our medicinal dose of humility; here is an interesting question:

What is the simplest tool (or set of tools) that one could use to reduce the risk of poor performance in a development team?

--

[Aside]

This question is motivated in part by the too-seldom-made observation that success is more often achieved by getting the management and engineering basics right than by resting all hopes on a single piece of sophisticated key IP.

On the other hand, it can be argued (http://williamtpayne.blogspot.com/2012/05/complexity-it-had-better-be-worth-it_03.html) that the opposite sentiment applies; that a lot of risk is not controllable, and that the only variable that can be optimized is the potential payoff.

The key parameter here is the degree to which risk may be controlled.

Friday, 20 July 2012

Quantitative Investing in Startups

Comment in response to: http://mattturck.com/2012/07/20/data-driven-venture-capital

Considering the possibility of using quantitative techniques to invest in startups.

Quantitative investing is really quite hard, even when trying to analyse equities with decades of financial reports available. The world is complex and always changing, with no guarantee that historical conditions will prevail. Furthermore, the data that you have is generally extremely sparse relative to the dimensionality of the problem, requires lots of massaging to normalize, and is full of wrinkles that must be ironed out to normalize it for comparison (retrospective corrections, stock splits and other corporate actions etc… etc…).

The natural (and only) solution is to approach the problem with really strong priors in the form of a set of principled, theory-driven model(s) of the fundamentals, (backed by good engineering and thorough data management).

Because the problem domain itself is so complex, and the data so sparse, the models themselves have to be simple, which means that most of the opportunities for innovation are to be found in the search for new and previously underexploited data sources. This is particularly true when looking at early-stage startups, as financial history is either absent or not particularly predictive.

Perhaps fortunately, the current state of the art in data exploitation is really quite poor, meaning that many opportunities exist to improve the state of the art.

So, what opportunities can we identify?

Well, organizations are composed of people. Different organizations have different personalities, and different cultures; sometimes the people in those organizations gel together and turn into a great and highly productive team, and sometimes they do not.

If you were able to develop a really good understanding of how people work together in teams, and how different personality types, personal circumstances, technical skills and work environments come together, you could build what could be a pretty strong factor based on staff surveys, psychometric profiles and whatever other behavioral data you can lay your hands on.

(You could also use the same models to build a secondary business offering personal and organizational coaching… )

The cost of obtaining this data would be quite high, unfortunately, but there are other factors that could be attractive based simply on the ease with which large quantities of data may be collected.

For example, a statistical analysis of source code repositories and checkin histories might well yield insights into the ability of the organisation to respond to changing conditions, and to rapidly innovate.

http://alistair.cockburn.us/Characterizing+people+as+non-linear%2c+first-order+components+in+software+development

Monday, 16 July 2012

Libor, Malfeasance, Data and Complexity

Thomas Redman wrote about the Libor scandal; arguing that improvements to data quality will help to reduce malfeasance.

http://blogs.hbr.org/cs/2012/07/libors_real_scandal_bad_data.html?cm_sp=blog_flyout-_-cs-_-libors_real_scandal_bad_data

I agree with a lot of what he says, but I believe that good data, on its own, is not enough. We also need visualisation and data exploration tools to understand the data, simplicity to communicate that understanding, and an educated and intelligent workforce to comprehend the implications.

So, here is my response, emphasizing simplicity partly for rhetorical effect, but mostly because it is the most difficult thing to achieve:

A safe, crisis-free future requires both good data and simplicity.

We can build a world so complex and hidden that there is no obvious fraud, or we can build one so simple and transparent that there is obviously no fraud.

The former is a recipe for catastrophe, but the latter requires considerably more effort.
Obtaining trustworthy data is hard enough, but building simplicity is subtle, and difficult. There is a degree of complexity which is intrinsic to every problem, and which cannot be eliminated, so beware of false simplicity that merely pushes complexity somewhere else. Non-essential complexity may be banished, but essential complexity must be consumed.

Scholarship as part of Agility

I found another John Boyd fan; increasingly relevant in this age of information overload:

http://blogs.hbr.org/cs/2012/07/act_fast_not_first.html

What this article fails to get across, however, is the emphasis that the OODA methodology places on understanding one's opponent, of getting inside his head so that you can, by your actions, disrupt his thinking.

It is unusual, in civilian life, to have an opponent that you are trying to outmaneuver, so Boyd's model does not apply directly to most of the situations that we encounter in our professional lives, but some lessons can still be drawn:

We live in an unpredictable world.
Sometimes we need to change direction because the world changes around us.
When this happens, we need to do it quickly.
To react quickly, we need to be prepared.
Part of being prepared is having a deep understanding and insight into the problem domain.

To be effective, warriors must also be scholars; and this applies equally to those of us in less violent occupations. To deal with a confusing, ever changing world, we must be scholars; to try to understand the factors that drive those changes.

Friday, 13 July 2012

Elites and Meritocracy

I have just finished reading an interesting op-ed piece in the NYTimes on modern elites; arguing that meritocracies tend to develop corrupt, oligarchistic tendencies.

http://www.nytimes.com/2012/07/13/opinion/brooks-why-our-elites-stink.html?_r=1&hp

A couple of thoughts speedily spring to mind.

Firstly, we have only limited ability to actually measure merit. As I have observed before, performance depends on a wide range of factors, most of them to do with situation and context, only a few to do with ability, attitude and aptitude.

Secondly, self-selection plays a strong role; as you get near to the top of an institution, certain personality types start to become more common, and man-management becomes harder. Self-serving decision making is by no means ubiquitous, but it is certainly more likely, and it becomes harder to trust the information being reported and the decisions being made.

This issue is really just a corollary of the Peter Principle writ large, and avoiding it is a non-trivial problem, but there are a couple of wrinkles worth exploring.

Different types of institution attract different personalities, so the flavor of the problem is different in different institutions, but it is particularly prevalent in institutions with a high public profile.

Within any given organization, the incidence and seriousness of the problem will tend to increase over time. This is partly a product of self-selecting individuals rising through the ranks, and partly a product of the spread of their ideas and culture. It is very hard to create a good, trustworthy culture, and very easy for a good culture to become cynical, jaded and self-serving. It might be worthwhile drawing the analogy between these memes and a pathogen, and thinking about the solution in terms of disease-control techniques.

Given that the incidence of the problem in an organization increases over the lifespan of the organization; increasing churn in the ecosystem, as new organizations and institutions replace old ones, should have the net effect of reducing these problems in the system as a whole.

This is another argument for an economy composed of many small, short-lived organisations and institutions, rather than a small number of large, long-lived organizations, as the damage done by the (relatively) small number of bad individuals is limited, and their malign influence quarantined and contained within their host organization.

Wednesday, 11 July 2012

An organizational model for an SME engaged in the development of complex systems

A lot of software engineering is really about organization. This is particularly important if you are working with a complex system. The better your organizational skills, the easier it is to learn about the system, and the less you need to think about day-to-day.

Most of my experience has been with smallish cross-disciplinary development teams working on commercial projects. Here are some best practices that I have been developing over the past 6-8 years, and continue to develop and improve over time. Please be aware that these practices may not be applicable to the very largest enterprises, nor to noncommercial (open-source) or academic development.

First of all, it helps if the business thinks about how it organizes it's filing system. Just because it is on a computer and searchable does not detract from the importance of a well laid out and well thought through document store. Putting documents in a document management system like sharepoint, or even using Google docs or some such is, on it's own, simply not enough. It is not just about having a document management system, it is about how that document management system is organized.

The structure of your filing system really is central to how your business is organized and, indeed, many other things. (More on that later)

Now, I know what you are saying: "How incredibly dull", and indeed it is, but that does not mean that it is not important. In fact, we will find that if we put a little bit of up-front effort into designing and standardizing the business' filing system, we will be able to reap enormous benefits from automation later on.

Now, most filing systems are either intrinsically hierarchical, or (at the very least) support hierarchical structures. A hierarchy is a conventional, effective way of organizing things. By creating an explicit hierarchy we are creating a common navigational frame-of-reference for the organization. We do sacrifice some of the flexibility offered by flat organizational systems, but we gain much more for the business in the form of more effective communication and coordination. This is because everybody has the same frame of reference; the same mental model of how everything hangs together.

I am aware that this seems antiquated in comparison with the modern trend to "tag & search", but hierarchy still has it's place, particularly in creating a common frame of reference.

So that we can persuade people that this stuff is important, and to make it sound like something new and interesting, we need an impressive and official sounding name. Let us try "Organizational Model", and see if that works.

Note that the organizational model is in principle "just" a collection of files and directories on a conventional computer file system, although in practice it may be so large that only parts of it are "active" and instantiated at any one time.

Since we are organizing things in a hierarchy, we need to decide what distinctions are drawn to the top of the hierarchy, and which ones are pushed down to the bottom of the hierarchy. Note that the obvious approach, that of mirroring the structure of the business in the structure of the filing system, is not necessarily the best idea, particularly since one of our goals is to facilitate interdisciplinary communication and avoid the formation of informational "silos".

We take strong cues from common software engineering practices over the years, since automation will be a big part of what we hope to gain from a disciplined approach to organization. In particular, we are influenced by common ways of working with Subversion (the version control system).

So, one distinction that we will make is between input and output: the source files that contain design information, and the destination files that contain completed reports, web-pages, executable programs, generated documentation and so on. We use the terms "src" & "dst" to denote source and destination. At this level, we also want to consider the third-party tools, libraries and other resources upon which our systems and processes depend. We use the term "env" to denote the environment upon which our systems depend.

Another distinction that we will make is between current work being carried out on a day-to-day basis and historical archives. We use the term "daily" to denote current work (Similar to how the term "trunk" is used in Subversion, but more forceful in suggesting how it is to be used), as well as "weekly", "monthly" and "annual" to denote archives taken at those intervals. We will also want to keep records of important moments - like product releases, as well as what is currently "in-production", for situations where that concept makes sense.

I prefer to place version-related distinctions at a higher level in the hierarchy than source-vs-destination distinctions, simply because I believe that more people make more switches between source and destination than between one version and another.

So, as a result, the most significant division is into what Subversion would call branches:

daily
archive

weekly

YYMMDD_foo

monthly

YYMMDD_foo

annual

YYMMDD_foo

release

stage - pre-release staging directory.
prod - current production system.
rollback - previous production system for quick emergency rollbacks.
YYMMDD_productA_v1.2 - Externally deployed or separately configuration-managed systems
YYMMDD_productB_v3.4 - Externally deployed or separately configuration-managed systems

The next most significant division is into source vs derived documents:

src:-

Human-generated source documents.

dst:-

Automatically generated binaries, documentation, test results, performance, financial and management reports. (dst/bin; dst/doc; dst/doc/performance etc..)

env:-

Installed third-party dependencies (Python virtualenv etc...)
This could also include images of clean VMs to be picked up by the continuous integration process, or a reference to which EC2 images to use in deployment & testing.

The next most significant division is into units chosen to help manage development costs:

bespoke:-

One-off bespoke work for customers.
Cost not amortized.
Cost borne entirely by customer.
Ideally contains configuration information rather than software.

library:-

Re-usable soft components.
Cost amortized across multiple products, systems & projects.

systems:-

Re-usable components with a hardware component.
Ongoing costs amortized across multiple products, systems & projects.
Ideally contains configuration information rather than software.
Includes management and financial reporting systems

products:-

Non-re-useable components;
Cost amortized over multiple sales.

research:-

Activities not directly related to product or internal system development.
Cost borne by enterprise.

sandbox:-

Test activities of little or no consequence.
Costs not tracked.

thirdparty:-

Links to third-party source-level dependencies.

The final division is into activities, to help identify the nature of the documents contained in that particular folder. This is a bit unusual in that it mixes software development with other business activities. The idea is that we can take advantage of automation (scripts etc...) to pull together technical reports on spend, ROI and so on in an integrated manner.

spec:-

Specification documents

doc:-

Technical documentation

test:-

Tests

config:-

Configuration files

sales:-

Sales estimates, leads & marketing collateral

support:-

Customer Feedback, support tickets etc...

financials:-

Cost estimates, budgets etc...

Because the organizational model is fixed, we can automate lots of things like financial & management reporting, business-ops as an equivalent to devops maybe?

Security and access control is managed through the use of Sub-Repositories that are embedded below this level of the organisation.

The organizational model seeks to accomplish its goals using out-of-band communication & organizational structure rather than management fiat. People tend to self-moderate & change to fit in with what is being done around them rather than seeking to make reasoned arguments for change.

The first goal is to facilitate automation by creating a conventional location for documents to be stored.

The second goal is to help manage the cost of systems development by explicitly organizing documents by how costs are amortized.

The third goal is to encourage "DRY" (Do not Repeat Yourself) systems design by encouraging reuse and facilitating development automation techniques (like code generation).

Monday, 9 July 2012

Honesty.

This post (http://learntoduck.net/the-curse-of-bullshit) reminded me of some thoughts that occurred to me last year.

We have this set of expectations, this mental image of how we are supposed to behave; to get up at 5 every morning, be in work before 8; work through 'till 8 or 9 in the evening, all the time cheerful, aggressive, the day punctuated by the regular, next! next! stand 'em up! knock 'em down; solving a new problem every 20 minutes, never making a mistake, always presenting solutions, innovating, driving the project forwards.

Sometimes we can manage it; circumstances come together, the stars align, and everything works out fine. When it happens, great! Work becomes pleasure, and we slip into the flow-state for months at a time.

Alas, this is not and cannot be normality for all of us, all of the time.

The problem is this; we cannot bring ourselves to admit it. We demand of ourselves, and of other people, the image of perfection, which means that we hide behind the lie; because we must maintain the image of performing perfectly all of the time, we loose the ability to admit when we are not performing, and effectively deal with it.

This is what humility is all about. It is not really a moral thing, it is deeply pragmatic. It is about giving yourself the freedom to analyse and understand your problems, and to do better.

Let us all be a bit more humble, let us admit, publicly, our flaws, and give each other the room to become better at what we do.

After all, if we demand perfection from ourselves and those around us, all we will get in return are lies.

Wednesday, 27 June 2012

Apple and strategic capability development

I just finished reading Someone is coming to eat you by Rands in Repose.

It is is an interesting analysis of Apple's strategic stance. In particular, the comment about Tim Cook and the Apple operations team resonates particularly strongly with me.

Perhaps I am reading too much into it, but here I imagine that I can see an organization that, yes, has a legendary focus on product design, but also has the ability and the intelligence to stand back and design an organizational capability that acts as a strategic competitive differentiator.

I think that there are two complementary strands at work here. The first is simplicity. The organization acknowledges that simplicity is hard, and works hard to achieve it, not only in product design but in number of products and (presumably) also in organizational structure.

Having achieved that level of simplicity, the organization can then stand back from the coal face, and engineer itself into a better competitive condition by creating one, or possibly more, world-class capabilities as organizational differentiators.

In the article, the operations team is picked out as an example. To me, this has a little of the flavor of the discussion around "product line" engineering, but it feels much more fundamental, bigger in scope and goes far further into the heart of the company.

Initially prompted by Conway's law, many years ago, I am now becoming increasingly convinced that we need to think of the structure, organization and culture of the enterprise as being the foundation upon which the design of the products are laid, and that a single vision needs to shine through all of these aspects, from management and financial reporting, through organizational culture, and all to build a set of strategic capabilities that will set the enterprise apart.

Behavioral aspects of Software Engineering

I recently started reading "Thinking, Fast and Slow" by Daniel Kahneman. This book, aside from being awesome, describes a lucid mechanism for analyzing human intellectual mechanisms.

Although I have only yet digested the first couple of chapters, I feel compelled to write about the parts of Daniel's thesis that resonate with me the most.

His basic premise is that we can better understand human thinking if we decompose our abilities into two categories. "System 1", the "fast" system, corresponds loosely to "gut instinct", whereas "System 2", the "slow" system, corresponds loosely to "concentrated reasoning".

Immediately, I have a mental image of the fast "system 1" as being a highly parallel, wide-but-deep, horizontally-scaled collection of special-purpose systems; classic winner-takes-all type architecture.

In contrast, the slow system seems rather less easily understood; I hesitate to de-emphasize parallelism, because virtually everything in biological neural systems is highly parallel; but it seems (intuitively) to have a fundamentally iterative aspect; involving some kind of interplay between working memory and decisioning/pattern recognition systems.

The fast System 1 operates more often than we think it does, consumes much less energy, but has a tendency towards bias and an over-simplification of the world. The slow System 2 operates much less frequently than we think it does, consumes much more energy, requires mental effort and concentration, but is able to perform somewhat more complex, sequential/algorithmic tasks than the fast System 1.

Anyway, my excitement does not derive from this indulgence of my mechanomorphistic instincts, but rather from the application of this analysis to my favorite topic: Software development and developer productivity.

My basic thesis is that highly skilled, highly productive developers make more extensive use of the fast System 1, less productive developers make more extensive use of the slow System 2.

The degree to which System 1 is preferred is a function both of experience and of the tools that are available to the developer. When faced with a new task or when one finds oneself in a new environment, frequently one has to concentrate very hard to get anything done; performing novel tasks requires a high level of slow System 2 involvement. As one acclimatizes oneself to the new problem or environment, one builds up "muscle memory", or the cognitive equivalent, and System 1 becomes gradually trained to take over more and more of the task.

To reiterate a point from one of my previous posts; software engineering is fundamentally a learning process. One comes across new problem domains, languages, tools, datasets, etc... all the time. A really good set of developer tools will revolve around the job of supercharging our ability to learn; to absorb new systems; new libraries, new concepts, new mathematics, to understand legacy code, to see patterns in data etc...

Personally, I have a very visual brain; I have always enjoyed drawing and painting, found pleasure in the intricacies of models and 3D structures. I always studied for exams by drawing diagrams, because I could remember pictures far more easily than words. For me, the ideal developer tools will the semantic content of the subject under study and translate it into rich, intricate visual images.

Not everybody has such a focus on visual information, (although many do), so visualisation is only one possible approach out of many.

The overarching motivation behind tools like these is, in my mind, to translate concepts into a form that enable the developer to engage his instincts, his fast System 1 thinking, to learn the necessary domain-specific skills quickly and frictionlessly.

What representations do you like? What conditions enable you to quickly learn to perform at a high level? How do you make yourself better at what you do?

Sunday, 24 June 2012

OOP Product vs Product-Line modeling.

In response to:

http://www.jmolly.com/2012/06/24/functional-and-object-oriented-programming.html

Everything that jmolly says is sincere, and rings true enough, but decoupling data and functional behavior is only part of it. Indeed, OOP was designed specifically to couple data and behavior together, so some people (at least) think that this is a good thing.

In my mind, the difficulty lies not with OOP per se, but in the thought-patterns and practices that have grown up around modern OOP programming, and the mental confusion that we have between the construction of models, the design of algorithms, and the creation of machines to help solve a particular problem. To a large extent, we lack the intellectual tools required to understand the problem; or, at least, the use and understanding of those tools is not widespread.

Personally, I suspect that there is a significant difference between product-centric thinking (design the product, ignore all else), and product-line-centric (or capability-centric) thinking (design the machinery that will make it trivially easy to design the product). This, for me, is a more important distinction than OOP vs FP, or any other dichotomy that you might care to mention.

So in focussing our OOP modeling efforts on the product itself, it becomes easy to fail to model potential variations in the product. To compensate, we attempt to use "best practices" and "design patterns" to ensure that we produce software that is flexible and malleable, but because we do so without any concrete reference point, we end up lost, making software that is too abstract and too complex.

We need to take a step back, and realize that software development is not abstract. It is concrete and is embedded in a social, political and business environment. It only appears to be abstract because there is so little communication between the software itself and the other stakeholders in the environment.

We do not, rightly, start our engineering effort by writing code. We start it by engineering the social conditions and the organizational structures first.

Once this is done, writing the software becomes, perhaps more straightforwards, once the goals and objectives of the organization are well aligned with the goals and objectives of the development effort.

See also Conway's Law.

I am not suggesting that I am right here, mind you: this is just the way that I feel; my gut instinct ... so perhaps some Socratic questioning might be in order here:

Let me start.

What problem, as software developers, are we actually trying to solve?

What I learned from Prolog

Prolog gives me headaches in places that other programming languages cannot reach, and I would not for an instant want to try to use it to create any significant sort of application; but it does teach a very fundamental and very important lesson.

Programs should, as much as is practicable, try to function as executable specifications, and to state their purpose in declarative form.

To achieve this, we can write somewhat general purpose machines, the parameters for which serve as the behavior specification.

In Prolog, the inference engine is the general purpose machine, the rule database is the executable specification.

For the majority of real-world projects that I have encountered, it has been possible to write libraries that together constitute a set of (somewhat less) general purpose machines; the configuration for which, in declarative form, serves as the "executable" specification of the problem.

All of my work therefore tends to be divided into two parts: A set of general purpose libraries that provide some configurable functionality, and a set of products that combine and configure those libraries, doing double duty as specification-documentation and top-level executable.

From Prolog comes the feeling that the declarative form is more suitable than the imperative form for the specification and communication of program function; and therefore the imperative parts of the program should be subsumed under a declarative layer.

Of course, none of this is new. Also, this notion does not by itself guarantee success, and must be tempered by simplicity and common sense.

Towards Semantic Code Search

In this discussion I will use the term "unit" to generalize over functions, classes, methods, procedures, macros and so on.

As I sit and write code, I would like to have a background process quietly searching open source repositories for units that are semantically similar to the ones that I am currently engaged in creating and/or modifying; these can then be presented as options for re-use; to limit the re-invention of the wheel, or to help identify and give advance warning of potential pitfalls and/or bugs.

To achieve this, we need to create some sort of metric space wherein similar units sit close together, and dissimilar (functions/classes) far apart. I would like this similarity metric to take > 2 values, so we do not just limit ourselves to binary matching, but can tune the sensitivity of the algorithm. This approach is suitable for exploratory work, because it helps give us a tool that we can use to build an intuitive understanding of the problem.

The algorithm can follow the prototypical pattern recognition architecture: A collection of 2-6 feature-extraction algorithms, each of which extracts a structural feature of the code under search. These structural features shall be designed so that their outputs are invariant under certain non-significant transformations of the source unit. (E.g. arbitrary renaming of variables, non-significant re-ordering of operations etc..).

These feature extraction algorithms could equally correctly be called normalisation algorithms.

Together, the outputs of the feature-extraction algorithms will define a point in some feature space. This point will be represented either by a numeric vector (if all the feature-extraction algorithms have scalar-numeric outputs), or will be something more complex, in the highly probable event that one or more of the feature extraction algorithms produces non scalar-numeric output (e.g. a tree structure).

Once we can construct such feature "vector"/structures, we need a metric that we can use to measure similarity/dissimilarity between pairs of such structures. If all the features end up being scalar-numeric, and the resulting point in feature space is a numeric vector, then a wide range of possible metrics are immediately available, and something very simple will probably do the trick. If, on the other hand, the features end up being partially or wholly non-numeric, then the computation of the metric may end being more involved.

If this all sounds too complex, then perhaps an example of a possible (non scalar-numeric) feature extraction algorithm will bring the discussion back down to earth and make it more real: http://c2.com/doc/SignatureSurvey/

The number of features should be strictly limited by the number of samples that are available to search. Given repositories like Github and SourceForge, somewhere around 3 features would be appropriate.

It is worth noting that we are not just looking for ways of identifying functional equivalence in programs. Functional equivalence is an important feature, but similarities in presentation are important also; so for some features, choice of function / class / variable name may not be important, but for others, it may be significant.

What types of normalization might be useful? What features of the source unit should we consider to be significant, and what features should we ignore?

Suggestions in comments gladly received.

Note: There exists some prior literature in this field. I have not yet started to explore it, so the above passages illustrate my uninformed initial thoughts on the problem.
http://www.cs.brown.edu/~spr/research/s6.html
http://adamldavis.com/post/24918802271/a-modest-proposal-for-open-source-project-hashes

Friday, 22 June 2012

How did I get here?

When I was young, I dreamt of becoming a physicist, but I very quickly discovered that I did not have the mathematical literacy required for that particular career choice.

(In retrospect, perhaps I should have just dug my heels in and persisted with it)

Well, if I cannot do the mathematics myself, I thought, perhaps I can program a computer to help me? This thought lead, eventually, to my doing an undergraduate degree in Artificial Intelligence.

It was during this course that I began to think of reasoning as a process driven predominantly by knowledge, and about the problem of acquiring the vast amounts of knowledge that would so obviously be required to do any sort of useful reasoning in the real world.

I was particularly taken by the potential for machine vision systems to help build knowledge bases such as these, and so I developed an enduring interest in machine vision (and statistical pattern recognition more generally).

In my early career, I was fortunate enough to work with scientists studying human perception, which built up my nascent interest in perceptual processes; a filter through which I still percieve many technical problems.

It has become clear, however, that the main barrier standing in the way of developing sophisticated software systems that can reason about the world and help us to understand our universe is the paucity and limited capability of the software development tools that we have at our disposal.

The latter half of my career has therefore largely turned towards improving the software tools available to the academics, scientists and engineers that I have been priviledged to work with over the years.

Thursday, 7 June 2012

Credo

I believe in the notion that development cost control & cost amortization through software reuse needs to be explicitly factored in to the organization's management and financial structure through the use of product line centric organizational structures.

I also believe that the structure of the source document repository used by the organization is a quick and easy way to communicate organizational structures and axes of reuse.

This is largely an extension of Conway's law.

Tuesday, 5 June 2012

Pushing complexity

"The simple rests upon the difficult" - Theodore Ayrault Dodge

I have often observed that many software engineering techniques or methods that aim to simplify actually just push complexity around rather than actually resolving anything.

As Fred Brooks has already told us: There is no silver bullet.

For example, I once came across a software development process that mandated the creation of elaborate and detailed specification documents together with an extremely formal and rigid process for translating those specifications into executable source documents (code).

The author of the process proudly claimed that his development method would eliminate all coding errors (implying that the majority of bugs are mere typos in the transliteration of specification to application logic).

To me, this seemed like hubris, as it pushes the burden from the software engineer (who in this scheme is reduced to a mere automaton) to the individual writing the specification, who (in the process) takes on the role of developer; (absent the tools and feedback needed to actually do the job well).

Thus, whilst approaches like this might help bloat costs and fuel the specsmanship games that blight certain (nameless) industrial sectors, they do nothing whatsoever to help developers produce higher quality product with less effort.

We do not make progress by pushing complexity around. We make progress by consuming it and taming it. We need to do the work and tackle the problem to make things happen.

Rather than focus on the silver bullet of simplification, we would be better served by processes, methods and tools that explicitly acknowledge the development feedback loop, and aim to tighten and broaden it through automation.

Code considered harmful

As developers, we often glibly talk of code and coding.

These words do us a tremendous disservice, as they imply that the source documents that we write are in some way encoded, or obfuscated, and so only interpretable by the high priests of the technocracy.

This might give us a moment of egotistical warmth, and provide fuel to our collective superiority complex, but in the long run it does a tremendous amount of harm to the industry.

We would be better served if we insisted (as is indeed the case) that a source document, well written in a high-level programming language is entirely legible to the intelligent layperson.

Whatever view you take on linguistic relativity; whether you believe that our choice of words actually affects our attitudes or not, I think (event from a purely aesthetic point of view) that the word "code" is as ugly as it is arrogant.

Perhaps, rather than talking about "code", we should talk about source documents, process descriptions, executable specifications, procedures or even just logic.

Let us acknowledge, in the words that we choose, that a central part of our jobs is to craft formal descriptions that are as easily interpretable by the human mind as by the grinding of an automaton.

Monday, 4 June 2012

Delays and Rates

From a recent post on news.ycombinator.com, vertically aligned for ease of comparison, with corresponding rates to better understand the implications:

(Edit: Expanded with numbers from a recent Ars Technica article on SSDs)

Register    << 1 ns
L1 cache reference (lower bound) < 1 ns 2,000,000,000 Hz
L1 cache reference (upper bound) 3 ns   333,333,333 Hz
Branch mispredict 5 ns 200,000,000 Hz
L2 cache reference 7 ns 142,857,143 Hz
L3 cache reference    20 ns    50,000,000 Hz
Mutex lock/unlock 25 ns 40,000,000 Hz
Main memory reference 100 ns 10,000,000 Hz
Compress 1K bytes with Zippy 3,000 ns 333,333 Hz
Send 2K bytes over 1 Gbps network 20,000 ns 50,000 Hz
Read 1 MB sequentially from memory 250,000 ns 4,000 Hz
Round trip within same datacenter 500,000 ns 2,000 Hz
Disk seek (lower bound)    3,000,000 ns           333 Hz

Disk seek (upper bound) 10,000,000 ns 100 Hz
Read 1 MB sequentially from disk 20,000,000 ns 50 Hz
Send packet CA->Netherlands->CA 150,000,000 ns < 7 Hz

Surprises and take home lesson(s):

1. Data intensive (I/O bound) systems are REALLY slow compared to the raw CPU grunt that is available.
2. Within-datacenter Network I/O is faster than disk I/O.
3. It makes sense to think about network I/O in the same way as we used to think about the SIMD/AltiVec/CUDA tradeoff. The payoff has to be worth while, because the packaging/transfer operations are expensive.
4. Branch mis-prediction is actually pretty expensive compared to L1 cache. For CPU bound inner-loop code, it makes sense to spend a bit more time trying to avoid branching.

Here is the table from Ars Technica:

Level                Access time    Typical size
Registers        "instantaneous"    under 1 KB
Level 1 Cache             1-3 ns    64 KB per core
Level 2 Cache               3-10 ns 256 KB per core
Level 3 Cache              10-20 ns    2-20 MB per chip
Main Memory                30-60 ns    4-32 GB per system
Hard Disk   3,000,000-10,000,000 ns

Friday, 1 June 2012

Recent Developments: Developing Future Development

In response to recent buzz around the idea of instant feedback in development environments:

Anything that tightens the feedback loop will increase velocity, so approaches like the ones espoused by Bret & Chris will definitely have a positive impact, although we may need to develop more advanced visualization techniques to help when the state of the system is not naturally visual. (http://tenaciousc.com/)

I am also convinced that there are a few more steps (in addition to instant feedback) that we need to take as well:

Firstly, the development environment needs to encourage (or enforce) a range of always-on continuous development automation, including unit-testing, static-analysis, linting, documentation generation etc... This should include automated meta-tests such as mutation-based fuzz testing so that the unit-test coverage itself is tested. This helps us to have confidence that we have not missed any regressions creeping in. (To compensate for our inability to pay attention to everything all of the time)

Secondly, refactoring tools need to be supported, so that code can be mutated easily and the solution-space explored in a semi-automated manner. (To compensate for the fact that we can only type at a limited speed).

Thirdly, we need to start using pattern recognition to find similarities in the code that people write, so we can be guided to find other people and other software projects so that we can either re-use code if appropriate, or share experiences and lessons learned otherwise. (To compensate for the fact that we know a vanishingly small part of what is knowable).