Idle Conjectures in Search of Refutation

Sunday, 24 June 2012

Towards Semantic Code Search

In this discussion I will use the term "unit" to generalize over functions, classes, methods, procedures, macros and so on.

As I sit and write code, I would like to have a background process quietly searching open source repositories for units that are semantically similar to the ones that I am currently engaged in creating and/or modifying; these can then be presented as options for re-use; to limit the re-invention of the wheel, or to help identify and give advance warning of potential pitfalls and/or bugs.

To achieve this, we need to create some sort of metric space wherein similar units sit close together, and dissimilar (functions/classes) far apart. I would like this similarity metric to take > 2 values, so we do not just limit ourselves to binary matching, but can tune the sensitivity of the algorithm. This approach is suitable for exploratory work, because it helps give us a tool that we can use to build an intuitive understanding of the problem.

The algorithm can follow the prototypical pattern recognition architecture: A collection of 2-6 feature-extraction algorithms, each of which extracts a structural feature of the code under search. These structural features shall be designed so that their outputs are invariant under certain non-significant transformations of the source unit. (E.g. arbitrary renaming of variables, non-significant re-ordering of operations etc..).

These feature extraction algorithms could equally correctly be called normalisation algorithms.

Together, the outputs of the feature-extraction algorithms will define a point in some feature space. This point will be represented either by a numeric vector (if all the feature-extraction algorithms have scalar-numeric outputs), or will be something more complex, in the highly probable event that one or more of the feature extraction algorithms produces non scalar-numeric output (e.g. a tree structure).

Once we can construct such feature "vector"/structures, we need a metric that we can use to measure similarity/dissimilarity between pairs of such structures. If all the features end up being scalar-numeric, and the resulting point in feature space is a numeric vector, then a wide range of possible metrics are immediately available, and something very simple will probably do the trick. If, on the other hand, the features end up being partially or wholly non-numeric, then the computation of the metric may end being more involved.

If this all sounds too complex, then perhaps an example of a possible (non scalar-numeric) feature extraction algorithm will bring the discussion back down to earth and make it more real: http://c2.com/doc/SignatureSurvey/

The number of features should be strictly limited by the number of samples that are available to search. Given repositories like Github and SourceForge, somewhere around 3 features would be appropriate.

It is worth noting that we are not just looking for ways of identifying functional equivalence in programs. Functional equivalence is an important feature, but similarities in presentation are important also; so for some features, choice of function / class / variable name may not be important, but for others, it may be significant.

What types of normalization might be useful? What features of the source unit should we consider to be significant, and what features should we ignore?

Suggestions in comments gladly received.

Note: There exists some prior literature in this field. I have not yet started to explore it, so the above passages illustrate my uninformed initial thoughts on the problem.
http://www.cs.brown.edu/~spr/research/s6.html
http://adamldavis.com/post/24918802271/a-modest-proposal-for-open-source-project-hashes

Friday, 22 June 2012

How did I get here?

When I was young, I dreamt of becoming a physicist, but I very quickly discovered that I did not have the mathematical literacy required for that particular career choice.

(In retrospect, perhaps I should have just dug my heels in and persisted with it)

Well, if I cannot do the mathematics myself, I thought, perhaps I can program a computer to help me? This thought lead, eventually, to my doing an undergraduate degree in Artificial Intelligence.

It was during this course that I began to think of reasoning as a process driven predominantly by knowledge, and about the problem of acquiring the vast amounts of knowledge that would so obviously be required to do any sort of useful reasoning in the real world.

I was particularly taken by the potential for machine vision systems to help build knowledge bases such as these, and so I developed an enduring interest in machine vision (and statistical pattern recognition more generally).

In my early career, I was fortunate enough to work with scientists studying human perception, which built up my nascent interest in perceptual processes; a filter through which I still percieve many technical problems.

It has become clear, however, that the main barrier standing in the way of developing sophisticated software systems that can reason about the world and help us to understand our universe is the paucity and limited capability of the software development tools that we have at our disposal.

The latter half of my career has therefore largely turned towards improving the software tools available to the academics, scientists and engineers that I have been priviledged to work with over the years.

Thursday, 7 June 2012

Credo

I believe in the notion that development cost control & cost amortization through software reuse needs to be explicitly factored in to the organization's management and financial structure through the use of product line centric organizational structures.

I also believe that the structure of the source document repository used by the organization is a quick and easy way to communicate organizational structures and axes of reuse.

This is largely an extension of Conway's law.

Tuesday, 5 June 2012

Pushing complexity

"The simple rests upon the difficult" - Theodore Ayrault Dodge

I have often observed that many software engineering techniques or methods that aim to simplify actually just push complexity around rather than actually resolving anything.

As Fred Brooks has already told us: There is no silver bullet.

For example, I once came across a software development process that mandated the creation of elaborate and detailed specification documents together with an extremely formal and rigid process for translating those specifications into executable source documents (code).

The author of the process proudly claimed that his development method would eliminate all coding errors (implying that the majority of bugs are mere typos in the transliteration of specification to application logic).

To me, this seemed like hubris, as it pushes the burden from the software engineer (who in this scheme is reduced to a mere automaton) to the individual writing the specification, who (in the process) takes on the role of developer; (absent the tools and feedback needed to actually do the job well).

Thus, whilst approaches like this might help bloat costs and fuel the specsmanship games that blight certain (nameless) industrial sectors, they do nothing whatsoever to help developers produce higher quality product with less effort.

We do not make progress by pushing complexity around. We make progress by consuming it and taming it. We need to do the work and tackle the problem to make things happen.

Rather than focus on the silver bullet of simplification, we would be better served by processes, methods and tools that explicitly acknowledge the development feedback loop, and aim to tighten and broaden it through automation.

Code considered harmful

As developers, we often glibly talk of code and coding.

These words do us a tremendous disservice, as they imply that the source documents that we write are in some way encoded, or obfuscated, and so only interpretable by the high priests of the technocracy.

This might give us a moment of egotistical warmth, and provide fuel to our collective superiority complex, but in the long run it does a tremendous amount of harm to the industry.

We would be better served if we insisted (as is indeed the case) that a source document, well written in a high-level programming language is entirely legible to the intelligent layperson.

Whatever view you take on linguistic relativity; whether you believe that our choice of words actually affects our attitudes or not, I think (event from a purely aesthetic point of view) that the word "code" is as ugly as it is arrogant.

Perhaps, rather than talking about "code", we should talk about source documents, process descriptions, executable specifications, procedures or even just logic.

Let us acknowledge, in the words that we choose, that a central part of our jobs is to craft formal descriptions that are as easily interpretable by the human mind as by the grinding of an automaton.

Monday, 4 June 2012

Delays and Rates

From a recent post on news.ycombinator.com, vertically aligned for ease of comparison, with corresponding rates to better understand the implications:

(Edit: Expanded with numbers from a recent Ars Technica article on SSDs)

Register    << 1 ns
L1 cache reference (lower bound) < 1 ns 2,000,000,000 Hz
L1 cache reference (upper bound) 3 ns   333,333,333 Hz
Branch mispredict 5 ns 200,000,000 Hz
L2 cache reference 7 ns 142,857,143 Hz
L3 cache reference    20 ns    50,000,000 Hz
Mutex lock/unlock 25 ns 40,000,000 Hz
Main memory reference 100 ns 10,000,000 Hz
Compress 1K bytes with Zippy 3,000 ns 333,333 Hz
Send 2K bytes over 1 Gbps network 20,000 ns 50,000 Hz
Read 1 MB sequentially from memory 250,000 ns 4,000 Hz
Round trip within same datacenter 500,000 ns 2,000 Hz
Disk seek (lower bound)    3,000,000 ns           333 Hz

Disk seek (upper bound) 10,000,000 ns 100 Hz
Read 1 MB sequentially from disk 20,000,000 ns 50 Hz
Send packet CA->Netherlands->CA 150,000,000 ns < 7 Hz

Surprises and take home lesson(s):

1. Data intensive (I/O bound) systems are REALLY slow compared to the raw CPU grunt that is available.
2. Within-datacenter Network I/O is faster than disk I/O.
3. It makes sense to think about network I/O in the same way as we used to think about the SIMD/AltiVec/CUDA tradeoff. The payoff has to be worth while, because the packaging/transfer operations are expensive.
4. Branch mis-prediction is actually pretty expensive compared to L1 cache. For CPU bound inner-loop code, it makes sense to spend a bit more time trying to avoid branching.

Here is the table from Ars Technica:

Level                Access time    Typical size
Registers        "instantaneous"    under 1 KB
Level 1 Cache             1-3 ns    64 KB per core
Level 2 Cache               3-10 ns 256 KB per core
Level 3 Cache              10-20 ns    2-20 MB per chip
Main Memory                30-60 ns    4-32 GB per system
Hard Disk   3,000,000-10,000,000 ns

Friday, 1 June 2012

Recent Developments: Developing Future Development

In response to recent buzz around the idea of instant feedback in development environments:

Anything that tightens the feedback loop will increase velocity, so approaches like the ones espoused by Bret & Chris will definitely have a positive impact, although we may need to develop more advanced visualization techniques to help when the state of the system is not naturally visual. (http://tenaciousc.com/)

I am also convinced that there are a few more steps (in addition to instant feedback) that we need to take as well:

Firstly, the development environment needs to encourage (or enforce) a range of always-on continuous development automation, including unit-testing, static-analysis, linting, documentation generation etc... This should include automated meta-tests such as mutation-based fuzz testing so that the unit-test coverage itself is tested. This helps us to have confidence that we have not missed any regressions creeping in. (To compensate for our inability to pay attention to everything all of the time)

Secondly, refactoring tools need to be supported, so that code can be mutated easily and the solution-space explored in a semi-automated manner. (To compensate for the fact that we can only type at a limited speed).

Thirdly, we need to start using pattern recognition to find similarities in the code that people write, so we can be guided to find other people and other software projects so that we can either re-use code if appropriate, or share experiences and lessons learned otherwise. (To compensate for the fact that we know a vanishingly small part of what is knowable).

Tuesday, 29 May 2012

Artificial Intelligence: Prerequisites.

How do we go about developing "Artificial Intelligence"?

If I recall correctly, Von Neumann once defined Artificial Intelligence very simply as "Advanced Computer Science".

In a different context, Arthur C Clarke said that "Any sufficiently advanced technology is indistinguishable from magic."

Inverting the intended sense of the latter phrase, it seems that, when envisioning what Artificial Intelligence might look like, we naturally seek
conceptually advanced techniques; a brilliant and unified theory of intelligence seemingly being required to produce an intelligent machine.

In other words, we look for the silver bullet, which, as Fred Brooks (again, from a different context) reminds us, does not exist.

Let us be a little more humble then. The development of intelligent machines might well not be so special as to require the development brilliant theoretical underpinnings.

The little I know of biologicial brains leads me to suspect that their function is more readily understood as the aggregate of a few hundred relatively simple processes rather than a small number of stupendously sophisticated ones.

Indeed, to make Artifical Intelligence a reality, we should focus on the prosaic and mundane rather than the exotic.

In my mind, the basic thing that stands in our way is simple software engineering, and the generally rather crude way that we currently go about it.

We need better tools and processes for developing complex software. Not just better languages, but better IDEs, testing frameworks, build servers, static analysis tools, refactoring tools, management tools etc.. etc...

The seemingly exotic, once we are familiar with it, collapses to the banal and the mundane. What is left is just a lot of hard work.

Let us get to it, then.

Friday, 25 May 2012

Open Community development vs Closed Commercial Development

Many popular contemporary tools (DVCS etc...) and workflows have emerged from the open development community.

The open development community has a very different set of requirements from closed, commercial development shops.

Sometimes they just need different tools.

Wednesday, 23 May 2012

Great advice for developers, great advice for life.

What is past is prologue - William Shakespeare.

Learn from the mistakes of others. You can't live long enough to make them all yourself - Eleanor Roosevelt.

Let us be a little humble; let us think that the truth may not perhaps be entirely with us - Jawaharlal Nehru.

(courtesy of the good folk of Sophos)

Thursday, 10 May 2012

What is the role of documentation in an agile team?

The distinguishing feature of an "agile" techniques is it's approach to risk. We have the humility to admit that development is primarily a learning process, and that risk originates from factors that are initially unknown.

Agile techniques seek to reduce development risk by producing a minimum viable product as early as possible, by postponing critical decisions until sufficient information is available, and by learning-by-doing. Storage and dissemination of lessons-learned is a critical part of any non-trivial agile effort.

Documentation is therefore critically important, but because documentation may change significantly and rapidly (as the team's understanding of the problem evolves) the form that the documentation takes must be very different in an agile development organization from a classical development organization.

I cannot stress this enough: Development Automation is THE key enabling set of technologies that enables us to take an agile approach. Documentation must be written in a formal manner, accessible to automation and able to be changed rapidly with minimal effort.

It also must be parsimonious and accessible enough to be disseminated rapidly. If automation is limited, then documentation cannot be lengthy. If automation is sophisticated, then more documentation can exist. Ultimately, the quantity of documentation is limited by the ability of team members to absorb the information rather than the speed with which it can be re-written.

Well written, readable source documents meet these criteria.

Monday, 7 May 2012

The OOP/FP argument, yet again.

I like OOP when I am the one writing the software, because I can think very naturally in terms of modelling the problem domain, but my opinion rapidly turns on it's head when I need to read/debug some OOP software that somebody else has written (When I need to think about control flow and behavior).

Writing good (readable) OOP problems is hard. Debugging them is even harder. Navigating then when armed with just the source code is harder still. (Diagrams please!)

Whilst most individual objects and methods in OOP systems are small and easy to understand in isolation (this is a good thing), I find that the flow of execution and the behavior of the system in the large becomes very difficult and time-consuming to understand. It is as if the fundamental complexity of the problem has been shoveled around from the small scale to the large scale.

To be fair, the same complaint probably applies to FP as much as to OOP; the tension driving this dialectic exists between the Functional Decomposition and the DRY principle on the one hand, and readability and narrative flow on the other. (Or the battle between reader and writer, to put it more colorfully)

Saturday, 5 May 2012

Tools for individual traits

One of my former employers (FIL) was noteworthy for it's culture of introspection. It encouraged staff to discover their own strengths & weaknesses, biases & predilections, and to use that knowledge to work better and to make better decisions. (Other financial institutions encourage similar cultures, to a greater or lesser degree). This was a fantastic and valuable lesson to learn.

So, swallowing a dose of humility, here goes:

I make mistakes all the time. Embarrassingly often.

Most of these mistakes are errors of omission: oversights. The spotlight of consciousness operating in my brain is unusually narrow. This means that I am reasonably good at something if I am focussing on it, but if I am not paying attention (which is most of the time for most things), I have a tendency to miss things in a way that is, well, rather absent minded.

This is not an uncommon tendency. Most people get over it by deploying organizational systems and practices to help them concentrate. Lists, and notes and obsessive habits and the like. I am a software developer. I use automated tests & other forms of development automation.

Without these, I tend to make a lot of mistakes and move very slowly. With them, I can move quickly, be creative & productive, and focus on making new things without worrying (too much) about what I have missed.

Standing back for a moment, we can observe that the usefulness of the tools that we use are really driven by the capabilities and characteristics of the people who use them. I am a bit obsessed by development automation because I rely on it to such a great extent. Other people will find other tools useful to a greater or lesser extent because of their own unique capabilities and weaknesses.

Exponential Growth + Network Effects.

Dr Albert A Bartlett's lecture on Arithmetic, Population and Energy is a really great introduction to the exponential function, and how we generally fail to understand it's implications. In the unlikely event that you have not already seen it, go watch it now, it is worth your time.

So, what things are growing? (Perhaps not exponentially due to limiting factors, but at a dramatic and most likely super-linear pace even so):

The total population of the world.
The percentage of the population who are educated to a certain level.
The percentage of the population connected to the internet.

Another good read is The Mythical Man Month, by Rodney Brooks. In this book, he observes that the number of channels of communication in a group of n individuals is given by n(n-1)/2. I.e. in big Oh notation O² (O squared).

Now, taking these things together, we have a growth in the number of potential channels of communication that increases at a rate somewhere between polynomial (O squared) and exponential.

Assuming (naively) that any given (non-geographically limited) interest group will scale linearly with the total number of potential channels of communication, all forums should experience this rate of increase.

I have always believed that quantitative change inevitably drives qualitative change.

So what impact will this have on the quality of communication (in terms of properties and characteristics, not value)?

Since there are physical (bandwidth, mental capacity) limits on our individual capacities to communicate, a drive towards increasing specialization must be a consequence (this is generally acknowledged, although the rate at which our specialization must increase is probably under-appreciated).

What other effects will we see? Any comments?

Will this affect things other than just communication? What about the economy? How we divide labour? The network effects are in evidence there, also.

Thursday, 3 May 2012

Complexity: It had better be worth it!

A response to the StackExchange question: What is the Optimal Organizational Structure for IT?

Conway's Law states:

"..organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations"

The corollary of this is that the best organizational structure is the same as the best software architecture.
Not knowing the specifics of your business, I am basing the following on some sweeping generalizations:

Experience indicates that software development is both expensive and risky; and that this risk and expense can only (even with strenuous effort) be reduced to a very limited extent.

Since cost & the level of risk is (approximately) fixed, you need to increase returns to achieve an acceptable reward:risk ratio.

The first obvious consequence of this is that organisations should focus on projects with a big pay-off. This is not always achievable, as the opportunities might not exist in the marketplace. The second obvious consequence of this is that development costs should be amortized as much as possible. For example, by spreading development costs over multiple product lines & projects. (I.e. code reuse).

So, back to Conway's law: What organizational structure maximises code reuse? The obvious answer would be to align organizational units around libraries & APIs, with each developer responsible for one or more libraries AND one or more products. How libraries get re-used is then no longer a purely technical decision, but an important business decision also. It should be a management function to ensure that development costs are amortized effectively to maximise return per unit development effort.

Each developer then has responsibility for the development, testing & in-service performance of the features supplied/supported by his library.

Development Concerns with MATLAB

My response to a StackExchange question from a while ago: Who organizes your MATLAB code?

I have found myself responsible for software development best practice amongst groups of MATLAB users on more than one occasion.

MATLAB users are not normally software engineers, but rather technical specialists from some other discipline, be it finance, mathematics, science or engineering. These technical specialists are often extremely valuable to the organisation, and bring significant skill and experience within their own domain of expertise.

Since their focus is on solving problems in their own particular domain, they quite rightly neither have the time nor the natural inclination to concern themselves with software development best practices. Many may well consider "software engineer" to be a derogatory term. :-)

(In fact, even thinking of MATLAB as a programming language can be somewhat unhelpful; Taking a cue from one of my former colleagues, I consider it to be primarily a data analysis & prototyping environment, competing more against Excel+VBA rather than C and C++).

I believe that tact, diplomacy and persistence are required when introducing software engineering best practices to MATLAB users; I feel that you have to entice people into a more organised way of working rather than forcing them into it. Deploying plenty of enthusiasm and evangelism also helps, but I do not think that one can expect the level of buy-in that you would get from a professional programming team. Conflict within the team is definitely counterproductive, and can lead to people digging their heels in. I do not believe it advisable to create a "code quality police" enforcer unless the vast majority of the team buys-in to the idea. In a team of typical MATLAB users, this is unlikely.
Perhaps the most important factor in promoting cultural change is to keep the level of engagement high over an extended time period: If you give up, people will quickly revert to follow the path of least resistance.

Here are some practical ideas:

Repository: If it does not already exist, set up the source file repository and organise it so that the intent to re-use software is manifest in it's structure. Try to keep folders for cross-cutting concerns at a shallower level in the source tree than folders for specific "products". Have a top-level libraries folder, and try to discourage per-user folders. The structure of the repository needs to have a rationale, and to be documented.

I have also found it helpful to keep the use of the repository as simple as possible and to discourage the use of branching and merging. I have generally used SVN+TortoiseSVN in the past, which most people get used to fairly quickly after a little bit of hand-holding.
I have found that sufficiently useful & easy-to-understand libraries can be very effective at enticing your colleagues into using the repository on a regular basis. In particular, data-file-reading libraries can be particularly effective at this, especially if there is no other easy way to import a dataset of interest into MATLAB. Visualisation libraries can also be effective, as the presence of pretty graphics can add a "buzz" that most APIs lack.

Coding Standards: On more than one occasion I have worked with (otherwise highly intelligent and capable) engineers and mathematicians who appear to have inherited their programming style from studying "Numerical Recipes in C", and therefore believe that single-letter variables are de rigueur, and that comments and vertical whitespace are strictly optional. It can be hard to change old habits, but it can be done.

If people are modifying existing functions or classes, they will tend to copy the style that they find there. It is therefore important to make sure that source files that you commit to the repository are shining examples of neatness, full of helpful documentation, comments and meaningful variable names. This is particularly important if your colleagues will be extending or modifying your source files. Your colleagues will have a higher chance of picking up good habits from your source files if your make demo applications to illustrate how to use your libraries.

Development Methodologies: It is harder to encourage people to follow a particular development methodology than it is to get them to use a repository and to improve their coding style; Methodologies like Scrum presuppose a highly social, highly interactive way of working. Teams of MATLAB users are often teams of experts, who are used to (and expect to continue) working alone for extended periods of time on difficult problems.

Apart from daily stand-up meetings, I have had little success in encouraging the use of "Agile" methodologies in teams of MATLAB users; most people just do not "get" the ideas behind test-driven development, development automation & continuous integration. In particular, the highly structured interaction with the "business" that Scrum espouses is a difficult concept to generate interest in, even though some of the more serious problems that I have experienced in various organisations could have been mitigated with a little bit of organisation in the lines of communication.

Administration: Most of what constitutes "good programming practice" is simply a matter of good administration & organisation. It might be helpful to consider framing solutions as "administrative" and "managerial" in nature, rather than as "software engineering best practice".

Wednesday, 2 May 2012

Lowering the barriers to Entry

or ... From svn to hg and back again (and more importantly ... where next?)

I have been using Mercurial for the past 6 months or so, and I am still only partially sold on the whole DVCS movement.

I used Subversion exclusively for around 4 years, between 2007 and 2011, and a mixture of Perforce, StarTeam & SourceSafe (shudder) in the years prior to that. (I even did it manually for a while, before I knew better). These formative experiences occurred (mostly) in corporate environments, where I was frequently faced with the task of evangelizing software development best-practices in teams dominated by non-programmers (academics or other domain specialists).

Here the challenge is in working effectively with colleagues who are accustomed to working alone for long periods, and for whom sharing of work is done through network folders and email (or peer-reviewed, published articles!).

It is easy to forget that most professionals will require a significant amount of convincing before they will tolerate even minor inconveniences. Subversion, one of the easiest version control systems to use, still presents major barriers to adoption. Mercurial, for all of it's DVCS goodness, requires yet more knowledge & presents yet more friction to the non-developer user than Subversion. I am not even going to think about discussing Git.

So, how can we lower the barriers to entry and reduce everyday friction for modern development automation systems? Can we make using distributed version control easier that using Subversion? Easier than using email to share work? Easier than using network folders?

Can we find a much simpler way to solve the same essential problem that version control systems, configuration management systems & source document repositories solve

Well, what is the essential problem that they are trying to solve, anyway? It has something to do with collaboration, something to do with man-management, and something to do with asset management, organization and control.

Like most issues that appear, on the surface, to be purely technological, when you peel back the surface, it becomes possible to discern psychological, sociological and political factors at play; but by the same token, once analyzed, these (potentially confounding) influences simply become additional technical problems that can be managed by technical means.

So, we use version control & configuration management systems is to help organize our source documents, organize our development processes, and organize how work is divided up and merged back together again. They give us visibility, they keep a record of what happened in the past, and enable us to predict and plan what is going to happen in the future. They are the ally of the obsessive-compulsive side of our personality, and they give us the comforting feeling that everything is under control. As much as they are anything else, they are also an emotional crutch, and in that, they are a political ally against the local-optimum seeking risk minimizers in life.

I have a lot of hope that real-time collaborative editing (Etherpad - Realie) and online development environments (Koding - CodeNow) will find success and provide us with a rich set of options for our future development environments; they certainly offer an aggressive simplification and improvement over the current state of affairs! (Although I believe that they will need to address the above political concerns to gain widespread traction in a conservative (small c) world.)

I also hope that these environments pick up the user-interface ideas that are being promoted by LightTable et al, and provide support for the broader engineering community (Embedded and safety-critical systems in particular) as well as web & enterprise development.

Thursday, 19 April 2012

Digital Design Desires

Response to: http://jonbho.net/2012/01/24/i-want-to-fix-programming/

Developing software is fundamentally a learning process; As developers, we direct the majority of our efforts towards exploring & learning the problem domain and the space of possible solutions. In comparison to this, the act of transcribing that understanding to a formal machine-readable representation ("source code") is rather less significant.

Notwithstanding this, the learning process itself is intimately involved with the manipulation of formal representations and the interaction of these with the environment. The formal language (and associated tools) that we use helps us not only to formally specify a set of possible solutions to the problem; but also helps to guide us in our search for those solutions and to build up our knowledge and understanding of the problem domain and of our proposed solution.

Importantly, the tools that we use help us to understand the implications of the formal statements that we have already made in our exploration.

This is really about the software development process as a whole, and how the specifics of the language support that process. The output of the software that we create, test results, debugger output, static analysis etc.. all help us to understand the problem domain and the current state of our search for a solution.

The languages that we use do need to support these tools better than they currently do, (especially static analysis, IMHO) but the focus of the (many) discussions on the language itself and it's syntax is somewhat misleading, I think.

I would rather the effort be spent on extending and improving the toolchains surrounding languages that already exist and are popular. I want continuous testing, style checking, static analysis and fuzz testing to be ready "out of the box" and enabled by default; from the get-go, I want my IDE/compiler/interpreter (with default settings) to refuse to build/run my code unless the style is spot-on, the test coverage 100% and the static analysis gives the thumbs-up.

If that means I must constrain the way that I develop software, so be it.

I agree with the original post: it would be nice for a program to consist only of invariants and test-cases, with the algorithm generated automagically. I suspect that, feasibly, the developer will still need to provide some guidance in the choice of algorithm, but I see no reason why we could not have our IDEs/text-editors provide continuous feedback when the statements that we type violate some constraints that we have previously specified, or cause some test-case to fail.

This would have been unthinkable a few short years ago, but the computational power at our disposal right now is immense: enough to make us shift our concept of what is possible.

Tuesday, 17 April 2012

They tried to give me a database, but I said, NoSQL, No NoSQL, No No No-oo-oo.

I am starting (reluctantly) to come to the conclusion that SQL & NoSQL databases are the wrong tools for the tasks that I want to undertake: For a colourful but ultimately misleading metaphor: They give me a hammer when I want a screwdriver. The databases that my (limited) experience encompasses (MySQL, Oracle, MongoDB) all offer a slew of features that I neither use nor need. Furthermore, the subset of features that I *do* need is probably better packaged as a set of libraries than as a separate application. For example, I want to manipulate large quantities of data, stored in RAM, possibly spread across multiple machines, possibly operated on by processes operating concurrently or in parallel. I would like to be able to plug together the distribution/replication/sharding of data & tasks, the fault-tolerance, zeroconfig style ease-of-administration, even (maybe) some indexing capabilities, but in such a way that will let me choose when I want each feature.

Monday, 2 April 2012

Spaces over tabs: A rationale

A wider range of people will read from a document than write to it.
A document formatted with spaces (instead of tabs) will look the same to all readers.
A document formatted with tabs will look different depending on the reader's tab settings.
In a corporate environment, it is easier to enforce uniformity in the settings of the text editors of the group of people who can write to the documents than of the (larger) group who can read from them.

If, like me, you use your spatial awareness to navigate source documents, and as a consequence, really like things to be vertically aligned (Guido, why do you hate us so?), viewing a tab-indented document with the wrong tab size set is very disconcerting.

So, for these reasons you should choose spaces over tabs, and set a uniform indent size. (4 seems to be the most commonly accepted convention)