Today, the role of the software architect is widely recognised and most large system development projects are led by someone who takes particular responsibility for the system wide technical factors, such as ensuring that the system will provide the right set of qualities (such as functionality, availability, security and performance) for its stakeholders.
In recent years we’ve also seen a welcome increase in the number of resources available to the aspiring software architect, including useful books, community web sites and interesting blogs. However ultimately the success of a project is often found to be closely related to the amount of hard won experience that the project team has – by which we really mean how many mistakes they’ve learned from in their careers so far! Experience is a hard but very effective teacher.
In this short article I hope to share some of the pitfalls that I’ve encountered in my experience as a software architect and in so doing, help you to avoid making just the same mistakes on your projects!
Many projects end up being doomed to failure before they’ve really started, simply because their scope was wrong. The most common problem is the infamous “scope creep” where the scope of the system steadily increases as more and more people get their say. This is the sort of situation where a simple travel booking system ends up with full expense claim management facilities being built into it, with inevitable repercussions for project costs, timescales and quality. The converse problem is where scope is artificially limited (perhaps to meet timescales) and so the effectiveness of the system is compromised. Architects need to maintain a particular focus on scope problems that relate to system qualities. Does the system really need to be available 24x7x365? Or would working hours plus Saturday mornings be sufficient? It is really true that no security is needed beyond simple login? Once logged into the system can users really perform any system operation? Catching these mistakes early in the project will help to greatly increase the probability of success.
A related mistake that many of us have made is to focus on just a couple of our system stakeholders – classically the acquirer (who is paying for the system) and the end users get all of the attention. However there are often quite a few other interested parties who need to be involved. Consider people like auditors, systems administrators and DBAs, testers, helpdesk staff, the development team and so on. You’ll probably need the cooperation of a fair number of these people in order to successfully deploy your system and number of them may not share your enthusiasm for it! Think about searching for stakeholders in groups like acquirers, assessors, communicators (like writers and trainers), developers, maintainers, suppliers, support staff, system administrators, testers and end users. The sooner you can make contact with these groups and understand their interests and concerns, the better.
It goes without saying that a key quality of a system is what it does. People buy or build systems in order to perform useful work and so it’s crucial that the system does what people need it to do. However, a trap that many new software architects fall into is focusing entirely on a system’s functions without considering any of its other properties.
Unfortunately, while we do need to get the system’s functional processing right, this is rarely enough; unless the system exhibits a whole range of qualities (such as performance, security, maintainability and so on) it is unlikely to be successful.
In my experience, an architect needs to focus on the type of functions that a system has to provide and must design a suitable framework that each of the individual features of the system can be slotted into during detailed design. The design of the overall framework has to take into account the kind of functions it will host (be they computationally intensive, data oriented, batch operations, interactive operations and so on) but crucially, it must also be capable of providing the right level of performance, allow the application to be secured, allow for high availability and disaster recovery, support operation and administration and so on.
In reality, if the system isn’t fast enough or is insecure, if it can’t be supported or changed to allow for new requirements, it probably won’t be used and may not even make it into production in the first place. Identifying and delivering on these non-functional requirements is a crucial part of the architect’s role.
At some point in the development of the system you will need to write something down to explain your ideas for the architecture of the system – in other words, you’ll need an architectural description.
The question is, how do you go about describing something as complex as a modern software intensive system? The approach that we’ve all tried at some point is the single, all inclusive, “boxes and lines” Visio diagram that tries to show everything. The diagram probably includes software modules and some interconnections, machines, network links, databases and data flow, some user interaction, perhaps some system monitoring and so on. If you’ve tried this yourself then you already know that it’s not a terribly effective approach.
There are two reasons why the huge Visio picture doesn’t work well as an architectural description: firstly, it’s trying to show too much information in a single representation, and secondly, no one is really sure quite what each of the different types of symbol that you’ve drawn mean.
The first problem means that the diagram doesn’t work well for anyone, as everyone has to hunt through it for the information they are interested in. Infra-structure experts have to discern the machines, network links and infra-structure software from the whole, software developers have to try to work out what the software layering is, testers need to try to untangle dependencies from data flows. Each person is mentally creating their own subset of the diagram in order to understand it. The solution to this problem is to decompose your large diagram into a number of well-defined, non-overlapping views of the system such that each view addresses one aspect of the system structure (such as runtime functional structure, software module structure, data structures or deployment environment). This is a well proven technique and there are a number of standard approaches you can use to guide you in achieving this (see the Further Reading section).
The solution to the problem of ambiguity is to make sure that you use a well defined notation for your diagrams and to provide enough supporting text somewhere to make your intentions clear. UML is an obvious starting point, given its tool support and wide use, although quite honestly it’s not a very good architectural notation (although again see the Further Reading section for some guidance on how to use it effectively). For those situations where UML isn’t effective, you will probably design your own notation to represent your ideas. However, remember to clearly define what your notation means so that people aren’t guessing – different people rarely guess the same way!
I’m not suggesting that we shouldn’t innovate or that all systems should use COBOL and VSAM but always bear in mind that your design needs to be built, tested and deployed in order for it to be successful. If the design or the technologies you use are too sophisticated or immature, then you may be jeopardising this.
Common things to watch out for related to building the system include designs that the developers or testers don’t really understand, technologies that they aren’t enthusiastic about or don’t have the time to learn, and new technologies that don’t yet have good tool support or perhaps impose a new and unfamiliar way of working. The presence of any of these factors suggests that you need to tread carefully and make sure that you work with the development and test teams to minimise the risks.
Also bear in mind that your system needs to be installed, monitored and controlled in production. Sophisticated architectures can make all of this significantly harder and immature technologies often have weak monitoring and management facilities, even after their development support has been improved. Again, it’s a case of working with the relevant parties (such as system administrators and perhaps the vendors) to make sure that no one has any nasty surprises late in the day.
Modern information systems nearly always rely on a fairly sophisticated “platform” of hardware and software to provide them with standard services like a runtime environment, data storage facilities and application security. The platform that information systems use has got a lot more complex in recent years and rather than 2 or 3 products (such as a compiler, operating system and database) today’s platforms are often comprised of 10 or more products (operating systems, databases, application servers, application frameworks, virtual machines, cluster services, security services and so on).
This increased complexity has inevitably resulted in a much greater chance of incompatibility between the various components, as there is more to go wrong and more chance of you using a particular combination for the first time. This situation means that its no longer sufficient to simply say that you “need Unix and Oracle” when specifying your platform. You need to be really precise about the specific versions and configurations of each part in order to ensure that you get what you need. This will allow you to avoid the situation where you can’t deploy your system because someone has helpfully upgraded a library for one part of the platform without realising that it means that something else will no longer work.
Work out what you need from your runtime platform as early as possible and make sure that you’re precise about the versions and configurations of its components.
An experience that many software architects will relate to is that of surprises arising during system build and test. This is probably never more the case than when considering performance and scalability – if nothing else because there are just so many things to go wrong!
Particularly when using new technology, it is often hard to get a good feel for performance and scalability characteristics without a great deal of experience and its easy to assume that it’s all going to be OK, only to find out rather late in the day than performance or scalability is sensitive to some factor you’ve not considered.
The only real solution to these problems is constant vigilance and assuming nothing! A little performance and scalability paranoia will help you to keep considering these factors throughout the project and to keep challenging your assumptions. Start considering performance and scalability early, create performance models to try to predict key performance metrics and spot bottlenecks and get stuck into some practical proof-of-concept work as your design ideas are forming. This will all help to increase confidence that there aren’t any performance and scalability demons lurking in your design.
Another system quality that has kept many software architects awake at night is security. This quality is becoming ever more important as systems are exposing interfaces outside the organisation and simultaneously becoming more and more audited due to regulatory factors and corporate governance initiatives.
A mistake made in many systems over the years has been to try to add security into the system using “home brew” security technology. Be it custom encryption algorithms, a developer’s own auditing system or an entire DIY access control system, locally developed security solutions are rarely a good idea. While most of us think we could probably whip up a clever piece of security technology in no time, we’re usually wrong.
Like many complex things, security technology is considerably harder to build than it appears at first glance and mainstream security products are inevitably built by specialists with many years of experience in the field. Trying to create your own security mechanisms is likely to be time consuming and may well introduce subtle security vulnerabilities into your system that can be exploited by attackers if there is anything in your system that they really want.
Of all of the system qualities, security is probably the one where it’s well worth getting some expert help to assess the security you need, the vulnerabilities that you may have and the mechanisms you should use to address them.
Many systems reach production without major mishap and manage to run successfully there for years, without any significant interruption to service. Others aren’t so lucky and suffer some sort of major failure involving entire system recovery a number of times during their operational life.
The problem when implementing disaster recovery (DR) during a project is often how you get funding and attention for something that may never happen, at a time when there is already too much to do with the resources available. However the obvious problem with not having DR is that serious infrastructure failures can happen at any time and if your system is important to the organisation, then its loss is going to have a serious impact on the business.
The key to getting resources to implement a DR mechanism for your system is to be specific and quantify the cost of system unavailability in a number of realistic scenarios. If you can also estimate the probability of the scenarios occurring then you can use these two figures to convince people that DR is important and to justify a certain level of budget to implement it.
Finally, also remember to test your DR processes and mechanisms regularly. Experience shows that the DR exercises rarely work the first time and you don’t want to find the weaknesses in your design when you have a real failure to deal with!
With the best will in the world, things go wrong and while the previous tip is a reminder to ensure production resilience, you also need to remember to allow for disasters during deployment.
In the ideal world, deployment is always a smooth process, resulting in a system running as expected. Things do go wrong though, from configuration difficulties to unexpected environmental factors, to undiscovered faults and of course simple human error. Make sure that whatever happens during the deployment of your system or upgrade that you have a documented, reviewed and agreed backout plan to allow you to restore the environment to its state before you started deployment. At least this means that you can undo the damage and avoid total disaster if the worst happens!
In this short article, I’ve shared some of the pitfalls that have caused many software architects to come a cropper in the past. Hopefully this will allow you to steer your projects clear of these potential icebergs that could leave it holed below the waterline and doomed to being another of the wrecks that others can only learn from.
Some useful references on describing software architectures using views and UML include:
Some useful books to help you achieve particular qualities in your systems include:
Eoin Woods is a software and enterprise architect of UBS Investment Bank. He is a member and Fellow of the International Association of Software Architects, and will be speaking at IASA’s IT Architects Regional Conference in San Diego, Oct. 15–16 2007. He is also co-author of the book Software Systems Architecture: Working With Stakeholders Using Viewpoints and Perspectives (published by Addison-Wesley).