Thursday, June 26, 2008

Design for correctness

We started integrating WS-CDL into our design and runtime processes a while back. This work became one of the defining (and differentiating) factors behind our governance efforts (and therefore Overlord). Some people (users and analysts) just "get it" and understand the need behind CDL (let's drop the WS component to the name, because CDL is not limited to SOAP/HTTP by any means). However, others don't and still others ignore it entirely. Best case, this is a shame. Worst case, this is compromising integrity of the systems they develop.

Steve Ross-Talbot recently gave a presentation on CDL recently at the Cognizant Community Europe workshop, and used the analogy of a house architect to explain where CDL fits in. This is a good analogy, because CDL should be in any good Enterprise Architect's repertoire. Just as you don't throw together a straw-built house from a pencil drawing on the back of a napkin and expect it to withstand a hurricane, neither should you just cobble together components or services into a distributed system (irrespective of the scale) and expect it to be correct (and provably correct at that). In the housing example you would pull an architect into the solution and that architect would use best practices that have been developed over centuries of collective experience, to design a building that can withstand 100 mph winds. Software engineering should be no different. Some sectors of our industry have been able to get by with computing as an art rather than a science, and house designers did pretty much the same thing thousands of years ago. But we don't live in caves any more for good reasons (although there's still something to be said for using caves in a hurricane!)

Of course it means that there are more layers in between the act of deciding what needs to be done and actually realising that in an implementation, but those layers are pretty important. The days of just throwing something together and assuming it'll work as planned are well and truly over. Asynchronous systems, which really began life several decades ago but were muzzled by layers of synchronous abstractions, are back to stay. Yes, synchronous is easier to understand and reason about, but it's an unfortunate reality that if you want scale, real-time, loose coupling etc. we have to break through the synchronous barrier. That has a knock-on effect on how you design your systems and individual components (services) and ultimately how they are managed (by a person or by some autonomic mechanism). "Design for testability" was a buzz-phrase from many years ago. What we need now (and what CDL integration gives us) is "design for correctness".

Monday, June 23, 2008

Goodbye BAM/BI, hello SAM/SI

OK, not quite, but it's a nice title ;-)

The term Business Activity Monitoring (BAM) is used to describe the real-time access to critical business performance metrics in order to improve the efficiency and effectiveness of business processes. Real-time process/service monitoring is a common capability supported in many distributed infrastructures. However, BAM differs in that it draws information from multiple sources to enable a broader and richer view of business activities. BAM also encompasses Business Intelligence (BI) as well as network and systems management. Plus BAM is often weighted toward the business side of the enterprise.

Although BAM was popularized by BPM, the fundamental basis behind it (monitoring the activities in an environment and informing interested parties when certain events are triggered) has been around since the early days of (distributed) system management and monitoring. BAM specializes this general notion and targets the business analyst. Using BAM within an autonomic infrastructure is often difficult if not impossible (depending upon the implementation in question).

Within a distributed environment (and many local environments) services are monitored by the infrastructure for a number of reasons, including performance and fault tolerance, e.g., detecting when services fail so that new instances can be automatically started elsewhere. Over the years distributed system implementations have typically provided different solutions to specific monitoring requirements, e.g., failure detection (or suspicion) would be implemented differently from that used to detect performance bottlenecks. For some types of event monitoring this leads to overlap and possible inefficiencies. For instance, some approaches to detecting (or suspecting) failures may also be used to detect services that are simply slow, indicating problems with the network or overloaded machine on which the service resides. But where these ad hoc approaches have differed from BAM/BI is in their intended target audience: other software components (e.g., a load balancer) rather than humans.

This separation of audience is useful from a high-level perspective: business analysts shouldn't have to be concerned about low-level infrastructural details. But in many cases this ad hoc (bolt-on) approach to BAM and BI can lead to less information being delivered to the entities that need it at the time they need it. Therefore, within the Overlord project we are working on Service Activity Monitoring (SAM) and associated Service Intelligence (SI), which will provide an architecture (and corresponding infrastructure) that brings together many different approaches to entity monitoring within distributed systems (where an entity could be a service, a machine, a network link or something else entirely) and particularly SOIs. The emergence of event processing has also seen an impact on this general entity monitoring, where some implementations treat failure, slowness to respond etc. as particular events. This uniform monitoring includes the following:

• Message throughput (the number of messages a service can process within a unit of time). This might also include the time taken to process specific types of messages (e.g., how long to do transformations).
• Service availability (whether or not the service is active).
• Service Mean Time To Failure (MTTF) and Mean Time To Recovery (MTTR).
• Information about where messages are sent.

The information is made available to the infrastructure so that it may be able to take advantage of it for improved QoS, fault tolerance etc. The streams may be pulled from existing infrastructure, such as availability probing messages that are typically used to detect machine or service failures, or may be created specifically for the SAM environment. Furthermore, streams may be dynamically generated in real-time (and perhaps persisted over time) or static, pre-defined information, where the SAM can be used to mine the data over time and based on explicit queries.

With the advent of SAM we will see BAM implementations that are built on it, narrowing the types of events of interest for the business analyst. The SAM approach offers more flexibility and power to monitoring and management over the traditional BAM approaches. As BPM and SOA move steadily towards each other, this kind of infrastructure will become more important to maintaining agility and flexibility.