by Rinat Abdullin, March 2012.
Hub is the platform that unifies the metered pay-as-you-go forecasting subscription offered by Lokad. It also provides intelligent self-managing business backend for our internal teams.
About The Problem
Lokad works with the business model that is rather young and unexplored: providing Software as a Service. This model is tightly linked into even less known domain of delivering demand forecasting and inventory optimization services that exploit cloud computing to provide lower prices and higher quality than any competition.
In short, this is a new domain with a lot of details. And we are still exploring market while adjusting out strategy and tactics to stay real and focused. Besides, we have a small team that is bound to deal with support of existing customers, management of new accounts being created every day and attempts to improve the state of the art in demand forecasting (which hasn't evolved for decades).
Hence, backbones of such company have to be extremely flexible, focused and self-managing. And they should be able to survive cloud outages as well. Hub is one of such backbones.
Features of Lokad.Hub
- Multi-tenant and can scale with the business.
- Supports instant data replication to multiple locations.
- Can be deployed to any cloud (useful for rapid failovers).
- Extremely low development friction: multiple deployments to production per day are a norm.
- Reliable development of complex business processes: CQRS/DDD+ES.
- Full audit logs and ability to roll system back to any point in time.
- Integration with payment, Geo-Location and CRM systems.
Lokad Hub is roughly the 5th CQRS system implemented at Lokad. This is a full rewrite of old business backend that became slow, complex and expensive to maintain. This legacy prevented us from delivering features that were essential for the company to move forward.
Old architecture was based on things like:
- SQL Azure for persistence
- NHibernate for ORM Mapping
- ASP.NET Web Forms for Web UI
- Autofac IoC for infrastructure
- Log4net for logging
- Protobuf serialization
- Windows Azure runtime and persistence libraries.
New architecture is based on the latest version of Lokad.CQRS. Actually, some of the things discovered during the development, were back-ported into this open source project, so that the community could benefit from it. Some of the latest ones are waiting for their turn for back-posting.
Best improvements from the new architecture are low friction and ability adapt to new needs of the company as it moves forward. Below are some of the most important concepts about it.
1. Use DDD to drive the design
- Domain-Driven Design was vigorously applied to break complex field into separate bounded contexts
- Best technology was chosen for each bounded context to implement needed functionality in a simple, testable and cloud-portable way. After multiple iterations each of the bounded contexts ended up in CQRS+ES implementation stemming from DDD model defined earlier.
- Modeling by coding was selected to actually capture and refine the domain.
- Development stack was matching Lokad.CQRS in portable configuration (no cloud bindings whatsoever).
- Last iteration of domain modeling that was able to capture all business requirements from the previous system - became foundation of the new system.
2. Reduce Technology Stack
Things like SQL, ORM, IoC and tight Azure couling were discarded in this case. Simple storage abstractions are more useful and portable. Actually core business logic (the most complex library) depended only on message contracts and portability interfaces.
Also, to reduce development friction, entire Lokad.CQRS portability layer (Lokad.Cqrs.Portable) was imported as source code into the system. This became the recommended approach of using Lokad.CQRS since then.
3. Choose your Domain Toolset
Complex business interactions and flows were modeled using AR+ES entities (Aggregate Roots implemented with Event Sourcing) with stateless workflows and long-running processes that logically map to human behaviors over the UI.
"Sagas" and alert services were dropped completely at the last moment for being complex and non-transparent for practical use. They also don't play well with migrating systems from legacy CRUD to event sourcing.
4. Expressive Tests and Living Documentation
Lokad.CQRS Flavor of SimpleTesting is used to define specifications for complex business behaviors. These are used both as unit tests and living documentation. This simplifies reliable and fast delivery of important features that can make difference for our customers.
[Passed] Use case 'user management - detect fast brute-force attack at customer password'. Given: 1. Created user 1 (security 1) with threshold 00:05:00 2. User 1 login failed at 1/1/2011 1:01:00 AM (via IP 'local') 3. User 1 login failed at 1/1/2011 1:02:00 AM (via IP 'local') 4. User 1 login failed at 1/1/2011 1:03:00 AM (via IP 'local') 5. User 1 login failed at 1/1/2011 1:03:00 AM (via IP 'local') When: Report login failure for user 1 at 1/1/2011 1:04:00 AM Expectations: [ok] User 1 login failed at 1/1/2011 1:04:00 AM (via IP 'local') [ok] User 1 locked with reason 'Login failed too many times within short time interval'
These specifications can further be reused to verify message contract stability and also visualize domain parts of the system, showing dependencies and unit test coverage:
Each yellow block on the left maps to separate specification that can be read as a document.
Note: These graphs were built by simply scanning unit tests assembly for available specifications, grabbing "When-Then" messages out of them and printing these dependencies in form understood by Graphvis.
5. Migrate from Legacy to ES with Separate tool
It is always a pain to migrate from state-based SQL data models to event sourcing (reverse is opposite). And there never is a silver-bullet solution.
Special "ReverseEngineer" application has been created for the sole task of transforming legacy SQL database into event streams for AR+ES entities.
6. Simplify UI Development
Client UI has been rewritten from scratch, to benefit from flexible projections and latest improvements in web development. It became simpler, nicer and faster to build UIs with ASP.NET MVC 3, Twitter Bootstrap and Projections.
This saves more time that could be spent improving customer experience.
This Web UI happens to work on mobile devices as well, without any additional effort.
7. Introduce Smart Projections
To reduce deployment friction, server was enriched with ability to automatically track and detect changes in projections on startup, rebuilding views that need that (or newly added views).
This vastly speeds up development and reduces amount of possible human errors.
8. Cloud Failover Strategy
Usage of event sourcing enables extremely simple and rather reliable failover strategy. It has following goals:
- Replicate all data reliably to multiple clouds and datacenters within 1 second (for cases like full data loss, data corruption or loss of access to data)
- Have readonly web UI clones available at any moment (for full cloud crashes or load-balancer issues)
- Be able to enable full write functionality manually on secondary cloud, after primary cloud is considered as being down.
It is possible to push this further and even have zero downtime strategy via things like change merging, higher write latencies or clever use of load balancers. However, this also brings complexity and additional costs which might not always be justified. Sometimes guaranteed recovery within a dozen of minutes is more than adequate.
The most important aspect of this strategy is the ability to keep crucial customer data safe and secure even in the face of global cloud outages.
9. Aim for Personal Freedom
Hub was developed and tested completely on a small MacBook Air 2011 laptop, in Visual Studio running in Parallels Virtual Machine. Laptop is pretty good and robust.
However, no matter how powerful this laptop is, it's DualCore processor can't be compared with a powerful workstation or PC that run Windows in native mode. So running SQL + Azure Storage Emulator + Azure Compute Emulator in debugging mode always seem to halt it to death, making Azure development impossible in such setup.
Fortunately Lokad.Hub does not need these technologies in order to be developed for the cloud.
This setup gives ultimate development freedom, since laptop is extremely lightweight, rather performant and has an impressive battery life. There is non-Mac equivalent out there as well: Asus ZenBook.
This list is by no means comprehensive. It does not include things that were already discussed in previous case studies. In particular, I would recommend to check out "Reducing Development Friction" section in Lokad Shelfcheck.
First deployment is worth a separate story. Almost every single technological investment made into Lokad.CQRS stack, paid itself at this stage. I was gradually preparing Hub for the first release, when Windows Azure Outage hit us, making all Azure-hosted services unavailable. The only services that stayed alive - were services outside Azure or services that could be failed over to a different cloud. Legacy version of Hub was too coupled with Windows Azure to be moved to any other location (now it seems as a really reckless decision).
So I did decide to push Hub to production earlier than expected, at least just for the sake of providing read-only view of our subscription management UI (this was better than "Server is not available" message). Since deployment went well, systems were switched completely on (for the first time with production data).
Then a glitch in my code was discovered - it caused system to settle and calculate monthly invoices earlier than planned. Fortunately, event sourcing allows to roll back system to any given point in time. So I just went back to the point in time, before glitch occurred, fixed the code and restarted the system.
Hub went live while other systems were struggling to survive the outage. Since then it has been a pleasure to work with.
Problems and limitations
The biggest limitation during the development of the project was related to extreme value and importance of proper DDD process. Literally weeks of effort (fortunately, this was one man's effort) were wasted because of the DDD model done initially wrong.
To be precise, they were not exactly wasted, but rather invested into understanding the area, where company derives one of its competitive advantage. Still, it was rather frustrating in the beginning nonetheless (business benefits of such decision started showing up only later).
These are the issues discovered closer to the first release (or after):
- High-level pictures become extremely important when you need to integrate together a dozen of bounded contexts or more. There should also be an infrastructure capable of supporting that in a clean and enabling way.
- Simple file-based event streams (or blob-based for cloud deployments) become a bit limiting when you have more than 100000 events to keep and replay. Dedicated and simple event streaming server would be beneficial.