CloudTran: 2009

Wednesday, 11 November 2009

One line of trace

We're doing about 2,000 tiny (price-tick) transactions on our pair of quad-core boxes now. The CPU is still only at 60% utilisation, so there's still a bottle neck or two.

Say we eventually get up to 3,000 transactions at peak, then each transaction will be using around 300 microseconds of CPU.

Now it turns out that we have left one line of console trace output saying "I have written transaction X" which takes roughly 60 microseconds on a single core, or about 15 microseconds of aggregate horse-power on our quad-core box.

So the point is: one line of console trace will use about 5% of available CPU when we reach our target ... is it really worth it? I sense another configuration option crawling out of the woodwork ...

Tuesday, 3 November 2009

Thousands of transactions per second

We've recently been performance testing CloudTran and we're now doing over 2,000 transactions per second. This is on a single-CPU, quad-core box running the transaction buffer and being bombarded from other machines over the network. "CatchingTicks" is the test, and the ticks are price changes on a market - Admittedly those 2,000 transactions are among the smallest we can do but the CPU runs them at less than 50% utilisation so we can expect much higher throughput in the future.

We spent ages stuck at around 900 transactions per second, looking at every part of the system ... until we realised we were using the wrong variant of the test. We assumed we were using the performance test mode of the test; in fact, we were using the "isolation" test, to check that transactions working on the same record/row observed isolation correctly. So we've tested isolation to death, inadvertently. And it wasn't all wasted time - when you look at code with suspicious eyes and a particular problem in mind, you realise there are angles you missed. Performance testing is certainly interesting for tech-heads. We've moved from fairly simple synchronisation techniques to Java's concurrency library and Atomic variables with the non-blocking synchronisation.

To do functional debugging, we use heavy tracing of anything that moves, but for performance we use 'OperationTimer' objects - timing the start and end of an operation. Analysing these logs is exciting - you have to be very careful that you're measuring what you think you are. Our life has been made much easier by having coordinated nanosecond timers, so that all the bombarder machines calculate their time relative to the TxB machine. Then, instead of looking at all the logs, we take a one-second slice of the OpTimer output (about 10,000 lines) and then sort on the time of the log. The important thing is we can then trace an individual transaction's path between the various boxes quite easily - the sorted trace is in global time order.

Unfortunately, we're now bumping up against Heisenberg's uncertainty principle - taking just the OpTimer measurements reduces speed by a factor of 3, which almost certainly changes the nature of the events ... timing is everything! So I suspect we'll end up using something like Dynatrace in the near future.

Another point to watch out for is the server's HotSpot optimisation - this affects the performance of the GigaSpaces GSCs. We normally run 50-100,000 transactions through the system before taking the timings - this indicates the bar for doing performance tuning is set very high on the server configuration of HotSpot.

Testing continues as we use the results to continually improve perfromance. And for now it's back to the logs.

Tuesday, 27 October 2009

CloudTran and Google App Engine – a comparison, no offence ....

We’ve been working on CloudTran since 2008 with no contact with Google - so it was surprising to see how similar it is to the Google AppEngine's design for distributed transactions (GAE-DT). See http://code.google.com/events/io/2009/sessions/DesignDistributedTransactionLayerAppEngine.html .

Here are my aide-memoire notes ......

What can CloudTran do that GAE-DT can't ?
1. GAE-DT is specific to GAE, CloudTran is a universal plug-in for messaging systems and any other type of persistence source.
2. CloudTran supports many data sources and message transports simultaneously. It can send data to local databases or remote databases across secure links.
3. With GAE-DT, an object that participates in a distributed transaction must no longer be written by Local Transaction. Whereas CloudTran's Distributed Transactions (DT) play nice with local transactions. A programmer can operate directly on elements in a DT using Local Transaction (LT). (See 'stored procedures' below.)
4. GAE-DT: The namespace of named objects must be partitioned into DT-flavoured names and LT-flavoured names. This is not necessary in CloudTran, you can use whatever objects you want.
5. GAE-DT queries do not honour Local Transactions, and therefore doesn’t honour Distributed Transactions either. A distributed search in CloudTran can either (a) see committed versions of objects only or (b) additionally see objects in the process of being changed in a DT.
6. CloudTran provides a layered solution that can be used by different developers:

Distributed Transactions, Coordination and Control;
In-Memory database, pretending to be rows in an RDBMS. Automatic suck from persist to database;
Java-level API (ORM capability).

The full API is available - e.g. users can write their own services and access the in-memory database, like writing stored procedures for performance-critical applications.
7. We don't have to go to the database to tell if there is or is not a conflict (simultaneous update). This is because the space IS the database so we can tell immediately if there is a problem.
8. We have a simpler conflict and deadlock resolution (at least, it seems like that from reading the notes!). Application programmers can specify a wait time - and if a row/object is busy, we will wait for it; and the overall transaction can have a timeout. We don't have to worry about roll-forward or deadlock resolution.
9. Ordering of transactions, aka "herding of cats". GAE-DT wants user-synchronous order: we attempt to complete DTs in the order they were requested and GAE-DT allows the user some control as to how to do this.

In CloudTran we do things slightly differently, to provide the effect of sequential consistency (http://en.wikipedia.org/wiki/Sequential_consistency):

We hold up any overlapping transactions. The A/P can determine how long to wait;
In case of deadlock, one transaction will timeout - the other may too, or it may continue normally, depending on timing. When I worked for a large bank, the number of TP transactions that had a conflict was 1 in 2-3 million; timing out is a simple way of resolving these deadlock issues without needing more data to herd the cats;
When a DT commits, it is guaranteed to be disjoint from any other in-flight DT. Therefore, the important thing is the order of completion;
We guarantee that the order of committing at the nodes and the order of persisting to datasources preserves that order, so there is no ordering issue. The timing of these later stages is "eventual consistency".

Similarities
1. A single primary key on persisted objects. Object identity ≡ primary key. Object keys are never reused.
2. Entity Groups - a grouping of types of entity that will benefit from locality of reference. This is how we got "optimal" performance - but it depends on the applications patterns of use, how much locality of reference we get.
3. Scalable in terms of numbers of objects involved in DT and number of nodes involved.
4. Use of an underlying layer to provide a local transaction. CloudTran uses GigaSpaces; GAE-DT uses BigTable. Beware: there are differences in the concept of local transactions - but they do both mean a data transaction on one "System of Record".
5. Support for optimistic locking.

Differences (just different, no advantage either way).
1. The underlying model for GAE-DT's is an object-oriented store. For CloudTran, it is the GigaSpaces Javaspace, which looks much like a database (objects look like rows) when the target data source is a database. However, complete documents or other complex objects (e.g. a complete XML document as an object tree) can also be handled as single entities if the target datasource supports it.
2. GAE-DT serves an object from a database, CloudTran serves it from an in-memory database which is quicker but more expensive to scale the GAE.
3. Use of an object cache. For CloudTran, this 'cache' is the GigaSpaces local transaction.

The Joy of Scalability

The joy of scalability is a many faceted thing ... such as, losing confidence in the trusty tools in your software library.

I stumbled over this recently on Java's DelayQueue, part of the concurrent library. When I saw the compareTo() method on our Transaction Control Blocks being called a lot by DelayQueue it made me suspicious that this would be a performance bottleneck.

We use the DelayQueues of transactions to time out or retry internal operations, so it's crucial that the take() method observes timeout order. However, we also need a fast remove() method based on the object, because normally our transactions won't time out - and we must get them off the DelayQueue as fast as possible. Because items typically won't have the same timeout, the insertions into the queue will be somewhat random. In summary, we need :

fast adds (random)
fast removes (random)
takes in timeout order, blocking

Our target for the transaction manager node is 10,000 transactions a second. If the average transaction is in the delay queue for 1 second then our DelayQueue will be 10,000 items long. And if the primary transaction manager fails over to the secondary and that takes a while, then the DelayQueue could easily have 50,000 items in it.

To figure out how scalable DelayQueue is, I tried reading the code ... but it's complicated enough to need a test.

Our tester is single-threaded and loads up the queue with, say, 10,000 entries and then starts removing 10% (1000) and adding 10% of the entries using remove() and put(). The timings on my Dell Latitude E6500 (T9600 @ 2.8GHz) to do one operation were:

Queue size	10,000	50,000	100,000
Add	300 nanos (.3 micros)	360 nanos	400 (or 900) nanos
Remove	33.5 micros	155 micros	340 micros

These numbers suggest that add is pretty close to stable as the queue size increases, but remove looks like it is doing a linear search for the exact object (it doesn't do any 'compareTo()' calls to try to zero in on the object to remove).

The bottom line for us is that this is good enough this year. At steady state, removing from the delay queue will only account for about 8.5 microseconds per second on a quad-core i7 machine (each core is about the same power as the T9600 core). However, if we start getting 25 or 50 cores on a machine, then the DelayQueue will start to become a bottleneck.

So I think DelayQueue can go back in the trusted tool box for now - a tribute to the developers of the Java Concurrent library.

Monday, 19 October 2009

Two's Company

We have been working on CloudTran for over a year now, and finally managed to come up with a design that works.

In fact, we had a V1 design in 12/2008, V2 in 4-5/2009, and finally the design that worked in 6/2009. We're indebted to Dan Stone of Scalable Application Strategies (www.scapps.co.uk) for his creative destruction of our constructions, and for mapping the guts of the system into the GigaSpaces APIs.

Until a few weeks ago, we couldn't see any other work that seemed to be similar. Finally, we saw some work from Google AppEngine - http://code.google.com/events/io/2009/sessions/DesignDistributedTransactionLayerAppEngine.html, (GAE-DT), with a good introduction to the issues in the PDF for the session.

What is comforting is that another company thinks that scalable transactions in the cloud are (a) worthwhile and (b) possible.

It was uncanny reading the GAE-DT design description - the first few pages could have been our V2 design from April/May. However, there are important differences:

CloudTran is not specific to Google AppEngine of course - initially designed to work with DBMSs, now you can plug in messaging systems and non-RBMDS data stores too;
CloudTran is distributed in terms of participants (i.e. nodes in the clouds) and targets (e.g. databases);
GAE-DT is backed by the BigTable database; CloudTran is backed by the GigaSpaces' JavaSpace as the system of record. This means that
CloudTran has a distinct transaction layer which is a generic cloud-level capability for high performance and reliability. It also layers the "in-memory database" and Java/ORM layers;
GAE-DT has disjoint distributed and local transactions; in CloudTran they are fully compatible;
Queries in GAE-DT do not honour changes made in DT's; CloudTran supports searches on committed objects only or on DT's in process.

We will post a detailed analysis on the CloudTran (www.nte.co.uk/java/CloudTran)web site. (Or see Blog post of 27th October.)

Thursday, 15 October 2009

Herding Cats

The almost universal response to the problem of saving data across a grid is not to solve the obvious problem - speed up the transactions! Instead, developers either build a new type of datastore - BigTable/Hadoop, SimpleDB, CouchDB - or 'roll their own'. New data approaches will require a lot of investment (new skills, new integration approaches) and waste the existing investment in management information and analytic engines that feed off databases. (The reaction against this attitude is well documented: the-rdbms-is-doomed-yada-yada-yada. Roll your own is high risk and hard to maintain.

What I couldn't understand when I started looking at this area was - why doesn't someone fix the darn problem for databases as they are now? I suspect the answer is that it's the old "herding cats" problem... Imagine a poor developer with the business brief:

There is a room containing an arbitrary number of cats with numbers on their backs. You will be given a list of lines with a cat number, a word to write on the cats' message pad and a door number. For each line on your list,

catch the correct cat
cross out the word currently on kitty's message pad, and write the requested word
catch another cat, write a copy of the word on his message pad,
herd the cats through the door indicated on the list, making sure they go through in the same order as on the list
if a cat falls asleep or runs off, catch a spare cat and use it.

You get the idea - herding cats in a distributed system, in the face of failure or lost messages etc. is tough.

Distributed transaction systems, within their design parameters, have handled the problem of providing the ACID features in the face of

failures and ...
the impossibility of two machines knowing the same thing at the same time (... whatever that is in a distributed environment).

Now cloud computing has added :

Demand for more performance - we're all so impatient now
An increase in maximum size of system
Scalability (the requirement that the application performance should scale linearly with the number of machines)
The wide range of the design space - be efficient with one or two machines up to 1,000s or 10,000s
More design options - do we fix a problem at the database, in the network, in a container ... etc.

This extra set of requirements is what has made this design problem so difficult to design for. In the next blog, I will start talking more about the solution we have built in CloudTran.

How niche is that?

CloudTran (previously known as CloudSave) addresses scalable transactions in the cloud – which may sound pretty arcane and boring. And to me too - I was always the first to fall asleep when transactions started being discussed!

I now believe that transactions in the cloud-style environments – multiple machines in a high-speed grid – will be key to making clouds mainstream for application programmers using Java or .NET. The “multi-machine” approach to building systems is becoming the fashionable choice because :

applications are getting bigger, more sophisticated and interconnected. Many applications now have millions of users online
you can increase capacity in smaller increments
performance – 1,000 modern machines could easily have 8TB of main memory and 8,000 CPU cores. This is becoming standard rather than supercomputer level;
there will always come a point at which a single machine runs out of capacity.

In other words, this niche is likely to become mainstream - if we can just save our data. The "distributed data/transaction" problem is also not just a niche problem: at the low end, the problem hits when there is more than one machine operating on data - simple transactions based on a single connection to the database won't work. We can even extend the niche to single-machine applications where non-stop reliability, maximum performance or sustained high performance during demand are important. A new general-purpose architecture that works for one machine or 1,000 would come in handy - especially if it's relatively easy for application programmers to get on board. Cloud transactions are what we need to make enterprise programs work in the Cloud quickly and easily.

Monday, 5 October 2009

The ACID test for Clouds

For the last year (it feels more like five) my company has been working on a product to support transactions in the cloud. It all started when I was in a GigaSpaces meeting about their "application platform" capability and there was no mention of their transactional capability. Well, it turned out there was :

confusion over the term "transaction" meaning different things to DBMS and Gigaspaces;
no transaction capability that old men with an ACID database would recognise.

Pretty much all my life, I’ve wanted to go bigger and faster … first of all it was motorbikes, then VLSI chips, local networks, enterprise systems, distributed systems ... and now clouds. So, just when I thought the next step in speed and scalability of applications was at hand ... there was nothing.

The more I looked for a solution to allow application programmers to easily write big, fast applications on the cloud using transactions, the more it seemed there wasn't one, and no appetite for one either. There were three stances from the cloud camp:

"You don't understand" - you don't need to save in-memory data grid transactions to a database, because the grid is so reliable. Right - tell that to your CIO;
"Eventual, non-consistent save is good enough". Well, that's at least got the 'D' part of ACID ... but nothing else;
Junk the RDBMS - solve the scalability and consistency problem in a different paradigm. Shame about the databases and information feeds that drive commerce and decision-making today.

And from the non-cloud camp, there's the impression that distributed transactions won't work either - they're either too slow and complicated (Höller) or too slow and unreliable in clouds (Helland).

On the other hand, it seemed like a valuable approach, to provide a transaction layer :

to match the speed and scalability of the cloud;
matching up with all the existing data and add-ons around RDBMSs;
in a way that normal developers can use.

As it happened, working on the problem was slightly trickier than we thought! (more on that later).

Now we are on the point of launching CloudTran**, which supports as-fast-as-possible transactional applications in grids and clouds, with a scalable, fast and reliable transaction mechanism. Our first performance tests showed around 300 (distributed cloud) transactions per second, and the key machine - a single Intel i7 920- was only at 8% CPU utilization, which implies we'll get 1000's of transactions per second.

This series of blogs will explain some of the assumptions that make this possible - i.e. define the space for CloudTran**, compared with native and distributed transactions - and describe its unique aspects.
**Previously known as CloudSave.

Welcome to the CloudSlave Blog

Welcome to the CloudSlave blog ....

CloudTran