Wednesday, 11 November 2009
One line of trace
Say we eventually get up to 3,000 transactions at peak, then each transaction will be using around 300 microseconds of CPU.
Now it turns out that we have left one line of console trace output saying "I have written transaction X" which takes roughly 60 microseconds on a single core, or about 15 microseconds of aggregate horse-power on our quad-core box.
So the point is: one line of console trace will use about 5% of available CPU when we reach our target ... is it really worth it? I sense another configuration option crawling out of the woodwork ...
Tuesday, 3 November 2009
Thousands of transactions per second
We've recently been performance testing CloudTran and we're now doing over 2,000 transactions per second. This is on a single-CPU, quad-core box running the transaction buffer and being bombarded from other machines over the network.
We spent ages stuck at around 900 transactions per second, looking at every part of the system ... until we realised we were using the wrong variant of the test. We assumed we were using the performance test mode of the test; in fact, we were using the "isolation" test, to check that transactions working on the same record/row observed isolation correctly. So we've tested isolation to death, inadvertently. And it wasn't all wasted time - when you look at code with suspicious eyes and a particular problem in mind, you realise there are angles you missed. Performance testing is certainly interesting for tech-heads. We've moved from fairly simple synchronisation techniques to Java's concurrency library and Atomic variables with the non-blocking synchronisation.
To do functional debugging, we use heavy tracing of anything that moves, but for performance we use 'OperationTimer' objects - timing the start and end of an operation. Analysing these logs is exciting - you have to be very careful that you're measuring what you think you are. Our life has been made much easier by having coordinated nanosecond timers, so that all the bombarder machines calculate their time relative to the TxB machine. Then, instead of looking at all the logs, we take a one-second slice of the OpTimer output (about 10,000 lines) and then sort on the time of the log. The important thing is we can then trace an individual transaction's path between the various boxes quite easily - the sorted trace is in global time order.
Unfortunately, we're now bumping up against Heisenberg's uncertainty principle - taking just the OpTimer measurements reduces speed by a factor of 3, which almost certainly changes the nature of the events ... timing is everything! So I suspect we'll end up using something like Dynatrace in the near future.
Another point to watch out for is the server's HotSpot optimisation - this affects the performance of the GigaSpaces GSCs. We normally run 50-100,000 transactions through the system before taking the timings - this indicates the bar for doing performance tuning is set very high on the server configuration of HotSpot.
Testing continues as we use the results to continually improve perfromance. And for now it's back to the logs.
Tuesday, 27 October 2009
CloudTran and Google App Engine – a comparison, no offence ....
Here are my aide-memoire notes ......
What can CloudTran do that GAE-DT can't ?
1. GAE-DT is specific to GAE, CloudTran is a universal plug-in for messaging systems and any other type of persistence source.
2. CloudTran supports many data sources and message transports simultaneously. It can send data to local databases or remote databases across secure links.
3. With GAE-DT, an object that participates in a distributed transaction must no longer be written by Local Transaction. Whereas CloudTran's Distributed Transactions (DT) play nice with local transactions. A programmer can operate directly on elements in a DT using Local Transaction (LT). (See 'stored procedures' below.)
4. GAE-DT: The namespace of named objects must be partitioned into DT-flavoured names and LT-flavoured names. This is not necessary in CloudTran, you can use whatever objects you want.
5. GAE-DT queries do not honour Local Transactions, and therefore doesn’t honour Distributed Transactions either. A distributed search in CloudTran can either (a) see committed versions of objects only or (b) additionally see objects in the process of being changed in a DT.
6. CloudTran provides a layered solution that can be used by different developers:
- Distributed Transactions, Coordination and Control;
- In-Memory database, pretending to be rows in an RDBMS. Automatic suck from persist to database;
- Java-level API (ORM capability).
7. We don't have to go to the database to tell if there is or is not a conflict (simultaneous update). This is because the space IS the database so we can tell immediately if there is a problem.
8. We have a simpler conflict and deadlock resolution (at least, it seems like that from reading the notes!). Application programmers can specify a wait time - and if a row/object is busy, we will wait for it; and the overall transaction can have a timeout. We don't have to worry about roll-forward or deadlock resolution.
9. Ordering of transactions, aka "herding of cats". GAE-DT wants user-synchronous order: we attempt to complete DTs in the order they were requested and GAE-DT allows the user some control as to how to do this.
In CloudTran we do things slightly differently, to provide the effect of sequential consistency (http://en.wikipedia.org/wiki/Sequential_consistency):
- We hold up any overlapping transactions. The A/P can determine how long to wait;
- In case of deadlock, one transaction will timeout - the other may too, or it may continue normally, depending on timing. When I worked for a large bank, the number of TP transactions that had a conflict was 1 in 2-3 million; timing out is a simple way of resolving these deadlock issues without needing more data to herd the cats;
- When a DT commits, it is guaranteed to be disjoint from any other in-flight DT. Therefore, the important thing is the order of completion;
- We guarantee that the order of committing at the nodes and the order of persisting to datasources preserves that order, so there is no ordering issue. The timing of these later stages is "eventual consistency".
Similarities
1. A single primary key on persisted objects. Object identity ≡ primary key. Object keys are never reused.
2. Entity Groups - a grouping of types of entity that will benefit from locality of reference. This is how we got "optimal" performance - but it depends on the applications patterns of use, how much locality of reference we get.
3. Scalable in terms of numbers of objects involved in DT and number of nodes involved.
4. Use of an underlying layer to provide a local transaction. CloudTran uses GigaSpaces; GAE-DT uses BigTable. Beware: there are differences in the concept of local transactions - but they do both mean a data transaction on one "System of Record".
5. Support for optimistic locking.
Differences (just different, no advantage either way).
1. The underlying model for GAE-DT's is an object-oriented store. For CloudTran, it is the GigaSpaces Javaspace, which looks much like a database (objects look like rows) when the target data source is a database. However, complete documents or other complex objects (e.g. a complete XML document as an object tree) can also be handled as single entities if the target datasource supports it.
2. GAE-DT serves an object from a database, CloudTran serves it from an in-memory database which is quicker but more expensive to scale the GAE.
3. Use of an object cache. For CloudTran, this 'cache' is the GigaSpaces local transaction.
The Joy of Scalability
The joy of scalability is a many faceted thing ... such as, losing confidence in the trusty tools in your software library.
I stumbled over this recently on Java's DelayQueue, part of the concurrent library. When I saw the compareTo() method on our Transaction Control Blocks being called a lot by DelayQueue it made me suspicious that this would be a performance bottleneck.
We use the DelayQueues of transactions to time out or retry internal operations, so it's crucial that the take() method observes timeout order. However, we also need a fast remove() method based on the object, because normally our transactions won't time out - and we must get them off the DelayQueue as fast as possible. Because items typically won't have the same timeout, the insertions into the queue will be somewhat random. In summary, we need :
Our target for the transaction manager node is 10,000 transactions a second. If the average transaction is in the delay queue for 1 second then our DelayQueue will be 10,000 items long. And if the primary transaction manager fails over to the secondary and that takes a while, then the DelayQueue could easily have 50,000 items in it.
To figure out how scalable DelayQueue is, I tried reading the code ... but it's complicated enough to need a test.
Our tester is single-threaded and loads up the queue with, say, 10,000 entries and then starts removing 10% (1000) and adding 10% of the entries using remove() and put(). The timings on my Dell Latitude E6500 (T9600 @ 2.8GHz) to do one operation were:
Queue size | 10,000 | 50,000 | 100,000 |
Add | 300 nanos (.3 micros) | 360 nanos | 400 (or 900) nanos |
Remove | 33.5 micros | 155 micros | 340 micros |
These numbers suggest that add is pretty close to stable as the queue size increases, but remove looks like it is doing a linear search for the exact object (it doesn't do any 'compareTo()' calls to try to zero in on the object to remove).
The bottom line for us is that this is good enough this year. At steady state, removing from the delay queue will only account for about 8.5 microseconds per second on a quad-core i7 machine (each core is about the same power as the T9600 core). However, if we start getting 25 or 50 cores on a machine, then the DelayQueue will start to become a bottleneck.
So I think DelayQueue can go back in the trusted tool box for now - a tribute to the developers of the Java Concurrent library.
Monday, 19 October 2009
Two's Company
We have been working on CloudTran for over a year now, and finally managed to come up with a design that works.
In fact, we had a V1 design in 12/2008, V2 in 4-5/2009, and finally the design that worked in 6/2009. We're indebted to Dan Stone of Scalable Application Strategies (www.scapps.co.uk) for his creative destruction of our constructions, and for mapping the guts of the system into the GigaSpaces APIs.
Until a few weeks ago, we couldn't see any other work that seemed to be similar. Finally, we saw some work from Google AppEngine - http://code.google.com/events/io/2009/sessions/DesignDistributedTransactionLayerAppEngine.html, (GAE-DT), with a good introduction to the issues in the PDF for the session.
What is comforting is that another company thinks that scalable transactions in the cloud are (a) worthwhile and (b) possible.
It was uncanny reading the GAE-DT design description - the first few pages could have been our V2 design from April/May. However, there are important differences:
We will post a detailed analysis on the CloudTran (www.nte.co.uk/java/CloudTran)web site. (Or see Blog post of 27th October.)
Thursday, 15 October 2009
Herding Cats
What I couldn't understand when I started looking at this area was - why doesn't someone fix the darn problem for databases as they are now? I suspect the answer is that it's the old "herding cats" problem... Imagine a poor developer with the business brief:
There is a room containing an arbitrary number of cats with numbers on their backs. You will be given a list of lines with a cat number, a word to write on the cats' message pad and a door number. For each line on your list,
- catch the correct cat
- cross out the word currently on kitty's message pad, and write the requested word
- catch another cat, write a copy of the word on his message pad,
- herd the cats through the door indicated on the list, making sure they go through in the same order as on the list
- if a cat falls asleep or runs off, catch a spare cat and use it.
Distributed transaction systems, within their design parameters, have handled the problem of providing the ACID features in the face of
- failures and ...
- the impossibility of two machines knowing the same thing at the same time (... whatever that is in a distributed environment).
- Demand for more performance - we're all so impatient now
- An increase in maximum size of system
- Scalability (the requirement that the application performance should scale linearly with the number of machines)
- The wide range of the design space - be efficient with one or two machines up to 1,000s or 10,000s
- More design options - do we fix a problem at the database, in the network, in a container ... etc.
How niche is that?
I now believe that transactions in the cloud-style environments – multiple machines in a high-speed grid – will be key to making clouds mainstream for application programmers using Java or .NET. The “multi-machine” approach to building systems is becoming the fashionable choice because :
- applications are getting bigger, more sophisticated and interconnected. Many applications now have millions of users online
- you can increase capacity in smaller increments
- performance – 1,000 modern machines could easily have 8TB of main memory and 8,000 CPU cores. This is becoming standard rather than supercomputer level;
- there will always come a point at which a single machine runs out of capacity.
Monday, 5 October 2009
The ACID test for Clouds
- confusion over the term "transaction" meaning different things to DBMS and Gigaspaces;
- no transaction capability that old men with an ACID database would recognise.
The more I looked for a solution to allow application programmers to easily write big, fast applications on the cloud using transactions, the more it seemed there wasn't one, and no appetite for one either. There were three stances from the cloud camp:
- "You don't understand" - you don't need to save in-memory data grid transactions to a database, because the grid is so reliable. Right - tell that to your CIO;
- "Eventual, non-consistent save is good enough". Well, that's at least got the 'D' part of ACID ... but nothing else;
- Junk the RDBMS - solve the scalability and consistency problem in a different paradigm. Shame about the databases and information feeds that drive commerce and decision-making today.
On the other hand, it seemed like a valuable approach, to provide a transaction layer :
- to match the speed and scalability of the cloud;
- matching up with all the existing data and add-ons around RDBMSs;
- in a way that normal developers can use.
Now we are on the point of launching CloudTran**, which supports as-fast-as-possible transactional applications in grids and clouds, with a scalable, fast and reliable transaction mechanism. Our first performance tests showed around 300 (distributed cloud) transactions per second, and the key machine - a single Intel i7 920- was only at 8% CPU utilization, which implies we'll get 1000's of transactions per second.
This series of blogs will explain some of the assumptions that make this possible - i.e. define the space for CloudTran**, compared with native and distributed transactions - and describe its unique aspects.
**Previously known as CloudSave.