Saturday 6 February 2010

CloudTran Reading list

I just got asked for some background reading on CloudTran. I've been meaning to give my introductory reading list for some time, so here goes.

The order here is hopefully a step-by-step guide to the issues; should be a bit easier for you than jumping in at the deep end like I did.

1. http://www.openspaces.org/display/DAE/GigaSpaces+PetClinic
This is what started it all off - see slides 13-15 in the slide show.

First, it indicates there's a fair amount of non-trivial work for your regular Java application developer. Second, this raises as many questions as it answers - in particular, if you have a lot of information, how do you distribute it across a grid and then how do you integrate a transaction with backend databases and other stores.

2. Pat Helland's Apostasy - Life Beyond Distributed Transactions - the original 'slit-your-wrists' exhortation and still the best.
Pat says, forget doing distributed transactions in a scalable application - "distributed transactions are too fragile and perform poorly".

Pat's statement of the problem is brilliant; but his solution would mean that application programmers would end up doing lots of infrastructure work, which in my experience is a no-no. Surely the better answer is to productise this infrastructure functionality, so application developers have a simple sandbox and can quickly deliver business results.

2a. This leads onto Todd Hoff's highscalability.com site, and articles like http://highscalability.com/amazon-architecture. Be afraid... it's all too complicated.

3. Andy Bechtolsheim's talk at HPTS: http://www.hpts.ws/session1/bechtolsheim.pdf.
And here is James Hamilton's one-page in-flight summary.

Andy was one of the original founders of Sun.

Bottom line for the next 10 years:

- Memory and CPU's will become cheaper
- More memory and more cores
- The bottleneck of access times to hard disks is going to get 10x worse, which will mean they are gradually phased out for live data
- Flash memory will take over mainstream applications for storage sizes > main memory. But how many writes can you get out of them...
4. Stanford's Case for RAMClouds. RAMClouds means 'all active data in memory rather than on disk'
And here's an easy-entry synposis of the same article.

By the time it was published, RAMClouds wasn't new ... but it does tie the previous paper into forward thinking about architecture, and gives a theoretical reasoning as to why RAMClouds will be one of the new architectures.

I actually saw Todd Hoff''s overview piece first - http://highscalability.com/are-cloud-based-memory-architectures-next-big-thing.
5. The requirement. Google "600 billion RFID" and go from there.
Basically, applications will continue to get larger. A million on-line users isn't worth shouting about today. This is the case for thinking about application architectures that will survive the next 10 years - there are going to be loads of customers out there wanting information now.
6. Performance matters (admittedly from Akamai marketing literature):
2006: Respond to users in 4 seconds

And we're getting more impatient:

2009: Respond to users in 2 seconds

This is the business driver: handle more customers and give a better experience (and get a competitive edge).
7. The fundamental platform: Julian Browne's Space-based architecture.
Also GigaSpaces white papers. This is based on JavaSpaces. Here is Bill Olivier's take on the big problems JavaSpaces solves - http://www.jisc.ac.uk/media/documents/programmes/jtap/jtap-055.pdf -- see section 2.3.1.

"Jini addresses the hard distributed computing problems of: network latency, memory access, partial failure, concurrency and consistency".

The big thing developers have trouble getting their head round, is that in a scalable system every failure event must be handled as part of the application. Most developers are used to letting ops worry about failure modes. It's really hard in a large-scale distributed environment to get this right.
8. How to distribute data for application programmers: partitioning and the entity group pattern. This answers the question, "how do I spread across nodes for best performance but easy management".
Billy Newport has a good overview of grids and partitioning.

Google App Engine defines Entity Groups as the limits of transactions.
In CloudTran we use this purely to define where information goes; cloud transactions can span entity group boundaries.
9. NoSQL (forget SQL) and BASE - An ACID Alternative. The database as we know it doesn't handle scalable applications and specialised requirements well.
The thrust of NoSQL (or 'not only SQL') is: if you really want to get scalable data, you can't have SQL and ACID charateristics. And there are certainly beyond SQL databases like BigTable that have highly specialised characteristics.

For a hilarious counter, see Brian Aker's talk.

In CloudTran we provide transactionality that can provide transactionality for SQL and no SQL, coordinating in-memory data with eventual consistency at the data sources. Some of SQL functionalities for joins has to be done by hand, but it's about 90% there.
_______________________________________

This should get you started Andrea.

No comments:

Post a Comment