Interesting aspect to tackle after putting applications on EC2 in a live production environment. Our first customer has been live for over 1 year on EC2 and never been down for a single time. We didn't have to reboot the linux server though we had to restart our java processes for upgrades. Touchwood. Database incremental backups happen every minute to S3 and a daily back up at night. This is mostly an enterprise app, so not very stringent requirements on real time failover. Shortly we are going live with a new retail customer facing application - a 24 x 7 uptime requirement. Scalability is not much of an issue as our server is still operating under 65MB with response times under 50ms most of the time. Failover is a key aspect for us - passive or active, active being the preferred option.
Scalr seems to be decent option (and the only one I could find) for EC2 environments. Server side clustering which also spawns new instances as load increases. Terracota has an offering on EC2 environment where the cache serves as the intermediate database for all practical purposes.
In our solution there is no http session or cache to hold data, every read and write operate on the database. There is only a small portion of read only configuration data that is cached in the server. So java server operates in a stateless mode - meaning a browser client has no affinity to a server. If there are 3 servers then client can utilize all 3 servers for its requests. So client itself can take of clustering and failover. When client first comes to server thru a url server loads the client and gives all server urls to it. Client discovers an available server out of this and sticks to it till server is either down or too loaded. Client switches to another server at that point from its list. No need for a gatekeeper on server.
Failover is largely needed at database layer. As long as database is replicated realtime (master-master replication) failover from a database perspective will be seamless. We are exploring solutions for this:
1) mysql clustering. There are some trials of this but none seem to be in a production EC2 setup.
2) C-JDBC or Sequoia: JDBC level clustering, seems the most transparent way. Have to check if JGroups etc can talk to EC2 servers only visible on http
3) HA-JDBC: Similar to C-JDBC.
Will check out in the next couple of weeks and post the results.