High availability at EC2
The definition of availability:
Present and ready for use; at hand; accessible:
Capable of being gotten; obtainable
Qualified and willing to be of service or assistance
Unavailability is not felt till the moment users reach out for the service and are not served. For me the counting of unavailability starts the moment absence of availability felt.
My application is a web based application. The potential of unavailability is bacause of:
- Absence of the desktop machines.
- Absence of network connection.
- Unable to serve due to fault on:
- Website not available
- Application servers are down or hung
- Slow performance and high response time
- Software bug and functionality is not available.
To make a high available application, the system requires:
- Mobile presence
- Smart client applications, provision to browse at client side during no network.
- The website is up all the time.All machines and mobile devices are functioning properly.
- All serves are functioning properly - Apache, tomcat, mysql, mail
- All service agents are functioning properly - sms, email
During,
- Single node failure
- High load situation
- New application version deployment
- Regular machine patch management
- Hacking attacks, intrusion attacks
- Virus Attack
- External applications unavailability
and immediately serving after a fault due to,
- Database failure
- System/node unavailability
- Application bugs
OK, the principles I am guided to provide high availability in my application are:
During Requirement
Availability for me is a functional requirement. All the requirements are analyzed from this perspective too. If system couldn't be available from world wide web, can he perform some operations from his mobile device via SMS, can he send a mail and do the same operation.
During Architecture & Design
- Stateless architecture on Amazon EC2 to quickly provide more servers as the number of users increase.
- Central point of information update and numerious points of information viewing to scale. In my application around 80% time information is viewed and 20% time it get's modified.
- Each component handles failure situation. Ex. The email agent if could not send mail using account X, it will try another email provider.
During execution - production environment
On production - During available period:
- Preparing Stand by machine to take over during failure
- Monitoring the health and providing necessary support.
- Continuous revision of server deployment according to the measured data:
Staep -1 ) Moving database and other single point of failure to an dedicated machine.
Step - 2) Up scaling te machine by adding more memory, cpu, disk and other resources in the existing machine
Step - 3) Hot backup and replication of all changed content.
On production - During Unavailable period
The Preparedness
- Failover drill process. It forces the system to shut down and waits for the failover system to take care.