The last post and this one I am writing the ideas that went in our products. If you need details on the product please visit drapsa site.
Any BI product needs a good data store to power the queries on a large dataset. Some of the work is done prior in way of building the cubes. Any adhoc queries or aggregation of data is done on the fly. Sybase and other database vendors including a couple of startups have proposed column oriented databases as a better model for the datastore as cubes aggregate data on a couple of columns at a time. Very compelling and results are there to see in live production environments.
In our design we made use of a search engine as the underlying datastore. Search engine index is optimized column wise like column oriented databases. Databases have largely grown from managing a small set of data very effectively with transaction semantics and ACID rules. As the dataset grew in BI scenarios more and more hardware and software designs have been employed to tackle it - example disk partitioning both physical and logical etc. Search engines on the other hand started with a paradigm of managing a large set of data primarily for read operations. No burden of supporting transaction semantics like a database. A BI product in that context is ideally suited to be served by a search engine.
Using flex on the front end alleviates load on the server to just serving search results. The heavy lifting of preparing visually rich charts is done very well by flex. So most of the server queries return in under 50ms and data transfer stays under 10K with compression enabled. Flex is optimized for the UI so the end response time is pretty good. If data is to be clustered (for redundancy or scalability) across search engine instances integration can happen at the UI so there is no single point of failure nor additional load on the server.
We have recently worked on a shopping cart optimized for grocery. Web is ready to a demo state and mobile is in the works. You can experience it here. It is done in flex so the first time load may take about 30 seconds.
But why build another shopping cart from scratch when the concept has been done to death in all conceivable forms - free, commercial, custom built, SaaS? There are certain characteristics to every shopping experience. Mobile shopping is largely centered on comparison (unless it is an iphone), nokia site illustrated this nicely. Similarly grocery shopping has to be done quickly as the list of items is known and it is typically a sizable one. Invite you to experience it and share your feedback. ---Thanks
Continuing on the previous post of comparing the onelineweb framework to Blazeds and Graniteds here is the summary of why an internal framework is developed.
1 - Scalability : It was impossible to introduce the plug-in architecture without creating/changing the class loading of graniteds and ibatis. The end result would have been a large monolithic package instead of smaller modules.
2 - Multi client support : Browser, Windows Mobile, Blackberry, Mashup API(XML API) and distribution (Ex. RSS feeds, Outlook).
3- Multi server support : Client interfacing with multiple server components directly - search engine, content engine, application all in unified fashion. The xml way.. Spitting out xml directly from search engine or content engine instead of proxing via the flex server side component.4 - Performance : Fast response time moving out all intermediate data exchange layers and natively compressing the data from database layer (right out of the resultset object) to gain speed at network layer using apache native compressors and browser native decompressors.
5 - Lightweight : Less dependencies on 3rd party libraries.
Let the problem define the framework rather than the
framework defining the solution. Simpler is Better.
Trying to put together a comparison of onelineWeb with BlazeDS and GraniteDS. There is one more decent implementation of remoting for flex from Hessian at http://hessian.caucho.com/#Flash/Flex
Hands on experience with the above: We started with GraniteDS (when it was beta) for our product and used it for 3-4 months before giving it up. I have checked out BlazeDS sample code and have a fairly good hang of how to go about if I have to use it.
First things first: BlazeDS is a port out of LivecycleDS from Adobe, so it can be expected to have a decent start. GraniteDS seems to have evolved quite a bit in the last 6 months including server data push with Gravity. It seems to be more integrated with EJB, JBoss, Spring and other java software stack, so it may have an advantage in that area over Blaze. But if my need is for remoting I will go with Blazeds due to its history. Further on I will compare with Blazeds only. GraniteDS or Hessian information can be obtained from their respective sites.
Messaging / Server push: Blazeds has got good support and covered the fallback aspect when clients and servers behind firewalls. If messaging needs are primary it is the way to go. onelineWeb does not have anything in this space.
Remoting Options: Remoting can be done in an text(http, web services) or binary way using remote objects.
Text (xml) based data transfer:
Onelineweb supports only text (xml) via http. It does not support object graph but does support lists. The object graph is built on the client by the application code, not the framework. All class definitions (java and flex) of database tables are code generated including basic CRUD operations in java. These definitions are then used for data marshalling and unmarshalling from xml to java objects on server side and client side. Xml based data transfer support in Blazeds will be bare minimum and these have to be provisioned for by the application.
Binary (object) based data transfer:
Onelineweb does not support binary transfer. BlazeDS has got first class support for this including deep object graph serialization. The java object model has to be replicated in flex via Actionscript classes for this mode to work. It is best done using some sort of code generation otherwise the class definition difference can cause issues during transfer. Maintaining this manually is a chore which cannot be sustained. The major drawback of a binary protocol is clients should have the ability to deserialize binary data like Flex. Other clients will need a wrapper on the server side to communicate in text format, think regular http clients or mobile phones.
The bandwidth requirements will be the least in binary mode. One way to reduce bandwidth requirement in xml way is to zip the content on Apache.This typically reduces the payload by more than 90% as xml tags are very repetitive in nature.
Large data transfer:
Onelineweb has the provision to stream xml data right from the database result even without creating java objects. This greatly reduces the server cpu and memory requirements and client response times. We are able to send 100s of records this way. In binary transfer also the object graph has to be built first.
Time to code:
How long does it take to start coding your first business class? As all the client and server framework code is ready in onelineWeb it gets a developer up and running much faster. As an application is developed it will need a central piece of code to handle communication between server and client. This will aid in streamlining error handling, performance measurement etc. This central piece has to be built on top of BlazeDS.
Modular Applications - Plugin Architecture:
Flex recommends building modules as opposed to one monolithic application. It aids in multiple ways - smaller swf downloads at any time, code separation etc. onelineWeb has the development tools (ant scripts) and runtime support for modules both on the client and server side. Each module is packaged in its own directory structure and built as a separate jar and swf file. Only the required jars are loaded into memory and swfs loaded into the browser. BlazeDS needs all the classes to be visible in the classpath and hence cannot support a full blown plugin architecture like Nutch or Eclipse.
I was recently into the salesforce event arranged at Bangalore Hotel Ashoka. The food was great :) The person sitting behind us introduced himself and said I am an anti-salesforce. Woh.. That was my opening conversation at the event. Well, I asked him why is he here then?? His answer was his BOSS sent him. Next question from he, why he hates salesforce?
This time the answer was really interesting. Most of their sales are made through the phone. What ever the phone conversation they are doing over phone, they are again typing that in the salesforce application. Additional nuisance . I understand it is good for company, the boss wants to have the complete picture. But immediately it is an additional nuisance.
Now I am not going to talk here about the rest of my evening. I am going to highlight on this particular point.
Long ego, when I was developing a pocket PC application at Fidelity, I came to know few sales agents record the conversation and then use Dragon speech recognition to get the text.
Why not a mashup of phone to application which shows a filled up form based on your conversation. Finally, no nuisance to end users ;)
A friend gave me a problem to build a painless functional testing tool. They have all kinds of testing tools [Rational, Mercury, Agitor (Unit testing) ]. Still they face following difficulties:
- After sometime the recorded scripts become unmanagable
- Too much to record for a good coverage
- Skill ncessary to operate the testing tool and writting the scripts
- Too much setup work and regular registance from developers
- Data changes invalidate the recordings.
OK, if you are thinking being discipline will make it happen; you are right. But it pains :( Let the machine be deciplined and do the heavy lifting.
I took this problem and applied the search engine to solve it.
First, Preparing the request
Goals
- Record from all kinds of clients which access a server.
- Sense and create simillar requests
- Minimal human touch
Steps
I will write TE for test engine. The TE server will listen for requests from the client. After accepting a request, it establishes a new connection to the application server. Now TE works as a conduit between application client and application server. Intermediately it records all the communications. Putting the TE is easy by manipulating the hosts file or the web prowser proxy.
Consider our application is a web application. The browser client makes an GET or POST request to the application server. These requests are pretty clean. However, sometimes they carry some past burden as
- The session information and things like that.
- Some last processing details
- Cookie information
TE need to undersatnd and know all these information. TE will do it by passing the request through many filters. Here I am proposing few filters to detect what the request parameters are:
- The recorded cookie
- The last results returned; If any matches we know the source. (Example session ID)
- Coded entry in the forms of the application. Ex. Instead of real employee write empName. This is a instruction to TE to replace it with a real employee accessing other data sources like template database, test database, user provided values, regular expression based random values... Other examples will include phone 888-8888888.
- Direct specification of tablename.fieldname
- Matching sample data type from the provided test data search index.
Working on the response result
Goals
- Sense and perform first phase of validation.
- Picking only gold from the result
- Minimal human touch with sampling
Steps
TE generates large amount of similar requests and plays automatically. TE catches server responses. Most of the content are JUNK to us. TO remove the JUNK it needs to go through the filters of:
- Ignore the binary content like image files, css, javascript, XSLT and other unnecessary stuffs.
- Parse the content of xml, html, swf files.
- Throw the unnecessary words. What is left is gold.
Try to sense the results, This will help system to validate in absense of the test engineer.
Now all these results are presented to the test engineer and test enginner validates the output. Test engineer validates the sensed fields. On need basis test enginner maps the result field to D/B table fields. Now for next play on the same request, TE knows what are the outputs to perform first phase of validation.
TE provides configured 5% record & play dialogue for validation.
Are you thinking why I need a search engine? If yes, I have put this to boost up the performance. Once the test data table goes to single search engine index; All tables including fields are searchable to enhance the sensing mechanism.
Why You think this "will not work"?
This is my viewpoint on database's role in software architecture during the last decade. I have a lot of respect for database vendors, they have made a very important data resource available with minimum fuss. Over the last decade or so databases have gone from a prominent player in software architecture to a much needed but far less desirable component. I refer to client server VB Unix Oracle applications where application development is faster and any significantly heavy processing is offloaded to database or to a unix batch process.
Mid to late 90s Java gained prominence and with it objects and middleware as well. Apps started to deal with objects instead of raw database records. A natural enhancement for a database was to provide an additional interface via objects. There have been some attempts in doing this but they never really picked up. At the same time ORM packages started to gain traction as the process of Object-Record mapping is a fairly straight forward procedure. Even though databases cached data in their layer to speed up access it proved to be little value for complex object models. The cost of OR mapping is so high that once done it made sense to cache the resulting object. So objects got cached at the middleware and caching products came into picture.
The fallouts:
- Majority of developers who started in last 5 years are familiar with ORM and caching products but never dealt with databases in a significant manner. So they are not able to leverage the database optimally. ORMs have abstracted the database but they did it sitting outside the database. So the tuning part is neither done by ORM nor developers.
- An organization dealing with huge databases has to provision for dual infrastructure - one for databases and another for caching the data. Typically caching is done by sucking the entire database table into memory.
These are my viewpoints/ observations. Please feel free to comment or send me a mail.
Recently my EC2 instance was hung. The website was getting timed out. A clear signal that the web server is not responding and all socket communication is not happening. As the sockets were not reactive, so the ssh server. I was not able to login using the SSH terminal - PUTTY.
Now I just can't shut down that instance and start another one as I had some production data in mysql data files which had not backed up. I had told my customer 99% uptime, 256MBps connectivity but what about reliability.. Now they are going to lose production data. Literally, I was shocked. I desparetly tried to get my data back. Did googling and got depressed reading the comments of pool. We then put a urgent request in the forum and amazon replied that nothing is possible.
After I cooled down a bit, it was time to fight back. First thing I thought of trying was restarting the instance. However, I was not sure about my data. All my mysql data files were in /mnt. So the question was am I going to loose data if I restart my instance. Some links I referred confused me more; I left with a understanding that anything at MNT has the possibility of loss. But the mnt was my target. Now I took a decision to restart my server - The last resort.
I issued restart command from my desktop. Oops... there was no effect. Then I went to my another amazon instance. There I had the EC2 tools. I used those tools and issued a restart command. Did a ssh again. The server was still hung. I sat tight for 5 mins. Again connected via my putty. Bingo, this time the server shown me the login prompt. My curiosity doubled up.. Am I going to have my data in /mnt. Then I changed directory to /mnt. Everything was there intact. I just relaxed. Thank God - I survived a crash
I remember few years back when I was running the Websphere, it was very difficult to kill a instance. On killing of one instance, a deamon was spawning another. Killing that one, one more was coming up. I had tried "kill -9", "kill -KILL" - No help. When one was going down, other was coming up. It is not like, I have 2 process instances running and on killing of both, the application is completely down. Rather, it is only one and taking birth each time it is murdered.
Why not this happens for a server. If I consider that is possible, virtually the application is immortal.
[I have taken this section out from my regular writting as I want to highlight on one particular issue. Recently I had mapped a production instance to amazon EC2 instance in web DNS mapping. I issued a shutdown command on EC2 machine by mistake. It just took me 2 mins to bring up a snapshot of the last instance as I was taking the snapshot. Then I pointed to this new server address in web DNS mapping. This process took a good 30mins and the system remained down for so long :(]
Consider a machinsm is available, where web DNS mapping is instantaneous to the depployed server. In this spectrum, I have an application which transcends through server instances. The possibilities lies on the snapshot creation. Before death can I take an exact snapshot and all next instances starts from there. Amazon EC2 has infiniband networking to S3. So, network latency is not my concern. Hadoop is there for my rescue to transfer the big files. The big file is the tar ball of the complete mnt. If you are reading my previous posts, you know you can shove everyting to mnt. That means the complete harddisk portion of my interest to the S3. This helps on saving the database files, other contents, application state and everything else.
During shutdown,
- Create a snapshot tar ball of /mnt of the portion of harddisk of my interest. - 30sec
- Copy this to S3.
- Start another instance.
- When the next instance comes up, it self configures itself during start up taking the file from S3.
I have taken snapshot and created a new instance quickly but manually. It takes just a minute. How can I do it automatically?
First I looked for linux shutdown hooks like init. No luck. If you are aware of any, please let me know. Now I am going to explore by creating a blank java application and starting it during init and putting a shutdown hook there. This shutdown hook will grap the linux shutdown. I am concluding this reading the post at http://forum.java.sun.com/thread.jspa?threadID=504889 . Once I finish this experiement, I blog about my findings :) This will open up following possibilities in EC2:
- Immortal application in a mortal server
- No information loss when instance shutted down.
- Many app instance creation when load increases. [What about the database??? I have a big blog coming on this. I am half way through]
Don't forget to help me on implementing a linux hook of shutdown rather than going via the java way... A work around not a clean solution :(
Databases replication is a really useful feature. It provides backup, load balancing, data aggregation to start with. A typical design of a database replication will be:
- A data modification statement on the master database is written to the database log
- These entries are read by a replication agent and put on a queue for child nodes to apply on their respective databases
- A checkpoint thread scans the master node's database log and clears the entries which are read by the replication agent.
What happens when a child node goes down?
- Entries off the queue (where the master rep agent puts in entries) are not removed by the child node so the queue starts building up
- When the queue is full the rep agent stops reading entries off the master database log. If it read where will it put them?
- The master database transaction log starts to get full and when it is full it stops accepting inbound transactions means master database becomes unavailable
In most of the cases the child comes up much before this happens and the child has all the information it has not consumed on the queue. So the queue is a important piece in the design. The master keeps it up and running, filling it with data while the child node is down. In essence the master node is responsible for the queue and hence takes on the risks associated with it. I don't think that is a very wise thing to do, you can think of any number of real world scenarios where this model does not work.
In the replication component Abinash is coding as I write this, we decided that each child node should take responsibility for its upkeep. it has to figure out how to recover from a failure and should not adversely affect the master in any way. We will publish the design details in a follow-up post.
on Onelineweb - Summary