A common pattern that we see in Big Data Analytics is that we are essentially operating on huge data pipelines which consume some input files (may be web logs/application logs etc) and do some heavy processing to generate insights that we store in an output file(s)/data store. Another data pipeline may consume this and generate even further insights.
The key points to note are
1.The data volume would be huge (would be in GBs/TBs)
2.We have many input sources from which this data will be arriving / needs to be extracted (push/pull)
3.We need to maintain some kind of a repository/warehouse which could be leveraged to consistently maintain the incoming data files, store intermediate/final results of our analytic processing
4.Analytics operations should be considered like "jobs" with an ETL nature
5.We would need some kind of an orchestration mechanism to manage the inter-dependencies of jobs & trigger jobs based on the availability of data, provide scheduling capabilities & recovery mechanisms etc.
Our Big Data solution stack may have hadoop, pig, hive ,custom map reduce jobs in java or we might use hadoop streaming etc..but the major problem still exists - Who will orchestrate my process flows which generates insights?
Wait...did you say process flow? Where did the "flow thingy" come from? I thought I could write a map-reduce program to do my analytics...Is that not the case? Well ,you can definitely write map-reduce programs, but an optimal way would be to leverage the higher level abstractions that already make your job easy in which ever stages you could use them. For instance I could use hive queries to quickly aggregate some data or use a pig script to do some input transformation .All these would get converted to map-reduce under the hood. Another reason why you can perceive it as a flow is because there will be a sequence of well defined steps that will take in data from a source, do some processing & feed it into a sink - A classic data pipe line.
You will see that lot of these steps will follow a common pattern and if we could have abstractions like “actions” which define each of this steps(what they do & which abstraction they use to complete the step) & link them as an sequence of steps, our job is done. A flow could be represented using xml and the abstractions for actions could be “pig action”, “file system action”, “map-reduce java action”, ”ssh action” etc
Looking from the above perspective, Yahoo’s Oozie (which is a workflow system for hadoop ) sounds promising.It has been around for almost a year and now there is a V2 version of it which has introduced a concept of “coordinator jobs” which exactly solves the scheduling & dependency management. The system is exposed via APIs and webservices & is deployed inside a tomcat.It provides an extjs based console also to monitor jobs.It also provides a command line based client.I saw a presentation on Oozie from hadoop summit 2010 & it seems to have addressed a lot of common pain points.
I have started playing with Oozie 2.2.0 this weekend .I will be posting the findings from my experiments soon. I did face some challenges in getting Oozie 2.2.0 installed on my hadoop cluster and in configuring mysql as the Oozie persistence mechanism. Since I couldn’t get a binary distribution from github, I started off by building a copy myself. Even after injecting extjs2.2 which it uses for rendering the console,I couldn’t get the Oozie web console working....Then firebug came to rescue, the problem was that RowExpander.js was being searched in the extjs home folder,but this script was located in the examples folder. I copied it over & repackaged oozie.war & it started working. I will be publishing a post with details on getting around the issues soon.
May be the issues are addressed in Oozie 2.2.2 & they have a distro available for download & we might not have to build it from source code.I am yet to try it out…