Tuesday, August 30, 2011

Choosing your database – SQL or NoSQL?

Lots of discussions are coming up lately on management of data in the traditional SQL-RDBMS way versus the emerging NoSQL way. As application designers, developers and architects, how do we choose which way to go? This is a common dilemma these days when the scale of applications as well as data volumes are ever increasing and processing terabytes of data have become common enough.

What exactly is the difference between SQL & NoSQL?

The major differences are in the data access patterns and data organization.

NoSQL databases allow a restricted predefined data access pattern. The basic philosophy here is that ultimately my application is the best judge to decide how to manipulate my data and I don’t want some blackbox like a query optimizer sitting inside my database to decide how my application should retrieve or manipulate my data. e.g HBase can be used more as a distributed persistent key value store where in my access pattern is well defined and very simple - I can look up based on a key and retrieve all/some values. So what I get is scalability and performance which are predictable. The price I have to pay is that I won’t get flexible arbitrary data access patterns with my favorite SQL queries. So it is a trade-off, I get more power and as always, power comes with a lot of responsibilities – you need to take care of a lot of things at the application level that your RDBMS would have silently done for you. NoSQL databases also give you the flexibility to store semi-structured or un-structured data and you are free to think beyond the conventional ‘tables’ with ‘rows’ and ‘columns’.

A few examples for NoSQL databases are HBase, Cassandra, Hypertable, CouchDB ,MongoDB etc

SQL databases need you to make some assumptions on your data access patterns based on which predefined optimizations like indexes are built. But if you ask me if we know exactly the data access pattern, I would say no.But definitely we have some idea about potential access patterns (e.g I would know which would be my primary key, foreign key etc.).The executions of queries are controlled by the database engine and programmers would have very limited control over it. So this is again a trade-off, I get simplicity, data normalization, data integrity and an ‘easy to work with’ language (SQL) but when it comes to scalability beyond a limit, it would be an issue. Another advantage I have is that it is very easy to comprehend a relational table since it is nothing but a simple structure with ‘rows’ and ‘columns’.

A few examples for SQL databases are Oracle, MySQL, PostgreSQL, Sybase etc

Ok now, why don’t you just tell me which one I should pick?
Just because Google, facebook, LinkedIn and other web companies do something doesn’t mean that you need to do the same. Just because there is too much buzz about something, it doesn’t mean that it would magically solve all your problems.

The key aspect is your problem. What you choose depends up on what problem you are trying to solve. Depending on your specific situation, SQL, NoSQL or even a hybrid solution supporting both might be appropriate .All I can say is that neither SQL or NoSQL are silver bullets. Sleep over your problems and choose wisely…Or ask others who have faced similar problems (it is very likely that problems we face are faced by someone else as well)-I have seen that this technique attest works ;)

Tuesday, March 15, 2011

Harnessing Hadoop For Big Data Analytics


Harnessing hadoop for big data analytics v0.1
Posted the slides from the session on big data analytics using hadoop.The session was conducted at Travancore Hall,Technopark Trivandrum

http://www.thehindubusinessline.com/todays-paper/tp-others/tp-states/article1518273.ece

View more presentations from jobinwilson