Thursday, May 20, 2010

Relational Database v/s Non Relational Datastore

“Non Relational datastore’s have been introduced to provide indexed data storage that is much in higher performance than existing relational database products like MySQL, Oracle, DB2 and SQL Server. They seek to break down the rigidity of the relational model, in exchange for leaner model that can perform and scale at higher levels, using various models (including key/value pairs, sharded arrays, and document-oriented approaches),which can be created and read efficiently as the basic unit of data storage. So this article tried to bring relation database face-to-face with non relational datastore and can help to choose whether non-relational datastore is the right choice for your application -

A. Normalization v/s Denormalization

Codd’s Normalization rules have been used in large effect while designing the application’s database and have been studied & practiced for more than decade to avoid redundancy. The result is very cheap INSERT, UPDATE, and DELETE operations but SELECT is potentially expensive if we require more complex JOINs to put all the data back together.

Rather than isolate redundancies into individual tables and then form relationships among them, non-relational datastore nests the information hierarchically in keys/documents/column families.

Denormalization is about the tradeoff between efficiently storing data in a single location versus potentially duplicating it in the hopes of speeding up performance of the system. In a denormalized setting, we have the reverse, INSERT, UPDATE, and DELETE are potentially expensive operations but SELECTs are very cheap.



The effective give and take of the normalization dichotomy is that normalization is worse for performance because it requires joins when disparate information is required together, whereas denormalization is more complex (because it may require more physical operations be done when changes occur) and more disk-heavy (because similar information may be stored multiple times).



B. Structured v/s Non-structured data

RDBMS are designed to model very highly and statically structured data. However, many of today's interesting problems involve unpredictable behavior and inputs from extremely large populations; consider web search, social network graphs, large scale purchasing habits, etc. In these "messy" arenas, the impulse to exactly model and define all the possible structures in the data in advance is exactly the wrong approach. Relational data design tends to turn programmers into "structure first" proponents, but in many cases, the rest of the world are thinking "data first".



In RDBMS , there is only one value for any given attribute; storing multiple values in the same attribute for the same column is considered very bad practice, and is not supported by standard SQL. Generally, cases where one might be tempted to store multiple values in the same attribute indicate that the design needs further normalization. This is a design strategy that can is frequently applied to many situations in standard relational database design by defining many-to-one relationship. The trouble with this pattern, however, is that it still does not elegantly serve all

the possible use cases of such data, especially in situations with a low cardinality; either it is overkill, or it is a clumsy way to store data. But in case of some of the non-relational data store, they have a "List" type that can store exactly this type of information as an attribute & query can return the list data also when the entity is retrieved. For example, sample document in CouchDB is defined to contain single keys and arrays:

{

"my_key" : "another value",

"different_key" : "another value still!"

"my_array_key" : ["value 1", "value 2", "etc"]

}



C. Data Retrieval - SQL v/s GQL(Google) or Map-Reduce(CouchDB)

SQL is an extremely concise declarative language; it builds a consistent, useful abstraction framework on top of data storage in the relational model, and allows implementations to optimize access to the data within the bounds given by that abstraction. SQL allows an arbitrary complexity of query syntax, and relational databases management systems typically have an incredibly complex layer for processing and planning the execution of these (potentially complex) queries. Nested queries, complex table joins, aggregation and pivoting, projections—all can be described in SQL, and a good query processing system will quickly craft extremely efficient mechanisms for answering these queries.



Non-relational stores generally allow queries against only the primary key of the store, possibly with one additional layer of filtering via index to limit results to only those that match a simple set of filters (i.e., WHERE clauses). Some of the datastore’s require something different from SQL to access and view its data. Queries are constructed using an object syntax of a simplified dialect of SQL known as "GQL" for retrieving the data from Google Data Store, whereas Map-Reduce comes into picture when accessing the data from CouchDB. Map/Reduce, originally, was a pair of high-order parallel functions introduced originally in LISP to partition information and then reconstitutes it in the final processed and useable state. In CouchDB, The Map function is the beast responsible for taking in a stack of documents and then "mapping" it into key-value pairs and Reduce is responsible for reducing the values under a particular key in the hash-table down into a smaller result set, potentially a single record.



Incase, if we want to retrieve the details about number of articles submitted this year, then SQL query to retrieve the data from RDBMS will be



SELECT count(*) FROM articles



In addition, if same data needs to be retrieved from CouchDB by submitting HTTP – “GET /articles”, then its map function written in JavaScript will be



function map(doc)

{

if(doc.type == "articles")

{

emit( doc.articleName, doc );

}

}



function reduce(key, values, rereduce)

{

return values.length;

}





D. Economy of Scale

It is important to note that one of the most important benefits of non-relational data store—and in fact, the justification for their existence in the first place—is their ability to scale to larger, parallel installations than relational databases can. Rather than designing an elegant relational model and only later considering how it might reasonably be "sharded" or replicated in such a way as to provide high availability in various failure scenarios, some of the datastore’s consider mechanism for establishing the locality of transactions. So in Google Data store when multiple entities in a transaction on a distributed data store are involved, it is desirable to restrict the number of nodes who actually must participate in the transaction. It is therefore most beneficial to couple related entities tightly, and unrelated entities loosely, so that the most common entities to participate in a transaction would be those that are already tightly coupled. In a relational database, you might use foreign key relationships to indicate related entities, but the relationship carries no additional information that might indicate, "These two things are likely to participate in transactions together".



Currently no paper/discussion/presentation is over without mentioning the economy of scale with respect to cloud. An application requiring an RDBMS that has a fixed number of users and whose workload is known not to expand will not have any problems using that RDBMS in the Cloud. However, as more and more applications are launched in environments that have massive workloads, such as web services, their scalability requirements can change very quickly and, secondly, grow very large. These scenario scan be difficult to manage with relational database in general. In addition, one of the core benefits of the Cloud is the ability to quickly (or automatically as we will show) add more servers to an application as its load increases thereby scaling it to heavier workloads. However, it is very hard expand an RDBMS this way. Data must either be replicated across the new servers, or partitioned between them. A big challenge with RDBMS and partitioning or replicating is maintaining referential integrity. When database is partitioned or replicated, it becomes nearly impossible to guarantee referential integrity is maintained across all databases. Basically, it is the very useful property of RDBMS being constructed out of lots of small index tables that are referred to by values in records that becomes so unworkable when these databases have to scale to deal with huge workloads.



E. Concurrency

Relational databases traditionally use a mechanism known as locking, or "Pessimistic" concurrency control; a transaction will identify the resources it intends to change, and protect these resources with a lock (of which there may be various types, depending on the specifics of the operation). Other transactions wishing to update the same resource must wait for the lock to be released. Participants wait their turn for exclusive access to the data, and then commit.

Locking does suffer from two problems that are critical from the perspective of non-relational database management systems – overhead & much more difficult to do correctly if the participants in the transaction are distributed. For this reason, locking is not used by any of the distributed non-relational database systems. As an alternative, another form of concurrency control is typically used in non - relational databases: Optimistic Concurrency, also known as MVCC (Multi-Version Concurrency Control). This mechanism relies on timestamps, to determine the modification dates of transactions. MVCC is very good at achieving true "snapshot" isolation, because a query can carry with it a timestamp that is used to filter any entity the query touches. Using Optimistic Concurrency, however, may introduce additional layers of

complexity to the program code, which would be silently handled in relational databases. The result of this restriction is that in most non-relational database systems, explicit (multi-step) transaction either do not exist at all, or have various limits placed on what they can do. As an example, Google App Engine Data Store can do transactions, but not arbitrary transactions: entities must be declared to be part of the same "entity group" at design time.



F. Consistency v/s Eventual Consistency

Consistency is the notion (which is often taken for granted in traditional relational database systems) that logically, when a client of a data storage system makes a write to that system, any subsequent read (by that client or others) will get the latest version of that data that was written. Consistency is closely related to the concept of transactionality: concurrent systems require transactional guarantees (at least) in order to maintain consistency. Instead, some of the models of non-relational databases use a technique known as "Eventual Consistency". The concept usually applies to cases where a distributed representation of the data is kept for example, across multiple servers in a cluster. The transaction protocol does not guarantee that reads and writes of all conceivable entities in the database will always be instantaneously consistent. Instead, a weakened guarantee is made: in the case of any sort of failure or latency issues, entities may appear temporarily inconsistent, but that they will eventually be made consistent.



G. Integrity

However, even assuming a stronger consistency model, non-relational databases have a significant amount of work to do if they want to replicate the same level of integrity guarantee that is provided by a relational database. Relational database architectures provide a layer through which all queries are passed, that enforces relational integrity guarantees; this would be extremely difficult to do in a distributed environment, and would hamper the system’s throughput. In place of proper relational integrity constraints, most non-relational databases offer un-enforced references: an entity whose key is used as a reference property in another entity can still be deleted, and it is always up to the application code to check the existence of a referred-to key before proceeding. This is the strategy used, for example, by the Google App Engine Data Store.



H. Data manipulation

Data Manipulation Language (DML) is used in RDBMS for entering and manipulating the data once the structure is defined. However some of the Non Relational Datastore’s provide the RESTful interface (HTTP PUT, POST, DELETE) again with differing serialization formats (like JSON, Thrift, RDF etc) and some of them like Google Data store provides the Data Manipulation APIs. Following sample can create the document in CouchDB datastore:

curl -X POST -i -d '{"name" : "Amit Piplani"}' \ -H 'Content-Type: text/javascript; charset=utf-8' \http://localhost:5984/couchdb_sample



{"ok":true,"id":"626b345059c2a54fbe8b8009ba834s01",

"rev":"1-21754816122"}



I. Relationships

One of the biggest question asked when somebody talks about transitions from the relational databases to non-relational datastore’s – What happens to the relationships? Although the integrity part is already covered in “Economy of Scale “and “Integrity” sections, but relationships by themselves require attention and separate topic for discussion. Since the entity relationships are being handled differently by datastore’s, so will be using CouchDB for discussion. Rather than isolate redundancies into individual tables and then form relationships among them, CouchDB nests information hierarchically within documents. Therefore, points of discussion for this case will cover one-to-many and many-to-many relationships, as these are the common types of entity relationships. So CouchDB document detailing out the user and phone number relationships (User having Multiple Phone Numbers associated to them):

/*Start of Document */

{

"name" : "John Doe",

"phone" : {

"mobile" : {

"number" : "123-555-3232",

"primary" : "Yes",

"provider" : "ATT"

},

"residence" : {

"number" : "333-232-2232",

"provider" : "Comcast"

}

"Work" : {

"number" : "432-232-3232",

}



}//Phone

}



Rather than subdividing data across tables to reduce duplication, CouchDB acknowledges a small amount of duplication in favor of keeping all of our data in the same place for easy retrieval



In addition, let us take the example of twitter as many-to-many relationship to explain the user-followers and user-following association. CouchDB in this scenario will break down many-to-many relationship into multiple (two times) one-to-many relationship and in this scenario the document will be:



/* User document

{

"user_id" : "456"

"username" : "John Doe",

"followers" : [

{"user_id" : "123", "username" : "Follower1"},

{"user_id" : "333", "username" : "Follower2"}

],

"following" : [

{"user_id" : "999", "username" : "Following1"},

{"user_id" : "888", "username" : "Following2"}

]

}





J. Security

Another category of diminished functionality in the current crop of non-relational Databases, compared to most commercial relational databases, is in the area of granular access control. Database systems like Oracle, Microsoft SQL Server, MySQL, etc., all contain a robust security model that allow the creation of user accounts, as well as roles or groups to combine and manage those user accounts. It is then possible to set very detailed, granular permissions regarding which users and / or groups can select, insert, update, and delete data, as well as execute individual functions, procedures, etc. Access control is real-time, meaning that changes to users and groups' granular access can be changed at any point, and the database engine itself immediately enforces that access. Non-relational stores do not generally provide access control of this granularity.



K. Stability/Maturity

The commercial relational databases have been around for decades and are rock solid. Most NoSQL datastore’s have been conceptualized and evolved in the last couple of years and have not been widely adopted.



L. Schema Updates

To be sure, relational databases have mechanisms for handling ongoing updates to data schema; indeed, one of the strengths of the relational model is that the schema is data: databases keep system tables, which define schema metadata, which are handled by the exact same database primitives as user-space tables. However, relational database schemas exist in only one state at any given time. This means that if the specific form of an attribute changes, it must change immediately for all records, even in cases where the new form of the attribute would rightfully require processing that the database cannot do. In addition, the release of relational database schema changes typically requires precise coordination with application-layer code; the code version must exactly match the data version. In any highly available application, there is a high likelihood that this implies downtime, or at least advanced operational coordination that takes a great deal of precision and energy. Non-relational databases, by comparison, can use a very different approach for schema versioning. Because the schema (in many cases) is not enforced at the data engine level, it is up to the application to enforce (and migrate) the schema. Therefore, a schema change can be gradually introduced by code that understands how to interact with both the N-1 version and the N version. Code generally lives longer than expected, and data access code doubly so, because it reflects aspects of the system that are less likely to change as requirements shift slightly. Therefore, the future needs of an application’s data are not always clear. This approach produces code that is more complex in the short term, but the knowledge that no downtime will be required during a schema evolution is worth the additional complexity.



M. Support/ Operations

As mentioned in earlier point, that most of the relational databases like Oracle, IBM have better operational support. Especially the support around data recovery, backup and ad hoc data fixing is always a big question in the mind of enterprise decision makers, as many of the 'Non Relational Databases' don't provide a robust and easy to use mechanism towards these problems.



N. Cost

Commercial RDBMS licenses are not cheap, especially in datacenter configurations. The per-server or per-core licensing expenses for a large-scale deployment can easily reach into the millions, which makes SQL much less attractive than the NOSQL alternatives even in non-cloud world.



O. Developer eco-system
SQL is a core competency for a large chunk of the overall developer population. NOSQL knowledge is still rare, both at the individual and at the organizational levels. This means that developers with the required skill sets are hard to find, hard to qualify and hard to retain.

So would like to end the paper saying - "NoSQL Databases are absolutely something the technologists/architects should be paying attention to and following closely- they are strong contender to the RDBMS Model and will likely become the de-facto data storage choice for most next generation web applications in some time. But there is no one-size NoSQL Database that fits all solution for the application’s requirement, and hence evaluating the correct NoSQL Database for project will be the important part of data architecture."

Thanks,
Amit Piplani

Friday, May 7, 2010

Comparing Key-Value DataStores


The simplest of all the data store types. It can be said to be build upon MemCacheDB distributed in-memory cache & provides a single key-value index for all the data. These datastores stores data in memory but other functionalities(like replication, partitioning) vary across the different key-value datastores. Following are the comparison of some of the key-based datastores looked into.

Thursday, May 6, 2010

U Pick 2 Selection for NoSQL Providers


As per CAP theorem, there are three primary concerns you must balance when choosing a data management system: consistency, availability, and partition tolerance.

* Consistency means that each client always has the same view of the data.
* Availability means that all clients can always read and write.
* Partition tolerance means that the system works well across physical network partitions.


And now using Panera Model of UPick 2 for your application requiremenents, CAP theorem says you choose 2 from CAP attributes. The above diagram tells how this attributes apply across some of the NoSQL Providers