Amit Piplani

Wednesday, December 29, 2010

Day out with the MongoDB!!!!

Non Relational datastore’s have been introduced to provide indexed data storage that is much in higher performance than existing relational database products like MySQL, Oracle, DB2 and SQL Server. They seek to break down the rigidity of the relational model, in exchange for leaner model that can perform and scale at higher levels.
Non relational datastore’s are synonymously also called as NoSQL datastore’s. But NoSQL doesn’t mean that industry is ending the roads for SQL itself for next generation web applications, but they meant Not Only SQL (NoSQL). NoSQL datastore’s are both SQL-Free and Schema-free datastore’s. Non Relational datastore’s have been classified into different types based upon their data models –
1. Key-value Stores: These systems store values and an index to find them, based on a programmer-defined key.
2. Document Stores: These systems store documents (as just defined). The documents are indexed and a simple query mechanism may be provided. Well these stores are different from content management systems. These documents are a set of attribute-value pairs, where the values may be complex, and the attribute names are dynamically defined for each document at runtime.
3. Extensible Record Datastore’s: These systems store extensible records that can be partitioned across nodes.

MongoDB is an open source document-oriented datastore designed for high performance access. It is written in C++ and is developed & supported by 10gen.And it is one of the datastore’s which has taken the features from both non-relational datastore & relational databases and I call them Scala of the NoSQL world.
In this article, will give more insight about MongoDB via question & answers:

Q: What is MongoDB?
A: MongoDB is document-oriented database, which has been designed to scale.

Q. What is document oriented database?
A: A document-oriented database is, unsurprisingly, made up of a series of self-contained documents. This means that all of the data for the document in question is stored in the document itself. In fact, there are no tables, rows, columns or relationships in a document-oriented database at all. This means that they are schema-free. If a document needs to add a new field, it can simply include that field, without adversely affecting other documents in the database. This also documents do not have to store empty data values for fields they do not have a value for.
Documents actually map to the given row in a table. Documents are actually stored in collections, which are mapped to the table in relational world.

Q. Does MongoDB have a concept of primary key?
A: Almost every MongoDB document has a _id field as its first attribute. This value is usually a BSON ObjectID. Such an id must be unique for each member of a collection; this is enforced if the collection has an index on _id, which is the case by default. If user tries to insert a document without providing an id, the database will automatically generate a _object id and store it the _id field.
A BSON ObjectID is a 12-byte value consisting of a 4-byte timestamp (seconds since epoch), a 3-byte machine id, a 2-byte process id, and a 3-byte counter. BSON ObjectID will be unique for that server, but not globally unique id. The _id field is automatically indexed.
Q. What is BSON? Is it related to JSON?
A: BSON is a bin¬ary-en¬coded seri¬al¬iz¬a¬tion of JSON-like doc¬u¬ments. BSON is designed to be lightweight, traversable, and efficient. BSON, like JSON, supports the embedding of objects and arrays within other objects and arrays. BSON is a language independent data interchange format & contains extensions that allow representation of data types that are not part of JSON(For example, BSON has a Date data type).

MongoDB uses BSON as the data storage and network transfer format for "documents". The MongoDB client drivers perform the serialization and deserialization. For a given language, the driver performs translation from the language’s “object” (ordered associative array) data representation to BSON, and back.

Q. What all language drivers are supported?
A: MongoDB currently has client support for the following programming languages: C, C#, C++, Haskell, Java, JavaScript, Perl, PHP, Python, Ruby and Scala (via Casbah).

Q. What are the different types of collections supported in MongoDB?
A: A MongoDB collection is a collection of BSON documents. These documents usually have the same structure, but this is not a requirement since MongoDB is a schema-free database.
Capped collections are fixed sized special collections that have a very high performance auto-FIFO age-out feature and are used for certain use cases such as logging. Unlike a standard collection, you must explicitly create a capped collection, specifying a collection size in bytes. The collection's data space is then preallocated. But there are additional points which are not applicable for capped collections such that capped collection are not shardable, index is not automatically created on _id for capped collections by default and deleting of objects is not allowed from capped collection.

Q. Are there any interactive shells to execute the MongoDB commands?
A: MongoDB provides the JavaScript interactive shell to issue commands from the command line and can be used to run the commands, create indices, create server side functions, create users and provide the correct access to the collections and other administration commands.

Q. How is single table inheritance implemented in MongoDB?
A: Being a cricket player, let’s take an example of application storing cricket player and their specialties. In relation world, there will be following columns defined for every player: Batsman, Batting Position, Bowler, Bowling Style, and Wicket-keeper. And some of the columns will be empty depending upon each player’s skills. But same can be easily implemented as each document will be different. Following will be the query results using MongoDB query interface on player’s collection:
 db.players.find()
{ _id: ”1”, name: “Player1”, Batsman: “Yes”, BattingPosition: “No3”}
{ _id: ”2”, name: “Player2”, Batsman: “Yes”, BattingPosition: “No5”, WicketKepper: “Yes”}
{ _id: ”3”, name: “Player1”, Bowler: “Yes”, BowlingStyle”Leg Spinner”}
{ _id: ”3”, name: “Player1”, Bowler: “Yes”, BowlingStyle”Off Spinner”, Batsman: “Yes”, BattingPosition: ”No 4”}

Q. Since there are no joins, so how do define relationships in MongoDB?
A: Join in relational database world maps to embedding and linking in MongoDB. As there are no joins in MongoDB, create one database collection for every top level object & try to embed child objects in that collection. With embedding data is then co-located on disk; client-server turnarounds to the database are eliminated.
So most of the relationships in relational world are described as one-to-many & they can be represented as embedded arrays or normalized collection in MongoDB world. In case of normalized collections, application will be retrieving both the collections separately as MongoDB will not be holding any relationship between them internally. The embedded arrays are the optimum solution.
In case of many-to-many relationship, the different collections have to be linked with each other via ObjectID’s in both the directions & have to query 2 collections if need to retrieve the documents for a given criteria. Instead of creating and maintaining a two-way relationship, it’s better to create one-way relationship between the collections.

Q. Does MongoDB recommends schema design using any methodologies?
A: Documents are simply designed as per the use cases/application needs and they are modified as the requirement changes in future sprints/iterations. So schema design is iterative and is highly recommended to be used using Agile/Scrum methodologies. Data with old and new structure documents design can exist in the same collection as MongoDB is schema-free datastore.

Q. Does document size matter in MongoDB?
A: Yes it does. There document size in 1.6 release of MongoDB is 4MB and is expected to grow in future releases to 16 MB.

Q. How to store bigger objects in MongoDB if there is a document size limit?
A. GridFS is a specification for storing large files in MongoDB, by providing a mechanism to transparently divide a large file among multiple documents. It works by splitting large object into small chunks, usually 256k in size. Each chunk is stored as a separate document in a chunks collection. Metadata about the file, including the filename, content type, and any optional information needed by the developer, is stored as a document in a files collection.

Q. Are indexes in relational world same in MongoDB?
A: Indexes in MongoDB are conceptually similar to those in RDBMS and should be used for enhancing query performance. Index in MongoDB is implemented as “B-Tree” indexes and collects information about the values of the specified fields in the documents of a collection. Once a collection is indexed on a key, random access on query expressions which match the specified key are fast. Without the index, MongoDB has to go through each document checking the value of specified key in the query. An index is always created on _id and that can’t be deleted. Indexed field can be of any types and multi-key "compound" indexes are also supported. Having lot of indexes for faster query performance can impact the inserts/updates/deletes in datastore, which is similar to relational world.

Each index created adds a certain amount of overhead for inserts and deletes. In addition to writing data to the base collection, keys must then be added to the B-Tree indexes. Thus, indexes are best for collections where the number of reads is much greater than the number of writes. For collections which are write-intensive, indexes, in some cases, may be counterproductive. Most collections are read-intensive, so indexes are a good thing in most situations.

MongoDB supports unique constraints via unique indexes, which guarantee that no documents are inserted whose values for the indexed keys match those of an existing document.

Q. What about the query interface in MongoDB?
A: Unlike many NoSQL databases, documents can be queried through a query interface much like Hibernate criteria queries. It’s not SQL, but critically it supports ad-hoc querying. Clients can search for documents based on fields within documents, and return specific fields of documents are part of queries. Through its command line interface, it therefore supports ad-hoc querying of data. Related to this, there is support to create indices on any field in the document to improve performance (including geospatial indices). Combining these two would appear to make MongoDB much closer to relational SQL databases than other NoSQL competitors, which obviously helps with migration.

MongoDB supports a number of query objects for fetching data. Queries are expressed as BSON documents which indicate the query pattern and also provide lot of query options like sorting, skip, limit etc. Queries to MongoDB return a cursor, which can be iterated to retrieve results.

Q. Is there anything similar to stored procedures in MongoDB?
A: Mongo supports the execution of code inside the database process using a SQL-style WHERE predicate clause, or a full JavaScript function. When using this mode of query, the database will call the server side function, or evaluate the predicate clause, for each object in the collection.

Q. What about the map-reduce support in MongoDB?
A: Map/reduce in MongoDB is useful for batch processing of data and aggregation operations. It is similar in spirit to using something like Hadoop with all input coming from a collection and output going to a collection. Map/reduce is invoked via a database command & they are the JavaScript functions and execute on the server. MapReduce jobs on a single mongod process are single threaded, so can use sharding to parallelize the map/reduce jobs.

Q. What about the security features in MongoDB?
A: Security is one area where MongoDB appears to be lacking compared to commercial databases. Users can be created and associated with specific databases, and may either be read-only or read/write. There is no lower level of granularity. However, authentication must be turned on explicitly when running MongoDB. It’s recommended to run MongoDB on servers behind enterprise firewall. Although user's password is stored as hash in MongoDB, but password is entered in plain-text when new user is add or when user is getting authenticated on shell.
Q. What about the transaction support in MongoDB?
A. MongoDB does not support traditional locking and complex transactions. As the distributed locks is going to be expensive and slow, which is not what MongoDB plans were from Day1.
MongoDB supports atomic operations on single documents and v1.8 release of MongoDB will have single server durability. MongoDB updates an object in-place when possible and doesn’t support versioning of documents. MongoDB also doesn’t support two phase commit.

Q. Does MongoDB comes with HTTP interface?
A: MongoDB provides a simple http interface listing information of interest to administrators. This interface may be accessed at the port with numeric value 1000 more than the configured mongod port.

Q. How to achieve high availability & scalability when using MongoDB?
A: Where multiple servers are available, replication is used to provide high availability. There are two options for replication: master/slave and ‘replica sets’. In any case, only one node is allowed to write to the database (called master or primary). Replication is asynchronous so reads from the slave/non-primary nodes may return slightly older data (i.e. reads are ‘eventually consistent’), but reads from the master/primary node will be the latest. In master/slave configuration, data is replicated from master to slave much the same as standard SQL database replication. Operations are logged on the master and are replayed on each of the slaves. Failover is manual. If the master is unavailable writes are not allowed until the master is restored or failover occurs. Replica sets extend master/slave replication with automatic failover and recovery of nodes.
To support scalability, sharding is available. MongoDB supports automatic sharding via an ‘order preserving partitioner’, meaning data with similar shard keys (i.e. how data is split up into shards) are likely to be together. When a shard gets too big, MongoDB automatically migrates the data round to balance out the shards. Sharding needs to be combined with replica sets in order to provide high availability and scalability. MongoDB is capable of querying across shards including sorting across shards. MongoDB uses the range based sharding on mentioned keys and it’s very important to decide which keys should be used for sharding as it is tough to repartition your tables once it is already sharded and have lot of data.

Q. Is there any monitoring support in MongoDB?
A: MongoDB provides lot of inbuilt monitoring and diagnostic features including server statistics, query profiler etc using mongostat (MongoDB’s top version) and monitoring plugins.

Q. What about the data files created on server?
A: MongoDB datafiles are created by default in /data/db folder, unless mentioned different data directory at mongod process. Each datafile is preallocated to a given size. (This is done to prevent file system fragmentation, among other reasons.) The first file for a database is .0, then .1, etc. .0 will be 64MB, .1 128MB, etc., up to 2GB. Once the files reach 2GB in size, each successive file is also 2GB. In data directory .ns files are also present for every database, containing namespace / index metadata exists.

Q. What are the key points for good performance when using MongoDB?
A. Mongo is very oriented toward performance, at the expense of features that would impede performance. Features that give MongoDB good performance are:
• BSON - native socket protocol for client/server interface
• use of memory mapped files for data storage
• objects from the same collection are stored contiguously
• update-in-place (not MVCC)

Q. What are the key points to be considered when using MongoDB keeping the high performance in mind?
A: Following points are very important for good performance:
• Schema design
• Indexes Usage
• Shard on right key.
• Use Capped collection for higher performance use cases.
• Use sharding + replication(master-slave or replica sets) for higher availability & scalability

References –
1 http://www.mongodb.org

Thursday, October 7, 2010

Doubts on Scalable Non relational datastore after Foursquare outage

In a detailed post-mortem, a Foursquare (Foursquare) engineer details the "embarrassing and disappointing" issues that brought the service to its knees and kept it down for the bulk of the day yesterday. Apparently, the service’s problems began when engineers noticed a disproportionate number of checkins being stored on one SHARD of the company’s database system. As the shard became more and more overloaded, engineers tried to introduce a new shard to the system to even out the load balance.

Unfortunately, the new shard took out the entire service for reasons unknown, from the website to the mobile apps; and every time the team tried to restart the site, the original shard kept overloading and bringing the works down again.In the end, Foursquare had to re-index the shard, which took a grueling five hours. The company is happy to report no data was lost.

This brings about two quick questions on MongoDB and Monitoring of non relational datastores -

1. Is this scenario really going against the MongoDb's auto sharding architecture - MongoDB has been designed to make horizontal scaling manageable. It does so via an auto-sharding architecture, which automatically manages the distribution of data across nodes. In addition, the sharding system handles the addition and removal of shard nodes, ensuring an even distribution of data. The architecture also provides for automatic failover; individual shards are made up of at least two redundant nodes, ensuring automatic recovery with no single point of failure.

2.Are we missing the monitoring tools on non-relational datastore?

Although Non relational datastore are being defined as scalable datastore but are they reliable is the question?? With companies like Foursquare growing much more rapidly than it has since its inception, can non relational datastore also scale with them???

Thursday, May 20, 2010

Relational Database v/s Non Relational Datastore

“Non Relational datastore’s have been introduced to provide indexed data storage that is much in higher performance than existing relational database products like MySQL, Oracle, DB2 and SQL Server. They seek to break down the rigidity of the relational model, in exchange for leaner model that can perform and scale at higher levels, using various models (including key/value pairs, sharded arrays, and document-oriented approaches),which can be created and read efficiently as the basic unit of data storage. So this article tried to bring relation database face-to-face with non relational datastore and can help to choose whether non-relational datastore is the right choice for your application -

A. Normalization v/s Denormalization

Codd’s Normalization rules have been used in large effect while designing the application’s database and have been studied & practiced for more than decade to avoid redundancy. The result is very cheap INSERT, UPDATE, and DELETE operations but SELECT is potentially expensive if we require more complex JOINs to put all the data back together.

Rather than isolate redundancies into individual tables and then form relationships among them, non-relational datastore nests the information hierarchically in keys/documents/column families.

Denormalization is about the tradeoff between efficiently storing data in a single location versus potentially duplicating it in the hopes of speeding up performance of the system. In a denormalized setting, we have the reverse, INSERT, UPDATE, and DELETE are potentially expensive operations but SELECTs are very cheap.

The effective give and take of the normalization dichotomy is that normalization is worse for performance because it requires joins when disparate information is required together, whereas denormalization is more complex (because it may require more physical operations be done when changes occur) and more disk-heavy (because similar information may be stored multiple times).

B. Structured v/s Non-structured data

RDBMS are designed to model very highly and statically structured data. However, many of today's interesting problems involve unpredictable behavior and inputs from extremely large populations; consider web search, social network graphs, large scale purchasing habits, etc. In these "messy" arenas, the impulse to exactly model and define all the possible structures in the data in advance is exactly the wrong approach. Relational data design tends to turn programmers into "structure first" proponents, but in many cases, the rest of the world are thinking "data first".

In RDBMS , there is only one value for any given attribute; storing multiple values in the same attribute for the same column is considered very bad practice, and is not supported by standard SQL. Generally, cases where one might be tempted to store multiple values in the same attribute indicate that the design needs further normalization. This is a design strategy that can is frequently applied to many situations in standard relational database design by defining many-to-one relationship. The trouble with this pattern, however, is that it still does not elegantly serve all

the possible use cases of such data, especially in situations with a low cardinality; either it is overkill, or it is a clumsy way to store data. But in case of some of the non-relational data store, they have a "List" type that can store exactly this type of information as an attribute & query can return the list data also when the entity is retrieved. For example, sample document in CouchDB is defined to contain single keys and arrays:

{

"my_key" : "another value",

"different_key" : "another value still!"

"my_array_key" : ["value 1", "value 2", "etc"]

}

C. Data Retrieval - SQL v/s GQL(Google) or Map-Reduce(CouchDB)

SQL is an extremely concise declarative language; it builds a consistent, useful abstraction framework on top of data storage in the relational model, and allows implementations to optimize access to the data within the bounds given by that abstraction. SQL allows an arbitrary complexity of query syntax, and relational databases management systems typically have an incredibly complex layer for processing and planning the execution of these (potentially complex) queries. Nested queries, complex table joins, aggregation and pivoting, projections—all can be described in SQL, and a good query processing system will quickly craft extremely efficient mechanisms for answering these queries.

Non-relational stores generally allow queries against only the primary key of the store, possibly with one additional layer of filtering via index to limit results to only those that match a simple set of filters (i.e., WHERE clauses). Some of the datastore’s require something different from SQL to access and view its data. Queries are constructed using an object syntax of a simplified dialect of SQL known as "GQL" for retrieving the data from Google Data Store, whereas Map-Reduce comes into picture when accessing the data from CouchDB. Map/Reduce, originally, was a pair of high-order parallel functions introduced originally in LISP to partition information and then reconstitutes it in the final processed and useable state. In CouchDB, The Map function is the beast responsible for taking in a stack of documents and then "mapping" it into key-value pairs and Reduce is responsible for reducing the values under a particular key in the hash-table down into a smaller result set, potentially a single record.

Incase, if we want to retrieve the details about number of articles submitted this year, then SQL query to retrieve the data from RDBMS will be

SELECT count(*) FROM articles

In addition, if same data needs to be retrieved from CouchDB by submitting HTTP – “GET /articles”, then its map function written in JavaScript will be

function map(doc)

{

if(doc.type == "articles")

{

emit( doc.articleName, doc );

}

}

function reduce(key, values, rereduce)

{

return values.length;

}

D. Economy of Scale

It is important to note that one of the most important benefits of non-relational data store—and in fact, the justification for their existence in the first place—is their ability to scale to larger, parallel installations than relational databases can. Rather than designing an elegant relational model and only later considering how it might reasonably be "sharded" or replicated in such a way as to provide high availability in various failure scenarios, some of the datastore’s consider mechanism for establishing the locality of transactions. So in Google Data store when multiple entities in a transaction on a distributed data store are involved, it is desirable to restrict the number of nodes who actually must participate in the transaction. It is therefore most beneficial to couple related entities tightly, and unrelated entities loosely, so that the most common entities to participate in a transaction would be those that are already tightly coupled. In a relational database, you might use foreign key relationships to indicate related entities, but the relationship carries no additional information that might indicate, "These two things are likely to participate in transactions together".

Currently no paper/discussion/presentation is over without mentioning the economy of scale with respect to cloud. An application requiring an RDBMS that has a fixed number of users and whose workload is known not to expand will not have any problems using that RDBMS in the Cloud. However, as more and more applications are launched in environments that have massive workloads, such as web services, their scalability requirements can change very quickly and, secondly, grow very large. These scenario scan be difficult to manage with relational database in general. In addition, one of the core benefits of the Cloud is the ability to quickly (or automatically as we will show) add more servers to an application as its load increases thereby scaling it to heavier workloads. However, it is very hard expand an RDBMS this way. Data must either be replicated across the new servers, or partitioned between them. A big challenge with RDBMS and partitioning or replicating is maintaining referential integrity. When database is partitioned or replicated, it becomes nearly impossible to guarantee referential integrity is maintained across all databases. Basically, it is the very useful property of RDBMS being constructed out of lots of small index tables that are referred to by values in records that becomes so unworkable when these databases have to scale to deal with huge workloads.

E. Concurrency

Relational databases traditionally use a mechanism known as locking, or "Pessimistic" concurrency control; a transaction will identify the resources it intends to change, and protect these resources with a lock (of which there may be various types, depending on the specifics of the operation). Other transactions wishing to update the same resource must wait for the lock to be released. Participants wait their turn for exclusive access to the data, and then commit.

Locking does suffer from two problems that are critical from the perspective of non-relational database management systems – overhead & much more difficult to do correctly if the participants in the transaction are distributed. For this reason, locking is not used by any of the distributed non-relational database systems. As an alternative, another form of concurrency control is typically used in non - relational databases: Optimistic Concurrency, also known as MVCC (Multi-Version Concurrency Control). This mechanism relies on timestamps, to determine the modification dates of transactions. MVCC is very good at achieving true "snapshot" isolation, because a query can carry with it a timestamp that is used to filter any entity the query touches. Using Optimistic Concurrency, however, may introduce additional layers of

complexity to the program code, which would be silently handled in relational databases. The result of this restriction is that in most non-relational database systems, explicit (multi-step) transaction either do not exist at all, or have various limits placed on what they can do. As an example, Google App Engine Data Store can do transactions, but not arbitrary transactions: entities must be declared to be part of the same "entity group" at design time.

F. Consistency v/s Eventual Consistency

Consistency is the notion (which is often taken for granted in traditional relational database systems) that logically, when a client of a data storage system makes a write to that system, any subsequent read (by that client or others) will get the latest version of that data that was written. Consistency is closely related to the concept of transactionality: concurrent systems require transactional guarantees (at least) in order to maintain consistency. Instead, some of the models of non-relational databases use a technique known as "Eventual Consistency". The concept usually applies to cases where a distributed representation of the data is kept for example, across multiple servers in a cluster. The transaction protocol does not guarantee that reads and writes of all conceivable entities in the database will always be instantaneously consistent. Instead, a weakened guarantee is made: in the case of any sort of failure or latency issues, entities may appear temporarily inconsistent, but that they will eventually be made consistent.

G. Integrity

However, even assuming a stronger consistency model, non-relational databases have a significant amount of work to do if they want to replicate the same level of integrity guarantee that is provided by a relational database. Relational database architectures provide a layer through which all queries are passed, that enforces relational integrity guarantees; this would be extremely difficult to do in a distributed environment, and would hamper the system’s throughput. In place of proper relational integrity constraints, most non-relational databases offer un-enforced references: an entity whose key is used as a reference property in another entity can still be deleted, and it is always up to the application code to check the existence of a referred-to key before proceeding. This is the strategy used, for example, by the Google App Engine Data Store.

H. Data manipulation

Data Manipulation Language (DML) is used in RDBMS for entering and manipulating the data once the structure is defined. However some of the Non Relational Datastore’s provide the RESTful interface (HTTP PUT, POST, DELETE) again with differing serialization formats (like JSON, Thrift, RDF etc) and some of them like Google Data store provides the Data Manipulation APIs. Following sample can create the document in CouchDB datastore:

curl -X POST -i -d '{"name" : "Amit Piplani"}' \ -H 'Content-Type: text/javascript; charset=utf-8' \http://localhost:5984/couchdb_sample

{"ok":true,"id":"626b345059c2a54fbe8b8009ba834s01",

"rev":"1-21754816122"}

I. Relationships

One of the biggest question asked when somebody talks about transitions from the relational databases to non-relational datastore’s – What happens to the relationships? Although the integrity part is already covered in “Economy of Scale “and “Integrity” sections, but relationships by themselves require attention and separate topic for discussion. Since the entity relationships are being handled differently by datastore’s, so will be using CouchDB for discussion. Rather than isolate redundancies into individual tables and then form relationships among them, CouchDB nests information hierarchically within documents. Therefore, points of discussion for this case will cover one-to-many and many-to-many relationships, as these are the common types of entity relationships. So CouchDB document detailing out the user and phone number relationships (User having Multiple Phone Numbers associated to them):

/*Start of Document */

{

"name" : "John Doe",

"phone" : {

"mobile" : {

"number" : "123-555-3232",

"primary" : "Yes",

"provider" : "ATT"

},

"residence" : {

"number" : "333-232-2232",

"provider" : "Comcast"

}

"Work" : {

"number" : "432-232-3232",

}

}//Phone

}

Rather than subdividing data across tables to reduce duplication, CouchDB acknowledges a small amount of duplication in favor of keeping all of our data in the same place for easy retrieval

In addition, let us take the example of twitter as many-to-many relationship to explain the user-followers and user-following association. CouchDB in this scenario will break down many-to-many relationship into multiple (two times) one-to-many relationship and in this scenario the document will be:

/* User document

{

"user_id" : "456"

"username" : "John Doe",

"followers" : [

{"user_id" : "123", "username" : "Follower1"},

{"user_id" : "333", "username" : "Follower2"}

],

"following" : [

{"user_id" : "999", "username" : "Following1"},

{"user_id" : "888", "username" : "Following2"}

]

}

J. Security

Another category of diminished functionality in the current crop of non-relational Databases, compared to most commercial relational databases, is in the area of granular access control. Database systems like Oracle, Microsoft SQL Server, MySQL, etc., all contain a robust security model that allow the creation of user accounts, as well as roles or groups to combine and manage those user accounts. It is then possible to set very detailed, granular permissions regarding which users and / or groups can select, insert, update, and delete data, as well as execute individual functions, procedures, etc. Access control is real-time, meaning that changes to users and groups' granular access can be changed at any point, and the database engine itself immediately enforces that access. Non-relational stores do not generally provide access control of this granularity.

K. Stability/Maturity

The commercial relational databases have been around for decades and are rock solid. Most NoSQL datastore’s have been conceptualized and evolved in the last couple of years and have not been widely adopted.

L. Schema Updates

To be sure, relational databases have mechanisms for handling ongoing updates to data schema; indeed, one of the strengths of the relational model is that the schema is data: databases keep system tables, which define schema metadata, which are handled by the exact same database primitives as user-space tables. However, relational database schemas exist in only one state at any given time. This means that if the specific form of an attribute changes, it must change immediately for all records, even in cases where the new form of the attribute would rightfully require processing that the database cannot do. In addition, the release of relational database schema changes typically requires precise coordination with application-layer code; the code version must exactly match the data version. In any highly available application, there is a high likelihood that this implies downtime, or at least advanced operational coordination that takes a great deal of precision and energy. Non-relational databases, by comparison, can use a very different approach for schema versioning. Because the schema (in many cases) is not enforced at the data engine level, it is up to the application to enforce (and migrate) the schema. Therefore, a schema change can be gradually introduced by code that understands how to interact with both the N-1 version and the N version. Code generally lives longer than expected, and data access code doubly so, because it reflects aspects of the system that are less likely to change as requirements shift slightly. Therefore, the future needs of an application’s data are not always clear. This approach produces code that is more complex in the short term, but the knowledge that no downtime will be required during a schema evolution is worth the additional complexity.

M. Support/ Operations

As mentioned in earlier point, that most of the relational databases like Oracle, IBM have better operational support. Especially the support around data recovery, backup and ad hoc data fixing is always a big question in the mind of enterprise decision makers, as many of the 'Non Relational Databases' don't provide a robust and easy to use mechanism towards these problems.

N. Cost

Commercial RDBMS licenses are not cheap, especially in datacenter configurations. The per-server or per-core licensing expenses for a large-scale deployment can easily reach into the millions, which makes SQL much less attractive than the NOSQL alternatives even in non-cloud world.

O. Developer eco-system
SQL is a core competency for a large chunk of the overall developer population. NOSQL knowledge is still rare, both at the individual and at the organizational levels. This means that developers with the required skill sets are hard to find, hard to qualify and hard to retain.

So would like to end the paper saying - "NoSQL Databases are absolutely something the technologists/architects should be paying attention to and following closely- they are strong contender to the RDBMS Model and will likely become the de-facto data storage choice for most next generation web applications in some time. But there is no one-size NoSQL Database that fits all solution for the application’s requirement, and hence evaluating the correct NoSQL Database for project will be the important part of data architecture."

Thanks,
Amit Piplani

Friday, May 7, 2010

Comparing Key-Value DataStores

The simplest of all the data store types. It can be said to be build upon MemCacheDB distributed in-memory cache & provides a single key-value index for all the data. These datastores stores data in memory but other functionalities(like replication, partitioning) vary across the different key-value datastores. Following are the comparison of some of the key-based datastores looked into.

Thursday, May 6, 2010

U Pick 2 Selection for NoSQL Providers

As per CAP theorem, there are three primary concerns you must balance when choosing a data management system: consistency, availability, and partition tolerance.

* Consistency means that each client always has the same view of the data.
* Availability means that all clients can always read and write.
* Partition tolerance means that the system works well across physical network partitions.

And now using Panera Model of UPick 2 for your application requiremenents, CAP theorem says you choose 2 from CAP attributes. The above diagram tells how this attributes apply across some of the NoSQL Providers

Monday, January 25, 2010

Features and pitfalls of Google App engine - My perspective

After migrating Pet Store(Sun's Web application) with JDBC calls to Google App engine with schemalass datastore entities and then re- developing the same as GWT application for Google App engine platform, here are the features and pitfalls from that perspective

Features
1.Components of Google App engine - Sandbox enviornment, DataStore & Services scales independently of each other.
2. All requests to Google App engine follow the same request-response model. Web requests/Task Queues/ Cron Jobs/XMPP Messages/Bulk-loader all are made to go via Google's front-end.
3.Server-less infrastucture
4.Supports Java Servlet(Including the features of filter, session (persistent by google), session listener, internationalization (using browser dependent locale).,JSP,JDO and JPA Specifications.
5.Internationalization is not a native feature, but can be added it by using a web application framework with internationalization features.
6.Easily integrates with Google infrastructure for authentication & authorization. And also sandboxing behavior of App engine runtime makes multiple applications to run in the same server without the behavior of one application affecting other
7.Logging jars and configurations are part of web development SDK and hence no time spent on building logger wrapper. And also Google App engine runtime provides logs of all web requests for a given application from administration console,which can help the administrator of tracking DOS attacks,bandwidth etc.
8.Separation of configuration files for each component eg. web application, indexes, tasks queues, cron job etc all have their configurations and can be configured/handled separately by team.
9.Application Caching (Memcache) support is provided as part of sdk and can be easily programmed to put/retrieve the objects from distributed cache.
10.Multiple versions of application can be tested at same moment of time and hence promotes the parallel development of different iterations for same application.
11.Easy separation of static content from application content by mentioning the same in app-config.xml & Google App engine automatically uploads the content into web server and app server respectively.
12.Google App engine replicates the data in multiple locations, and hence application support team don’t have the handle to backup and archive the data for redundancy.
13.Management tools to manage the resources used by application is being done by Google itself and report for the same can be accessed via Administration console.
14.Google Accounts Integration - App engine features integration with Google Accounts, the user account system used by Google applications such as Google Mail, Google Docs and Google Calendar.
15.Supports the concept of cron - to do batch processing jobs(but with request paremeters) & task queues to do processing aynchronously for any web request(outside the context of web reuqest).
16.The runtime environment also limits the amount of clock time, CPU use and memory a single request can take. App engine keeps these limits flexible & applies limits to those applications that use up more resources to protect shared resources from the runaway applications. But the response time for application can also determine the number of requests the application can handle dynamically.
17. Some of the open source applications are being written to synchronize the data between Google App engine datastore and relational database. For e.g., AppRocket is an open-source replication engine that synchronizes Google App engine datastore and MySQL database.
18.App engine includes a tool for uploading and downloading data via the remote API. The tool can create new datastore entities using data from a comma separated values (CSV) data file & can even create CSV files with data from the app’s datastore.
19.Google App engine provides two different interfaces - high-level & low-level for various services like datastore,memcache etc. High-level APIs make application portable but low-level APIs provides more features by the google platform.

Pitfalls-
1.Too many quotas imposed by Google App engine [bandwidth, datastore request/response size etc.] makes road towards quote driven/oriented architecture.
2.Bad entity designs can lead to index explosion or using lot of bandwidth & time while retrieving/updating/creating/deleting entity from datastore.
3.Java.io , JNI , multi-threading use is not supported currently in java runtime
4.Limited support of SQL by Java runtime as Google datastore is schema less. For eg. A query cannot use inequality filters on more than 1 property, Stored procedures/triggers are not being supported.
5.If using google infrastucture for authentication & authorization, right now it supports two roles - general & admin.
6.URL Fetch Process doesn’t support secure HTTPS communication, and hence app engine is not good use case when application needs to be communicate securely with other sites/ web services.
7.When the application creates new entities and updates existing ones using the datastore API, the call returns with success or failure after creating/updating entities along with updating every corresponding index. This makes queries very fast at the expense of the entity updates & hence affects the performance of applications.
8.With secure data connector, web applications can integrate with on-premise resources(only ones available on intranet).
9.Only content management systems build on top of App engine can be used by applications as most of the CMS typically uses sql databases.

Recommendations
1.With more insight into the Google Datastore’s indexes concept, would like to recommend that this architecture should be used by those application which needs faster access to data and not getting affected by how much data is in the system or how it is distributed across multiple servers.
2.GAE can serve traditional website content too (such as documents and images), but the environment is especially designed for real-time dynamic web applications.
3.Selection of Google App engine as platform is mainly driven by no-capital cost, pay-per-use model for resources used beyond free quota, scalability, manageability, server less on-premise infrastructure but limited by its sandbox capabilities and quotas, only web application support and on-going activities to add additional features in java runtime.
4.Ideal case for small web applications with less traffic for deploying their web applications on Google infrastructure.

Wednesday, January 13, 2010

Indexes – does it matter for Developer when using Google App engine???

Most web application need to store information during the handling of request for retrieval during a later request. By far the most populate kind of data storage system for web applications has been the relational database. But Google App engine’s database system most closely resembles an object database. The design of app engine datastore is an abstraction that allows App engine to handle the details of distributing and scaling the application.
An app engine application stores its data as one or more datastore entities. An entity has one or more properties (each of which has name and value). Each entity is of named kind, which categorizes the entity for the purpose of queries. Entities are different from rows defined in tables – firstly entities of a given kind don’t need to have same properties as other entities of same kind. Secondly an entity can have a property of the same name as another entity has, but with different type of value. Also entity can have multiple values for a single property.
Every datastore entity has a unique key that is either provided by the application or generated by App engine. The key is not a property, but an independent aspect of entity. An entity’s keys cannot be changed after the entity is created. Neither can the entity’s kind. App engine uses the entity’s kind & key to locate where the entity is stored in a large collection of servers.
An application always needs to perform query to retrieve one or more entities from the datastore & App engine also supports mentioning three different parts of query:
• Kind of entity
• One or more Filters provided on the property’s value
• One or more sort orders defined on the properties
When the query is executed, it fetches all entities of the given kind that meet all of the given conditions, sorted in the order described. But Google App engine have a limitation right now, it brings entire entity when query is executed & hence entity needs to be carefully designed to run the query in Google App engine world. So if entity is going to be used in lot of data retrieval, then design the entity such that required data should be part of the entity and additional data like BLOB can be part of other entity, which can be retrieved by the parent entity when required. This design is essential since this will add up to the applications’ quota & hence good design of entity is required to optimize the use of bandwidth.
To support faster retrieval of entities of database, Google App engine brings the concept of index. Indexes are also used in relational database world for better performance of queries and are often done by database administrators (and hence are abstracted from the developer’s point of view). But the App Engine datastore maintains an index for every query an application intends to make & when the application executes a query, the datastore fetches the results directly from the corresponding index. But since datastore uses the indexes for fetching the results, do Google App engine developer needs to understand
1. How entities are indexed by datastore?
2. Do the developers need to provide indexes configuration or it will be automatically provided by Google Datastore?
How entities are indexed by datastore?
Although this point many not be if interest to some of the developers as this is somehow feature provided by Google Datastore. But if this feature is not used appropriately (using many Multi-value Properties) in an entity, then it can lead to index explosion as there is limit imposed by Google on indexes configuration. Let’s try to explain the indexes by giving an example of Data Class Book with 3 properties – title, price & year.
public class Book {
private String name;
private Float price;
private int year;
}
For this example, keys for entities of kind Book will be generated by Google App engine, not by application. So Google App engine maintains indexes in following tables for
I. Single table maintaining the keys of the entities of a given kind.
Key
Book/1234
Book/2345
Book/3456

II. 2 Tables mapped to a single property for that kind in ascending and descending order. So in this case it will be 6 tables( 2(sort order) * no of properties)
Key Name
Book/1234 Applied Mathematics
Book/3456 Programming Java
Book/2345 SOA Cookbook

So when entities of a given kind are created/updated/deleted, then indexes in the above mentioned tables are re-arranged. So when following query is executed:
Select * from Book where price> 10.00 && price < 50.00
The datastore will first check the index table defined for the Book kind with price property with ascending order of price. Then, the datastore starts scanning the index at the first entity that meets all of the filter conditions using the query's filter values. And then the datastore continues to scan the index, returning each entity, until it finds the next entity that does not meet the filter conditions, until it reaches the end of the index, or until it has collected the maximum number of results requested by the query.

Key Price
Book/1234 8.00

Book/3456 12.99

Book/2345 23.00
Book/3466 28.00
Book/3479 34.00

Book/3660 53.00

Datastore doesn’t update the indexes when the property is not set for that given entity. And also lets the see the example of index table when there are multi-valued properties defined in an entity (In our Book entity we can assume the example of the reviewers as Multi-Value Properties):
Key Reviewers(Sorted by reviewers, not by keys)
Book/3456 James Keenan
Book/1234 Ken Beck
Book/3456 Laurel Bird
Book/1234 Surendra Reddy
Book/3475 Yale Bordy

Do the developers need to provide indexes configuration or it will be automatically provided by Google Datastore?
After explaining how the datastore indexes the entity when the entities are created or updated or deleted. In following section, will determine which all queries will be automatically indexed by Datastore and which queries requires specifying indexes configuration.
App engine datastore provides automatically indexing when following queries are being executed by application code:
1. Querying the all entities of a given kind
The example of this query is - select * from Book
The following query will use the single index table defined for storing the keys for a entity of a given kind (in our case Book).
2. Querying the all entities of a given kind and using filters on one property using equality operator with no sort order.
The example of this query is Select * from Book where price = 28.00
3. Querying the all entities of a given kind and using filters on one property using inequality operator(< , > , >=, <=) with no sort order
The same example used earlier while describing how query is executed (in previous section can be used here) : Select * from Book where price> 10.00 && price < 50.00
4. Querying the all entities of a given kind and using sort order on one property and can use no or filter on same sorted property
The example of this query will be – Select * from Book order by price desc
5. Querying the all entities of a given kind by filtering or sorting by entity key
This query can be used if application is aware of key names and then wants to retrieve that key entity later on from datastore.
The cases for equality and inequality operators are being defined separately as they are given different preferences, which will be covered with complex queries.
Till now, developer doesn’t need to concern itself with the indexes concept as datastore is able to do everything with respect to indexes for querying the entities & hence providing the layer of abstraction to developer.
But any application can’t have just simple queries like mentioned in above steps & hence application needs to provide custom indexes before deploying the same online. Following queries right now requires customer indexes:
1. Query involving the multiple sort orders defined on different properties
The example for this query will be
Select * from Book order by price desc, name desc
2. Query involving the equality filters defined on one property & any other filters on other properties.
The example for this query will be
Select * from Book where year=2009 and price < 34.00
These complex queries bring about three different principles of inequality filters defined by Google Datastore:
• If a query uses inequality filters on one property and equality filters on one or more other properties, the index must be first sorted by the properties used in equality filters, then by properties used in inequality filters.
• If query uses inequality filters on one property and sort orders on one or more of the other properties, the index must be first sorted by the property used in inequality filters, then by the other desired sort orders.
For eg. Select * From Book where price>20.00 order by name desc
So as mentioned in the above rule, the entities will be first sorted by price and name. So to make the results look obvious, it is always better to add inequality operator also in sort order to make it explicit. So the correct query will be
Select * from Book where price >20.00 order by price, name desc
• A query cannot use inequality filters on more than 1 property. This principle of inequality can be considered as limitation of Google Datastore till now.
Although Google datastore do support equality filters on more than 1 property, but this type of query makes the setup of index as flux between automatic & complex queries. Google datastore follows zigzag fashion of search of entities using equality filters on more than 1 property. So it is always advised as a good practice by Google itself to use automatic indexing for this type of query. In case the algorithm is taking lot of time to return back the response to web handler, then try to customize the indexes via configuration.
There are additional operators – IN and Not Equal (!), which are mostly used in any of the queries made to database. IN Operator is broken down into Multiple equality operators check on same property and ! Operator is broken down into > and < operator check with that filtered value for that property by datastore. So following examples will cover that:
Select * from Book where Year IN (2001, 2003, 2005) Select &* from Book where Year=2001 & Year=2003 & Year=2005
Select * from Book where year != 2004 -> Select * from Book where year>2004 & year < 2004
Using ! Operator leads to third point on inequality filer that, application can’t use ! operator on more than 1 property at a given time. Although the complex queries (defined by Datastore) are not really complex, when compared to the real world SQL’s. But that is again the limitation of datastore and they might also get complex as the days goes by.
After covering how the indexes are created & used by datastore, developer would always like to know whether they are responsible for writing indexes. Application developer can specify the customer indexes for a given kind of entity in datastore-indexes.xml, under the WEB-INF directory. This manual process of indexes can be tedious, error-prone & not liked by application developers, and hence development servers automatically generate indexes for complex queries whenever the queries are executed. To use automatic index configuration, add the attribute autoGenerate="true" to your datastore-indexes.xml file's element. Automatic index configuration is also used if app does not have a datastore-indexes.xml file.
So with more understanding of what index do and how they are configured, the next question still unanswered is whether they are also distributed like their datastore? Yes, both Entities and indexes are distributed across multiple machines, and each machine scans its own index in parallel with others. Each machine returns results to App engine as it scans its own index, and App engine delivers the final result to app, as if the all the results of query were in one large index.
So what developer/designer needs to learn from this article -
1. Design entities correctly so that it doesn’t lead to index explosion along with taking lot of bandwidths when retrieved/updated/created in datastore
2. Need to know which query needs the custom indexes and which not. If there was an entry created in configuration file for index, and that query was deleted from code, then developer/deployer needs to manually delete those indexes from xml file.
3. Need to know the laws on inequality filters and need to work with Google limitations for complex queries.
With more insight into the Google Datastore’s indexes concept, would like to recommend that this architecture should be used by those application which needs faster access to data and not getting affected by how much data is in the system or how it is distributed across multiple servers.