Wednesday, December 29, 2010

Day out with the MongoDB!!!!

Non Relational datastore’s have been introduced to provide indexed data storage that is much in higher performance than existing relational database products like MySQL, Oracle, DB2 and SQL Server. They seek to break down the rigidity of the relational model, in exchange for leaner model that can perform and scale at higher levels.
Non relational datastore’s are synonymously also called as NoSQL datastore’s. But NoSQL doesn’t mean that industry is ending the roads for SQL itself for next generation web applications, but they meant Not Only SQL (NoSQL). NoSQL datastore’s are both SQL-Free and Schema-free datastore’s. Non Relational datastore’s have been classified into different types based upon their data models –
1. Key-value Stores: These systems store values and an index to find them, based on a programmer-defined key.
2. Document Stores: These systems store documents (as just defined). The documents are indexed and a simple query mechanism may be provided. Well these stores are different from content management systems. These documents are a set of attribute-value pairs, where the values may be complex, and the attribute names are dynamically defined for each document at runtime.
3. Extensible Record Datastore’s: These systems store extensible records that can be partitioned across nodes.

MongoDB is an open source document-oriented datastore designed for high performance access. It is written in C++ and is developed & supported by 10gen.And it is one of the datastore’s which has taken the features from both non-relational datastore & relational databases and I call them Scala of the NoSQL world.
In this article, will give more insight about MongoDB via question & answers:

Q: What is MongoDB?
A: MongoDB is document-oriented database, which has been designed to scale.

Q. What is document oriented database?
A: A document-oriented database is, unsurprisingly, made up of a series of self-contained documents. This means that all of the data for the document in question is stored in the document itself. In fact, there are no tables, rows, columns or relationships in a document-oriented database at all. This means that they are schema-free. If a document needs to add a new field, it can simply include that field, without adversely affecting other documents in the database. This also documents do not have to store empty data values for fields they do not have a value for.
Documents actually map to the given row in a table. Documents are actually stored in collections, which are mapped to the table in relational world.

Q. Does MongoDB have a concept of primary key?
A: Almost every MongoDB document has a _id field as its first attribute. This value is usually a BSON ObjectID. Such an id must be unique for each member of a collection; this is enforced if the collection has an index on _id, which is the case by default. If user tries to insert a document without providing an id, the database will automatically generate a _object id and store it the _id field.
A BSON ObjectID is a 12-byte value consisting of a 4-byte timestamp (seconds since epoch), a 3-byte machine id, a 2-byte process id, and a 3-byte counter. BSON ObjectID will be unique for that server, but not globally unique id. The _id field is automatically indexed.
Q. What is BSON? Is it related to JSON?
A: BSON is a bin¬ary-en¬coded seri¬al¬iz¬a¬tion of JSON-like doc¬u¬ments. BSON is designed to be lightweight, traversable, and efficient. BSON, like JSON, supports the embedding of objects and arrays within other objects and arrays. BSON is a language independent data interchange format & contains extensions that allow representation of data types that are not part of JSON(For example, BSON has a Date data type).

MongoDB uses BSON as the data storage and network transfer format for "documents". The MongoDB client drivers perform the serialization and deserialization. For a given language, the driver performs translation from the language’s “object” (ordered associative array) data representation to BSON, and back.

Q. What all language drivers are supported?
A: MongoDB currently has client support for the following programming languages: C, C#, C++, Haskell, Java, JavaScript, Perl, PHP, Python, Ruby and Scala (via Casbah).

Q. What are the different types of collections supported in MongoDB?
A: A MongoDB collection is a collection of BSON documents. These documents usually have the same structure, but this is not a requirement since MongoDB is a schema-free database.
Capped collections are fixed sized special collections that have a very high performance auto-FIFO age-out feature and are used for certain use cases such as logging. Unlike a standard collection, you must explicitly create a capped collection, specifying a collection size in bytes. The collection's data space is then preallocated. But there are additional points which are not applicable for capped collections such that capped collection are not shardable, index is not automatically created on _id for capped collections by default and deleting of objects is not allowed from capped collection.

Q. Are there any interactive shells to execute the MongoDB commands?
A: MongoDB provides the JavaScript interactive shell to issue commands from the command line and can be used to run the commands, create indices, create server side functions, create users and provide the correct access to the collections and other administration commands.

Q. How is single table inheritance implemented in MongoDB?
A: Being a cricket player, let’s take an example of application storing cricket player and their specialties. In relation world, there will be following columns defined for every player: Batsman, Batting Position, Bowler, Bowling Style, and Wicket-keeper. And some of the columns will be empty depending upon each player’s skills. But same can be easily implemented as each document will be different. Following will be the query results using MongoDB query interface on player’s collection:
 db.players.find()
{ _id: ”1”, name: “Player1”, Batsman: “Yes”, BattingPosition: “No3”}
{ _id: ”2”, name: “Player2”, Batsman: “Yes”, BattingPosition: “No5”, WicketKepper: “Yes”}
{ _id: ”3”, name: “Player1”, Bowler: “Yes”, BowlingStyle”Leg Spinner”}
{ _id: ”3”, name: “Player1”, Bowler: “Yes”, BowlingStyle”Off Spinner”, Batsman: “Yes”, BattingPosition: ”No 4”}



Q. Since there are no joins, so how do define relationships in MongoDB?
A: Join in relational database world maps to embedding and linking in MongoDB. As there are no joins in MongoDB, create one database collection for every top level object & try to embed child objects in that collection. With embedding data is then co-located on disk; client-server turnarounds to the database are eliminated.
So most of the relationships in relational world are described as one-to-many & they can be represented as embedded arrays or normalized collection in MongoDB world. In case of normalized collections, application will be retrieving both the collections separately as MongoDB will not be holding any relationship between them internally. The embedded arrays are the optimum solution.
In case of many-to-many relationship, the different collections have to be linked with each other via ObjectID’s in both the directions & have to query 2 collections if need to retrieve the documents for a given criteria. Instead of creating and maintaining a two-way relationship, it’s better to create one-way relationship between the collections.

Q. Does MongoDB recommends schema design using any methodologies?
A: Documents are simply designed as per the use cases/application needs and they are modified as the requirement changes in future sprints/iterations. So schema design is iterative and is highly recommended to be used using Agile/Scrum methodologies. Data with old and new structure documents design can exist in the same collection as MongoDB is schema-free datastore.

Q. Does document size matter in MongoDB?
A: Yes it does. There document size in 1.6 release of MongoDB is 4MB and is expected to grow in future releases to 16 MB.

Q. How to store bigger objects in MongoDB if there is a document size limit?
A. GridFS is a specification for storing large files in MongoDB, by providing a mechanism to transparently divide a large file among multiple documents. It works by splitting large object into small chunks, usually 256k in size. Each chunk is stored as a separate document in a chunks collection. Metadata about the file, including the filename, content type, and any optional information needed by the developer, is stored as a document in a files collection.

Q. Are indexes in relational world same in MongoDB?
A: Indexes in MongoDB are conceptually similar to those in RDBMS and should be used for enhancing query performance. Index in MongoDB is implemented as “B-Tree” indexes and collects information about the values of the specified fields in the documents of a collection. Once a collection is indexed on a key, random access on query expressions which match the specified key are fast. Without the index, MongoDB has to go through each document checking the value of specified key in the query. An index is always created on _id and that can’t be deleted. Indexed field can be of any types and multi-key "compound" indexes are also supported. Having lot of indexes for faster query performance can impact the inserts/updates/deletes in datastore, which is similar to relational world.

Each index created adds a certain amount of overhead for inserts and deletes. In addition to writing data to the base collection, keys must then be added to the B-Tree indexes. Thus, indexes are best for collections where the number of reads is much greater than the number of writes. For collections which are write-intensive, indexes, in some cases, may be counterproductive. Most collections are read-intensive, so indexes are a good thing in most situations.

MongoDB supports unique constraints via unique indexes, which guarantee that no documents are inserted whose values for the indexed keys match those of an existing document.

Q. What about the query interface in MongoDB?
A: Unlike many NoSQL databases, documents can be queried through a query interface much like Hibernate criteria queries. It’s not SQL, but critically it supports ad-hoc querying. Clients can search for documents based on fields within documents, and return specific fields of documents are part of queries. Through its command line interface, it therefore supports ad-hoc querying of data. Related to this, there is support to create indices on any field in the document to improve performance (including geospatial indices). Combining these two would appear to make MongoDB much closer to relational SQL databases than other NoSQL competitors, which obviously helps with migration.

MongoDB supports a number of query objects for fetching data. Queries are expressed as BSON documents which indicate the query pattern and also provide lot of query options like sorting, skip, limit etc. Queries to MongoDB return a cursor, which can be iterated to retrieve results.

Q. Is there anything similar to stored procedures in MongoDB?
A: Mongo supports the execution of code inside the database process using a SQL-style WHERE predicate clause, or a full JavaScript function. When using this mode of query, the database will call the server side function, or evaluate the predicate clause, for each object in the collection.

Q. What about the map-reduce support in MongoDB?
A: Map/reduce in MongoDB is useful for batch processing of data and aggregation operations. It is similar in spirit to using something like Hadoop with all input coming from a collection and output going to a collection. Map/reduce is invoked via a database command & they are the JavaScript functions and execute on the server. MapReduce jobs on a single mongod process are single threaded, so can use sharding to parallelize the map/reduce jobs.

Q. What about the security features in MongoDB?
A: Security is one area where MongoDB appears to be lacking compared to commercial databases. Users can be created and associated with specific databases, and may either be read-only or read/write. There is no lower level of granularity. However, authentication must be turned on explicitly when running MongoDB. It’s recommended to run MongoDB on servers behind enterprise firewall. Although user's password is stored as hash in MongoDB, but password is entered in plain-text when new user is add or when user is getting authenticated on shell.
Q. What about the transaction support in MongoDB?
A. MongoDB does not support traditional locking and complex transactions. As the distributed locks is going to be expensive and slow, which is not what MongoDB plans were from Day1.
MongoDB supports atomic operations on single documents and v1.8 release of MongoDB will have single server durability. MongoDB updates an object in-place when possible and doesn’t support versioning of documents. MongoDB also doesn’t support two phase commit.

Q. Does MongoDB comes with HTTP interface?
A: MongoDB provides a simple http interface listing information of interest to administrators. This interface may be accessed at the port with numeric value 1000 more than the configured mongod port.

Q. How to achieve high availability & scalability when using MongoDB?
A: Where multiple servers are available, replication is used to provide high availability. There are two options for replication: master/slave and ‘replica sets’. In any case, only one node is allowed to write to the database (called master or primary). Replication is asynchronous so reads from the slave/non-primary nodes may return slightly older data (i.e. reads are ‘eventually consistent’), but reads from the master/primary node will be the latest. In master/slave configuration, data is replicated from master to slave much the same as standard SQL database replication. Operations are logged on the master and are replayed on each of the slaves. Failover is manual. If the master is unavailable writes are not allowed until the master is restored or failover occurs. Replica sets extend master/slave replication with automatic failover and recovery of nodes.
To support scalability, sharding is available. MongoDB supports automatic sharding via an ‘order preserving partitioner’, meaning data with similar shard keys (i.e. how data is split up into shards) are likely to be together. When a shard gets too big, MongoDB automatically migrates the data round to balance out the shards. Sharding needs to be combined with replica sets in order to provide high availability and scalability. MongoDB is capable of querying across shards including sorting across shards. MongoDB uses the range based sharding on mentioned keys and it’s very important to decide which keys should be used for sharding as it is tough to repartition your tables once it is already sharded and have lot of data.

Q. Is there any monitoring support in MongoDB?
A: MongoDB provides lot of inbuilt monitoring and diagnostic features including server statistics, query profiler etc using mongostat (MongoDB’s top version) and monitoring plugins.

Q. What about the data files created on server?
A: MongoDB datafiles are created by default in /data/db folder, unless mentioned different data directory at mongod process. Each datafile is preallocated to a given size. (This is done to prevent file system fragmentation, among other reasons.) The first file for a database is .0, then .1, etc. .0 will be 64MB, .1 128MB, etc., up to 2GB. Once the files reach 2GB in size, each successive file is also 2GB. In data directory .ns files are also present for every database, containing namespace / index metadata exists.

Q. What are the key points for good performance when using MongoDB?
A. Mongo is very oriented toward performance, at the expense of features that would impede performance. Features that give MongoDB good performance are:
• BSON - native socket protocol for client/server interface
• use of memory mapped files for data storage
• objects from the same collection are stored contiguously
• update-in-place (not MVCC)

Q. What are the key points to be considered when using MongoDB keeping the high performance in mind?
A: Following points are very important for good performance:
• Schema design
• Indexes Usage
• Shard on right key.
• Use Capped collection for higher performance use cases.
• Use sharding + replication(master-slave or replica sets) for higher availability & scalability


References –
1 http://www.mongodb.org