Wednesday, January 13, 2010

Indexes – does it matter for Developer when using Google App engine???

Most web application need to store information during the handling of request for retrieval during a later request. By far the most populate kind of data storage system for web applications has been the relational database. But Google App engine’s database system most closely resembles an object database. The design of app engine datastore is an abstraction that allows App engine to handle the details of distributing and scaling the application.
An app engine application stores its data as one or more datastore entities. An entity has one or more properties (each of which has name and value). Each entity is of named kind, which categorizes the entity for the purpose of queries. Entities are different from rows defined in tables – firstly entities of a given kind don’t need to have same properties as other entities of same kind. Secondly an entity can have a property of the same name as another entity has, but with different type of value. Also entity can have multiple values for a single property.
Every datastore entity has a unique key that is either provided by the application or generated by App engine. The key is not a property, but an independent aspect of entity. An entity’s keys cannot be changed after the entity is created. Neither can the entity’s kind. App engine uses the entity’s kind & key to locate where the entity is stored in a large collection of servers.
An application always needs to perform query to retrieve one or more entities from the datastore & App engine also supports mentioning three different parts of query:
• Kind of entity
• One or more Filters provided on the property’s value
• One or more sort orders defined on the properties
When the query is executed, it fetches all entities of the given kind that meet all of the given conditions, sorted in the order described. But Google App engine have a limitation right now, it brings entire entity when query is executed & hence entity needs to be carefully designed to run the query in Google App engine world. So if entity is going to be used in lot of data retrieval, then design the entity such that required data should be part of the entity and additional data like BLOB can be part of other entity, which can be retrieved by the parent entity when required. This design is essential since this will add up to the applications’ quota & hence good design of entity is required to optimize the use of bandwidth.
To support faster retrieval of entities of database, Google App engine brings the concept of index. Indexes are also used in relational database world for better performance of queries and are often done by database administrators (and hence are abstracted from the developer’s point of view). But the App Engine datastore maintains an index for every query an application intends to make & when the application executes a query, the datastore fetches the results directly from the corresponding index. But since datastore uses the indexes for fetching the results, do Google App engine developer needs to understand
1. How entities are indexed by datastore?
2. Do the developers need to provide indexes configuration or it will be automatically provided by Google Datastore?
How entities are indexed by datastore?
Although this point many not be if interest to some of the developers as this is somehow feature provided by Google Datastore. But if this feature is not used appropriately (using many Multi-value Properties) in an entity, then it can lead to index explosion as there is limit imposed by Google on indexes configuration. Let’s try to explain the indexes by giving an example of Data Class Book with 3 properties – title, price & year.
public class Book {
private String name;
private Float price;
private int year;
For this example, keys for entities of kind Book will be generated by Google App engine, not by application. So Google App engine maintains indexes in following tables for
I. Single table maintaining the keys of the entities of a given kind.

II. 2 Tables mapped to a single property for that kind in ascending and descending order. So in this case it will be 6 tables( 2(sort order) * no of properties)
Key Name
Book/1234 Applied Mathematics
Book/3456 Programming Java
Book/2345 SOA Cookbook

So when entities of a given kind are created/updated/deleted, then indexes in the above mentioned tables are re-arranged. So when following query is executed:
Select * from Book where price> 10.00 && price < 50.00
The datastore will first check the index table defined for the Book kind with price property with ascending order of price. Then, the datastore starts scanning the index at the first entity that meets all of the filter conditions using the query's filter values. And then the datastore continues to scan the index, returning each entity, until it finds the next entity that does not meet the filter conditions, until it reaches the end of the index, or until it has collected the maximum number of results requested by the query.

Key Price
Book/1234 8.00

Book/3456 12.99

Book/2345 23.00
Book/3466 28.00
Book/3479 34.00

Book/3660 53.00

Datastore doesn’t update the indexes when the property is not set for that given entity. And also lets the see the example of index table when there are multi-valued properties defined in an entity (In our Book entity we can assume the example of the reviewers as Multi-Value Properties):
Key Reviewers(Sorted by reviewers, not by keys)
Book/3456 James Keenan
Book/1234 Ken Beck
Book/3456 Laurel Bird
Book/1234 Surendra Reddy
Book/3475 Yale Bordy

Do the developers need to provide indexes configuration or it will be automatically provided by Google Datastore?
After explaining how the datastore indexes the entity when the entities are created or updated or deleted. In following section, will determine which all queries will be automatically indexed by Datastore and which queries requires specifying indexes configuration.
App engine datastore provides automatically indexing when following queries are being executed by application code:
1. Querying the all entities of a given kind
The example of this query is - select * from Book
The following query will use the single index table defined for storing the keys for a entity of a given kind (in our case Book).
2. Querying the all entities of a given kind and using filters on one property using equality operator with no sort order.
The example of this query is Select * from Book where price = 28.00
3. Querying the all entities of a given kind and using filters on one property using inequality operator(< , > , >=, <=) with no sort order
The same example used earlier while describing how query is executed (in previous section can be used here) : Select * from Book where price> 10.00 && price < 50.00
4. Querying the all entities of a given kind and using sort order on one property and can use no or filter on same sorted property
The example of this query will be – Select * from Book order by price desc
5. Querying the all entities of a given kind by filtering or sorting by entity key
This query can be used if application is aware of key names and then wants to retrieve that key entity later on from datastore.
The cases for equality and inequality operators are being defined separately as they are given different preferences, which will be covered with complex queries.
Till now, developer doesn’t need to concern itself with the indexes concept as datastore is able to do everything with respect to indexes for querying the entities & hence providing the layer of abstraction to developer.
But any application can’t have just simple queries like mentioned in above steps & hence application needs to provide custom indexes before deploying the same online. Following queries right now requires customer indexes:
1. Query involving the multiple sort orders defined on different properties
The example for this query will be
Select * from Book order by price desc, name desc
2. Query involving the equality filters defined on one property & any other filters on other properties.
The example for this query will be
Select * from Book where year=2009 and price < 34.00
These complex queries bring about three different principles of inequality filters defined by Google Datastore:
• If a query uses inequality filters on one property and equality filters on one or more other properties, the index must be first sorted by the properties used in equality filters, then by properties used in inequality filters.
• If query uses inequality filters on one property and sort orders on one or more of the other properties, the index must be first sorted by the property used in inequality filters, then by the other desired sort orders.
For eg. Select * From Book where price>20.00 order by name desc
So as mentioned in the above rule, the entities will be first sorted by price and name. So to make the results look obvious, it is always better to add inequality operator also in sort order to make it explicit. So the correct query will be
Select * from Book where price >20.00 order by price, name desc
• A query cannot use inequality filters on more than 1 property. This principle of inequality can be considered as limitation of Google Datastore till now.
Although Google datastore do support equality filters on more than 1 property, but this type of query makes the setup of index as flux between automatic & complex queries. Google datastore follows zigzag fashion of search of entities using equality filters on more than 1 property. So it is always advised as a good practice by Google itself to use automatic indexing for this type of query. In case the algorithm is taking lot of time to return back the response to web handler, then try to customize the indexes via configuration.
There are additional operators – IN and Not Equal (!), which are mostly used in any of the queries made to database. IN Operator is broken down into Multiple equality operators check on same property and ! Operator is broken down into > and < operator check with that filtered value for that property by datastore. So following examples will cover that:
Select * from Book where Year IN (2001, 2003, 2005) Select &* from Book where Year=2001 & Year=2003 & Year=2005
Select * from Book where year != 2004 -> Select * from Book where year>2004 & year < 2004
Using ! Operator leads to third point on inequality filer that, application can’t use ! operator on more than 1 property at a given time. Although the complex queries (defined by Datastore) are not really complex, when compared to the real world SQL’s. But that is again the limitation of datastore and they might also get complex as the days goes by.
After covering how the indexes are created & used by datastore, developer would always like to know whether they are responsible for writing indexes. Application developer can specify the customer indexes for a given kind of entity in datastore-indexes.xml, under the WEB-INF directory. This manual process of indexes can be tedious, error-prone & not liked by application developers, and hence development servers automatically generate indexes for complex queries whenever the queries are executed. To use automatic index configuration, add the attribute autoGenerate="true" to your datastore-indexes.xml file's element. Automatic index configuration is also used if app does not have a datastore-indexes.xml file.
So with more understanding of what index do and how they are configured, the next question still unanswered is whether they are also distributed like their datastore? Yes, both Entities and indexes are distributed across multiple machines, and each machine scans its own index in parallel with others. Each machine returns results to App engine as it scans its own index, and App engine delivers the final result to app, as if the all the results of query were in one large index.
So what developer/designer needs to learn from this article -
1. Design entities correctly so that it doesn’t lead to index explosion along with taking lot of bandwidths when retrieved/updated/created in datastore
2. Need to know which query needs the custom indexes and which not. If there was an entry created in configuration file for index, and that query was deleted from code, then developer/deployer needs to manually delete those indexes from xml file.
3. Need to know the laws on inequality filters and need to work with Google limitations for complex queries.
With more insight into the Google Datastore’s indexes concept, would like to recommend that this architecture should be used by those application which needs faster access to data and not getting affected by how much data is in the system or how it is distributed across multiple servers.

No comments:

Post a Comment