Building a high volume app without a RDMS or Domain objects

I have two concept applications that I think could work quite well without having to use a RDBMS or an OO domain layer. The question is, would they actually benefit from this approach?

There is no doubt that relational databases are the king of data storage and the object oriented applications are the predominant methodology and for good reasons, but how practical is it to build a high traffic application that doesn’t use a relational database or an domain object layer?

For both of the scenarios I’m thinking of, the hardware architecture would be very similar and based on Amazon’s Elastic Cloud and S3 services. The idea being that the data would reside in S3 in a text format (we’ll use XML for sake of argument), with the actual site work running off of elastic cloud instances. I was inspired a bit by a post made by the.codist{}.

Why Bother?

Good question. The idea behind thinking of this route is twofold. First, you would get the horizontal scalability of adding web servers — and possibly cache servers — quickly as traffic increases. Second is rapid development.

Rapid development. Isn’t that why we have Ruby on Rails, Grails and other frameworks like these? True, but how valuable are the domain driven OO frameworks that have dumb domain objects? I will concede that with some applications, working with the data is easier using objects that say a tabular model as the data is naturally hierarchal in nature. Then again, that is where the XML/JSON data models can fit. Also, frameworks like Ruby on Rails, Grails and ORMs like Hibernate, JPA, entity beans, etc. are most valuable when you need a full CRUD application. While both of these scenarios have CRUD operations, they aren’t data entry CRUD apps in the traditional sense.

What, no OO?

Well, the apps wouldn’t necessarily be non-object oriented, they just wouldn’t use domain objects. Rather they would have controller and action objects, commonly referred to as service objects. It is often debated on the benefits of having an anemic domain layer anyway. I know the point of Martin’s comments on the subject is to put more logic into the domain model, but what if your domain model doesn’t need any logic other than data retrieval and storage? (All of which is really a different debate)

So with those points in mind…

Scenario One: A Single User Style Application

Our first scenario is more along the lines of what we tend to think of as an ‘application’. While not single user in the sense of a desktop app, but single user in the sense that the user isn’t looking at other users or global data. Think of apps like simple, infrequent data entry and occasional reads. The volume and need for scalability from this app comes from the number of users or account holders, like say a 37Signals style application, Flickr or maybe even MySpace.

The data is stored in S3 as XML or perhaps even JSON format, and accessed as needed. Speed of access isn’t nearly as important in this application, but given that the amount of data retrieved at any one time would be relatively small, the speed shouldn’t be an issue anyway, especially over the (presumably big) pipes going between S3 and EC2.

Scenario Two: A High Traffic Social Networking Style Site

This application would be a high read, low write web site along the lines of say Digg, MySpace, etc. This one would be a little trickier to implement as the volume of reads would be quite high on common data. However I think it is still possible.

Data storage would be pretty much the same as in scenario one, however for this to work effectively, this application would almost certainly require caching using either a simple in-memory cache or using something like JCS or OSCache. The even trickier part would be if the site had something like voting. This portion of the application could see a high volume of writes and would have to be handled correctly. One possible solution to this would be to ‘write’ to the cache (of sorts) of the model and flush the writes to the file system at determined intervals. This flushing or notification would be done via a messaging system (which Amazon also provides incidentally).

At the moment the only way I can think of this really working would be that the data isn’t really being modified concurrently, but rather incrementally. To use Digg again, if one person gets a vote in before another person, it doesn’t really matter. The same could be said for comments. It doesn’t really matter that one person’s comments get posted before another’s. In other words the data being modified isn’t a single data item like say a product price or an account balance, but rather is being appended to.

Issues With Both Approaches

Transactions

The first issue that comes to mind is transaction support. Obviously this is where any RDMBS shines, but given that in both cases there is relatively low to no multi-user modifications of the data, one could argue that the ‘transaction’ could be simply done at the middle tier or application side.

Queries

Again, SQL is wonderful for queries, especially dynamic queries. This could be countered with using either XPath style queries or a search engine like Lucene, or even a combination of the two. Yes, yes, nothing really compares the power of SQL for query capabilities and I know that. The point is that it would really depend on the kind of queries needed. If you are not doing analytical queries and really only need “get object where property is like X” style queries, then other methods can provide that.

The downside to using something like XPath or the equivalent JSON/JavaScript mechanism is that the data would have to be parsed on each query. Depending on the amount of data needed for the query, that could be unacceptably CPU intensive. This could be countered with using a cache and executing those queries against the cache — something I have done in the past against object trees via JXPath with great success. Using this approach really degrades when your data files are really large. I have yet to find a SAX parser that allows access via XPath, which would go along way towards overcoming this problem. (anybody know of one by the way?)

Using something like Lucene really shines in the performance department, but fails when you need to do cross-data queries. In a RDMBS this would be akin to joining tables and such, again back to the more analytical queries. The flip side is that if your data is pretty much hierarchal in nature and doesn’t need to be normalized, then you are good to go.

I know that RDMBS’ provide many other benefits, but again we aren’t building an ERP system here, rather high read, low write applications.

Benefits

The first benefit is that you would have you don’t have to incur the overhead of a database and your scaling isn’t dependent on needing to cluster database servers or using replication strategies like a write database that replicates out to read databases. You also don’t have the disk I/O associated with a RDMBS. Yes, those files have to load off of the file system at some point, but loading simple files is much faster than going through a database any day. Look at any decent content management system and you will see that they usually implement caching by generating a static HTML page that the web server can serve directly. Yes it is true that databases are pretty darn fast these days for the most part, but without some complicated configurations and installations, they don’t cluster as easy as other means of data access.

The applications should be fairly easy to build, and build quickly, and the scaling of those apps should come almost free. The architecture is straightforward yet effective. If the app needs to use a caching tool, then expanding that cache should as easy as telling it the IP addresses of the newly instantiated cache servers. If can make the web servers stateless then you don’t have to worry about that either, or use a method like sticky sessions.

Here is a thought, what if we combined the rapid development of say Grails into this architecture? Instead of accessing the domain objects via Hibernate, they were populated from JSON or XML as needed? Or have simple coarse grained domain objects used for display purposes, again populated in controllers or services and passed on to the UI? Graeme, Guillaume or any other Grails/Rails guru out there have a comment on that? I know it is technically feasible — such as the S3 API for rails — but does it make sense?

So, Can It Be Done

I’m sure the DBA’s among you will protest and say that I’m forgetting about feature A,B,C,D…Z that a RDMBS provides. Yes I know about relational integrity, data integrity, yada, yada, yada. Yes they all have their place and usefulness. I am not denying nor forgetting about those features/benefits. The point is are they absolutely needed for these two scenarios?

For the object fanboys (and yes I love OO as much as the next developer), that need can be satisfied by caching domain objects and running queries on them like the JXPath method I mentioned earlier. One could also make the argument that DOM is an object model, a generic one, but one none the less.

Has anyone built anything like this? Any other thoughts?

Technorati , , , , , ,

Don't miss anything, subscribe!

20 Responses to “Building a high volume app without a RDMS or Domain objects”


  1. 1 Luis Bruno

    You probably are already aware of this, but if you restrict your XPath to forward-looking (axles?), you can use SAX. For python, you might be interested in ElementTree (which is somewhat anemic in it’s XPath support); you can always build your own — which is probably boring.

  2. 2 The Bull

    Thanks for the comment Luis. I haven’t seen any examples of using XPath against a SAX parser, but then I haven’t had to use one in quite some time. I certainly don’t want to write one either!

  3. 3 Jon Frisby

    Discarding the database isn’t necessarily of great benefit, but discarding the idea of writing directly to the database — and all that it implies — allows you a tremendous amount of freedom.

    In other words, if you eliminate the guarantees that the user’s change will always be visible instantly, you open up an enormous amount of freedom in finding alternative solutions.

    If you queue “write” events up, either on a local-to-the-web-server database, or a log file or some such, and then process those asynchronously into your data store you can achieve the goal of not having a central database bottleneck (single point of failure) quite easily. It also makes it much easier to get around the lack of transactions — you can cheat and use a database to act as a central locking authority for locking individual files or resources — it’s not used to serve web requests, just used by the back-end for updating things.

    Don’t be afraid of a hybrid approach where entity data that is searched for only by (or primarily by) some natural key (id, email, whatever) is stored in the filesystem (or S3) but a summary is kept in a DB (and replicated out to web servers using a one-way replication scheme, like MySQL has — sadly, Postgres’ Slony is not a good choice for this as it exhibits N^2 properties) for purposes of performing queries.

    Honestly, I wouldn’t recommend going down the rabit hole of using XPath for actually *searching* for stuff. Doing so implies either that you have many entities in one file, which means you have to parse much or all of the file just to display one entity (blech!) or that you’re going to query a bunch of different files, which implies a lot of file ops (and the filesystem doesn’t scale better than a database in terms of concurrent usage, for many of the same reasons).

    Use the filesystem or S3 for the common database pattern of find-single-entity-by-PK, or find-group-of-entities-by-FK but there are plenty of other options for search (Lucene and other search engines, databases, etc).

    -JF

  4. 4 The Bull

    Thanks for the comment Jon. I’m really thinking of trying this out and possibly using a message queue for the write events, much like you mention.

    Good point on the XPath. I wouldn’t use it to ‘query’ so much as to get by a certain key. Going this route I would most likely use a Lucene based search engine (actually probably use the Solr project which is Lucene based). That would return me a list of IDs that match the query from which I could either get one at a time — probably expensive from S3 — or from larger groups of files via XPath or something similar.

    Then again, I recently got an email about a plugin for MySQL that uses S3 as the storage system which seems intriguing.

  5. 5 Ted Stockwell

    I think that the style of application to which this technique would be applicable is very small. I would prefer to see a distributed database be developed that works on top of S3. That way there is no need to discard traditional database development but still get scalability of utility computing architecture.
    Something like the version of MySQL that stores data on S3 coupled with Hiberbate Shards would be what I would prefer….

    http://fallenpegasus.com/code/mysql-awss3/
    http://www.hibernate.org/414.html

  6. 6 The Bull

    There are plans in the works for building both a distributed MySQL cluster for EC2 and there is also a plugin for MySQL that uses S3 for storage. Either of those solutions has the potential to really help those applications that really need a relational database. Where I was going with the concept was to question whether we should always use a relational database for storage when there are other options available. RDBMS’ are so common that we tend to use them without any real regard as to why we are using them. It has been more of a ‘because that is what you do’.

  7. 7 wpbarr

    There’s a far easier way to scale. First, partition your data by the major natural search paths (geography, last name, subject, etc.) and then back-end the application with an object-oriented database. If you want data distribution to be handled for you, look into something like Intersystems Cache database.

  8. 8 Dan Creswell

    The thing is that for a lot of truly large scale usage as is seen on successful websites, conventional RDBMS have the wrong model. Bosworth says it best:

    http://www.adambosworth.net/archives/000038.html

  9. 9 MikeD

    There are a lot of good suggestions here. Some key points are to consider not using an RDBMS to drive website display, but rather consider storing in an RDBMS then publshing to a read-only system. The read-only system could also be a DB engine, but one optimized for mainly serving up blobs. A file system sort of looks like that, but Yahoo uses MySQL and Amazon uses BerkleyDB (now owned by Oracle).
    One suggestion was to send messages instead of writing to a DB, but many user-oriented apps would benefit from having a transactional ‘edit’ portion separated from a ’search and serve’ caching system - the messaging would happen from database writes, either homegrown or using replication to slaves.

    The suggestion for partitioning is also valuable, but be very very careful about choosing the key to partition on - it should never change. People change names, emails, locations all the time - you want to avoid moving data between partitions.

  10. 10 The Bull

    The replicate to a read-only db is rather common, and one could argue that you could accomplish this withing MySQL without having to actually replicate since MySQL’s default engine is optimized for reads anyway. Replicating would distribute the load however.

    I’ve seen how eBay partitions their system and it is pretty slick and complicated. As for keys, this is one of the reasons I don’t like DB generated keys.

  11. 11 Steven Wilmot

    Although I’m a database consultant (and make a living from advising clients on appropriate use of an RDBMS, so this article was an interesting read.

    I agree with Jon Frisby’s earlier comments that discarding the database entirely isn’t necessarily a good thing.

    For large systems like the ones mentioned, each technology has its own place.
    - File-system storage and storing multiple copies of data (stored initially by a natural key) are a good way to start.
    - Large-scale caching systems will read-only replicated copies of the database (or appropriate sections of it) also have their place.
    - However, even in these situations, RDBMS systems still offer advantages such as proper transacional locking on concurren access (perhaps of configuration-settings or financial management)

    Many articles that I’ve read today seem to start by suggesting that their particular design paradigm is (or could be made) applicable as the solution to too many problems.

    The successful systems seem to be the ones that don’t try a “one solution fits all” approach, but instead only use RDBMS systems in appropriate places, making the bast use of caching, queued-updates, and replication, using each technology to its own advantage.

  12. 12 The Bull

    Thanks for the comment Steve. I completely agree that RDBM’s have lots to offer and you mention only a fraction of them. My thoughts were along the lines of those apps that don’t need the things that a RDBMS is really good at, like transactions. Databases can be good for caches (such as MySQL’s in memory tables and such), and other things that can be done with other systems.

    relational databases are just so common these days that I was curious about other alternatives.

  13. 13 Bob Lozano

    Many good thoughts in this thread, interesting to consider in light of how Digg and other similar sites are scaling etc. I think that assuming no db is a great formalism (see where it takes you and all that), but not the whole picture. In this post I start to think down the “this is good, what else might help” parth a bit.

    Thx for exploring this point - it’s definitely time to figure out how to scale the both the data layer and the apps that depend on them cheaper & easier.

  14. 14 DAR

    One other issue with this setup from a more practical perspective - the choice of S3 itself carries some potential negative consequences with it:

    * It *does* have outages from time to time. If you’re using it as the backing store of a high-volume web app, then when it goes down your app goes down. Probably best not to tie your site’s reliability to AWS’s. Most web companies that use S3 (e.g., SmugMug) use it as backup or “cold” storage for that reason.

    * There’s propagation latency inherent to S3. So if one of your EC2 instances stores something to S3, other instances might not see it for a while until it propagates through.

  1. 1 » Building a high volume app without a RDMS or… - myspacerip.com
  2. 2 Scaling By Eliminating The Database on iface thoughts
  3. 3 Pragmatic Dictator » Blog Archive » When Esoteric Becomes Mainstream
  4. 4 Scaling Without A Database « François Schiettecatte’s Blog
  5. 5 Putting my non-RDMBS idea to the test » Thinking Outloud
  6. 6 Thoughts on Computing » Blog Archive » Scaling Digg … Shards and the DB

Leave a Reply