Thoughts on MongoDB & co.

('NoSQL' databases)

Posted: 2012.01.28   Revised: 2012.04.10

I've talked with 3 local devs in the past week who've worked with MongoDB lately, and in contrast to my experience, it hasn't been a good fit for them. Two of them work together on large-scale search crawler stuff at TNR Global (a Java shop). The other (at a Python/MySQL shop) inherited a pile of crap code from some guys who tried Mongo for the hell of it, then bailed out.

I'm just building a CMS for artists' websites. MongoDB is touted as being flexible and scalable, but I chose it to simplify my stack... I liked the idea of keeping 200-byte concert postings and 500-meg videos in the same datastore. And the query language is Javascript, so in conjunction with Node.js, I have a pure Javascript stack... that consistency is worth something.


Here's a quick comparison of open-source NoSQL contenders, from oldest to newest:

  • MUMPS (from the 1960s-70s, still used mainly for medical records... just saying, this is not a new idea)

  • Hadoop: Java-based; mapreduce queries; Google-inspired

    • HDFS: distributed filesystem... like BigTable

      • HBase: column-oriented w/ Bloom filters
  • CouchDB: Erlang-based; Javascript mapreduce queries

  • MongoDB: C++; Javascript adhoc queries; mmap I/O; optional journalling

    • GridFS: layer for large file storage
  • Redis: C; in-memory key-value database; optional journalling; Memcached-inspired?

  • Riak: C++; Javascript & Erlang queries; high-availability; Amazon-inspired

Here's what my friends at TNR had to say. First of all, loading their voluminous crawler logs into Mongo, they kept hitting a performance wall... I guess they just ran out of RAM, and adding servers wouldn't help; it's a master-slave replication architecture. Michael McIntosh started looking at Riak instead, which is great for dynamic scaling -- it stores 3 copies of everything, on 3 different servers, so you can be reading one copy, writing another, and migrating the third copy to another server, all the once. Chris Miles did a write-up on Hadoop/HBase a couple weeks ago. Apparently it fits. There are mature heavy-duty reporting tools for Hadoop; they're already using Solr/Lucene and other Java stuff; and Java is tolerable when you're using Clojure.

There is a pretty prominent Use Cases page on mongodb.org, which says it's "less well suited" for "traditional non-realtime data warehousing" which is basically what TNR is doing. I guess "where the data volume is huge, MongoDB may be a fit" is kinda misleading; it may be a fit for huge realtime data, not bigger-than-huge data that can't possibly be processed in realtime!

I worked with CouchDB a bit, 3 years ago. Erlang was a bit of a pain to install, and it had crappy I/O throughput. Mapreduce queries were tedious and inflexible. And it just had too many bells and whistles... versioning, replication, etc... and most recently UNQL (pronounced "uncool"?), a throwback to the 1970s. Well, let's not diss CouchDB too much; it was a pioneer of the NoSQL renaissance.


MongoDB queries are a couple orders of magnitude quicker to write than CouchDB queries. The MongoDB query language feels like SQL, but it's an abstract "JSON" syntax, so you can build queries out of hash/array objects instead of all that error-prone nonsense of concatenating strings and escaping quotes. For performance, there are standard query operations (and, or, not, greater/less than, regular expressions) but you can also use Javascript which runs inside the Mongo server... like PL/SQL stored procedures, but a heck of a lot more elegant.

Database Drivers

I've only looked at the Node.js drivers... so far I'm aware of:

The first one, a port of the Ruby driver, is pretty stable. I'm using it. It's asynchronous, and a typical query requires 3 to 5 nested callbacks, which gets a bit tedious. The next two are convenience wrappers that make node-mongodb-native look almost like the 'mongo' command; you get to chain your lookup commands and do a single callback at the end. Mongoose adds an ORM layer in addition to some conveniences. I tried it first because I stumbled on the "Blog rolling with Node.js and MongoDB" tutorial... but the API was evolving rapidly and there were outdated docs everywhere. Also, I'm not keen on ORM. MongoSkin has some kind of ORM but it's purely optional. Mongolian DeadBeef is "built from the ground up for node.js" -- sounds promising, but it's experimental for now.

Is it ridiculous to use an ORM with MongoDB? ORMs are a kludge to make relational SQL database look object-oriented, and to deal with that string concatenation/escaping nonsense once and for all (almost) for the popular flavors of SQL. MongoDB is already object-oriented. When you add an ORM layer, you're just adding dependencies that'll bite you in the ass later, i.e. when everyone realizes what a nightmare ORMs are and stops maintaining your MongoDB ORM. Yep, ORMs suck. Take it from Rich Hickey (creator of Clojure), who singled out ORMs for criticism in his excellent "Simple Made Easy" talk last year. He said that OO is the wrong approach, and declarative data languages (such as SQL and LINQ) are the right one. MongoDB queries are closer to the latter category than the former.

A point in favor of ORM-like layers: sometimes you want strictly enforced data types and validation. For example, I have a lot of Date/Time fields that should always be stored as Date objects, not ISO strings. And if they are accidentally stored as strings, I want an ORM-like layer to clean them up or error out so my templates don't barf (and I certainly don't want my templates littered with input-checking code!)

I'm gonna play with MongoSkin and Mongolian and post a followup...

Issues with MongoDB

Durability was (past tense) a problem with single-server MongoDB deployments... i.e. you could lose data for no apparent reason. They solved that by adding journalling in 1.8, and as of 2.0 it's enabled by default on 64-bit systems. And with Mongo's memory-mapped disk I/O you're limited to 2GB on 32-bit, so you generally need to run 64-bit anyway. If you have a 32-bit dev box, fine: turn on --journal. Don't use Mongo if you need to deploy it on 32-bit servers for some reason.

GridFS feels bolted-on, and it is, also as a side-effect of mmap I/O. (You just can't mmap hundreds of 500-meg files at the same time, even on 64-bit.) In practice it doesn't add much complexity, though. And it's helpful to keep large files separate from your data records anyway. BTW, there's a misconception that you can't query GridFS files... sure you can: the metadata is in the 'fs.files' collection.

I would not use MongoDB for, say, accounting apps, where transactional integrity is crucial. Relational databases are pretty ideal for that; SQL syntax isn't, but it'll do.


In closing, I see MongoDB as an evolutionary step. You've probably seen that triangle diagram: "Consistency, Durability, Replication: pick any two". MongoDB picks durability and replication, with a flexible query language and high performance for small to "finite-scale" datasets ("infinite-scale" meaning TNR-scale, Google/Amazon-scale, and beyond). There's ample opportunity to blur the scaling/flexibility tradeoff. Decentralization (not mere replication through a single chokepoint) is another frontier. And there's still a huge unfilled niche for a NoSQL relational database to cover the "Consistency" corner.

2012.04.10 Update: I've dropped MongoDB from my little CMS, in favor of plain old JSON files.

You've probably heard about Rich Hickey's Datomic project by now. Sounds like it might be the next evolutionary step, but it probably won't trickle down into everyday use for a few years.