Everything old is new again

It’s been interesting watching the flurry of activity in data-storage land over the last few months. CouchDB has been improving in leaps and bounds, and multi dimensioned data stores have been getting a lot more attention in general.

I started using Couch on a project a few weeks ago with the DataMapper adapter, specifically for scalability and search reasons. Migrating from my test SQLite database to Couch was a breeze, however it started getting ugly after a short while.

The main obstacle was that Datamapper’s Couch adapter integrated pretty clunkily with DataMapper, particularly:

  • You had to mix in DataMapper::CouchResource into your models instead of the bog standard DataMapper::Resource.
  • The standard DataMapper finders didn't work, so you had to wrap all your queries in Couch views.

As much as this annoyed me, dm-couchrest-adapter’s maintainer is a smart guy, so I knew there was a good reason behind it being that way even if it wasn’t immediately apparent.

That good reason presented itself to me a few days ago when I started playing with dm-ferret-adapter to get full text search with Ferret on my models.

I was trying to work out how you do a multi-field search, but as with most things in the Ruby world, the documentation was lacking. The author of the dm-sphinx-adapter fortuitously posted on the mailing list about how his adapter handled the problem, so I went digging around inside dm-ferret-adapter’s and dm-is-searchable’s internals to work out why it wasn’t behaving.

The crux of the problem was that DM was being too smart for its own good and tried to match fields listed in the the :conditions parameter to actual fields in the database, hence passing a big string in a :conditions would explode before it even hit the Ferret adapter.

And thus we’ve hit a fundamental problem with DataMapper’s current implementation: at its core, it’s still an ORM for relational databases - adapter authors are always going to be fighting an uphill battle when trying to integrate a non-relational data store.

So back in Couch land, I ended up switching to the CouchRest ORM to talk to Couch. Explaining why he wrote CouchRest as a standalone ORM, Chris Anderson stated

(I could have written a DataMapper adapter for CouchDB, but much of DataMapper’s code is based around SQL-like problems that CouchDB just doesn’t have.)

… sounds just like the problems I was referring to.

Anyhow, why the title of this post?

Well last year I briefly hacked on some business banking code for Suncorp, and I was introduced to the wonderful world of UniVerse BASIC. I’m guessing that almost none of the readers of this blog have ever heard of UniVerse, but some may have heard of Pick.

Pick was a pre-Unix operating system and rapid application development environment initially released in 1965. Pick’s killer feature was the MultiValue database (think hash table), and was specifically targeted at businesses and business analysts.

You’re probably thinking “Woo, a hash table - why should I care about this Lindsay? My language already has Hashes/Dictionaries/HashMaps/filing cabinets”. Well Pick’s hash table implementation was (and still is) pretty kick arse for its time. There’s a query language (that’s suspiciously similar but slightly different to SQL), and it backs onto an incredibly well tested on-disk data store.

There’s also no enforced schema, so it was particularly useful in the accounting world where relational databases with rigidly enforced schema aren’t a good fit. Hence, there are a lot of financial applications out there written in Pick or a Pick derivative.

If you’ve ever worked with any of the EDIFACT data formats, you’ve indirectly worked with Pick. Those pesky separator/terminators ('+:?) are handled by multivalue databases really well. If you were to represent an EDIFACT segment how UniVerse would process it:

# EDIFACT segment
TVL+240493:1740::2030+JFK+MIA+DL+081+C'

To Python tuples and dictionaries:

( 'TVL', {'240493': ('1740', None, '2030')}, 'JFK', 'MIA', 'DL', '081', 'C')

To Ruby arrays and hashes:

[ 'TVL', {'240493' => ['1740', nil, '2030']},  'JFK', 'MIA', 'DL', '081', 'C']

By now you’re thinking “oh god now I have to iterate over a whole bunch of nested data structures”, but fortunately Pick’s BASIC implementation provided syntax that made this pretty straightforward.

Anyhow, the MultiValue technology behind Pick was licensed to roughly 3 dozen companies during the 70’s and 80’s, but there’s been a lot of consolidation in the Pick market since then, and the main player is now actually IBM. They purchased two implementations, UniVerse and UniData, in the 90’s, re-branded them as U2 (sorry, no Bono here), and have been continually developing them ever since.

IBM have written .NET and Java interfaces to U2 data stores, there’s integration with RedBack (a web application development framework), and more recently work has gone into PHP, Python, and Ruby bindings.

Pick’s usage in the enterprise was and is still phenomenal. Last year at IBM’s U2 University in Sydney the U2 product manager quoted a statistic that the U2 team estimate at least 60% of IBM’s clients are directly using either UniVerse or UniData. A large majority of these systems are small back office-type setups that were installed decades ago, next-to-nobody touches, but are mission-critical.

So after seeing Tokyo Cabinet do the rounds this week in the Ruby sphere, it’s pretty obvious that multi dimensioned data stores are experiencing a bit of resurgence.

Google’s success with BigTable has kicked a lot of smart people into gear: CouchDB, HBase, and Tokyo Cabinet are shining examples of awesome work being done in the DBMS sphere.

What I think is going to make a difference this time:

  • Implementations are not walled gardens. IBM's U2 products are not open source, and have a significant monetary barrier of entry. It's a problem that the entire Pick marketplace suffers, and why there isn't a lot of young talent in the Pick sphere anymore.
  • The multi dimensioned data paradigm maps really well onto existing (and popular!) interchange formats. Take a look at JSON - its take up over the last few years has been impressive to say the least. YAML is another great example. They succeed where rigid data formats don't fit. Also, they're not the new EDIFACT.
  • Developers are hitting barriers with RDBMSes. If there's one thing we can learn from the hype-fest that was "Web 2.0", it's that scalability is hard. Multi dimensioned databases aren't a magical elixir for the scalability problems of developers around the world, but they do prompt people to think of alternate ways of storing their data.

It’ll be interesting to see whether the industry will start taking up multi dimensioned data stores en mass any time soon.