Netuality

Taming the big, bad, nasty websites

Archive for the ‘NoSQL’ tag

And now for something a little different: graphviz candy

leave a comment

No self-respecting geek can resist either of these incredible temptations:

  • Correct something wrong on “the internets”
  • Produce a little bit of graphviz candy (what a fantastic tool)

Read the rest of this entry »

Written by Adrian

May 6th, 2010 at 4:37 pm

Posted in Presentations

Tagged with , ,

Linkdump: Twitter, Twitter, CAP and … iPad

leave a comment

Well, not all Twitter runs on Cassandra :) Alex Payne explains how they build Hawkwind, a distributed search system written in Scala. Take a look at the slide 18, where you can clearly see that they use HBase as backend:



Also from the great guys at Twitter: gizzard. Interesting and appropriate name for a database sharding framework. Gizzard uses range-based partitioning and replication tree and knows to rely on a large range of data stores: RDBMSes, Lucene or Redis – you name it. But I wonder about the operational overhead when you have a really large gizzard cluster.

Michael Stonebraker has a short essay on CAP published in the ACM blogs. He identifies a series of use cases where the CAP theorem simply does not apply and cannot be appealed to for guidance:

Obviously, one should write software that can deal with load spikes without failing; for example, by shedding load or operating in a degraded mode. Also, good monitoring software will help identify such problems early, since the real solution is to add more capacity. Lastly, self-reconfiguring software that can absorb additional resources quickly is obviously a good idea.

In summary, one should not throw out the C so quickly, since there are real error scenarios where CAP does not apply and it seems like a bad tradeoff in many of the other situations.

Great nosqlEu coverage on Alex Popescu’s blog MyNoSQL. Read it to get all the presentations, tons of links and Twitter quotes.

Because every self-respecting blog should mention some info about the newly released iPad, here’s mine. According to the O’Reilly Radar, iPad is not ready for the cloud integration:

I am hoping for a future where all I need to supply a device with is my identity, and everything else falls into place. This doesn’t even have to be me trusting in a third-party cloud: there’s no reason similar mechanisms couldn’t be used privately in a home network setting.

I think the iPad is an amazing piece of hardware, and the most pleasant web browsing experience available. It is still very much a 1.0 device though, and its best days certainly lie ahead of it. I hope part of that improvement is a simple story for synchronization and cloud access.

Guess I’ll be waiting for the release of iPad Pro:

Written by Adrian

April 21st, 2010 at 11:24 pm

Posted in Linkdump

Tagged with , , ,

Linkdump: Cassandra @Twitter, Forrester not grokking NoSQL

one comment

Seven signs you need to accept NoSQL in your life according to the High Scalability blog. I especially like sign #6 “Maintaining a completely separate object caching system on top of an already beefy table storage system“. There are companies making serious bucks from selling exactly this type of caching systems. I find that a bit ironic, don’t you?

Twitter has just decided to adopt Cassandra as their main storage. I roughly estimated the status table to having  more than 9 billion rows – it’s a good table size to start thinking about the benefits of NoSQL. I would have been interested in seeing a comparison with other existing solutions and a rationale of their choice. According to some sources, Ryan King rejected HBase because if a region server is down, writes will be blocked for affected data until the data is redistributed – unlike Cassandra’s “write never fail” policy. According to other sources, this will be solved in a future version of HBase but I think Twitter needed a solution sooner rather than later. I hope for two things:

  • That the Twitter dudes will blog about their migration experience
  • That I’ll be able to access and search through all my older tweets, fer’ God sake!

Forrester Research thinks that NoSQL and Elastic Caching Platforms are very similar. So similar that “NoSQL Wants To Be Elastic Caching When It Grows Up“. According to Forrester “Ultimately, the real difference between NoSQL and elastic caching now may be in-memory versus persistent storage on disk.” Yeah sure: transactions, durability, indexing, security model – who needs this crap anyway?

Oh and let’s not forget about today’s GAE unscheduled downtime. Waiting forward for the post mortem, for sure there will be a thing or two to learn…

Written by Adrian

February 24th, 2010 at 11:18 pm

Posted in Linkdump

Tagged with , , ,

Hadoop Map/Reduce versus DBMS, benchmarks

leave a comment

Here’s a recent benchmark published at SIGMOD ’09 by a team of researchers and students from Brown, M.I.T. and Wisconsin-Madison universities. The details of their setup here and this is the paper (PDF).

They ran a few simple tasks such as loading, „grepping” (as described in the original M/R paper), aggregation, selection and join on a total of 1TB of data. On the same 100-nodes RedHat cluster they compared Vertica (a well-known MPP), „plain” Hadoop with custom-coded Map/Reduce tasks and an unnamed DBMS-X (probably Oracle Exadata, which is mentioned in the article).

The final result shows Vertica and DBMS-X being (not astonishing at all!) 2, respectively 3 times faster than the brute M/R approach. What they also mention is that Hadoop was surprisingly easy to install and run, while the DBMS-X installation process was a relatively complex one, followed by tuning. Parallel databases were using space more efficiently due to compression, while Hadoop needed at least 3 times the space due to redundancy mechanism. A good point for Hadoop was the failure model allowing for quick recovery from faults and uninterrupted long-running jobs.

The authors recommend parallel DBMS-es against „brute force” models. “[…] we are wary of devoting huge computational clusters and “brute force” approaches to computation when sophisticated software would could do the same processing with far less hardware and consume far less energy, or in less time, thereby obviating the need for a sophisticated fault tolerance model. A multithousand- node cluster of the sort Google, Microsoft, and Yahoo! run uses huge amounts of energy, and as our results show, for many data processing tasks a parallel DBMS can often achieve the same performance using far fewer nodes. As such, the desirable approach is to use high-performance algorithms with modest parallelism rather than brute force approaches on much larger clusters.

What do you think, dear reader? I would be curious to see the same benchmark replicated on other NoSQL systems. Also, I find 1TB too low for most web-scale apps today.

Written by Adrian

January 3rd, 2010 at 10:40 pm

Posted in Articles

Tagged with , , , , ,