Netuality

Taming the big, bad, nasty websites

Archive for the ‘HBase’ tag

And now for something a little different: graphviz candy

leave a comment

No self-respecting geek can resist either of these incredible temptations:

  • Correct something wrong on “the internets”
  • Produce a little bit of graphviz candy (what a fantastic tool)

Read the rest of this entry »

Written by Adrian

May 6th, 2010 at 4:37 pm

Posted in Presentations

Tagged with , ,

Linkdump: using Hbase, CAP visuals, Farmville and more

leave a comment

Two great posts from my colleagues about why Adobe is using HBase: part 1 and part 2. As I’ve experienced all these firsthand, I guarantee this is solid, relevant information. Both articles are highly recommended reads.

Speaking about HBase, there’s rumor on the street that they are taking HBASE-1295 (multi data center replication) very seriously and we’ll be seeing a new feature announcement relatively soon. Waiting forward!

An older but still interesting presentation on how RIPE NCC is using Hadoop and HBase to store and search through IP addresses for Europe, Middle East and Russia can be found here:

It looks like Farmvile is still in the MySQL+memcache phase, according to the High Scalability blog. And they use PHP. When will they start looking into NoSQL? Hopefully soon enough to have a good crop.

Nathan’s visual guide to NoSQL systems while perhaps not entirely correct is a nice tentative to put all these projects on the same map. I would love to see a “patched” version of the visual guide taking into account all the information left in the comments…

Oh and Twitter is using Protocol Buffers to store information on Hadoop. And they’re going to opensource their implementation.

Written by Adrian

March 17th, 2010 at 1:20 pm

Linkdump: Cassandra lovers, blowing the circuit breaker and Oracle clouds

2 comments

Good points (as always) on Alexandru’s blog discussing the SQL scalability isn’t for everyone topic.

NoSQL as RDBMS are just tools for our job and there is nothing about the death of one of the other. But as we’ve learned over years, every new programming language is the death of all its precursors, every new programming paradigm is the death of everything that existed before and so on. The part that some seem to be missing or ignoring deliberately is that in most of these cases this death have never really happened.

For large-scale performance testing of a production environment check out how Facebook MySpace simulated 1 million concurrent users with a huge EC2 cluster, described on the High Scalability blog. While the article is a guest post from a company selling “cloud testing” solutions and has a bit of “sales juice” in it, it’s still a very good read:

Large-scale testing using EC2

Someone is in love with Cassandra after only 4 months. Hoping Cassandra doesn’t get too fat after the wedding:

Traditional sharding and replication with databases like MySQL and PostgreSQL have been shown to work even on the largest scale websites — but come at a large operational cost. Setting up replication for MySQL can be done quickly, but there are many issues you need to be aware of, such as slave replication lag. Sharding can be done once you reach write throughput limits, but you are almost always stuck writing your own sharding layer to fit how your data is created and operationally, it takes a lot of time to set everything up correctly. We skipped that step all together and added a couple hooks to make our data aggregation service siphon to both PostgreSQL and Cassandra for the initial integration.

Distributed data war stories from Anders @ bandwidth.com, HBase and Hadoop on commodity hardware:

As mentioned before, the commodity machines I used were very basic but I was able to insert conservatively about 500 records per second with this setup. I kept blowing the circuit breaker at the office as well forcing me to spread the machines across several power circuits but it proved that the system was at least fault tolerant!

SourceForge chooses Python, TurboGears and … MongoDB for a new version of their website. Looks like Mongo is becoming quite mainstream.

Don’t believe the rumors, Oracle is into cloud computing after all – at least according to Forrester. Well, as long as the clouds are private. And as long as you can live with “coming soon” tooling. And it’s not like they really have a clear long-term strategy for cloud computing:

I believe that cloud is a revolution for Oracle, IBM, SAP, and the other big vendors with direct sales forces (despite what they say). Cloud computing has the potential to undermine the account-management practices and pricing models these big companies are founded on. I think it will take years for each of the big vendors to adapt to cloud computing. Oracle is just beginning this journey; I think other vendors are further down the track.

The igvita blog hits NoSQL in the groin by showing a simple way of having a schema-free data store … in MySQL. It’s a sort of proxy that translates schemas into denormalized data placed in distinct tables:

Instead of defining columns on a table, each attribute has its own table (new tables are created on the fly), which means that we can add and remove attributes at will. In turn, performing a select simply means joining all of the tables on that individual key. To the client this is completely transparent, and while the proxy server does the actual work, this functionality could be easily extracted into a proper MySQL engine – I’m just surprised that no one has done so already.

While an interesting idea, not sure how effective this will be in practice, as joins are among the most time-consuming operations in the database world. I’m pretty sure that replacing a 10-column table get on the primary key with joins on 10 tables will add an important overhead.

Written by Adrian

March 4th, 2010 at 9:31 pm

Posted in Linkdump

Tagged with , , , ,