Netuality

Taming the big, bad, nasty websites

Linkdump: Cassandra lovers, blowing the circuit breaker and Oracle clouds

2 comments

Good points (as always) on Alexandru’s blog discussing the SQL scalability isn’t for everyone topic.

NoSQL as RDBMS are just tools for our job and there is nothing about the death of one of the other. But as we’ve learned over years, every new programming language is the death of all its precursors, every new programming paradigm is the death of everything that existed before and so on. The part that some seem to be missing or ignoring deliberately is that in most of these cases this death have never really happened.

For large-scale performance testing of a production environment check out how Facebook MySpace simulated 1 million concurrent users with a huge EC2 cluster, described on the High Scalability blog. While the article is a guest post from a company selling “cloud testing” solutions and has a bit of “sales juice” in it, it’s still a very good read:

Large-scale testing using EC2

Someone is in love with Cassandra after only 4 months. Hoping Cassandra doesn’t get too fat after the wedding:

Traditional sharding and replication with databases like MySQL and PostgreSQL have been shown to work even on the largest scale websites — but come at a large operational cost. Setting up replication for MySQL can be done quickly, but there are many issues you need to be aware of, such as slave replication lag. Sharding can be done once you reach write throughput limits, but you are almost always stuck writing your own sharding layer to fit how your data is created and operationally, it takes a lot of time to set everything up correctly. We skipped that step all together and added a couple hooks to make our data aggregation service siphon to both PostgreSQL and Cassandra for the initial integration.

Distributed data war stories from Anders @ bandwidth.com, HBase and Hadoop on commodity hardware:

As mentioned before, the commodity machines I used were very basic but I was able to insert conservatively about 500 records per second with this setup. I kept blowing the circuit breaker at the office as well forcing me to spread the machines across several power circuits but it proved that the system was at least fault tolerant!

SourceForge chooses Python, TurboGears and … MongoDB for a new version of their website. Looks like Mongo is becoming quite mainstream.

Don’t believe the rumors, Oracle is into cloud computing after all – at least according to Forrester. Well, as long as the clouds are private. And as long as you can live with “coming soon” tooling. And it’s not like they really have a clear long-term strategy for cloud computing:

I believe that cloud is a revolution for Oracle, IBM, SAP, and the other big vendors with direct sales forces (despite what they say). Cloud computing has the potential to undermine the account-management practices and pricing models these big companies are founded on. I think it will take years for each of the big vendors to adapt to cloud computing. Oracle is just beginning this journey; I think other vendors are further down the track.

The igvita blog hits NoSQL in the groin by showing a simple way of having a schema-free data store … in MySQL. It’s a sort of proxy that translates schemas into denormalized data placed in distinct tables:

Instead of defining columns on a table, each attribute has its own table (new tables are created on the fly), which means that we can add and remove attributes at will. In turn, performing a select simply means joining all of the tables on that individual key. To the client this is completely transparent, and while the proxy server does the actual work, this functionality could be easily extracted into a proper MySQL engine – I’m just surprised that no one has done so already.

While an interesting idea, not sure how effective this will be in practice, as joins are among the most time-consuming operations in the database world. I’m pretty sure that replacing a 10-column table get on the primary key with joins on 10 tables will add an important overhead.

Written by Adrian

March 4th, 2010 at 9:31 pm

Posted in Linkdump

Tagged with , , , ,

Linkdump: Cassandra @Twitter, Forrester not grokking NoSQL

one comment

Seven signs you need to accept NoSQL in your life according to the High Scalability blog. I especially like sign #6 “Maintaining a completely separate object caching system on top of an already beefy table storage system“. There are companies making serious bucks from selling exactly this type of caching systems. I find that a bit ironic, don’t you?

Twitter has just decided to adopt Cassandra as their main storage. I roughly estimated the status table to having  more than 9 billion rows – it’s a good table size to start thinking about the benefits of NoSQL. I would have been interested in seeing a comparison with other existing solutions and a rationale of their choice. According to some sources, Ryan King rejected HBase because if a region server is down, writes will be blocked for affected data until the data is redistributed – unlike Cassandra’s “write never fail” policy. According to other sources, this will be solved in a future version of HBase but I think Twitter needed a solution sooner rather than later. I hope for two things:

  • That the Twitter dudes will blog about their migration experience
  • That I’ll be able to access and search through all my older tweets, fer’ God sake!

Forrester Research thinks that NoSQL and Elastic Caching Platforms are very similar. So similar that “NoSQL Wants To Be Elastic Caching When It Grows Up“. According to Forrester “Ultimately, the real difference between NoSQL and elastic caching now may be in-memory versus persistent storage on disk.” Yeah sure: transactions, durability, indexing, security model – who needs this crap anyway?

Oh and let’s not forget about today’s GAE unscheduled downtime. Waiting forward for the post mortem, for sure there will be a thing or two to learn…

Written by Adrian

February 24th, 2010 at 11:18 pm

Posted in Linkdump

Tagged with , , ,

January 30 linkdump: cloud, cloud, cloud

leave a comment

Yes there is such a thing as cloud management services and Cloudkick has a business model around them:

The San Francisco company’s existing features — including a dashboard with an overview of your cloud infrastructure, email alerts, and graphs that you help you visualize data like bandwidth requirements — will always be free, said co-founder and chief executive Alex Polvi. But Cloudkick wants to charge for features on top of the basic service, such as SMS alerts when your app has problems and a change-log tool where sysadmins can communicate with each other, which Polvi described as “Twitter for servers.”

Great article on designing applications for the cloud from Godjo Adzic who spent his last two years in projects deployed on the Amazon cloud:

A very healthy way to look at this is that all your cloud applications will run on a bunch of cheap web servers. It’s healthy because planning for that in advance will help you keep your mental health when glitches occur, and it will also force you to design for machine failure upfront making the system more resilient.

Royans blog comments James Hamilton critical post about private clouds not being the future:

Though I believe in most of his comments, I’m not convinced with the generalization of the conclusions. In particular, what is the maximum number of servers one need to own, beyond which outsourcing will become a liability. I suspect this is not a very high number today, but will grow over time.

And a good detailed article about Hive used at Facebook:

Facebook has a production Hive cluster which is primarily used for log summarization, including aggregation of impressions, click counts and statistics around user engagement. They have a separate cluster for “Ad hoc analysis” which is free for all/most Facebook employees to use. And over time they figured out how to use it for spam detection, ad optimization and a host of other undocumented stuff.

Written by Adrian

January 30th, 2010 at 11:44 pm

Posted in Linkdump

Tagged with , , ,

January 23 linkdump: grids, BuddyPoke and the state of Internet

leave a comment

On Enterprise Storage a few experts look at grid computing and the future of cloud computing.

Can cloud computing succeed where grid failed and find widespread acceptance in enterprise data centers? And is there still room for grid computing in the brave new world of cloud computing? We asked some grid computing pioneers for their views on the issue.

[...]

And when it comes to IaaS [infrastructure as a service], I think in five years something like 80 to 90 percent of the computation we are doing could be cloud-based.

BuddyPoke cofounder Dave Westwood explains on the High Scalability blog how they achieved viral scale, Facebook viral scale to be more specific. BuddyPoke is today entirely hosted on GAE (Google AppEngine) and they some great insights and lessons learned.

On the surface BuddyPoke seems simple, but under hood there’s some intricate strategy going on. Minimizing costs while making it scale and perform is not obvious. Who does what, when, why and how takes some puzzling out. It’s certainly an approach a growing class of apps will find themselves using in the future.

Jamesh Varia from Amazon wrote a great Architecting for the Cloud: Best Practices [PDF] paper:

This paper is targeted towards cloud architects who are gearing up to move an enterprise-class application from a fixed physical environment to a virtualized cloud environment. The focus of this paper is to highlight concepts, principles and best practices in creating new cloud applications or migrating existing applications to the cloud.

The AWS cloud offers highly reliable pay-as-you-go infrastructure services. The AWS-specific tactics highlighted in the paper will help design cloud applications using these services. As a researcher, it is advised that you play with these commercial services, learn from the work of others, build on the top, enhance and further invent cloud computing.

The Pingdom guys have another fantastic post on their blog about the state of Internet in 2009:

  • 90 trillion – The number of emails sent on the Internet in 2009.
  • 92% – Peak spam levels late in the year.
  • 13.9% – The growth of Apache websites in 2009.
  • -22.1% – The growth of IIS websites in 2009.

These and more interesting statistics in their blog post.

Written by Adrian

January 23rd, 2010 at 1:20 pm

Posted in Linkdump

Tagged with , , ,

Google’s Map/Reduce patent and impact on Hadoop: none expected

leave a comment

From the GigaOm analysis:

Fortunately, for them, it seems unlikely that Google will take to the courts to enforce its new intellectual property. A big reason is that “map” and “reduce” functions have been part of parallel programming for decades, and vendors with deep pockets certainly could make arguments that Google didn’t invent MapReduce at all.

Should Hadoop come under fire, any defendants (or interveners like Yahoo and/or IBM) could have strong technical arguments over whether the open-source Hadoop even is an infringement. Then there is the question of money: Google has been making plenty of it without the patent, so why risk the legal and monetary consequences of losing any hypothetical lawsuit? Plus, Google supports Hadoop, which lets university students learn webscale programming (so they can become future Googlers) without getting access to Google’s proprietary MapReduce language.

[...]

A Google spokeswoman emailed this in response to our questions about why Google sought the patent, and whether or not Google would seek to enforce its patent rights, attributing it to Michelle Lee, Deputy General Counsel:

“Like other responsible, innovative companies, Google files patent applications on a variety of technologies it develops. While we do not comment about the use of this or any part of our portfolio, we feel that our behavior to date has been inline with our corporate values and priorities.”

From Ars Technica:

Hadoop isn’t the only open source project that uses MapReduce technology. As some readers may know, I’ve recently been experimenting with CouchDB, an open source database system that allows developers to perform queries with map and reduce functions. Another place where I’ve seen MapReduce is Nokia’s QtConcurrent framework, an extremely elegant parallel programming library for Qt desktop applications.

It’s unclear what Google’s patent will mean for all of these MapReduce adopters. Fortunately, Google does not have a history of aggressive patent enforcement. It’s certainly possible that the company obtained the patent for “defensive” purposes. Like virtually all major software companies, Google is frequently the target of patent lawsuits. Many companies in technical fields attempt to collect as many broad patents as they can so that they will have ammunition with which to retaliate when they are faced with patent infringement lawsuits.

Google’s MapReduce patent raises some troubling questions for software like Hadoop, but it looks unlikely that Google will assert the patent in the near future; Google itself uses Hadoop for its Code University program.

Even if Google takes the unlikely course of action and does decide to target Hadoop users with patent litigation, the company would face significant resistance from the open source project’s deep-pocketed backers—including IBM, which holds the industry’s largest patent arsenal.

Another dimension of this issue is the patent’s validity. On one hand, it’s unclear if taking age-old principles of functional software development and applying them to a cluster constitutes a patentable innovation.

Still nothing from the big analysts, Gartner and the gang…

Written by Adrian

January 22nd, 2010 at 7:39 pm

Posted in Articles

Tagged with , , , ,