On building large scale data processing system

I was reading a few blog posts about distributed, large-scale processing of data, be it in batch or real-time. And definitely the move is towards real-time now. ( here and here ) .

Well, in this blog post I am only going to mention about the things that I have come across so far. I would like to learn more.

All the buzz around large scale data processing, in some way or the other, seems to be inspired by papers published by Google or the systems they built. It is just fascinating!

Distributed file systems:

HDFS is just one of the many options available. Since it is a user-space file system, other such filesytems also fit the criteria viz. MapRFS, GlusterFS, Tahoe-LAFS and more.

Batch Processing:

Apache Hadoop is pretty much the tool that is widely used. However many of the industry players have already moved on.

Realtime Processing:

Many tools to choose from here Storm, Spark, Esper, S4, and HStreaming

Machine learning, Data Mining and Analytics:

There are so many Free and Open Source tools available to pick from: R, Python scipy package, Weka, Apache UIMA, Apache Mahout.

Hardware:

Finally we need lots of hardware to run this infrastructure on.

Comments and suggestions welcome :)