Commenting on Software Technology Trends & Market Dynamics

Jnan Dash

Subscribe to Jnan Dash: eMailAlertsEmail Alerts
Get Jnan Dash: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Blog Feed Post

Hadoop, the next ten years

I attended a meetup yesterday evening at the San Jose Convention Center on the subject “Apache Hadoop, the next 10 years” by Doug Cutting, the creator of Hadoop while at Yahoo, who works at Cloudera now. That venue was chosen because of the ongoing Strata+Hadoop conference there.

It’s always fun listening to Doug recounting how Hadoop got created in the first place. Based on early papers from Google on GFS (Google File System) and Map Reduce computing algorithm, a project was launched called Nutch, subsequently renamed Hadoop (after Doug’s son’s toy elephant name). This all made sense as horizontal scaling via commodity hardware was coming to dominate the computing landscape. All the modules in Hadoop were designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework. That was all back in 2006. As an open source project, Hadoop gained momentum with community support for the overall ecosystem. Over the next seven years, we saw many new additions/improvements such as YARN, Hbase, Hive, Pig, Zookeeper, etc. Hence, Doug wanted to emphasize that there is a difference between just Hadoop and the Hadoop ecosystem.

The original Hadoop with its Map Reduce computing had its limitations and lately Spark is taking over the computing part. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. It originated at UC, Berkeley’s AMPlab and is gaining fast momentum with its added features for machine learning, streaming, graph and SQL interfaces. To a question from the audience, Doug replied that such enhancements are expected and more will come as the Apache Hadoop ecosystem grows. Cloudera has created Impala, a speedier version plus the SQL interface to meet customer needs. Another example of a key addition to the ecosystem is Kafka which originated from Linked-In. The Apache Kafka project is a message broker service and  aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. To another question on whether another general-purpose platform will replace Hadoop, Doug suggested that projects like Spark will appear to handle parts of the ecosystem better. There may be many purpose-built software to address specific needs like Kafka. He eloquently praised the “open Source” philosophy of community of developers helping faster progress compared to the speed at older companies like Oracle in enhancing its DBMS software.

From the original Hadoop meant for batch processing of large volumes of data in a distributed cluster, we are moving towards the real-time world of streaming analytics and instant insights. The popularity of Hadoop can be gauged by the growth in attendance of the San Jose Hadoop Summit…from 2700 attendees in 2013, it more than doubled last year.

Doug is a good speaker and his 40 minute talk was informative and entertaining.


Read the original blog entry...

More Stories By Jnan Dash

Jnan Dash is Senior Advisor at EZShield Inc., Advisor at ScaleDB and Board Member at Compassites Software Solutions. He has lived in Silicon Valley since 1979. Formerly he was the Chief Strategy Officer (Consulting) at Curl Inc., before which he spent ten years at Oracle Corporation and was the Group Vice President, Systems Architecture and Technology till 2002. He was responsible for setting Oracle's core database and application server product directions and interacted with customers worldwide in translating future needs to product plans. Before that he spent 16 years at IBM. He blogs at http://jnandash.ulitzer.com.