This is the first in a two-part series about the Oracle big data strategy. It weighs the pros and cons of a big data platform built on commodity hardware vs. appliances.
Many organizations are grappling with the issue of big data as they look for ways to manage and analyze large collections of information. Two popular approaches have emerged for implementing big data. The first is to build an infrastructure in-house that relies heavily on commodity hardware and open source software. The second -- which is more at the heart of the Oracle big data strategy -- is to purchase an appliance that contains all the hardware and software necessary to implement big data. At the heart of either approach is Apache Hadoop, an open source software platform for managing and accessing unstructured data. Both approaches rely heavily on Hadoop and are designed around its architecture.
Hadoop and the commodity-based approach
Hadoop is a master/slave architecture typically implemented on clustered Linux computers. Each cluster contains a master node and multiple slave nodes. The master node manages the slave nodes and all tasks related to processing and accessing the data files on those nodes. The slave nodes manage the data files and process read and write requests to the files. In large clusters, you'll also find a name node that manages the Hadoop Distributed File System (HDFS) namespace and regulates access to the data files. The HDFS provides the structure necessary to store, manage and access the files. On smaller clusters, the name node functionality is often implemented on the master node.
An important component of Hadoop is the MapReduce framework. The framework works in conjunction with HDFS to break large data sets into manageable blocks that can be distributed across thousands of computers and replicated within the cluster to ensure fault tolerance. The MapReduce framework also balances the load among the slave nodes in order to deliver efficient parallel operations, such as searching data, processing complex client requests and performing in-depth analyses.
Hadoop, HDFS and the MapReduce framework were designed with commodity hardware in mind. This hardware is usually made up of PC-based computers whose components share a common architecture, adhere to open standards and contain compatible interfaces, making it easy to interchange one system with another. Not surprisingly, the price tags on these computers are significantly lower than high-end custom servers often touted for big data platforms -- a fact that may not be in line with an Oracle big data strategy seeking to sell you big, expensive server hardware.
More on Oracle big data strategy
Check out the cost-benefit analysis of Oracle's Big Data Appliance
Is an appliance a good form factor for big data?
Read about one company's use of Oracle and big data.
Implementing Hadoop on commodity hardware can seem to be an appealing option to the organization looking to save money. It has already been proven in such organizations as Google and Yahoo. Google, for example, processes over 20 petabytes of data every day on commodity hardware. Building a big data platform on Hadoop, however, involves a lot more than downloading software and setting up a bunch of computers. A big data project must take into account the resources and time necessary to configure Hadoop, implement supporting software, set up the network infrastructure, connect Hadoop to other systems, develop custom code, implement analytics, and manage the system and data on an ongoing basis. For all that, you need a significant level of expertise from professionals who understand the subtleties and pitfalls of big data. You also need a fair amount of time and money.
The rise of the big data appliance
Despite the easy access to commodity hardware and open source software, implementing a big data platform is no small task. The Oracle big data strategy seeks to lure organizations into buying proprietary systems, such as the Oracle Big Data Appliance, which contains all the hardware and software necessary to implement big data. But an appliance is not just a pile of servers and shrink-wrapped software. The components are tightly bundled into a preconfigured package that's fully tested and ready to go upon delivery.
The Oracle Big Data Appliance, for example, comes with a rack of 18 Sun Microsystems computers and storage nodes, multiple Infiniband switches and cables, redundant power distribution units, and spare storage disks. On the software side, the appliance is set up with Oracle Enterprise Linux, Cloudera's Distribution including Apache Hadoop, Cloudera Manager, Oracle NoSQL Database, Oracle R and Oracle Big Data Connectors. Oracle's appliance provides more than 600 terabytes of raw storage capacity, preconfigured with a Hadoop cluster ready to accept unstructured data. Premium support is available for both hardware and software maintenance.
For organizations looking to implement big data as quickly and easily as possible, appliances can seem the perfect answer. But they don't come cheap. For instance, the full Oracle Big Data Appliance costs $450,000, plus $54,000 per year for system support and another $36,000 per year for operating system support. You can buy a lot of commodity hardware for that kind of money. But for the organization that doesn't have the in-house resources and expertise necessary to build its own system, an appliance can turn out to be quite the bargain.
Appliance vs. commodity hardware
Both commodity-based and appliance-based platforms offer their own advantages and disadvantages, and choosing one over the other is rarely an easy decision. You must take into account not only the costs of implementing and managing a big data platform, but also the long-range expectations of such a system. In the second article of this two-part series, we'll dig deeper into several issues you should consider when you're deciding whether to build or buy.
About the author:
Robert Sheldon is a technical consultant and the author of numerous books, articles and training material related to Microsoft Windows, various relational database management systems, and business intelligence design and implementation.