michelangelus - Fotolia
In February, Oracle released its latest big data offering, Oracle Big Data Discovery. Touted as an end-to-end solution for retrieving, transforming and analyzing Hadoop data, Big Data Discovery delivers an all-in-one package that lets users find relevant data from Hadoop clusters, explore the data to understand its potential and transform the data to cleanse and enrich it. From this foundation, users can analyze the data to gain new insights and make strategic decisions, and share their results with team members for collaboration and further analysis.
Oracle introduced Big Data Discovery to address what the company perceived as a big data mess, with organizations struggling to manage a morass of information and glean meaningful insights, often without a clear idea of how to get started, let alone move forward. Adding to the challenge were the limitations of traditional business intelligence tools geared toward well-defined relational structures, but lacking the capacity to deal effectively with big data analytics.
Getting to know Big Data Discovery
According to Oracle, Big Data Discovery addresses these limitations and opens up the discovery process to both business analysts and data scientists. Big Data Discovery promises to accelerate the analytical process so users can spend less time preparing the data and more time analyzing it. With its focus on Hadoop, Big Data Discovery removes the technical barriers between points A and B, providing what Oracle calls the "visual face of Hadoop."
Oracle has engineered Big Data Discovery to support visual analytic capabilities without users having to learn complex processes or rely on specialized expertise. Big Data Discovery lets users visualize attributes by data type and easily determine which ones are the most relevant to their analytics. They can then sort these attributes according to users' specific needs in order to prioritize information. Big Data Discovery provides an interactive catalog for finding data, viewing summaries of data sets and exploring data through easy-to-use search and navigation features.
According to Oracle, analysts will be able to question the data and get answers as easily as shopping online. The interface provides self-service wizards, supports drag-and-drop operability and includes a number of other features to help turn raw data into rich, interactive visualizations and dashboards. Big Data Discovery also fits neatly into Oracle's big data architecture, integrating with tools such as Oracle R, Oracle Exadata and Oracle Big Data SQL.
Five steps to data discovery
Oracle breaks the Big Data Discovery analytical process into five basic steps: find, explore, transform, discover and share.
To find the information they need, analysts can use an interactive catalog to access the raw data in Hadoop without having to understand its underlying structure. Instead, they need only focus on getting at the types of information they require for their analytics. The catalog organizes data into visual data sets, such as weblogs, customer snapshots or brand loyalty surveys. In this way, analysts can identify the categories of information they need and drill down into the details of that data.
The explore step is the process of drilling into the data. Each data set is broken down into attributes that can be visually sorted and combined to better understand how they might relate. Analysts can organize data by its potential; moving the most interesting attributes to the top of the heap or experimenting with different combinations of attributes. The explore step helps analysts quickly understand data's quality and key elements to determine its overall potential.
The transform step lets users change data through an extensive library of transformations and enrichments. For example, users can cleanse data by normalizing or grouping values. Big Data Discovery provides a spreadsheet-like interface for defining how data should be transformed. In addition, users can enrich data by applying features such as inferring language or location or detecting topics or themes. Big Data Discovery handles all transformation operations natively.
Discovery is where the user joins or blends data into dashboards and visualizations that can range from tables to detailed maps. This interface includes a search tool that allows users to find patterns in data, as well as a navigation feature that walks them through search results. At any time, analysts can add or relate more data to help expand the results, or apply additional filters to better refine data.
The final step is sharing the results with the rest of the team in order to collaborate on the project. Users also can apportion their bookmarks and galleries of snapshots for further shared analysis. In addition, they can publish transformed data back to Hadoop to be used by such products as Oracle R or Big Data SQL.
Behind the scenes with Big Data Discovery
Big Data Discovery is made up of three primary components -- Studio, Dgraph and Data Processing -- that together with Hadoop clusters provide a complete data solution. On the Hadoop side, Big Data Discovery requires Cloudera Distribution for Hadoop (CDH), which includes a number of components that support Big Data Discovery functions, such as Cloud Manager, ZooKeeper and Spark.
The Studio component of Big Data Discovery is the front-end Web application that provides users with access to the Hadoop data. Studio includes all the features necessary for analysts to find, explore, transform, discover and share data. It's a Java-based application that can run across several nodes to support load-balancing and high availability. Big Data Discovery stores most of the Studio project and configuration data in a relational database.
Studio communicates with Dgraph, which then routes requests to the Hadoop clusters. The Dgraph component also handles caching and business logic. Like Studio, Dgraph can run on a single node or in a cluster, using CDH ZooKeeper to handle cluster services. For each data set that Big Data Discovery finds, it loads records and their schemas into Dgraph.
The final piece of the Big Data Discovery puzzle is the Data Processing component, which is a set of processing and jobs doing much of the heavy lifting, such as sampling, profiling and enriching data. Many of these processes run directly on the Hadoop nodes, relying on Spark to run all Data Processing jobs. One important Data Processing component is Hive Table Detector, which monitors the Hive database for new or deleted tables. If the Detector discovers a change, it launches a Data Processing workflow.
Big data continues to make headlines, and that data grows every day. Yet organizations have not known quite what to do with it, running into one roadblock after the next. Although Big Data Discovery promises to change that, it's still in its infancy, so it has not undergone the type of field testing that could expose its true capabilities. No doubt it will excel in some areas and need revision in others. Even so, it points to the fact that big data remains serious business, and we'll likely see other all-in-one products appear on the horizon before long.
Learn about one of the products Big Data Discovery can be integrated with, Big Data SQL
Find out about the trends behind data visualization and business intelligence