Problem solve Get help with specific problems with your technologies, process and projects.

The problem with petabytes

What challenges will storing perabytes of data present to companies? This tip tells you what to expect.

The database sector is as guilty as any portion of the IT market at buying into its own hype. Yet there may be a legitimate cause for excitement -- and trepidation -- about the imminent arrival of petabyte-sized databases. A petabyte is a daunting number -- 1,024 terabytes, to be exact -- but with data stores growing exponentially, it will be a reality for many enterprises in the near future.

While the world waits for the petabyte era to arrive, database administrators are surely already wondering about the unique challenges posed by such huge amounts of data. Most DBAs struggle with managing and analyzing even the smaller amounts of information their organizations have today.

Luckily, we've got industry experts like Kevin Strange, vice president and research director at Stamford-Conn.-based Gartner, to help us sort through the critical issues and challenges. Strange recently got on the horn with TechTarget to create a roadmap for dealing with petabytes of data going forward.

TechTarget: What are some of the biggest challenges to adoption of petabyte-range database technology at this point?
Strange: You have to separate different areas such as online transaction processing (OLTP) and data warehouse decision support. There are different sets of challenges.

Dealing with OLTP, the whole idea of manageability is still a challenge. Vendors have worked on improving their databases to support the higher end, but when you start talking about petabytes and OLTP types of implementations, we're probably years away before we're even going to see that. The whole idea with manageability is to be able to deal with a smaller piece of the picture, and you'll end up with a huge amount of these pieces if you're looking at petabytes.

In the data warehouse arena, the real challenge there is optimizing queries that go against that size of a database. Right now, organizations are implementing terabyte size architecture, and the vendors are struggling to optimize queries against that size. So, there's a manageability and performance issue on the decision-support side of the house. I'm not sure that you're going to see petabyte-sized databases in the data warehouse arena for five years. Anyone who says that they're doing this today is pulling your leg.

TechTarget: So most people haven't even conquered the terabyte range yet?
Strange: When you start looking at hundred-terabyte-size implementations, what we're seeing today is not really 100 terabytes of data; we're seeing 100 terabytes of disk. When you look at the ratio that the database vendors require, for example, with Oracle, for every terabyte of data you have, you're looking at 4 and 5 terabytes of disk space to support that. And that's not including mirroring or any free space you have.

If you look under the covers at those organizations that are talking about 100-terabyte implementations, you're going to find that they are significantly smaller. The 100-terabyte implementations we read about today have been total disk on the server. If you mirror it, you're already down to 50 terabytes right there. I don't think that organizations really have 100 terabytes to deal with in a data warehouse-type implementation. The vendors aren't ready for it either.

TechTarget: Of all the database vendors, who is furthest along?
Strange: It's probably Teradata. They're focused on this. Typically a data warehouse is going to be an order of magnitude larger than a common OLTP application within an organization. You might have 50-60 different sources of data that you're supporting in a transactional environment because the data warehouse by definition is pulling data from multiple sources in the historical information.

The folks at Teradata focused on these sizes for a lot longer timeframe than the other database vendors. The others spent their time focusing on transactions per second in the OLTP world.

TechTarget: Who else is even in the game?
Strange: Probably IBM with DB2. They're not far behind, and it's all relative. They're definitely closer to Teradata than Oracle or Sybase may be. It's a big challenge. The ability to put the data in is one major challenge. The other part is achieving any kind of performance and getting the data back out to do something with it. When users are expressing very complex analysis to the database to analyze data, that's a second side of the challenge for the vendors to address.

TechTarget: How will database administrators deal with issues around extent sizing and table design? Can we even define the problems there yet?
Strange: When you look at some of the multiterabyte-size architectures that are out there, you have one of two choices from a table-structure data-model perspective. You can do a heavily denormalized type of technique, whether it be a star schema or the classic big wide table. That does one thing for you; it alleviates some of the performance issues around trying to query that data. There is a down side to that, though. When you build a data model to the query that you're going to issue, you reduce your flexibility. Today the world is very dynamic, and how you analyze the data may be very different from how you'll be doing it tomorrow. Things are changing all the time, and that's one of the challenges. You can address the performance issue with these very large databases, but it's going to restrict your agility down the road.

The other choice is to do a more normalized type of data modeling technique that the relational zealots have advocated for years. That's where the additional challenge besides size comes in, being able to put it in and manage it. The additional challenge is having the performance. When you start having a more normalized technique to provide agility in how you do analysis, it gets very complex. The database software has got to have a lot more innovation and maturity to address the hundred-terabyte and petabyte databases of tomorrow.

TechTarget: What's driving the increase in data to the petabyte level?
Strange: Two things are to be considered. First off, we're collecting more data than ever before in enterprises. When you think about some of the CRM strategies like sales force automation, there's a ton of customer information changing hands. There's a greater thirst for this kind of detailed information than ever before. We're always looking for more information to drive the databases.

Another aspect to consider is all the unstructured data, including imaging and text systems, coming down the road. We started hearing about this back in 1997. The relational databases were going to be able to store and manage unstructured data types like images, etc. Some steps have been made in terms of storing them, but the key remains how you analyze the text and the images right along with the structured business data, and that is where things didn't pan out. The idea of a data warehouse moving to an information warehouse never happened. But five years down the road, as some of these text-mining tools start to incorporate capabilities that go beyond the structured business data, you could find that text and unstructured types will start to be incorporated as well, and that will drive it.

TechTarget: Say that you're speaking with a database administrator working at a company where the CEO or CIO has become obsessed with the idea of petabyte databases. With the lack of knowledge out there, what would be your advice to someone in this position?
Strange: Well, the first part is always going to be planning and choosing the right technology. When it comes to data warehouses and business intelligence applications, it's not strictly a technology aspect. It's 50-50 business and technology. The problem is if you don't select the right technology, and you're not careful in your planning and design, you're doomed for failure right from the start.

TechTarget: Who are the organizations most likely to be early adopters?
Strange: Retail should be right there; they were early adopters and sometimes very successful as we moved into enterprise applications. Financial institutions are another natural fit. Conglomerates like Citigroup will want to look across all their lines of business, and that's another significant driver. With insurance, traditional banking and brokerage houses all under one roof, there's a tremendous appetite for capturing and analyzing massive amounts of customer information.

TechTarget: When do we actually see the applications themselves arriving?
Strange: We're talking 5-10 years out. The challenge you have here in the arrival of these technologies is that the vendors typically have their marketing hype 3-4 years ahead of the reality.


SearchDatabase Best Web Links on VLDB.

Expert definition of a petabyte.


Open Storage Management with AutoIS

AutoIS is EMC's strategy, products, and services for making storage management simple, automated, and open. It's about achieving results. AutoIS is a combination of architecture and automation to reduce manual tasks, lower costs, and eliminate errors.

* Simple means having one management interface to mask the complexity and manage the growth inside your environment.

Automated means allowing your people to set the rules and let the technology do the work it was designed to deliver.

Open means...just that -- management that works across all your storage assets. That's right: software that manages EMC and non-EMC storage platforms.

Learn more about AutoIS.

This was last published in June 2002

Dig Deeper on Oracle database design and architecture

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.