The exact time when data is acquired is essential for building an effective data warehouse for clickstream analysis
and other purposes. To do this, your servers need to be synchronized--however, computer clocks just can't measure up. Check out these alternatives from Ralph Kimball and Richard Merz's book The Data Webhouse Toolkit (Wiley Computer Publishing, 2000):
The Web-enabled data warehouse collects timestamped event information from many different sources both within and outside the enterprise. Web responses are likely to be delivered by a loosely coupled network of servers and proxies. Analyzing user behavior depends on knowing exactly when an event occurred relative to other events. This requires maintaining a precise time standard across all data sources.
Each computer and program involved needs to know what time it is, not to the nearest minute or even to the nearest second, but to subsecond accuracy. This accuracy level is essential in order to properly merge the sequence of user events within a session. For analytical purposes we need to assume that timestamps are accurate to a tenth of a second. We should strive for 100-millisecond accuracy, relative to an absolute standard time. Without setting the goal of this precision, its likely that the various log records we collect will be at least several seconds out of synchronicity, making the required timing analysis impossible.
Maintaining such clock accuracy across diverse systems is a problem that is well understood in military and aviation systems, and even in some financial applications like securities trading systems (which require 3-second accuracy). Subsecond time precision has not, until now, been a normal concern of commercial IT departments. As the implementer of a Web-enabled data warehouse, you may well meet with skepticism when trying to convince an operations manager that such accuracy is not only necessary but is possible.
Time synchronization tools and techniques
Most computers contain a hardware clock to maintain system time. At their heart, computer clocks rely on a quartz crystal exactly like those used in quartz watches. And like quartz watches, computer clocks vary widely in accuracy and their ability to provide consistently accurate time. There aren't any standards for computer clock accuracy, so there's no way of knowing how accurate any given computer clock is likely to be.
In our experience there's little correlation between computer price and clock accuracy. Quartz clock crystals are quite sensitive to temperature variations. Very high-quality quartz clocks enclose their crystal in a constant-temperature oven to avoid temperature-induced drift. Wristwatch quartz crystals are calibrated to an "oven" temperature of 98.6 degrees. Unfortunately, you are unlikely to encounter this technology in a computer clock, which can drift several seconds or more every day. We need to come up with methods for compensating for the inevitable inaccuracies of computer clocks.
There are three steps in achieving time synchronization across the enterprise.
- Reduce drift of each individual clock, relative to standard time.
- Synchronize all of the clocks in a system to a single system master clock.
- Synchronize the system master clock to a primary reference source.
The first and second steps can be achieved with a software solution. There are both commercial and shareware packages available to synchronize the clocks of computers within a network to primary reference source. An excellent source of technical information is the documentation for the Internet Network Time Protocol (NTP). The best time synchronization software computes the drift of each dependent machine relative to a primary reference source, and then establishes a correction factor that is applied frequently to each individual computer to keep its clock "nudged" into synchronization with the system selected as the master. Frequent small nudges are greatly preferable to larger ones since they minimize cumulative drift and compensate for different drift rates among different machines.
The final task is to keep the master CPU synchronized with a primary reference source like the National Bureau of Standards. This has become quite easy in the last few years. There are several sources of accurate standard time. First, the National Bureau of Standards and other standards bodies broadcast coded time over ultra-low-frequency (ULF) radio signals. These can be picked up by special receivers designed for the purpose and can be queried by the master computer in a network. These receivers are both accurate and quite expensive.
Second, time can be obtained from the Internet itself by querying a standards organization like the U.S. Naval Observatory or the Bureau of Standards. However because the network latency time varies and is unpredictable, this method isn't recommended.
The third, and most satisfactory way to obtain accurate time is from the satellites of the Global Positioning System (GPS). GPS time signals accurate to within a microsecond are now obtainable from inexpensive GPS clocks that have standard RS-232 serial ports for direct computer connection. We recommend this method of obtaining the master, benchmark time for your computer system. It is inexpensive and precise. GPS clocks are also useful for systems that, because of location or platform peculiarities, can't be serviced by standard time synchronization software.
For More Information
- The Best Data Warehouse Design and Development Links
- The Best Web-Based Warehouse Links
- Do you have any technical questions about data warehouse design? Post them--or help out your peers by answering them--in our live discussion forums