Q
Problem solve Get help with specific problems with your technologies, process and projects.

# Hashing functions explained

What is meant by "hashing function"? How is it used in hash data clusters and hash joins?

A hashing function is typically a mathematical function which takes a key value and determines a position in an array. For instance, we might have a table which contains employee names. If we needed to find out if an employee is in the table, we would have to search the entire table. It would be nice if we had an idea of where that employee record might be. So a hash function can be used to determine, based on the employee's name, the position in the array, or table. Check the array in that position and see if the record is there.

Hash functions have a big science behind it because as you may already be guessing, two different keys can "hash" to the same position. If this occurs, then a collision has happened. How do you handle the collision? Hashing also leads to lots of wasted space where no keys can map, or hash to. So this can be quite a topic.

How does hashing relate to hash joins? In hash joins, we will be joining two tables. We read the first table and apply a hash function to each record in the table. These records are then stored in a number of buckets. We then take the records from the second table and apply the same hash function. This will tell us which bucket to look in. Is there a row in this bucket from the first table to join to the row in the second table? If so, it will be in that bucket. This is the basic concept behind hash joins.

When I went to graduate school, I wrote my master's thesis on an extension to the hash join. You can view a copy of this thesis on my old school's Web site (http://www.cs.ndsu.nodak.edu/~peasland/paper.doc). If you are interested in learning about hash joins, then read the first few chapters and they give nice diagrams on how hash joins work.

Close