Problem solve Get help with specific problems with your technologies, process and projects.

Using SQL to calculate nearby ZIP codes

This tip shows how to use SQL to calculate nearby ZIP codes.

Any type of application that involves names and addresses inevitably involves ZIP codes as well. If you've built...

the data backend for such an application, you've probably dealt with databases containing ZIP codes by city/state or even by specific street address. But have you ever needed to find ZIP codes within a certain distance of a given ZIP code, city, or street address? Maybe your company needs a "find a store near you" function for their web site. Or maybe your marketing department has asked to generate targeted mailing lists from ZIP code-based customer demographic information from the corporate database. Where do you start when you need to calculate the distance between two ZIP codes, cities, or other geographical features? If you're like me, you start with spherical coordinates and a lesson in trigonometry! Don't worry, if you want to skip to the end, I promise you'll get all the information you need without the gory details.

The Data

The very first ingredient for building a ZIP code-oriented application is the appropriate data. At a minimum, we'll need the ZIP code and a set of latitude and longitude values for that ZIP code. This information is available from a number of commercial vendors, which provide a lot of added value in the form of demographic information and other location-based information garnered from both government and private source. A short search of the web turned up over a dozen vendors with prices ranging from the low hundreds to well over a thousand. Another alternative is to choose a completely free data set. The best one I found was part of the information available from the U.S. Census Bureau. The downside is that it is based on the 1990 census, only included 5-digit ZIP codes, and has limited additional information that is completely out of date. But it gives us a functional database for building this application. The database can be downloaded in text format from this site.

One crucial factor to determine about the latitude and longitude information is whether it is in degrees/minutes/seconds format or in the decimal equivalent--the decimal equivalent is what you want. The quick way to translate is

Decimal value = degrees + (minutes/60) + (seconds/3600)

The Formula

Once you've got the ZIP code and latitude/longitude data, we've got the spherical coordinates to describe a location. The next step is to determine the distance between any two locations using the spherical coordinates. You'd think this would be a straightforward process, but in reality it is fairly complex to come up with an extremely precise distance between two points on the surface of the earth. For an extremely good discussion of the problem, and for the source I used to build my code, you can refer to this document, which discusses the details of what is basically the problem aircraft pilots face every day. For our purposes of calculating the distance between any two ZIP codes in the US, we can settle for the accuracy of the Haversine formula:

Distance = 2R *arcsin (x)
where x=(sin((lat2-lat1)/2))^2 + cos(lat1) * cos(lat2) * (sin((long2-long1)/2))^2.

It is crucial to realize that the trigonometric functions of most databases are expecting arguments to be in radians instead of degrees. The best approach is to pre-calculate the latitude and longitude values for each location in the database in terms of radians so that the SQL stored procedure will have that many less calculations to perform. The formula for the conversion is

radians = 57.295*degrees (or 180/pi * degrees)

The Fine Print

The algorithm presented here works extremely well for calculating the distance between two ZIP codes in the US (or cities or other locations). But if you read the full discussion about the various ways to calculate the distance between two points on the surface of the Earth (http://www.census.gov/cgi-bin/geo/gisfaq?Q5.1), you'll notice that the Haversine formula does not work perfectly for all possibilities. As the mathematically-inclined would say, it "blows up" under certain circumstances which results in an undefined mathematical value and thus a runtime error in the calculation. For the purposes of ZIP codes however, there's no problem since the distances involved are well shy of the danger points. It's just important to realize that you'd need to use a more sophisticated approach to calculate the distance between any two arbitrary points on the surface of the Earth. As the article shows, there are simpler formulas, but for ZIP code calculations, the Haversine formula is a good tradeoff between accuracy and performance.

It is also important to understand a little bit about how ZIP codes are distributed on the map of the US. It would be nice to think that they are all regularly-sized polygons that reproducibly blanket the country and are easily related to the latitude and longitude...but of course real life is never that easy. ZIP codes are basically ways to describe postal routes. Many of them represent squares or ovals or other regularly-shaped polygons with an easily-defined center, but some, especially in rural areas, are shaped in an arc along a road or highway. Keep in mind that each database of ZIP codes and latitude/longitude values makes different assumptions about how to assign a physical location to a ZIP code--the value for any given ZIP code will vary the equivalent of a few blocks to a few miles depending on the shape of the ZIP code. This means that for closely spaced ZIP codes, the distances may be off by a large relative percentage. For most applications, this will not matter, but it is an important limitation to consider.

The Bottom Line

The goal of this article is to create a stored procedure to determine all of the ZIP codes within a certain radius of a starting ZIP code. We'll use the Haversine formula and a database of latitude/longitude and ZIP codes to make the calculation. I implemented this in SQL Server 7.0, but any database should be fine as long as it has the arcsin, cosine, and sine trigonometric functions and the ability to do square roots and raise values to a power (Microsoft Access is out, but MySQL, Oracle, and DB2 certainly work).

``` CREATE PROCEDURE zip_nearby_locations @lat1 float, @long1 float, @radius float AS SELECT (2*3956*ASIN( (SQRT( POWER(SIN((([email protected]))/2),2) + COS(@lat1) * COS(rlat) * POWER(SIN(((abs(rlong)[email protected]))/2),2) )) ) ) AS dist, location, zipcode FROM Zipcodes WHERE (2*3956*ASIN( (SQRT( POWER(SIN((([email protected]))/2),2) + COS(@lat1) * COS(rlat) * POWER(SIN(((abs(rlong)[email protected]))/2),2) )) ) ) < @radius ORDER BY dist GO
```

In this stored procedure, R (radius of the earth) is 3956 miles so our result comes out in miles. The fields rlat and rlong are the latitude and longitude in radians. This procedure basically calculates the distance between the given ZIP and every ZIP in the database and numerically sorts the ones within the given radius.

As an example, try determining the number of ZIPs within 1 mile of the White House (ZIP 20001). Using the public domain database from the Census Bureau, I got the following 6 results:

``` ZIP Location Distance 20001 WASHINGTON, DC 0.0 20405 ZIP CODE 20405 0.52719256096716194 20060 ZIP CODE 20060 0.70030985977018034 20005 WASHINGTON, DC 0.7223858318763472 20552 ZIP CODE 20552 0.75187311639892684 20038 WASHINGTON, DC 0.840488509229948
```

If you get different results, first check whether the trigonometric functions in your database expect radians (most do), and if so ensure that your latitude/longitude data is in radians. This is the most common problem other than a typo in the equation.

Based on a very rough profiling of the stored procedure, the majority of the processing time is the ORDER BY statement. If you don't need an ordered list of nearby ZIP codes, dropping this clause should maximize performance.

I've put the code on SourceForge as well as the database I used so that you can keep up with changes and discuss any problems with the code. It is fairly easy to take this starting point and use it and a good database of ZIP codes, locations, and demographic data to explore all sorts of possible functionality.