## Geocoding 101

Extensibility
Extensibility covers the mechanisms by which you, as the user or developer, can extend the functionality of the Teradata Database, for example with the use of User Defined Functions, or UDFs.

## Geocoding 101

As Teradata customers discover and begin to utilize the native Teradata database geospatial capabilities, one of the first questions that inevitably comes up is, how do I “Geocode” my data?  In fact, Geocoding will often be an important first phase of any Geospatial implementation project and sometimes even a barrier to start the project all together.  The purpose of this article is to discuss what Geocoding is, how it works, Geocoding options, precision, and sources available today for Geocoded information.

## What is Geocoding?

Geocoding is a process of calculating geographical coordinates (longitude and latitude – x and y) of various business entities (customers, suppliers, stores, assets, etc.) based on the entities location on the surface of the earth.  It is often defined as an address latitude/longitude append operation.  Once the coordinates of these entities are acquired, it will be possible to use them in geospatial proximity and location focused analytics within the Teradata database.

For example, here is the address of the Teradata R&D facility in San Diego the way we are typically used to identifying a location:

17095 Via Del Campo

San Diego, CA 92127

These are the associated latitudes and longitudes after this location is Geocoded:

Latitude: 33° 01’20.90” N

Longitude: -117° 05’33.75” W

## Coordinate Formats

When we talk about Geocoding we are talking about latitudes and longitudes (remember 5th grade geography?).  There are three generally accepted formats for representing coordinates in latitude and longitude:

• Degrees, Minutes, Seconds (DMS): 10°30’00”N, 50°30’00”W
• Decimal, Minutes (DM): 10°30.0’, -50°30.0’, 10d30.0m, -50d30.0’
• Decimal, Degrees (DD): 10.5000°, -50.5000°, typically with four to six decimal place precision

Degrees, Minutes, Seconds is the most common format that we are used to seeing on all charts and maps.  Decimal Degrees expresses coordinates as decimal fractions and is the most convenient for analysis in the database.  Ultimately, all Geocoded locations, whether it is an address, zip code, zip+3, zip+4, city center, etc., must be converted to this format in order to be analyzed with the in-database SQL spatial functions in Teradata.

## Geocoding Granularity

There are different levels of Geocoding granularity that you should take into consideration.  You can Geocode to the roof-top, parcel centroid, street address, zip, zip+4, city center, state center, etc.  At what level you Geocode to is really based on your business need, analytics requirements, and of course the data available to you.

If the postal code is sufficient spatial information or if the business is reasonably well aligned to postal code boundaries, then it may not be necessary to do any Geocoding at a lower level of granularity (e.g. address, roof-top, parcel, etc.).

On the other hand, if the postal code information is not granular enough, does not align with the spatial pattern you are trying to analyze (e.g. insurance flood zones, cell signal coverage areas, sales regions, or other man-made boundaries, etc.), or there are no static addresses because the object is moving (rail cars, trucks, ships, parcels, RFID tags, etc.), then Geocoding to a more granular unit may add significant value to the data and the resulting analysis.

## The Geocoding Process

Once you’ve determine if Geocoding is indeed necessary for your analytical purposes, then the next step is to figure out how to go about Geocoding your data.

Here are some options:

• #### Web service Geocoding

This approach will require developing or purchasing a program that sends addresses to a Geocoding service (like the ones provided by Google or Yahoo), then receives back the Geocoded data.  Some of these services are free for a limited number of rows per day, but in general you will typically pay a subscription fee and possibly a ‘per-transaction’ fee.  No software or data will have to be installed and maintained on site (except the simple program calling the Geocoder), so no up-front license purchase will be necessary.  The advantage of this method will be the low up-front cost and low total cost of ownership. The potential issues are the need to move data off site (security issues), performance issues because of the network latency, and SLAs.
• #### Service-only Geocoding

In this scenario the enterprise sends the data to a Geocoding vendor in the form of a file with a pre-defined format and receives back the file with the coordinates after an agreed time period.  The Geocoded file is than loaded back into the database by using the usual data loading tools.  The advantage of this approach a relative simplicity and low total cost of ownership. The drawbacks with this method will be related to data latency and security concerns.
• #### Hybrid Geocoding

A hybrid approach is a combination of all the methods noted above offered by some vendors.  Where large number of records need to be Geocoded (during the initial load, for example) an on-site solution is offered.  Incremental updates or near real-time Geocoding can be performed using either an on-site Geocoder or a web service (where the update volume is not very high).  Initial or occasional batch Geocoding can be handled by on-site Geocoders or a service-only vendor.

The specific Geocoding method to be used will largely depend on the application requirements, data volumes, and data update frequency (large batches, small batches, real-time, etc.).  It also will depend on how stringent the data security is, what are the data latency requirements, data accuracy, service level requirements, and the budget available.

## Geocoding Precision

Not all Geocoding is equal and it can be a very complex activity.  Many Geocoder vendors have proprietary and patented approaches and algorithms to provide very accurate, precise, and fast Geocoded results.  In fact, depending on the technique used to collect the coordinates, the precision of the location results can differ widely from approach to approach and vendor to vendor.  Also, the price of the Geocoding solution will vary.  In fact, you can send the same address to 10 different vendors and different Geocoding processes, and you will get 10 different answers.  Here’s a summary of the some of the most common Geocoding methods used by vendors today and their relative accuracy:

• #### Rooftop / Parcel Geocoding

This is usually the most precise (and most expensive) Geocoding methodology.  There are 144.3 million privately owned parcels in the US.  Similar property register is available in other countries.  Parcel data is often available from cities, counties, regions and other local or central government sources.  As digital parcel boundaries become available they are rapidly being incorporated into public access or third party vendor databases.  A geocoding application that uses parcel data will typically associate the coordinates of parcel center (‘centroid’) of the parcel (property).

• #### Street Interpolation Geocoding

This method uses street data where the street network is already coded within the geographic coordinate space by third party applications such as Navteq, MapInfo etc.  Each street segment is associated with an address range (a series of house numbers). The Geocoding algorithm then takes an address, matches it to a street and a specific segment.  Geocoding then interpolates the position of the address, within the segment. Since this is an approximation, results may vary and errors ranging from 50 feet (16 meters) up to several thousand feet (1 km or more) are not uncommon, as seen in the picture below. Because of this potential inaccuracy, interpolated Geocodes should not be used in applications that require extremely high accuracy, such as home insurance risk assessment, where a location difference of just 50-100 feet (15-30 meters) can for result in a house being within the boundaries of a flood zone.

• #### Centroid Interpolation Geocoding

This geocoding method is also based on approximations.  It is using the center (‘centroid’) of the area the object is in, instead of the road network as the street level interpolation.  It ranges from the precise (rooftop interpolation) to the very approximate.  Here are some examples for the Teradata Office in Rancho Bernardo using various interpolation approaches:

It is recommended to ask the Geocoding vendor about the method used for Geocoding and the precision.  As mentioned, the precision (and price) of the Geocoding solution will depend on your requirements and the application that is being designed.

## Country or Regional Geocoding Solutions

Country and regional coverage requirements will have to be taken into consideration before starting any Geocoding process.  The most extensive coverage will typically be found in North America (US, Canada) and in some areas of Western Europe.  Coverage of other areas will vary from country to country and from area to area.  For instance, urban areas usually have a better coverage than rural areas.  The type of geocoding available will also vary with very precise geocoding available for some countries and less precise or even very approximate interpolated geocoding solutions available for others.

## On-Line Geocoding Services

There are a number of free Geocoders, which usually can be accessed through web-services APIs.  These Geocoders usually allow a certain number of records to be Geocoded per day from the same IP address.  If you want to Geocode a larger number of records a commercial license is available.  As of March 2010, the most popular are the Yahoo (up to 5,000 Geocodes per day per IP address) and Google (15,000 per day per IP address), although others are also available.  These Geocoders will usually be adequate for testing and development purposes or even for some low-volume geocoding maintenance applications.  However, if you want to experiment using some of these for business critical applications, you probably want to do thorough research and testing because some of these may not offer acceptable service level guarantees or accuracy for your production operations.

Once you have the location coordinate data (latitudes and longitudes) now it's time to load that into the Teradata ST_Geometry data type in your database tables.  My colleague, Mike Riordan, has put together another good article on how to load and convert Geocoded location data into the Teradata ST_Geometry data type to begin your Geospatial analytics.

Tags (5)
2 REPLIES
Enthusiast

## Re: Geocoding 101

Dear Mzenus,

That was a wonderful introduction. From DBA point of view is there anyway we can estimate the size required to store the Geodata at a moderate precisions like "Centroid interpolation geocoding" for example? Telecom implementations use this data from warehouses extensively and sizing is a pre-requisite for good planning.

Regards.

## Re: Geocoding 101

Hi Ramakrishna:

That's a good question. In Teradata, geospatial coordinates are loaded into a new data type column (ST_Geometry).

The ST_Geometry type can represent any of the following geospatial types that are defined in the ANSI SQL:1999 SQL Multimedia and Application Packages standard. Any column of type ST_Geometry can contain one of these geospatial types (documented using their
well-known text formats):
- Points (x y)
- Lines or Curves (x y, x y, x y, x y)
- Polygons (x y, x y, x y, ...)
Also, Geometry Collections, GeoSequence, MultiLine String, MultiPoint, and MultiPolygon.

Because ST_Geometry is based on the BLOB type, it is defined with a maximum size, as measured by its well-known binary representation, of 16 MB (allowing for approximately 1 million vertices in the geospatial object). Therefore, the size will really depend on the types of the geometries you will be storing in each row. If the Geometry is less than or equal to 10k, then it is stored in the row. If the Geometry is greater than 10k then it is stored as a LOB.