Geocoding Accuracy: Feedback & Suggestions For Whereabouts Project

by Pedro Alvarez 67 views

Hey @ajl2718,

First off, I wanted to say a huge thank you for putting together this project! It's seriously impressive work. I’ve been tackling a similar task using the rapidfuzz package myself, and the potential of DuckDB really caught my eye. While I was gearing up to experiment with a DuckDB-based approach, I stumbled upon this repo.

Accuracy of the Geocoding Demo

When it comes to geocoding accuracy, it’s super important to get things right, especially when dealing with addresses. I spent some time testing the demo available on the Hugging Face Demo to see how it performs in different scenarios. One of the key things I wanted to understand was how well it handles typos and variations in address formats. In my testing, I noticed a couple of interesting behaviors that I wanted to bring to your attention.

For example, when I entered a valid address like 3333 Channel Way, San Diego, CA, the demo found a match with a 0.79 similarity score. That’s pretty good! However, when I introduced a small typo, like changing “Channel” to “Chanel” (3333 Chanel Way, San Diego, CA), the demo returned null. This raised a question for me: Is this behavior expected with the standard method being used? It seems that even a minor typo can throw off the matching process, which could be a challenge in real-world applications where users might make similar errors.

I also tried experimenting with the "trigram" option, hoping it might be more robust against typos. However, I encountered an error when using it. This led me to wonder if the "trigram" option has been tested extensively, especially on larger databases. It would be really valuable to see how it performs under more demanding conditions, as trigram-based matching is often effective at handling minor variations in text.

Could you shed some light on why the system behaves this way with typos? Have you had the chance to test the "trigram" option on a large database with typo-prone addresses? Understanding the limitations and strengths of the current approach will be super helpful for anyone looking to use this geocoding tool.

Questions About Address Formatting

Let's dive into some questions I had regarding the address formatting within the project. These questions are aimed at clarifying best practices and ensuring the system works as smoothly as possible for different types of address data. Specifically, my main question revolves around the ADDRESS_LABEL field and how it should be formatted.

Currently, I’m curious about whether the ADDRESS_LABEL field should contain the full address without commas (e.g., 3333 CHANNEL WAY SAN DIEGO CA 92110) or just the street portion (3333 CHANNEL WAY). The distinction here is pretty important because the choice of format can impact how effectively the geocoding system matches addresses. Using the full address provides more context, but it also introduces more opportunities for mismatches if any part of the address is slightly off. On the other hand, using just the street portion simplifies the matching process but might miss some nuances that could help in disambiguation.

It would be great to get some clarity on which format is preferred and why. Understanding the rationale behind the choice will help ensure that data is prepared correctly, leading to more accurate geocoding results. Are there specific reasons why one format might be favored over the other, such as performance considerations or compatibility with certain geocoding algorithms? Knowing the best practices for formatting the ADDRESS_LABEL field will definitely help in making the most of this project.

Suggestions for Enhancements

I have a few suggestions that could potentially enhance the project even further. These suggestions touch on different aspects of the system, from handling address line 2 information to data types for various fields. Let's break them down:

1. Address Line 2 Support

In the US, it’s super common for addresses to include a second line, such as an apartment number (e.g., APT C). Not accounting for this can lead to incomplete or inaccurate geocoding results. Think about it – a large apartment complex could have hundreds of units at the same street address, and without the apartment number, it's impossible to pinpoint the exact location. Supporting address line 2 would significantly improve the accuracy and usability of the system, making it more versatile for real-world applications.

Implementing support for address line 2 might involve adding a new field to the address data structure and adjusting the geocoding algorithms to consider this additional information. It could also require some changes to the user interface to allow users to input this information easily. While it might add some complexity, the benefits in terms of accuracy and completeness would be well worth the effort.

2. Data Type Considerations

When it comes to data types, consistency and compatibility are key. I noticed a couple of fields where using strings might be a better choice for long-term flexibility and interoperability:

  • ADDRESS_DETAIL_PID and POSTCODE: These are currently likely stored as integers, but I suggest changing them to strings. Many Points of Interest (POI) databases use UUIDs (Universally Unique Identifiers) as identifiers, which are inherently strings. Keeping ADDRESS_DETAIL_PID as a string ensures consistency when integrating with these databases. Similarly, representing POSTCODE as a string aligns with the requirements of Placekey, another address linkage method that relies on strings. This consistency simplifies data integration and reduces the risk of data type-related issues.

  • ZIP: This one should definitely be a string. There are several reasons for this: Canadian postal codes contain letters, US ZIP codes can start with 0 (which can be lost if stored as an integer), and extended ZIP codes use the +4 format (e.g., 10000-1234). Storing ZIP codes as strings ensures that all these variations can be accommodated without any loss of information. This is crucial for accurate geocoding and address matching.

Switching these fields to strings might seem like a small change, but it can have a big impact on the system’s ability to handle diverse datasets and integrate with other tools and services. It’s all about future-proofing the project and making it as versatile as possible.

By implementing these suggestions, the project could become even more powerful and user-friendly, capable of handling a wider range of address formats and data sources.


Thanks again for the fantastic work you’ve put into this—I’m really looking forward to hearing your thoughts and discussing these points further!