Street Counting Methodology Research
Problem Statement
When counting unique instances of street names in OpenStreetMap data, we face a fundamental challenge: how do we determine when multiple road segments belong to the same street instance versus different instances?
For example:
- Victoria Street appears in multiple Sydney suburbs - each is a distinct street that happens to share a name
- Pacific Highway runs for hundreds of kilometers with gaps - it’s one continuous highway despite the gaps
- Railway Terrace might be broken into segments by intersections - these segments are part of the same street
This research compares different algorithmic approaches to solving this problem.
Test Dataset
We tested on 23 streets in the Sydney metropolitan area:
Regular Streets (15 streets):
- Victoria Street, Regent Street, Short Street, Railway Terrace
- Park Street, George Street, Elizabeth Street, William Street
- King Street, Church Street, Windsor Street, Albert Street
- Parramatta Road, Liverpool Road, Victoria Road
Highways & Freeways (7 streets):
- Pacific Highway (722 segments)
- Great Western Highway (516 segments)
- Princes Highway (497 segments)
- Hume Highway (320 segments)
- Warringah Freeway (118 segments)
- Central Coast Highway (100 segments)
- Prospect Highway (97 segments)
Methods Tested
1. Point-to-Point Distance
Algorithm: For each pair of segments, check if any points on one segment are within a threshold distance of any points on the other segment. Use Union-Find to group connected segments.
Variants: 30m, 50m, 100m thresholds
Pros:
- Most accurate for regular streets
- Considers full geometry of each segment
Cons:
- Slowest method (5-6 seconds total)
- O(n²) complexity for segment comparisons
- Still fails on long highways with large gaps
Performance:
- Time: 5.9-6.3s total across 23 streets
- Accuracy: Good for regular streets, poor for highways
2. Grid-Based Flood Fill
Algorithm:
- Divide the map into a grid (cell size = threshold distance)
- Map each segment to the grid cells it touches
- Use flood fill to find connected regions of cells
- Union all segments within each region
Variants: 50m, 100m, 200m grid sizes
Pros:
- 100-200x faster than Point-to-Point
- Scales well to large datasets
- After fixing floating-point bug, accuracy matches other methods
Cons:
- Initially had a floating-point precision bug causing incorrect adjacency checks
- Slightly less intuitive than distance-based methods
Critical Bug Fix:
The flood fill was generating adjacent cell coordinates with f"{lat+dlat},{lng+dlng}", but floating point errors meant -33.94 + 0.001 = -33.939 instead of the expected -33.941.
Solution: Round adjacent coordinates to grid: round((lat + dlat) / grid_size) * grid_size
Performance:
- Time: 0.027-0.055s total (200x faster!)
- Accuracy: Matches Point-to-Point and Polygon methods after bug fix
3. Polygon Buffer Intersection
Algorithm:
- Create buffer polygons around each line segment (using Shapely library)
- Check which buffer polygons intersect
- Use Union-Find to group intersecting segments
Variants: 30m, 50m, 100m buffer distances
Pros:
- Geometric approach - theoretically sound
- Can bridge small gaps better than point sampling
Cons:
- Requires Shapely library
- Slower than grid method but faster than Point-to-Point
- Similar accuracy to other methods
Performance:
- Time: 2.2-2.3s total
- Accuracy: Nearly identical to Point-to-Point
4. Endpoint-Only Distance
Algorithm: Only check distances between segment endpoints, ignoring intermediate points.
Variants: 30m, 100m thresholds, plus “Adaptive” (2km for major roads)
Pros:
- Faster than Point-to-Point
- Simple implementation
Cons:
- Less accurate - misses connections where segments are close in the middle but far at ends
- Adaptive variant requires road classification metadata
Performance:
- Time: 0.97-1.08s total
- Accuracy: Similar to Point-to-Point for most streets
5. Highway-Aware (Hybrid)
Algorithm:
- Run any base proximity method (Grid, Polygon, or Point-to-Point)
- Post-process: If street name contains “Highway”, “Freeway”, or “Motorway”, merge ALL components into one
Variants: Grid 100m base, Polygon 100m base
Pros:
- Solves the long-highway problem: All highways correctly count as 1
- Simple heuristic based on naming conventions
- Doesn’t affect regular street counting
Cons:
- Name-based rule might miss highways without standard names
- Only tested with OSM data (naming conventions may vary)
Performance:
- Time: Same as base method (minimal overhead)
- Accuracy: Perfect for highways (7/7 count as 1), normal for streets
Key Findings
| Method |
Total Time |
Avg per Street |
Speed vs P2P |
| Point-to-Point (30m) |
6.28s |
0.393s |
1x (baseline) |
| Grid 100m + Flood Fill |
0.037s |
0.002s |
170x faster |
| Grid 200m + Flood Fill |
0.028s |
0.002s |
224x faster |
| Polygon Buffer (100m) |
2.26s |
0.141s |
2.8x faster |
| Endpoint Only (100m) |
0.97s |
0.061s |
6.5x faster |
| Highway-Aware (Grid 100m) |
0.037s |
0.002s |
170x faster |
Accuracy Comparison
Streets with Perfect Agreement (all methods agree):
- Regent Street, Railway Terrace, Windsor Street
- Parramatta Road, Liverpool Road, Victoria Road
Streets with Disagreements:
- Victoria Street: 36-41 depending on threshold
- Short Street: 53-57 depending on threshold
- George Street: 51-59 depending on threshold
The disagreements are primarily due to threshold differences, not method differences. When using the same threshold (100m), Grid, Point-to-Point, and Polygon Buffer show nearly identical results.
Highway Counting
Before Highway-Aware Method:
- Pacific Highway: 5 instances (should be 1)
- Princes Highway: 3 instances (should be 1)
- Great Western Highway: 3 instances (should be 1)
- All methods failed equally
After Highway-Aware Method:
- ALL highways: 1 instance ✓
- Regular streets: unaffected
- Zero false positives (no regular streets incorrectly merged)
Recommendations
For General Use: Grid 200m + Flood Fill
Best all-around choice for counting street instances:
- 224x faster than Point-to-Point
- Accurate: Matches other methods within threshold variance
- Scalable: O(n) complexity with grid size
- Simple: No external dependencies
For Highway-Heavy Datasets: Highway-Aware (Grid 200m)
If your dataset includes highways, freeways, or motorways:
- Use Grid 200m as base method
- Apply Highway-Aware post-processing
- Perfect highway counting with no accuracy loss on regular streets
Threshold Selection
Based on testing:
- 200m: Best for urban areas with complex road networks
- Balances connectivity (bridges small gaps) with discrimination (separates distinct streets)
- Recommended for most use cases
- 100m: More conservative
- Use when streets are very close together
- Slightly more likely to split streets that should be connected
- 30-50m: Very conservative
- Only connects segments that are immediately adjacent
- May split continuous streets at intersections
Technical Deep Dive
The Floating-Point Bug
The grid method initially overcounted streets because of a subtle floating-point precision error:
# BUGGY CODE
for dlat in [-grid_size, 0, grid_size]:
adj_cell = f"{lat+dlat},{lng+dlng}" # -33.94 + 0.001 = -33.939 (!)
The problem: -33.94 + 0.001 produces -33.939 due to floating-point representation, not -33.941 as expected. This meant adjacent cells weren’t being recognized as adjacent.
# FIXED CODE
for dlat in [-grid_size, 0, grid_size]:
adj_lat = round((lat + dlat) / grid_size) * grid_size # Snap to grid
adj_cell = f"{adj_lat},{lng+dlng}"
This fix ensured that Victoria Street (previously split into 37 parts) correctly counted as 36 instances.
Union-Find Algorithm
All methods use Union-Find (Disjoint Set Union) for grouping:
def find(x):
if parent[x] != x:
parent[x] = find(parent[x]) # Path compression
return parent[x]
def union(x, y):
px, py = find(x), find(y)
if px != py:
parent[px] = py
This provides O(α(n)) amortized time complexity, where α is the inverse Ackermann function (effectively constant).
Explore the methods yourself:
Conclusion
The Grid-based flood fill method with 200m cells provides the best balance of speed, accuracy, and simplicity. Combined with Highway-Aware post-processing, it correctly handles both regular streets and long highways with gaps.
For the Sydney Streets project, we’re using Grid 200m + Highway-Aware as our production method.
Research conducted November 2025 by analyzing 23 streets across Sydney using OpenStreetMap data.