sydney-streets

Street Counting Methodology Research

Problem Statement

When counting unique instances of street names in OpenStreetMap data, we face a fundamental challenge: how do we determine when multiple road segments belong to the same street instance versus different instances?

For example:

Victoria Street appears in multiple Sydney suburbs - each is a distinct street that happens to share a name
Pacific Highway runs for hundreds of kilometers with gaps - it’s one continuous highway despite the gaps
Railway Terrace might be broken into segments by intersections - these segments are part of the same street

This research compares different algorithmic approaches to solving this problem.

Test Dataset

We tested on 23 streets in the Sydney metropolitan area:

Regular Streets (15 streets):

Victoria Street, Regent Street, Short Street, Railway Terrace
Park Street, George Street, Elizabeth Street, William Street
King Street, Church Street, Windsor Street, Albert Street
Parramatta Road, Liverpool Road, Victoria Road

Highways & Freeways (7 streets):

Pacific Highway (722 segments)
Great Western Highway (516 segments)
Princes Highway (497 segments)
Hume Highway (320 segments)
Warringah Freeway (118 segments)
Central Coast Highway (100 segments)
Prospect Highway (97 segments)

Methods Tested

1. Point-to-Point Distance

Algorithm: For each pair of segments, check if any points on one segment are within a threshold distance of any points on the other segment. Use Union-Find to group connected segments.

Variants: 30m, 50m, 100m thresholds

Pros:

Most accurate for regular streets
Considers full geometry of each segment

Cons:

Slowest method (5-6 seconds total)
O(n²) complexity for segment comparisons
Still fails on long highways with large gaps

Performance:

Time: 5.9-6.3s total across 23 streets
Accuracy: Good for regular streets, poor for highways

2. Grid-Based Flood Fill

Algorithm:

Divide the map into a grid (cell size = threshold distance)
Map each segment to the grid cells it touches
Use flood fill to find connected regions of cells
Union all segments within each region

Variants: 50m, 100m, 200m grid sizes

Pros:

100-200x faster than Point-to-Point
Scales well to large datasets
After fixing floating-point bug, accuracy matches other methods

Cons:

Initially had a floating-point precision bug causing incorrect adjacency checks
Slightly less intuitive than distance-based methods

Critical Bug Fix: The flood fill was generating adjacent cell coordinates with f"{lat+dlat},{lng+dlng}", but floating point errors meant -33.94 + 0.001 = -33.939 instead of the expected -33.941.

Solution: Round adjacent coordinates to grid: round((lat + dlat) / grid_size) * grid_size

Performance:

Time: 0.027-0.055s total (200x faster!)
Accuracy: Matches Point-to-Point and Polygon methods after bug fix

3. Polygon Buffer Intersection

Algorithm:

Create buffer polygons around each line segment (using Shapely library)
Check which buffer polygons intersect
Use Union-Find to group intersecting segments

Variants: 30m, 50m, 100m buffer distances

Pros:

Geometric approach - theoretically sound
Can bridge small gaps better than point sampling

Cons:

Requires Shapely library
Slower than grid method but faster than Point-to-Point
Similar accuracy to other methods

Performance:

Time: 2.2-2.3s total
Accuracy: Nearly identical to Point-to-Point

4. Endpoint-Only Distance

Algorithm: Only check distances between segment endpoints, ignoring intermediate points.

Variants: 30m, 100m thresholds, plus “Adaptive” (2km for major roads)

Pros:

Faster than Point-to-Point
Simple implementation

Cons:

Less accurate - misses connections where segments are close in the middle but far at ends
Adaptive variant requires road classification metadata

Performance:

Time: 0.97-1.08s total
Accuracy: Similar to Point-to-Point for most streets

5. Highway-Aware (Hybrid)

Algorithm:

Run any base proximity method (Grid, Polygon, or Point-to-Point)
Post-process: If street name contains “Highway”, “Freeway”, or “Motorway”, merge ALL components into one

Variants: Grid 100m base, Polygon 100m base

Pros:

Solves the long-highway problem: All highways correctly count as 1
Simple heuristic based on naming conventions
Doesn’t affect regular street counting

Cons:

Name-based rule might miss highways without standard names
Only tested with OSM data (naming conventions may vary)

Performance:

Time: Same as base method (minimal overhead)
Accuracy: Perfect for highways (7/7 count as 1), normal for streets

Key Findings

Performance Comparison

Method	Total Time	Avg per Street	Speed vs P2P
Point-to-Point (30m)	6.28s	0.393s	1x (baseline)
Grid 100m + Flood Fill	0.037s	0.002s	170x faster
Grid 200m + Flood Fill	0.028s	0.002s	224x faster
Polygon Buffer (100m)	2.26s	0.141s	2.8x faster
Endpoint Only (100m)	0.97s	0.061s	6.5x faster
Highway-Aware (Grid 100m)	0.037s	0.002s	170x faster

Accuracy Comparison

Streets with Perfect Agreement (all methods agree):

Regent Street, Railway Terrace, Windsor Street
Parramatta Road, Liverpool Road, Victoria Road

Streets with Disagreements:

Victoria Street: 36-41 depending on threshold
Short Street: 53-57 depending on threshold
George Street: 51-59 depending on threshold

The disagreements are primarily due to threshold differences, not method differences. When using the same threshold (100m), Grid, Point-to-Point, and Polygon Buffer show nearly identical results.

Highway Counting

Before Highway-Aware Method:

Pacific Highway: 5 instances (should be 1)
Princes Highway: 3 instances (should be 1)
Great Western Highway: 3 instances (should be 1)
All methods failed equally

After Highway-Aware Method:

ALL highways: 1 instance ✓
Regular streets: unaffected
Zero false positives (no regular streets incorrectly merged)

Recommendations

For General Use: Grid 200m + Flood Fill

Best all-around choice for counting street instances:

224x faster than Point-to-Point
Accurate: Matches other methods within threshold variance
Scalable: O(n) complexity with grid size
Simple: No external dependencies

For Highway-Heavy Datasets: Highway-Aware (Grid 200m)

If your dataset includes highways, freeways, or motorways:

Use Grid 200m as base method
Apply Highway-Aware post-processing
Perfect highway counting with no accuracy loss on regular streets

Threshold Selection

Based on testing:

200m: Best for urban areas with complex road networks
- Balances connectivity (bridges small gaps) with discrimination (separates distinct streets)
- Recommended for most use cases
100m: More conservative
- Use when streets are very close together
- Slightly more likely to split streets that should be connected
30-50m: Very conservative
- Only connects segments that are immediately adjacent
- May split continuous streets at intersections

Technical Deep Dive

The Floating-Point Bug

The grid method initially overcounted streets because of a subtle floating-point precision error:

# BUGGY CODE
for dlat in [-grid_size, 0, grid_size]:
    adj_cell = f"{lat+dlat},{lng+dlng}"  # -33.94 + 0.001 = -33.939 (!)

The problem: -33.94 + 0.001 produces -33.939 due to floating-point representation, not -33.941 as expected. This meant adjacent cells weren’t being recognized as adjacent.

# FIXED CODE
for dlat in [-grid_size, 0, grid_size]:
    adj_lat = round((lat + dlat) / grid_size) * grid_size  # Snap to grid
    adj_cell = f"{adj_lat},{lng+dlng}"

This fix ensured that Victoria Street (previously split into 37 parts) correctly counted as 36 instances.

Union-Find Algorithm

All methods use Union-Find (Disjoint Set Union) for grouping:

def find(x):
    if parent[x] != x:
        parent[x] = find(parent[x])  # Path compression
    return parent[x]

def union(x, y):
    px, py = find(x), find(y)
    if px != py:
        parent[px] = py

This provides O(α(n)) amortized time complexity, where α is the inverse Ackermann function (effectively constant).

Interactive Tools

Explore the methods yourself:

Method Comparison Tool: Compare up to 4 methods side-by-side on any street
Summary Table: View all results, timing data, and disagreement analysis
Main Visualization: See the Grid 200m method applied to the full dataset

Conclusion

The Grid-based flood fill method with 200m cells provides the best balance of speed, accuracy, and simplicity. Combined with Highway-Aware post-processing, it correctly handles both regular streets and long highways with gaps.

For the Sydney Streets project, we’re using Grid 200m + Highway-Aware as our production method.

Research conducted November 2025 by analyzing 23 streets across Sydney using OpenStreetMap data.