sydney-streets

Street Counting Methodology Research

Problem Statement

When counting unique instances of street names in OpenStreetMap data, we face a fundamental challenge: how do we determine when multiple road segments belong to the same street instance versus different instances?

For example:

This research compares different algorithmic approaches to solving this problem.

Test Dataset

We tested on 23 streets in the Sydney metropolitan area:

Regular Streets (15 streets):

Highways & Freeways (7 streets):

Methods Tested

1. Point-to-Point Distance

Algorithm: For each pair of segments, check if any points on one segment are within a threshold distance of any points on the other segment. Use Union-Find to group connected segments.

Variants: 30m, 50m, 100m thresholds

Pros:

Cons:

Performance:

2. Grid-Based Flood Fill

Algorithm:

  1. Divide the map into a grid (cell size = threshold distance)
  2. Map each segment to the grid cells it touches
  3. Use flood fill to find connected regions of cells
  4. Union all segments within each region

Variants: 50m, 100m, 200m grid sizes

Pros:

Cons:

Critical Bug Fix: The flood fill was generating adjacent cell coordinates with f"{lat+dlat},{lng+dlng}", but floating point errors meant -33.94 + 0.001 = -33.939 instead of the expected -33.941.

Solution: Round adjacent coordinates to grid: round((lat + dlat) / grid_size) * grid_size

Performance:

3. Polygon Buffer Intersection

Algorithm:

  1. Create buffer polygons around each line segment (using Shapely library)
  2. Check which buffer polygons intersect
  3. Use Union-Find to group intersecting segments

Variants: 30m, 50m, 100m buffer distances

Pros:

Cons:

Performance:

4. Endpoint-Only Distance

Algorithm: Only check distances between segment endpoints, ignoring intermediate points.

Variants: 30m, 100m thresholds, plus “Adaptive” (2km for major roads)

Pros:

Cons:

Performance:

5. Highway-Aware (Hybrid)

Algorithm:

  1. Run any base proximity method (Grid, Polygon, or Point-to-Point)
  2. Post-process: If street name contains “Highway”, “Freeway”, or “Motorway”, merge ALL components into one

Variants: Grid 100m base, Polygon 100m base

Pros:

Cons:

Performance:

Key Findings

Performance Comparison

Method Total Time Avg per Street Speed vs P2P
Point-to-Point (30m) 6.28s 0.393s 1x (baseline)
Grid 100m + Flood Fill 0.037s 0.002s 170x faster
Grid 200m + Flood Fill 0.028s 0.002s 224x faster
Polygon Buffer (100m) 2.26s 0.141s 2.8x faster
Endpoint Only (100m) 0.97s 0.061s 6.5x faster
Highway-Aware (Grid 100m) 0.037s 0.002s 170x faster

Accuracy Comparison

Streets with Perfect Agreement (all methods agree):

Streets with Disagreements:

The disagreements are primarily due to threshold differences, not method differences. When using the same threshold (100m), Grid, Point-to-Point, and Polygon Buffer show nearly identical results.

Highway Counting

Before Highway-Aware Method:

After Highway-Aware Method:

Recommendations

For General Use: Grid 200m + Flood Fill

Best all-around choice for counting street instances:

For Highway-Heavy Datasets: Highway-Aware (Grid 200m)

If your dataset includes highways, freeways, or motorways:

Threshold Selection

Based on testing:

Technical Deep Dive

The Floating-Point Bug

The grid method initially overcounted streets because of a subtle floating-point precision error:

# BUGGY CODE
for dlat in [-grid_size, 0, grid_size]:
    adj_cell = f"{lat+dlat},{lng+dlng}"  # -33.94 + 0.001 = -33.939 (!)

The problem: -33.94 + 0.001 produces -33.939 due to floating-point representation, not -33.941 as expected. This meant adjacent cells weren’t being recognized as adjacent.

# FIXED CODE
for dlat in [-grid_size, 0, grid_size]:
    adj_lat = round((lat + dlat) / grid_size) * grid_size  # Snap to grid
    adj_cell = f"{adj_lat},{lng+dlng}"

This fix ensured that Victoria Street (previously split into 37 parts) correctly counted as 36 instances.

Union-Find Algorithm

All methods use Union-Find (Disjoint Set Union) for grouping:

def find(x):
    if parent[x] != x:
        parent[x] = find(parent[x])  # Path compression
    return parent[x]

def union(x, y):
    px, py = find(x), find(y)
    if px != py:
        parent[px] = py

This provides O(α(n)) amortized time complexity, where α is the inverse Ackermann function (effectively constant).

Interactive Tools

Explore the methods yourself:

Conclusion

The Grid-based flood fill method with 200m cells provides the best balance of speed, accuracy, and simplicity. Combined with Highway-Aware post-processing, it correctly handles both regular streets and long highways with gaps.

For the Sydney Streets project, we’re using Grid 200m + Highway-Aware as our production method.


Research conducted November 2025 by analyzing 23 streets across Sydney using OpenStreetMap data.