Skip to content

OSM data analysis for environment and agriculture

Published:
25 min read

Introduction

In this blog post, we are going to analyze the keys and tags from OpenStreetMap (OSM) with a particular interest on environmental and agricultural topics.

To follow along, here is the codebase associated with this blog post: https://github.com/NoeFlandre/osm-stats

Overview

The first step to perform such an analysis is to download the dataset of statistics including all keys and tags from OSM data which is available at the following link: https://taginfo.openstreetmap.org/download/taginfo-db.db.bz2

Once extracted, the reader will find a file of roughly 13GB with an SQLite database.

First global stats

To get started we need to understand what is a key and what is a tag. A key is typically a single word which describes a category of OSM polygon. For instance “building” is a key. A tag, is simply a pair of (key, value). An example of a tag would be (building, house). So a polygon of a house on OSM will belong to the key “building” and most likely have a tag “building=house” which corresponds to the (key, value) = (building, house).

Now let’s have a look at some descriptive stats from our database:

We can already conclude that this is a fairly large database.

If we want to take a first glance at the data, we can check what are the top 10 keys and top 10 tags (key=value pairs).

For the keys, we can see that basic infrastructure dominates:

  1. building (~699.6 million entries)
  2. source (~310.2 million)
  3. highway (~296.3 million)
  4. addr:housenumber (~180.9 million)
  5. addr:street (~170.0 million)
  6. addr:city (~131.3 million)
  7. addr:postcode (~115.2 million)
  8. name (~114.1 million)
  9. natural (~92.2 million)
  10. surface (~78.7 million)

The tags are confirming this first observation:

  1. building=yes (~556.4 million)
  2. highway=residential (~69.4 million)
  3. building=house (~66.1 million)
  4. highway=service (~64.6 million)
  5. surface=asphalt (~36.2 million)
  6. source=microsoft/BuildingFootprints (~35.8 million)
  7. natural=tree (~33.6 million)
  8. highway=footway (~31.6 million)
  9. highway=track (~29.6 million)
  10. waterway=stream (~29.0 million)

Should we just remove some obvious keys?

What we would like is to cut through the noise and only keep the keys and tags we are interested in (anything relevant to topics like agriculture, environment and so on). One could have the idea to directly remove the “building” key completely but before doing that, let’s have a look at the first 20 tags with “building” as a key:

valuecount_all
yes556,451,624
house66,181,816
residential16,123,805
detached10,056,176
garage8,122,742
apartments7,784,487
shed4,495,922
industrial2,608,502
roof2,450,761
hut2,447,641
farm_auxiliary2,366,983
semidetached_house1,974,860
terrace1,478,303
commercial1,455,301
school1,345,309
retail1,274,921
construction1,132,478
outbuilding1,084,337
garages1,054,157
greenhouse821,293
barn805,630
cabin657,564
static_caravan557,489
service548,879
warehouse473,553
bungalow440,894
church433,987
farm414,913
allotment_house349,247
carport338,130
office304,623
ruins290,506
public213,143
civic212,355
university176,658
hospital170,684
hotel160,423
kindergarten127,484
chapel119,655
boathouse118,131
ger107,876
mosque107,555
storage_tank105,426
manufacture99,828
hangar95,179
bunker74,512
dormitory73,721
silo65,809
train_station60,106
college55,587

As we can see the tags are including many elements we are not interested in (e.g bunker, college and so on). However some tags could be of interest from an environmental / agriculture perspective (e.g greenhouse, farm). So simply discarding all occurrences with a “building” key is not the right solution. On top of that, we saw earlier that the entire set is including 110,706 keys and going through each of them would take a while… We therefore can’t offered to do a fine filtering manually.

Filtering at scale

One of the critical issue with OSM data is that the tags are not standardized, in other words it seems like each contributor is fairly free to annotate a polygon using free form text. Moreover we can assume that some typos might exist in the tags. Therefore it is likely that among the 192,821,586 tags, we may have a very long tail of tags which are not used that often since they come from a specific user notation.

Removing tags with low occurrences

In order to first clean up this database and only keep prominent tags, we can decide to only keep tags such that count_all >= 500. This simple filtering brings down the number of unique (key, value) pairs from 192.8 M to 224,123. By doing this filtering, the total number of occurrences goes from 3,892,388,715 to 3,350,015,993, so the long tail filtering dropped roughly 14% of occurrences in total while dramatically cleaning up the number of tags (roughly compressing the number of tags by a factor or 860). In simpler terms, we significantly reduced the complexity of our codebase while preserving a satisfactory number of samples.

Standardizing tags

Since OSM tags could be messy (for example we could think of tags having “Landuse” and “landuse”), we want to turn them into clean pairs. To do so we convert each string into lowercase, we strip to remove any unwanted space, we handle missing values by mapping them to “none” and finally we joint keys and values using the pipe “|” as a joiner since it never appears in our OSM values. Every tag now looks clean and resemble something like this : “landuse|farmland”.

Tokenizing our strings

Now that we have clean tags, we can try to cluster them, in order to gather together tags which belong to a similar topic. To do so we need to turn our strings into vectors. We have two options for that: either tokenize them at the word-level or at the character-level. Tokenizing at the word level would require exact matches, which is a harsh condition and also can be too strict. Suppose the context where a user would have made a type and wrote “lanuse|farmland” instead of “landuse|farmland”. Tokenizing at the word level would consider “landuse” and “lanuse” as two different words while they are the same. That’s why tokenizing at the character level might be a better pick here. A good solution here is to use n-grams (i.e a sliding window of N characters through the word). For instance the 3-grams from the word “landuse” are “lan”, “and”, “ndu”, “dus” and “use”. This way, even when we have a typo, two words still share a high similarity. For example “landuse” and “lanuse” share the 3-grams “lan” and “use”. A good tradeoff as well is to choose a range of n-grams from 3 to 5 grams (2 being too noisy since 2-grams are shared across two many words, and 6+ being too specific).

Turning tokens to vectors

On top of this tokenization, we are going to need a way to analyze these n-grams. If we were to simply count n-grams, this would treat each n-gram as equally important. However this is not a good approach since some n-grams are very common in English like “ing” and therefore non informative, while some n-grams are rather rare like “g|y” (the boundary between “building” and “yes”) and in this case, is very informative. In order to tackle this issue, we can use TF-IDF which stands for Term Frequency - Inverse Document Frequency. The idea is to weight each n-gram by how rare it is across the whole dataset. The term frequency is the count of the n-gram in the current string while the inverse document frequency down-weights n-grams that appear in many strings and up-weights n-grams appearing only in a few. The intuition behind using this is exactly what we described before: a rare n-gram is informative while a common one is not.

As an implementation detail, we decide to drop any n-gram that appears only in one tag string since it carries no clustering signal. This is just an optimization to keep a vocabulary that actually connect tags together. The output of this transformation is a sparse matrix of shape (224,123; 396,969) each row is a tag string while each column is surviving n-gram. The cell at position (i, j) carries the weight of the TF-IDF for the n-gram j of the tag string i. In this matrix we have a sparse density of 0.014% which means that each string only activates a ver small handful of the 396,969 possible n-grams. At this stage, we have effectively turned each tag into a vector of dimension 396,969. In this space, two tags sharing many n-grams end up close to each other while those sharing none are orthogonal.

Dimensionality reduction

The clustering algorithm we are going to use later is HDBSCAN, which computes pairwise distances between every pair of points. For a full pairwise matrix this is O(n^2* d) where n is the number of points and d the number of dimensions. In our case we have d = 396,969 and n = 224,123 which is intractable. However we are dealing with a sparse matrix, which makes it possible to rather compress it into a dense matrix. To do so we are using Truncated Singular Value Decomposition, which is a dimensionality reduction technique which approximates a matrix by only keeping its top k singular values and vectors. The top k components we are going to keep, instead of the full 224,123, will give us the directions of highest variance in the data. This way, we are discarding the noise while keeping the geometric relationships needed for clustering.

The choice of k here is a tradeoff. If we choose it too low, we would collapse together things which are supposed to be separate, for example landuse=farmland and highway=residential could end up in the same cluster because we threw away the n-grams that were distinguishing them. If we choose it too high, then we are back at the cost problem. A common band is to work between 30 and 50 components. In order to be safe, let’s stick to the upper end 50.

Clustering our tags

Now that we have 224,123 points in a 50-dimensional dense space, we are going to cluster these points. We are going to use HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise). It groups points that are close and dense together while pushing isolated points into a noise bucket. This is what we want: large, dense clusters of popular landuse keys. The two main parameters are min_cluster_size and min_samples. The former, which we set to 5 means that a tag needs at least 5 near duplicates in the dataset to form a category (a cluster). The latter, which we set to 2, controls how conservative the clustering is. A higher value pushes more ambiguous points into the noise bucket while a lower value is more permissive. We use euclidean distances to reflect the n-gram overlap between tags.

On our 224,123 by 50 matrix takes roughly 3.5 minutes and produces 9,037 clusters along with 79,053 noise points, corresponding to 35.3% of the corpus. This substantial amount of noise is expected since OSM tags are not following a coherent distribution but rather a mix of names, postcodes, street names, typos and so on. Moreover not all tags have 5 near duplicates in a 50-D space, so they go to noise.

If we inspect the 5 largest clusters, we end up with the following:

Therefore, we can see that our clustering is quite fine since it can distinguish addresses from different countries.

From 9,037 clusters to 413 OSM key families

A list of 9,037 clusters is hard to read. A trick we could think of is to summarize each cluster by a single representative tag (which is named a medoid). We could then group medoids together.

For every cluster, we take the centroid (the mean of all member vectors) and select the actual member closest: the medoid. From each medoid string (e.g addr:street|hauptstraße) we split on the first colon and keep the part before (e.g addr). For each base key we sum the cluster count and the total count_all across all member clusters.

Top 20 OSM key families

base_keycluster_counttotal_count_allrepresentative_medoids
addr4,790465,963,996addr:country
source668231,299,922source
building6274,058,131building:levels
surface270,692,821surface
area169,449,530area:highway
removed465,732,436removed:highway
tiger32743,393,943tiger:mtfcc
xmas136,046,635xmas:feature
razed431,291,789razed:highway
landuse230,325,838landuse
lanes422,576,108lanes
oneway222,445,510oneway:foot
boat222,276,483boat
driveway121,984,902driveway
height13320,268,617height
generator818,510,670generator:source
start_date2917,404,344start_date
maxspeed3216,086,937maxspeed
barrier214,614,296barrier
roof2714,258,235roof:shape

A quick analysis lets us see that “addr” is the elephant with 4,790 clusters and 466M occurrences. Likewise “building” has an huge number of occurrences even though its cluster count is small. Environmental and agricultural signal is small but real: landuse (2 clusters, 30M), natural (not in top 20, but 3 clusters, 42M). Moreover the set to human verify is now manageable with 413 key families. We could also think of asking an LLM to only keep labels relevant to these topics based on this list.

Selecting OSM key families relevant to environment and agriculture

Using Minimax M3, we filtered the 413 OSM key families to only keep the ones relevant to the environment and agriculture. Minimax chose 26 keys as a subset of the 413 which it deemed relevant. The following table summarizes the relevant keys selected:

base_keycluster_idmedoidcluster_sizetotal_count_all
landuse7819landuse|orchard2918,776,722
landuse7797landuse|farmland511,549,116
water8154water|ditch2311,446,716
waterway8261waterway|ditch187,062,136
generator2372generator:source|diesel156,337,545
leisure7934leisure|dog_park176,327,674
generator6260generator:method|battery-storage76,190,190
generator4174generator:output:electricity|2.3 mw735,750,401
natural8342natural|tundra204,163,927
natural8101natural|landslide52,442,862
crop8250crop|cana-de-açúcar141,934,912
wetland7942wetland|dambo281,600,723
genus8114genus|celtis65693,939
species8021species:es|falso pimiento138518,947
species221species:wikidata|q163760112441,244
survey_point4987survey_point|suppl18418,591
embankment7043embankment|left11357,670
diameter_crown5882diameter_crown|910305,345
species7953species|platanus × acerifolia8204,191
diameter_crown5878diameter_crown|149138,082
crop7147crop|native_pasture6138,059
taxon7880taxon|sapindaceae19129,205
plant5850plant:source|oil10128,360
trees7357trees|pitaya_plants12128,305
species6850species:en|pin oak39124,857
genus7546genus:en|lime20115,365
diameter_crown5881diameter_crown|2m5113,180
monitoring6394monitoring:water_ph|yes12107,842
species7954species|platanus ×hispanica11100,887
landform8100landform|dune_system14100,315
species7965species|prunus cerasus29100,030
generator6323generator:solar:modules|14999,247
species6328species|populus canadensis2395,434
boundary7483boundary|legal1794,892
generator6322generator:solar:modules|3990,726
genus7674genus:de|hainbuche790,474
species8014species:de|götterbaum2787,727
protect_class8136protect_class|3986,979
species7981species:de|hainbuche1984,261
landcover8099landcover|dry_swamp682,892
natural7703natural|valley881,827
genus8019genus|casuarina678,043
genus7950genus|malus673,165
genus8105genus:de|apfel969,296
species5819species:wikipedia|pl:klon polny3868,613
species6950species:it|pioppo bianco2867,997
genus226genus:wikidata|q1278491356,537
trees7771trees|almond_trees1551,415
species7932species:nl|inlandse eik1650,113
species8017species|eucalyptus melliodora648,963
protection_title6106protection_title|environmental use2447,061
species7999species|prunus domestica940,636
species7722species|fraxinus americana1139,711
genus7271genus:it|olivo1139,249
species7955species|acer negundo534,234
species8015species|melaleuca nesophila1134,034
taxon8135taxon|pinus nigra2130,507
water_source6052water_source|tube_well528,555
species8020species|quercus phellos826,295
survey_point4986survey_point:purpose|vertical726,229
iucn_level6926iucn_level|ii725,663
survey_point4976survey_point:structure|pillar625,317
genus8113genus|corylus523,792
species8016species|betula utilis823,466
species7833species|prunus serrulata521,937
species8012species:de|silber-linde821,655
species7956species:pl|klon zwyczajny1521,495
species7740species|pinus sylvestris520,562
species7280species|pyrus calleryana chanticleer620,044
taxon7667taxon:en|honeylocust1119,579
wood6434wood|deciduous618,843
generator6290generator:type|wind_turbine517,409
generator4117generator:orientation|sw1116,948
genus8106genus:ru|берёза1112,949
species7351species|adansonia grandidieri610,515
diameter_crown5883diameter_crown|5.00610,161
tree4326tree:ref|1008139,502
taxon8357taxon:cultivar|plena89,173
tree4325tree:ref|107129,097
generator6321generator:solar:modules|2268,204
monitoring5769monitoring:water_quality|yes57,377
tree4323tree:ref|597,310
tree4322tree:ref|200696,592
taxon8086taxon|prunus cerasifera ‘pissardii’76,259
species6849species:en|maple silver55,292
diameter_crown5884diameter_crown|5.555,202
tree4321tree:ref|20363,911
species7897species:ro|paltin de câmp53,839
species8013species:ru|берёза повислая53,821
tree4324tree:ref|1263,462

To put this in perspective, the 26 selected base keys span 90 clusters and account for roughly 90M total occurrences, with landuse leading the way at about 30M occurrences.

Ablations and following questions

Some decisions made in the pipeline above are worth questioning. In this section, we are going to tackle these.

When should we perform standardization?

Consider the case where you would have the tag landuse with 286 occurrences and Landuse with 450 occurrences. Using the pipeline defined above, both these tags would get discarded since they both do not satisfy the condition count_all >= 500. We could therefore think of first standardizing these tags, essentially unifying them as a single landuse tag, for which the number of occurrences would be 450+286 = 736. In such a case, this would mean that the new tag would pass the condition count_all >= 500 and as a result, not be discarded. Since this pipeline does rescue some tags which were non standardized, we can expect this new appraoch to produce more rows in the thresholded output.

In fact doing standardization first and then filtering for count_all >= 500 yields 225,684 tags and 3,368,341,528 occurrences, that is, by standardizing first, we rescued +1,561 tags and +18,325,535 occurrences compared to the filter first and standardize later approach. Since the later steps of the pipeline are designed to filter these tags down to tags of interest for environment and agriculture, it might be interesting to take the approach of standardizing first in order to rescue more tags and maybe recover more relevant tags for our topics of interest. We cannot purely compare the effect of this choice in our current setting since the clustering algorithm is not purely deterministic and different clusters would be produced in a second run therefore making the comparison unclear.

What if we use an embedding model instead of TF-IDF?

The clustering we obtained before was mainly based on lexical similarity since two tags would end up in the same cluster if they shared some n-grams. For example “landuse|farmland” and “landuse|farmyard” are very likely to end up in the same cluster while “landuse|meadow” and “landuse|grassland” are not even though they are semantically close. Using an embedding model could help us tackle this problem.

Some prior work like GeoVectors used fastText as a word-level embedder for OSM tags. However this is a rather heavy option. Modern smaller alternatives exist like BGE or Nomic but they expect sentence input instead of short strings. A pratical middle ground is Model2Vec’s potion-base-8M which is a 32M static vector table distilled from BGE-base-en-v1.5. On the MTEB benchmark it is reported as outperforming fastText.

As a sanity check, we are going to use this model to embedd a handful of tags and inspect whether the cosine similarities reflect the underlying semantic structure we are expecting. For example we expect “landuse|meadown” and “landuse|farmland” to rather be close to each other while “landuse|residential” should rather be far away.

We embedded seven env/agri tags with potion-base-8M and inspected the cosine similarities. The agricultural landuse values (farmland, meadow, grassland) landed at ~0.78 average similarity to each other, clearly above the urban landuse=residential (~0.66) and far from unrelated natural=water and natural=tree (~0.19). The full similarity matrix:

farmlandmeadowgrasslandforestresidentialnatural/waternatural/tree
farmland1.000.750.810.730.720.200.16
meadow0.751.000.800.690.620.170.18
grassland0.810.801.000.720.630.210.20
forest0.730.690.721.000.690.240.41
residential0.720.620.630.691.000.300.21
natural/water0.200.170.210.240.301.000.62
natural/tree0.160.180.200.410.210.621.00

Using the semantic embeddings, we can then rederive our pipeline of 224,123 row through the same SVD-to-50d and HDBSCAN stages. The char n-grams TF-IDF stage is therefore replaced by embeddings of potion-base-8M.

metricTF-IDFEmbeddingsdelta
number of base key families413435+22
total clusters8,9105,259-3,651 (-41%)
total occurrences captured (top 20 base keys)1.68 B2.48 B+799 M (+47%)
noise ratio35.3%49.3%+14 pp

The embedding pipeline produces fewer, larger clusters and pulls significantly more occurrences.

Effect on env/agri base keys

base_keytfidf_clustersembedding_clusterstfidf_occurrencesembedding_occurrencesoccurrences_delta
natural316,688,61656,559,801+49,871,185 (+746%)
waterway117,062,13632,312,984+25,250,848 (+358%)
landuse2130,325,83839,334,360+9,008,522 (+30%)
wetland111,600,7237,097,879+5,497,156 (+343%)
boundary1194,8922,196,788+2,101,896 (+2215%)
taxon513194,723875,653+680,930 (+350%)
genus10141,252,8091,183,437-69,372 (-6%)
species28222,320,8001,370,135-950,665 (-41%)
generator8518,510,6707,260,035-11,250,635 (-61%)
water1111,446,71616,768-11,429,948 (-100%)

natural jumps from 3 spelling-driven clusters to a single semantic cluster that captures 8x the volume: natural=water, natural=wetland, natural=wood, natural=tree, natural=scrub all land together because they describe environmental features. waterway and wetland follow the same pattern. landuse loses a cluster (the orchard-vs-farmland split collapses) and gains 30% more volume. The losses are also informative. water drops to 16k occurrences because embeddings separate water=* (the value water as a tag) from waterway=* and natural=water (which are about water too but they live in different keys).

Therefore semantic clustering seems to be a better pick since it rescues more occurrences and seems to capture sematically related concept better.

The method retained

Given the previous analysis, we are going to standardize first and then filter. Moreover, because both pipelines presented (TF-IDF and the embedding models) yielded different results that may be complementary, we are going to keep both approaches.

Once the final set of base keys has been computed for both pipelines, we will manually assess each base key for relevance to environmental and agricultural topics.

The two preprocessing paths differ as follows:

MetricFilter-firstStandardize-first
Tags224,123225,684 (+1,561)
Occurrences3,350,015,9933,368,341,528 (+18,325,535)

Two parallel clustering pipelines are then run on these sets, and both are retained because they capture complementary information:

PipelineReal clustersNoise pointsNoise volumeDistinct base keys
TF-IDF (character n-grams)8,83278,270 (34.7%)1,122,085,693 (33.3%)427
Embeddings (potion-base-8M)4,954106,498 (47.2%)803,203,928 (23.8%)433

The base-key families overlap on 307 keys, with:

for a total union of 553 distinct base keys.

Discussion

The codebase we used did not tell us when, where or what type of object each tag is associated with, but only “this tags exist N times across the planet”. A future study could improve upon this little analysis to figure out how are these tags distributed geographically, temporarily etc.


Edit on GitHub