OSM data analysis for environment and agriculture

Introduction

In this blog post, we are going to analyze the keys and tags from OpenStreetMap (OSM) with a particular interest on environmental and agricultural topics.

To follow along, here is the codebase associated with this blog post: https://github.com/NoeFlandre/osm-stats

Overview

The first step to perform such an analysis is to download the dataset of statistics including all keys and tags from OSM data which is available at the following link: https://taginfo.openstreetmap.org/download/taginfo-db.db.bz2

Once extracted, the reader will find a file of roughly 13GB with an SQLite database.

First global stats

To get started we need to understand what is a key and what is a tag. A key is typically a single word which describes a category of OSM polygon. For instance “building” is a key. A tag, is simply a pair of (key, value). An example of a tag would be (building, house). So a polygon of a house on OSM will belong to the key “building” and most likely have a tag “building=house” which corresponds to the (key, value) = (building, house).

Now let’s have a look at some descriptive stats from our database:

The total number of distinct keys is 110,706.
Likewise, the total number of tags is 192,821,586.
The total number of occurrences is 3,892,388,715.

We can already conclude that this is a fairly large database.

Should we just remove some obvious keys?

What we would like is to cut through the noise and only keep the keys and tags we are interested in (anything relevant to topics like agriculture, environment and so on). One could have the idea to directly remove the “building” key completely but before doing that, let’s have a look at the first 20 tags with “building” as a key:

value	count_all
yes	556,451,624
house	66,181,816
residential	16,123,805
detached	10,056,176
garage	8,122,742
apartments	7,784,487
shed	4,495,922
industrial	2,608,502
roof	2,450,761
hut	2,447,641
farm_auxiliary	2,366,983
semidetached_house	1,974,860
terrace	1,478,303
commercial	1,455,301
school	1,345,309
retail	1,274,921
construction	1,132,478
outbuilding	1,084,337
garages	1,054,157
greenhouse	821,293
barn	805,630
cabin	657,564
static_caravan	557,489
service	548,879
warehouse	473,553
bungalow	440,894
church	433,987
farm	414,913
allotment_house	349,247
carport	338,130
office	304,623
ruins	290,506
public	213,143
civic	212,355
university	176,658
hospital	170,684
hotel	160,423
kindergarten	127,484
chapel	119,655
boathouse	118,131
ger	107,876
mosque	107,555
storage_tank	105,426
manufacture	99,828
hangar	95,179
bunker	74,512
dormitory	73,721
silo	65,809
train_station	60,106
college	55,587

As we can see the tags are including many elements we are not interested in (e.g bunker, college and so on). However some tags could be of interest from an environmental / agriculture perspective (e.g greenhouse, farm). So simply discarding all occurrences with a “building” key is not the right solution. On top of that, we saw earlier that the entire set is including 110,706 keys and going through each of them would take a while… We therefore can’t offered to do a fine filtering manually.

Filtering at scale

One of the critical issue with OSM data is that the tags are not standardized, in other words it seems like each contributor is fairly free to annotate a polygon using free form text. Moreover we can assume that some typos might exist in the tags. Therefore it is likely that among the 192,821,586 tags, we may have a very long tail of tags which are not used that often since they come from a specific user notation.

Removing tags with low occurrences

In order to first clean up this database and only keep prominent tags, we can decide to only keep tags such that count_all >= 500. This simple filtering brings down the number of unique (key, value) pairs from 192.8 M to 224,123. By doing this filtering, the total number of occurrences goes from 3,892,388,715 to 3,350,015,993, so the long tail filtering dropped roughly 14% of occurrences in total while dramatically cleaning up the number of tags (roughly compressing the number of tags by a factor or 860). In simpler terms, we significantly reduced the complexity of our codebase while preserving a satisfactory number of samples.

Standardizing tags

Since OSM tags could be messy (for example we could think of tags having “Landuse” and “landuse”), we want to turn them into clean pairs. To do so we convert each string into lowercase, we strip to remove any unwanted space, we handle missing values by mapping them to “none” and finally we joint keys and values using the pipe “|” as a joiner since it never appears in our OSM values. Every tag now looks clean and resemble something like this : “landuse|farmland”.

Tokenizing our strings

Now that we have clean tags, we can try to cluster them, in order to gather together tags which belong to a similar topic. To do so we need to turn our strings into vectors. We have two options for that: either tokenize them at the word-level or at the character-level. Tokenizing at the word level would require exact matches, which is a harsh condition and also can be too strict. Suppose the context where a user would have made a type and wrote “lanuse|farmland” instead of “landuse|farmland”. Tokenizing at the word level would consider “landuse” and “lanuse” as two different words while they are the same. That’s why tokenizing at the character level might be a better pick here. A good solution here is to use n-grams (i.e a sliding window of N characters through the word). For instance the 3-grams from the word “landuse” are “lan”, “and”, “ndu”, “dus” and “use”. This way, even when we have a typo, two words still share a high similarity. For example “landuse” and “lanuse” share the 3-grams “lan” and “use”. A good tradeoff as well is to choose a range of n-grams from 3 to 5 grams (2 being too noisy since 2-grams are shared across two many words, and 6+ being too specific).

Turning tokens to vectors

On top of this tokenization, we are going to need a way to analyze these n-grams. If we were to simply count n-grams, this would treat each n-gram as equally important. However this is not a good approach since some n-grams are very common in English like “ing” and therefore non informative, while some n-grams are rather rare like “g|y” (the boundary between “building” and “yes”) and in this case, is very informative. In order to tackle this issue, we can use TF-IDF which stands for Term Frequency - Inverse Document Frequency. The idea is to weight each n-gram by how rare it is across the whole dataset. The term frequency is the count of the n-gram in the current string while the inverse document frequency down-weights n-grams that appear in many strings and up-weights n-grams appearing only in a few. The intuition behind using this is exactly what we described before: a rare n-gram is informative while a common one is not.

As an implementation detail, we decide to drop any n-gram that appears only in one tag string since it carries no clustering signal. This is just an optimization to keep a vocabulary that actually connect tags together. The output of this transformation is a sparse matrix of shape (224,123; 396,969) each row is a tag string while each column is surviving n-gram. The cell at position (i, j) carries the weight of the TF-IDF for the n-gram j of the tag string i. In this matrix we have a sparse density of 0.014% which means that each string only activates a ver small handful of the 396,969 possible n-grams. At this stage, we have effectively turned each tag into a vector of dimension 396,969. In this space, two tags sharing many n-grams end up close to each other while those sharing none are orthogonal.

Dimensionality reduction

The clustering algorithm we are going to use later is HDBSCAN, which computes pairwise distances between every pair of points. For a full pairwise matrix this is O(n^2* d) where n is the number of points and d the number of dimensions. In our case we have d = 396,969 and n = 224,123 which is intractable. However we are dealing with a sparse matrix, which makes it possible to rather compress it into a dense matrix. To do so we are using Truncated Singular Value Decomposition, which is a dimensionality reduction technique which approximates a matrix by only keeping its top k singular values and vectors. The top k components we are going to keep, instead of the full 224,123, will give us the directions of highest variance in the data. This way, we are discarding the noise while keeping the geometric relationships needed for clustering.

The choice of k here is a tradeoff. If we choose it too low, we would collapse together things which are supposed to be separate, for example landuse=farmland and highway=residential could end up in the same cluster because we threw away the n-grams that were distinguishing them. If we choose it too high, then we are back at the cost problem. A common band is to work between 30 and 50 components. In order to be safe, let’s stick to the upper end 50.

Clustering our tags

Now that we have 224,123 points in a 50-dimensional dense space, we are going to cluster these points. We are going to use HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise). It groups points that are close and dense together while pushing isolated points into a noise bucket. This is what we want: large, dense clusters of popular landuse keys. The two main parameters are min_cluster_size and min_samples. The former, which we set to 5 means that a tag needs at least 5 near duplicates in the dataset to form a category (a cluster). The latter, which we set to 2, controls how conservative the clustering is. A higher value pushes more ambiguous points into the noise bucket while a lower value is more permissive. We use euclidean distances to reflect the n-gram overlap between tags.

On our 224,123 by 50 matrix takes roughly 3.5 minutes and produces 9,037 clusters along with 79,053 noise points, corresponding to 35.3% of the corpus. This substantial amount of noise is expected since OSM tags are not following a coherent distribution but rather a mix of names, postcodes, street names, typos and so on. Moreover not all tags have 5 near duplicates in a 50-D space, so they go to noise.

If we inspect the 5 largest clusters, we end up with the following:

addr:postcode|5000, addr:postcode|5020, … -> a cluster of numeric postcode values
addr:suburb|mitte, addr:suburb|lichterfelde, … -> likely a cluster of suburb names (mostly German)
addr:street|hauptstraße, addr:street|dorfstraße, addr:street|bahnhofstraße, … -> likely a cluster of German street (names sharing the straße suffix)
addr:city:simc|0918123, addr:city:simc|0969400, … -> likely a cluster of Polish city codes from the SIMC (TERYT) administrative register
addr:street|sunset drive, addr:street|lakeview drive, … -> likely a cluster of English street names sharing drive

Therefore, we can see that our clustering is quite fine since it can distinguish addresses from different countries.

From 9,037 clusters to 413 OSM key families

A list of 9,037 clusters is hard to read. A trick we could think of is to summarize each cluster by a single representative tag (which is named a medoid). We could then group medoids together.

For every cluster, we take the centroid (the mean of all member vectors) and select the actual member closest: the medoid. From each medoid string (e.g addr:street|hauptstraße) we split on the first colon and keep the part before (e.g addr). For each base key we sum the cluster count and the total count_all across all member clusters.

Top 20 OSM key families

base_key	cluster_count	total_count_all	representative_medoids
addr	4,790	465,963,996	addr:country
source	668	231,299,922	source
building	62	74,058,131	building:levels
surface	2	70,692,821	surface
area	1	69,449,530	area:highway
removed	4	65,732,436	removed:highway
tiger	327	43,393,943	tiger:mtfcc
xmas	1	36,046,635	xmas:feature
razed	4	31,291,789	razed:highway
landuse	2	30,325,838	landuse
lanes	4	22,576,108	lanes
oneway	2	22,445,510	oneway:foot
boat	2	22,276,483	boat
driveway	1	21,984,902	driveway
height	133	20,268,617	height
generator	8	18,510,670	generator:source
start_date	29	17,404,344	start_date
maxspeed	32	16,086,937	maxspeed
barrier	2	14,614,296	barrier
roof	27	14,258,235	roof:shape

A quick analysis lets us see that “addr” is the elephant with 4,790 clusters and 466M occurrences. Likewise “building” has an huge number of occurrences even though its cluster count is small. Environmental and agricultural signal is small but real: landuse (2 clusters, 30M), natural (not in top 20, but 3 clusters, 42M). Moreover the set to human verify is now manageable with 413 key families. We could also think of asking an LLM to only keep labels relevant to these topics based on this list.

Selecting OSM key families relevant to environment and agriculture

Using Minimax M3, we filtered the 413 OSM key families to only keep the ones relevant to the environment and agriculture. Minimax chose 26 keys as a subset of the 413 which it deemed relevant. The following table summarizes the relevant keys selected:

base_key	cluster_id	medoid	cluster_size	total_count_all
landuse	7819	landuse\|orchard	29	18,776,722
landuse	7797	landuse\|farmland	5	11,549,116
water	8154	water\|ditch	23	11,446,716
waterway	8261	waterway\|ditch	18	7,062,136
generator	2372	generator:source\|diesel	15	6,337,545
leisure	7934	leisure\|dog_park	17	6,327,674
generator	6260	generator:method\|battery-storage	7	6,190,190
generator	4174	generator:output:electricity\|2.3 mw	73	5,750,401
natural	8342	natural\|tundra	20	4,163,927
natural	8101	natural\|landslide	5	2,442,862
crop	8250	crop\|cana-de-açúcar	14	1,934,912
wetland	7942	wetland\|dambo	28	1,600,723
genus	8114	genus\|celtis	65	693,939
species	8021	species:es\|falso pimiento	138	518,947
species	221	species:wikidata\|q163760	112	441,244
survey_point	4987	survey_point\|suppl	18	418,591
embankment	7043	embankment\|left	11	357,670
diameter_crown	5882	diameter_crown\|9	10	305,345
species	7953	species\|platanus × acerifolia	8	204,191
diameter_crown	5878	diameter_crown\|14	9	138,082
crop	7147	crop\|native_pasture	6	138,059
taxon	7880	taxon\|sapindaceae	19	129,205
plant	5850	plant:source\|oil	10	128,360
trees	7357	trees\|pitaya_plants	12	128,305
species	6850	species:en\|pin oak	39	124,857
genus	7546	genus:en\|lime	20	115,365
diameter_crown	5881	diameter_crown\|2m	5	113,180
monitoring	6394	monitoring:water_ph\|yes	12	107,842
species	7954	species\|platanus ×hispanica	11	100,887
landform	8100	landform\|dune_system	14	100,315
species	7965	species\|prunus cerasus	29	100,030
generator	6323	generator:solar:modules\|14	9	99,247
species	6328	species\|populus canadensis	23	95,434
boundary	7483	boundary\|legal	17	94,892
generator	6322	generator:solar:modules\|3	9	90,726
genus	7674	genus:de\|hainbuche	7	90,474
species	8014	species:de\|götterbaum	27	87,727
protect_class	8136	protect_class\|3	9	86,979
species	7981	species:de\|hainbuche	19	84,261
landcover	8099	landcover\|dry_swamp	6	82,892
natural	7703	natural\|valley	8	81,827
genus	8019	genus\|casuarina	6	78,043
genus	7950	genus\|malus	6	73,165
genus	8105	genus:de\|apfel	9	69,296
species	5819	species:wikipedia\|pl:klon polny	38	68,613
species	6950	species:it\|pioppo bianco	28	67,997
genus	226	genus:wikidata\|q127849	13	56,537
trees	7771	trees\|almond_trees	15	51,415
species	7932	species:nl\|inlandse eik	16	50,113
species	8017	species\|eucalyptus melliodora	6	48,963
protection_title	6106	protection_title\|environmental use	24	47,061
species	7999	species\|prunus domestica	9	40,636
species	7722	species\|fraxinus americana	11	39,711
genus	7271	genus:it\|olivo	11	39,249
species	7955	species\|acer negundo	5	34,234
species	8015	species\|melaleuca nesophila	11	34,034
taxon	8135	taxon\|pinus nigra	21	30,507
water_source	6052	water_source\|tube_well	5	28,555
species	8020	species\|quercus phellos	8	26,295
survey_point	4986	survey_point:purpose\|vertical	7	26,229
iucn_level	6926	iucn_level\|ii	7	25,663
survey_point	4976	survey_point:structure\|pillar	6	25,317
genus	8113	genus\|corylus	5	23,792
species	8016	species\|betula utilis	8	23,466
species	7833	species\|prunus serrulata	5	21,937
species	8012	species:de\|silber-linde	8	21,655
species	7956	species:pl\|klon zwyczajny	15	21,495
species	7740	species\|pinus sylvestris	5	20,562
species	7280	species\|pyrus calleryana chanticleer	6	20,044
taxon	7667	taxon:en\|honeylocust	11	19,579
wood	6434	wood\|deciduous	6	18,843
generator	6290	generator:type\|wind_turbine	5	17,409
generator	4117	generator:orientation\|sw	11	16,948
genus	8106	genus:ru\|берёза	11	12,949
species	7351	species\|adansonia grandidieri	6	10,515
diameter_crown	5883	diameter_crown\|5.00	6	10,161
tree	4326	tree:ref\|1008	13	9,502
taxon	8357	taxon:cultivar\|plena	8	9,173
tree	4325	tree:ref\|107	12	9,097
generator	6321	generator:solar:modules\|22	6	8,204
monitoring	5769	monitoring:water_quality\|yes	5	7,377
tree	4323	tree:ref\|5	9	7,310
tree	4322	tree:ref\|2006	9	6,592
taxon	8086	taxon\|prunus cerasifera ‘pissardii’	7	6,259
species	6849	species:en\|maple silver	5	5,292
diameter_crown	5884	diameter_crown\|5.5	5	5,202
tree	4321	tree:ref\|203	6	3,911
species	7897	species:ro\|paltin de câmp	5	3,839
species	8013	species:ru\|берёза повислая	5	3,821
tree	4324	tree:ref\|12	6	3,462

To put this in perspective, the 26 selected base keys span 90 clusters and account for roughly 90M total occurrences, with landuse leading the way at about 30M occurrences.

Ablations and following questions

Some decisions made in the pipeline above are worth questioning. In this section, we are going to tackle these.

When should we perform standardization?

Consider the case where you would have the tag landuse with 286 occurrences and Landuse with 450 occurrences. Using the pipeline defined above, both these tags would get discarded since they both do not satisfy the condition count_all >= 500. We could therefore think of first standardizing these tags, essentially unifying them as a single landuse tag, for which the number of occurrences would be 450+286 = 736. In such a case, this would mean that the new tag would pass the condition count_all >= 500 and as a result, not be discarded. Since this pipeline does rescue some tags which were non standardized, we can expect this new appraoch to produce more rows in the thresholded output.

In fact doing standardization first and then filtering for count_all >= 500 yields 225,684 tags and 3,368,341,528 occurrences, that is, by standardizing first, we rescued +1,561 tags and +18,325,535 occurrences compared to the filter first and standardize later approach. Since the later steps of the pipeline are designed to filter these tags down to tags of interest for environment and agriculture, it might be interesting to take the approach of standardizing first in order to rescue more tags and maybe recover more relevant tags for our topics of interest. We cannot purely compare the effect of this choice in our current setting since the clustering algorithm is not purely deterministic and different clusters would be produced in a second run therefore making the comparison unclear.

What if we use an embedding model instead of TF-IDF?

The clustering we obtained before was mainly based on lexical similarity since two tags would end up in the same cluster if they shared some n-grams. For example “landuse|farmland” and “landuse|farmyard” are very likely to end up in the same cluster while “landuse|meadow” and “landuse|grassland” are not even though they are semantically close. Using an embedding model could help us tackle this problem.

Some prior work like GeoVectors used fastText as a word-level embedder for OSM tags. However this is a rather heavy option. Modern smaller alternatives exist like BGE or Nomic but they expect sentence input instead of short strings. A pratical middle ground is Model2Vec’s potion-base-8M which is a 32M static vector table distilled from BGE-base-en-v1.5. On the MTEB benchmark it is reported as outperforming fastText.

As a sanity check, we are going to use this model to embedd a handful of tags and inspect whether the cosine similarities reflect the underlying semantic structure we are expecting. For example we expect “landuse|meadown” and “landuse|farmland” to rather be close to each other while “landuse|residential” should rather be far away.

We embedded seven env/agri tags with potion-base-8M and inspected the cosine similarities. The agricultural landuse values (farmland, meadow, grassland) landed at ~0.78 average similarity to each other, clearly above the urban landuse=residential (~0.66) and far from unrelated natural=water and natural=tree (~0.19). The full similarity matrix:

	farmland	meadow	grassland	forest	residential	natural/water	natural/tree
farmland	1.00	0.75	0.81	0.73	0.72	0.20	0.16
meadow	0.75	1.00	0.80	0.69	0.62	0.17	0.18
grassland	0.81	0.80	1.00	0.72	0.63	0.21	0.20
forest	0.73	0.69	0.72	1.00	0.69	0.24	0.41
residential	0.72	0.62	0.63	0.69	1.00	0.30	0.21
natural/water	0.20	0.17	0.21	0.24	0.30	1.00	0.62
natural/tree	0.16	0.18	0.20	0.41	0.21	0.62	1.00

Using the semantic embeddings, we can then rederive our pipeline of 224,123 row through the same SVD-to-50d and HDBSCAN stages. The char n-grams TF-IDF stage is therefore replaced by embeddings of potion-base-8M.

metric	TF-IDF	Embeddings	delta
number of base key families	413	435	+22
total clusters	8,910	5,259	-3,651 (-41%)
total occurrences captured (top 20 base keys)	1.68 B	2.48 B	+799 M (+47%)
noise ratio	35.3%	49.3%	+14 pp

The embedding pipeline produces fewer, larger clusters and pulls significantly more occurrences.

Effect on env/agri base keys

base_key	tfidf_clusters	embedding_clusters	tfidf_occurrences	embedding_occurrences	occurrences_delta
natural	3	1	6,688,616	56,559,801	+49,871,185 (+746%)
waterway	1	1	7,062,136	32,312,984	+25,250,848 (+358%)
landuse	2	1	30,325,838	39,334,360	+9,008,522 (+30%)
wetland	1	1	1,600,723	7,097,879	+5,497,156 (+343%)
boundary	1	1	94,892	2,196,788	+2,101,896 (+2215%)
taxon	5	13	194,723	875,653	+680,930 (+350%)
genus	10	14	1,252,809	1,183,437	-69,372 (-6%)
species	28	22	2,320,800	1,370,135	-950,665 (-41%)
generator	8	5	18,510,670	7,260,035	-11,250,635 (-61%)
water	1	1	11,446,716	16,768	-11,429,948 (-100%)

natural jumps from 3 spelling-driven clusters to a single semantic cluster that captures 8x the volume: natural=water, natural=wetland, natural=wood, natural=tree, natural=scrub all land together because they describe environmental features. waterway and wetland follow the same pattern. landuse loses a cluster (the orchard-vs-farmland split collapses) and gains 30% more volume. The losses are also informative. water drops to 16k occurrences because embeddings separate water=* (the value water as a tag) from waterway=* and natural=water (which are about water too but they live in different keys).

Therefore semantic clustering seems to be a better pick since it rescues more occurrences and seems to capture sematically related concept better.

The method retained

Given the previous analysis, we are going to standardize first and then filter. Moreover, because both pipelines presented (TF-IDF and the embedding models) yielded different results that may be complementary, we are going to keep both approaches.

Once the final set of base keys has been computed for both pipelines, we will manually assess each base key for relevance to environmental and agricultural topics.

The two preprocessing paths differ as follows:

Metric	Filter-first	Standardize-first
Tags	224,123	225,684 (+1,561)
Occurrences	3,350,015,993	3,368,341,528 (+18,325,535)

Two parallel clustering pipelines are then run on these sets, and both are retained because they capture complementary information:

Pipeline	Real clusters	Noise points	Noise volume	Distinct base keys
TF-IDF (character n-grams)	8,832	78,270 (34.7%)	1,122,085,693 (33.3%)	427
Embeddings (potion-base-8M)	4,954	106,498 (47.2%)	803,203,928 (23.8%)	433

The base-key families overlap on 307 keys, with:

120 keys identified only by TF-IDF,
126 keys identified only by embeddings,

for a total union of 553 distinct base keys.

Discussion

The codebase we used did not tell us when, where or what type of object each tag is associated with, but only “this tags exist N times across the planet”. A future study could improve upon this little analysis to figure out how are these tags distributed geographically, temporarily etc.