An incomplete comparison of geospatial file formats
Diogo Souza da Silva
Posted on October 27, 2019
A few years ago I was working with geospatial analysis and got to study and compare a few alternative file formats for storing and, especially, transferring such data.
I worked mainly with vector data (polygons and points) and served that for the web. The motivation of this work was to test the gains and cost of TopoJSON, and how it fared in size (important for transfer time) and encoding time (important for server resource usage).
A more complete description of advantages of the available formats can be found at Shapefile must die! , a good resource even if a bit strong on words (FYI: As a developer, I really dislike working with shapefile).
Even if I never got to finish it, I guess I can share the results anyway.
The contender file formats
First, let`s make some groups. Here we will compare three things:
- Data structure
- Format encoding
- Compression
For data structure, I am are comparing how data is organized inside the file:
- Shape
- GeoJSON structure
- TopoJSON structure
For format encoding , I am comparing how this structure is serialized:
- Binary (shapefile)
- JSON
- MessagePack
For compression , I am comparing compression algorithms:
- None
- Deflate/GZIP
- XZ/LZMA
The choices of what to test were made based on tooling availability on programing language(Java/Clojure) and ease of use from the web(Javascript).
A few notes of formats not tested:
CSV was not tested cause it is not as standard as it looks with several combinations of separators, enclosing and record separator. And also would have to compare WKT and WKB and other geometry encodings. But it would probably fare well as it can be streamed and well compressed.
Spatialite is a bit more complex to handle, as you will not only need an SQLite library but with extensions for Spatialite. Also would have to define table structure and such.
Given more time I would include more tests on both.
Shapefile was only tested as a baseline.
Overall results
The code and results can be found at my github.
| Structure | Format | Compress | Size | Time |
|-----------|---------|----------|-------|------|
| Shapefile | - | - | 5MB | - |
| Shapefile | - | zip | 3.2MB | - |
| Geo | JSON | - | 9MB | 10s |
| Geo | JSON | gz | 2.5MB | 11s |
| Geo | JSON | xz | 1.4MB | 21s |
| Geo | MsgPack | - | 5.2MB | 9s |
| Geo | MsgPack | gz | 3.5MB | 11s |
| Geo | MsgPack | xz | 1.7MB | 15s |
| Topo | JSON | - | 524KB | 22s |
| Topo | JSON | gz | 84KB | 20s |
| Topo | JSON | xz | 64KB | 22s |
| Topo | MsgPack | - | 256KB | 21s |
| Topo | MsgPack | gz | 76KB | 20s |
| Topo | MsgPack | xz | 60KB | 22s |
Shapefiles are the baseline, they do not compress very well.
As expected raw GeoJSON files are huge, but as text files, they compress very well, and are reasonably fast to encode.
TopoJSON files are minimal in size but take a long time to encode. Also not included in this test is the fact that topology encoding takes a lot of memory, as it has to hold the whole collection to iterate over it.
MessagePack as a format offers reasonable space efficiency being a binary format and encode faster. They add more complexity to the web and lose most gains after compression. They are faster to read/write on the server but slower on the browser.
Deflate/GZ offer expected compression results. They are standard on the web which makes it an easy choice, you server already have them, and so does the browser.
LZMA/XZ is a bit harder to use on the browser, but it is able to deliver even more.
A few conclusions
This test is incomplete and you should run your own on your set of data to get more practical results to your reality.
But here is my take on it:
- Shapefiles sucks cause they are a lot of files and with several limitations
- If nothing else, at least enable DEFLATE on your geojson serving
- TopoJSON is complex to deal with and expensive to encode
- TopoJSON offers insane compaction, specially on polygons with a lot of shared lines
- MsgPack offers nice compaction over text, but most of that is lost over compression
- LZMA/XZ adds a little complexity but it gave good gains on bigger files
So, if you can afford to encode only once with resources to spare: TopoJSON with XZ gives the most value. If you have to encode/decode on the fly: GeoJSON with XZ. If you can spare disk, offer both: TopoJSON and GeoJSON with XZ and GZ.
Posted on October 27, 2019
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.