Efficient Filtering and Conversion of OpenStreetMap Files with Osmium October 2011

The OpenStreetMap database currently contains over 1.2 billion points, 110 million lines and one million relations (composite objects). This data is made available in the so called planet dump, a file that contains all the objects in sequential order (first all nodes (points), then all ways (lines) followed by all relations) and is currently – depending on the format – between 9 and 18 gigabytes large when compressed.

Processing a full planet dump is a task that will let you face your computer’s hardware limitations, even if you have a reasonably powerful machine by general standards. Very simple tasks can, depending on the tools you use, take days for completion. For this reason, smaller extracts of the data are made available, constrained to individual countries or even cities, sufficient for most users.

For creating the world maps for Maps Without Borders, I am currently creating thematic extracts of the global OSM data, containing for example all motorways or train routes of the world. In other words, the task is trivial, but on a huge scale: scan through the data, keep a tiny fraction (for example, only 0.6% of all ways in the database concern railways according to OSM’s TagInfo) and throw away everything else. I want to be able to do this on my 3 year old laptop, which sports a dual-core 2 GHz processor and 2 gigabytes of RAM (at the time I bought the laptop this was a top-end machine).

After initial unsuccessful attempts using the standard, Java-based OSM processing tool Osmosis, I discovered Osmium, a C++ library for writing your own tools to work with OSM data. Osmium comes with an example that already does exactly what I want: keep all ways with a certain tag and export them as a Shapefile. This is done very efficiently: since each way needs the coordinate information of all of its nodes, which are stored earlier in the file, it uses an optimized datastructure to first store the coordinates of all nodes using 8 bytes per coordinate pair, and then writes the ways to the file using these previously stored coordinates. For storing the coordinates, you have two options: either use a google::sparsetable, which uses only 1 bit for empty records, or use mmap, a Unix system call that provides your program with a block of virtual memory which can be transparently stored in a (possibly huge) file on disk.

For the 1.2 billion points in the database, you need 9 gigabytes of memory to just keep the coordinates in memory before you come to processing the ways. So this was not an option for me. Unfortunately, my hopes of using the second method were soon shattered, too: although I have enough free diskspace on my SSD, the transparent virtual memory mapping done by mmap means that you have to be able to address that memory - and on a 32 bit system you can only address 4 gigabytes of memory, no matter how much virtual memory you may be able to provide. Of course I could have looked for or written a different disk-based storage mechanism, but integrating it into Osmium and making it perform reasonably seemed too big of an effort.

In the end I settled for a two-pass approach: In the first pass I skip all nodes, and analyze only the ways to generate a list of ids of nodes that we need to construct the ways that we are actually interested in. Using a sparse table of integer values (storing a “1” for every node id we need), we end up needing just 1 bit for all nodes (a total of ~143 MB) plus one byte for every node that should actually be stored (if we assume we end up storing less than 1% of all points this should be below 11 MB). Of course this could be further optimized by using a static bitfield, but given the sparseness of the data I decided to not bother and use Osmium’s on-board means. So our first pass looks like this:

typedef Osmium::Storage::SparseTable storage_ids_t;

class Pass1Handler : public Osmium::Handler::Base {

public:

    storage_ids_t store_node_ids;

    void way(Osmium::OSM::Way *way) {
        const char *railway = way->tags().get_tag_by_key("highway");
        if (railway && !strcmp(railway, "motorway")) {
            if (way->id() >= 0) {
                for (Osmium::OSM::WayNodeList::iterator it = way->nodes().begin(); it != way->nodes().end(); ++it) {
                    const int64_t id = it->ref();
                    store_node_ids.set( id, 1 );
                }                    
            }
            // TODO: negative ids (used for temporary data, not found in planet dumps) not implemented
        }
    }
};

After the first pass, the procedure for the second pass is more or less the original approach: for each node, look up if we actually need its coordinates (using the sparsetable generated in the first pass), and if so store them in another sparesetable. Afterwards we construct the ways a ususal and can be sure that all the required coordinates have been stored.

typedef Osmium::Storage::SparseTable storage_sparsetable_t;
typedef Osmium::Handler::CoordinatesForWays coord_handler_t;

class Pass2Handler : public Osmium::Handler::Base {

    storage_ids_t* store_node_ids;

    storage_sparsetable_t store_pos;
    storage_sparsetable_t store_neg;
    coord_handler_t* coord_handler;

    Osmium::Export::LineStringShapefile *shapefile_linestring;

public:

    // pass in stored node ids from 1st pass
    Pass2Handler(storage_ids_t* node_ids) {

        store_node_ids = node_ids;
        coord_handler = new coord_handler_t(store_pos, store_neg);

        shapefile_linestring = new Osmium::Export::LineStringShapefile("highways");
        shapefile_linestring->add_field("id", FTInteger, 10);
    }

    ~Pass2Handler() {
        delete shapefile_linestring;
    }

    void node(Osmium::OSM::Node *node) {
        // we have to make sure id is in the range of stored ids and id is in table
        if (node->id() < store_node_ids->size() && (*store_node_ids)[node->id()]) {
            coord_handler->node(node);
        }
    }

    void way(Osmium::OSM::Way *way) {
        const char *tag = way->tags().get_tag_by_key("highway");
        if (tag && !strcmp(tag, "motorway")) {
            coord_handler->way(way);
            try {
                Osmium::Geometry::LineString linestring(*way);
                shapefile_linestring->add_geometry(linestring.create_shp_object());
                shapefile_linestring->add_attribute(0, way->id());
            } catch (Osmium::Exception::IllegalGeometry) {
                std::cerr << "Ignoring illegal geometry for way " << way->id() << ".\n";
            }
        }
    }

};

Originally I combined the two handlers classes in one, using an internal flag to distinguish the first and second pass. Investigating the Osmium source code it turned out that Osmium adapts its parsing by skipping parts that are not actually processed by the handler. To make this work (and gain quite a performance benefit), the corresponding method must not be implemented in the handler, so I had to split the two passes to benefit from skipping the unprocessed nodes in the first pass.

Finally, add the glue to turn the two handlers into an executable program:

int main(int, char *argv[]) {
    Osmium::init(true);

    Osmium::OSMFile infile(argv[1]);
    Pass1Handler handler1;
    Pass2Handler handler2(&handler1.store_node_ids);

    infile.read(handler1);
    infile.read(handler2);
}

Filtering planet.osm.pbf in this way takes approximately 40 minutes on my laptop, and peaks at 900 MB memory usage. Still serious processing, but orders of magnitude faster and more efficient than the alternatives.

Obviously this could be extended much more (I actually simplified the code for this post) but is already a lifesaver for generating full planet extracts in this hardcoded form.