For the past few weeks I’ve been thinking about bus arrival times. On a daily basis, I use the Skyss app to plan my trips. It’s very useful, but the delays do not show up before the bus is actually delayed, even though the delays are often very predictable. That is of course understandable, but is it possible to do better? What if we built a smarter predictive model for it? I’ve seen a lot of cool applied machine learning projects, and I want to join in!

However, there is one small problem with this plan: I know very little about how to build predictive models end-to-end. Sure, I once created a model in Python following a YouTube/Book tutorial. But I fell off when trying to learn the math behind back propagation, and subsequently never touched it again. Furthermore, bus prediction isn’t neccesarily best done by a neural network, or so I’ve read.

One of the few things I know is important is data. So before learning anything, I’m going to collect data. The real-time bus APIs are open, but historical data does not seem available. So I have to build my own archive. The model also needs other inputs, including traffic information and weather. I’m sticking to the sound strategy of “save everything now, deal with it later”. However, I only want to deal with one bus provider. I chose Skyss, because it’s what I use. Furthermore, Skyss has seperate datasets within the Vestland. I chose to only collect within Hordaland.

Collecting bus data

The most important data for training a bus prediction algorithm is of course going to be bus data. Thankfully, there are good APIs for this. There are two main data sources: Real-time Data and Stops/Timetable Data. The real-time data consists of subscribing to a stream (SIRI), that informs about trip updates, vehicle positionings and service alerts.

The Stops- and Timetable data are listed, with both NeTEx and GTFS formats being available. I don’t yet know any of these formats or standards, so I collect it all indiscriminately. Conveniently, they also provide when the dataset was imported. So one can check this endpoint for updates. I check (and potentially download the new data) once every 24 hours.

Collecting weather data

Bus times are also dependent on weather. This is a bit trickier. The Norwegian Meteorological Institute provides free meteorological APIs, with very generous rate limits (20 req/s). But since I can’t just query “the weather in Hordaland”, I have to query a sample of points. I generated 500 points, based on the bus data. I used the data to place the points where bus routes are, or have been. It prioritizes higher trip density areas, while also trying to cover a large area in total. Because the weather data resolution is 1km x 1km, and in the interest of spreading the points out, two points are never closer than 1.5km.

I will both collect forecasts and “nowcasts” (current weather), because the model may need to use both and know the difference between them. Forecasts are collected every hour, while nowcasts are collected every 15 minutes.

Collecting traffic information

There is a huge amount of data available through Statens vegvesen’s (Norwegian Public Roads Administration) APIs. There is so much different data that I didn’t know what to collect. Thankfully the APIs provide historical data as well, so I do not have to choose now. Currently, I will be collecting the hour-by-hour traffic volume for the previous day, once every day. I have also applied for DATEX to get some additional information.

Storage and Logging

Everything gets written to parquet files, which provide column-level compression. A new parquet file filled with data gets uploaded to B2 every 5 minutes. At midnight, these files get compacted into a single parquet file for that day, optimizing query speed and space efficiency. The data is partitioned by date, which means every day gets a new file.

I have instrumented the application with some OpenTelemetry events, so I have insight into potential errors or collection gaps. I spun up a SigNoz instance to ingest the data.

Siri VM and ET records building up in an in-memory Parquet buffer, before getting flushed to files. Records are visibly building up more slowly as rush hour ends.

Now I just have to wait for data to come in! Maybe I should learn how to build a model in the meantime?