How Frequent is Transit in Your City?

Frequent transit maps, as described by Jarrett Walker, are very useful; you can figure out where you can just show up to a bus stop and not waste your time waiting for that once-hourly bus. They can also show you where service is concentrated, either to serve lots of people on a busy bus corridor, or to funnel riders into a particularly important destination, and can thus provide insights as to how to expand a transit network.

However, actually making them by hand, particularly for a huge operation like New York City’s MTA, or for a metropolitan area split between several transit agencies like the Bay Area, can be a huge pain, particularly if the agency doesn’t post something akin to the service guide found on the back of a New York City bus map. It’s one of the factors as to why this kind of mapping, although more useful for customers, is not very common (the other reasons including but not limited to; inertia, politics, etc.) Fortunately, many transit agencies already collect and publish information about trips, service patterns, travel times, and such using Google’s GTFS specification, so that people can plan trips using a trip planner like Google Transit. So with a little bit of Python and matplotlib, you can make a quick and dirty transit map that looks something like this:

bcfgkuh1

A frequent transit map for the New York City metro area. The darker the line, the more frequent it is.

 

In fact, because GTFS is a standard, it isn’t very hard to reuse the same code for other transit agencies. For example, by reusing code I was able to generate a frequent transit map for the top 20 metropolitan areas by population in the country, all mapped to the same standard and at the same scale. The code to generate this wasn’t particularly difficult at all, either.

Manipulating GTFS

GTFS is nice because it’s well defined. Every route has trips. Every trip has an associated list of stops and stop times, an associated shape, and an associated weekly schedule, and all of these are linked together using various IDs. Because it’s such a well defined standard, this code will generally work with any unzipped GTFS file; all it does is read data in, calculate the number of trips within the given time frame on the given days of the week, and then evaluate it against a base number of trips.

def plotData(m, folderList, minsize):
    shapes = pd.DataFrame()
    trips = pd.DataFrame()
    stopTimes = pd.DataFrame()
    calendar = pd.DataFrame()
    frequencies = pd.DataFrame()

    # Read the data in as if it were a csv file.
    shapes, trips, stopTimes, calendar, frequencies = getData(folderList, shapes, trips, stopTimes, calendar, frequencies)

    # Calculate the number of trips for each route.
    numTrips = getNumTrips(trips, stopTimes, calendar, frequencies)

    # Map the routes, with transparency dependent on frequency (trips / base_frequency).
    for row in numTrips.itertuples():
        shape_id = row.shape_id
        num_trips = row.totalTrips
        currentShape = shapes[shapes['shape_id'] == shape_id]
        transp = (num_trips / base_trips) ** 2
        if (transp > 1):
            transp = 1
        m.plot(currentShape['shape_pt_lon'].values, currentShape['shape_pt_lat'].values, color='black', latlon=True, linewidth=minsize, alpha=transp)

Calculating base_trips is easy; you can plug and chug into the following and very quickly adjust the base being evaluated against.

calendarData = ['service_id','monday','tuesday','wednesday','thursday','friday','saturday','sunday']

# Calculate base trips.
base_directions = 2
base_minhour = 7
base_maxhour = 19
base_mintime = '07:00:00'
base_maxtime = '19:00:00'
base_maxheadway = 10
base_days = 7
base_trips = base_days * base_directions * (base_maxhour - base_minhour) * (60 / base_maxheadway)

Calculating the number of trips, though, is a bit tricky, since the relevant data can be stored in either stop_times.txt or frequencies.txt.

# Calculates the number of trips.
def getNumTrips(trips, stopTimes, calendar, frequencies):
    validFreq = pd.DataFrame()

    # Grab the number of trips made outside min and max hour.
    tooEarly = stopTimes['arrival_time'] < base_mintime     tooLate = stopTimes['departure_time'] > base_maxtime
    invalidTrips = stopTimes[(tooEarly | tooLate)].groupby('trip_id').size()

    # Add frequency information.
    if not frequencies.empty:
        # Filter out frequencies that fall outside the given time.
        freq_tooEarly = frequencies['end_time'] < base_mintime         
        freq_tooLate = frequencies['start_time'] > base_maxtime
        validFreq = frequencies[~(freq_tooEarly | freq_tooLate)]

        # Only consider valid timeframes.
        validFreq.loc[validFreq.start_time < base_mintime, 'start_time'] = base_mintime         validFreq.loc[validFreq.end_time > base_maxtime, 'end_time'] = base_maxtime
        freq_noTime = validFreq['start_time'] == validFreq['end_time']
        validFreq = validFreq[~freq_noTime]

        # Calculate the amount of time left.
        validFreq['trips']=pd.to_timedelta(pd.to_datetime(validFreq['end_time'], format='%H:%M:%S')-pd.to_datetime(validFreq['start_time'], format='%H:%M:%S')).dt.seconds / validFreq['headway_secs']
        validFreq = validFreq[['trip_id','trips']].groupby('trip_id', as_index = False).sum()

        # Remove invalid trips that have valid entries in frequencies.txt.
        in_freq = invalidTrips.index.isin(validFreq.index)
        print(invalidTrips[in_freq])
        invalidTrips = invalidTrips[~in_freq]

    # Filter out the invalid trips.
    in_validTrips = ~trips.trip_id.isin(invalidTrips.index)
    validTrips = trips[in_validTrips]
    validTrips['trips']=1

    if not validFreq.empty:
        # Go through the items in validFreq and update validTrips.trips.
        for row in validFreq.itertuples():
            validTrips.loc[validTrips.trip_id == row.trip_id, 'trips'] = row.trips

    # Find the amount of days any given service pattern runs.
    calendar['numDays'] = calendar.drop('service_id', 1).sum(axis=1)
    calendar = calendar[['service_id', 'numDays']]

    # Add this information to each trip.
    numTrips = pd.merge(validTrips, calendar, on='service_id', how='inner')
    numTrips['totalTrips'] = numTrips['numDays'] * numTrips['trips']
    numTrips = numTrips.groupby('shape_id', as_index=False).sum()    
    return numTrips

But ultimately, this works with basically any GTFS data, assuming everything you need is there, and the data is completely solid.

Limitations

Of course, this being the real world, in many cases not everything you need is there, and the data is not solid. Part of this is because the way GTFS is specified; shapes.txt is not a required file in the specification, but for our purposes it is probably the most important file, since without it we can’t plot anything on the map. However, even files required by GTFS are not there sometimes; calendar.txt is necessary to determine what days of the week a given trip is running, but several big agencies like WMATA in DC, TriMet in Portland, Valley Metro in Phoenix, and NJ Transit in New Jersey are missing this vital data. So on some of these maps, there appears to be no transit at all because the largest transportation providers are missing the required data.

This also doesn’t include agencies that provide bad data. For example, MTS in San Diego does have calendar.txt, but it’s empty, so you can’t do anything with it. Several agencies have mismatched shape_id fields in their shapes.txt and trips.txt files, so some trips cannot be plotted at all. In my hometown of New York for instance, none of the trips on the 1 train have a valid shape_id at all, so the 1 is simply not plotted onto the map!

There is also the issue of how fine-grained the data is. New York splits its GTFS data by mode and agency, which makes it easy to depict the different modes and levels of service. However, many other systems do not make this distinction, particularly those with newer light rail networks, so the light rail just ends up getting depicted like every other bus. It doesn’t really change how long you have to wait for transit though, so it ends up being a minor aesthetic issue.

Ultimately, I found this a very interesting exercise with some very interesting results. A lot of smaller agencies are better at maintaining top-quality GTFS data than some of the much bigger ones. It’s also interesting to see that past the top 5, metro area population does not seem to be correlated with transit quality at all; Minneapolis-St. Paul has a better frequent network than Dallas, for example. Since this code is basically ready to go on any suitable GTFS data, I hope it removes a significant barrier to making frequent maps, or at least the bare-bones kind.

Posted in Uncategorized
2 comments on “How Frequent is Transit in Your City?
  1. Is there a larger version of the NYC map because it is hard to read.

Leave a reply to Henry Cancel reply