Skip to content

Reading data#

Populations#

We have some read methods for common input data formats - but first let's take a quick look at the core pam data structure for populations:

from pam.core import Population, Household, Person

population = Population()  # initialise an empty population

household = Household('hid0', attributes = {'struct': 'A', 'dogs': 2, ...})
population.add(household)

person = Person('pid0', attributes = {'age': 33, 'height': 'tall', ...})
household.add(person)

person = Person('pid1', attributes = {'age': 35, 'cats_or_dogs?': 'dogs', ...})
household.add(person)

population.print()

Read methods#

The first step in any application is to load your data into the core pam format (). We are trying to support common tabular formats ('travel diaries') using . A travel diary can be composed of three tables:

  • trips (required) - a trip diary for all people in the population, with rows representing trips
  • persons_attributes (optional) - optionally include persons attributes (eg: person income)
  • households_attributes (optional) - optionally include households attributes (eg: hh number of cars)

The input tables are expected as pandas.DataFrame, eg:

import pandas as pd
import pam

trips_df = pd.read_csv(trips.csv)
persons_df = pd.read_csv(persons.csv)

# Fix headers and wrangle as required
# ...

population = pam.read.load_travel_diary(
    trips = trips_df,
    persons_attributes = persons_df,
    hhs_attributes = None,
    )

print(population.stats)

example_person = population.random_person
example_person.print()
example_person.plot()

PAM requires tabular inputs to follow a basic structure. Rows in the trips dataframe represent unique trips by all persons, rows in the persons_attributes dataframe represent unique persons and rows in the hhs_attributes dataframe represent unique households. Fields named pid (person ID) and hid (household ID) are used to provide unique identifiers to people and households.

Trips Input:

eg:

pid hid seq hzone ozone dzone purp mode tst tet freq
0 0 0 Harrow Harrow Camden work pt 444 473 4.54
0 0 1 Harrow Camden Harrow home pt 890 919 4.54
1 0 0 Harrow Harrow Tower Hamlets work car 507 528 2.2
1 0 1 Harrow Tower Hamlets Harrow home car 1065 1086 2.2
2 1 0 Islington Islington Hackney shop pt 422 425 12.33
2 1 1 Islington Hackney Hackney leisure walk 485 500 12.33
2 1 2 Islington Croydon Islington home pt 560 580 12.33

A trips table is composed of rows representing unique trips for all persons in the population. Trips must be correctly ordered according to their sequence unless a numeric seq (trip sequence) field is provided, in which case trips will be ordered accordingly for each person.

The trips input must include the following fields: - pid - person ID, used as a unique identifier to associate trips belonging to the same person and to join trips with person attributes if provided. - ozone - trip origin zone ID - dzone - trip destination zone ID - mode - trip mode - note that lower case strings are enforced - tst - trip start time in minutes (integer) or a datetime string (eg: "2020-01-01 14:00:00") - tet - trip end time in minutes (integer) or a datetime string (eg: "2020-01-01 14:00:00")

The trips input must either: - purp - trip or tour purpose, eg 'work' - oact and dact - origin activity type and destination activity type, eg 'home' and 'work'

Note that lower case strings are enforced and that 'home' activities should be encoded as home.

The trips input may also include the following fields: - hid - household ID, used as a unique identifier to associate persons belonging to the same household and to join with household attributes if provided - freq - trip weighting for representative population - seq - trip sequence number, if omitted pam will assume that trips are already ordered - hzone - household zone

'trip purpose' vs 'tour purpose':

We've encountered a few different ways that trip purpose can be encoded. The preferred way being to encode a trip purpose as being the activity of the destination, so that a trip home would be encoded as purp = home. However we've also seen the more complex 'tour purpose' encoding, in which case a return trip from work to home is encoded as purp = work. Good news is that the will deal ok with either. But it's worth checking.

Using persons_attributes and /or households_attributes

eg:

persons.csv

pid hid hzone freq income age driver cats or dogs
0 0 Harrow 10.47 high high yes dogs
1 0 Harrow 0.034 low medium no dogs
2 1 Islington 8.9 medium low yes dogs

households.csv

hid hzone freq persons cars
0 Harrow 10.47 2 1
1 Islington 0.034 1 1

If you are using persons_attributes (persons_attributes) this table must contain a pid field (person ID). If you are using persons_attributes (households_attributes) this table must contain a hid field (household ID). In both cases, the frequency field freq may be used. All other attributes can be included with column names to suit the attribute. Note that hzone (home zone) can optionally be provided in the attribute tables.

A note about 'freq':

Frequencies (aka 'weights') for trips, persons or households can optionally be added to the respective input tables using columns called freq. We generally assume a frequency to represent expected occurrences in a full population. For example if we use a person frequency () the the sum of all these frequencies (), will equal the expected population size.

Because it is quite common to provide a person or household freq in the trips table, there are two special options (trip_freq_as_person_freq = True and trip_freq_as_hh_freq = True) that can be used to pass the freq field from the trips table to either the people or households table instead.

Generally PAM will assume when you want some weighted output, that it should use household frequencies. If these have not been set then PAM will assume that the household frequency is the average frequency of persons within the household. If person frequencies are not set the PAM will assume that the person frequency is the average frequency of legs within the persons plan. If you wish to adjust frequencies of a population then you should use the set_freq() method, eg:

factor = 1.2
household.set_freq(household.freq * factor)
for pid, person in household:
    person.set_freq(person.freq * factor)

Read/Write/Other formats#

PAM can read/write to tabular formats and MATSim xml ( and ). PAM can also write to segmented OD matrices using .

Benchmark or summary data and cross-tabulations can be extracted with the benchmarking CLI method. For more fine-grain control, pandas dataframes for specific data field(s), dimension(s) and aggregation function(s) can be generated with . For example pam.report.benchmarks.create_benchmark(population.trips_df(), dimensions = ['duration_category'], data_fields= ['freq'], aggfunc = [sum] returns the frequency breakdown of trips' duration.

Please get in touch if you would like additional support or feel free to add your own.