Reading data#

Populations#

We have some read methods for common input data formats - but first let's take a quick look at the core pam data structure for populations:

from pam.core import Population, Household, Person

population = Population()  # initialise an empty population

household = Household('hid0', attributes = {'struct': 'A', 'dogs': 2, ...})
population.add(household)

person = Person('pid0', attributes = {'age': 33, 'height': 'tall', ...})
household.add(person)

person = Person('pid1', attributes = {'age': 35, 'cats_or_dogs?': 'dogs', ...})
household.add(person)

population.print()

Read methods#

The first step in any application is to load your data into the core pam format (). We are trying to support common tabular formats ('travel diaries') using . A travel diary can be composed of three tables:

trips (required) - a trip diary for all people in the population, with rows representing trips
persons_attributes (optional) - optionally include persons attributes (eg: person income)
households_attributes (optional) - optionally include households attributes (eg: hh number of cars)

The input tables are expected as pandas.DataFrame, eg:

import pandas as pd
import pam

trips_df = pd.read_csv(trips.csv)
persons_df = pd.read_csv(persons.csv)

# Fix headers and wrangle as required
# ...

population = pam.read.load_travel_diary(
    trips = trips_df,
    persons_attributes = persons_df,
    hhs_attributes = None,
    )

print(population.stats)

example_person = population.random_person
example_person.print()
example_person.plot()

PAM requires tabular inputs to follow a basic structure. Rows in the trips dataframe represent unique trips by all persons, rows in the persons_attributes dataframe represent unique persons and rows in the hhs_attributes dataframe represent unique households. Fields named pid (person ID) and hid (household ID) are used to provide unique identifiers to people and households.

Trips Input:

eg:

pid	hid	seq	hzone	ozone	dzone	purp	mode	tst	tet	freq
0	0	0	Harrow	Harrow	Camden	work	pt	444	473	4.54
0	0	1	Harrow	Camden	Harrow	home	pt	890	919	4.54
1	0	0	Harrow	Harrow	Tower Hamlets	work	car	507	528	2.2
1	0	1	Harrow	Tower Hamlets	Harrow	home	car	1065	1086	2.2
2	1	0	Islington	Islington	Hackney	shop	pt	422	425	12.33
2	1	1	Islington	Hackney	Hackney	leisure	walk	485	500	12.33
2	1	2	Islington	Croydon	Islington	home	pt	560	580	12.33

A trips table is composed of rows representing unique trips for all persons in the population. Trips must be correctly ordered according to their sequence unless a numeric seq (trip sequence) field is provided, in which case trips will be ordered accordingly for each person.

The trips input must include the following fields: - pid - person ID, used as a unique identifier to associate trips belonging to the same person and to join trips with person attributes if provided. - ozone - trip origin zone ID - dzone - trip destination zone ID - mode - trip mode - note that lower case strings are enforced - tst - trip start time in minutes (integer) or a datetime string (eg: "2020-01-01 14:00:00") - tet - trip end time in minutes (integer) or a datetime string (eg: "2020-01-01 14:00:00")

The trips input must either: - purp - trip or tour purpose, eg 'work' - oact and dact - origin activity type and destination activity type, eg 'home' and 'work'

Note that lower case strings are enforced and that 'home' activities should be encoded as home.

The trips input may also include the following fields: - hid - household ID, used as a unique identifier to associate persons belonging to the same household and to join with household attributes if provided - freq - trip weighting for representative population - seq - trip sequence number, if omitted pam will assume that trips are already ordered - hzone - household zone

'trip purpose' vs 'tour purpose':

We've encountered a few different ways that trip purpose can be encoded. The preferred way being to encode a trip purpose as being the activity of the destination, so that a trip home would be encoded as purp = home. However we've also seen the more complex 'tour purpose' encoding, in which case a return trip from work to home is encoded as purp = work. Good news is that the will deal ok with either. But it's worth checking.

Using persons_attributes and /or households_attributes

eg:

persons.csv

pid	hid	hzone	freq	income	age	driver	cats or dogs
0	0	Harrow	10.47	high	high	yes	dogs
1	0	Harrow	0.034	low	medium	no	dogs
2	1	Islington	8.9	medium	low	yes	dogs

households.csv

hid	hzone	freq	persons	cars
0	Harrow	10.47	2	1
1	Islington	0.034	1	1

If you are using persons_attributes (persons_attributes) this table must contain a pid field (person ID). If you are using persons_attributes (households_attributes) this table must contain a hid field (household ID). In both cases, the frequency field freq may be used. All other attributes can be included with column names to suit the attribute. Note that hzone (home zone) can optionally be provided in the attribute tables.

A note about 'freq':

Frequencies (aka 'weights') for trips, persons or households can optionally be added to the respective input tables using columns called freq. We generally assume a frequency to represent expected occurrences in a full population. For example if we use a person frequency () the the sum of all these frequencies (), will equal the expected population size.

Because it is quite common to provide a person or household freq in the trips table, there are two special options (trip_freq_as_person_freq = True and trip_freq_as_hh_freq = True) that can be used to pass the freq field from the trips table to either the people or households table instead.

Generally PAM will assume when you want some weighted output, that it should use household frequencies. If these have not been set then PAM will assume that the household frequency is the average frequency of persons within the household. If person frequencies are not set the PAM will assume that the person frequency is the average frequency of legs within the persons plan. If you wish to adjust frequencies of a population then you should use the set_freq() method, eg:

factor = 1.2
household.set_freq(household.freq * factor)
for pid, person in household:
    person.set_freq(person.freq * factor)

Read/Write/Other formats#

PAM can read/write to tabular formats and MATSim xml ( and ). PAM can also write to segmented OD matrices using .

Benchmark or summary data and cross-tabulations can be extracted with the benchmarking CLI method. For more fine-grain control, pandas dataframes for specific data field(s), dimension(s) and aggregation function(s) can be generated with . For example pam.report.benchmarks.create_benchmark(population.trips_df(), dimensions = ['duration_category'], data_fields= ['freq'], aggfunc = [sum] returns the frequency breakdown of trips' duration.

Please get in touch if you would like additional support or feel free to add your own.