Converting the National Travel Survey into a Simple MATSim Format Population¶
This notebook demonstrates an example workflow for converting tabular diary data from UK NTS (household attributes, person attributes and trip data) into MATSim formatted xml population data for London households.
This includes:
- pre-processing of tabular inputs
- loading data into pam
- household sampling
- facility sampling
- preliminary investigation
- writing to xml
This example is highly simplified. Of particular note: the diary data used is spatially very aggregate (trip locations are aggregated to inner/outer London). This creates significant variance in the sampled trip lengths. Generally we would expect more precise spatial data to be used. Alternately the complexity of the facility sampling step can be improved to better account for known trip features such as mode and duration.
The diary data used is available from the UK Data Service (https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=5340) and is described here:http://doc.ukdataservice.ac.uk/doc/5340/mrdoc/pdf/5340_nts_user_guidance_1995-2016.pdf
import os
from copy import deepcopy
import geopandas as gp
import pandas as pd
from pam import read, write
from pam.core import Population
from pam.plot.stats import plot_activity_times, plot_leg_times
from pam.samplers.basic import freq_sample
from pam.samplers.spatial import RandomPointSampler
use_dummy_data = True
out_dir = "./outputs" # outputs are writen here
# required inputs from the National Travel Survey
if use_dummy_data:
households_csv = "./data/dummyNTS/householdeul2017.tab"
individuals_csv = "./data/dummyNTS/individualeul2017.tab"
trips_csv = "./data/dummyNTS/tripeul2017.tab"
else:
households_csv = "~/Data/UKDA-5340-tab/tab/householdeul2017.tab"
individuals_csv = "~/Data/UKDA-5340-tab/tab/individualeul2017.tab"
trips_csv = "~/Data/UKDA-5340-tab/tab/tripeul2017.tab"
Load data¶
Load households data¶
- Load household data into pandas DataFrame.
- Create some mappings of participation and weighting by household for use later. These are described in http://doc.ukdataservice.ac.uk/doc/5340/mrdoc/pdf/5340_nts_user_guidance_1995-2016.pdf
hh_in = pd.read_csv(
households_csv,
sep="\t",
usecols=[
"HouseholdID",
"SurveyYear",
"PSUID",
"W2",
"OutCom_B02ID",
"HHIncome2002_B02ID",
"AddressType_B01ID",
"Ten1_B02ID",
"Landlord_B01ID",
"ResLength_B01ID",
"HHoldCountry_B01ID",
"HHoldGOR_B02ID",
"HHoldNumAdults",
"HHoldNumChildren",
"HHoldNumPeople",
"HHoldStruct_B02ID",
"NumLicHolders",
"HHoldEmploy_B01ID",
"NumVehicles",
"NumBike",
"NumCar",
"NumMCycle",
"NumVanLorry",
"NumCarVan",
"WalkBus_B01ID",
"Getbus_B01ID",
"WalkRail_B01ID",
"WalkRailAlt_B01ID",
"HRPWorkStat_B02ID",
"HRPSEGWorkStat_B01ID",
"HHoldOAClass2011_B03ID",
"Settlement2011EW_B03ID",
"Settlement2011EW_B04ID",
],
)
hh_in.HHIncome2002_B02ID = pd.to_numeric(hh_in.HHIncome2002_B02ID, errors="coerce")
hh_in.NumLicHolders = pd.to_numeric(hh_in.NumLicHolders, errors="coerce")
hh_in.NumVehicles = pd.to_numeric(hh_in.NumVehicles, errors="coerce")
hh_in.NumCar = pd.to_numeric(hh_in.NumCar, errors="coerce")
hh_in.NumMCycle = pd.to_numeric(hh_in.NumMCycle, errors="coerce")
hh_in.NumVanLorry = pd.to_numeric(hh_in.NumVanLorry, errors="coerce")
hh_in.NumCarVan = pd.to_numeric(hh_in.NumCarVan, errors="coerce")
hh_in.Settlement2011EW_B04ID = pd.to_numeric(hh_in.Settlement2011EW_B04ID, errors="coerce")
hh_in.head()
HouseholdID | SurveyYear | PSUID | W2 | OutCom_B02ID | HHIncome2002_B02ID | AddressType_B01ID | Ten1_B02ID | Landlord_B01ID | ResLength_B01ID | ... | NumCarVan | WalkBus_B01ID | Getbus_B01ID | WalkRail_B01ID | WalkRailAlt_B01ID | HRPWorkStat_B02ID | HRPSEGWorkStat_B01ID | HHoldOAClass2011_B03ID | Settlement2011EW_B03ID | Settlement2011EW_B04ID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2002 | 1 | 1 | 1 | 2 | 3 | 1 | -10 | 3 | ... | 2 | 1 | 5 | 1 | -9 | 1 | 3 | -10 | 1 | 1 |
1 | 2 | 2002 | 1 | 1 | 1 | 1 | 3 | 1 | -10 | 8 | ... | 0 | 1 | 5 | 2 | -9 | 3 | 5 | -10 | 1 | 1 |
2 | 3 | 2002 | 1 | 1 | 1 | 3 | 3 | 1 | -10 | 8 | ... | 2 | 1 | 4 | 3 | -9 | 1 | 3 | -10 | 1 | 1 |
3 rows × 33 columns
participation_mapping = dict(zip(hh_in.HouseholdID, hh_in.OutCom_B02ID))
weight_mapping = dict(zip(hh_in.HouseholdID, hh_in.W2))
Load person data¶
Load person attributes data into pandas DataFrame.
persons_in = pd.read_csv(
individuals_csv,
sep="\t",
usecols=[
"SurveyYear",
"IndividualID",
"HouseholdID",
"PSUID",
"VehicleID",
"PersNo",
"Age_B01ID",
"OfPenAge_B01ID",
"Sex_B01ID",
"EdAttn1_B01ID",
"EdAttn2_B01ID",
"EdAttn3_B01ID",
"DrivLic_B02ID",
"CarAccess_B01ID",
"DrivDisable_B01ID",
"WkPlace_B01ID",
"ES2000_B01ID",
"NSSec_B03ID",
"SC_B01ID",
"Stat_B01ID",
"SVise_B01ID",
"EcoStat_B02ID",
"PossHom_B01ID",
],
)
persons_in.head()
SurveyYear | IndividualID | HouseholdID | PSUID | VehicleID | PersNo | Age_B01ID | OfPenAge_B01ID | Sex_B01ID | EdAttn1_B01ID | ... | CarAccess_B01ID | DrivDisable_B01ID | WkPlace_B01ID | ES2000_B01ID | NSSec_B03ID | SC_B01ID | Stat_B01ID | SVise_B01ID | EcoStat_B02ID | PossHom_B01ID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2002 | 1 | 1 | 1 | 1 | 1 | 13 | 2 | 2 | -10 | ... | 2 | -9 | 1 | 7 | 2 | 3 | 1 | 2 | 2 | 2 |
1 | 2002 | 2 | 1 | 1 | 2 | 2 | 13 | 2 | 1 | -10 | ... | 2 | -9 | 1 | 7 | 3 | 4 | 1 | 2 | 1 | 2 |
2 | 2002 | 3 | 1 | 1 | 3 | 4 | 2 | 1 | -10 | ... | 4 | -9 | -9 | -9 | -9 | -9 | -9 | -9 | -9 | -9 | |
3 | 2002 | 4 | 1 | 1 | 4 | 2 | 2 | 2 | -10 | ... | 4 | -9 | -9 | -9 | -9 | -9 | -9 | -9 | -9 | -9 | |
4 | 2002 | 5 | 2 | 1 | 1 | 18 | 1 | 2 | -10 | ... | 6 | 2 | -9 | 6 | 1 | 3 | 1 | 1 | 4 | -9 |
5 rows × 23 columns
load trip data¶
- Load trip data into pandas DataFrame format.
- Apply some preliminary formatting
- Replace headers so that we can use pam read method:
- pid - person ID
- hid - household ID
- seq - trip sequence number
- hzone - household zone
- ozone - trip origin zone
- dzone - trip destination zone
- purp - trip purpose
- mode - trip mode
- tst - trip start time (minutes)
- tet - trip end time (minutes)
- freq - weighting for representative population
travel_diaries_in = pd.read_csv(
trips_csv,
sep="\t",
usecols=[
"TripID",
"SurveyYear",
"DayID",
"IndividualID",
"HouseholdID",
"PSUID",
"PersNo",
"TravDay",
"JourSeq",
"ShortWalkTrip_B01ID",
"NumStages",
"MainMode_B04ID",
"TripPurpFrom_B01ID",
"TripPurpTo_B01ID",
"TripPurpose_B04ID",
"TripStart",
"TripEnd",
"TripOrigUA2009_B01ID",
"TripDestUA2009_B01ID",
],
)
travel_diaries_in.TripStart = pd.to_numeric(travel_diaries_in.TripStart, errors="coerce")
travel_diaries_in.TripEnd = pd.to_numeric(travel_diaries_in.TripEnd, errors="coerce")
travel_diaries_in.head()
TripID | SurveyYear | DayID | IndividualID | HouseholdID | PSUID | PersNo | TravDay | JourSeq | ShortWalkTrip_B01ID | NumStages | MainMode_B04ID | TripPurpose_B04ID | TripPurpFrom_B01ID | TripPurpTo_B01ID | TripStart | TripEnd | TripOrigUA2009_B01ID | TripDestUA2009_B01ID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2002 | 2 | 1 | 1 | 1 | 1 | 2 | 1 | 2 | 1 | 4 | 7 | 23 | 10 | 675 | 683 | 530 | 550 |
1 | 2 | 2002 | 2 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 1 | 4 | 6 | 10 | 8 | 720 | 735 | 530 | 550 |
2 | 3 | 2002 | 2 | 1 | 1 | 1 | 1 | 2 | 3 | 2 | 1 | 4 | 7 | 8 | 10 | 770 | 780 | 530 | 550 |
3 | 4 | 2002 | 2 | 1 | 1 | 1 | 1 | 2 | 4 | 2 | 1 | 12 | 7 | 10 | 23 | 1110 | 1130 | 530 | 550 |
4 | 5 | 2002 | 3 | 1 | 1 | 1 | 1 | 3 | 1 | 2 | 1 | 4 | 7 | 23 | 10 | 760 | 770 | 530 | 550 |
travel_diaries_in["participation"] = travel_diaries_in.HouseholdID.map(participation_mapping)
travel_diaries_in["hh_weight"] = travel_diaries_in.HouseholdID.map(weight_mapping)
travel_diaries = travel_diaries_in.loc[travel_diaries_in.participation.isin([1, 2])]
travel_diaries.head()
TripID | SurveyYear | DayID | IndividualID | HouseholdID | PSUID | PersNo | TravDay | JourSeq | ShortWalkTrip_B01ID | ... | MainMode_B04ID | TripPurpose_B04ID | TripPurpFrom_B01ID | TripPurpTo_B01ID | TripStart | TripEnd | TripOrigUA2009_B01ID | TripDestUA2009_B01ID | participation | hh_weight | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2002 | 2 | 1 | 1 | 1 | 1 | 2 | 1 | 2 | ... | 4 | 7 | 23 | 10 | 675 | 683 | 530 | 550 | 1 | 1 |
1 | 2 | 2002 | 2 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | ... | 4 | 6 | 10 | 8 | 720 | 735 | 530 | 550 | 1 | 1 |
2 | 3 | 2002 | 2 | 1 | 1 | 1 | 1 | 2 | 3 | 2 | ... | 4 | 7 | 8 | 10 | 770 | 780 | 530 | 550 | 1 | 1 |
3 | 4 | 2002 | 2 | 1 | 1 | 1 | 1 | 2 | 4 | 2 | ... | 12 | 7 | 10 | 23 | 1110 | 1130 | 530 | 550 | 1 | 1 |
4 | 5 | 2002 | 3 | 1 | 1 | 1 | 1 | 3 | 1 | 2 | ... | 4 | 7 | 23 | 10 | 760 | 770 | 530 | 550 | 1 | 1 |
5 rows × 21 columns
travel_diaries = travel_diaries.rename(
columns={ # rename data
"JourSeq": "seq",
"TripOrigUA2009_B01ID": "ozone",
"TripDestUA2009_B01ID": "dzone",
"TripPurpFrom_B01ID": "oact",
"TripPurpTo_B01ID": "dact",
"MainMode_B04ID": "mode",
"TripStart": "tst",
"TripEnd": "tet",
}
)
travel_diaries.head()
TripID | SurveyYear | DayID | IndividualID | HouseholdID | PSUID | PersNo | TravDay | seq | ShortWalkTrip_B01ID | ... | mode | TripPurpose_B04ID | oact | dact | tst | tet | ozone | dzone | participation | hh_weight | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2002 | 2 | 1 | 1 | 1 | 1 | 2 | 1 | 2 | ... | 4 | 7 | 23 | 10 | 675 | 683 | 530 | 550 | 1 | 1 |
1 | 2 | 2002 | 2 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | ... | 4 | 6 | 10 | 8 | 720 | 735 | 530 | 550 | 1 | 1 |
2 | 3 | 2002 | 2 | 1 | 1 | 1 | 1 | 2 | 3 | 2 | ... | 4 | 7 | 8 | 10 | 770 | 780 | 530 | 550 | 1 | 1 |
3 | 4 | 2002 | 2 | 1 | 1 | 1 | 1 | 2 | 4 | 2 | ... | 12 | 7 | 10 | 23 | 1110 | 1130 | 530 | 550 | 1 | 1 |
4 | 5 | 2002 | 3 | 1 | 1 | 1 | 1 | 3 | 1 | 2 | ... | 4 | 7 | 23 | 10 | 760 | 770 | 530 | 550 | 1 | 1 |
5 rows × 21 columns
travel_diaries.dtypes
TripID int64 SurveyYear int64 DayID int64 IndividualID int64 HouseholdID int64 PSUID int64 PersNo int64 TravDay int64 seq int64 ShortWalkTrip_B01ID int64 NumStages int64 mode int64 TripPurpose_B04ID int64 oact int64 dact int64 tst int64 tet int64 ozone int64 dzone int64 participation int64 hh_weight int64 dtype: object
def check_uniques(df):
for c in df.columns:
print(c)
n = df[c].nunique()
if n < 1000:
print(df[c].unique())
check_uniques(travel_diaries)
TripID [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149] SurveyYear [2002] DayID [ 2 3 5 6 7 8 9 10 11 12 13 14 16 17 19 20 21 23 24 26 27 28 29 30 33 35 36 37 38 39 40 41 42 43 44 45 46 47 49] IndividualID [1 2 3 4 5 6 7] HouseholdID [1 2 3] PSUID [1] PersNo [1 2 3 4] TravDay [2 3 5 6 7 1 4] seq [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15] ShortWalkTrip_B01ID [2 1] NumStages [1 4 3 2] mode [ 4 12 3 1 11 7] TripPurpose_B04ID [7 6 3 4 5 1 2] oact [23 10 8 20 5 9 22 1 12 4 3 6 14 2 18] dact [10 8 23 20 5 9 22 1 12 4 3 6 14 18] tst [ 675 720 770 1110 760 790 810 845 865 908 922 940 1015 1080 905 738 755 1050 1220 1270 525 546 745 764 1060 833 1320 530 1265 605 620 960 970 1075 360 350 875 898 1310 345 870 900 923 1311 990 585 560 840 438 1010 1105 1112 440 965 921 980 1115 910 820 853 1013 1023 1298 1306 702 975 1087 495 1070 1052] tet [ 683 735 780 1130 770 805 825 852 873 914 926 955 1032 1108 785 917 752 761 1076 1223 1275 540 556 760 774 1127 836 1325 535 1272 608 675 690 962 972 1078 363 1080 355 890 912 931 1316 348 887 915 933 1314 855 1020 795 730 750 885 465 1036 1110 1120 984 950 1005 471 922 830 862 906 1023 1034 1305 1312 627 710 468 986 1085 1091 550 1100 1115 559] ozone [530] dzone [550] participation [1] hh_weight [1]
Area Mapping¶
The NTS documentation refers to a 'modified' 2009 Unitary Authorities. The Unmodified 2017 UAs are included below for reference. They 2017 UA names are similar but not the same as the NTS mappings.
We have built our own geometry:
area_path = "./data/dummyNTS/NTSareas.geojson"
areas = gp.read_file(area_path)
areas.head()
id | name | geometry | |
---|---|---|---|
0 | 530.0 | West Midlands | POLYGON ((-1.95366 52.66273, -1.95457 52.66264... |
1 | 550.0 | West Yorkshire | POLYGON ((-1.87994 53.95590, -1.87982 53.95681... |
areas.plot(figsize=(6, 6))
<Axes: >
Clean out incomplete plans¶
def remove_broken_plans(plan):
if plan.isnull().values.any():
return None
for col in ["ozone", "dzone"]:
if -8 in list(plan[col]):
return None
return plan
clean_travel_diaries = (
travel_diaries.groupby(["IndividualID", "TravDay"], group_keys=False)
.apply(remove_broken_plans)
.reset_index(drop=True)
)
/var/folders/6n/0h9tynqn581fxsytcc863h94tm217b/T/ipykernel_94296/1629738967.py:3: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning. .apply(remove_broken_plans)
clean_travel_diaries.head()
TripID | SurveyYear | DayID | IndividualID | HouseholdID | PSUID | PersNo | TravDay | seq | ShortWalkTrip_B01ID | ... | mode | TripPurpose_B04ID | oact | dact | tst | tet | ozone | dzone | participation | hh_weight | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2002 | 2 | 1 | 1 | 1 | 1 | 2 | 1 | 2 | ... | 4 | 7 | 23 | 10 | 675 | 683 | 530 | 550 | 1 | 1 |
1 | 2 | 2002 | 2 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | ... | 4 | 6 | 10 | 8 | 720 | 735 | 530 | 550 | 1 | 1 |
2 | 3 | 2002 | 2 | 1 | 1 | 1 | 1 | 2 | 3 | 2 | ... | 4 | 7 | 8 | 10 | 770 | 780 | 530 | 550 | 1 | 1 |
3 | 4 | 2002 | 2 | 1 | 1 | 1 | 1 | 2 | 4 | 2 | ... | 12 | 7 | 10 | 23 | 1110 | 1130 | 530 | 550 | 1 | 1 |
4 | 5 | 2002 | 3 | 1 | 1 | 1 | 1 | 3 | 1 | 2 | ... | 4 | 7 | 23 | 10 | 760 | 770 | 530 | 550 | 1 | 1 |
5 rows × 21 columns
print(len(travel_diaries))
print(len(clean_travel_diaries))
149 149
Build Mappings and apply to common fields¶
We simplify key trip variables such as mode and activity.
mode_mapping = {
1: "walk",
2: "bike",
3: "car", #'Car/van driver'
4: "car", #'Car/van driver'
5: "car", #'Motorcycle',
6: "car", #'Other private transport',
7: "pt", # Bus in London',
8: "pt", #'Other local bus',
9: "pt", #'Non-local bus',
10: "pt", #'London Underground',
11: "pt", #'Surface Rail',
12: "car", #'Taxi/minicab',
13: "pt", #'Other public transport',
-10: "DEAD",
-8: "NA",
}
purp_mapping = {
1: "work",
2: "work", #'In course of work',
3: "education",
4: "shop", #'Food shopping',
5: "shop", #'Non food shopping',
6: "medical", #'Personal business medical',
7: "other", #'Personal business eat/drink',
8: "other", #'Personal business other',
9: "other", #'Eat/drink with friends',
10: "visit", #'Visit friends',
11: "other", #'Other social',
12: "other", #'Entertain/ public activity',
13: "other", #'Sport: participate',
14: "home", #'Holiday: base',
15: "other", #'Day trip/just walk',
16: "other", #'Other non-escort',
17: "escort", #'Escort home',
18: "escort", #'Escort work',
19: "escort", #'Escort in course of work',
20: "escort", #'Escort education',
21: "escort", #'Escort shopping/personal business',
22: "escort", #'Other escort',
23: "home", #'Home',
-10: "DEAD",
-8: "NA",
}
clean_travel_diaries["mode"] = clean_travel_diaries["mode"].map(mode_mapping)
clean_travel_diaries["oact"] = clean_travel_diaries["oact"].map(purp_mapping)
clean_travel_diaries["dact"] = clean_travel_diaries["dact"].map(purp_mapping)
Reweight and Split Days¶
In order to get the most from our small sample we treat individual diary days as new persons. In order to maintain the original household weighting we reduce this accordingly.
# reweight and split ids for unique days
def reweight(group):
"""
Reweight based on multiple diary days, ie if an agent has two diary days, we will treat these as
two unique agents, so we half the original weighting.
"""
group["freq"] = group.hh_weight / group.DayID.nunique()
return group
trips = clean_travel_diaries.groupby("IndividualID", group_keys=False).apply(reweight)
trips["pid"] = [f"{p}_{d}" for p, d in zip(trips.IndividualID, trips.TravDay)]
trips["hid"] = [f"{h}_{d}" for h, d in zip(trips.HouseholdID, trips.TravDay)]
/var/folders/6n/0h9tynqn581fxsytcc863h94tm217b/T/ipykernel_94296/963563098.py:13: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning. trips = clean_travel_diaries.groupby("IndividualID", group_keys=False).apply(reweight)
trips.head()
TripID | SurveyYear | DayID | IndividualID | HouseholdID | PSUID | PersNo | TravDay | seq | ShortWalkTrip_B01ID | ... | dact | tst | tet | ozone | dzone | participation | hh_weight | freq | pid | hid | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2002 | 2 | 1 | 1 | 1 | 1 | 2 | 1 | 2 | ... | visit | 675 | 683 | 530 | 550 | 1 | 1 | 0.2 | 1_2 | 1_2 |
1 | 2 | 2002 | 2 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | ... | other | 720 | 735 | 530 | 550 | 1 | 1 | 0.2 | 1_2 | 1_2 |
2 | 3 | 2002 | 2 | 1 | 1 | 1 | 1 | 2 | 3 | 2 | ... | visit | 770 | 780 | 530 | 550 | 1 | 1 | 0.2 | 1_2 | 1_2 |
3 | 4 | 2002 | 2 | 1 | 1 | 1 | 1 | 2 | 4 | 2 | ... | home | 1110 | 1130 | 530 | 550 | 1 | 1 | 0.2 | 1_2 | 1_2 |
4 | 5 | 2002 | 3 | 1 | 1 | 1 | 1 | 3 | 1 | 2 | ... | visit | 760 | 770 | 530 | 550 | 1 | 1 | 0.2 | 1_3 | 1_3 |
5 rows × 24 columns
def expand_days(
trips, target, trips_on="Diary_number", target_on="Diary_number", new_id="pid", trim=True
):
"""
Expand target df based on mapping between trips target_on and new_id.
This is so slow. Fix
Set index to new_id.
"""
print("Building mapping.")
mapping = {}
for i, person in trips.groupby(target_on):
mapping[i] = list(set(person[new_id]))
n = len(mapping)
if trim:
print("Trimming target.")
selection = set(trips[trips_on])
target = target.loc[target[target_on].isin(selection)]
expanded = pd.DataFrame()
for p, (i, ids) in enumerate(mapping.items()):
if not p % 10:
print(f"Building expanded data {p}/{n}", end="\r", flush=True)
for idx in ids:
split = target.loc[target[target_on] == i]
expanded = pd.concat([expanded, split.assign(**{new_id: idx})])
expanded = expanded.set_index(new_id)
print("Done")
return expanded
hhs = expand_days(trips, hh_in, trips_on="HouseholdID", target_on="HouseholdID", new_id="hid")
Building mapping. Trimming target. Building expanded data 0/3
Done
hhs.head()
HouseholdID | SurveyYear | PSUID | W2 | OutCom_B02ID | HHIncome2002_B02ID | AddressType_B01ID | Ten1_B02ID | Landlord_B01ID | ResLength_B01ID | ... | NumCarVan | WalkBus_B01ID | Getbus_B01ID | WalkRail_B01ID | WalkRailAlt_B01ID | HRPWorkStat_B02ID | HRPSEGWorkStat_B01ID | HHoldOAClass2011_B03ID | Settlement2011EW_B03ID | Settlement2011EW_B04ID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
hid | |||||||||||||||||||||
1_6 | 1 | 2002 | 1 | 1 | 1 | 2 | 3 | 1 | -10 | 3 | ... | 2 | 1 | 5 | 1 | -9 | 1 | 3 | -10 | 1 | 1 |
1_1 | 1 | 2002 | 1 | 1 | 1 | 2 | 3 | 1 | -10 | 3 | ... | 2 | 1 | 5 | 1 | -9 | 1 | 3 | -10 | 1 | 1 |
1_2 | 1 | 2002 | 1 | 1 | 1 | 2 | 3 | 1 | -10 | 3 | ... | 2 | 1 | 5 | 1 | -9 | 1 | 3 | -10 | 1 | 1 |
1_3 | 1 | 2002 | 1 | 1 | 1 | 2 | 3 | 1 | -10 | 3 | ... | 2 | 1 | 5 | 1 | -9 | 1 | 3 | -10 | 1 | 1 |
1_7 | 1 | 2002 | 1 | 1 | 1 | 2 | 3 | 1 | -10 | 3 | ... | 2 | 1 | 5 | 1 | -9 | 1 | 3 | -10 | 1 | 1 |
5 rows × 33 columns
people = expand_days(
trips, persons_in, trips_on="IndividualID", target_on="IndividualID", new_id="pid"
)
Building mapping. Trimming target. Building expanded data 0/7
Done
people.head()
SurveyYear | IndividualID | HouseholdID | PSUID | VehicleID | PersNo | Age_B01ID | OfPenAge_B01ID | Sex_B01ID | EdAttn1_B01ID | ... | CarAccess_B01ID | DrivDisable_B01ID | WkPlace_B01ID | ES2000_B01ID | NSSec_B03ID | SC_B01ID | Stat_B01ID | SVise_B01ID | EcoStat_B02ID | PossHom_B01ID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
pid | |||||||||||||||||||||
1_6 | 2002 | 1 | 1 | 1 | 1 | 1 | 13 | 2 | 2 | -10 | ... | 2 | -9 | 1 | 7 | 2 | 3 | 1 | 2 | 2 | 2 |
1_2 | 2002 | 1 | 1 | 1 | 1 | 1 | 13 | 2 | 2 | -10 | ... | 2 | -9 | 1 | 7 | 2 | 3 | 1 | 2 | 2 | 2 |
1_3 | 2002 | 1 | 1 | 1 | 1 | 1 | 13 | 2 | 2 | -10 | ... | 2 | -9 | 1 | 7 | 2 | 3 | 1 | 2 | 2 | 2 |
1_7 | 2002 | 1 | 1 | 1 | 1 | 1 | 13 | 2 | 2 | -10 | ... | 2 | -9 | 1 | 7 | 2 | 3 | 1 | 2 | 2 | 2 |
1_5 | 2002 | 1 | 1 | 1 | 1 | 1 | 13 | 2 | 2 | -10 | ... | 2 | -9 | 1 | 7 | 2 | 3 | 1 | 2 | 2 | 2 |
5 rows × 23 columns
Load into PAM¶
We load the pandas formatted data into Pam using the pam.read.load_travel_diary_from_to
read method. We do some very preliminary validation of plans and assurance.
trips.tst = trips.tst.astype(int)
trips.tet = trips.tet.astype(int)
population = read.load_travel_diary(
trips=trips, persons_attributes=people, hhs_attributes=hhs, trip_freq_as_person_freq=True
)
Using from-to activity parser using 'oact' and 'dact' columns
Adding pid->hh mapping to persons_attributes from trips.
Unable to load household area ('hzone') - not found in trips_diary or unable to build from attributes. Pam will try to infer home location from activities, but this behaviour is not recommended.
Using freq of 'None' for all trips.
Using freq of 'None' for all households.
Person pid:2_5 hid:1_5 plan does not start with 'home' activity: work
Person pid:2_6 hid:1_6 plan does not start with 'home' activity: work
Person pid:3_6 hid:1_6 plan does not start with 'home' activity: education
Person pid:2_7 hid:1_7 plan does not start with 'home' activity: work
population.fix_plans()
# this should be replaced with a more direct method
for hh in population.households.values():
for p in hh.people.values():
p.validate()
population.size # this also accounts for the weighting
3.285714285714285
population.stats
{'num_households': 18, 'num_people': 39, 'num_activities': 175, 'num_legs': 136}
population.activity_classes
{'education', 'escort', 'home', 'medical', 'other', 'shop', 'visit', 'work'}
population.mode_classes
{'car', 'pt', 'walk'}
plot_activity_times(population)
plot_leg_times(population)
# night shift @ 2016008863_6
hh = population.random_household()
hh.print()
hh.plot()
Household: 1_4 {'HouseholdID': 1, 'SurveyYear': 2002, 'PSUID': 1, 'W2': 1, 'OutCom_B02ID': 1, 'HHIncome2002_B02ID': 2, 'AddressType_B01ID': 3, 'Ten1_B02ID': 1, 'Landlord_B01ID': -10, 'ResLength_B01ID': 3, 'HHoldCountry_B01ID': 1, 'HHoldGOR_B02ID': 7, 'HHoldNumAdults': 2, 'HHoldNumChildren': 2, 'HHoldNumPeople': 4, 'HHoldStruct_B02ID': 5, 'NumLicHolders': 2, 'HHoldEmploy_B01ID': 5, 'NumVehicles': 2, 'NumBike': 0, 'NumCar': 2, 'NumMCycle': 0, 'NumVanLorry': 0, 'NumCarVan': 2, 'WalkBus_B01ID': 1, 'Getbus_B01ID': 5, 'WalkRail_B01ID': 1, 'WalkRailAlt_B01ID': -9, 'HRPWorkStat_B02ID': 1, 'HRPSEGWorkStat_B01ID': 3, 'HHoldOAClass2011_B03ID': -10, 'Settlement2011EW_B03ID': 1, 'Settlement2011EW_B04ID': 1} Person: 2_4 {'SurveyYear': 2002, 'IndividualID': 2, 'HouseholdID': 1, 'PSUID': 1, 'VehicleID': '2', 'PersNo': 2, 'Age_B01ID': 13, 'OfPenAge_B01ID': 2, 'Sex_B01ID': 1, 'EdAttn1_B01ID': -10, 'EdAttn2_B01ID': -10, 'EdAttn3_B01ID': -10, 'DrivLic_B02ID': 1, 'CarAccess_B01ID': 2, 'DrivDisable_B01ID': -9, 'WkPlace_B01ID': 1, 'ES2000_B01ID': 7, 'NSSec_B03ID': 3, 'SC_B01ID': 4, 'Stat_B01ID': 1, 'SVise_B01ID': 2, 'EcoStat_B02ID': 1, 'PossHom_B01ID': 2} 0: Activity(act:home, location:530, time:00:00:00 --> 16:00:00, duration:16:00:00) 1: Leg(mode:car, area:530 --> 550, time:16:00:00 --> 16:02:00, duration:0:02:00) 2: Activity(act:shop, location:550, time:16:02:00 --> 16:10:00, duration:0:08:00) 3: Leg(mode:car, area:550 --> 550, time:16:10:00 --> 16:12:00, duration:0:02:00) 4: Activity(act:home, location:550, time:16:12:00 --> 17:55:00, duration:1:43:00) 5: Leg(mode:car, area:550 --> 550, time:17:55:00 --> 17:58:00, duration:0:03:00) 6: Activity(act:work, location:550, time:17:58:00 --> 00:00:00, duration:6:02:00)
population.activity_classes
{'education', 'escort', 'home', 'medical', 'other', 'shop', 'visit', 'work'}
Sample the Population¶
We sample a very small population based on the given NTS household weightings.
population_sample = Population()
for _hid, household in population.households.items():
av_hh_weight = household.freq # this is currently the av of person freq in the hh
freq = freq_sample(av_hh_weight, 10)
for idx in range(freq):
hh = deepcopy(household)
hh.hid = f"{hh.hid}_{idx}"
hh.people = {}
for pid, person in household.people.items():
p = deepcopy(person)
p.pid = f"{pid}_{idx}"
hh.add(p)
population_sample.add(hh)
population_sample.size
6.214285714285718
Facility Sampling¶
The facilities input is prepared using a separate project called OSM-Facility Sampler (OSMFS). This project woulbe be better names the OSM Facility Extractor. We use it to extract viable activity locations for each activity type for each zone. This project is not currently open source, but is described below:
OSMFS joins osm data with the geographies of an area to create a mapping between zones, acts and facility locations (points). This is output as a geojson:
{"type": "FeatureCollection", "features": [{"id": "0", "type": "Feature", "properties": {"activity": "other"}, "geometry": {"type": "Point", "coordinates": [-4.5235751, 54.1698685]}},
todo: the current methodology does not support shared facilities, ie facilities with more than one activity (schools are places of education and work for example).
todo: the above json has to be rejoined with the geography to create a spatial sampler. This is a duplicated operation which could be included in the Bench output, eg:
zone_id: activity: (id, point)
def load_facilities(path, from_crs="EPSG:4326", to_crs="EPSG:27700"):
facilities = gp.read_file(path)
facilities.crs = from_crs
facilities = facilities.to_crs(to_crs)
return facilities
def load_zones(zones_path, from_crs="EPSG:27700", to_crs="EPSG:27700"):
zones = gp.read_file(zones_path)
zones = zones.set_index("id")
if from_crs != to_crs:
zones.crs = from_crs
zones = zones.to_crs(to_crs)
return zones
Random Sampler¶
Failing a facility sampler - we can use random sampling instead.
zones = load_zones(area_path, from_crs="EPSG:4326")
sampler = RandomPointSampler(geoms=zones)
sampler.sample(530, None)
population_sample.sample_locs(sampler)
person = population_sample.random_person()
person.plot()
person.print()
Person: 2_7_1 {'SurveyYear': 2002, 'IndividualID': 2, 'HouseholdID': 1, 'PSUID': 1, 'VehicleID': '2', 'PersNo': 2, 'Age_B01ID': 13, 'OfPenAge_B01ID': 2, 'Sex_B01ID': 1, 'EdAttn1_B01ID': -10, 'EdAttn2_B01ID': -10, 'EdAttn3_B01ID': -10, 'DrivLic_B02ID': 1, 'CarAccess_B01ID': 2, 'DrivDisable_B01ID': -9, 'WkPlace_B01ID': 1, 'ES2000_B01ID': 7, 'NSSec_B03ID': 3, 'SC_B01ID': 4, 'Stat_B01ID': 1, 'SVise_B01ID': 2, 'EcoStat_B02ID': 1, 'PossHom_B01ID': 2} 0: Activity(act:work, location:POINT (414048.93555820326 293172.88327221776), time:00:00:00 --> 05:45:00, duration:5:45:00) 1: Leg(mode:car, area:POINT (414048.93555820326 293172.88327221776) --> POINT (420610.5860599916 411636.7024491086), time:05:45:00 --> 05:48:00, duration:0:03:00) 2: Activity(act:home, location:POINT (420610.5860599916 411636.7024491086), time:05:48:00 --> 14:30:00, duration:8:42:00) 3: Leg(mode:car, area:POINT (420610.5860599916 411636.7024491086) --> POINT (408972.8550773945 448442.45591133967), time:14:30:00 --> 14:47:00, duration:0:17:00) 4: Activity(act:escort, location:POINT (408972.8550773945 448442.45591133967), time:14:47:00 --> 15:00:00, duration:0:13:00) 5: Leg(mode:car, area:POINT (408972.8550773945 448442.45591133967) --> POINT (403169.4387876878 417483.48085058795), time:15:00:00 --> 15:15:00, duration:0:15:00) 6: Activity(act:escort, location:POINT (403169.4387876878 417483.48085058795), time:15:15:00 --> 15:23:00, duration:0:08:00) 7: Leg(mode:car, area:POINT (403169.4387876878 417483.48085058795) --> POINT (420610.5860599916 411636.7024491086), time:15:23:00 --> 15:33:00, duration:0:10:00) 8: Activity(act:home, location:POINT (420610.5860599916 411636.7024491086), time:15:33:00 --> 21:51:00, duration:6:18:00) 9: Leg(mode:car, area:POINT (420610.5860599916 411636.7024491086) --> POINT (442873.50833292294 421061.950670316), time:21:51:00 --> 21:54:00, duration:0:03:00) 10: Activity(act:work, location:POINT (442873.50833292294 421061.950670316), time:21:54:00 --> 00:00:00, duration:2:06:00)
Write to Disk¶
- write MATSim formats to disk (plans and attributes)
- write csv and geojson summaries to disk
- write MATSim formatted facilities to disk
comment = "NTS london prelim 24nov2020 epsg27700"
write.write_matsim(
population_sample, plans_path=os.path.join(out_dir, "plans.xml"), comment=comment
)
population_sample.to_csv(out_dir, crs="EPSG:27700", to_crs="EPSG:4326")