Introduction¶
This notebook demonstrates the creation of a synthetic population using an Iterative Proportional Fitting (IPF) approach.
IPF is a statistical technique that tries to adjust the values of a matrix (joint distribution) in order to match the values of expected single distributions (marginals) for each matrix dimension. We use this approach to sample from a set of persons with certain demographic attributes in a way that the observed distributions in each zone (ie from census) are met.
import itertools
import pandas as pd
from pam.core import Person, Population
from pam.planner import ipf
/var/folders/6n/0h9tynqn581fxsytcc863h94tm217b/T/ipykernel_95267/1028812630.py:3: DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466 import pandas as pd
Data¶
We start with some demographic data for each zone, as shown below. The zones dataframe includes the zone name as the index, and its columns follow a variable|class
naming convention. Alternatively, they could be provided with a mutltiIndex column, with the first level being the variable and the second level indicating the class. The controlled variables should be part of the seed population's attributes.
zone_data = pd.DataFrame(
{
"zone": ["a", "b"],
"hhincome|high": [30, 80],
"hhincome|medium": [40, 100],
"hhincome|low": [30, 20],
"age|minor": [40, 90],
"age|adult": [60, 110],
"carAvail|yes": [90, 180],
"carAvail|no": [10, 20],
}
).set_index("zone")
zone_data
hhincome|high | hhincome|medium | hhincome|low | age|minor | age|adult | carAvail|yes | carAvail|no | |
---|---|---|---|---|---|---|---|
zone | |||||||
a | 30 | 40 | 30 | 40 | 60 | 90 | 10 |
b | 80 | 100 | 20 | 90 | 110 | 180 | 20 |
Let's create a seed population which includes every possible combination of attributes:
dims = {"hhincome": ["low", "medium", "high"], "age": ["minor", "adult"], "carAvail": ["yes", "no"]}
list(itertools.product(*dims.values()))
# %%
seed_pop = Population()
n = 0
for attributes in list(itertools.product(*dims.values())):
hhincome, age, carAvail = attributes
person = Person(
pid=n, attributes={"hhincome": hhincome, "age": age, "carAvail": carAvail, "hzone": pd.NA}
)
n += 1
seed_pop.add(person)
seed_pop.random_person().attributes
{'hhincome': 'low', 'age': 'adult', 'carAvail': 'no', 'hzone': <NA>}
IPF¶
Now let's create a population that matches the demographic distribution of each zone:
pop = ipf.generate_population(seed_pop, zone_data)
The resulting population comprises 300 persons (as defined in the zone data):
len(pop)
300
And each person in the population is assigned a household zone:
pop.random_person().attributes
{'hhincome': 'high', 'age': 'minor', 'carAvail': 'yes', 'hzone': 'b'}
The resulting joint demographic distributions in each zone are shown below:
summary = (
pd.DataFrame([person.attributes for hid, pid, person in pop.people()])
.value_counts()
.reorder_levels([3, 0, 1, 2])
.sort_index()
)
summary
hzone hhincome age carAvail a high adult no 2 yes 16 minor no 1 yes 11 low adult no 2 yes 16 minor no 1 yes 11 medium adult no 2 yes 22 minor no 2 yes 14 b high adult no 4 yes 40 minor no 4 yes 32 low adult no 1 yes 10 minor no 1 yes 8 medium adult no 6 yes 50 minor no 4 yes 40 Name: count, dtype: int64
The aggregate demographic distributions match the marginals in zone_data
:
summary_aggregate = []
for var in dims:
df = summary.groupby(level=["hzone", var]).sum().unstack(level=var)
df.columns = [f"{var}|{x}" for x in df.columns]
summary_aggregate.append(df)
summary_aggregate = pd.concat(summary_aggregate, axis=1)
summary_aggregate.index.name = "zone"
summary_aggregate = summary_aggregate[zone_data.columns]
summary_aggregate
hhincome|high | hhincome|medium | hhincome|low | age|minor | age|adult | carAvail|yes | carAvail|no | |
---|---|---|---|---|---|---|---|
zone | |||||||
a | 30 | 40 | 30 | 40 | 60 | 90 | 10 |
b | 80 | 100 | 20 | 89 | 111 | 180 | 20 |
pd.testing.assert_frame_equal(
summary_aggregate, zone_data, check_exact=False, atol=1
) # test passes