Introduction¶

This notebook demonstrates the creation of a synthetic population using an Iterative Proportional Fitting (IPF) approach.

IPF is a statistical technique that tries to adjust the values of a matrix (joint distribution) in order to match the values of expected single distributions (marginals) for each matrix dimension. We use this approach to sample from a set of persons with certain demographic attributes in a way that the observed distributions in each zone (ie from census) are met.

In [1]:

Copied!

import itertools

import pandas as pd
from pam.core import Person, Population
from pam.planner import ipf
import itertools

import pandas as pd
from pam.core import Person, Population
from pam.planner import ipf

/var/folders/6n/0h9tynqn581fxsytcc863h94tm217b/T/ipykernel_95267/1028812630.py:3: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd

Data¶

We start with some demographic data for each zone, as shown below. The zones dataframe includes the zone name as the index, and its columns follow a variable|class naming convention. Alternatively, they could be provided with a mutltiIndex column, with the first level being the variable and the second level indicating the class. The controlled variables should be part of the seed population's attributes.

In [2]:

Copied!





zone_data = pd.DataFrame(
    {
        "zone": ["a", "b"],
        "hhincome|high": [30, 80],
        "hhincome|medium": [40, 100],
        "hhincome|low": [30, 20],
        "age|minor": [40, 90],
        "age|adult": [60, 110],
        "carAvail|yes": [90, 180],
        "carAvail|no": [10, 20],
    }
).set_index("zone")
zone_data
zone_data = pd.DataFrame(
    {
        "zone": ["a", "b"],
        "hhincome|high": [30, 80],
        "hhincome|medium": [40, 100],
        "hhincome|low": [30, 20],
        "age|minor": [40, 90],
        "age|adult": [60, 110],
        "carAvail|yes": [90, 180],
        "carAvail|no": [10, 20],
    }
).set_index("zone")
zone_data

Out[2]:

	hhincome\|high	hhincome\|medium	hhincome\|low	age\|minor	age\|adult	carAvail\|yes	carAvail\|no
zone
a	30	40	30	40	60	90	10
b	80	100	20	90	110	180	20

Let's create a seed population which includes every possible combination of attributes:

In [3]:

Copied!





dims = {"hhincome": ["low", "medium", "high"], "age": ["minor", "adult"], "carAvail": ["yes", "no"]}
list(itertools.product(*dims.values()))

# %%
seed_pop = Population()
n = 0
for attributes in list(itertools.product(*dims.values())):
    hhincome, age, carAvail = attributes
    person = Person(
        pid=n, attributes={"hhincome": hhincome, "age": age, "carAvail": carAvail, "hzone": pd.NA}
    )
    n += 1
    seed_pop.add(person)

seed_pop.random_person().attributes
dims = {"hhincome": ["low", "medium", "high"], "age": ["minor", "adult"], "carAvail": ["yes", "no"]}
list(itertools.product(*dims.values()))

# %%
seed_pop = Population()
n = 0
for attributes in list(itertools.product(*dims.values())):
    hhincome, age, carAvail = attributes
    person = Person(
        pid=n, attributes={"hhincome": hhincome, "age": age, "carAvail": carAvail, "hzone": pd.NA}
    )
    n += 1
    seed_pop.add(person)

seed_pop.random_person().attributes

Out[3]:

{'hhincome': 'low', 'age': 'adult', 'carAvail': 'no', 'hzone': <NA>}

IPF¶

Now let's create a population that matches the demographic distribution of each zone:

In [4]:

Copied!

pop = ipf.generate_population(seed_pop, zone_data)
pop = ipf.generate_population(seed_pop, zone_data)

The resulting population comprises 300 persons (as defined in the zone data):

In [5]:

Copied!

len(pop)
len(pop)

Out[5]:

And each person in the population is assigned a household zone:

In [6]:

Copied!

pop.random_person().attributes
pop.random_person().attributes

Out[6]:

{'hhincome': 'high', 'age': 'minor', 'carAvail': 'yes', 'hzone': 'b'}

The resulting joint demographic distributions in each zone are shown below:

In [7]:

Copied!





summary = (
    pd.DataFrame([person.attributes for hid, pid, person in pop.people()])
    .value_counts()
    .reorder_levels([3, 0, 1, 2])
    .sort_index()
)

summary
summary = (
    pd.DataFrame([person.attributes for hid, pid, person in pop.people()])
    .value_counts()
    .reorder_levels([3, 0, 1, 2])
    .sort_index()
)

summary

Out[7]:

hzone  hhincome  age    carAvail
a      high      adult  no           2
                        yes         16
                 minor  no           1
                        yes         11
       low       adult  no           2
                        yes         16
                 minor  no           1
                        yes         11
       medium    adult  no           2
                        yes         22
                 minor  no           2
                        yes         14
b      high      adult  no           4
                        yes         40
                 minor  no           4
                        yes         32
       low       adult  no           1
                        yes         10
                 minor  no           1
                        yes          8
       medium    adult  no           6
                        yes         50
                 minor  no           4
                        yes         40
Name: count, dtype: int64

The aggregate demographic distributions match the marginals in zone_data:

In [8]:

Copied!





summary_aggregate = []
for var in dims:
    df = summary.groupby(level=["hzone", var]).sum().unstack(level=var)
    df.columns = [f"{var}|{x}" for x in df.columns]
    summary_aggregate.append(df)
summary_aggregate = pd.concat(summary_aggregate, axis=1)
summary_aggregate.index.name = "zone"
summary_aggregate = summary_aggregate[zone_data.columns]
summary_aggregate
summary_aggregate = []
for var in dims:
    df = summary.groupby(level=["hzone", var]).sum().unstack(level=var)
    df.columns = [f"{var}|{x}" for x in df.columns]
    summary_aggregate.append(df)
summary_aggregate = pd.concat(summary_aggregate, axis=1)
summary_aggregate.index.name = "zone"
summary_aggregate = summary_aggregate[zone_data.columns]
summary_aggregate

Out[8]:

	hhincome\|high	hhincome\|medium	hhincome\|low	age\|minor	age\|adult	carAvail\|yes	carAvail\|no
zone
a	30	40	30	40	60	90	10
b	80	100	20	89	111	180	20

In [9]:

Copied!

pd.testing.assert_frame_equal(
    summary_aggregate, zone_data, check_exact=False, atol=1
)  # test passes
pd.testing.assert_frame_equal(
    summary_aggregate, zone_data, check_exact=False, atol=1
)  # test passes