{
"cells": [
{
"cell_type": "markdown",
"id": "3c56f3a8-056d-4f8a-b46e-cd44e1441a07",
"metadata": {},
"source": [
"# Introduction"
]
},
{
"cell_type": "markdown",
"id": "1d9a8764-65dd-4442-8224-075f427edae0",
"metadata": {},
"source": [
"This notebook demonstrates the creation of a synthetic population using an Iterative Proportional Fitting (IPF) approach.\n",
"\n",
"IPF is a statistical technique that tries to adjust the values of a matrix (joint distribution) in order to match the values of expected single distributions (marginals) for each matrix dimension. We use this approach to sample from a set of persons with certain demographic attributes in a way that the observed distributions in each zone (ie from census) are met."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "aa7cdd7f-3256-4dab-b42e-7bea3634eb7b",
"metadata": {
"execution": {
"iopub.execute_input": "2023-09-22T10:52:45.466157Z",
"iopub.status.busy": "2023-09-22T10:52:45.465946Z",
"iopub.status.idle": "2023-09-22T10:52:46.256436Z",
"shell.execute_reply": "2023-09-22T10:52:46.256128Z"
}
},
"outputs": [],
"source": [
"import itertools\n",
"\n",
"import pandas as pd\n",
"\n",
"from pam.core import Person, Population\n",
"from pam.planner import ipf"
]
},
{
"cell_type": "markdown",
"id": "30574bfc-87a4-4846-b881-9e4a439720c3",
"metadata": {},
"source": [
"# Data"
]
},
{
"cell_type": "markdown",
"id": "4bad3b0e-9361-4764-a7fd-fc40f8566e55",
"metadata": {},
"source": [
"We start with some demographic data for each zone, as shown below. The zones dataframe includes the zone name as the index, and its columns follow a `variable|class` naming convention. Alternatively, they could be provided with a mutltiIndex column, with the first level being the variable and the second level indicating the class. The controlled variables should be part of the seed population's attributes."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "c54ce830-3539-4616-b42f-98cd6662968e",
"metadata": {
"execution": {
"iopub.execute_input": "2023-09-22T10:52:46.258097Z",
"iopub.status.busy": "2023-09-22T10:52:46.257961Z",
"iopub.status.idle": "2023-09-22T10:52:46.265070Z",
"shell.execute_reply": "2023-09-22T10:52:46.264787Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" hhincome|high | \n",
" hhincome|medium | \n",
" hhincome|low | \n",
" age|minor | \n",
" age|adult | \n",
" carAvail|yes | \n",
" carAvail|no | \n",
"
\n",
" \n",
" zone | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 30 | \n",
" 40 | \n",
" 30 | \n",
" 40 | \n",
" 60 | \n",
" 90 | \n",
" 10 | \n",
"
\n",
" \n",
" b | \n",
" 80 | \n",
" 100 | \n",
" 20 | \n",
" 90 | \n",
" 110 | \n",
" 180 | \n",
" 20 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" hhincome|high hhincome|medium hhincome|low age|minor age|adult \\\n",
"zone \n",
"a 30 40 30 40 60 \n",
"b 80 100 20 90 110 \n",
"\n",
" carAvail|yes carAvail|no \n",
"zone \n",
"a 90 10 \n",
"b 180 20 "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"zone_data = pd.DataFrame(\n",
" {\n",
" \"zone\": [\"a\", \"b\"],\n",
" \"hhincome|high\": [30, 80],\n",
" \"hhincome|medium\": [40, 100],\n",
" \"hhincome|low\": [30, 20],\n",
" \"age|minor\": [40, 90],\n",
" \"age|adult\": [60, 110],\n",
" \"carAvail|yes\": [90, 180],\n",
" \"carAvail|no\": [10, 20],\n",
" }\n",
").set_index(\"zone\")\n",
"zone_data"
]
},
{
"cell_type": "markdown",
"id": "14e19191-2784-4c5e-a023-d7cb3bba9c66",
"metadata": {},
"source": [
"Let's create a seed population which includes every possible combination of attributes:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "f546a8dd-0d1d-441b-a768-b483fd0d1382",
"metadata": {
"execution": {
"iopub.execute_input": "2023-09-22T10:52:46.266476Z",
"iopub.status.busy": "2023-09-22T10:52:46.266381Z",
"iopub.status.idle": "2023-09-22T10:52:46.269792Z",
"shell.execute_reply": "2023-09-22T10:52:46.269550Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'hhincome': 'high', 'age': 'adult', 'carAvail': 'yes', 'hzone': }"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dims = {\"hhincome\": [\"low\", \"medium\", \"high\"], \"age\": [\"minor\", \"adult\"], \"carAvail\": [\"yes\", \"no\"]}\n",
"list(itertools.product(*dims.values()))\n",
"\n",
"# %%\n",
"seed_pop = Population()\n",
"n = 0\n",
"for attributes in list(itertools.product(*dims.values())):\n",
" hhincome, age, carAvail = attributes\n",
" person = Person(\n",
" pid=n, attributes={\"hhincome\": hhincome, \"age\": age, \"carAvail\": carAvail, \"hzone\": pd.NA}\n",
" )\n",
" n += 1\n",
" seed_pop.add(person)\n",
"\n",
"seed_pop.random_person().attributes"
]
},
{
"cell_type": "markdown",
"id": "708c2f97-907d-4fe0-9361-f8e0e90422f8",
"metadata": {},
"source": [
"# IPF"
]
},
{
"cell_type": "markdown",
"id": "dc3527a8-0f0d-4094-9084-0a73cba351ae",
"metadata": {},
"source": [
"Now let's create a population that matches the demographic distribution of each zone:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "2c3b5f5d-37c6-4d4b-bda5-bad713634a28",
"metadata": {
"execution": {
"iopub.execute_input": "2023-09-22T10:52:46.271111Z",
"iopub.status.busy": "2023-09-22T10:52:46.271022Z",
"iopub.status.idle": "2023-09-22T10:52:46.342084Z",
"shell.execute_reply": "2023-09-22T10:52:46.341774Z"
}
},
"outputs": [],
"source": [
"pop = ipf.generate_population(seed_pop, zone_data)"
]
},
{
"cell_type": "markdown",
"id": "1793b438-f84c-4024-abf5-1936eaadc5f0",
"metadata": {},
"source": [
"The resulting population comprises 300 persons (as defined in the zone data):"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "2cd65d20-56c6-46e9-b53d-b451db535081",
"metadata": {
"execution": {
"iopub.execute_input": "2023-09-22T10:52:46.343712Z",
"iopub.status.busy": "2023-09-22T10:52:46.343631Z",
"iopub.status.idle": "2023-09-22T10:52:46.345897Z",
"shell.execute_reply": "2023-09-22T10:52:46.345623Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"300"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(pop)"
]
},
{
"cell_type": "markdown",
"id": "88a0c55c-dc6b-4024-8411-48f8a18b02cc",
"metadata": {},
"source": [
"And each person in the population is assigned a household zone:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "cd08aa2f-c2c4-4335-bcac-315a0609f63f",
"metadata": {
"execution": {
"iopub.execute_input": "2023-09-22T10:52:46.347311Z",
"iopub.status.busy": "2023-09-22T10:52:46.347214Z",
"iopub.status.idle": "2023-09-22T10:52:46.349644Z",
"shell.execute_reply": "2023-09-22T10:52:46.349400Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'hhincome': 'medium', 'age': 'adult', 'carAvail': 'yes', 'hzone': 'b'}"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pop.random_person().attributes"
]
},
{
"cell_type": "markdown",
"id": "f412df08-af9e-4a78-807e-6789f06e3097",
"metadata": {},
"source": [
"The resulting joint demographic distributions in each zone are shown below:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "1441fecf-523d-41b5-a936-441da8da3529",
"metadata": {
"execution": {
"iopub.execute_input": "2023-09-22T10:52:46.350895Z",
"iopub.status.busy": "2023-09-22T10:52:46.350801Z",
"iopub.status.idle": "2023-09-22T10:52:46.355724Z",
"shell.execute_reply": "2023-09-22T10:52:46.355502Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"hzone hhincome age carAvail\n",
"a high adult no 2\n",
" yes 16\n",
" minor no 1\n",
" yes 11\n",
" low adult no 2\n",
" yes 16\n",
" minor no 1\n",
" yes 11\n",
" medium adult no 2\n",
" yes 22\n",
" minor no 2\n",
" yes 14\n",
"b high adult no 4\n",
" yes 40\n",
" minor no 4\n",
" yes 32\n",
" low adult no 1\n",
" yes 10\n",
" minor no 1\n",
" yes 8\n",
" medium adult no 6\n",
" yes 50\n",
" minor no 4\n",
" yes 40\n",
"Name: count, dtype: int64"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"summary = (\n",
" pd.DataFrame([person.attributes for hid, pid, person in pop.people()])\n",
" .value_counts()\n",
" .reorder_levels([3, 0, 1, 2])\n",
" .sort_index()\n",
")\n",
"\n",
"summary"
]
},
{
"cell_type": "markdown",
"id": "9500b6af-2f8f-492e-a2cd-284743fa362a",
"metadata": {},
"source": [
"The aggregate demographic distributions match the marginals in `zone_data`:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "10ff7760-3e96-4fd2-b438-ea0114328414",
"metadata": {
"execution": {
"iopub.execute_input": "2023-09-22T10:52:46.356948Z",
"iopub.status.busy": "2023-09-22T10:52:46.356860Z",
"iopub.status.idle": "2023-09-22T10:52:46.364199Z",
"shell.execute_reply": "2023-09-22T10:52:46.363956Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" hhincome|high | \n",
" hhincome|medium | \n",
" hhincome|low | \n",
" age|minor | \n",
" age|adult | \n",
" carAvail|yes | \n",
" carAvail|no | \n",
"
\n",
" \n",
" zone | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 30 | \n",
" 40 | \n",
" 30 | \n",
" 40 | \n",
" 60 | \n",
" 90 | \n",
" 10 | \n",
"
\n",
" \n",
" b | \n",
" 80 | \n",
" 100 | \n",
" 20 | \n",
" 89 | \n",
" 111 | \n",
" 180 | \n",
" 20 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" hhincome|high hhincome|medium hhincome|low age|minor age|adult \\\n",
"zone \n",
"a 30 40 30 40 60 \n",
"b 80 100 20 89 111 \n",
"\n",
" carAvail|yes carAvail|no \n",
"zone \n",
"a 90 10 \n",
"b 180 20 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"summary_aggregate = []\n",
"for var in dims:\n",
" df = summary.groupby(level=[\"hzone\", var]).sum().unstack(level=var)\n",
" df.columns = [f\"{var}|{x}\" for x in df.columns]\n",
" summary_aggregate.append(df)\n",
"summary_aggregate = pd.concat(summary_aggregate, axis=1)\n",
"summary_aggregate.index.name = \"zone\"\n",
"summary_aggregate = summary_aggregate[zone_data.columns]\n",
"summary_aggregate"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "f6db0771-b725-4246-af5c-989e140d0835",
"metadata": {
"execution": {
"iopub.execute_input": "2023-09-22T10:52:46.365439Z",
"iopub.status.busy": "2023-09-22T10:52:46.365350Z",
"iopub.status.idle": "2023-09-22T10:52:46.367388Z",
"shell.execute_reply": "2023-09-22T10:52:46.367147Z"
}
},
"outputs": [],
"source": [
"pd.testing.assert_frame_equal(\n",
" summary_aggregate, zone_data, check_exact=False, atol=1\n",
") # test passes"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}