{
"cells": [
{
"cell_type": "markdown",
"id": "3c56f3a8-056d-4f8a-b46e-cd44e1441a07",
"metadata": {},
"source": [
"# Introduction"
]
},
{
"cell_type": "markdown",
"id": "1d9a8764-65dd-4442-8224-075f427edae0",
"metadata": {},
"source": [
"This notebook demonstrates the creation of a synthetic population using an Iterative Proportional Fitting (IPF) approach.\n",
"\n",
"IPF is a statistical technique that tries to adjust the values of a matrix (joint distribution) in order to match the values of expected single distributions (marginals) for each matrix dimension. We use this approach to sample from a set of persons with certain demographic attributes in a way that the observed distributions in each zone (ie from census) are met."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "aa7cdd7f-3256-4dab-b42e-7bea3634eb7b",
"metadata": {
"execution": {
"iopub.execute_input": "2024-04-05T15:22:30.161208Z",
"iopub.status.busy": "2024-04-05T15:22:30.161074Z",
"iopub.status.idle": "2024-04-05T15:22:31.086419Z",
"shell.execute_reply": "2024-04-05T15:22:31.085892Z"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/var/folders/6n/0h9tynqn581fxsytcc863h94tm217b/T/ipykernel_95267/1028812630.py:3: DeprecationWarning: \n",
"Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),\n",
"(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)\n",
"but was not found to be installed on your system.\n",
"If this would cause problems for you,\n",
"please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466\n",
" \n",
" import pandas as pd\n"
]
}
],
"source": [
"import itertools\n",
"\n",
"import pandas as pd\n",
"\n",
"from pam.core import Person, Population\n",
"from pam.planner import ipf"
]
},
{
"cell_type": "markdown",
"id": "30574bfc-87a4-4846-b881-9e4a439720c3",
"metadata": {},
"source": [
"# Data"
]
},
{
"cell_type": "markdown",
"id": "4bad3b0e-9361-4764-a7fd-fc40f8566e55",
"metadata": {},
"source": [
"We start with some demographic data for each zone, as shown below. The zones dataframe includes the zone name as the index, and its columns follow a `variable|class` naming convention. Alternatively, they could be provided with a mutltiIndex column, with the first level being the variable and the second level indicating the class. The controlled variables should be part of the seed population's attributes."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "c54ce830-3539-4616-b42f-98cd6662968e",
"metadata": {
"execution": {
"iopub.execute_input": "2024-04-05T15:22:31.089553Z",
"iopub.status.busy": "2024-04-05T15:22:31.089276Z",
"iopub.status.idle": "2024-04-05T15:22:31.102561Z",
"shell.execute_reply": "2024-04-05T15:22:31.102041Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" hhincome|high | \n",
" hhincome|medium | \n",
" hhincome|low | \n",
" age|minor | \n",
" age|adult | \n",
" carAvail|yes | \n",
" carAvail|no | \n",
"
\n",
" \n",
" zone | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 30 | \n",
" 40 | \n",
" 30 | \n",
" 40 | \n",
" 60 | \n",
" 90 | \n",
" 10 | \n",
"
\n",
" \n",
" b | \n",
" 80 | \n",
" 100 | \n",
" 20 | \n",
" 90 | \n",
" 110 | \n",
" 180 | \n",
" 20 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" hhincome|high hhincome|medium hhincome|low age|minor age|adult \\\n",
"zone \n",
"a 30 40 30 40 60 \n",
"b 80 100 20 90 110 \n",
"\n",
" carAvail|yes carAvail|no \n",
"zone \n",
"a 90 10 \n",
"b 180 20 "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"zone_data = pd.DataFrame(\n",
" {\n",
" \"zone\": [\"a\", \"b\"],\n",
" \"hhincome|high\": [30, 80],\n",
" \"hhincome|medium\": [40, 100],\n",
" \"hhincome|low\": [30, 20],\n",
" \"age|minor\": [40, 90],\n",
" \"age|adult\": [60, 110],\n",
" \"carAvail|yes\": [90, 180],\n",
" \"carAvail|no\": [10, 20],\n",
" }\n",
").set_index(\"zone\")\n",
"zone_data"
]
},
{
"cell_type": "markdown",
"id": "14e19191-2784-4c5e-a023-d7cb3bba9c66",
"metadata": {},
"source": [
"Let's create a seed population which includes every possible combination of attributes:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "f546a8dd-0d1d-441b-a768-b483fd0d1382",
"metadata": {
"execution": {
"iopub.execute_input": "2024-04-05T15:22:31.105343Z",
"iopub.status.busy": "2024-04-05T15:22:31.105100Z",
"iopub.status.idle": "2024-04-05T15:22:31.110967Z",
"shell.execute_reply": "2024-04-05T15:22:31.110542Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'hhincome': 'low', 'age': 'adult', 'carAvail': 'no', 'hzone': }"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dims = {\"hhincome\": [\"low\", \"medium\", \"high\"], \"age\": [\"minor\", \"adult\"], \"carAvail\": [\"yes\", \"no\"]}\n",
"list(itertools.product(*dims.values()))\n",
"\n",
"# %%\n",
"seed_pop = Population()\n",
"n = 0\n",
"for attributes in list(itertools.product(*dims.values())):\n",
" hhincome, age, carAvail = attributes\n",
" person = Person(\n",
" pid=n, attributes={\"hhincome\": hhincome, \"age\": age, \"carAvail\": carAvail, \"hzone\": pd.NA}\n",
" )\n",
" n += 1\n",
" seed_pop.add(person)\n",
"\n",
"seed_pop.random_person().attributes"
]
},
{
"cell_type": "markdown",
"id": "708c2f97-907d-4fe0-9361-f8e0e90422f8",
"metadata": {},
"source": [
"# IPF"
]
},
{
"cell_type": "markdown",
"id": "dc3527a8-0f0d-4094-9084-0a73cba351ae",
"metadata": {},
"source": [
"Now let's create a population that matches the demographic distribution of each zone:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "2c3b5f5d-37c6-4d4b-bda5-bad713634a28",
"metadata": {
"execution": {
"iopub.execute_input": "2024-04-05T15:22:31.113330Z",
"iopub.status.busy": "2024-04-05T15:22:31.113132Z",
"iopub.status.idle": "2024-04-05T15:22:31.209993Z",
"shell.execute_reply": "2024-04-05T15:22:31.209534Z"
}
},
"outputs": [],
"source": [
"pop = ipf.generate_population(seed_pop, zone_data)"
]
},
{
"cell_type": "markdown",
"id": "1793b438-f84c-4024-abf5-1936eaadc5f0",
"metadata": {},
"source": [
"The resulting population comprises 300 persons (as defined in the zone data):"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "2cd65d20-56c6-46e9-b53d-b451db535081",
"metadata": {
"execution": {
"iopub.execute_input": "2024-04-05T15:22:31.212863Z",
"iopub.status.busy": "2024-04-05T15:22:31.212596Z",
"iopub.status.idle": "2024-04-05T15:22:31.217690Z",
"shell.execute_reply": "2024-04-05T15:22:31.217272Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"300"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(pop)"
]
},
{
"cell_type": "markdown",
"id": "88a0c55c-dc6b-4024-8411-48f8a18b02cc",
"metadata": {},
"source": [
"And each person in the population is assigned a household zone:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "cd08aa2f-c2c4-4335-bcac-315a0609f63f",
"metadata": {
"execution": {
"iopub.execute_input": "2024-04-05T15:22:31.220276Z",
"iopub.status.busy": "2024-04-05T15:22:31.220081Z",
"iopub.status.idle": "2024-04-05T15:22:31.223536Z",
"shell.execute_reply": "2024-04-05T15:22:31.223171Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'hhincome': 'high', 'age': 'minor', 'carAvail': 'yes', 'hzone': 'b'}"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pop.random_person().attributes"
]
},
{
"cell_type": "markdown",
"id": "f412df08-af9e-4a78-807e-6789f06e3097",
"metadata": {},
"source": [
"The resulting joint demographic distributions in each zone are shown below:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "1441fecf-523d-41b5-a936-441da8da3529",
"metadata": {
"execution": {
"iopub.execute_input": "2024-04-05T15:22:31.225762Z",
"iopub.status.busy": "2024-04-05T15:22:31.225557Z",
"iopub.status.idle": "2024-04-05T15:22:31.233168Z",
"shell.execute_reply": "2024-04-05T15:22:31.232730Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"hzone hhincome age carAvail\n",
"a high adult no 2\n",
" yes 16\n",
" minor no 1\n",
" yes 11\n",
" low adult no 2\n",
" yes 16\n",
" minor no 1\n",
" yes 11\n",
" medium adult no 2\n",
" yes 22\n",
" minor no 2\n",
" yes 14\n",
"b high adult no 4\n",
" yes 40\n",
" minor no 4\n",
" yes 32\n",
" low adult no 1\n",
" yes 10\n",
" minor no 1\n",
" yes 8\n",
" medium adult no 6\n",
" yes 50\n",
" minor no 4\n",
" yes 40\n",
"Name: count, dtype: int64"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"summary = (\n",
" pd.DataFrame([person.attributes for hid, pid, person in pop.people()])\n",
" .value_counts()\n",
" .reorder_levels([3, 0, 1, 2])\n",
" .sort_index()\n",
")\n",
"\n",
"summary"
]
},
{
"cell_type": "markdown",
"id": "9500b6af-2f8f-492e-a2cd-284743fa362a",
"metadata": {},
"source": [
"The aggregate demographic distributions match the marginals in `zone_data`:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "10ff7760-3e96-4fd2-b438-ea0114328414",
"metadata": {
"execution": {
"iopub.execute_input": "2024-04-05T15:22:31.235334Z",
"iopub.status.busy": "2024-04-05T15:22:31.235142Z",
"iopub.status.idle": "2024-04-05T15:22:31.246323Z",
"shell.execute_reply": "2024-04-05T15:22:31.245884Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" hhincome|high | \n",
" hhincome|medium | \n",
" hhincome|low | \n",
" age|minor | \n",
" age|adult | \n",
" carAvail|yes | \n",
" carAvail|no | \n",
"
\n",
" \n",
" zone | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 30 | \n",
" 40 | \n",
" 30 | \n",
" 40 | \n",
" 60 | \n",
" 90 | \n",
" 10 | \n",
"
\n",
" \n",
" b | \n",
" 80 | \n",
" 100 | \n",
" 20 | \n",
" 89 | \n",
" 111 | \n",
" 180 | \n",
" 20 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" hhincome|high hhincome|medium hhincome|low age|minor age|adult \\\n",
"zone \n",
"a 30 40 30 40 60 \n",
"b 80 100 20 89 111 \n",
"\n",
" carAvail|yes carAvail|no \n",
"zone \n",
"a 90 10 \n",
"b 180 20 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"summary_aggregate = []\n",
"for var in dims:\n",
" df = summary.groupby(level=[\"hzone\", var]).sum().unstack(level=var)\n",
" df.columns = [f\"{var}|{x}\" for x in df.columns]\n",
" summary_aggregate.append(df)\n",
"summary_aggregate = pd.concat(summary_aggregate, axis=1)\n",
"summary_aggregate.index.name = \"zone\"\n",
"summary_aggregate = summary_aggregate[zone_data.columns]\n",
"summary_aggregate"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "f6db0771-b725-4246-af5c-989e140d0835",
"metadata": {
"execution": {
"iopub.execute_input": "2024-04-05T15:22:31.248461Z",
"iopub.status.busy": "2024-04-05T15:22:31.248289Z",
"iopub.status.idle": "2024-04-05T15:22:31.251794Z",
"shell.execute_reply": "2024-04-05T15:22:31.251292Z"
}
},
"outputs": [],
"source": [
"pd.testing.assert_frame_equal(\n",
" summary_aggregate, zone_data, check_exact=False, atol=1\n",
") # test passes"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}