{ "cells": [ { "cell_type": "markdown", "id": "3c56f3a8-056d-4f8a-b46e-cd44e1441a07", "metadata": {}, "source": [ "# Introduction" ] }, { "cell_type": "markdown", "id": "1d9a8764-65dd-4442-8224-075f427edae0", "metadata": {}, "source": [ "This notebook demonstrates the creation of a synthetic population using an Iterative Proportional Fitting (IPF) approach.\n", "\n", "IPF is a statistical technique that tries to adjust the values of a matrix (joint distribution) in order to match the values of expected single distributions (marginals) for each matrix dimension. We use this approach to sample from a set of persons with certain demographic attributes in a way that the observed distributions in each zone (ie from census) are met." ] }, { "cell_type": "code", "execution_count": 1, "id": "aa7cdd7f-3256-4dab-b42e-7bea3634eb7b", "metadata": { "execution": { "iopub.execute_input": "2023-09-22T10:52:45.466157Z", "iopub.status.busy": "2023-09-22T10:52:45.465946Z", "iopub.status.idle": "2023-09-22T10:52:46.256436Z", "shell.execute_reply": "2023-09-22T10:52:46.256128Z" } }, "outputs": [], "source": [ "import itertools\n", "\n", "import pandas as pd\n", "\n", "from pam.core import Person, Population\n", "from pam.planner import ipf" ] }, { "cell_type": "markdown", "id": "30574bfc-87a4-4846-b881-9e4a439720c3", "metadata": {}, "source": [ "# Data" ] }, { "cell_type": "markdown", "id": "4bad3b0e-9361-4764-a7fd-fc40f8566e55", "metadata": {}, "source": [ "We start with some demographic data for each zone, as shown below. The zones dataframe includes the zone name as the index, and its columns follow a `variable|class` naming convention. Alternatively, they could be provided with a mutltiIndex column, with the first level being the variable and the second level indicating the class. The controlled variables should be part of the seed population's attributes." ] }, { "cell_type": "code", "execution_count": 2, "id": "c54ce830-3539-4616-b42f-98cd6662968e", "metadata": { "execution": { "iopub.execute_input": "2023-09-22T10:52:46.258097Z", "iopub.status.busy": "2023-09-22T10:52:46.257961Z", "iopub.status.idle": "2023-09-22T10:52:46.265070Z", "shell.execute_reply": "2023-09-22T10:52:46.264787Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
hhincome|highhhincome|mediumhhincome|lowage|minorage|adultcarAvail|yescarAvail|no
zone
a30403040609010
b80100209011018020
\n", "
" ], "text/plain": [ " hhincome|high hhincome|medium hhincome|low age|minor age|adult \\\n", "zone \n", "a 30 40 30 40 60 \n", "b 80 100 20 90 110 \n", "\n", " carAvail|yes carAvail|no \n", "zone \n", "a 90 10 \n", "b 180 20 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "zone_data = pd.DataFrame(\n", " {\n", " \"zone\": [\"a\", \"b\"],\n", " \"hhincome|high\": [30, 80],\n", " \"hhincome|medium\": [40, 100],\n", " \"hhincome|low\": [30, 20],\n", " \"age|minor\": [40, 90],\n", " \"age|adult\": [60, 110],\n", " \"carAvail|yes\": [90, 180],\n", " \"carAvail|no\": [10, 20],\n", " }\n", ").set_index(\"zone\")\n", "zone_data" ] }, { "cell_type": "markdown", "id": "14e19191-2784-4c5e-a023-d7cb3bba9c66", "metadata": {}, "source": [ "Let's create a seed population which includes every possible combination of attributes:" ] }, { "cell_type": "code", "execution_count": 3, "id": "f546a8dd-0d1d-441b-a768-b483fd0d1382", "metadata": { "execution": { "iopub.execute_input": "2023-09-22T10:52:46.266476Z", "iopub.status.busy": "2023-09-22T10:52:46.266381Z", "iopub.status.idle": "2023-09-22T10:52:46.269792Z", "shell.execute_reply": "2023-09-22T10:52:46.269550Z" } }, "outputs": [ { "data": { "text/plain": [ "{'hhincome': 'high', 'age': 'adult', 'carAvail': 'yes', 'hzone': }" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dims = {\"hhincome\": [\"low\", \"medium\", \"high\"], \"age\": [\"minor\", \"adult\"], \"carAvail\": [\"yes\", \"no\"]}\n", "list(itertools.product(*dims.values()))\n", "\n", "# %%\n", "seed_pop = Population()\n", "n = 0\n", "for attributes in list(itertools.product(*dims.values())):\n", " hhincome, age, carAvail = attributes\n", " person = Person(\n", " pid=n, attributes={\"hhincome\": hhincome, \"age\": age, \"carAvail\": carAvail, \"hzone\": pd.NA}\n", " )\n", " n += 1\n", " seed_pop.add(person)\n", "\n", "seed_pop.random_person().attributes" ] }, { "cell_type": "markdown", "id": "708c2f97-907d-4fe0-9361-f8e0e90422f8", "metadata": {}, "source": [ "# IPF" ] }, { "cell_type": "markdown", "id": "dc3527a8-0f0d-4094-9084-0a73cba351ae", "metadata": {}, "source": [ "Now let's create a population that matches the demographic distribution of each zone:" ] }, { "cell_type": "code", "execution_count": 4, "id": "2c3b5f5d-37c6-4d4b-bda5-bad713634a28", "metadata": { "execution": { "iopub.execute_input": "2023-09-22T10:52:46.271111Z", "iopub.status.busy": "2023-09-22T10:52:46.271022Z", "iopub.status.idle": "2023-09-22T10:52:46.342084Z", "shell.execute_reply": "2023-09-22T10:52:46.341774Z" } }, "outputs": [], "source": [ "pop = ipf.generate_population(seed_pop, zone_data)" ] }, { "cell_type": "markdown", "id": "1793b438-f84c-4024-abf5-1936eaadc5f0", "metadata": {}, "source": [ "The resulting population comprises 300 persons (as defined in the zone data):" ] }, { "cell_type": "code", "execution_count": 5, "id": "2cd65d20-56c6-46e9-b53d-b451db535081", "metadata": { "execution": { "iopub.execute_input": "2023-09-22T10:52:46.343712Z", "iopub.status.busy": "2023-09-22T10:52:46.343631Z", "iopub.status.idle": "2023-09-22T10:52:46.345897Z", "shell.execute_reply": "2023-09-22T10:52:46.345623Z" } }, "outputs": [ { "data": { "text/plain": [ "300" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(pop)" ] }, { "cell_type": "markdown", "id": "88a0c55c-dc6b-4024-8411-48f8a18b02cc", "metadata": {}, "source": [ "And each person in the population is assigned a household zone:" ] }, { "cell_type": "code", "execution_count": 6, "id": "cd08aa2f-c2c4-4335-bcac-315a0609f63f", "metadata": { "execution": { "iopub.execute_input": "2023-09-22T10:52:46.347311Z", "iopub.status.busy": "2023-09-22T10:52:46.347214Z", "iopub.status.idle": "2023-09-22T10:52:46.349644Z", "shell.execute_reply": "2023-09-22T10:52:46.349400Z" } }, "outputs": [ { "data": { "text/plain": [ "{'hhincome': 'medium', 'age': 'adult', 'carAvail': 'yes', 'hzone': 'b'}" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pop.random_person().attributes" ] }, { "cell_type": "markdown", "id": "f412df08-af9e-4a78-807e-6789f06e3097", "metadata": {}, "source": [ "The resulting joint demographic distributions in each zone are shown below:" ] }, { "cell_type": "code", "execution_count": 7, "id": "1441fecf-523d-41b5-a936-441da8da3529", "metadata": { "execution": { "iopub.execute_input": "2023-09-22T10:52:46.350895Z", "iopub.status.busy": "2023-09-22T10:52:46.350801Z", "iopub.status.idle": "2023-09-22T10:52:46.355724Z", "shell.execute_reply": "2023-09-22T10:52:46.355502Z" } }, "outputs": [ { "data": { "text/plain": [ "hzone hhincome age carAvail\n", "a high adult no 2\n", " yes 16\n", " minor no 1\n", " yes 11\n", " low adult no 2\n", " yes 16\n", " minor no 1\n", " yes 11\n", " medium adult no 2\n", " yes 22\n", " minor no 2\n", " yes 14\n", "b high adult no 4\n", " yes 40\n", " minor no 4\n", " yes 32\n", " low adult no 1\n", " yes 10\n", " minor no 1\n", " yes 8\n", " medium adult no 6\n", " yes 50\n", " minor no 4\n", " yes 40\n", "Name: count, dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "summary = (\n", " pd.DataFrame([person.attributes for hid, pid, person in pop.people()])\n", " .value_counts()\n", " .reorder_levels([3, 0, 1, 2])\n", " .sort_index()\n", ")\n", "\n", "summary" ] }, { "cell_type": "markdown", "id": "9500b6af-2f8f-492e-a2cd-284743fa362a", "metadata": {}, "source": [ "The aggregate demographic distributions match the marginals in `zone_data`:" ] }, { "cell_type": "code", "execution_count": 8, "id": "10ff7760-3e96-4fd2-b438-ea0114328414", "metadata": { "execution": { "iopub.execute_input": "2023-09-22T10:52:46.356948Z", "iopub.status.busy": "2023-09-22T10:52:46.356860Z", "iopub.status.idle": "2023-09-22T10:52:46.364199Z", "shell.execute_reply": "2023-09-22T10:52:46.363956Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
hhincome|highhhincome|mediumhhincome|lowage|minorage|adultcarAvail|yescarAvail|no
zone
a30403040609010
b80100208911118020
\n", "
" ], "text/plain": [ " hhincome|high hhincome|medium hhincome|low age|minor age|adult \\\n", "zone \n", "a 30 40 30 40 60 \n", "b 80 100 20 89 111 \n", "\n", " carAvail|yes carAvail|no \n", "zone \n", "a 90 10 \n", "b 180 20 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "summary_aggregate = []\n", "for var in dims:\n", " df = summary.groupby(level=[\"hzone\", var]).sum().unstack(level=var)\n", " df.columns = [f\"{var}|{x}\" for x in df.columns]\n", " summary_aggregate.append(df)\n", "summary_aggregate = pd.concat(summary_aggregate, axis=1)\n", "summary_aggregate.index.name = \"zone\"\n", "summary_aggregate = summary_aggregate[zone_data.columns]\n", "summary_aggregate" ] }, { "cell_type": "code", "execution_count": 9, "id": "f6db0771-b725-4246-af5c-989e140d0835", "metadata": { "execution": { "iopub.execute_input": "2023-09-22T10:52:46.365439Z", "iopub.status.busy": "2023-09-22T10:52:46.365350Z", "iopub.status.idle": "2023-09-22T10:52:46.367388Z", "shell.execute_reply": "2023-09-22T10:52:46.367147Z" } }, "outputs": [], "source": [ "pd.testing.assert_frame_equal(\n", " summary_aggregate, zone_data, check_exact=False, atol=1\n", ") # test passes" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 5 }