{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "3c56f3a8-056d-4f8a-b46e-cd44e1441a07",
   "metadata": {},
   "source": [
    "# Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d9a8764-65dd-4442-8224-075f427edae0",
   "metadata": {},
   "source": [
    "This notebook demonstrates the creation of a synthetic population using an Iterative Proportional Fitting (IPF) approach.\n",
    "\n",
    "IPF is a statistical technique that tries to adjust the values of a matrix (joint distribution) in order to match the values of expected single distributions (marginals) for each matrix dimension. We use this approach to sample from a set of persons with certain demographic attributes in a way that the observed distributions in each zone (ie from census) are met."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "aa7cdd7f-3256-4dab-b42e-7bea3634eb7b",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-09-22T10:52:45.466157Z",
     "iopub.status.busy": "2023-09-22T10:52:45.465946Z",
     "iopub.status.idle": "2023-09-22T10:52:46.256436Z",
     "shell.execute_reply": "2023-09-22T10:52:46.256128Z"
    }
   },
   "outputs": [],
   "source": [
    "import itertools\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "from pam.core import Person, Population\n",
    "from pam.planner import ipf"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30574bfc-87a4-4846-b881-9e4a439720c3",
   "metadata": {},
   "source": [
    "# Data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4bad3b0e-9361-4764-a7fd-fc40f8566e55",
   "metadata": {},
   "source": [
    "We start with some demographic data for each zone, as shown below. The zones dataframe includes the zone name as the index, and its columns follow a `variable|class` naming convention. Alternatively, they could be provided with a mutltiIndex column, with the first level being the variable and the second level indicating the class. The controlled variables should be part of the seed population's attributes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "c54ce830-3539-4616-b42f-98cd6662968e",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-09-22T10:52:46.258097Z",
     "iopub.status.busy": "2023-09-22T10:52:46.257961Z",
     "iopub.status.idle": "2023-09-22T10:52:46.265070Z",
     "shell.execute_reply": "2023-09-22T10:52:46.264787Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>hhincome|high</th>\n",
       "      <th>hhincome|medium</th>\n",
       "      <th>hhincome|low</th>\n",
       "      <th>age|minor</th>\n",
       "      <th>age|adult</th>\n",
       "      <th>carAvail|yes</th>\n",
       "      <th>carAvail|no</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>zone</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <td>30</td>\n",
       "      <td>40</td>\n",
       "      <td>30</td>\n",
       "      <td>40</td>\n",
       "      <td>60</td>\n",
       "      <td>90</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>b</th>\n",
       "      <td>80</td>\n",
       "      <td>100</td>\n",
       "      <td>20</td>\n",
       "      <td>90</td>\n",
       "      <td>110</td>\n",
       "      <td>180</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      hhincome|high  hhincome|medium  hhincome|low  age|minor  age|adult  \\\n",
       "zone                                                                       \n",
       "a                30               40            30         40         60   \n",
       "b                80              100            20         90        110   \n",
       "\n",
       "      carAvail|yes  carAvail|no  \n",
       "zone                             \n",
       "a               90           10  \n",
       "b              180           20  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "zone_data = pd.DataFrame(\n",
    "    {\n",
    "        \"zone\": [\"a\", \"b\"],\n",
    "        \"hhincome|high\": [30, 80],\n",
    "        \"hhincome|medium\": [40, 100],\n",
    "        \"hhincome|low\": [30, 20],\n",
    "        \"age|minor\": [40, 90],\n",
    "        \"age|adult\": [60, 110],\n",
    "        \"carAvail|yes\": [90, 180],\n",
    "        \"carAvail|no\": [10, 20],\n",
    "    }\n",
    ").set_index(\"zone\")\n",
    "zone_data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "14e19191-2784-4c5e-a023-d7cb3bba9c66",
   "metadata": {},
   "source": [
    "Let's create a seed population which includes every possible combination of attributes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "f546a8dd-0d1d-441b-a768-b483fd0d1382",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-09-22T10:52:46.266476Z",
     "iopub.status.busy": "2023-09-22T10:52:46.266381Z",
     "iopub.status.idle": "2023-09-22T10:52:46.269792Z",
     "shell.execute_reply": "2023-09-22T10:52:46.269550Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'hhincome': 'high', 'age': 'adult', 'carAvail': 'yes', 'hzone': <NA>}"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dims = {\"hhincome\": [\"low\", \"medium\", \"high\"], \"age\": [\"minor\", \"adult\"], \"carAvail\": [\"yes\", \"no\"]}\n",
    "list(itertools.product(*dims.values()))\n",
    "\n",
    "# %%\n",
    "seed_pop = Population()\n",
    "n = 0\n",
    "for attributes in list(itertools.product(*dims.values())):\n",
    "    hhincome, age, carAvail = attributes\n",
    "    person = Person(\n",
    "        pid=n, attributes={\"hhincome\": hhincome, \"age\": age, \"carAvail\": carAvail, \"hzone\": pd.NA}\n",
    "    )\n",
    "    n += 1\n",
    "    seed_pop.add(person)\n",
    "\n",
    "seed_pop.random_person().attributes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "708c2f97-907d-4fe0-9361-f8e0e90422f8",
   "metadata": {},
   "source": [
    "# IPF"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dc3527a8-0f0d-4094-9084-0a73cba351ae",
   "metadata": {},
   "source": [
    "Now let's create a population that matches the demographic distribution of each zone:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "2c3b5f5d-37c6-4d4b-bda5-bad713634a28",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-09-22T10:52:46.271111Z",
     "iopub.status.busy": "2023-09-22T10:52:46.271022Z",
     "iopub.status.idle": "2023-09-22T10:52:46.342084Z",
     "shell.execute_reply": "2023-09-22T10:52:46.341774Z"
    }
   },
   "outputs": [],
   "source": [
    "pop = ipf.generate_population(seed_pop, zone_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1793b438-f84c-4024-abf5-1936eaadc5f0",
   "metadata": {},
   "source": [
    "The resulting population comprises 300 persons (as defined in the zone data):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "2cd65d20-56c6-46e9-b53d-b451db535081",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-09-22T10:52:46.343712Z",
     "iopub.status.busy": "2023-09-22T10:52:46.343631Z",
     "iopub.status.idle": "2023-09-22T10:52:46.345897Z",
     "shell.execute_reply": "2023-09-22T10:52:46.345623Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "300"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(pop)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "88a0c55c-dc6b-4024-8411-48f8a18b02cc",
   "metadata": {},
   "source": [
    "And each person in the population is assigned a household zone:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "cd08aa2f-c2c4-4335-bcac-315a0609f63f",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-09-22T10:52:46.347311Z",
     "iopub.status.busy": "2023-09-22T10:52:46.347214Z",
     "iopub.status.idle": "2023-09-22T10:52:46.349644Z",
     "shell.execute_reply": "2023-09-22T10:52:46.349400Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'hhincome': 'medium', 'age': 'adult', 'carAvail': 'yes', 'hzone': 'b'}"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pop.random_person().attributes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f412df08-af9e-4a78-807e-6789f06e3097",
   "metadata": {},
   "source": [
    "The resulting joint demographic distributions in each zone are shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "1441fecf-523d-41b5-a936-441da8da3529",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-09-22T10:52:46.350895Z",
     "iopub.status.busy": "2023-09-22T10:52:46.350801Z",
     "iopub.status.idle": "2023-09-22T10:52:46.355724Z",
     "shell.execute_reply": "2023-09-22T10:52:46.355502Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "hzone  hhincome  age    carAvail\n",
       "a      high      adult  no           2\n",
       "                        yes         16\n",
       "                 minor  no           1\n",
       "                        yes         11\n",
       "       low       adult  no           2\n",
       "                        yes         16\n",
       "                 minor  no           1\n",
       "                        yes         11\n",
       "       medium    adult  no           2\n",
       "                        yes         22\n",
       "                 minor  no           2\n",
       "                        yes         14\n",
       "b      high      adult  no           4\n",
       "                        yes         40\n",
       "                 minor  no           4\n",
       "                        yes         32\n",
       "       low       adult  no           1\n",
       "                        yes         10\n",
       "                 minor  no           1\n",
       "                        yes          8\n",
       "       medium    adult  no           6\n",
       "                        yes         50\n",
       "                 minor  no           4\n",
       "                        yes         40\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "summary = (\n",
    "    pd.DataFrame([person.attributes for hid, pid, person in pop.people()])\n",
    "    .value_counts()\n",
    "    .reorder_levels([3, 0, 1, 2])\n",
    "    .sort_index()\n",
    ")\n",
    "\n",
    "summary"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9500b6af-2f8f-492e-a2cd-284743fa362a",
   "metadata": {},
   "source": [
    "The aggregate demographic distributions match the marginals in `zone_data`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "10ff7760-3e96-4fd2-b438-ea0114328414",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-09-22T10:52:46.356948Z",
     "iopub.status.busy": "2023-09-22T10:52:46.356860Z",
     "iopub.status.idle": "2023-09-22T10:52:46.364199Z",
     "shell.execute_reply": "2023-09-22T10:52:46.363956Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>hhincome|high</th>\n",
       "      <th>hhincome|medium</th>\n",
       "      <th>hhincome|low</th>\n",
       "      <th>age|minor</th>\n",
       "      <th>age|adult</th>\n",
       "      <th>carAvail|yes</th>\n",
       "      <th>carAvail|no</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>zone</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <td>30</td>\n",
       "      <td>40</td>\n",
       "      <td>30</td>\n",
       "      <td>40</td>\n",
       "      <td>60</td>\n",
       "      <td>90</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>b</th>\n",
       "      <td>80</td>\n",
       "      <td>100</td>\n",
       "      <td>20</td>\n",
       "      <td>89</td>\n",
       "      <td>111</td>\n",
       "      <td>180</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      hhincome|high  hhincome|medium  hhincome|low  age|minor  age|adult  \\\n",
       "zone                                                                       \n",
       "a                30               40            30         40         60   \n",
       "b                80              100            20         89        111   \n",
       "\n",
       "      carAvail|yes  carAvail|no  \n",
       "zone                             \n",
       "a               90           10  \n",
       "b              180           20  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "summary_aggregate = []\n",
    "for var in dims:\n",
    "    df = summary.groupby(level=[\"hzone\", var]).sum().unstack(level=var)\n",
    "    df.columns = [f\"{var}|{x}\" for x in df.columns]\n",
    "    summary_aggregate.append(df)\n",
    "summary_aggregate = pd.concat(summary_aggregate, axis=1)\n",
    "summary_aggregate.index.name = \"zone\"\n",
    "summary_aggregate = summary_aggregate[zone_data.columns]\n",
    "summary_aggregate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "f6db0771-b725-4246-af5c-989e140d0835",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-09-22T10:52:46.365439Z",
     "iopub.status.busy": "2023-09-22T10:52:46.365350Z",
     "iopub.status.idle": "2023-09-22T10:52:46.367388Z",
     "shell.execute_reply": "2023-09-22T10:52:46.367147Z"
    }
   },
   "outputs": [],
   "source": [
    "pd.testing.assert_frame_equal(\n",
    "    summary_aggregate, zone_data, check_exact=False, atol=1\n",
    ")  # test passes"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}