{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "3c56f3a8-056d-4f8a-b46e-cd44e1441a07",
   "metadata": {},
   "source": [
    "# Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d9a8764-65dd-4442-8224-075f427edae0",
   "metadata": {},
   "source": [
    "This notebook demonstrates the creation of a synthetic population using an Iterative Proportional Fitting (IPF) approach.\n",
    "\n",
    "IPF is a statistical technique that tries to adjust the values of a matrix (joint distribution) in order to match the values of expected single distributions (marginals) for each matrix dimension. We use this approach to sample from a set of persons with certain demographic attributes in a way that the observed distributions in each zone (ie from census) are met."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "aa7cdd7f-3256-4dab-b42e-7bea3634eb7b",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-04-05T15:22:30.161208Z",
     "iopub.status.busy": "2024-04-05T15:22:30.161074Z",
     "iopub.status.idle": "2024-04-05T15:22:31.086419Z",
     "shell.execute_reply": "2024-04-05T15:22:31.085892Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/6n/0h9tynqn581fxsytcc863h94tm217b/T/ipykernel_95267/1028812630.py:3: DeprecationWarning: \n",
      "Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),\n",
      "(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)\n",
      "but was not found to be installed on your system.\n",
      "If this would cause problems for you,\n",
      "please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466\n",
      "        \n",
      "  import pandas as pd\n"
     ]
    }
   ],
   "source": [
    "import itertools\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "from pam.core import Person, Population\n",
    "from pam.planner import ipf"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30574bfc-87a4-4846-b881-9e4a439720c3",
   "metadata": {},
   "source": [
    "# Data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4bad3b0e-9361-4764-a7fd-fc40f8566e55",
   "metadata": {},
   "source": [
    "We start with some demographic data for each zone, as shown below. The zones dataframe includes the zone name as the index, and its columns follow a `variable|class` naming convention. Alternatively, they could be provided with a mutltiIndex column, with the first level being the variable and the second level indicating the class. The controlled variables should be part of the seed population's attributes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "c54ce830-3539-4616-b42f-98cd6662968e",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-04-05T15:22:31.089553Z",
     "iopub.status.busy": "2024-04-05T15:22:31.089276Z",
     "iopub.status.idle": "2024-04-05T15:22:31.102561Z",
     "shell.execute_reply": "2024-04-05T15:22:31.102041Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>hhincome|high</th>\n",
       "      <th>hhincome|medium</th>\n",
       "      <th>hhincome|low</th>\n",
       "      <th>age|minor</th>\n",
       "      <th>age|adult</th>\n",
       "      <th>carAvail|yes</th>\n",
       "      <th>carAvail|no</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>zone</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <td>30</td>\n",
       "      <td>40</td>\n",
       "      <td>30</td>\n",
       "      <td>40</td>\n",
       "      <td>60</td>\n",
       "      <td>90</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>b</th>\n",
       "      <td>80</td>\n",
       "      <td>100</td>\n",
       "      <td>20</td>\n",
       "      <td>90</td>\n",
       "      <td>110</td>\n",
       "      <td>180</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      hhincome|high  hhincome|medium  hhincome|low  age|minor  age|adult  \\\n",
       "zone                                                                       \n",
       "a                30               40            30         40         60   \n",
       "b                80              100            20         90        110   \n",
       "\n",
       "      carAvail|yes  carAvail|no  \n",
       "zone                             \n",
       "a               90           10  \n",
       "b              180           20  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "zone_data = pd.DataFrame(\n",
    "    {\n",
    "        \"zone\": [\"a\", \"b\"],\n",
    "        \"hhincome|high\": [30, 80],\n",
    "        \"hhincome|medium\": [40, 100],\n",
    "        \"hhincome|low\": [30, 20],\n",
    "        \"age|minor\": [40, 90],\n",
    "        \"age|adult\": [60, 110],\n",
    "        \"carAvail|yes\": [90, 180],\n",
    "        \"carAvail|no\": [10, 20],\n",
    "    }\n",
    ").set_index(\"zone\")\n",
    "zone_data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "14e19191-2784-4c5e-a023-d7cb3bba9c66",
   "metadata": {},
   "source": [
    "Let's create a seed population which includes every possible combination of attributes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "f546a8dd-0d1d-441b-a768-b483fd0d1382",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-04-05T15:22:31.105343Z",
     "iopub.status.busy": "2024-04-05T15:22:31.105100Z",
     "iopub.status.idle": "2024-04-05T15:22:31.110967Z",
     "shell.execute_reply": "2024-04-05T15:22:31.110542Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'hhincome': 'low', 'age': 'adult', 'carAvail': 'no', 'hzone': <NA>}"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dims = {\"hhincome\": [\"low\", \"medium\", \"high\"], \"age\": [\"minor\", \"adult\"], \"carAvail\": [\"yes\", \"no\"]}\n",
    "list(itertools.product(*dims.values()))\n",
    "\n",
    "# %%\n",
    "seed_pop = Population()\n",
    "n = 0\n",
    "for attributes in list(itertools.product(*dims.values())):\n",
    "    hhincome, age, carAvail = attributes\n",
    "    person = Person(\n",
    "        pid=n, attributes={\"hhincome\": hhincome, \"age\": age, \"carAvail\": carAvail, \"hzone\": pd.NA}\n",
    "    )\n",
    "    n += 1\n",
    "    seed_pop.add(person)\n",
    "\n",
    "seed_pop.random_person().attributes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "708c2f97-907d-4fe0-9361-f8e0e90422f8",
   "metadata": {},
   "source": [
    "# IPF"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dc3527a8-0f0d-4094-9084-0a73cba351ae",
   "metadata": {},
   "source": [
    "Now let's create a population that matches the demographic distribution of each zone:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "2c3b5f5d-37c6-4d4b-bda5-bad713634a28",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-04-05T15:22:31.113330Z",
     "iopub.status.busy": "2024-04-05T15:22:31.113132Z",
     "iopub.status.idle": "2024-04-05T15:22:31.209993Z",
     "shell.execute_reply": "2024-04-05T15:22:31.209534Z"
    }
   },
   "outputs": [],
   "source": [
    "pop = ipf.generate_population(seed_pop, zone_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1793b438-f84c-4024-abf5-1936eaadc5f0",
   "metadata": {},
   "source": [
    "The resulting population comprises 300 persons (as defined in the zone data):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "2cd65d20-56c6-46e9-b53d-b451db535081",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-04-05T15:22:31.212863Z",
     "iopub.status.busy": "2024-04-05T15:22:31.212596Z",
     "iopub.status.idle": "2024-04-05T15:22:31.217690Z",
     "shell.execute_reply": "2024-04-05T15:22:31.217272Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "300"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(pop)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "88a0c55c-dc6b-4024-8411-48f8a18b02cc",
   "metadata": {},
   "source": [
    "And each person in the population is assigned a household zone:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "cd08aa2f-c2c4-4335-bcac-315a0609f63f",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-04-05T15:22:31.220276Z",
     "iopub.status.busy": "2024-04-05T15:22:31.220081Z",
     "iopub.status.idle": "2024-04-05T15:22:31.223536Z",
     "shell.execute_reply": "2024-04-05T15:22:31.223171Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'hhincome': 'high', 'age': 'minor', 'carAvail': 'yes', 'hzone': 'b'}"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pop.random_person().attributes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f412df08-af9e-4a78-807e-6789f06e3097",
   "metadata": {},
   "source": [
    "The resulting joint demographic distributions in each zone are shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "1441fecf-523d-41b5-a936-441da8da3529",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-04-05T15:22:31.225762Z",
     "iopub.status.busy": "2024-04-05T15:22:31.225557Z",
     "iopub.status.idle": "2024-04-05T15:22:31.233168Z",
     "shell.execute_reply": "2024-04-05T15:22:31.232730Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "hzone  hhincome  age    carAvail\n",
       "a      high      adult  no           2\n",
       "                        yes         16\n",
       "                 minor  no           1\n",
       "                        yes         11\n",
       "       low       adult  no           2\n",
       "                        yes         16\n",
       "                 minor  no           1\n",
       "                        yes         11\n",
       "       medium    adult  no           2\n",
       "                        yes         22\n",
       "                 minor  no           2\n",
       "                        yes         14\n",
       "b      high      adult  no           4\n",
       "                        yes         40\n",
       "                 minor  no           4\n",
       "                        yes         32\n",
       "       low       adult  no           1\n",
       "                        yes         10\n",
       "                 minor  no           1\n",
       "                        yes          8\n",
       "       medium    adult  no           6\n",
       "                        yes         50\n",
       "                 minor  no           4\n",
       "                        yes         40\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "summary = (\n",
    "    pd.DataFrame([person.attributes for hid, pid, person in pop.people()])\n",
    "    .value_counts()\n",
    "    .reorder_levels([3, 0, 1, 2])\n",
    "    .sort_index()\n",
    ")\n",
    "\n",
    "summary"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9500b6af-2f8f-492e-a2cd-284743fa362a",
   "metadata": {},
   "source": [
    "The aggregate demographic distributions match the marginals in `zone_data`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "10ff7760-3e96-4fd2-b438-ea0114328414",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-04-05T15:22:31.235334Z",
     "iopub.status.busy": "2024-04-05T15:22:31.235142Z",
     "iopub.status.idle": "2024-04-05T15:22:31.246323Z",
     "shell.execute_reply": "2024-04-05T15:22:31.245884Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>hhincome|high</th>\n",
       "      <th>hhincome|medium</th>\n",
       "      <th>hhincome|low</th>\n",
       "      <th>age|minor</th>\n",
       "      <th>age|adult</th>\n",
       "      <th>carAvail|yes</th>\n",
       "      <th>carAvail|no</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>zone</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <td>30</td>\n",
       "      <td>40</td>\n",
       "      <td>30</td>\n",
       "      <td>40</td>\n",
       "      <td>60</td>\n",
       "      <td>90</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>b</th>\n",
       "      <td>80</td>\n",
       "      <td>100</td>\n",
       "      <td>20</td>\n",
       "      <td>89</td>\n",
       "      <td>111</td>\n",
       "      <td>180</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      hhincome|high  hhincome|medium  hhincome|low  age|minor  age|adult  \\\n",
       "zone                                                                       \n",
       "a                30               40            30         40         60   \n",
       "b                80              100            20         89        111   \n",
       "\n",
       "      carAvail|yes  carAvail|no  \n",
       "zone                             \n",
       "a               90           10  \n",
       "b              180           20  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "summary_aggregate = []\n",
    "for var in dims:\n",
    "    df = summary.groupby(level=[\"hzone\", var]).sum().unstack(level=var)\n",
    "    df.columns = [f\"{var}|{x}\" for x in df.columns]\n",
    "    summary_aggregate.append(df)\n",
    "summary_aggregate = pd.concat(summary_aggregate, axis=1)\n",
    "summary_aggregate.index.name = \"zone\"\n",
    "summary_aggregate = summary_aggregate[zone_data.columns]\n",
    "summary_aggregate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "f6db0771-b725-4246-af5c-989e140d0835",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-04-05T15:22:31.248461Z",
     "iopub.status.busy": "2024-04-05T15:22:31.248289Z",
     "iopub.status.idle": "2024-04-05T15:22:31.251794Z",
     "shell.execute_reply": "2024-04-05T15:22:31.251292Z"
    }
   },
   "outputs": [],
   "source": [
    "pd.testing.assert_frame_equal(\n",
    "    summary_aggregate, zone_data, check_exact=False, atol=1\n",
    ")  # test passes"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}