Input
This section is related to the JUNE-NZ component: cli_input. Note that the original purpose of cli_input is to write the measles data records to the format that can be used in JUNE-NZ, therefore it is not a necesary step for all modelling requirements.
1. Background
JUNE-NZ relies on input data primarily generated by the population layer of the original JUNE model. The synthetic population in JUNE often leads to a high volume of potential interactions by individual agents. To mitigate the graph expansion in the Graph Neural Network and provide a more accurate representation of model uncertainties, we suggest randomly partitioning the interactions from the original synthetic population. This approach allows us to create multiple input datasets for JUNE-NZ, which can be run in parallel for simulations.
Input: In order to create the input for JUNE-NZ, we need the following dataset as the input for cli_input:
targetdata: The target dataset represents the ground truth, such as the recorded number of COVID-19 cases, that the model aims to learn and predict.agentdata: The agent dataset comprises essential attributes for modeling purposes, such as age, sex, ethnicity, geographic area, group (e.g., the unique household identifier to which the agent belongs), spec (e.g., household, city transport, etc.), and timestamp.sa2 - DHBdata: This dataset provides a mapping between SA2 (ID) and DHB (ID and name).
Output: The output from cli_input include:
targetdata: This dataset represents the ground truth to be learned from the model.agent_groupdata: This dataset provides the agent ID and the group (e.g., the unique household identifier to which the agent belongs) that where the agent belongs. It is extracted from theagentdata.
2. target data
The target dataset represents the ground truth, such as the recorded number of COVID-19 cases, that the model aims to learn and predict. For now, the target dataset must be in the format of __parquet__. An example of the target dataset is shown below:
Region |
Week_11 |
Week_13 |
Week_15 |
Week_16 |
Week_17 |
Week_18 |
Week_19 |
Week_20 |
Week_21 |
Week_22 |
|---|---|---|---|---|---|---|---|---|---|---|
Waitemata |
1.0 |
2 |
3.0 |
0.0 |
4.0 |
6.0 |
6 |
6 |
5 |
4 |
Auckland |
2.0 |
2 |
1.0 |
0.0 |
1.0 |
1.0 |
2 |
1 |
4 |
2 |
Counties Manukau |
0.0 |
0 |
0.0 |
0.0 |
0.0 |
1.0 |
2 |
2 |
1 |
3 |
cli_input will combine the above dataset, and produce something like the below (which is the one used in training for JUNE-NZ):
target |
|
|---|---|
Week_2 |
0.0 |
Week_3 |
0.0 |
Week_4 |
3.0 |
Week_5 |
5.0 |
Week_6 |
11.0 |
Week_7 |
22.0 |
The processed output of the target data will be stored in the format of csv in the working directory.
3. agent and agent group data
The agent and agent group data are related to the agent and interactions that we will use in the model.
3.1 Base agent data
The agent data is produced from the original JUNE model, and then being processed accordingly (e.g., converting ethnicity from name to identifier, adding vaccination etc.).
It exports in the format in parquet
An example of agent data is shown below:
id |
age |
sex |
ethnicity |
area |
group |
spec |
time |
|---|---|---|---|---|---|---|---|
0 |
0 |
m |
European |
110400 |
Household_00692 |
household |
20200302T00 |
276 |
6 |
f |
European |
110400 |
Household_00692 |
household |
20200302T00 |
1 |
0 |
f |
European |
110400 |
Household_01228 |
household |
20200302T00 |
2 |
0 |
m |
European |
110400 |
Household_00371 |
household |
20200302T00 |
386 |
8 |
m |
European |
110400 |
Household_00371 |
household |
20200302T00 |
3.2 sa2 - DHB (intermediate data)
This straightforward mapping directory illustrates the relationship between SA2 and DHB, as demonstrated in the following example:
SA2 |
DHB_code |
DHB_name |
|---|---|---|
460 |
146100 |
Counties Manukau |
463 |
146400 |
Counties Manukau |
461 |
146800 |
Counties Manukau |
4742 |
147500 |
Counties Manukau |
3.3 Agent group data
The agent group data represents the mapping between agent IDs and their corresponding group identifiers, as illustrated in the example below:
id |
group |
|---|---|
1014806 |
Household_313093 |
1014807 |
Household_313988 |
1014808 |
Household_312993 |
3.4 Interaction data
Interaction data is derived from the base agent data, such as JUNE, to construct pairwise datasets for each interaction. Each row of the dataset is structured as follows:
id_x: Identifies one of the individuals involved in the interaction.id_y: Identifies another individual in the interaction.spec_x: Represents the category of the venue group (e.g., school, hospital) using a unique identifier (e.g., 0, 1, etc.).group: Specifies the venue where the interaction takes place.
One of the examples of the data is shown below:
id_x |
id_y |
spec_x |
group |
|---|---|---|---|
25732 |
27402 |
0 |
329971 |
27401 |
27402 |
0 |
329971 |
27400 |
27402 |
0 |
329971 |
25733 |
27402 |
0 |
329971 |
4. Configuration
The configuration for cli_input contains two parts:
interaction_ratio: This parameter specifies the desired percentage of original interactions to be included in the dataset.vaccine_ratio: This parameter accounts for the vaccination rates among different ethnic groups.
An example of the configuration can be found below:
interaction_ratio:
household: 0.1
cinema: 0.1
pub: 0.1
gym: 0.1
grocery: 0.1
company: 0.05
school: 0.05
hospital: 0.03
inter_city_transport: 0.3
city_transport: 0.3
vaccine_ratio:
European: 0.75
Maori: 0.47
Pacific: 0.6
Asian: 0.89
MELAA: 0.75
The dataset will be randomly generated according to the percentages specified in the configuration. This allows us to produce multiple datasets with distinct synthetic population representations, which can then be employed in the model to generate ensemble-based model outputs.