Input

This section is related to the JUNE-NZ component: cli_input. Note that the original purpose of cli_input is to write the measles data records to the format that can be used in JUNE-NZ, therefore it is not a necesary step for all modelling requirements.

1. Background

JUNE-NZ relies on input data primarily generated by the population layer of the original JUNE model. The synthetic population in JUNE often leads to a high volume of potential interactions by individual agents. To mitigate the graph expansion in the Graph Neural Network and provide a more accurate representation of model uncertainties, we suggest randomly partitioning the interactions from the original synthetic population. This approach allows us to create multiple input datasets for JUNE-NZ, which can be run in parallel for simulations.

Input: In order to create the input for JUNE-NZ, we need the following dataset as the input for cli_input:
- target data: The target dataset represents the ground truth, such as the recorded number of COVID-19 cases, that the model aims to learn and predict.
- agent data: The agent dataset comprises essential attributes for modeling purposes, such as age, sex, ethnicity, geographic area, group (e.g., the unique household identifier to which the agent belongs), spec (e.g., household, city transport, etc.), and timestamp.
- sa2 - DHB data: This dataset provides a mapping between SA2 (ID) and DHB (ID and name).
Output: The output from cli_input include:
- target data: This dataset represents the ground truth to be learned from the model.
- agent_group data: This dataset provides the agent ID and the group (e.g., the unique household identifier to which the agent belongs) that where the agent belongs. It is extracted from the agent data.

2. target data

The target dataset represents the ground truth, such as the recorded number of COVID-19 cases, that the model aims to learn and predict. For now, the target dataset must be in the format of __parquet__. An example of the target dataset is shown below:

Region	Week_11	Week_13	Week_15	Week_17	Week_18	Week_19	Week_20	Week_21	Week_22
Waitemata	1.0	2	3.0	4.0	6.0	6	6	5	4
Auckland	2.0	2	1.0	1.0	1.0	2	1	4	2
Counties Manukau	0.0	0	0.0	0.0	1.0	2	2	1	3

cli_input will combine the above dataset, and produce something like the below (which is the one used in training for JUNE-NZ):

	target
Week_2	0.0
Week_3	0.0
Week_4	3.0
Week_5	5.0
Week_6	11.0
Week_7	22.0

The processed output of the target data will be stored in the format of csv in the working directory.

3. agent and agent group data

The agent and agent group data are related to the agent and interactions that we will use in the model.

3.1 Base agent data

The agent data is produced from the original JUNE model, and then being processed accordingly (e.g., converting ethnicity from name to identifier, adding vaccination etc.). It exports in the format in parquet

An example of agent data is shown below:

id	age	sex	ethnicity	area	group	spec	time
0	0	m	European	110400	Household_00692	household	20200302T00
276	6	f	European	110400	Household_00692	household	20200302T00
1	0	f	European	110400	Household_01228	household	20200302T00
2	0	m	European	110400	Household_00371	household	20200302T00
386	8	m	European	110400	Household_00371	household	20200302T00

3.2 sa2 - DHB (intermediate data)

This straightforward mapping directory illustrates the relationship between SA2 and DHB, as demonstrated in the following example:

SA2	DHB_code	DHB_name
460	146100	Counties Manukau
463	146400	Counties Manukau
461	146800	Counties Manukau
4742	147500	Counties Manukau

3.3 Agent group data

The agent group data represents the mapping between agent IDs and their corresponding group identifiers, as illustrated in the example below:

id	group
1014806	Household_313093
1014807	Household_313988
1014808	Household_312993

3.4 Interaction data

Interaction data is derived from the base agent data, such as JUNE, to construct pairwise datasets for each interaction. Each row of the dataset is structured as follows:

id_x: Identifies one of the individuals involved in the interaction.
id_y: Identifies another individual in the interaction.
spec_x: Represents the category of the venue group (e.g., school, hospital) using a unique identifier (e.g., 0, 1, etc.).
group: Specifies the venue where the interaction takes place.

One of the examples of the data is shown below:

id_x	id_y	group
25732	27402	329971
27401	27402	329971
27400	27402	329971
25733	27402	329971

4. Configuration

The configuration for cli_input contains two parts:

interaction_ratio: This parameter specifies the desired percentage of original interactions to be included in the dataset.
vaccine_ratio: This parameter accounts for the vaccination rates among different ethnic groups.

An example of the configuration can be found below:

interaction_ratio:
    household: 0.1
    cinema: 0.1
    pub: 0.1
    gym: 0.1
    grocery: 0.1
    company: 0.05
    school: 0.05
    hospital: 0.03
    inter_city_transport: 0.3
    city_transport: 0.3

vaccine_ratio:
    European: 0.75
    Maori: 0.47
    Pacific: 0.6
    Asian: 0.89
    MELAA: 0.75

The dataset will be randomly generated according to the percentages specified in the configuration. This allows us to produce multiple datasets with distinct synthetic population representations, which can then be employed in the model to generate ensemble-based model outputs.