Input

This section is related to the JUNE-NZ component: cli_input. Note that the original purpose of cli_input is to write the measles data records to the format that can be used in JUNE-NZ, therefore it is not a necesary step for all modelling requirements.

1. Background

JUNE-NZ relies on input data primarily generated by the population layer of the original JUNE model. The synthetic population in JUNE often leads to a high volume of potential interactions by individual agents. To mitigate the graph expansion in the Graph Neural Network and provide a more accurate representation of model uncertainties, we suggest randomly partitioning the interactions from the original synthetic population. This approach allows us to create multiple input datasets for JUNE-NZ, which can be run in parallel for simulations.

  • Input: In order to create the input for JUNE-NZ, we need the following dataset as the input for cli_input:

    • target data: The target dataset represents the ground truth, such as the recorded number of COVID-19 cases, that the model aims to learn and predict.

    • agent data: The agent dataset comprises essential attributes for modeling purposes, such as age, sex, ethnicity, geographic area, group (e.g., the unique household identifier to which the agent belongs), spec (e.g., household, city transport, etc.), and timestamp.

    • sa2 - DHB data: This dataset provides a mapping between SA2 (ID) and DHB (ID and name).

  • Output: The output from cli_input include:

    • target data: This dataset represents the ground truth to be learned from the model.

    • agent_group data: This dataset provides the agent ID and the group (e.g., the unique household identifier to which the agent belongs) that where the agent belongs. It is extracted from the agent data.

2. target data

The target dataset represents the ground truth, such as the recorded number of COVID-19 cases, that the model aims to learn and predict. For now, the target dataset must be in the format of __parquet__. An example of the target dataset is shown below:

Region

Week_11

Week_13

Week_15

Week_16

Week_17

Week_18

Week_19

Week_20

Week_21

Week_22

Waitemata

1.0

2

3.0

0.0

4.0

6.0

6

6

5

4

Auckland

2.0

2

1.0

0.0

1.0

1.0

2

1

4

2

Counties Manukau

0.0

0

0.0

0.0

0.0

1.0

2

2

1

3

cli_input will combine the above dataset, and produce something like the below (which is the one used in training for JUNE-NZ):

target

Week_2

0.0

Week_3

0.0

Week_4

3.0

Week_5

5.0

Week_6

11.0

Week_7

22.0

The processed output of the target data will be stored in the format of csv in the working directory.

3. agent and agent group data

The agent and agent group data are related to the agent and interactions that we will use in the model.

3.1 Base agent data

The agent data is produced from the original JUNE model, and then being processed accordingly (e.g., converting ethnicity from name to identifier, adding vaccination etc.). It exports in the format in parquet

An example of agent data is shown below:

id

age

sex

ethnicity

area

group

spec

time

0

0

m

European

110400

Household_00692

household

20200302T00

276

6

f

European

110400

Household_00692

household

20200302T00

1

0

f

European

110400

Household_01228

household

20200302T00

2

0

m

European

110400

Household_00371

household

20200302T00

386

8

m

European

110400

Household_00371

household

20200302T00

3.2 sa2 - DHB (intermediate data)

This straightforward mapping directory illustrates the relationship between SA2 and DHB, as demonstrated in the following example:

SA2

DHB_code

DHB_name

460

146100

Counties Manukau

463

146400

Counties Manukau

461

146800

Counties Manukau

4742

147500

Counties Manukau

3.3 Agent group data

The agent group data represents the mapping between agent IDs and their corresponding group identifiers, as illustrated in the example below:

id

group

1014806

Household_313093

1014807

Household_313988

1014808

Household_312993

3.4 Interaction data

Interaction data is derived from the base agent data, such as JUNE, to construct pairwise datasets for each interaction. Each row of the dataset is structured as follows:

  • id_x: Identifies one of the individuals involved in the interaction.

  • id_y: Identifies another individual in the interaction.

  • spec_x: Represents the category of the venue group (e.g., school, hospital) using a unique identifier (e.g., 0, 1, etc.).

  • group: Specifies the venue where the interaction takes place.

One of the examples of the data is shown below:

id_x

id_y

spec_x

group

25732

27402

0

329971

27401

27402

0

329971

27400

27402

0

329971

25733

27402

0

329971

4. Configuration

The configuration for cli_input contains two parts:

  • interaction_ratio: This parameter specifies the desired percentage of original interactions to be included in the dataset.

  • vaccine_ratio: This parameter accounts for the vaccination rates among different ethnic groups.

An example of the configuration can be found below:

interaction_ratio:
    household: 0.1
    cinema: 0.1
    pub: 0.1
    gym: 0.1
    grocery: 0.1
    company: 0.05
    school: 0.05
    hospital: 0.03
    inter_city_transport: 0.3
    city_transport: 0.3

vaccine_ratio:
    European: 0.75
    Maori: 0.47
    Pacific: 0.6
    Asian: 0.89
    MELAA: 0.75

The dataset will be randomly generated according to the percentages specified in the configuration. This allows us to produce multiple datasets with distinct synthetic population representations, which can then be employed in the model to generate ensemble-based model outputs.