Stratification and sampling️ of population centers in Loreto, Peru 🎲✨

1 Introduction

This document describes the methodology carried out for the stratification and selection of population centers in Loreto for drone surveillance. This activity is part of the the Harmonize project: Harmonisation of spatio-temporal multi-scale data for health in climate change hotspots in the Peruvian Amazon (SIDISI 209821), in collaboration with the Barcelona Supercomputing Center.

The Harmonize project has the following objectives:

  1. Geo-location of cases collected by passive surveillance from the Peruvian Ministry of Health (MINSA) through the Center for Disease Control and Prevention (CDC-Peru) in the study area using GPS devices.

  2. Identify potential mosquito breeding sites based on new longitudinal ground survey data and using drone surveillance.

  3. Collect data from weather stations, satellite imagery, ambient acoustic sounds and air pollution to calibrate and reduce the spatial resolution of Earth observation data sets.

  4. Determine the impact of the use of these new technologies on infectious disease reduction interventions carried out by the Dirección Regional de Salud (DIRESA) of Loreto.

The methodology described here corresponds to the second objective above. Since we cannot map the whole study area of Loreto, our approach was to characterize the population centers in Loreto according to the incidence of relevant diseases in the area (Malaria, Dengue and Leptospirosis) and meteorological, environmental and human intervention variables and then use this variables to create sampling strata using clustering techniques. Finally, we randomly sampled a number of the population centers from each stratum proportional to the stratum size.

2 Dataset

For each population center, we gathered records on the following fields:

  • UBIGEOCP, which is the official identification code of the population centers used by the Instituto Nacional de Estadística e Informática (INEI).
  • Official name.
  • Latitude and longitude.
  • 2022 cumulative incidence of Malaria, Dengue and Leptospirosis.
  • Difference in the cumulative incidence of Malaria between 2010 and 2019.
  • Mean and difference of the annual mean values of the years 2010 and 2021 of the following variables:
    • Precipitation accumulation (mm),
    • Runoff (mm),
    • Soil moisture (mm),
    • Maximum temperature (°C),
    • Minimum temperature (°C),
    • Total evapotranspiration (kg/m2),
    • Specific humidity (kg/kg),
    • Deforestation area (km2).
  • Mean and difference of the annual values of the years 2010 and 2020 of the following variables:
    • Population density from WorldPop and
    • Human settlement data from the Global Human Settlement Layer (GHSL).
  • Mean and difference of the annual mean values from years 2014 and 2021 of the nighttime data from the Visible Infrared Imaging Radiometer Suite (VIIRS).

We can take a look at the data in the following table:

Code
# Load necessary packages
library(dplyr)
library(factoextra)
library(cluster)
library(leaflet)

# Read dataset
processed_path = "data/processed/"
file_name = "ccpp-10km.csv"
file_path = fs::path(processed_path, file_name)
ccpp = readr::read_csv(file_path, col_types = "ccccccddcd")

# Display dataset
ccpp %>%
  mutate(across(where(is.numeric), ~ round(.x, 2))) %>%
  DT::datatable(
    options = list(display = "compact", pageLength = 5, scrollX = TRUE)
  )

3 Stratification

The stratification of the population centers consisted of using the meteorological, environmental and human intervention variables to built groups of population centers with homogeneous characteristics. For this purpose, we tested two approaches. The first one consisted of performing an Agglomerative (“bottom-up”) Hierarchical Clustering (AHC) analysis using different distances, linkage methods, and number of clusters selection methods. On the second one, we used Principal Component Analysis (PCA) to reduce the dimension of the variables to one principal dimension and used it’s scores to create groups by quantile classification.

3.1 Agglomerative hierarchical clustering analysis

The AHC analysis consisted of two steps. First, we run several configurations the agglomerative nesting algorithms and selected the one which gave us the best hierarchical clustering (HC) structure. Secondly, from the selected hierarchical clustering structure, we calculated and examined a number of indices for determining the most appropriate number of clusters.

3.1.1 Chosing the hierarchical clustering structure

For this first step, we tested combinations of the following methods:

  • Scaling method: Whether to use normalization (scaling from 0 to 1) or standardadization (subtracting the mean and dividing by the standard deviation).
  • Distance: Whether to use euclidean distance or correlation-based distances, such as Pearson or Spearman correlation.
  • Linkage method: Whether to use average, single, complete, Ward or weighted average linkage method.

For each combination, we run the agglomerative nesting algorithm and calculated the agglomerative coefficient (AC) which is a measure of the strength of the HC structure (Boehmke and Greenwell 2019, 421–22). For the 3 combinations of methods with larger AC, we also examined the resulting dendograms.

Code
# Select variables for analysis
variables = select(ccpp, dengue:malaria_diff)

# Function to scale data either by standardization or normalization
scale_data = function(data, method) {
  if (method == "standardization") {
    data_scaled = mutate(data, across(everything(), ~as.numeric(scale(.x))))
  } 
  if (method == "normalization") {
    data_scaled = mutate(data, across(everything(), scales::rescale))
  }
  data_scaled
}

# Listing the configurations for the scaling, distance and linkage methods
scaling_method = c("standardization", "normalization")
dist_method = c("euclidean", "pearson", "spearman")
linkage_method = c("average", "single", "complete", "ward", "weighted")

# Table with all the combinations
design = tidyr::expand_grid(scaling_method, dist_method, linkage_method, variables)

# Calculate the agglomerative coefficient for each combination
agglomerative_coeff = design %>% 
  tidyr::nest(data = -c(scaling_method, dist_method, linkage_method)) %>% 
  mutate(
    scaling = purrr::map2(data, scaling_method, ~scale_data(.x, method = .y)),
    dist_mat = purrr::map2(scaling, dist_method, ~get_dist(.x, method = .y)),
    tree = purrr::map2(dist_mat, linkage_method, ~agnes(.x, method = .y)),
    ac = purrr::map(tree, ~.x$ac)
  ) %>% 
  tidyr::unnest(ac) %>% 
  arrange(-ac)

The best 3 HC structures according to the AC were the following:

Code
# Display the best 3 structures 
agglomerative_coeff %>% 
  slice_head(n = 3) %>% 
  select(-c(data, scaling, dist_mat, tree)) %>% 
  mutate(across(where(is.character), stringr::str_to_title)) %>% 
  mutate(across(where(is.numeric), ~round(.x, 4))) %>% 
  DT::datatable(
    colnames = c("Scaling", "Distance", "Linkage", "AC"),
    options = list(dom = 't')
  )


The difference of the first 3 AC values are almost negligible. Consequently, we examined the dendograms:

Code
# Function to plot a dendogram
dendogram = function(row, title) {
  dendogram <- fviz_dend(agglomerative_coeff$tree[[row]], main = title) + 
    theme(text = element_text(size = 13))
  
  dendogram
}
Code
dendogram(1, "Normalization, Pearson correlation, Ward linkage")

Code
dendogram(2, "Standardization, Spearman correlation, Ward linkage")

Code
dendogram(3, "Standardization, Pearson correlation, Ward linkage")

After inspecting the dendograms, we chose the second HC structure as it seems to yield more homogeneous groups with relatively similar sizes for the different height values.

3.2 Determining the number of clusters

We calculated and plotted the Average silhouette width, Total within sum of squares and Gap statistic for a range of 1 to 10 number of clusters and examined the optimal number of cluster for each criteria. We defined a range of numbers of clusters that met or near met the optimal criteria and then for each one of them we used PCA to visualize the clusters in 2 dimensions to evaluate how well the groups were formed. In addition, the silhouette information of each observation was plotted to assess if there were observations that might have been misclustered.

3.2.1 Evaluation of the criteria of the optimal number of clusters

Code
# Function to plot optimal number of clusters plot for a certain criteria
nbclust_plot = function(data, method, ...) {
  nbclust <- fviz_nbclust(
    data, hcut, hc_func = "agnes", hc_method = "ward.D2", 
    c_metric = "spearman", method = method, ...
  ) + theme(text = element_text(size = 13))
  
  nbclust
}

# Scale variables using standardization
variables_std = mutate(variables, across(everything(), ~as.numeric(scale(.x))))
Code
set.seed(2022)
nbclust_plot(variables_std, "silhouette")

Code
nbclust_plot(variables_std, "wss")

Code
set.seed(2022)
nbclust_plot(variables_std, "gap_stat", verbose = FALSE)

According to the average silhouette width, the optimal number of clusters is 5. In the total within sum of squares plot, we see an apparent elbow at 4 clusters. Finally, the Gap statistic tells us that the optimal number of clusters is 10 or more. However, we do not wanted to work with more than 5 clusters, 4 or 5 clusters would be the closest optimal.

3.2.2 Clusters plots

Next, we plotted the clusters in a plane using PCA for 4 and 5 clusters.

Code
# Function to cut tree
cut_tree = function(data, k) {
  tree = hcut(
    data, k = k, hc_func = "agnes", hc_method = "ward.D2", 
    hc_metric = "spearman", graph = TRUE
  )
  tree
}

# Function to generate cluster plot
cluster_plot = function(tree) {
  cluster = fviz_cluster(tree, ggtheme = theme_classic()) +
    geom_vline(xintercept = 0, linetype = "dashed") +
    geom_hline(yintercept = 0, linetype = "dashed") +
    theme(text = element_text(size = 13))
  
  cluster
}
Code
tree_4 = cut_tree(variables_std, 4)
cluster_plot(tree_4)

Code
tree_5 = cut_tree(variables_std, 5)
cluster_plot(tree_5)

We can see that the fifth group is formed from some populated centers of clusters 2 and 3 in the 4-cluster graph.

3.2.3 Individual silhouette widths

Next, we plotted the individual silhouette width for the observations grouped 4 and 5 clusters.

Code
# Function to plot individual silhouette widths
silhouette_plot = function(tree) {
  ind_silhouette <- tree %>%
    fviz_silhouette(ggtheme = theme_classic(), print.summary = FALSE) +
    theme(text = element_text(size = 13))
  
  ind_silhouette
}
Code
silhouette_plot(tree_4)

Code
silhouette_plot(tree_5)

We see that with 5 clusters we obtain a greater average silhouette width (0.44) than with 4 clusters (0.41). However, with 5 clusters with get greater negative values for some populated centers, indicating that they may be misclustered.

3.2.4 HC groups on maps

As an additional step, for each number of clusters, we plotted the population centers (polygon centroids) on a map to see their spatial distribution.

Code
# Create table with the groups
ccpp_clusters = ccpp %>% 
  mutate(
    hc_group_4 = as.factor(tree_4$cluster),
    hc_group_5 = as.factor(tree_5$cluster)
  )

# Function to plot a map with the clusters
cluster_map = function(data, palette, group_var) {
  values = pull(data, {{group_var}})
  
  map = data %>% 
    sf::st_as_sf(coords = c("lng", "lat"), crs = 4326) %>%
    leaflet() %>% 
    addTiles(group = "OpenStreetMap") %>% 
    addProviderTiles(provider = providers$CartoDB, group = "CartoDB") %>%
    addProviderTiles(
      provider = providers$Esri.WorldImagery, group = "Satelital"
    ) %>%
    addCircleMarkers(
      popup = ccpp$population_center, color = ~palette(values),
      opacity = 1, radius = 0.1, fillOpacity = 0.5
    )%>%
    addLayersControl(
      baseGroups = c("CartoDB", "OpenStreetMap", "Satelital")
    ) %>%
    addLegend(
      title = "Cluster", pal = palette, values = values, opacity = 1,
      position = "bottomright"
      )
  map
}
Code
# Palette for 4 clusters
palette_hc_4 = colorFactor(
  palette = "viridis", domain = ccpp_clusters$hc_group_4
)

# Plot cluster map
cluster_map(ccpp_clusters, palette_hc_4, hc_group_4)
Code
# Palette for 5 clusters
palette_hc_5 = colorFactor(
  palette = "viridis", domain = ccpp_clusters$hc_group_5
)

# Plot cluster map
cluster_map(ccpp_clusters, palette_hc_5, hc_group_5)

We see that the population centers seem to group, showing a strong spatial component driving the clustering.

3.3 Principal components analysis

Principal component analysis (PCA) was used as an alternative method for clustering. Our approach was to use the principal component to build an aggregate index of all the variables considered. Then, the scores of the populated centers on this principal component were grouped into different categories according to the quantiles of the distribution. In order to compare this approach with the results of hierarchical clustering, we tested with 4 (25th, 50th and 75th quantiles) and 5 (20th, 40th, 60th and 80th quantiles) groups.

Code
# Perform PCA
pca = FactoMineR::PCA(variables_std, graph = FALSE)

3.3.1 Scree plot

After having performed the PCA, we plotted the scree plot to check the percentage of explained variance each component or dimension.

Code
pca %>% 
  fviz_screeplot() +
  theme(text = element_text(size = 13))

We see that the first component explain approximately 50% of the variance in the data. This is not such a high value, meaning that the variables do not seem to be strongly linearly correlated. Still, 50% is an acceptable value in practice.

3.3.2 Contribution plot

Next, we plotted the top 10 variables according to their contribution to the first component.

Code
pca %>% 
  fviz_contrib(choice = "var", axes = 1, top = 10) +
  theme(text = element_text(size = 13))

We see that the soil moisture, precipitation and runoff variables are the ones that contribute the most to explaining the first component.

3.3.3 PCA groups on maps

Finally, we created the groups dividing the population centers according to the quantiles of the scores on the principal component. Then we plotted the population centers on maps for 2, 3 and 4 groups.

Code
# Get scores
pca_results = get_pca_ind(pca)
scores = pca_results$coord

# Create groups by quantile categorization
ccpp_strata = ccpp_clusters %>% 
  mutate(pca_score = scores[, 1]) %>% 
  mutate(
    pca_group_4 = cut(
      pca_score, breaks = quantile(pca_score, c(0, .25, .5, .75, 1)), 
      labels = as.character(rev(1:4)),
      include.lowest = TRUE
    ),
    pca_group_5 = cut(
      pca_score, breaks = quantile(pca_score, c(0, .2, .4, .6, .8, 1)), 
      labels = as.character(rev(1:5)),
      include.lowest = TRUE
    )
  ) %>% 
  mutate(across(pca_group_4:pca_group_5, forcats::fct_rev))
Code
# Palette for 4 groups
palette_pca_4 = colorFactor(
  palette = "viridis", domain = ccpp_strata$pca_group_4
)

# Plot cluster map
cluster_map(ccpp_strata, palette_pca_4, pca_group_4)
Code
# Palette for 5 groups
palette_pca_5 = colorFactor(
  palette = "viridis", domain = ccpp_strata$pca_group_5
)

# Plot cluster map
cluster_map(ccpp_strata, palette_pca_5, pca_group_5)

Geographically, the PCA groups seem to be distributed very similar to the HC groups.

3.4 Comparing HC and PCA

We assessed how strong was the agreement between the HC method and PCA method on the grouping of the population centers for 4 and 5 groups. Confusion matrices and the Kappa statistic were used for this analysis.

3.4.1 Confusion matrices

Code
ccpp_strata %>% 
  yardstick::conf_mat(hc_group_4, pca_group_4, dnn = c("PCA", "HC")) %>% 
  autoplot(type = "heatmap") +
  theme(text = element_text(size = 13))

Code
ccpp_strata %>% 
  yardstick::conf_mat(hc_group_5, pca_group_5, dnn = c("PCA", "HC")) %>% 
  autoplot(type = "heatmap") +
  theme(text = element_text(size = 13))

Code
# Get kappa statistic
kap_4 = yardstick::kap(ccpp_strata, hc_group_4, pca_group_4)
kap_5 = yardstick::kap(ccpp_strata, hc_group_5, pca_group_5)

From the confusion matrices it is evident a moderate agreement between the HC method and the PCA method. The agreement is stronger with 4 groups (kappa = 0.87) than with and 5 (kappa = 0.67).

In summary, both 4 and 5 groups give very similar results in terms of visual and statistical indicators. However, we ultimately opted to work with 4 clusters to avoid an excessive disaggreation of populated centers.

4 Selection of populated centers

Having chosen the number of clusters, we determined the number of population centers we should select from each one of them to obtain a sample of 10 population centers with the same proportion of cases in each cluster.

Code
# Sample size for each cluster
round(table(ccpp_strata$hc_group_4)/6)

1 2 3 4 
3 2 2 2 

Therefore, we needed to select roughly 3 populated centers from cluster 1 and 2 from each of the other ones.

We did not take a random sample approach since we had the following restraints:

  • The Direccion Regional de Salud de Loreto (DIRESA Loreto) selected 5 population centers that are of epidemiological interest.
  • We had previously conducted field work to collect georeferenced points to reconstruct the perimeter of 21 populated centers. This is work that would have to be done for all the communities selected for drone mapping. So, for the sake of convenience, it was better to select from these population centers whose perimeters we already had.
  • We also filtered out those communities that required more than 3 hexagons of 1km2 area for drone flights.

In the end, we were left with 24 populated centers (5 from DIRESA Loreto, 19 from InnovaLab), and we took these as our sampling frame. The 5 populated centers from DIRESA Loreto were a must and 4 of them were in cluster 1 and 1 in cluster 2, so from the InnovaLab collection we randomly sample 1 populated center from cluster 2 and 2 from clusters 3 and 4 to have a total sample size of 10.

The size and the percentage from the total of the clusters and the final sample are shown in the following table:

Cluster Size Sample size
1 18 (30%) 4 (40%)
2 14 (23.3%) 2 (20%)
3 14 (23.3%) 2 (20%)
4 14 (23.3%) 2 (20%)
Code
# Read the 26 populated centers previously selected
raw_path = "data/raw/"
ccpp_previous_path = fs::path(raw_path, "ccpp-perimeters-diresa.csv")
ccpp_previous_ccpp = readr::read_csv(ccpp_previous_path, col_types = "cccccccd")
ccpp_previous_strata = filter(
  ccpp_strata, ubigeocp %in% ccpp_previous_ccpp$UBIGEOCP
)

# Select from DIRESA Loreto selection
ccpp_sample_diresa = ccpp_previous_strata %>% 
  filter(group == "DIRESA")

# Select from InnovaLab collection
n_samples_innova = c(1, 2, 2)
set.seed(2023)
ccpp_sample_innova = ccpp_previous_strata %>% 
  filter(group == "INNOVALAB", hc_group_4 != "1") %>% 
  tidyr::nest(data = -hc_group_4) %>% 
  mutate(
    sample = purrr::map2(data, n_samples_innova, ~slice_sample(.x, n = .y))
  ) %>% 
  select(hc_group_4, sample) %>% 
  tidyr::unnest(sample) %>% 
  relocate(hc_group_4, .before =  hc_group_5)
Code
set.seed(2023)
group_3_sample = ccpp_previous_strata %>% 
  filter(hc_group_4 == "3", !(nomcp %in% c("HUAMBE", "23 DE FEBRERO"))) %>% 
  slice_sample(n = 1)

ccpp_sample_innova_new = filter(ccpp_sample_innova, nomcp != "HUAMBE")

# Final sample
ccpp_sample = bind_rows(
  ccpp_sample_innova_new, group_3_sample, ccpp_sample_diresa
)

Finally, we plot the samples in a map to see their geographical distribution.

Code
# Coordinates of all the population centers
ccpp_coord = ccpp_strata %>% 
  sf::st_as_sf(coords = c("lng", "lat"))

# Coordinates of the 10 samples
ccpp_sample_coord = ccpp_sample %>% 
  sf::st_as_sf(coords = c("lng", "lat"))

# Function to plot the samples on a map
sample_map = function(data, palette, group_var) {
  values = pull(data, {{group_var}})
  map = data %>% 
    leaflet() %>% 
    addTiles(group = "OpenStreetMap") %>% 
    addProviderTiles(provider = providers$CartoDB, group = "CartoDB") %>%
    addProviderTiles(
      provider = providers$Esri.WorldImagery,group = "Satelital"
    ) %>%
    addCircleMarkers(
      data = ccpp_coord, opacity = 1, radius = 0.1, fillOpacity = 0.5, 
      color = "lightgray"
      ) %>% 
    addCircleMarkers(
      popup = data$population_center, color = ~palette(values),
      opacity = 1, radius = 0.1, fillOpacity = 0.5
      )%>% 
    addLayersControl(baseGroups = c("CartoDB","OpenStreetMap", "Satelital")) %>% 
    addLegend(
      title = "Cluster", pal = palette, values = values, opacity = 1,
      position = "bottomright"
    )
  map
}

# Palette for the sample
sample_palette = colorFactor(
  palette = "viridis", domain = ccpp_sample$hc_group_4
)

# Map the samples
sample_map(ccpp_sample, sample_palette, hc_group_4)

References

Boehmke, Brad, and Brandon Greenwell. 2019. Hands-on Machine Learning with r. Chapman; Hall/CRC.