module3.knit

Module 3

Real-world application of SOM: Modeling the COVID-19 Pandemic in India

The COVID-19 pandemic had a devastating impact worldwide in 2020 and 2021. In an attempt to study the dynamics of the pandemic, it is required to analyze the number of rising cases and deaths. However, the sheer amount and variety of data on the topic make the task difficult. Hence, dimensionality reduction was performed as the initial approach towards analyzing the data and it was achieved using SOM. The obtained data groups or clusters were represented graphically as dynamic SOM charts and were simultaneously highlighted over the India map. The KML file associated with the India map was obtained from the GeoServer platform by National Remote Sensing Centre, Department of Space, Government of India [1]. The data used for this module ranges from 1st March 2021 to 1st June 2021 and contains figures on all the states in India as of 2021. The data was extracted from the Bangalore Centre of the Indian Statistical Institute [2] and the Ministry of Family Health and Welfare, Govt. of India [3].

Figure 1: SOM applied over India’s COVID-19 data

Part 1: Loading necessary libraries
Part 2: Creating required functions
Part 3: Loading data and necessary files
Part 4: Generating SOM plots
Part 5: Generating an animation
Part 6: Conclusion

Part 1: Loading necessary libraries

# 1.1) Loading the necessary libraries.
suppressWarnings(library(readr))
suppressWarnings(library(dplyr))
suppressWarnings(library(stringr))
suppressWarnings(library(rgdal))
suppressWarnings(library(sf))
suppressWarnings(library(ggplot2))
suppressWarnings(library(magick))
suppressWarnings(library(plotfunctions))

Part 2: Creating required functions

# 2.1) A function to obtain the sum of squared distance between "x" and "y" vectors.
euclidean_distance <- function(x, y) {
  ret <- sqrt(sum((x - y)^2))
  return(ret)
}

# 2.2) Function to create an SOM grid.
# "x*y" is the number of neurons.
# "p" is the number of columns of the original dataset or the number of input dimensions.
create_grid <- function(x,y,p) {
  ret <- matrix(data = rnorm(x * y * p), nrow = x * y, ncol = p)
  return(ret)
}

# 2.3) Function to decay the radius exponentially over time. 
# "radius" is the initial radius value.
# "current_iteration" represents the current iteration.
# "time_constant" is the time constant.
decay_radius_function <- function(radius, current_iteration, time_constant) {
  ret <- radius * exp(-current_iteration / time_constant)
  return(ret)
}

# 2.4) Function to decay the learning rate.
# "learning_rate" is the current learning rate.
# "current_iteration" is the current iteration.
# "n_iteration" is the number of iterations.
decay_learning_rate <- function(learning_rate, current_iteration, n_iteration) {
  ret <- learning_rate * exp(-current_iteration / n_iteration)
  return(ret)
}

# 2.5) A function to calculate the influence over neighbouring neurons.
# "distance" is the lateral distance.
# "radius" is the current neighbourhood radius.
influence_calculation <- function(distance, radius) {
  ret <- exp(-(distance^2) / (2 * (radius^2)))
  return(ret)
}

# 2.6) A function to obtain the winning neuron.
# "x" is a single row of data.
# "input_grid" is the grid.
BMU <- function(x, input_grid) { 
  distance <- 0
  min_distance <- 10000000 
  min_ind <- -1 
  for (e in 1:nrow(input_grid)) # Iterating through the grid.
  {
    distance <- euclidean_distance(x, input_grid[e, ]) # Calculating euclidean distance.
    if (distance < min_distance) {
      min_distance <- distance # Updating minimum distance for winning neuron.
      min_ind <- e # Updating winning neuron.
    }
  }
  return(min_ind-1) # Returns index of BMU.
}

# 2.7) Faster BMU implementation.
BMU_Vectorised <- function(x, input_grid) { 
  dist_mtrx=rowSums(sweep(input_grid,2,x)^2) # Calculating the distance of "x" row from all the neurons using matrix operation.
  min_ind=which.min(dist_mtrx) # Finding the location of the neuron associated with the minimum distance.
  return (min_ind-1) # Returning the zero-indexed value of the winning neuron.
}

# 2.8) A function to encapsulate the complete SOM algorithm.
# "x" is the input.
# "input_grid" is the SOM grid that shall get updated with each iteration.
SOM <- function(x, input_grid) {
  breaker <- 0
  n_iteration <- nrow(x) # Defining the number of iterations.
  initial_learning_rate <- 0.5 # Defining initial learning rate.
  initial_radius <- 15 # Defining initial radius.
  time_constant <- n_iteration / log(initial_radius) # Initializing time constant.
  lateral_distance_points=expand.grid(1:sqrt(nrow(input_grid)),1:sqrt(nrow(input_grid))) # Initializing physical locations of neurons to figure out lateral distance.
  rows=sqrt(nrow(input_grid)) # Taking the number of rows as square root of the number of entries in the grid.
  n_epochs=40 # Defining the number of epochs.
  new_radius <- initial_radius
  l <- c()
  for(ne in 1:n_epochs)
  {
    extra <- ((ne-1)*400)
    for (i in 1:n_iteration) # Looping through for training.
    {
      old_grid=input_grid
      curr_i <- extra + i
      sample_input_row <- as.vector(unlist(x[sample(1:nrow(x), size = 1, replace = F), ])) # Selecting random input row from the given data set.
      new_radius <- decay_radius_function(initial_radius, curr_i, time_constant) # Decaying the radius.
      new_learning_rate <- decay_learning_rate(initial_learning_rate,curr_i, n_iteration) # Decaying the learning rate.
      index_temp <- BMU_Vectorised(sample_input_row, input_grid) # Finding best matching unit for the given input row.
      index_new=c((as.integer(index_temp/rows)+1),(index_temp%%rows)+1) # Converting a 1D co-ordinate to a 2D co-ordinate to find lateral distance on the map.
      lateral_distance=sqrt(abs(rowSums(sweep(lateral_distance_points,2,index_new)^2))) # Finding euclidean distance between the given best matching unit and all units of the map.
      rn=which(lateral_distance<=new_radius) # Finding neurons that are within the radius of the winning neuron.
      inf=influence_calculation(lateral_distance[rn],new_radius) # Calculating the influence of winning neuron on neighbours.
      if(length(rn)!=1) # Updating multiple rows if neighbourhood is large.
      {
        # Again calculating the influence of winning neuron on neighbours.
        diff_grid=(sweep(input_grid[rn,],2,sample_input_row))*-1 # Creating a temporary matrix to store the difference between the given data point and the weights of the winning neuron & its neighbours.
        updated_weights=new_learning_rate*inf*diff_grid # Updating operation performed over the winning and neighbouring neurons.
        input_grid[rn,]=input_grid[rn,]+updated_weights # Updating grid entries that are either the winning neuron or its neighbours.
      }
      else # Updating only winning neuron.
      {
        diff_row=(input_grid[rn,]-sample_input_row)*-1 # Creating a temporary matrix to store the difference between the given data point and the weights of the winning neuron & its neighbours.
        updated_weights=new_learning_rate*inf*diff_row # Updating operation performed over the winning neuron & its neighbours.
        input_grid[rn,]=input_grid[rn,]+updated_weights # Updating grid entries that are either the winning neuron or its neighbours.
      }
      l <- c(l,euclidean_distance(old_grid,input_grid))
      if(isTRUE(all.equal(old_grid,input_grid)))
      {
        breaker <- 1
        break
      }
    }
    if(breaker ==1)
    {
      break
    }
  }
  return(list(input_grid,l)) # Returning the updated SOM weights.
}

# 2.9) A function to visualize the SOM results by mapping numeric values to colors.
drawGrid<- function(weight,dimension,showPlot=TRUE){
  
  # Converting to a matrix.
  weight<-as.matrix(weight, ncol = ncol(weight))
  
  norm.matrix<-NULL
  
  # Calculating the matrix norm.
  for(i in 1:length(weight[,1])){
    a<-norm(weight[i,], type = "2")
    norm.matrix<-rbind(norm.matrix,a)
  }

  # Mapping to the numeric range from 10 to 46.
  input_start<-min(norm.matrix)
  input_end<-max(norm.matrix)
  output_start<-15
  output_end<-51
  
  # Calculating wavelength on the basis of the matix norm.
  color<-NULL
  for(i in 1:length(norm.matrix)){
    input = norm.matrix[i]
    output = output_start + ((output_end - output_start) / (input_end - input_start)) * (input - input_start)
    color<-rbind(color,output)
  }
  
  # Getting colors (hex values) from wavelength values.
  color.rgb<-rainbow(51, rev = T)[color] 
  
  # Plotting the grid.
  if(showPlot){
    dim<-max(dimension)+1
    plot(1:dim, type = "n")
    
    for (i in 1:dimension[1]){
      for(j in 1:dimension[2]){
        rect(i,j,i+1,j+1, col = color.rgb[i*dimension[1]+j - dimension[1]])
      }
    }
  }
  
  return(color.rgb)
}

Part 3: Loading data and necessary files

The sf package [4] was used to plot the spatial data as it provides a standard and straightforward way to encode spatial vector data. It binds to the rgdal package [5] for reading and writing the data. The COVID-19 data for India contained the total number of confirmed cases, cured cases and deaths for each state. The original data contained cumulative values; therefore, the number of daily infections was obtained by subtracting values between consecutive dates.

# 3.1)  Creating directory for storing output.
dir.create("SOM_Maps")
dir_out <- paste0(getwd(),"/SOM_Maps/")

# 3.2) Reading required files.
isro <- st_read("Resources/cauvery-INDIA_STATE_250K.kml")

## Reading layer `cauvery:INDIA_STATE_250K' from data source 
##   `D:\FOSSEE\SOM\Resources\cauvery-INDIA_STATE_250K.kml' using driver `KML'
## Simple feature collection with 36 features and 2 fields
## Geometry type: GEOMETRYCOLLECTION
## Dimension:     XY
## Bounding box:  xmin: 68.10606 ymin: 6.679998 xmax: 97.41535 ymax: 37.07833
## Geodetic CRS:  WGS 84

isro <- isro[-1,]

# 3.3) The COVID-19 data required to be processed before analyzing.
source("Resources/Data Preprocessing.R")

# The "Data Preprocessing.R" file contains the following code.
  # # Loading libraries
  # library(readr)
  # library(tidyverse)
  # 
  # # Reading the COVID-19 dataset
  # df<-read_csv("Resources/summarymohfw1update.csv")
  # 
  # # Data processing
  # n<-(ncol(df)-1)/4
  # name<-c("TCIN","TCFN","Cured","Death")
  # name<-rep(name,n)
  # name<-append(name,"State.UnionTerritory",0)
  # colnames(df)<-name
  # df<-df[-c(1,38),]
  # date<-seq(as.Date("2020/3/10"), as.Date("2021/10/1"), "days")
  # df2<-NULL
  # for(i in seq(1,n)){
  #   df1<-df[,c(1:5)]
  #   df1$date<-rep(date[i],36)
  #   df2<-rbind(df2,df1)
  #   df<-df[,-c(2:5)]
  # }
  # df2$TCIN <- as.numeric(df2$TCIN)
  # df2$TCFN <- as.numeric(df2$TCFN)
  # df2$Confirmed<- df2$TCIN + df2$TCFN
  # df2 <- df2 %>% select(date,State.UnionTerritory,TCIN,TCFN,Cured,Death,Confirmed)
  # 
  # # Saving the final data frame as a CSV file
  # write.csv(df2,"Resources/covid_data.csv")
  # rm(df,df1,df2,date,name,n)

# 3.4) Reading the processed data.
covid <- read.csv("Resources/covid_data.csv")
head(covid)

##   X       date        State.UnionTerritory TCIN TCFN Cured Death Confirmed
## 1 1 2020-03-10              Andhra Pradesh    0    0     0     0         0
## 2 2 2020-03-10 Andaman and Nicobar Islands    0    0     0     0         0
## 3 3 2020-03-10           Arunachal Pradesh    0    0     0     0         0
## 4 4 2020-03-10                       Assam    0    0     0     0         0
## 5 5 2020-03-10                       Bihar    0    0     0     0         0
## 6 6 2020-03-10                  Chandigarh    0    0     0     0         0

# 3.5) Filtering columns.
data<-covid[,c(2,3,7,8)]

# 3.6) Formatting data for plotting.
data<-data[12205:length(data[,1]),] %>% arrange(date,State.UnionTerritory)
average.length<-7
iteration<-0 
date.range<-seq(from = as.Date("2021/3/1"), to = as.Date("2021/6/1"),by = "day")

head(data)

##         date        State.UnionTerritory Death Confirmed
## 1 2021-02-12 Andaman and Nicobar Islands    62      5007
## 2 2021-02-12              Andhra Pradesh  7161    888692
## 3 2021-02-12           Arunachal Pradesh    56     16832
## 4 2021-02-12                       Assam  1086    217277
## 5 2021-02-12                       Bihar  1521    261568
## 6 2021-02-12                  Chandigarh   344     21184

Part 4: Generating SOM plots

The COVID-19 data was fed to the SOM model after reformating it and the model was trained with a continuous seven-day moving average, starting from 1st March 2021 to 1st June 2021. Later, a feature map (2D grid) was prepared from the weights returned by the model. The color of each Indian state and union territory was determined by feeding the associated data to the BMU function that was applied over the trained model. It returned the index number of the winning neuron for that particular input state or union territory data. From the obtained index number, the position of the winning neuron in the 2D grid was determined and its color was assigned to the corresponding state or union territory in India’s map.

For this module, the visualization algorithm did not use a continuous color spectrum of wavelengths from 400 nm to 700 nm because the colors obtained from the spectrum were hard to distinguish. Therefore instead of using the photobiology package [6], thirty six colors from the rainbow color palette of base R [7] were used. The number of colors was decided based on India’s total number of states and union territories as of 2021.

The entire process was iterated for each week’s data until it reached the end. Finally, around ninety feature maps and corresponding India maps with clustered states were obtained. Later these obtained map images were stitched together to create a GIF file displaying an animated transition of the COVID-19 infection across the country.

# 4.1) Iterating over all rows of the dataset.
for(iteration in 0:(length(date.range)-1)){ 
  
    start<-36*iteration+1
    end<-start+36*average.length
    
    previous.range<-seq(start,start+35)
    later.range<-seq(end,end+35)
    
    week<-cbind.data.frame(Name = data[1:36,2],
                           Deaths = (data[later.range,3] - data[previous.range,3])/average.length,
                           Confirmed = (data[later.range,4] - data[previous.range,4])/average.length
    )
    
    # Note: Replace the below mentioned code with "data.set<-week_1[,-1]" if want execution over data of individual weeks.
    data.set<-week[,-1] # Removing the names.
    
    # Creating a 4*4 grid using the function defined above.
    set.seed(222)
    grid <- create_grid(30,30,2)
    
    # Training the model.
    y <- SOM(data.set,grid)
    
    # Weights for 900 neurons, i.e., 30*30 grid
    gridSOM <- y[1]
    
    # Saving the plot.
    img.name<-paste0(dir_out,"Day ",iteration+1,".png",sep = "")
    png(img.name, width = 1280, height = 720)
    par(mfrow =c(1,2))
      
    # Retrieving weights and plotting the map.
    gridSOM<-matrix(unlist(gridSOM),ncol=2)
    color.rgb<-drawGrid(gridSOM,c(30,30),TRUE)
      
    # Matching states with neurons.
    index<-NULL
    for(i in 1:nrow(data.set)){
        k<-as.matrix(data.set[i,],ncol = 2)
        index<-rbind(index,c(week[i,1],BMU_Vectorised(k,gridSOM)))
      }
      
    # Removing states that are not present in the map.
    index<-rbind(index[1:8,],index[8,],index[9:36,])
    index<-index[-c(19,33),]
    index.numbers<-as.numeric(index[,2])
    color.index<-color.rgb[index.numbers]

    # Plotting the grid.
    title_string <- paste0("SOM For COVID-19 Cases in India on ",date.range[iteration+1])
    plot(1:100, type ="n", xlim=c(60,110), ylim=c(5,40), xlab="Latitude", ylab="Longitude", main=title_string) 
    for(i in 1:36){
        plot(st_geometry(isro[i,]),col =color.index[i], add =TRUE)
      }
    dev.off()
}

Part 5: Generating an animation

# 5.1) Obtaining all maps in one list.
imgs <- list.files(dir_out, full.names = F)
imgs_index <- str_replace(imgs,".png","")
imgs_index <- sort(as.numeric(str_replace(imgs_index,"Day ","")))
imgs <- paste0(dir_out,"Day ",imgs_index,".png")
img_list <- lapply(imgs, image_read)
# 5.2) Joining the images together.
img_joined <- image_join(img_list)
# 5.3) Animating at the rate of 2 frames per second.
img_animated <- image_animate(img_joined, fps = 2)
# 5.4) Viewing the animated image.
img_animated

# 5.5) Saving to disk.
image_write(image = img_animated,path = "som_map_animation.gif")

Part 6: Conclusion

SOM for dimensionality reduction of India’s COVID-19 data got successfully implemented. It was a complicated task to implement SOM over the data due to the challenges faced while mapping clusters from the obtained 2D grid of neurons to the map of India. Mapping was done by first locating the coordinates of the desired state or union territory and then coloring it with the color of the associated cell in the 2D grid. To verify the usefulness of implementing SOM over the COVID-19 data, we compared the finally obtained GIF to various online graphics associated with COVID-19 outbreak intensity across India provided by local news channels and pandemic information sources. We were delighted to observe that the hotspots reported in the news matched with the clusters from the GIF.

References

[1] GeoServer by National Remote Sensing Centre, Department of Space, Government of India. https://bhuvan-vec3.nrsc.gov.in/bhuvan/web/wicket/bookmarkable/org.geoserver.web.demo.MapPreviewPage;jsessionid=E704D8D216740FC88D13EC934562D253.worker1?0
[2] COVID-19 India-Timeline an understanding across States and Union Territories. 2020. https://www.isibang.ac.in/~athreya/incovid19/
[3] Ministry of Health and Family Welfare, Government of India. https://www.mohfw.gov.in/
[4] Pebesma, E., 2018. Simple Features for R: Standardized Support for Spatial Vector Data. The R Journal 10 (1), 439-446, https://doi.org/10.32614/RJ-2018-009
[5] Roger Bivand, Tim Keitt and Barry Rowlingson (2021). rgdal: Bindings for the ‘Geospatial’ Data Abstraction Library. R package version 1.5-27. https://CRAN.R-project.org/package=rgdal
[6] Aphalo, Pedro J. (2015) The r4photobiology suite. UV4Plants Bulletin, 2015:1, 21-29. DOI: 10.19232/uv4pb.2015.1.14
[7] R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/