Adventures with R - Cricket Analysis

On my journey for more interesting R packages, I stumbled onto CricketR. 
CricketR is a wonderful package found on the CRAN library HERE and written/maintained by tvganesh (GitHub link) . 

Big Hat Tip to him for taking the Crickinfo Stats Guru website and converting the query/output to a R package

As usual, RStudio is the programming interface that I used. For ease of use, I go to Tools --> Global Options --> Appearance. I used Sky as my RStudio theme, Lucida console, font size 11 and Idle Fingers as my Editor theme. It gives a neutral black background and the fonts (comments, code, etc) is much more clearer to see

All the files on the code are as follows,
Main Code
Player File
Venue File
Result File

A key objective I had in this exercise
Avoid manual code runs or brute force. The package was easy enough to run one player at a time but I wanted to run it in an automated fashion for multiple players

  • The R package needs player id to run but Cricinfo provides the PlayerName. I was not able to find an easy way to get the playerid from the PlayerName. All ears if someone has an easy way to mass extract player name and player id. That would allow us to add even more players to the mix. I have around 50 right now - mostly from India but a fair bit of other players also
    • Update : After tinkering around Rvest and some other functions, I was able to figure out how to extract a player database from the cricinfo website. A different blogpost found HERE would walk everyone through the approach on that
  • I have looped the code to run for ODI, T20 and test match separately. A much more efficient approach would be to call the three R functions in the same code block and loop them through. It would not reduce the time on the code but would be more efficient

Some things I want to do on top of this code
1. I have this code running on my local PC. I would love to get someone's guidance to run this on either Azure or AWS in an automated fashion so that it runs once a week and grabs all data and outputs it out as a csv file. Would show me how the entire cloud ML layer works
2. There are functions in the R package to incorporate a forecasting module. I was not able to advance the code to that point. But I believe several ML aspects can be worked on. Some that I want to do, 

  • A forecasting approach to forecast the number of runs in the next N games at a batsman level
  • A clustering approach to cluster batsmen together

## Code Run Through

# You need the package cricketr installed from the CRAN library. 
# install.packages("cricketr")

library(cricketr)
library(dplyr)

library(tidyr)

# I use a directory on my C folder as a working directory. You can use any directory
setwd('C:/Training/R/CricketAnalysis/')

#The next pieces of code loop through the three csv files and create an unique listing of players, venue types and result types

# Get listing of venue
VenueTable <- read.csv("VenueTable.csv")
venue_results <- list()


for (i in 1:nrow(VenueTable)) {
  
  venue_results[[i]] <- data.frame(VenueID = VenueTable$VenueID[i], stringsAsFactors = FALSE)
  
}

Venue <- data.table::rbindlist(venue_results, fill = TRUE)
Venue <- unique(Venue[,1])

# Players in database
PlayerTable <- read.csv("PlayerTable.csv")
player_results <- list()

for (i in 1:nrow(PlayerTable)) {
  
  player_results[[i]] <- data.frame(playerid = PlayerTable$PlayerID[i], stringsAsFactors = FALSE)

}

Player <- data.table::rbindlist(player_results, fill = TRUE)
Player <- unique(Player[,1])

# Result Type
ResultTable <- read.csv("ResultTable.csv")
result_results <- list()

for (i in 1:nrow(ResultTable)) {
  
  result_results[[i]] <- data.frame(ResultID = ResultTable$ResultID[i], stringsAsFactors = FALSE)

}

Result <- data.table::rbindlist(result_results, fill = TRUE)
Result <- unique(Result[,1])


rm(list= ls()[!(ls() %in% c('Result', 'Venue', 'Player', 'PlayerTable', 'VenueTable', 'ResultTable'))])

# Data Processing for ODI
extraction_list <- list()
m <- 0

for (i in 1:nrow(Player)) {
  
  for (j in 1:nrow(Venue)) {
    
    for (k in 1:nrow(Result)) {
      
      m <- m+1
      
      tryCatch ({
        
        post <- getPlayerDataOD(profile = Player$playerid[i], type = "batting", homeOrAway = Venue$VenueID[j], result = Result$ResultID[k]) 
        
        extraction_list[[m]] <- data.frame(Player = Player$playerid[i],
                                           Runs = post$Runs,
                                           Mins = post$Mins,
                                           BF = post$BF,
                                           Fours = post$`4s`,
                                           Sixes = post$`6s`,
                                           SR = post$SR,
                                           Pos = post$Pos,
                                           Dismissal = post$Dismissal,
                                           Inns = post$Inns,
                                           Opposition = post$Opposition,
                                           Ground = post$Ground,
                                           StartDate = post$`Start Date`,
                                           VenueType = Venue$VenueID[j], 
                                           ResultType = Result$ResultID[k], 
                                           MatchType = 'ODI', 
                                           stringsAsFactors = FALSE)
      }, error = function(e) e)
      
    }
    
  }
}


extraction1 <- data.table::rbindlist(extraction_list[!extraction_list=='NA'], fill = TRUE)
extraction1a <- left_join(extraction1, PlayerTable, by=c("Player" = "PlayerID"))
extraction1b <- left_join(extraction1a, VenueTable, by=c("VenueType" = "VenueID"))
extraction1c <- left_join(extraction1b, ResultTable, by=c("ResultType" = "ResultID"))

extraction1 <- subset(extraction1c, select = -c(VenueType, ResultType))

#Data clean up
## Removing rows where Player did not Bat
extraction1 <- extraction1[!(extraction1$Runs == 'DNB'),]
extraction1 <- extraction1[!(extraction1$Runs == 'TDNB'),]

## Convert Runs to Numeric
extraction1 <- extraction1[!(extraction1$Runs == 'absent'),]
extraction1 <- extraction1[!(extraction1$Runs == 'sub'),]
extraction1$Runs = as.numeric(gsub("\\*", "", extraction1$Runs))

# Convert character to numerc in R
extraction1$BF <- as.numeric(as.character(extraction1$BF))

extraction1$Fours[extraction1$Fours=="-"] <- 0
extraction1$Fours <- as.numeric(as.character(extraction1$Fours))

extraction1$Sixes[extraction1$Sixes=="-"] <-0
extraction1$Sixes <- as.numeric(as.character(extraction1$Sixes))

extraction1$Pos <- as.numeric(as.character(extraction1$Pos))
extraction1$Inns <- as.numeric(as.character(extraction1$Inns))

extraction1$SR[extraction1$SR=="-"] <- 0
options(digits=3)
extraction1$SR = as.double(as.character(extraction1$SR))

ODI <- extraction1

# Data Processing for T20
extraction_list <- list()
m <- 0

for (i in 1:nrow(Player)) {
  
  for (j in 1:nrow(Venue)) {
    
    for (k in 1:nrow(Result)) {
      
      m <- m+1
      
      tryCatch ({
        
        post <- getPlayerDataTT(profile = Player$playerid[i], type = "batting", homeOrAway = Venue$VenueID[j], result = Result$ResultID[k]) 
        
        extraction_list[[m]] <- data.frame(Player = Player$playerid[i],
                                           Runs = post$Runs,
                                           Mins = post$Mins,
                                           BF = post$BF,
                                           Fours = post$`4s`,
                                           Sixes = post$`6s`,
                                           SR = post$SR,
                                           Pos = post$Pos,
                                           Dismissal = post$Dismissal,
                                           Inns = post$Inns,
                                           Opposition = post$Opposition,
                                           Ground = post$Ground,
                                           StartDate = post$`Start Date`,
                                           VenueType = Venue$VenueID[j], 
                                           ResultType = Result$ResultID[k], 
                                           MatchType = 'T20', 
                                           stringsAsFactors = FALSE)
      }, error = function(e) e)
      
    }
    
  }
}

#extraction1 <- data.table::rbindlist(extraction_list[!extraction_list=='NULL'], fill = TRUE)
extraction1 <- data.table::rbindlist(extraction_list[!extraction_list=='NA'], fill = TRUE)
extraction1a <- left_join(extraction1, PlayerTable, by=c("Player" = "PlayerID"))
extraction1b <- left_join(extraction1a, VenueTable, by=c("VenueType" = "VenueID"))
extraction1c <- left_join(extraction1b, ResultTable, by=c("ResultType" = "ResultID"))

extraction1 <- subset(extraction1c, select = -c(VenueType, ResultType))

#Data clean up
## Removing rows where Player did not Bat
extraction1 <- extraction1[!(extraction1$Runs == 'DNB'),]
extraction1 <- extraction1[!(extraction1$Runs == 'TDNB'),]

## Convert Runs to Numeric
extraction1$Runs = as.numeric(gsub("\\*", "", extraction1$Runs))

# Convert character to numerc in R
extraction1$BF <- as.numeric(as.character(extraction1$BF))

extraction1$Fours[extraction1$Fours=="-"] <- 0
extraction1$Fours <- as.numeric(as.character(extraction1$Fours))

extraction1$Sixes[extraction1$Sixes=="-"] <-0
extraction1$Sixes <- as.numeric(as.character(extraction1$Sixes))

extraction1$Pos <- as.numeric(as.character(extraction1$Pos))
extraction1$Inns <- as.numeric(as.character(extraction1$Inns))

extraction1$SR[extraction1$SR=="-"] <- 0
options(digits=3)
extraction1$SR = as.double(as.character(extraction1$SR))

T20 <- extraction1

# Data Processing for Test Matches
extraction_list <- list()
m <- 0

for (i in 1:nrow(Player)) {
  
  for (j in 1:nrow(Venue)) {
    
    for (k in 1:nrow(Result)) {
      
      m <- m+1
      
      tryCatch ({
        
        post <- getPlayerData(profile = Player$playerid[i], type = "batting", homeOrAway = Venue$VenueID[j], result = Result$ResultID[k]) 
        
        extraction_list[[m]] <- data.frame(Player = Player$playerid[i],
                                           Runs = post$Runs,
                                           Mins = post$Mins,
                                           BF = post$BF,
                                           Fours = post$`4s`,
                                           Sixes = post$`6s`,
                                           SR = post$SR,
                                           Pos = post$Pos,
                                           Dismissal = post$Dismissal,
                                           Inns = post$Inns,
                                           Opposition = post$Opposition,
                                           Ground = post$Ground,
                                           StartDate = post$`Start Date`,
                                           VenueType = Venue$VenueID[j], 
                                           ResultType = Result$ResultID[k], 
                                           MatchType = 'Test', 
                                           stringsAsFactors = FALSE)
      }, error = function(e) e)
      
    }
    
  }
}

#extraction1 <- data.table::rbindlist(extraction_list[!extraction_list=='NULL'], fill = TRUE)
extraction1 <- data.table::rbindlist(extraction_list[!extraction_list=='NA'], fill = TRUE)
extraction1a <- left_join(extraction1, PlayerTable, by=c("Player" = "PlayerID"))
extraction1b <- left_join(extraction1a, VenueTable, by=c("VenueType" = "VenueID"))
extraction1c <- left_join(extraction1b, ResultTable, by=c("ResultType" = "ResultID"))

extraction1 <- subset(extraction1c, select = -c(VenueType, ResultType))

#Data clean up
## Removing rows where Player did not Bat
extraction1 <- extraction1[!(extraction1$Runs == 'DNB'),]
extraction1 <- extraction1[!(extraction1$Runs == 'TDNB'),]
extraction1 <- extraction1[!(extraction1$Runs == 'absent'),]

## Convert Runs to Numeric
extraction1$Runs = as.numeric(gsub("\\*", "", extraction1$Runs))

# Convert character to numerc in R
extraction1$BF[extraction1$BF=="-"] <- 0
extraction1$BF <- as.numeric(as.character(extraction1$BF))

extraction1$Fours[extraction1$Fours=="-"] <- 0
extraction1$Fours <- as.numeric(as.character(extraction1$Fours))

extraction1$Sixes[extraction1$Sixes=="-"] <-0
extraction1$Sixes <- as.numeric(as.character(extraction1$Sixes))

extraction1$Pos <- as.numeric(as.character(extraction1$Pos))
extraction1$Inns <- as.numeric(as.character(extraction1$Inns))

extraction1$SR[extraction1$SR=="-"] <- 0
options(digits=3)
extraction1$SR = as.double(as.character(extraction1$SR))

Test <- extraction1

write.csv(ODI, file="ODIData.csv")
write.csv(T20, file="T20.csv")
write.csv(Test, file="Test.csv")


rm(list= ls()[!(ls() %in% c('T20','ODI','Test'))])


The final stage provides three csv files. I took the ODI file and created a Tableau Public dashboard that can be accessed HERE

Let me know your thoughts !!!








Comments

Popular posts from this blog

Outsourcing - the new wave !!

The Dragon vs the Tiger

Dekh Sako to Dekh Lo, Lekin Hathoda maare bina