Adventures with R - Cricket Analysis (creating a Player Database)

One of the things I had encountered in the main series was the inability to create a comprehensive player database. That part of the code was fairly manual.

I kept noodling and tinkering around to get an approach for creating a comprehensive player database. This would replace the manual approach of extracting one player link at a time.

I had initially set out to use the excellent Rvest package but ran into some issues trying to decipher the xpath that is required to make the link work. I believe that the player information is coded directly as html tags on crickinfo and it would have taken me a couple of xpath loops to get the player name and then then the player profile id out. Definitely doable (but will keep Rvest for another code I have in mind..

I focused on using dplyr, tidyr to do my heavy lifting

The code can be found HERE

Code walk through. I am reproducing the first part of the code, the main code is available for everyone to look at

library(stringr)
library(sqldf)
library(dplyr)
library(tidyr)

batsmen_list <- list()
i<- 0

# Hard coding pages to 50 pages. Each page has 50 players so we will have a database of 2500 top players, all time 
# These are some more types to generate more players
# Class 11 is all Test/ODI/T20 combined, Class 1 is top batsmen for Tests, Class 2 is top batsmen for ODI, Class 3 is top batsmen for T20


# First Run is for Overall Top Batsmen 

# Each page has 50 batsmen, focusing on the top 2500 batsmen
for(i in 1:50){
  
  print(i)
  
  main <- "http://stats.espncricinfo.com/ci/engine/stats/index.html?class=11;page="
  page <- i
  main1 <- ";template=results;type=batting"
  
  url <- sprintf("%s%s%s", main, page, main1)
  
  lines <- readLines(url)
  
  lines <- trimws(lines)
  
  lines <- lines[lines != '']
  
  link <- grep('href="', lines, value = TRUE) %>% 
    gsub('.*?href=\"(*ci/content/player/*)\">', '\\1', .)
  
  link <- paste0('http://stats.espncricinfo.com', link)
  
  batsmen_list[[i]] <- data.frame(link = link, stringsAsFactors = FALSE)
  
}

lapply(batsmen_list, nrow)

batsmen_df <- data.table::rbindlist(batsmen_list) %>% as.data.frame


batsmen_df_clean <- sqldf("select * from batsmen_df where link like '%/ci/content/player%' ")

#Extending the table to have numeric player id

numextract <- function(string){ 
  str_extract(string, "\\-*\\d+\\.*\\d*")


batsmen_df_clean$PlayerID = numextract(batsmen_df_clean$link)
batsmen_df_clean$PlayerID <- as.numeric(as.character(batsmen_df_clean$PlayerID))

batsmen_df_clean$PlayerName <- as.character(sub(".*> *(.*?) *profile.*", "\\1", batsmen_df_clean$link))

batsmen_df_clean <- batsmen_df_clean[-grep("http", batsmen_df_clean$PlayerName),]
batsmen_df_clean <- batsmen_df_clean[ -c(1)]
batsmen_df_clean <- unique(batsmen_df_clean)


batsmen_Overall <- batsmen_df_clean

The final batsmen_Overall provides a listing of 2500 batsmen with their profile id which cane be used in the main code

Comments

Popular posts from this blog

Balkanization of Pakistan

Film Reviews on IndiaFM

Outsourcing - the new wave !!