Adventures with R - Cricket Analysis (creating a Player Database)

One of the things I had encountered in the main series was the inability to create a comprehensive player database. That part of the code was fairly manual.

I kept noodling and tinkering around to get an approach for creating a comprehensive player database. This would replace the manual approach of extracting one player link at a time.

I had initially set out to use the excellent Rvest package but ran into some issues trying to decipher the xpath that is required to make the link work. I believe that the player information is coded directly as html tags on crickinfo and it would have taken me a couple of xpath loops to get the player name and then then the player profile id out. Definitely doable (but will keep Rvest for another code I have in mind..

I focused on using dplyr, tidyr to do my heavy lifting

The code can be found HERE

Code walk through. I am reproducing the first part of the code, the main code is available for everyone to look at

library(stringr)
library(sqldf)
library(dplyr)
library(tidyr)

batsmen_list <- list()
i<- 0

# Hard coding pages to 50 pages. Each page has 50 players so we will have a database of 2500 top players, all time 
# These are some more types to generate more players
# Class 11 is all Test/ODI/T20 combined, Class 1 is top batsmen for Tests, Class 2 is top batsmen for ODI, Class 3 is top batsmen for T20


# First Run is for Overall Top Batsmen 

# Each page has 50 batsmen, focusing on the top 2500 batsmen
for(i in 1:50){
  
  print(i)
  
  main <- "http://stats.espncricinfo.com/ci/engine/stats/index.html?class=11;page="
  page <- i
  main1 <- ";template=results;type=batting"
  
  url <- sprintf("%s%s%s", main, page, main1)
  
  lines <- readLines(url)
  
  lines <- trimws(lines)
  
  lines <- lines[lines != '']
  
  link <- grep('href="', lines, value = TRUE) %>% 
    gsub('.*?href=\"(*ci/content/player/*)\">', '\\1', .)
  
  link <- paste0('http://stats.espncricinfo.com', link)
  
  batsmen_list[[i]] <- data.frame(link = link, stringsAsFactors = FALSE)
  
}

lapply(batsmen_list, nrow)

batsmen_df <- data.table::rbindlist(batsmen_list) %>% as.data.frame


batsmen_df_clean <- sqldf("select * from batsmen_df where link like '%/ci/content/player%' ")

#Extending the table to have numeric player id

numextract <- function(string){ 
  str_extract(string, "\\-*\\d+\\.*\\d*")


batsmen_df_clean$PlayerID = numextract(batsmen_df_clean$link)
batsmen_df_clean$PlayerID <- as.numeric(as.character(batsmen_df_clean$PlayerID))

batsmen_df_clean$PlayerName <- as.character(sub(".*> *(.*?) *profile.*", "\\1", batsmen_df_clean$link))

batsmen_df_clean <- batsmen_df_clean[-grep("http", batsmen_df_clean$PlayerName),]
batsmen_df_clean <- batsmen_df_clean[ -c(1)]
batsmen_df_clean <- unique(batsmen_df_clean)


batsmen_Overall <- batsmen_df_clean

The final batsmen_Overall provides a listing of 2500 batsmen with their profile id which cane be used in the main code

Comments

Popular posts from this blog

Balkanization of Pakistan

Film Reviews on IndiaFM

Adventures with R - Facebook Ads (Part 3)