Adventures with R - Cricket Analysis (creating a Player Database)

December 05, 2018

One of the things I had encountered in the main series was the inability to create a comprehensive player database. That part of the code was fairly manual.

I kept noodling and tinkering around to get an approach for creating a comprehensive player database. This would replace the manual approach of extracting one player link at a time.

I had initially set out to use the excellent Rvest package but ran into some issues trying to decipher the xpath that is required to make the link work. I believe that the player information is coded directly as html tags on crickinfo and it would have taken me a couple of xpath loops to get the player name and then then the player profile id out. Definitely doable (but will keep Rvest for another code I have in mind..

I focused on using dplyr, tidyr to do my heavy lifting

The code can be found HERE

Code walk through. I am reproducing the first part of the code, the main code is available for everyone to look at

library(stringr)
library(sqldf)
library(dplyr)
library(tidyr)

batsmen_list <- list()
i<- 0

# Hard coding pages to 50 pages. Each page has 50 players so we will have a database of 2500 top players, all time
# These are some more types to generate more players
# Class 11 is all Test/ODI/T20 combined, Class 1 is top batsmen for Tests, Class 2 is top batsmen for ODI, Class 3 is top batsmen for T20

# First Run is for Overall Top Batsmen

# Each page has 50 batsmen, focusing on the top 2500 batsmen
for(i in 1:50){

print(i)

main <- "http://stats.espncricinfo.com/ci/engine/stats/index.html?class=11;page="
page <- i
main1 <- ";template=results;type=batting"

url <- sprintf("%s%s%s", main, page, main1)

lines <- readLines(url)

lines <- trimws(lines)

lines <- lines[lines != '']

link <- grep('href="', lines, value = TRUE) %>%
gsub('.*?href=\"(*ci/content/player/*)\">', '\\1', .)

link <- paste0('http://stats.espncricinfo.com', link)

batsmen_list[[i]] <- data.frame(link = link, stringsAsFactors = FALSE)

}

lapply(batsmen_list, nrow)

batsmen_df <- data.table::rbindlist(batsmen_list) %>% as.data.frame

batsmen_df_clean <- sqldf("select * from batsmen_df where link like '%/ci/content/player%' ")

#Extending the table to have numeric player id

numextract <- function(string){
str_extract(string, "\\-*\\d+\\.*\\d*")
}

batsmen_df_clean$PlayerID = numextract(batsmen_df_clean$link)
batsmen_df_clean$PlayerID <- as.numeric(as.character(batsmen_df_clean$PlayerID))

batsmen_df_clean$PlayerName <- as.character(sub(".*> *(.*?) *profile.*", "\\1", batsmen_df_clean$link))

batsmen_df_clean <- batsmen_df_clean[-grep("http", batsmen_df_clean$PlayerName),]
batsmen_df_clean <- batsmen_df_clean[ -c(1)]
batsmen_df_clean <- unique(batsmen_df_clean)

batsmen_Overall <- batsmen_df_clean

The final batsmen_Overall provides a listing of 2500 batsmen with their profile id which cane be used in the main code

Search This Blog

Random Rants/Adventures in Life

Adventures with R - Cricket Analysis (creating a Player Database)

Comments

Popular posts from this blog

Adventures with R - Facebook Ads (Part 3)

Film Reviews on IndiaFM

Dekh Sako to Dekh Lo, Lekin Hathoda maare bina