Adventures with R - Cricket Analysis (creating a Player Database)
One of the things I had encountered in the main series was the inability to create a comprehensive player database. That part of the code was fairly manual.
I kept noodling and tinkering around to get an approach for creating a comprehensive player database. This would replace the manual approach of extracting one player link at a time.
I had initially set out to use the excellent Rvest package but ran into some issues trying to decipher the xpath that is required to make the link work. I believe that the player information is coded directly as html tags on crickinfo and it would have taken me a couple of xpath loops to get the player name and then then the player profile id out. Definitely doable (but will keep Rvest for another code I have in mind..
I focused on using dplyr, tidyr to do my heavy lifting
The code can be found HERE
Code walk through. I am reproducing the first part of the code, the main code is available for everyone to look at
library(stringr)
library(sqldf)
library(dplyr)
library(tidyr)
batsmen_list <- list()
i<- 0
# Hard coding pages to 50 pages. Each page has 50 players so we will have a database of 2500 top players, all time
# These are some more types to generate more players
# Class 11 is all Test/ODI/T20 combined, Class 1 is top batsmen for Tests, Class 2 is top batsmen for ODI, Class 3 is top batsmen for T20
# First Run is for Overall Top Batsmen
# Each page has 50 batsmen, focusing on the top 2500 batsmen
for(i in 1:50){
print(i)
main <- "http://stats.espncricinfo.com/ci/engine/stats/index.html?class=11;page="
page <- i
main1 <- ";template=results;type=batting"
url <- sprintf("%s%s%s", main, page, main1)
lines <- readLines(url)
lines <- trimws(lines)
lines <- lines[lines != '']
link <- grep('href="', lines, value = TRUE) %>%
gsub('.*?href=\"(*ci/content/player/*)\">', '\\1', .)
link <- paste0('http://stats.espncricinfo.com', link)
batsmen_list[[i]] <- data.frame(link = link, stringsAsFactors = FALSE)
}
lapply(batsmen_list, nrow)
batsmen_df <- data.table::rbindlist(batsmen_list) %>% as.data.frame
batsmen_df_clean <- sqldf("select * from batsmen_df where link like '%/ci/content/player%' ")
#Extending the table to have numeric player id
numextract <- function(string){
str_extract(string, "\\-*\\d+\\.*\\d*")
}
batsmen_df_clean$PlayerID = numextract(batsmen_df_clean$link)
batsmen_df_clean$PlayerID <- as.numeric(as.character(batsmen_df_clean$PlayerID))
batsmen_df_clean$PlayerName <- as.character(sub(".*> *(.*?) *profile.*", "\\1", batsmen_df_clean$link))
batsmen_df_clean <- batsmen_df_clean[-grep("http", batsmen_df_clean$PlayerName),]
batsmen_df_clean <- batsmen_df_clean[ -c(1)]
batsmen_df_clean <- unique(batsmen_df_clean)
batsmen_Overall <- batsmen_df_clean
I kept noodling and tinkering around to get an approach for creating a comprehensive player database. This would replace the manual approach of extracting one player link at a time.
I had initially set out to use the excellent Rvest package but ran into some issues trying to decipher the xpath that is required to make the link work. I believe that the player information is coded directly as html tags on crickinfo and it would have taken me a couple of xpath loops to get the player name and then then the player profile id out. Definitely doable (but will keep Rvest for another code I have in mind..
I focused on using dplyr, tidyr to do my heavy lifting
The code can be found HERE
Code walk through. I am reproducing the first part of the code, the main code is available for everyone to look at
library(stringr)
library(sqldf)
library(dplyr)
library(tidyr)
batsmen_list <- list()
i<- 0
# Hard coding pages to 50 pages. Each page has 50 players so we will have a database of 2500 top players, all time
# These are some more types to generate more players
# Class 11 is all Test/ODI/T20 combined, Class 1 is top batsmen for Tests, Class 2 is top batsmen for ODI, Class 3 is top batsmen for T20
# First Run is for Overall Top Batsmen
# Each page has 50 batsmen, focusing on the top 2500 batsmen
for(i in 1:50){
print(i)
main <- "http://stats.espncricinfo.com/ci/engine/stats/index.html?class=11;page="
page <- i
main1 <- ";template=results;type=batting"
url <- sprintf("%s%s%s", main, page, main1)
lines <- readLines(url)
lines <- trimws(lines)
lines <- lines[lines != '']
link <- grep('href="', lines, value = TRUE) %>%
gsub('.*?href=\"(*ci/content/player/*)\">', '\\1', .)
link <- paste0('http://stats.espncricinfo.com', link)
batsmen_list[[i]] <- data.frame(link = link, stringsAsFactors = FALSE)
}
lapply(batsmen_list, nrow)
batsmen_df <- data.table::rbindlist(batsmen_list) %>% as.data.frame
batsmen_df_clean <- sqldf("select * from batsmen_df where link like '%/ci/content/player%' ")
#Extending the table to have numeric player id
numextract <- function(string){
str_extract(string, "\\-*\\d+\\.*\\d*")
}
batsmen_df_clean$PlayerID = numextract(batsmen_df_clean$link)
batsmen_df_clean$PlayerID <- as.numeric(as.character(batsmen_df_clean$PlayerID))
batsmen_df_clean$PlayerName <- as.character(sub(".*> *(.*?) *profile.*", "\\1", batsmen_df_clean$link))
batsmen_df_clean <- batsmen_df_clean[-grep("http", batsmen_df_clean$PlayerName),]
batsmen_df_clean <- batsmen_df_clean[ -c(1)]
batsmen_df_clean <- unique(batsmen_df_clean)
batsmen_Overall <- batsmen_df_clean
The final batsmen_Overall provides a listing of 2500 batsmen with their profile id which cane be used in the main code
Comments