Bind Rows For Loop in R Randomly Stops Working? - r

I am scraping NBA play by play data using the play_by_play function in the nbastatR package. The problem is this function only collects data for 1 game ID at a time, and there are 1230 game IDs in a complete season. When I enter more than 15 game ID's in the play_by_play function, R just keeps loading and showing the wheel of death forever.
I tried to get around this by making a for loop which binds each game id to one cumulative dataframe. However, I run into the same error where R will endlessly load around the 16'th game- very peculiar. I could clean the data in the loop and try that out (I do not need all play by play data, just every shot from the season), but does anyone know why this is happening and how/if I could get around this?
Thanks
full<- play_by_play(game_ids = 21400001, nest_data = F, return_message = T)
for(i in 21400002:21400040){
data <- play_by_play(game_ids = c(i), nest_data = F, return_message = F)
full <- bind_rows(full,data)
cat(i)
}
This code will stop working at around the the 16th game ID. I tried using bind_rows from dplyr but that did not help at all.

Try this [untested code as I don't have your data]:
full <- lapply(
21400001:21400040,
function(i) {
play_by_play(game_ids = c(i), nest_data = F, return_message = F)
}
) %>%
bind_rows()
You can get more information on lazy evaluation here.

Related

efficient data collection from API using R

I am trying to get data from the UN Stats API for a list of indicators (https://unstats.un.org/SDGAPI/swagger/).
I have constructed a loop that can be used to get the data for a single indicator (code is below). The loop can be applied to multiple indicators as needed. However, this is likely to cause problems relating to large numbers of requests, potentially being perceived as a DDoS attack and taking far too long.
Is there an alternative way to get data for an indicator for all years and countries without making a ridiculous number of requests or in a more efficient manner than below? I suppose this question likely applies more generally to other similar APIs as well. Any help would be most welcome.
Please note: I have seen the post here (Faster download for paginated nested JSON data from API in R?) but it is not quite what I am looking for.
Minimal working example
# libraries
library(jsonlite)
library(dplyr)
library(purrr)
# get the meta data
page = ("https://unstats.un.org/SDGAPI//v1/sdg/Series/List")
sdg_meta = fromJSON(page) %>% as.data.frame()
# parameters
PAGE_SIZE =100000
N_PAGES = 5
FULL_DF = NULL
my_code = "SI_COV_SOCINS"
# loop to go over pages
for(i in seq(1,N_PAGES,1)){
ind = which(sdg_meta$code == my_code)
cat(paste0("Processing : ", my_code, " ", i, " of ",N_PAGES, " \n"))
my_data_page <- c(paste0("https://unstats.un.org/SDGAPI/v1/sdg/Series/Data?seriesCode=",my_code,"&page=",i,"pageSize=",PAGE_SIZE))
df <- fromJSON(my_data_page) #depending on the data you are calling, you will get a list
df= df$data %>% as.data.frame() %>% distinct()
# break the loop when no more to add
if(is_empty(df)){
break
}
FULL_DF = rbind(FULL_DF,df)
Sys.sleep(5) # sleep to avoid any issues
}

R create a loop for a list of data frames

I currently have a list which is made up of around 80+ data frames, what I would like to do is to loop a chunk of code for each individual data frame within the list, without naming each one individually, or splitting them into individual data frames to work on.
Currently I split the list into each individual data frame using the below code:
dat5split <- setNames(split(dat5, dat5$CODE), paste0("df", unique(dat5$CODE)))
list2env(dat5split, globalenv())
I then work through each data frame individually:
# call in SPC function and write to 'results10000'
results10000<-SPC_XBAR(df10000,vol_n,seasonality)
results10000 = results10000 %>%
cbind(Spec = df10000$CODE) %>%
subset(`table_n` == 1)
results10000 <- results10000[order(results10000$tpd),]
results10000$Date <- as.Date(cbind(Date = df10000$CENSUS_DATE))
# call in SPC function and write to 'results10001'
results10001<-SPC_XBAR(df10001,vol_n,seasonality)
results10001 = results10001 %>%
cbind(Spec = df10001$CODE) %>%
subset(`table_n` == 1)
results10001 <- results10001[order(results10001$tpd),]
results10001$Date <- as.Date(cbind(Date = df10001$CENSUS_DATE))
Currently I call in the function 'SPC_XBAR' to where vol_n and seasonality are set earlier in the code. The script then passes the values to the function which then assigns the results to 'results10000, results10001' etc etc. Upon which I do a small bit of data wrangling on each newly created data frame before feeding the results back into sql server at the end.
As you can see each one is being individually hard coded which is not efficient.
What I would like to do is to loop a chunk of code for each individual data frame within the list, without naming each one individually.
I believe a loop would solve this issue but I am a little inexperienced when it comes to the ability to create a loop around it. Any advice would be much appreciated.
Cheers
Have you considered using lapply instead of a loop throughout the list? Check it here...
EDIT: I try to elaborate a bit more... What happens if you do this:
myFunction <- function(x) {
results<-SPC_XBAR(x,vol_n,seasonality)
results = results %>%
cbind(Spec = x$CODE) %>%
subset(`table_n` == 1)
results <- results[order(results$tpd),]
results$Date <- as.Date(cbind(Date = x$CENSUS_DATE))
results
}
lapply(dat5split, myFunction)
I would expect it to return a list of the resulting datasets

Not sure which way of combining my loop results I should be using

To make a long story short I'm trying to gather information on 6500 user, so I wrote a loop. Below you can find an example of 10 artists. In this loop I'm trying to use a call to gather information on all tracks of a user.
test <- fromJSON(getURL('http://api.soundcloud.com/users/52880864/tracks?client_id=0ab2657a7e5b63b6dbc778e13c834e3d&limit=200&offset=1&linked_partitioning=1'))
This short example shows a dataframe with all the tracks uploaded by a user. When I use my loop I'd like to add all the dataframes together so that I can process them with tapply. This way I can for instance see how what the sum of all track likes are. However, two things are going wrong. First, when I run the loop, each users only shows one uploaded track. Second, I think I'm not combining the dataframes properly. Could somebody please explain to me what I'm doing wrong?
id <- c(20376298, 63320169, 3806325, 12231483, 18838035, 117385796, 52880864, 32704993, 63975320, 95667573)
Partition1 <- paste0("'http://api.soundcloud.com/users/", id, "/tracks?client_id=0ab2657a7e5b63b6dbc778e13c834e3d&limit=200&offset=1&linked_partitioning=1'")
results <- vector(mode = "list", length = length(Partition1))
for (i in seq_along(Partition1)){
message(paste0('Query #',i))
tryCatch({
result_i <- fromJSON((getURL(str_replace_all(Partition1[i],"'",""))))
clean_i <- function(x)ifelse(is.null(x),NA,ifelse(length(x)==0,NA,x))
results[[i]] <- plyr::llply(result_i, clean_i) %>% as_data_frame
if( i == 4 ) {
stop('stop')
}
}, error = function(e){
beepr::beep(1)
}
)
Sys.sleep(0.5)
}

R: looping through a list of links

I have some code that scrapes data off this link (http://stats.ncaa.org/team/stats?org_id=575&sport_year_ctl_id=12280) and runs some calculations.
What I want to do is cycle through every team and collect and run the manipulations on every team. I have a dataframe with every team link, like the one above.
Psuedo code:
for (link in teamlist)
{scrape, manipulate, put into a table}
However, I can't figure out how to run loop through the links.
I've tried doing URL = teamlist$link[i], but I get an error when using readhtmltable(). I have no trouble manually pasting each team individual URL into the script, just only when trying to pull it from a table.
Current code:
library(XML)
library(gsubfn)
URL= 'http://stats.ncaa.org/team/stats?org_id=575&sport_year_ctl_id=12280'
tx<- readLines(URL)
tx2<-gsub("</tbody>","",tx)
tx2<-gsub("<tfoot>","",tx2)
tx2<-gsub("</tfoot>","</tbody>",tx2)
Player_Stats = readHTMLTable(tx2,asText=TRUE, header = T, which = 2,stringsAsFactors = F)
Thanks.
I agree with #ialm that you should check out the rvest package, which makes it very fun and straightforward to loop through links. I will create some example code here using similar subject matter for you to check out.
Here I am generating a list of links that I will iterate through
rm(list=ls())
library(rvest)
mainweb="http://www.basketball-reference.com/"
urls=html("http://www.basketball-reference.com/teams") %>%
html_nodes("#active a") %>%
html_attrs()
Now that the list of links is complete I iterate through each link and pull a table from each
teamdata=c()
j=1
for(i in urls){
bball <- html(paste(mainweb, i, sep=""))
teamdata[j]= bball %>%
html_nodes(paste0("#", gsub("/teams/([A-Z]+)/$","\\1", urls[j], perl=TRUE))) %>%
html_table()
j=j+1
}
Please see the code below, which basically builds off your code and loops through two different team pages as identified by the vector team_codes. The tables are returned in a list where each list element corresponds to a team's table. However, the tables look like they will need more cleaning.
library(XML)
library(gsubfn)
Player_Stats <- list()
j <- 1
team_codes <- c(575, 580)
for(code in team_codes) {
URL <- paste0('http://stats.ncaa.org/team/stats?org_id=', code, '&sport_year_ctl_id=12280')
tx<- readLines(URL)
tx2<-gsub("</tbody>","",tx)
tx2<-gsub("<tfoot>","",tx2)
tx2<-gsub("</tfoot>","</tbody>",tx2)
Player_Stats[[j]] = readHTMLTable(tx2,asText=TRUE, header = T, which = 2,stringsAsFactors = F)
j <- j + 1
}

Reading a CSV file in R in a Function

I have a question that I can't seem to find the answer anywhere online. I apologize if it's already been answered, but here goes. I've written a script in R that will go through the process of forecasting for me, and returning the best point forecast based on cross validation and other criteria. I'm wanting to save this script as a function, that way I don't have to use the full script every time I go to forecast. The basic set up of my script is the following:
output <- read.csv("C:/Users/data.csv", header = T)
colnames(output)
month_count = length(output[,1]) ##used in calculations throughout code
current_year = output[1,1]
current_month = output[1,2]
months = 5 #months to forecast out
m = 0
data <- ts(output[,3][c(1:(month_count-m))],
frequency = 12, start = c(current_year,current_month))
#runs all the other steps from here on
The function that I'm writing will looking like this where it takes various inputs and then runs the script and prints back my forecasts
forecastMe = function(sourcefile,months,m)
{
#runs the data prints out the result
}
The problem I'm having is I want to be able to enter a directory and file name such as C:/Users/documents/data1.csv into the function (for the sourcefile part) and for it pick that up at this step of my R script.
output <- read.csv("C:/Users/sourcefile.csv", header = T)
I can't seem to find a way to get it to do it right. Any ideas or suggestions?
So...
function(sourcefile, etc) {
output <- read.csv(sourcefile, header = T)
etc
}
...that? I don't really see what you're asking exactly.
You were almost there. All you have to do is replace your constants with the variable names you want to pass to the function and delete your declarations you don't need anymore.
forecastMe = function(sourcefile,months,m) {
output <- read.csv(sourcefile, header = T)
colnames(output)
month_count = length(output[,1]) ##used in calculations throughout code
current_year = output[1,1]
current_month = output[1,2]
data <- ts(output[,3][c(1:(month_count-m))],
frequency = 12, start = c(current_year,current_month))
#runs all the other steps from here on
}

Resources