Using a loop to cycle though API calls in R - r

I have created a simple script to collect a Youtube Channels statistics. Just wondering how I could loop though a list of channel ID's instead of having to manually change the channel ID each time then re-run the script? I struggle to understand how to write loops in R.
key <- 'MyKey'
channel_id1 <- 'UCLSWNf28X3mVTxTT3_nLCcw'
url <- 'https://www.googleapis.com/youtube/v3/channels?part=statistics'
y <- paste0(url,'&id=',channel_id1,'&key=',key)
yt_channel1 <- fromJSON(txt=y)
yt_d_channel1 <- as.data.frame(do.call(c, unlist(yt_channel1, recursive=FALSE)))
Any way to store all channel ID's of interest in a list or vector then loop though them, storing results into new or the same dataframe?
i.e.
channels <- c('UCLSWNf28X3mVTxTT3_nLCcw', 'UCLSW467236VTxTT3_nLCcw', UHJKHS328787_ndncp')
for i 1:3, {
channels...
do stuff
}
Any help is greatly appreciated.

Yes, store the channel IDs in a column in a data frame. Assuming you have a data frame called my_data_frame with a column ID that contains the IDs, you can loop through the IDs like this:
key <- 'MyKey'
url <- 'https://www.googleapis.com/youtube/v3/channels?part=statistics'
for(i in 1:nrow(my_data_frame)){
y <- paste0(url,'&id=',my_data_frame$ID[i],'&key=',key)
yt_channel1 <- fromJSON(txt=y)
yt_d_channel1 <- as.data.frame(do.call(c, unlist(yt_channel1, recursive=FALSE)))
}
Notice how the ID is referenced using an index i which will count from 1 until the number of rows in your data frame.
Note, this code will not work as you will need to come up with a way to store the results.

Related

Optimizing data scrape from NHL API using R

I am a novice with R and a total newbie with the NHL API. I wrote an R program to extract all of the goals recorded in the NHL's data repository accessed through the NHL API using the R "nhlapi" package. I have code that works, but it's ugly and slow, and I wanted to see if anyone has suggestions for improving it. I am using the nhl_games_feed function provided by nhlapi to pull all events, from which I select the goals. This function returns a JSON blob (list of lists of lists of lists ...) in R, which I want to convert into a proper data.table.
I pasted a stripped-down version of my code below. I understand that normal practice here would be to include a sample data blob with the code so that other users can recreate my problems, but the data blob is the problem.
When I ran the full version of my code last night, the "Loop through games" portion took about 11 hours, and the "Convert players list to columns" took about 2 hours. Unless I can find a way to push the column or row filtering into the NHL's system, I don't think I am likely to find a way to speed up the "Loop through games" portion. So my first question: Does anyone have any thoughts about how to extract a subset of columns or rows using the NHL API, or do I need to pull everything and process it on my end?
My other question related to the second chunk of code ("Convert players..."), which converts the resulting event data into a single row of scalar elements per event. The event data shows up in lblob_feed[[1]]$liveData$plays$allPlays, which contains one row of scalar elements per event, except that one of the elements is ..$allPlays$players, which is itself a 4x5 dataframe. As a result, the only way that I could find to extract that data into scalar elements is the "Convert players..." loop. Is there a better way to convert this into a simple data.table?
Finally, any tips on other ways to end up with a comprehensive database of NHL events?
require("nhlapi")
require("data.table")
require("tidyverse")
require("hms")
assign("last.warning", NULL, envir = baseenv())
# create small list of selected games, using NHL API game code format
cSelGames <- c(2021020001, 2021020002, 2021020003, 2021020004)
liNumGames <- length(cSelGames)
print(liNumGames)
# 34370 games in the full database
# =============================================================================
# Loop through games
# Pull data for one game per call
Sys.time()
dtGoals <- data.table()
for (liGameNum in 1:liNumGames) {
# Pull the NHL feed blob for one selected game
# 11 hours in the full version
lblob_feed <- nhl_games_feed(gameId = cSelGames[liGameNum])
# Select only the play (event) portion of the feed blob
ldtFeed <- as.data.table(c(lblob_feed[[1]]$gamePk, lblob_feed[[1]]$liveData$plays$allPlays))
setnames(ldtFeed, 1, "gamePk")
# Check for games with no play data - 1995020006 has none and would kill execution
if ('result.eventCode' %in% colnames(ldtFeed)) {
# Check for missing elements in allPlays list
# team.triCode is missing for at least one game, probably for all-star games
if (!('team.triCode' %in% colnames(ldtFeed))) {ldtFeed[, ':=' (team.triCode = NA)]}
if (!('result.strength.code' %in% colnames(ldtFeed))) {ldtFeed[, ':=' (result.strength.code = NA)]}
if (!('result.emptyNet' %in% colnames(ldtFeed))) {ldtFeed[, ':=' (result.emptyNet = NA)]}
# Select the events and columns for the output data table
ldtGoals_new <- ldtFeed[(result.eventTypeId == 'GOAL')
,list(gamePk, result.eventCode, players, result.description
, team.triCode, about.period, about.periodTime
, about.goals.away, about.goals.home, result.strength.code, result.emptyNet)]
# Append the incremental data table to the aggregate data table
dtGoals <- rbindlist(list(dtGoals, ldtGoals_new), use.names=TRUE, fill=TRUE)
}
}
# =============================================================================
# Convert players list to columns
# 2 hours in the full version
# 190686 goals in full table
# For each goal, the player dataframe is 4x5
Sys.time()
dtGoal_player <- data.table()
for (i in 1:dtGoals[,.N]) {
# convert rows with embedded dataframes into multiple rows with scalar elements
s_result.eventCode <- dtGoals[i,result.eventCode]
dtGoal_player_new <- as.data.table(dtGoals[i,players[[1]]])
dtGoal_player_new[, ':=' (result.eventCode=s_result.eventCode)]
dtGoal_player <- rbindlist(list(dtGoal_player, dtGoal_player_new), use.names=TRUE, fill=TRUE)
}
# drop players element
dtGoals[, players:=NULL]
# clean up problem with duplicated rows with playerType=Assist
dtGoal_player[, lag.playerType:=c('nomatch', playerType[-.N]), by=result.eventCode]
dtGoal_player[, playerType2:=ifelse((playerType==lag.playerType),'Assist2',playerType)]
# transpose multiple rows per event into single row with multiple columns for playerType
dtGoal_player_t <- dcast.data.table(dtGoal_player, result.eventCode ~ playerType2
, value.var='player.id', fun.aggregate=max)
# =============================================================================
# Merge players data into dtGoals
Sys.time()
dtGoals <- merge(dtGoals, dtGoal_player_t, by="result.eventCode")
Sys.time()

Inputting df frame value into GET function web query

I'm trying to input a list of values from a data frame into my get function for the web query and then cycle through each iteration as I go. If somebody would be able to link me some further resources to read and learn from this, it would be appreciated.
The following is the code which draws the data names from the API server. I plan on using purrr iteration functions to go over it. The input from the list would be inserted in the variable name RFG_SELECT.
library(httr)
library(purrr)
## Call up Query Development Script
## Calls up every single rainfall data gauge across the entirety of QLD
wmip_callup <- GET('https://water-monitoring.information.qld.gov.au/cgi/webservice.pl?{"function":"get_site_list","version":"1","params":{"site_list":"MERGE(GROUP(MGR_OFFICE_ALL,AYR),GROUP(MGR_OFFICE_ALL,BRISBANE),GROUP(MGR_OFFICE_ALL,BUNDABERG),GROUP(MGR_OFFICE_ALL,MACKAY),GROUP(MGR_OFFICE_ALL,MAREEBA),GROUP(MGR_OFFICE_ALL,ROCKHAMPTON),GROUP(MGR_OFFICE_ALL,SOUTH_JOHNSTONE),GROUP(MGR_OFFICE_ALL,TOOWOOMBA))"}}')
# Turns API server data into JSON data.
wmip_dataf <- content(wmip_callup, type = 'application/json')
# Returns the values of the rainfall gauge site names and is the directory function.
list_var <- wmip_dataf[["_return"]][["sites"]]
# Combines all of the rainfall gauge data together in a list (could be used for giving file names / looping the data).
rfg_bind <- do.call(rbind.data.frame, list_var)
# Sets the column name of the combination data frame.
rfg_bind <- setNames(rfg_bind, "Rainfall Gauge Name")
rfg_select <- rfg_bind$`Rainfall Gauge Name`
# Attempts to filter list into query:
wmip_input <- GET('https://water-monitoring.information.qld.gov.au/cgi/webservice.pl?{"function":"get_ts_traces","version":"1","params":{"site_list":**rfg_select**,"datasource":"AT","varfrom":"10","varto":"10","start_time":"0","end_time":"0","data_type":"mean","interval":"day","multiplier":"1"}}') ```
Hey there,
After some work I've found a solution using a concatenate string.
I setup a dummy variable that helped me select a data value.
# Dummy Variable string:
wmip_url <- 'https://water-monitoring.information.qld.gov.au/cgi/webservice.pl?{"function":"get_ts_traces","version":"1","params":{"site_list":"varinput","datasource":"AT","varfrom":"10","varto":"10","start_time":"0","end_time":"0","data_type":"mean","interval":"day","multiplier":"1"}}'
# Dummy String, grabs ones value from the list.
rfg_individual <- rfg_select[2:2]
# Replaces the specified input
rfg_replace <- gsub("varinput", rfg_individual, wmip_url)
# Result
"https://water-monitoring.information.qld.gov.au/cgi/webservice.pl?{\"function\":\"get_ts_traces\",\"version\":\"1\",\"params\":{\"site_list\":\"001203A\",\"datasource\":\"AT\",\"varfrom\":\"10\",\"varto\":\"10\",\"start_time\":\"0\",\"end_time\":\"0\",\"data_type\":\"mean\",\"interval\":\"day\",\"multiplier\":\"1\"}}"

R get the original index of data frame after subsetting

Is it possible to get the original index of a data frame after subsetting? It is being stored somewhere but I am not sure where and how to access it. I understand that there is a better solution if this is part of the algorithm design. I am just curious if anyone knows if it possible.
Example Scenario:
df = data.frame(atr1=integer(),atr2=integer())
for(i in 1:10) {
df <- rbind(df,data.frame(atr1=as.integer(i),atr2=as.integer(i)))
}
View(df)
Note the far left side of output of View function in R studio will show the index (I am not sure how to post an image that only exists on my local machine).
Create a data frame by taking subset of that original data frame:
df_subset <- df[which(df$atr1 > 4),]
View(df_subset)
The output of the View function doesn't index df_subset 1 to 6 as you would access them. The original indices 5 to 10 are maintained. I am curious if it is possible to accesses those indices in some fashion similar to:
df_subset[index,]$<some hidden attribute>

R – Using a loop on a list of Twitter handles to extract tweets and create multiple data frames

I’ve got a df that consists of Twitter handles that I wish to scrape on a regular basis.
df=data.frame(twitter_handles=c("#katyperry","#justinbieber","#Cristiano","#BarackObama"))
My Methodology
I would like to run a for loop that loops over each of the handles in my df and creates multiple dataframes:
1) By using the rtweet library, I would like to gather tweets using the search_tweets function.
2) Then I would like to merge the new tweets to existing tweets for each dataframe, and then use the unique function to remove any duplicate tweets.
3) For each dataframe, I'd like to add a column with the name of the Twitter handle used to obtain the data. For example: For the database of tweets obtained using the handle #BarackObama, I'd like an additional column called Source with the handle #BarackObama.
4) In the event that the API returns 0 tweets, I would like Step 2) to be ignored. Very often, when the API returns 0 tweets, I get an error as it attempts to merge an empty dataframe with an existing one.
5) Finally, I would like to save the results of each scrape to the different dataframe objects. The name of each dataframe object would be its Twitter handle, in lower case and without the #
My Desired Output
My desired output would be 4 dataframes, katyperry, justinbieber, cristiano & barackobama.
My Attempt
library(rtweet)
library(ROAuth)
#Accessing Twitter API using my Twitter credentials
key <-"yKxxxxxxxxxxxxxxxxxxxxxxx"
secret <-"78EUxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
setup_twitter_oauth(key,secret)
#Dataframe of Twitter handles
df=data.frame(twitter_handles=c("#katyperry","#justinbieber","#Cristiano","#BarackObama"))
# Setting up the query
query <- as.character(df$twitter_handles)
query <- unlist(strsplit(query,","))
tweets.dataframe = list()
# Loop through the twitter handles & store the results as individual dataframes
for(i in 1:length(query)){
result<-search_tweets(query[i],n=10000,include_rts = FALSE)
#Strip tweets that contain RTs
tweets.dataframe <- c(tweets.dataframe,result)
tweets.dataframe <- unique(tweets.dataframe)
}
However I have not been able to figure out how to include in my for loop the part which ignores the concatenation step if the API returns 0 tweets for a given handle.
Also, my for loop does not return 4 dataframes in my environment, but stores the results as a Large list
I identified a post that addresses a problem very similar to the one I face, but I find it difficult to adapt to my question.
Your inputs would be greatly appreciated.
Edit: I have added Step 3) in My Methodology, in case you are able to help with that too.
tweets.dataframe = list()
# Loop through the twitter handles & store the results as individual dataframes
for(i in 1:length(query)){
result<-search_tweets(query[i],n=10,include_rts = FALSE)
if (nrow(result) > 0) { # only if result has data
tweets.dataframe <- c(tweets.dataframe, list(result))
}
}
# tweets.dataframe is now a list where each element is a date frame containing
# the results from an individual query; for example...
tweets.dataframe[[1]]
# to combine them into one data frame
do.call(rbind, tweets.dataframe)
in response to a reply...
twitter_handles <- c("#katyperry","#justinbieber","#Cristiano","#BarackObama")
# Loop through the twitter handles & store the results as individual dataframes
for(handle in twitter_handles) {
result <- search_tweets(handle, n = 15 , include_rts = FALSE)
result$Source <- handle
df_name <- substring(handle, 2)
if(exists(df_name)) {
assign(df_name, unique(rbind(get(df_name), result)))
} else {
assign(df_name, result)
}
}

Using a loop to create multiple data frames in R

I have this function that returns a data frame of JSON data from the NBA stats website. The function takes in the game ID of a certain game and returns a data frame of the halftime box score for that game.
getstats<- function(game=x){
for(i in game){
url<- paste("http://stats.nba.com/stats/boxscoretraditionalv2?EndPeriod=10&
EndRange=14400&GameID=",i,"&RangeType=2&Season=2015-16&SeasonType=
Regular+Season&StartPeriod=1&StartRange=0000",sep = "")
json_data<- fromJSON(paste(readLines(url), collapse=""))
df<- data.frame(json_data$resultSets[1, "rowSet"])
names(df)<-unlist(json_data$resultSets[1,"headers"])
}
return(df)
}
So what I would like to do with this function is take a vector of several game ID's and create a separate data frame for each one. For example:
gameids<- as.character(c(0021500580:0021500593))
I would want to take the vector "gameids", and create fourteen data frames. If anyone knew how I would go about doing this it would be greatly appreciated! Thanks!
You can save your data.frames into a list by setting up the function as follows:
getstats<- function(games){
listofdfs <- list() #Create a list in which you intend to save your df's.
for(i in 1:length(games)){ #Loop through the numbers of ID's instead of the ID's
#You are going to use games[i] instead of i to get the ID
url<- paste("http://stats.nba.com/stats/boxscoretraditionalv2?EndPeriod=10&
EndRange=14400&GameID=",games[i],"&RangeType=2&Season=2015-16&SeasonType=
Regular+Season&StartPeriod=1&StartRange=0000",sep = "")
json_data<- fromJSON(paste(readLines(url), collapse=""))
df<- data.frame(json_data$resultSets[1, "rowSet"])
names(df)<-unlist(json_data$resultSets[1,"headers"])
listofdfs[[i]] <- df # save your dataframes into the list
}
return(listofdfs) #Return the list of dataframes.
}
gameids<- as.character(c(0021500580:0021500593))
getstats(games = gameids)
Please note that I could not test this because the URLs do not seem to be working properly. I get the connection error below:
Error in file(con, "r") : cannot open the connection
Adding to Abdou's answer, you could create dynamic data frames to hold results from each gameID using the assign() function
for(i in 1:length(games)){ #Loop through the numbers of ID's instead of the ID's
#You are going to use games[i] instead of i to get the ID
url<- paste("http://stats.nba.com/stats/boxscoretraditionalv2?EndPeriod=10&
EndRange=14400&GameID=",games[i],"&RangeType=2&Season=2015-16&SeasonType=
Regular+Season&StartPeriod=1&StartRange=0000",sep = "")
json_data<- fromJSON(paste(readLines(url), collapse=""))
df<- data.frame(json_data$resultSets[1, "rowSet"])
names(df)<-unlist(json_data$resultSets[1,"headers"])
# create a data frame to hold results
assign(paste('X',i,sep=''),df)
}
The assign function will create data frames same as number of game IDS. They be labelled X1,X2,X3......Xn. Hope this helps.
Use lapply (or sapply) to apply a function to a list and get the results as a list. So if you get a vector of several game ids and a function that do what you want to do, you can use lapply to get a list of dataframe (as your function return df).
I haven't been able to test your code (I got an error with the function you provided), but something like this should work :
library(RJSONIO)
gameids<- as.character(c(0021500580:0021500593))
df_list <- lapply(gameids, getstats)
getstats<- function(game=x){
url<- paste0("http://stats.nba.com/stats/boxscoretraditionalv2?EndPeriod=10&EndRange=14400&GameID=",
game,
"&RangeType=2&Season=2015-16&SeasonType=Regular+Season&StartPeriod=1&StartRange=0000")
json_data<- fromJSON(paste(readLines(url), collapse=""))
df<- data.frame(json_data$resultSets[1, "rowSet"])
names(df)<-unlist(json_data$resultSets[1,"headers"])
return(df)
}
df_list will contain 1 dataframe per Id you provided in gameids.
Just use lapply again for additionnal data processing, including saving the dataframes to disk.
data.table is a nice package if you have to deal with a ton of data. Especially rbindlist allows you to rbind all the dt (=df) contained in a list into a single one if needed (split will do the reverse).

Resources