I'm trying to input a list of values from a data frame into my get function for the web query and then cycle through each iteration as I go. If somebody would be able to link me some further resources to read and learn from this, it would be appreciated.
The following is the code which draws the data names from the API server. I plan on using purrr iteration functions to go over it. The input from the list would be inserted in the variable name RFG_SELECT.
library(httr)
library(purrr)
## Call up Query Development Script
## Calls up every single rainfall data gauge across the entirety of QLD
wmip_callup <- GET('https://water-monitoring.information.qld.gov.au/cgi/webservice.pl?{"function":"get_site_list","version":"1","params":{"site_list":"MERGE(GROUP(MGR_OFFICE_ALL,AYR),GROUP(MGR_OFFICE_ALL,BRISBANE),GROUP(MGR_OFFICE_ALL,BUNDABERG),GROUP(MGR_OFFICE_ALL,MACKAY),GROUP(MGR_OFFICE_ALL,MAREEBA),GROUP(MGR_OFFICE_ALL,ROCKHAMPTON),GROUP(MGR_OFFICE_ALL,SOUTH_JOHNSTONE),GROUP(MGR_OFFICE_ALL,TOOWOOMBA))"}}')
# Turns API server data into JSON data.
wmip_dataf <- content(wmip_callup, type = 'application/json')
# Returns the values of the rainfall gauge site names and is the directory function.
list_var <- wmip_dataf[["_return"]][["sites"]]
# Combines all of the rainfall gauge data together in a list (could be used for giving file names / looping the data).
rfg_bind <- do.call(rbind.data.frame, list_var)
# Sets the column name of the combination data frame.
rfg_bind <- setNames(rfg_bind, "Rainfall Gauge Name")
rfg_select <- rfg_bind$`Rainfall Gauge Name`
# Attempts to filter list into query:
wmip_input <- GET('https://water-monitoring.information.qld.gov.au/cgi/webservice.pl?{"function":"get_ts_traces","version":"1","params":{"site_list":**rfg_select**,"datasource":"AT","varfrom":"10","varto":"10","start_time":"0","end_time":"0","data_type":"mean","interval":"day","multiplier":"1"}}') ```
Hey there,
After some work I've found a solution using a concatenate string.
I setup a dummy variable that helped me select a data value.
# Dummy Variable string:
wmip_url <- 'https://water-monitoring.information.qld.gov.au/cgi/webservice.pl?{"function":"get_ts_traces","version":"1","params":{"site_list":"varinput","datasource":"AT","varfrom":"10","varto":"10","start_time":"0","end_time":"0","data_type":"mean","interval":"day","multiplier":"1"}}'
# Dummy String, grabs ones value from the list.
rfg_individual <- rfg_select[2:2]
# Replaces the specified input
rfg_replace <- gsub("varinput", rfg_individual, wmip_url)
# Result
"https://water-monitoring.information.qld.gov.au/cgi/webservice.pl?{\"function\":\"get_ts_traces\",\"version\":\"1\",\"params\":{\"site_list\":\"001203A\",\"datasource\":\"AT\",\"varfrom\":\"10\",\"varto\":\"10\",\"start_time\":\"0\",\"end_time\":\"0\",\"data_type\":\"mean\",\"interval\":\"day\",\"multiplier\":\"1\"}}"
Related
I am a novice with R and a total newbie with the NHL API. I wrote an R program to extract all of the goals recorded in the NHL's data repository accessed through the NHL API using the R "nhlapi" package. I have code that works, but it's ugly and slow, and I wanted to see if anyone has suggestions for improving it. I am using the nhl_games_feed function provided by nhlapi to pull all events, from which I select the goals. This function returns a JSON blob (list of lists of lists of lists ...) in R, which I want to convert into a proper data.table.
I pasted a stripped-down version of my code below. I understand that normal practice here would be to include a sample data blob with the code so that other users can recreate my problems, but the data blob is the problem.
When I ran the full version of my code last night, the "Loop through games" portion took about 11 hours, and the "Convert players list to columns" took about 2 hours. Unless I can find a way to push the column or row filtering into the NHL's system, I don't think I am likely to find a way to speed up the "Loop through games" portion. So my first question: Does anyone have any thoughts about how to extract a subset of columns or rows using the NHL API, or do I need to pull everything and process it on my end?
My other question related to the second chunk of code ("Convert players..."), which converts the resulting event data into a single row of scalar elements per event. The event data shows up in lblob_feed[[1]]$liveData$plays$allPlays, which contains one row of scalar elements per event, except that one of the elements is ..$allPlays$players, which is itself a 4x5 dataframe. As a result, the only way that I could find to extract that data into scalar elements is the "Convert players..." loop. Is there a better way to convert this into a simple data.table?
Finally, any tips on other ways to end up with a comprehensive database of NHL events?
require("nhlapi")
require("data.table")
require("tidyverse")
require("hms")
assign("last.warning", NULL, envir = baseenv())
# create small list of selected games, using NHL API game code format
cSelGames <- c(2021020001, 2021020002, 2021020003, 2021020004)
liNumGames <- length(cSelGames)
print(liNumGames)
# 34370 games in the full database
# =============================================================================
# Loop through games
# Pull data for one game per call
Sys.time()
dtGoals <- data.table()
for (liGameNum in 1:liNumGames) {
# Pull the NHL feed blob for one selected game
# 11 hours in the full version
lblob_feed <- nhl_games_feed(gameId = cSelGames[liGameNum])
# Select only the play (event) portion of the feed blob
ldtFeed <- as.data.table(c(lblob_feed[[1]]$gamePk, lblob_feed[[1]]$liveData$plays$allPlays))
setnames(ldtFeed, 1, "gamePk")
# Check for games with no play data - 1995020006 has none and would kill execution
if ('result.eventCode' %in% colnames(ldtFeed)) {
# Check for missing elements in allPlays list
# team.triCode is missing for at least one game, probably for all-star games
if (!('team.triCode' %in% colnames(ldtFeed))) {ldtFeed[, ':=' (team.triCode = NA)]}
if (!('result.strength.code' %in% colnames(ldtFeed))) {ldtFeed[, ':=' (result.strength.code = NA)]}
if (!('result.emptyNet' %in% colnames(ldtFeed))) {ldtFeed[, ':=' (result.emptyNet = NA)]}
# Select the events and columns for the output data table
ldtGoals_new <- ldtFeed[(result.eventTypeId == 'GOAL')
,list(gamePk, result.eventCode, players, result.description
, team.triCode, about.period, about.periodTime
, about.goals.away, about.goals.home, result.strength.code, result.emptyNet)]
# Append the incremental data table to the aggregate data table
dtGoals <- rbindlist(list(dtGoals, ldtGoals_new), use.names=TRUE, fill=TRUE)
}
}
# =============================================================================
# Convert players list to columns
# 2 hours in the full version
# 190686 goals in full table
# For each goal, the player dataframe is 4x5
Sys.time()
dtGoal_player <- data.table()
for (i in 1:dtGoals[,.N]) {
# convert rows with embedded dataframes into multiple rows with scalar elements
s_result.eventCode <- dtGoals[i,result.eventCode]
dtGoal_player_new <- as.data.table(dtGoals[i,players[[1]]])
dtGoal_player_new[, ':=' (result.eventCode=s_result.eventCode)]
dtGoal_player <- rbindlist(list(dtGoal_player, dtGoal_player_new), use.names=TRUE, fill=TRUE)
}
# drop players element
dtGoals[, players:=NULL]
# clean up problem with duplicated rows with playerType=Assist
dtGoal_player[, lag.playerType:=c('nomatch', playerType[-.N]), by=result.eventCode]
dtGoal_player[, playerType2:=ifelse((playerType==lag.playerType),'Assist2',playerType)]
# transpose multiple rows per event into single row with multiple columns for playerType
dtGoal_player_t <- dcast.data.table(dtGoal_player, result.eventCode ~ playerType2
, value.var='player.id', fun.aggregate=max)
# =============================================================================
# Merge players data into dtGoals
Sys.time()
dtGoals <- merge(dtGoals, dtGoal_player_t, by="result.eventCode")
Sys.time()
I have created a simple script to collect a Youtube Channels statistics. Just wondering how I could loop though a list of channel ID's instead of having to manually change the channel ID each time then re-run the script? I struggle to understand how to write loops in R.
key <- 'MyKey'
channel_id1 <- 'UCLSWNf28X3mVTxTT3_nLCcw'
url <- 'https://www.googleapis.com/youtube/v3/channels?part=statistics'
y <- paste0(url,'&id=',channel_id1,'&key=',key)
yt_channel1 <- fromJSON(txt=y)
yt_d_channel1 <- as.data.frame(do.call(c, unlist(yt_channel1, recursive=FALSE)))
Any way to store all channel ID's of interest in a list or vector then loop though them, storing results into new or the same dataframe?
i.e.
channels <- c('UCLSWNf28X3mVTxTT3_nLCcw', 'UCLSW467236VTxTT3_nLCcw', UHJKHS328787_ndncp')
for i 1:3, {
channels...
do stuff
}
Any help is greatly appreciated.
Yes, store the channel IDs in a column in a data frame. Assuming you have a data frame called my_data_frame with a column ID that contains the IDs, you can loop through the IDs like this:
key <- 'MyKey'
url <- 'https://www.googleapis.com/youtube/v3/channels?part=statistics'
for(i in 1:nrow(my_data_frame)){
y <- paste0(url,'&id=',my_data_frame$ID[i],'&key=',key)
yt_channel1 <- fromJSON(txt=y)
yt_d_channel1 <- as.data.frame(do.call(c, unlist(yt_channel1, recursive=FALSE)))
}
Notice how the ID is referenced using an index i which will count from 1 until the number of rows in your data frame.
Note, this code will not work as you will need to come up with a way to store the results.
I’ve got a df that consists of Twitter handles that I wish to scrape on a regular basis.
df=data.frame(twitter_handles=c("#katyperry","#justinbieber","#Cristiano","#BarackObama"))
My Methodology
I would like to run a for loop that loops over each of the handles in my df and creates multiple dataframes:
1) By using the rtweet library, I would like to gather tweets using the search_tweets function.
2) Then I would like to merge the new tweets to existing tweets for each dataframe, and then use the unique function to remove any duplicate tweets.
3) For each dataframe, I'd like to add a column with the name of the Twitter handle used to obtain the data. For example: For the database of tweets obtained using the handle #BarackObama, I'd like an additional column called Source with the handle #BarackObama.
4) In the event that the API returns 0 tweets, I would like Step 2) to be ignored. Very often, when the API returns 0 tweets, I get an error as it attempts to merge an empty dataframe with an existing one.
5) Finally, I would like to save the results of each scrape to the different dataframe objects. The name of each dataframe object would be its Twitter handle, in lower case and without the #
My Desired Output
My desired output would be 4 dataframes, katyperry, justinbieber, cristiano & barackobama.
My Attempt
library(rtweet)
library(ROAuth)
#Accessing Twitter API using my Twitter credentials
key <-"yKxxxxxxxxxxxxxxxxxxxxxxx"
secret <-"78EUxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
setup_twitter_oauth(key,secret)
#Dataframe of Twitter handles
df=data.frame(twitter_handles=c("#katyperry","#justinbieber","#Cristiano","#BarackObama"))
# Setting up the query
query <- as.character(df$twitter_handles)
query <- unlist(strsplit(query,","))
tweets.dataframe = list()
# Loop through the twitter handles & store the results as individual dataframes
for(i in 1:length(query)){
result<-search_tweets(query[i],n=10000,include_rts = FALSE)
#Strip tweets that contain RTs
tweets.dataframe <- c(tweets.dataframe,result)
tweets.dataframe <- unique(tweets.dataframe)
}
However I have not been able to figure out how to include in my for loop the part which ignores the concatenation step if the API returns 0 tweets for a given handle.
Also, my for loop does not return 4 dataframes in my environment, but stores the results as a Large list
I identified a post that addresses a problem very similar to the one I face, but I find it difficult to adapt to my question.
Your inputs would be greatly appreciated.
Edit: I have added Step 3) in My Methodology, in case you are able to help with that too.
tweets.dataframe = list()
# Loop through the twitter handles & store the results as individual dataframes
for(i in 1:length(query)){
result<-search_tweets(query[i],n=10,include_rts = FALSE)
if (nrow(result) > 0) { # only if result has data
tweets.dataframe <- c(tweets.dataframe, list(result))
}
}
# tweets.dataframe is now a list where each element is a date frame containing
# the results from an individual query; for example...
tweets.dataframe[[1]]
# to combine them into one data frame
do.call(rbind, tweets.dataframe)
in response to a reply...
twitter_handles <- c("#katyperry","#justinbieber","#Cristiano","#BarackObama")
# Loop through the twitter handles & store the results as individual dataframes
for(handle in twitter_handles) {
result <- search_tweets(handle, n = 15 , include_rts = FALSE)
result$Source <- handle
df_name <- substring(handle, 2)
if(exists(df_name)) {
assign(df_name, unique(rbind(get(df_name), result)))
} else {
assign(df_name, result)
}
}
I wrote a script in R which reads a csv file to a dataframe, then manipulates that dataframe to create a new one. The current script is as follows:
#read in the csv file for a node
node00D = read.csv("00D.csv")
#create a new data frame with the specific measurement we want to decrease file size
node00D_smaller = node00D[node00D$sensor=="BMP180",]
#remove the columns that we don't need
keeps = c("node_id","timestamp","parameter","value")
node00D = node00D_smaller[keeps]
#fix row names
rownames(node00D) = 1:nrow(node00D)
#convert the timestamp column from a factor to a date and then truncate to the hour
node00D$timestamp = as.POSIXlt(node00D$timestamp)
node00D$timestamp = trunc(node00D$timestamp,"hour")
rm(keeps,node00D_smaller)
#get average temperature for each hour
library(plyr)
node00D$timestamp = as.character(node00D$timestamp)
node00D_means = ddply(node00D, .(timestamp), summarize,
mean=round(mean(value),2))
node00D$timestamp = as.POSIXlt(node00D$timestamp)
node00D_means$timestamp = as.POSIXlt(node00D_means$timestamp)
write.csv(node00D_means,"00D_Edit.csv")
#load lat long data
latlong = read.csv("Node.Lat.Lon.csv")
node00D_means$Node = "00D"
node00D_means = merge(node00D_means,latlong,by="Node")
I have to do this for up to 100 nodes, and so I tried writing a function with argument 'node' which would perform this. In this example, I would input getNodeData(00D). However, when I do this there are issues actually creating the data frames. The function runs but does not create any new objects. Is there a way to turn this script into a function so that I can more easily perform it 100 times?
You could do something like
fun1 <- function(node.num){
### (1). Load the data
dat <- read.csv(paste0(node.num, ".csv"))
dat_smaller <- dat[dat$sensor=="BMP180",]
### (2). Here proceed with your code and substitute node00D(_smaller) by dat(_smaller) ###
# ------------------------------------------------------------------- #
### (3). Then define dat_mean and save the .csv
library(plyr)
dat$timestamp=as.character(dat$timestamp)
dat_means=ddply(dat,.(timestamp),summarize,mean=round(mean(value),2))
dat$timestamp=as.POSIXlt(dat$timestamp)
dat_means$timestamp=as.POSIXlt(dat_means$timestamp)
write.csv(dat_means,paste0(node.num, "_Edit.csv"))
### (4). And similarly for lat long
latlong=read.csv("Node.Lat.Lon.csv")
dat_means$Node=node.num
dat_means=merge(dat_means,latlong,by="Node")
}
Now this function is not returning anything, it is saving the .csv files though. However, if you want it to return something, e.g. dat_means, then you can add the line return(dat_means) before the function ends.
Appendix
Now to perform the above operation dynamically, you can for instance using a loop:
### (1.) First, create an object containing all your nodes, e.g.
nodes.vector <- c("00D", ...)
### (2.) Run a loop, or use one of the apply functions
for(k in seq_along(nodes.vector)){
fun1(nodes.vector[k])
}
# Or
sapply(nodes.vector, fun1)
Now I don't know your data, but if the nodes are contained in latlong$Node, then you can set this to be your dat.vector.
I'm pulling large sets of data for multiple sites and views from Google Analytics for processing in R. To streamline the process, I've added my most common queries to a function (so I only have to pass the profile ID and date range). Each query is stored as a local variable in the function, and is assigned to a dynamically-named global variable:
R version 3.1.1
library(rga)
library(stargazer)
# I would add a dataset, but my data is locked down by client agreements and I don't currently have any test sites configured.
profiles <- ga$getProfiles()
website1 <- profiles[1,]
start <- "2013-01-01"
end <- "2013-12-31"
# profiles are objects containing all the ID's, accounts #, etc.; start and end specify date range as strings (e.g. "2014-01-01")
reporting <- function(profile, start, end){
id <- profile[,1] #sets profile number from profile object
#rga function for building and submitting query to API
general <- ga$getData(id,
start.date = start,
end.date = end,
metrics = "ga:sessions")
... #additional queries, structured similarly to example above(e.g. countries, cities, etc.)
#transforms name of profile object to string
profileName <- deparse(substitute(profile))
#appends "Data" to profile object name
temp <- paste(profileName, "Data", sep="")
#stores query results as list
temp2 <- list(general,countries,cities,devices,sources,keywords,pages,events)
#assigns list of query results and stores it globally
assign(temp, temp2, envir=.GlobalEnv)
}
#call reporting function at head of report or relevant section
reporting(website1,start,end)
#returns list of data frames returned by the ga$getData(...), but within the list they are named "data.frame" instead of their original query name.
#generate simple summary table with stargazer package for display within the report
stargazer(website1[1])
I'm able to access these results through *website1*Data[1], but I'm handing the data off to collaborators. Ideally, they should be able to access the data by name (e.g. *website1*Data$countries).
Is there an easier/better way to store these results, and to make accessing them easier from within an .Rmd report?
There's no real reason to do the deparse in side the function just to assign a variable in the parent environment. If you have to call the reporting() function, just have that function return a value and assign the results
reporting <- function(profile, start, end){
#... all the other code
#return results
list(general=general,countries=countries,cities=cities,
devices=devices,sources=sources,keywords=keywords,
pages=pages,events=events)
}
#store results
websiteResults <- reporting(website1,start,end)