I'm pulling large sets of data for multiple sites and views from Google Analytics for processing in R. To streamline the process, I've added my most common queries to a function (so I only have to pass the profile ID and date range). Each query is stored as a local variable in the function, and is assigned to a dynamically-named global variable:
R version 3.1.1
library(rga)
library(stargazer)
# I would add a dataset, but my data is locked down by client agreements and I don't currently have any test sites configured.
profiles <- ga$getProfiles()
website1 <- profiles[1,]
start <- "2013-01-01"
end <- "2013-12-31"
# profiles are objects containing all the ID's, accounts #, etc.; start and end specify date range as strings (e.g. "2014-01-01")
reporting <- function(profile, start, end){
id <- profile[,1] #sets profile number from profile object
#rga function for building and submitting query to API
general <- ga$getData(id,
start.date = start,
end.date = end,
metrics = "ga:sessions")
... #additional queries, structured similarly to example above(e.g. countries, cities, etc.)
#transforms name of profile object to string
profileName <- deparse(substitute(profile))
#appends "Data" to profile object name
temp <- paste(profileName, "Data", sep="")
#stores query results as list
temp2 <- list(general,countries,cities,devices,sources,keywords,pages,events)
#assigns list of query results and stores it globally
assign(temp, temp2, envir=.GlobalEnv)
}
#call reporting function at head of report or relevant section
reporting(website1,start,end)
#returns list of data frames returned by the ga$getData(...), but within the list they are named "data.frame" instead of their original query name.
#generate simple summary table with stargazer package for display within the report
stargazer(website1[1])
I'm able to access these results through *website1*Data[1], but I'm handing the data off to collaborators. Ideally, they should be able to access the data by name (e.g. *website1*Data$countries).
Is there an easier/better way to store these results, and to make accessing them easier from within an .Rmd report?
There's no real reason to do the deparse in side the function just to assign a variable in the parent environment. If you have to call the reporting() function, just have that function return a value and assign the results
reporting <- function(profile, start, end){
#... all the other code
#return results
list(general=general,countries=countries,cities=cities,
devices=devices,sources=sources,keywords=keywords,
pages=pages,events=events)
}
#store results
websiteResults <- reporting(website1,start,end)
Related
I have a dataframe on members of the US congress and I'm am collecting additional data from google trends using the gtrendsR package. Given that google trends won't allow me to search all members of congress in a single query, I've decided to create a ForLoop that collects the data for one politician at a time.
for (i in df$name_google){
user <- i
res <- gtrends(c(obama, user), geo = c("US"), time = "all")
}
However, the google trends output file (res) is itself a list with a number of dataframes.
I would like to use the for loop to save this list to a new column in df, with each iteration of the loop adding the new res file to the row the of the user it just searched. I don't know how to do this, but something like the line i added to the lop below. Let me know if I'm failing to include necessary info.
for (i in df$name_google){
user <- i
res <- gtrends(c(obama, user), geo = c("US"), time = "all")
#df$newlistvariable <- res if df$name == i
}
I am connecting to a SQL Server database through the ODBC connection in R. I have two potential methods to get data, and am trying to determine which would be more efficient. The data is needed for a Shiny dashboard, so the data needs to be pulled while the app is loading rather than querying on the fly as the user is using the app.
Method 1 is to use over 20 stored procedures to query all of the needed data and store them for use. Method 2 is to query all of the tables individually.
Here is the method I used to query one of the stored procedures:
get_proc_data <- function(proc_name, url, start_date, end_date){
dbGetQuery(con, paste0(
"EXEC dbo.", proc_name, " ",
"#URL = N'", url, "', ",
"#Startdate = '", start_date, "', ",
"#enddate = '", end_date, "' "
))
}
data <- get_proc_data(proc_name, url, today(), today() %m-% years(5))
However, each of the stored procedures has a slightly different setup for the parameters, so I would have to define each of them separately.
I have started to implement Method 2, but have run into issues with iteratively querying each table.
# use dplyr create list of table names
db_tables <- dbGetQuery(con, "SELECT * FROM [database_name].INFORMATION_SCHEMA.TABLES;") %>% select(TABLE_NAME)
# use dplyr pull to create list
table_list <- pull(db_tables , TABLE_NAME)
# get a quick look at the first few rows
tbl(con, "[TableName]") %>% head() %>% glimpse()
# iterate through all table names, get the first five rows, and export to .csv
for (table in table_list){
write.csv(
tbl(con, table) %>% head(), str_glue("{getwd()}/00_exports/tables/{table}.csv")
)
}
selected_tables <- db_tables %>% filter(TABLE_NAME == c("TableName1","TableName2"))
Ultimately this method was just to test how long it would take to iterate through the ~60 tables and perform the required function. I have tried putting this into a function instead but have not been able to get it to iterate through while also pulling the name of the table.
Pro/Con for Method 1: The stored procs are currently powering a metrics plug-in that was written in C++ and is displaying metrics on the webpage. This is for internal use to monitor website performance. However, the stored procedures are not all visible to me and the client needs me to extend their current metrics. I also do not have a DBA at my disposal to help with the SQL Server side, and the person who wrote the procs is unavailable. The procs are also using different logic than each other, so joining the results of two different procs gives drastically different values. For example, depending on the proc, each date will list total page views for each day or already be aggregated at the weekly or monthly scale then listed repeatedly. So joining and grouping causes drastic errors in actual page views.
Pro/Con for Method 2: I am familiar with dplyr and would be able to join the tables together to pull the data I need. However, I am not as familiar with SQL and there is no Entity-Relationship Diagram (ERD) of any sort to refer to. Otherwise, I would build each query individually.
Either way, I am trying to come up with a way to proceed with either a named function, lambda function, or vectorized method for iterating. It would be great to name each variable and assign them appropriately so that I can perform the data wrangling with dplyr.
Any help would be appreciated, I am overwhelmed with which direction to go. I researched the equivalent to Python list comprehension in R but have not been able get the function in R to perform similarly.
> db_table_head_to_csv <- function(table) {
+ write.csv(
+ tbl(con, table) %>% head(), str_glue("{getwd()}/00_exports/bibliometrics_tables/{table}.csv")
+ )
+ }
>
> bibliometrics_tables %>% db_table_head_to_csv()
Error in UseMethod("as.sql") :
no applicable method for 'as.sql' applied to an object of class "data.frame"
Consider storing all table data in a named list (counterpart to Python dictionary) using lapply (counterpart to Python's list/dict comprehension). And if you use its sibling, sapply, the character vector passed in will return as names of elements:
# RETURN VECTOR OF TABLE NAMES
db_tables <- dbGetQuery(
con, "SELECT [TABLE_NAME] FROM [database_name].INFORMATION_SCHEMA.TABLES"
)$TABLE_NAME
# RETURN NAMED LIST OF DATA FRAMES FOR EACH DB TABLE
df_list <- sapply(db_tables, function(t) dbReadTable(conn, t), simplify = FALSE)
You can extend the lambda function for multiple steps like write.csv or use a defined method. Just be sure to return a data frame as last line. Below uses the new pipe, |> in base R 4.1.0+:
db_table_head_to_csv <- function(table) {
head_df <- dbReadTable(con, table) |> head()
write.csv(
head_df,
file.path(
"00_exports", "bibliometrics_tables", paste0(table, ".csv")
)
)
return(head_df)
}
df_list <- sapply(db_tables, db_table_head_to_csv, simplify = FALSE)
You lose no functionality of data frame object if stored in a list and can extract with $ or [[ by name:
# EXTRACT SPECIFIC ELEMENT
head(df_list$table_1)
tail(df_list[["table_2"]])
summary(df_list$`table_3`)
I'm trying to input a list of values from a data frame into my get function for the web query and then cycle through each iteration as I go. If somebody would be able to link me some further resources to read and learn from this, it would be appreciated.
The following is the code which draws the data names from the API server. I plan on using purrr iteration functions to go over it. The input from the list would be inserted in the variable name RFG_SELECT.
library(httr)
library(purrr)
## Call up Query Development Script
## Calls up every single rainfall data gauge across the entirety of QLD
wmip_callup <- GET('https://water-monitoring.information.qld.gov.au/cgi/webservice.pl?{"function":"get_site_list","version":"1","params":{"site_list":"MERGE(GROUP(MGR_OFFICE_ALL,AYR),GROUP(MGR_OFFICE_ALL,BRISBANE),GROUP(MGR_OFFICE_ALL,BUNDABERG),GROUP(MGR_OFFICE_ALL,MACKAY),GROUP(MGR_OFFICE_ALL,MAREEBA),GROUP(MGR_OFFICE_ALL,ROCKHAMPTON),GROUP(MGR_OFFICE_ALL,SOUTH_JOHNSTONE),GROUP(MGR_OFFICE_ALL,TOOWOOMBA))"}}')
# Turns API server data into JSON data.
wmip_dataf <- content(wmip_callup, type = 'application/json')
# Returns the values of the rainfall gauge site names and is the directory function.
list_var <- wmip_dataf[["_return"]][["sites"]]
# Combines all of the rainfall gauge data together in a list (could be used for giving file names / looping the data).
rfg_bind <- do.call(rbind.data.frame, list_var)
# Sets the column name of the combination data frame.
rfg_bind <- setNames(rfg_bind, "Rainfall Gauge Name")
rfg_select <- rfg_bind$`Rainfall Gauge Name`
# Attempts to filter list into query:
wmip_input <- GET('https://water-monitoring.information.qld.gov.au/cgi/webservice.pl?{"function":"get_ts_traces","version":"1","params":{"site_list":**rfg_select**,"datasource":"AT","varfrom":"10","varto":"10","start_time":"0","end_time":"0","data_type":"mean","interval":"day","multiplier":"1"}}') ```
Hey there,
After some work I've found a solution using a concatenate string.
I setup a dummy variable that helped me select a data value.
# Dummy Variable string:
wmip_url <- 'https://water-monitoring.information.qld.gov.au/cgi/webservice.pl?{"function":"get_ts_traces","version":"1","params":{"site_list":"varinput","datasource":"AT","varfrom":"10","varto":"10","start_time":"0","end_time":"0","data_type":"mean","interval":"day","multiplier":"1"}}'
# Dummy String, grabs ones value from the list.
rfg_individual <- rfg_select[2:2]
# Replaces the specified input
rfg_replace <- gsub("varinput", rfg_individual, wmip_url)
# Result
"https://water-monitoring.information.qld.gov.au/cgi/webservice.pl?{\"function\":\"get_ts_traces\",\"version\":\"1\",\"params\":{\"site_list\":\"001203A\",\"datasource\":\"AT\",\"varfrom\":\"10\",\"varto\":\"10\",\"start_time\":\"0\",\"end_time\":\"0\",\"data_type\":\"mean\",\"interval\":\"day\",\"multiplier\":\"1\"}}"
I am in the process of downloading and processing a substantial amount of air quality data from an API through the jsonlite library in R. Because of the amount of data, I thought it prudent to write a script that would automate the entire process. These data encompass all 50 States throughout the US, as well as four different air pollutants. The timeline for this is January 1st, 2015 through December 31st, 2019. The state, air pollutant code, begin date, and end date are the four parameters that are submitted at each query iteration to the API.
Since I know this will take quite a bit of time to download and process all the data, I am wondering if there is a way that I can run the script for a while, stop it when I need to do something else, and then restart the script at its last iteration, or rather, at its last index in the vectors containing the values passed into the API URL. I did look up other similar questions but could not find something that accurately fit my situation.
Because this script is querying a web API, I thought that the best approach in this case would be to automate the API calls and data processing through nested for loops. For the "for loop" portion, it would begin on the first state, first pollutant parameter, and then iterate through each beginning date and end date (the API only accepts timelines of up to one year). Then it would go to the next pollutant and iterate through each beginning date and ending date, so on and so forth. Is there a way I could keep track of the iterations and pass that back into the for loops once I restart the script? I am struggling to come up with the logic for this.
Also, I am still familiarizing myself with R after doing much of my data processing in Python and SQL, so please bear with me if my code is not very efficient or more complicated than what is necessary. I did not include the actual data processing portion of the code, only the vectors and the for loops for iterating through those vectors.
#set working directory
setwd('Path_to_directory_here')
#import jsonlite library
library(jsonlite)
#create state FIPS list
fips_list <- c("01","02","04","05",
"06","08","09","10",
"12","13","15","16",
"17","18","19","20",
"21","22","23","24",
"25","26","27","28",
"29","30","31","32",
"33","34","35","36",
"37","38","39","40",
"41","42","44","45",
"46","47","48","49",
"50","51","53","54",
"55","56")
#create list of state names corresponding with state FIPS codes
#this was created to specify which state the data is from in the file name when writing to a .CSV file
fips_names <- c('AL','AK','AZ','AR',
'CA','CO','CT','DE',
'FL','GA','HI','ID',
'IL','IN','IA','KS',
'KY','LA','ME','MD',
'MA','MI','MN','MS',
'MO','MT','NE','NV',
'NH','NJ','NM','NY',
'NC','ND','OH','OK',
'OR','PA','RI','SC',
'SD','TN','TX','UT',
'VT','VA','WA','WV',
'WI','WY')
#create a key/value pair for FIPS codes and state names
fips_states <- setNames(as.list(fips_names), fips_list)
fips_key <- names(fips_states)
fips_val <- fips_states[fips_list]
#same procedure as FIPS codes and state names
param_list <- c('88101','88501', '42602','44201')
param_names <- c('PM25', 'PM25_LC', 'NO2', 'O3') #specifies which pollutant was pulled in the file name
params <- setNames(as.list(param_names), param_list)
param_key <- names(params)
param_val <- params[param_list]
#same as above
begin_yr <- c('20150101','20160101','20170101','20180101','20190101')
end_yr <- c('20151231','20161231','20171231','20181231','20191231')
yr_list <- setNames(as.list(end_yr), begin_yr)
key <- names(yr_list)
val <- yr_list[begin_yr]
#keep track of files processed
file_tracker <- 0
for (x in 1:length(fips_states)) {
for (y in 1:length(params)) {
for (z in 1:length(yr_list)) {
file_tracker = file_tracker + 1
tracker_msg <- sprintf('Reading in State: %s, parameter: %s, timeframe: %s', fips_val[x], param_val[y], key[z])
print(tracker_msg)
#call to API
url <- sprintf("https://aqs.epa.gov/data/api/sampleData/byState?email=MY_EMAIL&key=MY_KEY¶m=%s&bdate=%s&edate=%s&state=%s",
param_key[y],key[z],val[z],fips_key[x])
data <- fromJSON(txt = url)
#rest of the data formatting and processing here
}
}
}
I can include more code if necessary. Thanks for any help provided.
I have setup an API access key with a data provider of stock market data. With this key i am able to extract stock market data based on ticker code (E.g. APPL: Apple, FB: Facebook etc).
I am able to extract stock data on an individual ticker basis using R but I want to write a piece of code that extracts data based on the multiple stock tickers and puts them all in one data frame (the structure is the same for all stocks). I m not sure how to create a loop that updates the data frame each time stock data is extracted. I get an error called 'No encoding supplied: defaulting to UTF-8' which does not tell me much. A point in the right direction would be helpful.
I have the following code:
if (!require("httr")) {
install.packages("httr")
library(httr)
}
if (!require("jsonlite")) {
install.packages("jsonlite")
library(jsonlite)
}
stocks <- c("FB","APPL") #Example stocks actual stocks removed
len <- length(stocks)
url <- "URL" #Actual url removed
access_key <- "MY ACCESS KEY" #Actual access key removed
extraction <- lapply(stocks[1:len],function(i){
call1 <- paste(url,"?access_key=",access_key,"&","symbols","=",stocks[i],sep="")
get_prices <- GET(call1)
get_prices_text <- content(get_prices, "text")
get_prices_json <- fromJSON(get_prices_text, flatten = TRUE)
get_prices_df <- as.data.frame(get_prices_json)
return(get_prices_df)
}
)
file <- do.call(rbind,extraction)
I realised that this is not the most efficient way of doing this. A better way is to update the url to include multiple stocks rather then using a lapply function. I am therefore closing the question.