Macro with Iteration through URL - r

Having a problem creating a macro variable within an API call in R. I am trying to loop through a vector of zip codes and make an API call on that vector iteratively. Pretty unfamiliar with iterating through a R list that needs to be macro'd out.
Here is my code:
# creating a dataframe of 10 sample California zip codes to iterate through from database
zip_iterations<-sqlQuery(ch,"Select distinct zip from zip_codes where state='CA' limit 10",believeNRows="F")
# Calling the api to retrieve the JSON
json_file <- "http://api.openweathermap.org/data/2.5/weather?zip=**'MACRO VECTOR TO ITERATE'**
My goal is to go through the list of 10 zip codes in the dataframe by using a macro.

R doesn't use macros per se, but there are lots of alternative ways of doing what it sounds like you want to do. This version will return a character vector json_file, with each entry containing the body of the HTTP response for that zip code:
library("httr")
json_file <- character(0)
urls <- paste0("http://api.openweathermap.org/data/2.5/weather?zip=", zip_iterations)
for (i in seq_along(urls)) {
json_file[i] <- content(GET(urls[i]), as = "text")
}
You can then parse the resulting vector into a set of R lists using, for example, fromJSON() from the jsonlite package, such as:
library("jsonlite")
lapply(json_file, fromJSON)
The result of that will be a list of lists.

Related

creating a loop for "load" and "save" processes

I have a data.frame (dim: 100 x 1) containing a list of url links, each url looks something like this: https:blah-blah-blah.com/item/123/index.do .
The list (the list is a data.frame called my_list with 100 rows and a single column named col and is in character format $ col: chr) together looks like this :
1 "https:blah-blah-blah.com/item/123/index.do"
2" https:blah-blah-blah.com/item/124/index.do"
3 "https:blah-blah-blah.com/item/125/index.do"
etc.
I am trying to import each of these url's into R and collectively save the object as an object that is compatible for text mining procedures.
I know how to successfully convert each of these url's (that are on the list) manually:
library(pdftools)
library(tidytext)
library(textrank)
library(dplyr)
library(tm)
#1st document
url <- "https:blah-blah-blah.com/item/123/index.do"
article <- pdf_text(url)
Once this "article" file has been successfully created, I can inspect it:
str(article)
chr [1:13]
It looks like this:
[1] "abc ....."
[2] "def ..."
etc etc
[15] "ghi ...:
From here, I can successfully save this as an RDS file:
saveRDS(article, file = "article_1.rds")
Is there a way to do this for all 100 articles at the same time? Maybe with a loop?
Something like :
for (i in 1:100) {
url_i <- my_list[i,1]
article_i <- pdf_text(url_i)
saveRDS(article_i, file = "article_i.rds")
}
If this was written correctly, it would save each article as an RDS file (e.g. article_1.rds, article_2.rds, ... article_100.rds).
Would it then be possible to save all these articles into a single rds file?
Please note that list is not a good name for an object, as this will
temporarily overwrite the list() function. I think it is usually good
to name your variables according to their content. Maybe url_df would be
a good name.
library(pdftools)
#> Using poppler version 20.09.0
library(tidyverse)
url_df <-
data.frame(
url = c(
"https://www.nimh.nih.gov/health/publications/autism-spectrum-disorder/19-mh-8084-autismspecdisordr_152236.pdf",
"https://www.nimh.nih.gov/health/publications/my-mental-health-do-i-need-help/20-mh-8134-mymentalhealth-508_161032.pdf"
)
)
Since the urls are already in a data.frame we could store the text data in
an aditional column. That way the data will be easily available for later
steps.
text_df <-
url_df %>%
mutate(text = map(url, pdf_text))
Instead of saving each text in a separate file we can now store all of the data
in a single file:
saveRDS(text_df, "text_df.rds")
For historical reasons for loops are not very popular in the R community.
base R has the *apply() function family that provides a functional
approach to iteration. The tidyverse has the purrr package and the map*()
functions that improve upon the *apply() functions.
I recommend taking a look at
https://purrr.tidyverse.org/ to learn more.
It seems that there are certain url's in your data which are not valid pdf files. You can wrap it in tryCatch to handle the errors. If your dataframe is called df with url column in it, you can do :
library(pdftools)
lapply(seq_along(df$url), function(x) {
tryCatch({
saveRDS(pdf_text(df$url[x]), file = sprintf('article_%d.rds', x)),
},error = function(e) {})
})
So say you have a data.frame called my_df with a column that contains your URLs of pdf locations. As by your comments, it seems that some URLs lead to broken PDFs. You can use tryCatch in these cases to report back which links were broken and check manually what's wrong with these links.
You can do this in a for loop like this:
my_df <- data.frame(url = c(
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf", # working pdf
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pfd" # broken pdf
))
# make some useful new columns
my_df$id <- seq_along(my_df$url)
my_df$status <- NA
for (i in my_df$id) {
my_df$status[i] <- tryCatch({
message("downloading ", i) # put a status message on screen
article_i <- suppressMessages(pdftools::pdf_text(my_df$url[i]))
saveRDS(article_i, file = paste0("article_", i, ".rds"))
"OK"
}, error = function(e) {return("FAILED")}) # return the string FAILED if something goes wrong
}
my_df$status
#> [1] "OK" "FAILED"
I included a broken link in the example data on purpose to showcase how this would look.
Alternatively, you can use a loop from the apply family. The difference is that instead of iterating through a vector and applying the same code until the end of the vector, *apply takes a function, applies it to each element of a list (or objects which can be transformed to lists) and returns the results from each iteration in one go. Many people find *apply functions confusing at first because usually people define and apply functions in one line. Let's make the function more explicit:
s_download_pdf <- function(link, id) {
tryCatch({
message("downloading ", id) # put a status message on screen
article_i <- suppressMessages(pdftools::pdf_text(link))
saveRDS(article_i, file = paste0("article_", id, ".rds"))
"OK"
}, error = function(e) {return("FAILED")})
}
Now that we have this function, let's use it to download all files. I'm using mapply which iterates through two vectors at once, in this case the id and url columns:
my_df$status <- mapply(s_download_pdf, link = my_df$url, id = my_df$id)
my_df$status
#> [1] "OK" "FAILED"
I don't think it makes much of a difference which approach you choose as the speed will be bottlenecked by your internet connection instead of R. Just thought you might appreciate the comparison.

Inputting df frame value into GET function web query

I'm trying to input a list of values from a data frame into my get function for the web query and then cycle through each iteration as I go. If somebody would be able to link me some further resources to read and learn from this, it would be appreciated.
The following is the code which draws the data names from the API server. I plan on using purrr iteration functions to go over it. The input from the list would be inserted in the variable name RFG_SELECT.
library(httr)
library(purrr)
## Call up Query Development Script
## Calls up every single rainfall data gauge across the entirety of QLD
wmip_callup <- GET('https://water-monitoring.information.qld.gov.au/cgi/webservice.pl?{"function":"get_site_list","version":"1","params":{"site_list":"MERGE(GROUP(MGR_OFFICE_ALL,AYR),GROUP(MGR_OFFICE_ALL,BRISBANE),GROUP(MGR_OFFICE_ALL,BUNDABERG),GROUP(MGR_OFFICE_ALL,MACKAY),GROUP(MGR_OFFICE_ALL,MAREEBA),GROUP(MGR_OFFICE_ALL,ROCKHAMPTON),GROUP(MGR_OFFICE_ALL,SOUTH_JOHNSTONE),GROUP(MGR_OFFICE_ALL,TOOWOOMBA))"}}')
# Turns API server data into JSON data.
wmip_dataf <- content(wmip_callup, type = 'application/json')
# Returns the values of the rainfall gauge site names and is the directory function.
list_var <- wmip_dataf[["_return"]][["sites"]]
# Combines all of the rainfall gauge data together in a list (could be used for giving file names / looping the data).
rfg_bind <- do.call(rbind.data.frame, list_var)
# Sets the column name of the combination data frame.
rfg_bind <- setNames(rfg_bind, "Rainfall Gauge Name")
rfg_select <- rfg_bind$`Rainfall Gauge Name`
# Attempts to filter list into query:
wmip_input <- GET('https://water-monitoring.information.qld.gov.au/cgi/webservice.pl?{"function":"get_ts_traces","version":"1","params":{"site_list":**rfg_select**,"datasource":"AT","varfrom":"10","varto":"10","start_time":"0","end_time":"0","data_type":"mean","interval":"day","multiplier":"1"}}') ```
Hey there,
After some work I've found a solution using a concatenate string.
I setup a dummy variable that helped me select a data value.
# Dummy Variable string:
wmip_url <- 'https://water-monitoring.information.qld.gov.au/cgi/webservice.pl?{"function":"get_ts_traces","version":"1","params":{"site_list":"varinput","datasource":"AT","varfrom":"10","varto":"10","start_time":"0","end_time":"0","data_type":"mean","interval":"day","multiplier":"1"}}'
# Dummy String, grabs ones value from the list.
rfg_individual <- rfg_select[2:2]
# Replaces the specified input
rfg_replace <- gsub("varinput", rfg_individual, wmip_url)
# Result
"https://water-monitoring.information.qld.gov.au/cgi/webservice.pl?{\"function\":\"get_ts_traces\",\"version\":\"1\",\"params\":{\"site_list\":\"001203A\",\"datasource\":\"AT\",\"varfrom\":\"10\",\"varto\":\"10\",\"start_time\":\"0\",\"end_time\":\"0\",\"data_type\":\"mean\",\"interval\":\"day\",\"multiplier\":\"1\"}}"

R: Doing the same steps on many data frames with their names stored in a vector

I have several .RData files, each of which has letters and numbers in its name, eg. m22.RData. Each of these contains a single data.frame object, with the same name as the file, eg. m22.RData contains a data.frame object named "m22".
I can generate the file names easily enough with something like datanames <- paste0(c("m","n"),seq(1,100)) and then use load() on those, which will leave me with a few hundred data.frame objects named m1, m2, etc. What I am not sure of is how to do the next step -- prepare and merge each of these dataframes without having to type out all their names.
I can make a function that accepts a data frame as input and does all the processing. But if I pass it datanames[22] as input, I am passing it the string "m22", not the data frame object named m22.
My end goal is to epeatedly do the same steps on a bunch of different data frames without manually typing out "prepdata(m1) prepdata(m2) ... prepdata(n100)". I can think of two ways to do it, but I don't know how to implement either of them:
Get from a vector of the names of the data frames to a list containing the actual data frames.
Modify my "prepdata" function so that it can accept the name of the data frame, but then still somehow be able to do things to the data frame itself (possibly by way of "assign"? But the last step of the function will be to merge the prepared data to a bigger data frame, and I'm not sure if there's a method that uses "assign" that can do that...)
Can anybody advise on how to implement either of the above methods, or another way to make this work?
See this answer and the corresponding R FAQ
Basically:
temp1 <- c(1,2,3)
save(temp1, file = "temp1.RData")
x <- c()
x[1] <- load("temp1.RData")
get(x[1])
#> [1] 1 2 3
Assuming all your data exists in the same folder you can create an R object with all the paths, then you can create a function that gets a path to a Rdata file, reads it and calls "prepdata". Finally, using the purr package you can apply the same function on a input vector.
Something like this should work:
library(purrr)
rdata_paths <- list.files(path = "path/to/your/files", full.names = TRUE)
read_rdata <- function(path) {
data <- load(path)
return(data)
}
prepdata <- function(data) {
### your prepdata implementation
}
master_function <- function(path) {
data <- read_rdata(path)
result <- prepdata(data)
return(result)
}
merged_rdatas <- map_df(rdata_paths, master_function) # This create one dataset. Merging all together

Looping through dataframe names in R and saving out corresponding dataframe as Rds file

I have about 30 separate dataframes loaded in my R session each with various names. I also have a character vector called mydfs which contains the names of all those dataframes loaded into my R session. I am trying to loop over mydfs and save out as an rds file each dataframe listed in the elements of mydfs, but for some reason, I'm only able to save out the character string of the name of the dataframe I'm trying to save (not the datafame itself). Here is simulated, reproducible example of what I have:
#Create vector of dataframes that exist in base r to create a reproducible example
mydfs<-c("cars","iris","iris3","mtcars")
#My code that creates files, but they don't contain my dataframe data for some reason
for (i in 1:length(mydfs)){
savefile<-paste0(paste0("D:/Data/", mydfs[i]), ".Rds")
saveRDS(mydfs[i], file=savefile)
print(paste("Dataframe Saved:", mydfs[i]))
}
This results in the following log output:
[1] "Dataframe Saved: cars"
[1] "Dataframe Saved: iris"
[1] "Dataframe Saved: iris3"
[1] "Dataframe Saved: mtcars"
Then I try to read back in any of the files I created:
#But when read back in only contain a single character string of the dataframe name
a<-readRDS("D:/Data/iris3.Rds")
str(a)
chr "iris3"
Note that when I read iris3.Rds back into a new R session using readRDS, I don't have a dataframe as I was expecting, but a single character vector containing the name of the datafame and not the data.
I haven't been programming in R for a while, since my current client preferred SAS, so I think I am somehow getting macro variable looping in SAS confused with R and so that when I call saveRDS, I'm passing in a single character vector instead of the actual dataframe. How can I get the dataframe to be passed into saveRDS instead of the character?
Thanks for helping me untangle my SAS thinking with my somewhat rusty R thinking.
You're currently just saving the names of the dataframes. You can use the get function as follows:
mydfs<-c("cars","iris","iris3","mtcars")
for (i in 1:length(mydfs)){
savefile<-paste0(paste0("D:/Data/", mydfs[i]), ".Rds")
saveRDS(get(mydfs[i]), file=savefile)
print(paste("Dataframe Saved:", mydfs[i]))
}
readRDS('D:/Data/iris3.RDS')

Dynamically initializing variables in R using functional programming

Hi I'm writing a function to read a file and returning a time series. I then need to assign this time series to a variable. I'm trying to do this in a terse way and utilise any functional programming features of R.
# read a file and returns the
readFile <- function( fileName , filePath){
fullPath <- paste(filePath, filename, sep='');
f <- as.xts(read.zoo(fullPath, format='%d/%m/%Y',
FUN=as.Date, header=TRUE, sep='\t'));
return(na.locf(f));
}
filePath <- 'C://data/'
# real list of files is a lot longer
fnames <- c('d1.csv', 'd2.csv','d3.csv');
varnames <- c('data1', 'data2', 'data3');
In the above piece of code I would like to initialise variables by the name of data1, data2, data2 by applying the readfile function to fnames and filepath (which is always constant).
Something like :
lapply( fnames, readFile, filePath);
The above doesnt work, of course and neither does it do this dynamic variable assignment that I'm trying to achieve. Any R functional programming gurus out there that could guide me ?
The working version of this would look something like :
data1 <- readFile( 'd1.csv', filepath);
data2 <- readFile( 'd2.csv', filepath);
YIKES
Constructing many variables with specified names is a somewhat common request on SO and can certainly be managed with the assign function, but you'll probably find it easier to handle your data if you build a list instead of multiple variables. For instance, you could read in all the results and obtain a named list with:
lst <- setNames(lapply(fnames, readFile, filePath), varnames)
Then you could access your results from each csv file with lst[["data1"]], lst[["data2"]], and lst[["data3"]].
The benefit of this approach is that you can now perform an operation over all your time series variables using lapply(lst, ...) instead of looping through all your variables or looping through variable names and using the get function.

Resources