I have a set of targets, lets say data_a, data_b, ...
I want to iterate over all datasets and load the data.
This can be achieved using tar_read(data_a) or tar_read(data_a"). As I want to load the targets programmatically, I would like to use something like this in some kind of lapply:
target_name <- "data_a"
data <- tar_read(target_name)
But then I get the error that the target target_name was not found.
I know this is related to NSE with R as tar_read internally calls substitute, but I wasn't able to figure out how to mask the target_name to make tar_read work. I have tried eval(parse()) and the different options presented in Advanced R, as well as rlang (such as !!, {{ and similar) to no avail.
Any idea how to achieve this?
If you look at the code for tar_read, you see that it uses NSE to convert the name parameter into a character string, then calls the function tar_read_raw on the resulting string:
tar_read
#> function (name, branches = NULL, meta = tar_meta(store = store),
# store = targets::tar_config_get("store"))
#> {
#> force(meta)
#> name <- tar_deparse_language(substitute(name))
#> tar_read_raw(name = name, branches = branches, meta = meta,
#> store = store)
#> }
However, you can also use tar_read_raw directly. The manual for tar_read_raw says:
Like tar_read() except name is a character string.
So you should just be able to do:
data <- tar_read_raw(target_name)
Related
I'm trying to use a for loop to create a set of dynamic objects in R. These will contain a list of organisations and values against a certain metric--each output will be the values of an individual metric.
In practice, this will be used to create chart objects using ggplot2, which I'll then use in RMarkdown. For the example below, it's just a sample using a head() function for each metric.
I tried using the paste function to create this name, but it gives the following error:
Error in paste("organisation_short", "_", MetricIDs[x]) <-
head(organisationdata_Jan2021) : target of assignment expands to
non-language object
I understand that the assign function might help, but I'm not sure how to use it. (My attempts also produced errors). I found a similar question in the link below, but it's set up in a way that pipes data directly into assign. I'm also not clear what "value = ." is doing. This query is below:
dynamically name objects in R
I believe the "value = ." refers to the data being piped into the assign function. I created an alternative version which is in the code below.
Error in assign(x = organisationdata_Jan2021, value = paste0("sampledata", :
invalid first argument
The idea is to create output files along the lines of: organisation_short_ABC123, organisation_short_ABC323, organisation_short_KJM088
I would be grateful for any guidance you might have!
MetricIDs <- c('ABC123','ABC323','KJM088')
# Attempt using paste
for (x in 1:3)
{
organisationdata_Jan2021 <- organisationdata_CM0040_Jan2021 %>% filter(Metric_ID==MetricIDs[x]) # Filter data to specific Metric ID
paste("organisation_short","_", MetricIDs[x]) <- head(organisationdata_Jan2021) # Goal: Create object that includes the Metric ID.
}
# Attempt using assign
for (x in 1:3)
{
organisationdata_Jan2021 <- organisationdata_CM0040_Jan2021 %>% filter(Metric_ID==MetricIDs[x]) # Filter data to specific Metric ID
assign(x=organisationdata_Jan2021, value=paste0("sampledata",MetricIDs[x]))
}
# Expected object names: organisation_short_ABC123, organisation_short_ABC323, organisation_short_KJM088
# This will be used to create chart objects using ggplot2, and those objects will be used in an R MarkDown document.
I am connecting to a SQL Server database through the ODBC connection in R. I have two potential methods to get data, and am trying to determine which would be more efficient. The data is needed for a Shiny dashboard, so the data needs to be pulled while the app is loading rather than querying on the fly as the user is using the app.
Method 1 is to use over 20 stored procedures to query all of the needed data and store them for use. Method 2 is to query all of the tables individually.
Here is the method I used to query one of the stored procedures:
get_proc_data <- function(proc_name, url, start_date, end_date){
dbGetQuery(con, paste0(
"EXEC dbo.", proc_name, " ",
"#URL = N'", url, "', ",
"#Startdate = '", start_date, "', ",
"#enddate = '", end_date, "' "
))
}
data <- get_proc_data(proc_name, url, today(), today() %m-% years(5))
However, each of the stored procedures has a slightly different setup for the parameters, so I would have to define each of them separately.
I have started to implement Method 2, but have run into issues with iteratively querying each table.
# use dplyr create list of table names
db_tables <- dbGetQuery(con, "SELECT * FROM [database_name].INFORMATION_SCHEMA.TABLES;") %>% select(TABLE_NAME)
# use dplyr pull to create list
table_list <- pull(db_tables , TABLE_NAME)
# get a quick look at the first few rows
tbl(con, "[TableName]") %>% head() %>% glimpse()
# iterate through all table names, get the first five rows, and export to .csv
for (table in table_list){
write.csv(
tbl(con, table) %>% head(), str_glue("{getwd()}/00_exports/tables/{table}.csv")
)
}
selected_tables <- db_tables %>% filter(TABLE_NAME == c("TableName1","TableName2"))
Ultimately this method was just to test how long it would take to iterate through the ~60 tables and perform the required function. I have tried putting this into a function instead but have not been able to get it to iterate through while also pulling the name of the table.
Pro/Con for Method 1: The stored procs are currently powering a metrics plug-in that was written in C++ and is displaying metrics on the webpage. This is for internal use to monitor website performance. However, the stored procedures are not all visible to me and the client needs me to extend their current metrics. I also do not have a DBA at my disposal to help with the SQL Server side, and the person who wrote the procs is unavailable. The procs are also using different logic than each other, so joining the results of two different procs gives drastically different values. For example, depending on the proc, each date will list total page views for each day or already be aggregated at the weekly or monthly scale then listed repeatedly. So joining and grouping causes drastic errors in actual page views.
Pro/Con for Method 2: I am familiar with dplyr and would be able to join the tables together to pull the data I need. However, I am not as familiar with SQL and there is no Entity-Relationship Diagram (ERD) of any sort to refer to. Otherwise, I would build each query individually.
Either way, I am trying to come up with a way to proceed with either a named function, lambda function, or vectorized method for iterating. It would be great to name each variable and assign them appropriately so that I can perform the data wrangling with dplyr.
Any help would be appreciated, I am overwhelmed with which direction to go. I researched the equivalent to Python list comprehension in R but have not been able get the function in R to perform similarly.
> db_table_head_to_csv <- function(table) {
+ write.csv(
+ tbl(con, table) %>% head(), str_glue("{getwd()}/00_exports/bibliometrics_tables/{table}.csv")
+ )
+ }
>
> bibliometrics_tables %>% db_table_head_to_csv()
Error in UseMethod("as.sql") :
no applicable method for 'as.sql' applied to an object of class "data.frame"
Consider storing all table data in a named list (counterpart to Python dictionary) using lapply (counterpart to Python's list/dict comprehension). And if you use its sibling, sapply, the character vector passed in will return as names of elements:
# RETURN VECTOR OF TABLE NAMES
db_tables <- dbGetQuery(
con, "SELECT [TABLE_NAME] FROM [database_name].INFORMATION_SCHEMA.TABLES"
)$TABLE_NAME
# RETURN NAMED LIST OF DATA FRAMES FOR EACH DB TABLE
df_list <- sapply(db_tables, function(t) dbReadTable(conn, t), simplify = FALSE)
You can extend the lambda function for multiple steps like write.csv or use a defined method. Just be sure to return a data frame as last line. Below uses the new pipe, |> in base R 4.1.0+:
db_table_head_to_csv <- function(table) {
head_df <- dbReadTable(con, table) |> head()
write.csv(
head_df,
file.path(
"00_exports", "bibliometrics_tables", paste0(table, ".csv")
)
)
return(head_df)
}
df_list <- sapply(db_tables, db_table_head_to_csv, simplify = FALSE)
You lose no functionality of data frame object if stored in a list and can extract with $ or [[ by name:
# EXTRACT SPECIFIC ELEMENT
head(df_list$table_1)
tail(df_list[["table_2"]])
summary(df_list$`table_3`)
I have a data.frame (dim: 100 x 1) containing a list of url links, each url looks something like this: https:blah-blah-blah.com/item/123/index.do .
The list (the list is a data.frame called my_list with 100 rows and a single column named col and is in character format $ col: chr) together looks like this :
1 "https:blah-blah-blah.com/item/123/index.do"
2" https:blah-blah-blah.com/item/124/index.do"
3 "https:blah-blah-blah.com/item/125/index.do"
etc.
I am trying to import each of these url's into R and collectively save the object as an object that is compatible for text mining procedures.
I know how to successfully convert each of these url's (that are on the list) manually:
library(pdftools)
library(tidytext)
library(textrank)
library(dplyr)
library(tm)
#1st document
url <- "https:blah-blah-blah.com/item/123/index.do"
article <- pdf_text(url)
Once this "article" file has been successfully created, I can inspect it:
str(article)
chr [1:13]
It looks like this:
[1] "abc ....."
[2] "def ..."
etc etc
[15] "ghi ...:
From here, I can successfully save this as an RDS file:
saveRDS(article, file = "article_1.rds")
Is there a way to do this for all 100 articles at the same time? Maybe with a loop?
Something like :
for (i in 1:100) {
url_i <- my_list[i,1]
article_i <- pdf_text(url_i)
saveRDS(article_i, file = "article_i.rds")
}
If this was written correctly, it would save each article as an RDS file (e.g. article_1.rds, article_2.rds, ... article_100.rds).
Would it then be possible to save all these articles into a single rds file?
Please note that list is not a good name for an object, as this will
temporarily overwrite the list() function. I think it is usually good
to name your variables according to their content. Maybe url_df would be
a good name.
library(pdftools)
#> Using poppler version 20.09.0
library(tidyverse)
url_df <-
data.frame(
url = c(
"https://www.nimh.nih.gov/health/publications/autism-spectrum-disorder/19-mh-8084-autismspecdisordr_152236.pdf",
"https://www.nimh.nih.gov/health/publications/my-mental-health-do-i-need-help/20-mh-8134-mymentalhealth-508_161032.pdf"
)
)
Since the urls are already in a data.frame we could store the text data in
an aditional column. That way the data will be easily available for later
steps.
text_df <-
url_df %>%
mutate(text = map(url, pdf_text))
Instead of saving each text in a separate file we can now store all of the data
in a single file:
saveRDS(text_df, "text_df.rds")
For historical reasons for loops are not very popular in the R community.
base R has the *apply() function family that provides a functional
approach to iteration. The tidyverse has the purrr package and the map*()
functions that improve upon the *apply() functions.
I recommend taking a look at
https://purrr.tidyverse.org/ to learn more.
It seems that there are certain url's in your data which are not valid pdf files. You can wrap it in tryCatch to handle the errors. If your dataframe is called df with url column in it, you can do :
library(pdftools)
lapply(seq_along(df$url), function(x) {
tryCatch({
saveRDS(pdf_text(df$url[x]), file = sprintf('article_%d.rds', x)),
},error = function(e) {})
})
So say you have a data.frame called my_df with a column that contains your URLs of pdf locations. As by your comments, it seems that some URLs lead to broken PDFs. You can use tryCatch in these cases to report back which links were broken and check manually what's wrong with these links.
You can do this in a for loop like this:
my_df <- data.frame(url = c(
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf", # working pdf
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pfd" # broken pdf
))
# make some useful new columns
my_df$id <- seq_along(my_df$url)
my_df$status <- NA
for (i in my_df$id) {
my_df$status[i] <- tryCatch({
message("downloading ", i) # put a status message on screen
article_i <- suppressMessages(pdftools::pdf_text(my_df$url[i]))
saveRDS(article_i, file = paste0("article_", i, ".rds"))
"OK"
}, error = function(e) {return("FAILED")}) # return the string FAILED if something goes wrong
}
my_df$status
#> [1] "OK" "FAILED"
I included a broken link in the example data on purpose to showcase how this would look.
Alternatively, you can use a loop from the apply family. The difference is that instead of iterating through a vector and applying the same code until the end of the vector, *apply takes a function, applies it to each element of a list (or objects which can be transformed to lists) and returns the results from each iteration in one go. Many people find *apply functions confusing at first because usually people define and apply functions in one line. Let's make the function more explicit:
s_download_pdf <- function(link, id) {
tryCatch({
message("downloading ", id) # put a status message on screen
article_i <- suppressMessages(pdftools::pdf_text(link))
saveRDS(article_i, file = paste0("article_", id, ".rds"))
"OK"
}, error = function(e) {return("FAILED")})
}
Now that we have this function, let's use it to download all files. I'm using mapply which iterates through two vectors at once, in this case the id and url columns:
my_df$status <- mapply(s_download_pdf, link = my_df$url, id = my_df$id)
my_df$status
#> [1] "OK" "FAILED"
I don't think it makes much of a difference which approach you choose as the speed will be bottlenecked by your internet connection instead of R. Just thought you might appreciate the comparison.
This question already has an answer here:
Get a function from a string of the form "package::function"
(1 answer)
Closed 3 years ago.
I've created a variable and want to use that variable to select my desired function in a package (e.g. package::function), however, the variable name is interpreted literally instead of being evaluated.
Here's the approach:
library(GSEABase)
library(tidyverse)
### SET ONTOLOGY GROUP (e.g. Biological Process = BP, Molecular Function = MF, Cellular Component = CC)
ontology <- "BP"
### Set GOOFFSPRING database, based on ontology group set above
go_offspring <- paste("GO", ontology, "OFFSPRING", sep = "")
## Need to know the 'offspring' of each term in the ontology, and this is given by the data in:
GO.db::go_offspring
## Create function to parse out GO terms assigned to each GOslim
## Courtesy Bioconductor Support: https://support.bioconductor.org/p/128407/
mappedIds <-
function(df, collection, OFFSPRING)
{
map <- as.list(OFFSPRING[rownames(df)])
mapped <- lapply(map, intersect, ids(collection))
df[["go_terms"]] <- vapply(unname(mapped), paste, collapse = ";", character(1L))
df
}
## Run the function
slimsdf <- mappedIds(slimsdf, myCollection, go_offspring)
This spits out the error:
Error: 'go_offspring' is not an exported object from 'namespace:GO.db'
When playing around in the R Studio console, I also notice that when I type
GO.db::
the autocomplete feature does not list my go_offspring variable as an option; it only lists the available functions within the GO.db package.
Seeing this behavior suggests there's a scope issue, in that the package cannot see variable definitions defined outside of the package.
Is there any way around this?
I've looked at this http://adv-r.had.co.nz/Environments.html, but I'm not entirely sure I follow all of it, nor do I see how to manipulate my environments to allow passing go_offspring to GO.db::.
You can use getFromNamespace to get the function via its character name from the namespace.
slimsdf <- mappedIds(slimsdf, myCollection, getFromNamespace(go_offspring, "GO.db"))
Looking into adding data to a table with dplyr, I saw https://stackoverflow.com/a/26784801/1653571 but the documentation says db_insert_table() is deprecated.
?db_insert_into()
...
db_create_table() and db_insert_into() have been deprecated in favour of db_write_table().
...
I tried to use the non-deprecated db_write_table() instead, but it fails both with and without the append= option:
require(dplyr)
my_db <- src_sqlite( "my_db.sqlite3", create = TRUE) # create src
copy_to( my_db, iris, "my_table", temporary = FALSE) # create table
newdf = iris # create new data
db_write_table( con = my_db$con, table = "my_table", values = newdf) # insert into
# Error: Table `my_table` exists in database, and both overwrite and append are FALSE
db_write_table( con = my_db$con, table = "my_table", values = newdf,append=True) # insert into
# Error: Table `my_table` exists in database, and both overwrite and append are FALSE
Should one be able to append data with db_write_table()?
See also https://github.com/tidyverse/dplyr/issues/3120
No, you shouldn't use db_write_table() instead of db_insert_table(), since it can't be generalized across backends.
And you shouldn't use the tidyverse versions rather than the relevant DBI::: versions, since the tidyverse helper functions are for internal use, and not designed to be robust enough for use by users. See the discussion at https://github.com/tidyverse/dplyr/issues/3120#issuecomment-339034612 :
Actually, I don't think you should be using these functions at all. Despite that SO post, these are not user facing functions. You should be calling the DBI functions directly.
-- Hadley Wickham, package author.