I'm regularly connected to our database, whose data continuously changes. I was wondering if there was a way to run a specific set of lines of code on a regular time interval? For example:
list_table <- as_tibble(mydatabaseconnection %>%
tbl("mytable"))
View(list_table , title = "list_table")
I run this manually every now and then, but it would be nice to have the table on one of my screens and have it update periodically on its own. So far all I've found are tools that automatically run an entire script and you can only view their output as images or excel files. Any way to do what I outlined above within RStudio?
You could use something like Sys.sleep() inside a for loop, e.g.:
for (i in 1:24) {
print(i)
Sys.sleep(5)
}
In this case, it would look like:
for (i in 1:24) {
list_table <- as_tibble(mydatabaseconnection %>%
tbl("mytable"))
View(list_table , title = "list_table")
Sys.sleep(5)
}
Related
Thanks in advance for any feedback.
As part of my dissertation I'm trying to scrape data from the web (been working on this for months). I have a couple issues:
-Each document I want to scrape has a document number. However, the numbers don't always go up in order. For example, one document number is 2022, but the next one is not necessarily 2023, it could be 2038, 2040, etc. I don't want to hand go through to get each document number. I have tried to wrap download.file in purrr::safely(), but once it hits a document that does not exist it stops.
-Second, I'm still fairly new to R, and am having a hard time setting up destfile for multiple documents. Indexing the path for where to store downloaded data ends up with the first document stored in the named place, the next document as NA.
Here's the code I've been working on:
base.url <- "https://www.europarl.europa.eu/doceo/document/"
document.name.1 <- "P-9-2022-00"
document.extension <- "_EN.docx"
#document.number <- 2321
document.numbers <- c(2330:2333)
for (i in 1:length(document.numbers)) {
temp.doc.name <- paste0(base.url,
document.name.1,
document.numbers[i],
document.extension)
print(temp.doc.name)
#download and save data
safely <- purrr::safely(download.file(temp.doc.name,
destfile = "/Users/...[i]"))
}
Ultimately, I need to scrape about 120,000 documents from the site. Where is the best place to store the data? I'm thinking I might run the code for each of the 15 years I'm interested in separately, in order to (hopefully) keep it manageable.
Note: I've tried several different ways to scrape the data. Unfortunately for me, the RSS feed only has the most recent 25. Because there are multiple dropdown menus to navigate before you reach the .docx file, my workaround is to use document numbers. I am however, open to more efficient way to scrape these written questions.
Again, thanks for any feedback!
Kari
After quickly checking out the site, I agree that I can't see any easier ways to do this, because the search function doesn't appear to be URL-based. So what you need to do is poll each candidate URL and see if it returns a "good" status (usually 200) and don't download when it returns a "bad" status (like 404). The following code block does that.
Note that purrr::safely doesn't run a function -- it creates another function that is safe and which you then can call. The created function returns a list with two slots: result and error.
base.url <- "https://www.europarl.europa.eu/doceo/document/"
document.name.1 <- "P-9-2022-00"
document.extension <- "_EN.docx"
#document.number <- 2321
document.numbers <- c(2330:2333,2552,2321)
sHEAD = purrr::safely(httr::HEAD)
sdownload = purrr::safely(download.file)
for (i in seq_along(document.numbers)) {
file_name = paste0(document.name.1,document.numbers[i],document.extension)
temp.doc.name <- paste0(base.url,file_name)
print(temp.doc.name)
print(sHEAD(temp.doc.name)$result$status)
if(sHEAD(temp.doc.name)$result$status %in% 200:299){
sdownload(temp.doc.name,destfile=file_name)
}
}
It might not be as simple as all of the valid URLs returning a '200' status. I think in general URLs in the range 200:299 are ok (edited answer to reflect this).
I used parts of this answer in my answer.
If the file does not exists, tryCatch simply skips it
library(tidyverse)
get_data <- function(index) {
paste0(
"https://www.europarl.europa.eu/doceo/document/",
"P-9-2022-00",
index,
"_EN.docx"
) %>%
download.file(url = .,
destfile = paste0(index, ".docx"),
mode = "wb",
quiet = TRUE) %>%
tryCatch(.,
error = function(e) print(paste(index, "does not exists - SKIPS")))
}
map(2000:5000, get_data)
I am new to R and trying to write some code that will loop over all the files in my a folder, and then pull in all the data associated with a certain tab. However, this tab may not be present in all of the files I have stored in this folder. To trouble shoot this, I am using a Try-Catch function, but am still running into issues.
What else do I need to do so I just loop over the data if the tab is not present, and not load it in?
This is what I have tried:
for (i in 1:nrow(filesinfolderfull_list)){
print(filesinfolder_list[i])
i_ddolv_temp <- tryCatch (
{ read_excel(filesinfolderfull_list$datafiles[i], sheet="Display-OLV Reporting",col_names=TRUE,skip=4)},
error = function(e){print("skip")}
)
templateDDOLV_df<- bind_rows(templateDDOLV_df,i_ddolv_temp)
}
I get from your explanation that some excels do not have the "Display-OLV Reporting" sheet. Why not look up first with excel_sheets and then extract if TRUE. Something like:
for (i in 1:nrow(filesinfolderfull_list)){
print(filesinfolder_list[i])
if ("Display-OLV Reporting"%in%excel_sheets(read_excel(filesinfolderfull_list$datafiles[i]){
i_ddolv_temp <- read_excel(filesinfolderfull_list$datafiles[i], sheet="Display-OLV Reporting",col_names=TRUE,skip=4)
templateDDOLV_df<- bind_rows(templateDDOLV_df,i_ddolv_temp)
}
}
You could use a variable to not read each excel twice
Is there any way when using the str() function in R to start at the beginning of the output provided rather than the end after the function is run?
Basically I'd like a faster way to get to the beginning of the output rather than scrolling manual back up through the output. This would be especially useful when looking at the structure of larger objects like spatial data.
A variation of the answer linked to by Dason (https://stackoverflow.com/a/3837885/1017276) would be to redirect the output to a browser.
view_str <- function(x)
{
tmp <- tempfile(fileext = ".html")
x <- capture.output(str(x))
write(paste0(x, collapse = "<br/>"),
file = tmp)
viewer <- getOption("viewer")
if (!is.null(viewer)) # If in RStudio, use RStudio's browser
{
viewer(tmp)
}
else{ # Otherwise use the system's default browser
utils::browseURL(tmp)
}
}
view_str(mtcars)
How can i find out WHERE the error occurs?
i've got a double loop like this
companies <- # vector with all companies in my data.frame
dates <- # vector with all event dates in my data.frame
for(i in 1:length(companies)) {
events_k <- # some code that gives me events of one company at a time
for{j in 1:nrow(events_k)) {
# some code that gives me one event date at a time
results <- # some code that calculates stuff for that event date
}
mylist[[i]] <- results # store the results in a list
}
In this code I got an error (it was something like error in max(i)...)
The inner loop works perfectly. So by leaving out the outer loop and manually enter the company ID's until that error appeared, I found out for which company there was something wrong. My data.frame had letters in a vector with daily returns for that specific company.
For a next time: Is there a way in R to find out WHERE (or here FOR WHICH COMPANY) the error appears? It could save a lot of time!
What I like to use is:
options(error = recover)
You only need to run it once at the beginning of your session (or add it to your .Rprofile file)
After that, each time an error is thrown you will be shown the stack of function calls that led to the error. You can select any of these calls and it will be as if you had run that command in browser() mode: you will be able to look at the variables in the calling environment and walk through the code.
More info and examples at ?recover.
It is difficult to know without explicit code that we can run, but my guess is that changing your code to
for(i in companies) {
for(j in dates) {
or alternatively
for(i in 1:length(companies)) {
for(j in 1:length(dates)) {
may solve the issue. Note the ( in the second loop. If not, it might be a good idea to edit your example to have some code / data that produces the same error.
To figure out where it occurs, you can always add print(i) or something like that at a suitable place in the code.
I do a lot of data exploration in R and I would like to keep every plot I generate (from the interactive R console). I am thinking of a directory where everything I plot is automatically saved as a time-stamped PDF. I also do not want this to interfere with the normal display of plots.
Is there something that I can add to my ~/.Rprofile that will do this?
The general idea is to write a script generating the plot in order to regenerate it. The ESS documentation (in a README) says it well under 'Philosophies for using ESS':
The source code is real. The objects are realizations of the
source code. Source for EVERY user modified object is placed in a
particular directory or directories, for later editing and
retrieval.
With any editor allows stepwise (or regionwise) execution of commands you can keep track of your work this way.
The best approach is to use a script file (or sweave or knitr file) so that you can just recreate all the graphs when you need them (into a pdf file or other).
But here is the start of an approach that does the basics of what you asked:
savegraphs <- local({i <- 1;
function(){
if(dev.cur()>1){
filename <- sprintf('graphs/SavedPlot%03d.pdf', i)
dev.copy2pdf( file=filename )
i <<- i + 1
}
}
})
setHook('before.plot.new', savegraphs )
setHook('before.grid.newpage', savegraphs )
Now just before you create a new graph the current one will be saved into the graphs folder of the current working folder (make sure that it exists). This means that if you add to a plot (lines, points, abline, etc.) then the annotations will be included. However you will need to run plot.new in order for the last plot to be saved (and if you close the current graphics device without running another plot.new then that last plot will not be saved).
This version will overwrite plots saved from a previous R session in the same working directory. It will also fail if you use something other than base or grid graphics (and maybe even with some complicated plots then). I would not be surprised if there are some extra plots on occasion that show up (when internally a plot is created to get some parameters, then immediatly replaced with the one of interest). There are probably other things that I have overlooked as well, but this might get you started.
you could write your own wrapper functions for your commonly used plot functions. This wrapper function would call both the on-screen display and a timestamped pdf version. You could source() this function in your ~/.Rprofile so that it's available every time you run R.
For latice's xyplot, using the windows device for the on-screen display:
library(lattice)
my.xyplot <- function(...){
dir.create(file.path("~","RPlots"))
my.chart <- xyplot(...)
trellis.device(device="windows",height = 8, width = 8)
print(my.chart)
trellis.device(device = "pdf",
file = file.path("~", "RPlots",
paste("xyplot",format(Sys.time(),"_%Y%m%d_%H-%M-%S"),
".pdf", sep = "")),
paper = "letter", width = 8, height = 8)
print(my.chart)
dev.off()
}
my.data <- data.frame(x=-100:100)
my.data$y <- my.data$x^2
my.xyplot(y~x,data=my.data)
As others have said, you should probably get in the habit of working from an R script, rather than working exclusively from the interactive terminal. If you save your scripts, everything is reproducible and modifiable in the future. Nonetheless, a "log of plots" is an interesting idea.