Error when trying to read_html from a website with delayed loading - r

I'm trying to scrape the public consultations from the European Commission's website with R for my master thesis. However, the loading function in the first second when opening the website seems to prevent this.
By running
test <- read_html("https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives")
I get the following error:
Fehler in readBin(3L, "raw", 65536L) :
Failure when receiving data from the peer
("Fehler in" translates to "error in")
I couldn't find a solution yet and also struggle to identify the specific problem because I would actually expect the function the return the HTML of the loading screen that appears in the first second.
Does anyone have an idea how to get beyond the loading part with R?

Related

R tryCatch RODBC function issue

We have a number of MS Access databases on a server which are copies from remote locations which are updated overnight. We collate some of the data from these machines for reporting purposes on a daily basis. Sometimes the overnight update fails, meaning we don’t have access to all of the databases, so I am attempting to write an R script which will test if we can connect (using a list of the database paths), and output an updated version of the list including only those which we can connect to. This will then be used to run a further script which will only update the data related to the available databases.
This is what I have so far (I am new to R but reasonably proficient in SAS and SQL – attempting to use R both as a learning exercise and for potential cost savings);
{
# Create Store data locations listing
A=matrix(c(1000,1,"One","//Server/Comms1/Access.mdb"
,2000,2,"Two","//Server/Comms2/Access.mdb"
,3000,3,"Three","//Server/Comms3/Access.mdb"
)
,nrow=3,ncol=4,byrow=TRUE)
# Add column names
colnames(A)<-c("Ref1","Ref2","Ref3","Location")
#Create summary for testing connections (Ref1 and Location)
B<-A[,c(1,4)]
ConnectionTest<-function(Ref1,Location)
{
out<-tryCatch({ch<-odbcDriverConnect(paste("Driver={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=",Location))
sqlQuery(ch,paste("select ",Ref1," as Ref1,COUNT(variable) as Count from table"))}
,error=matrix(c(Ref1,0),nrow=1,ncol=2,byrow=TRUE)
)
return(out)
}
#Run function, using 'B' to provide arguments
C<-apply(B,1,function(x)do.call(ConnectionTest,as.list(x)))
#Convert to matrix and add column names
D<-matrix(unlist(C),ncol=2,byrow=T)
colnames(D)<-c("Ref1","Count")
}
When I run the script I get the following error message;
Error in value[3L] : attempt to apply non-function
I am guessing this is because I am using TryCatch incorrectly inside the UDF?
Does anyone have any advice on what I am doing incorrectly, or even if this is the best way to do what I am attempting?
Thanks
(apologies if this is formatted incorrectly, having to post on my phone due to Stackoverflow posting being blocked)
Edit - I think I fixed the 'Error in value[3L]' issue by adding function(e) {} around the matrix function in the error part of the tryCatch.
The issue now is that the script just fails if it can't reach one of the databases, rather than doing the matrix function. Do I need to add something else to make it ignore the error?
Edit 2 - it seems tryCatch does now work - it processes the
alternate function upon error but also shows warnings about the error, which makes sense.
As mentioned in the edit above, using 'function(e) {}' to wrap the Matrix function in the error section of the tryCatch fixed the 'Error in value[3L]' issue, so the script now works, but displays error messages if it can't access a particular channel. I am guessing the 'warning' section of the tryCatch can be used to adjust these as necessary.

Error in curl: Curl Fetch Memory - RStudio Curl Memory - Failed Writing Body

For Context, I have been using RStudio to pull data from our reporting platform (Domo) then I manipulate the tables in R, and send the df back into Domo at the end. The start of the code worked perfectly fine for a few months but is now returning a "Error in Curl" message. Here is the exact code/error:
library (DomoR)
library (dplyr)
DomoR::init('Company-Name', 'TOKEN-NUMBER')
Total_Vendor_Revenue <- DomoR::fetch('DATA-FILE-NUMBER')
Error in curl::curl_fetch_memory(url, handle = handle) :
Failed writing body (0 != 16384)
Do I need to clear a certain memory within my computer? Is there a way to clear it after the code completely runs? Any information helps- thanks!
I think this means the table you are trying to fetch is too big. You can probably filter the table down in size in an ETL then pull the ETL's data set into R?

Reading in XML data from URL using parLapply R

Beginner here, so please alert me to any formatting errors, or general etiquette mistakes.
I am attempting to scrape data from ~800 websites, to increase speed I was looking into running in parallel.
I have simplified my code to produce the error:
url=c("https://apps.hydroshare.org/apps/nwm-forecasts/api/GetWaterML/?config=long_range&geom=channel_rt&variable=streamflow&COMID=6251152&startDate=2018-03-12&lag=t18z&member=4",
"https://apps.hydroshare.org/apps/nwm-forecasts/api/GetWaterML/?config=long_range&geom=channel_rt&variable=streamflow&COMID=6244518&startDate=2018-03-12&lag=t18z&member=4"
cores.number=2
cluster1=makeCluster(cores.number)
clusterExport(cluster1,"url")
clusterEvalQ(cluster1,library("xml2"))
clusterEvalQ(cluster1,url)
temp=parLapply(cluster1,
url,
function(x)
read_xml(x,fill=TRUE,row.names=NULL))
stopCluster(cluster1)
I get the following error:
Error in checkForRemoteErrors(val) :
2 nodes produced errors; first error: HTTP error 500.
When doing the same functions with lapply instead of parLapply I have no issues and when I adjust the cores.number to 1 I have no issues.
Thanks

quantmod - getQuote() - '403 Forbidden'

I found an answer to my own question (see below). Still need help.
In the same package, quantmod, there is an option called getSymbol.google.
Nevertheless,
If I use it to get Microsoft value, for example, it works all right
getSymbols.google('MSFT', environment() , src="google", from = (Sys.Date() - 1))
[1] "MSFT"
But, I can´t make it work on a currency pair;
getSymbols.google("GBPUSD", environment() , src="google", from = (Sys.Date() - 1))
Error in download.file(paste(google.URL, "q=", Symbols.name, "&startdate=", :
cannot open URL 'http://finance.google.com/finance/historical?q=GBPUSD&startdate=Nov+02,+2017&enddate=Nov+03,+2017&output=csv'
In addition: Warning message:
In download.file(paste(google.URL, "q=", Symbols.name, "&startdate=", :
cannot open URL 'http://finance.google.com/finance/historical?q=GBPUSD&startdate=Nov+02,+2017&enddate=Nov+03,+2017&output=csv': HTTP status was '400 Bad Request'
Any ideas?
Good morning,
Since the 1ts of November i´m having trouble with the function getQuote from Yahoo. Is a function inside the package "quantmod", which uses yahoo API to request the information.
The description of the function is as follows; Fetch current stock quote(s) from specified source. At present this only handles sourcing quotes from Yahoo Finance, but it will be extended to additional sources over time.
In r, i´m getting the following error; "HTTP status was '403 Forbidden'"
I´ve look on my browser and the error comes from the following error in Yahoo web page "Fetch current stock quote(s) from specified source. At present this only handles sourcing quotes from Yahoo Finance, but it will be extended to additional sources over time."
Does anybody know how to solve ir, or, any alternatives to the function getQuote()
Here is an example from RStudio
getQuote("AAPL")
Error in download.file(paste("https://finance.yahoo.com/d/quotes.csv?s=", :
cannot open URL 'https://finance.yahoo.com/d/quotes.csv?s=AAPL&f=d1t1l1c1p2ohgv'
In addition: Warning message:
In download.file(paste("https://finance.yahoo.com/d/quotes.csv?s=", :
cannot open URL 'https://finance.yahoo.com/d/quotes.csv?s=AAPL&f=d1t1l1c1p2ohgv': HTTP status was '403 Forbidden'
Thanks
seems that yahoo has discontinued this service. Anyone aware of a alternative for yahoo (I'd rather not have to webscrape yahoo for this)
rob
I ran into the same problem... it's kludgey but as a workaround to get the end-of-day value, I have found this to work for now:
Instead of getQuote() to get the Last price (which doesn't seem to work from Yahoo anymore):
underlying<-"AAPL"
quote.last <-getQuote(underlying)$Last
I use "getSymbols" which still works-- throws it into a new data frame, and I pull out the value I want from that:
Hx<-getSymbols(underlying,from=Sys.Date()-1) # allows me to not have to retain the ticker name if I do this across many tickers
quote.last<-as.double(tail(Cl(get(Hx)),1)) # Closing price value from last row of data
rm(list=Hx) # throw away the temporary data frame with quote history
I'm sure the's a more elegant way to do it, but this is what fell out of my brain as a quick workaround that got it done... sadly that doesn't get things like the Bid and Ask that getQuote does.

How do I close unused connections after read_html in R

I am quite new to R and am trying to access some information on the internet, but am having problems with connections that don't seem to be closing. I would really appreciate it if someone here could give me some advice...
Originally I wanted to use the WebChem package, which theoretically delivers everything I want, but when some of the output data is missing from the webpage, WebChem doesn't return any data from that page. To get around this, I have taken most of the code from the package but altered it slightly to fit my needs. This worked fine, for about the first 150 usages, but now, although I have changed nothing, when I use the command read_html, I get the warning message " closing unused connection 4 (http:....." Although this is only a warning message, read_html doesn't return anything after this warning is generated.
I have written a simplified code, given below. This has the same problem
Closing R completely (or even rebooting my PC) doesn't seem to make a difference - the warning message now appears the second time I use the code. I can run the querys one at a time, outside of the loop with no problems, but as soon as I try to use the loop, the error occurs again on the 2nd iteration.
I have tried to vectorise the code, and again it returned the same error message.
I tried showConnections(all=TRUE), but only got connections 0-2 for stdin, stdout, stderr.
I have tried searching for ways to close the html connection, but I can't define the url as a con, and close(qurl) and close(ttt) also don't work. (Return errors of no applicable method for 'close' applied to an object of class "character and no applicable method for 'close' applied to an object of class "c('xml_document', 'xml_node')", repectively)
Does anybody know a way to close these connections so that they don't break my routine? Any suggestions would be very welcome. Thanks!
PS: I am using R version 3.3.0 with RStudio Version 0.99.902.
CasNrs <- c("630-08-0","463-49-0","194-59-2","86-74-8","148-79-8")
tit = character()
for (i in 1:length(CasNrs)){
CurrCasNr <- as.character(CasNrs[i])
baseurl <- 'http://chem.sis.nlm.nih.gov/chemidplus/rn/'
qurl <- paste0(baseurl, CurrCasNr, '?DT_START_ROW=0&DT_ROWS_PER_PAGE=50')
ttt <- try(read_html(qurl), silent = TRUE)
tit[i] <- xml_text(xml_find_all(ttt, "//head/title"))
}
After researching the topic I came up with the following solution:
url <- "https://website_example.com"
url = url(url, "rb")
html <- read_html(url)
close(url)
# + Whatever you wanna do with the html since it's already saved!
I haven't found a good answer for this problem. The best work-around that I came up with is to include the function below, with Secs = 3 or 4. I still don't know why the problem occurs or how to stop it without building in a large delay.
CatchupPause <- function(Secs){
Sys.sleep(Secs) #pause to let connection work
closeAllConnections()
gc()
}
I found this post as I was running into the same problems when I tried to scrape multiple datasets in the same script. The script would get progressively slower and I feel it was due to the connections. Here is a simple loop that closes out all of the connections.
for (i in seq_along(df$URLs)){function(i)
closeAllConnections(i)
}

Resources