How to Speed up read_html runtime in r? - r

I have a character string of 400 URLs called URLs.
I have a loop that has been working for a while but now it takes way too long. It used to just report the url as an error and then I would omit but its is getting hung up.
dput(URLs)
c("http://www.chinadaily.com.cn/a/202102/04/WS601b5bd7a31024ad0baa736d.html",
"http://www.xinhuanet.com/english/2021-02/02/c_139716479.htm",
"http://www.china.org.cn/world/Off_the_Wire/2021-02/02/content_77181645.htm",
"http://english.sina.com/world/af/2021-02-02/detail-ikftssap2511288.shtml",
"https://www.beijingnews.net/news/267750643/fox-takes-clubhouse-lead-as-johnson-makes-move-in-saudi-arabia",
"https://www.beijingnews.net/news/267768819/johnson-excited-for-season-after-second-saudi-title",
"https://en.wtcf.org.cn/GlobalNews/2021020320227.html", "https://www.ladepeche.fr/2021/02/08/golf-un-top-4-royal-pour-victor-perez-9360378.php",
"https://sport24.lefigaro.fr/golf/tour-europeen/actualites/victor-perez-dans-les-pas-de-dustin-johnson-en-arabie-saoudite-1032163",
"https://sport24.lefigaro.fr/golf/tour-europeen/actualites/european-tour-victor-perez-a-longtemps-tenu-tete-a-dustin-johnson-en-arabie-saoudite-1032273",
"https://www.france24.com/en/live-news/20210206-johnson-seizes-two-shot-lead-in-saudi-international",
"https://www.france24.com/en/live-news/20210205-fox-takes-clubhouse-lead-as-johnson-makes-move-in-saudi-arabia",
"https://www.france24.com/en/live-news/20210203-big-hitting-dechambeau-happy-to-take-longer-clubs-out-of-rivals-hands",
"https://www.france24.com/en/live-news/20210203-as-bubble-life-drags-on-psychologists-say-cricketers-need-more-support",
"https://www.sports.fr/golf/circuit-europeen/golf-perez-gratin-arabie-saoudite-426859.html",
"https://www.sport.fr/golf/lopen-de-france-est-sauve-758291.shtm",
"https://www.ffgolf.org/Actus/Pro/European-Tour/Saudi-International-ET-Perez-n-est-pas-passe-loin",
"https://www.ffgolf.org/Actus/Pro/European-Tour/Saudi-International-ET-Perez-a-rendez-vous-avec-DJ-dimanche",
"https://www.ffgolf.org/Actus/Pro/European-Tour/Saudi-International-ET-Rozner-au-sec-a-6-Perez-a-7",
"https://www.ffgolf.org/Actus/Pro/European-Tour/Saudi-International-ET-Rozner-et-Perez-demarrent-bien",
"https://www.ffgolf.org/Actus/Pro/LPGA-Tour/Franck-Riboud-On-va-pouvoir-continuer-a-travailler-sereinement",
"https://www.ffgolf.org/Actus/Pro/Feuilletons/Paroles-de-coach/Paroles-de-coach-6-Gwladys-Nocera",
"https://franceracing.fr/other/porsche-et-tag-heuer-scellent-un-partenariat-strategique/",
"https://www.rfi.fr/en/sports/20210206-johnson-seizes-two-shot-lead-in-saudi-international",
"https://www.rfi.fr/en/sports/20210205-fox-takes-clubhouse-lead-as-johnson-makes-move-in-saudi-arabia",
"https://www.rfi.fr/en/sports/20210203-big-hitting-dechambeau-happy-to-take-longer-clubs-out-of-rivals-hands",
"https://www.rfi.fr/en/sports/20210203-as-bubble-life-drags-on-psychologists-say-cricketers-need-more-support",
"https://www.jeudegolf.org/EasyBlog/Agathe-sauzon.html", "http://topactu.net/2021/02/viktor-hovland-vaults-into-farmers-lead-at-wet-torrey-pines/",
"https://www.sueddeutsche.de/sport/golf-kaymer-18-bei-golf-turnier-in-saudi-arabien-johnson-siegt-dpa.urn-newsml-dpa-com-20090101-210207-99-337940",
"https://www1.wdr.de/sport/golf-martin-kaymer-saudi-arabien-100.html",
"https://www.augsburger-allgemeine.de/sport/sonstige-sportarten/Kaymer-18-bei-Golf-Turnier-in-Saudi-Arabien-Johnson-siegt-id59059886.html",
"https://www.schwaebische.de/sport/ueberregionaler-sport_artikel,-kaymer-18-bei-golf-turnier-in-saudi-arabien-johnson-siegt-_arid,11325827.html",
"https://www.sport.de/news/ne4341625/golf--kaymer-beendet-turnier-in-saudi-arabien-als-18/",
"https://www.mz-web.de/sport/golf/kaymer-18--bei-golf-turnier-in-saudi-arabien---johnson-siegt-38027428",
"https://www.nwzonline.de/sport-meldungen/european-tour-kaymer-18-bei-golf-turnier-in-saudi-arabien-johnson-siegt_a_50,12,475833623.html",
"https://www.volksstimme.de/golf/news/kaymer-18.-bei-golf-turnier-in-saudi-arabien---johnson-siegt/1612702615000",
"https://www.wn.de/Sport/Weltsport/Golf/4360897-European-Tour-Kaymer-18.-bei-Golf-Turnier-in-Saudi-Arabien-Johnson-siegt",
"https://www.mainpost.de/sport/kaymer-18-bei-golf-turnier-in-saudi-arabien-johnson-siegt-art-10562664",
"https://www.moz.de/nachrichten/sport/news/european-tour-kaymer-18.-bei-golf-turnier-in-saudi-arabien-johnson-siegt-54931493.html",
"https://www.svz.de/sport/weitere-sportarten/kaymer-18-bei-golf-turnier-in-saudi-arabien-johnson-siegt-id31187247.html?nojs=true",
"https://www.rhein-zeitung.de/sport/aus-aller-welt/aus-aller-welt-golf_artikel,-kaymer-18-bei-golfturnier-in-saudiarabien-johnson-siegt-_arid,2220135.html",
"https://www.rhein-zeitung.de/sport/aus-aller-welt/aus-aller-welt-golf_artikel,-martin-kaymer-sagt-olympiastart-in-tokio-ab-_arid,2274019.html",
"https://www.allgemeine-zeitung.de/sport/golf/kaymer-18-bei-golf-turnier-in-saudi-arabien-johnson-siegt_23109750",
"https://www.echo-online.de/sport/golf/kaymer-18-bei-golf-turnier-in-saudi-arabien-johnson-siegt_23109750",
"https://www.mittelhessen.de/sport/golf/kaymer-18-bei-golf-turnier-in-saudi-arabien-johnson-siegt_23109750",
"https://www.muensterschezeitung.de/Sport/Sportarten/Golf/4360897-European-Tour-Kaymer-18.-bei-Golf-Turnier-in-Saudi-Arabien-Johnson-siegt",
"https://www.wiesbadener-kurier.de/sport/golf/kaymer-18-bei-golf-turnier-in-saudi-arabien-johnson-siegt_23109750",
"https://www.giessener-anzeiger.de/sport/golf/kaymer-18-bei-golf-turnier-in-saudi-arabien-johnson-siegt_23109750",
"https://newsroom.porsche.com/de/2021/unternehmen/porsche-sportwagenhersteller-tag-heuer-luxusuhren-schmiede-zusammenarbeit-videostream-23558.html",
"https://www.azonline.de/Sport/Weitere-Sportarten/Golf/4360897-European-Tour-Kaymer-18.-bei-Golf-Turnier-in-Saudi-Arabien-Johnson-siegt",
"https://www.borkenerzeitung.de/welt/sport/Kaymer-18-bei-Golf-Turnier-in-Saudi-Arabien-Johnson-siegt-327224.html",
"https://www.golfpost.de/european-tour-saudi-international-2021-ergebnisse-runde-2-7777396527/",
"https://www.golfpost.de/396354-7777396354/", "https://www.golfpost.de/german-challenge-powerd-by-vcg-golf-challenge-tour-kehrt-nach-deutschland-zurueck-7777396396/",
"https://www.golfpost.de/die-macht-der-moneten-saudi-arabien-auf-dem-weg-zum-big-player-im-golf-7777396387/",
"https://www.kreis-anzeiger.de/sport/golf/kaymer-18-bei-golf-turnier-in-saudi-arabien-johnson-siegt_23109750",
"https://www.wormser-zeitung.de/sport/golf/kaymer-18-bei-golf-turnier-in-saudi-arabien-johnson-siegt_23109750",
"https://m.azonline.de/Sport/Weitere-Sportarten/Golf/4361712-PGA-Turnier-US-Golfstar-Koepka-triumphiert-bei-Phoenix-Open",
"https://www.mv-online.de/sport/sportmix/kaymer-18-bei-golf-turnier-in-saudi-arabien-johnson-siegt-409658.html",
"https://www.golf.de/publish/dgv-sport/golf-team-germany/news/60228375/sophia-popov-nach-major-sieg-in-elite-team-germany",
"https://www.golf.de/publish/tournews/nachrichten-tour/60228372/einmal-saudi-einmal-etwas-gaudi",
"https://www.golf.de/publish/tournews/nachrichten-tour/60228387/koepka-comeback-und-eine-wuestenbilanz",
"https://www.ev-online.de/sport/sportmix/kaymer-18-bei-golf-turnier-in-saudi-arabien-johnson-siegt-409655.html",
"https://www.nach-welt.com/dustin-johnson-setzt-masstabe-aber-jordan-spieth-justin-rose-und-brooks-koepka-kehren-zur-form-zuruck/",
"https://www.nach-welt.com/ryan-fox-wird-sechster-wahrend-dustin-johnson-saudi-international-gewinnt/",
"https://www.usinger-anzeiger.de/sport/golf/kaymer-18-bei-golf-turnier-in-saudi-arabien-johnson-siegt_23109750",
"https://www.gaeubote.de/Nachrichten/Golf-Turnier-in-Muenchen-Kaymer-faellt-zurueck-86604.html",
"https://www.gaeubote.de/Nachrichten/Kaymer-nach-Traumrunde-Zweiter-bei-Golf-Turnier-in-Muenchen-86664.html",
"https://www.main-spitze.de/sport/golf/kaymer-18-bei-golf-turnier-in-saudi-arabien-johnson-siegt_23109750",
"https://www.lauterbacher-anzeiger.de/sport/golf/kaymer-18-bei-golf-turnier-in-saudi-arabien-johnson-siegt_23109750",
"https://www.oberhessische-zeitung.de/sport/golf/kaymer-18-bei-golf-turnier-in-saudi-arabien-johnson-siegt_23109750",
"https://de.advfn.com/p.php?pid=nmona&article=84265497", "https://www.buerstaedter-zeitung.de/sport/golf/kaymer-18-bei-golf-turnier-in-saudi-arabien-johnson-siegt_23109750",
"https://www.golftime.de/golf-nachrichten/challenge-tour-in-deutschland-neues-profi-turnier/",
"https://www.golftime.de/golf-nachrichten/martin-kaymer-saudi-international-tour-news/",
"https://www.golftime.de/magazin/distanz-usga-ra-elite-spieler-regel-anpassung/",
"https://www.dmm.travel/nc/news/porsche-und-tag-heuer-arbeiten-zusammen/",
"https://www.lampertheimer-zeitung.de/sport/golf/kaymer-18-bei-golf-turnier-in-saudi-arabien-johnson-siegt_23109750",
"https://www.hongkongherald.com/news/267768819/johnson-excited-for-season-after-second-saudi-title",
"https://www.hongkongherald.com/news/267750643/fox-takes-clubhouse-lead-as-johnson-makes-move-in-saudi-arabia",
"http://hongkongcityportal.com/saudi-international-englands-david-horsey-leads-from-scotlands-stephen-gallacher/",
"http://hongkongcityportal.com/bryson-dechambeau-flattered-and-welcomes-proposed-rule-changes/",
"http://hongkongcityportal.com/paul-casey-englishman-defends-saudi-international-u-turn/",
"https://as.com/masdeporte/2021/02/03/golf/1612378989_020231.html",
"https://www.marca.com/golf/2021/02/07/601fd7c122601d860c8b45dc.html",
"https://www.marca.com/golf/2021/05/02/608ece1b22601d9d5d8b45f0.html",
"https://www.marca.com/golf/2021/02/03/601ad5d7268e3ef01e8b4670.html",
"https://www.republicworld.com/sports-news/other-sports/johnson-eases-to-another-victory-at-saudi-international.html",
"https://www.republicworld.com/sports-news/other-sports/dustin-johnson-within-1-shot-of-lead-at-saudi-international.html",
"https://timesofindia.indiatimes.com/sports/golf/top-stories/dustin-johnson-excited-for-season-after-second-saudi-title/articleshow/80737390.cms",
"https://timesofindia.indiatimes.com/sports/golf/top-stories/johnson-eases-to-another-victory-at-saudi-international/articleshow/80736264.cms",
"https://timesofindia.indiatimes.com/sports/golf/top-stories/ryan-fox-takes-surprise-lead-at-saudi-international/articleshow/80711869.cms",
"https://timesofindia.indiatimes.com/sports/golf/top-stories/horsey-goes-on-birdie-blitz-for-saudi-international-lead/articleshow/80691513.cms",
"https://timesofindia.indiatimes.com/sports/golf/top-stories/shubhankar-shoots-69-in-opening-round-at-saudi-international/articleshow/80691501.cms",
"https://timesofindia.indiatimes.com/sports/golf/top-stories/big-hitting-dechambeau-happy-to-take-longer-clubs-out-of-rivals-hands/articleshow/80672723.cms",
"https://timesofindia.indiatimes.com/sports/cricket/news/as-bubble-life-drags-on-psychologists-say-cricketers-need-more-support/articleshow/80662353.cms",
"https://www.abc.es/deportes/abci-sergio-garcia-apunta-ryder-202102070038_noticia.html",
"https://www.abc.es/deportes/abci-golfistas-golpe-gimnasio-202102050031_noticia.html",
"https://www.investing.com/news/general/golf-johnson-holds-on-to-clinch-second-saudi-international-title-2411514"
)
####I have tried this:
html_reader<- function(x){return( tryCatch(xml2::read_html(URLs[k]), error = function(e) NULL))}
for (k in seq_along(URLs)) parsed_pages[k] <-lapply(as.list(URLs), html_reader)
I havent run into issues with runtime for some reason until now. The function will not complete even with the try() error function.
My current working code is the following:
pp <- replicate(list(), n = length(ESPN))
for (k in seq_along(ESPN)) pp[[k]] <- try(xml2::read_html(ESPN[k]), silent = TRUE)
It used to just take a while but now it never finishes.

I think the issue I am running into is due to the open connections. The script would get progressively slower and I feel it was due to the old connections. Here is a simple loop that closes out all of the connections. I will know when I run a particular report again if this is the solution but it has seemed to help so far.
for (i in seq_along(df$URLs)){function(i)
closeAllConnections(i)
}

Related

R error: dims do not match the length of an object

I am currently trying to run some code (if you need to know the purpose to help me, ask me, but I'm trying to keep this question short). This is the code:
par<-c(a=.5,b=rep(1.3,4))
est<-rep(TRUE,length(par))
ncat<-5
Theta<-matrix(c(-6,-5.8,-5.6,-5.4,-5.2,-5,-4.8,-4.6,-4.4,-4.2,-4,-3.8,-3.6,-3.4,-3.2,-3,-2.8,-2.6,-2.4,-2.2,-2,-1.8,-1.6,-1.4,-1.2,-1,-0.8,-0.6,-0.4,-0.2,0,0.2,0.4,0.6,0.8,1,1.2,1.4,1.6,1.8,2,2.2,2.4,2.6,2.8,3,3.2,3.4,3.6,3.8,4,4.2,4.4,4.6,4.8,5,5.2,5.4,5.6,5.8,6))
p.grm<-function(par,Theta,ncat){
a<-par[1]
b<-par[2:length(par)]
z<-matrix(0,nrow(Theta),ncat)
y<-matrix(0,nrow(Theta),ncat)
y[,1]<-1
for(i in 1:ncat-1){
y[,i+1]<-(exp(a*(Theta-b[i])))/(1+exp(a*(Theta-b[i])))
}
for(i in 1:ncat-1){
z[,i]<-y[,i]-y[,i+1]
}
z[,ncat]<-y[,ncat]
z
}
However, when I try to run the code:
p.grm(par=par,Theta=Theta,ncat=ncat)
I get the following error:
Error: dims [product 61] do not match the length of object [0]
Traceback tells me that the error is occurring in the first for loop in the line:
y[,i+1]<-(exp(a*(Theta-b[i])))/(1+exp(a*(Theta-b[i])))
Could someone point me to what I'm doing wrong? When I try to run this code step by step outside of the custom p.grm function, everything seems to work fine.
It is a common mistake. When you write the for loop and you want it from 1 to ncat -1 remember to write it as for (i in 1:(ncat-1)) instead of for(i in 1:ncat-1) they are completly different.
You may also add to the function something to return return(z). Here it is the corrected code:
par<-c(a=.5,b=rep(1.3,4))
est<-rep(TRUE,length(par))
ncat<-5
Theta<-matrix(c(-6,-5.8,-5.6,-5.4,-5.2,-5,-4.8,-4.6,-4.4,-4.2,-4,-3.8,-3.6,-3.4,-3.2,-3,-2.8,-2.6,-2.4,-2.2,-2,-1.8,-1.6,-1.4,-1.2,-1,-0.8,-0.6,-0.4,-0.2,0,0.2,0.4,0.6,0.8,1,1.2,1.4,1.6,1.8,2,2.2,2.4,2.6,2.8,3,3.2,3.4,3.6,3.8,4,4.2,4.4,4.6,4.8,5,5.2,5.4,5.6,5.8,6))
p.grm<-function(par,Theta,ncat){
a<-par[1]
b<-par[2:length(par)]
z<-matrix(0,nrow(Theta),ncat)
y<-matrix(0,nrow(Theta),ncat)
y[,1]<-1
for(i in 1:(ncat-1)){
y[,i+1]<-(exp(a*(Theta-b[i])))/(1+exp(a*(Theta-b[i])))
}
for(i in 1:(ncat-1)){
z[,i]<-y[,i]-y[,i+1]
}
z[,ncat]<-y[,ncat]
return(z)
}
p.grm(par=par,Theta=Theta,ncat=ncat)

Use of variable in Unix command line

I'm trying to make life a little bit easier for myself but it is not working yet. What I'm trying to do is the following:
NOTE: I'm running R in the unix server, since the rest of my script is in R. That's why there is system(" ")
system("TRAIT=some_trait")
system("grep var.resid.anim rep_model_$TRAIT.out > res_var_anim_$TRAIT'.xout'",wait=T)
When I run the exact same thing in putty (without system(" ") of course), then the right file is read and right output is created. The script also works when I just remove the variable that I created. However, I need to do this many times, so a variable is very convenient for me, but I can't get it to work.
This code prints nothing on the console.
system("xxx=foo")
system("echo $xxx")
But the following does.
system("xxx=foo; echo $xxx")
The system forgets your variable definition as soon as you finish one call for "system".
In your case, how about trying:
system("TRAIT=some_trait; grep var.resid.anim rep_model_$TRAIT.out > res_var_anim_$TRAIT'.xout'",wait=T)
You can keep this all in R:
grep_trait <- function(search_for, in_trait, out_trait=in_trait) {
l <- readLines(sprintf("rep_model_%s.out", in_trait))
l <- grep(search_for, l, value=TRUE) %>%
writeLines(l, sprintf("res_var_anim_%s.xout", out_trait))
}
grep_trait("var.resid.anim", "haptoglobin")
If there's a concern that the files are read into memory first (i.e. if they are huge files), then:
grep_trait <- function(search_for, in_trait, out_trait=in_trait) {
fin <- file(sprintf("rep_model_%s.out", in_trait), "r")
fout <- file(sprintf("res_var_anim_%s.xout", out_trait), "w")
repeat {
l <- readLines(fin, 1)
if (length(l) == 0) break;
if (grepl(search_for, l)[1]) writeLines(l, fout)
}
close(fin)
close(fout)
}

Getting a strange bug in jxBrowser

So this is a strange one. My code does a bunch a things that are hard to explain (but if necessary I´ll try to explain), but the following works:
var res = data.delete_if (function(key, value) { return key == "a"; })
but the following crashes:
data.delete_if (function(key, value) { return key == "a"; })
So, the fact that I do not save the result of the delete_if function crashes the browser with the following stack trace:
Error: test: B environment should proxy a Ruby hash. (MDArraySolTest): Java::JavaLang::IllegalStateException: Channel stream was closed before response has been received.
java.lang.reflect.Method.invoke(java/lang/reflect/Method.java:498) org.jruby.javasupport.JavaMethod.invokeDirectWithExceptionHandling(org/jruby/javasupport/JavaMethod.java:453)
Any ideas of why this happens? Any solutions? I can provide more information if needed.
EDIT1:
Doing some more tests I found out that the error occurs only if the call to data.delete_if is the last statement on the script. If I add for example: console.log(""); after the call, everything works fine.
Thanks

Unable to resolve an Argument is of length zero error error

I get an Argument is of length zero error when I run the below code
The code is from this blog -http://giventhedata.blogspot.in/2012/08/r-and-web-for-beginners-part-iii.html.
library(XML)
url<- "http://news.bbc.co.uk/2/hi/uk_politics/8044207.stm"
first<-"Abbott, Ms Diane"
url.tab <- readHTMLTable(url)
for (i in 1:length(url.tab)){
if (as.character(url.tab[[i]][1,1]) == first ) {print(first)}
}
I know that the url.tab[[5]][1,1]) does contain the string "Abbott, Ms Diane", and when I run IF statement in isolation replacing the i with 5, it runs fine. Any help would be appreciated. I also tried declaring i<-1 upfront. DInt change anything.
Some of your tables are in fact NULL.
So you have to test for is.null before trying to subset the table:
for (i in 1:length(url.tab)){
this.tab <- url.tab[[i]]
if(!is.null(this.tab)) if(as.character(this.tab[1,1]) == first ) {print(first)}
}
[1] "Abbott, Ms Diane"

getURL (from RCurl package) doesn't work in a loop

I have a list of URL named URLlist and I loop over it to get the source code for each of those URL :
for (k in 1:length(URLlist)){
temp = getURL(URLlist[k])
}
Problem is for some random URL, the code get stuck and I get the error message:
Error in function (type, msg, asError = TRUE) :
transfer closed with outstanding read data remaining
But when I try the getURL function, not in the loop, with the URL which had a problem, it perfectly works.
Any help please ? thank you very much
Hard to tell for sure without more information, but it could just be the requests getting sent too quickly, in which case just pausing between requests could help :
for (k in 1:length (URLlist)) {
temp = getURL (URLlist[k])
Sys.sleep (0.2)
}
I'm assuming that your actual code does something with 'temp' before writing over it in every iteration of the loop, and whatever it does is very fast.
You could also try building in some error handling so that one problem doesn't kill the whole thing. Here's a crude example that tries twice on each URL before giving up:
for (url in URLlist) {
temp = try (getURL (url))
if (class (temp) == "try-error") {
temp = try (getURL (url))
if (class (temp) == "try-error")
temp = paste ("error accessing", url)
}
Sys.sleep(0.2)
}

Resources