I am having an issue with my code. I have got a for loop so that it Identifies all "strong" tags in an html document and then identifies the row number of a given word in the html. I want it, for any instances where the row numbers match, to make note of that row number. I have it so far, but if there is an instance of the word outside that of the rows where the strong tag is, it fails
url <- readLines("http://afip.gob.ar/contacto")
tagname=NULL
identifier=NULL
IDintag=NULL
rowst=NULL
rowend=NULL
data=NULL
tag <- as.matrix(grep("<strong>",url))
if(length(tag) > 0)
{ID <- grep("Telef|Numero",url)
for(i in 1:length(ID))
{IDintag[i] <- grep(ID[i],tag)
}
for(i in 1:length(IDintag))
{tagname[i] <- tag[IDintag[i]]
}
for(i in 1:length(tagname))
{rowst[i] <- which(grepl(tagname[i],tag))
rowend[i] <- tag[rowst[i] + 1,]-1
data[i] <- toString(url[tagname[i]:rowend[i]])
}
}
This works like a dream but if I change the url to one where the ID terms occur where the strong tag doesn't, it fails for example:
url <- readLines("http://www2.le.ac.uk/contact")
tagname=NULL
identifier=NULL
IDintag=NULL
rowst=NULL
rowend=NULL
data=NULL
tag <- as.matrix(grep("<h2>",url))
if(length(tag) > 0)
{ID <- grep("Telef|Numero|phone",url)
for(i in 1:length(ID))
{IDintag[i] <- grep(ID[i],tag)
}
for(i in 1:length(IDintag))
{tagname[i] <- tag[IDintag[i]]
}
for(i in 1:length(tagname))
{rowst[i] <- which(grepl(tagname[i],tag))
rowend[i] <- tag[rowst[i] + 1,]-1
data[i] <- toString(url[tagname[i]:rowend[i]])
}
}
Thanks in advance
Related
I am trying to scrape some (a lot) of NCAA mens basketball data off of a website called RealGM. My code lies below:
library(htmltab)
tables <- list()
for (i in 0:1548) {
for (j in 0:16) {
for (k in 0:4) {
a <- i+1
b <- 2003+j
c <- k+1
url <- paste("https://basketball.realgm.com/ncaa/conferences/Big-Ten-Conference/2/Michigan/",a,"/individual-games/",b,"/minutes/Season/desc/",c,sep = "")
tables[[paste(i,j,k,sep = "")]] <- htmltab(url,rm_nodata_cols = F,which = 1)
}
}
}
I've used similar methods in the past to pull data off of sites like Sports Reference which keep player data in tables.
In this loop, the variable a controls the team, b controls the year, and c controls the page number for the game log set.
My issue here is that some of the referenced URLs contain no tables, i.e. there is no 4th page of game logs for Michigan's 2003 team, but there are 5 pages for their 2018 team.
Unfortunately, htmltab returns an error when there is not table found, and it aborts my loop. Is there a workaround for this so that it will just skip those urls and/or continue through the rest of the process?
I was able to figure out how to do this by checking first to see if a table existed, and if not, go to the next iteration of the loop:
library(htmltab)
tables <- list()
for (i in 0:1548) {
for (j in 0:16) {
for (k in 0:4) {
a <- i+1
b <- 2003+j
c <- k+1
url <- paste("https://basketball.realgm.com/ncaa/conferences/Big-Ten-Conference/2/Michigan/",a,"/individual-games/",b,"/minutes/Season/desc/",c,sep = "")
test <- html_nodes(read_html(url),"table")
if (length(test) == 0){
next
}
tables[[paste(i,j,k,sep = "")]] <- htmltab(url,rm_nodata_cols = F,which = 1)
}
}
}
One option is to use tryCatch and skip the URL's which give an error.
library(htmltab)
tables <- list()
for (i in 1:1549) {
for (j in 2003:2019) {
for (k in 1:5) {
url <- paste0("https://basketball.realgm.com/ncaa/conferences/Big-Ten-Conference/2/Michigan/",i,"/individual-games/",j,"/minutes/Season/desc/",k)
tables[[paste0(i,j,k)]] <- tryCatch({
htmltab(url,rm_nodata_cols = F,which = 1)
}, error = function(e) {
cat("Wrong URL : ", url, " skipping\n")
})
}
}
}
I'm trying to save each iteration of this for loop in a vector.
for (i in 1:177) {
a <- geomean(er1$CW[1:i])
}
Basically, I have a list of 177 values and I'd like the script to find the cumulative geometric mean of the list going one by one. Right now it will only give me the final value, it won't save each loop iteration as a separate value in a list or vector.
The reason your code does not work is that the object ais overwritten in each iteration. The following code for instance does what precisely what you desire:
a <- c()
for(i in 1:177){
a[i] <- geomean(er1$CW[1:i])
}
Alternatively, this would work as well:
for(i in 1:177){
if(i != 1){
a <- rbind(a, geomean(er1$CW[1:i]))
}
if(i == 1){
a <- geomean(er1$CW[1:i])
}
}
I started down a similar path with rbind as #nate_edwinton did, but couldn't figure it out. I did however come up with something effective. Hmmmm, geo_mean. Cool. Coerce back to a list.
MyNums <- data.frame(x=(1:177))
a <- data.frame(x=integer())
for(i in 1:177){
a[i,1] <- geomean(MyNums$x[1:i])
}
a<-as.list(a)
you can try to define the variable that can save the result first
b <- c()
for (i in 1:177) {
a <- geomean(er1$CW[1:i])
b <- c(b,a)
}
I managed to fill my doi_list but it does not work if I encapsulate the code into a function. From a tutorial I've seen I assume that this should be possible but doi_list is empty after get_doi_from_category() finishes.
library(aRxiv)
get_doi_from_category <- function(category, doi_list) {
arxiv_rec <- arxiv_search(category)
arxiv_doi_list <- arxiv_rec[13]
by(arxiv_doi_list, 1:nrow(arxiv_doi_list),
function(row) {
if(nchar(row) > 0) {
doi_list <<- c(doi_list, row)
}
})
}
doi_list <- list()
get_doi_from_category('cat:stat.ML', doi_list)
for(doi in doi_list)
{
print(doi)
}
get_doi_from_category('cat:stat.CO', doi_list)
get_doi_from_category('cat:stat.ME', doi_list)
get_doi_from_category('cat:stat.TH', doi_list)
PS: First day with R.
Here's a better way to do what you want in R:
categ <- c(CO = "cat:stat.CO", #I'm naming these elements so
ME = "cat:stat.ME", # that the corresponding elements
TH = "cat:stat.TH", # in the list are named as well.
ML = "cat:stat.ML") # Could also just set 'names(doi_list)' to 'categ'.
doi_list <-
lapply(categ, function(ctg)
(doi <- arxiv_search(ctg)$doi)[nchar(doi) > 0])
I sort of threw you in the deep end on the last line with in-line assignment of doi; a more step-by-step approach would be:
lapply(categ, function(ctg){
arxiv.df <- arxiv_search(ctg)
doi <- arxiv.df$doi
doi[nchar(doi) > 0]})
I'm a novice R user and have created a small script that is doing some trigonometry with movement data. I need to add a final column that deletes repeated values from the column before it.
I've tried adding an if else statement that seems to work when isolated, but keep having errors when it is put into the for loop. I'd appreciate any advice.
# trig loop
list.df <- vector("list", max(Sp_test$ID))
names1 <- c(1:max(Sp_test$ID))
for(i in 1:max(Sp_test$ID)) {
if(i %in% unique(Sp_test$ID)) {
idata <- subset(Sp_test, ID == i)
idata$originx <- idata[1,3]
idata$originy <- idata[1,4]
idata$deltax <- idata[,"UTME"]-idata[,"originx"]
idata$deltay <- idata[,"UTMN"]-idata[,"originy"]
idata$length <- sqrt((idata[,"deltax"])^2+(idata[,"deltay"]^2))
idata$arad <- atan2(idata[,"deltay"],idata[,"deltax"])
idata$xnorm <- idata[,"deltax"]/idata[,"length"]
idata$ynorm <- idata[,"deltay"]/idata[,"length"]
sumy <- sum(idata$ynorm, na.rm=TRUE)
sumx <- sum(idata$xnorm, na.rm=TRUE)
idata$vecsum <- atan2(sumy,sumx)
idata$width <- idata$length*sin(idata$arad-idata$vecsum)
# need if else statement excluding a repeat from the position just before it
list.df[[i]] <- idata
names1[i] <- i
} }
# this works alone, I think the problem is when it gets to the first of the dataset and there is not one before it
if (idata$width[j]==idata$width[j-1]) {
print("NA")
} else {
print(idata$width[j])
}
I think you want to use the function diff for this. diff(idata$width) will give the differences between successive values of idata$width. Then
idata$width[c(FALSE, diff(idata$width) == 0)] <- NA
I think does what you want. The initial FALSE is since there is no value corresponding to the first element (since as you rightly noted, the first element doesn't have an element before it).
I have a vector appended within a list. The entry successively grows.
li <- list()
for(i in 1:10)
{
v <- runif(2)
if(i==1)
{
li[[1]] <- v
} else {
li[[1]] <- append(li[[1]],v)
}
}
It's ugly that I need different code for the two cases 1) li[[1]] does not exist and 2) li[[1]] exists. Any solutions?
Background:
you cant initialize a list element as you do it with a vector:
v <- NULL
v <- append(v,c(1,2,3))
works
but
li <- list()
li[[1]] <- NULL
li[[1]] <- append(li[[1]],c(1,2,3))
throws an error, since li[[1]] can't be initialized by li[[1]] <- NULL .
Update: I learned that this will work with named lists (which also adds some grace), but there may be (dynamical) cases where naming is not a good option.
I am not sure that you need a loop here. But you can pre-allocate your list to avoid dealing with empty lists. You allocate using vector like this:
n <- 10
li <- vector('list',n)
Then you just assign each element :
for(i in 1:10) {
v <- runif(sample(n,1)) ## I choose a dynamic length here
## otherwise the example don't make sense
li[[i]] <- v
}
The proper way to initialize list elements is such:
li <- list(NULL)
for(i in 1:10)
{
v <- runif(2)
li[[1]] <- append(li[[1]],v)
}
Thanks