I am trying to scrape some (a lot) of NCAA mens basketball data off of a website called RealGM. My code lies below:
library(htmltab)
tables <- list()
for (i in 0:1548) {
for (j in 0:16) {
for (k in 0:4) {
a <- i+1
b <- 2003+j
c <- k+1
url <- paste("https://basketball.realgm.com/ncaa/conferences/Big-Ten-Conference/2/Michigan/",a,"/individual-games/",b,"/minutes/Season/desc/",c,sep = "")
tables[[paste(i,j,k,sep = "")]] <- htmltab(url,rm_nodata_cols = F,which = 1)
}
}
}
I've used similar methods in the past to pull data off of sites like Sports Reference which keep player data in tables.
In this loop, the variable a controls the team, b controls the year, and c controls the page number for the game log set.
My issue here is that some of the referenced URLs contain no tables, i.e. there is no 4th page of game logs for Michigan's 2003 team, but there are 5 pages for their 2018 team.
Unfortunately, htmltab returns an error when there is not table found, and it aborts my loop. Is there a workaround for this so that it will just skip those urls and/or continue through the rest of the process?
I was able to figure out how to do this by checking first to see if a table existed, and if not, go to the next iteration of the loop:
library(htmltab)
tables <- list()
for (i in 0:1548) {
for (j in 0:16) {
for (k in 0:4) {
a <- i+1
b <- 2003+j
c <- k+1
url <- paste("https://basketball.realgm.com/ncaa/conferences/Big-Ten-Conference/2/Michigan/",a,"/individual-games/",b,"/minutes/Season/desc/",c,sep = "")
test <- html_nodes(read_html(url),"table")
if (length(test) == 0){
next
}
tables[[paste(i,j,k,sep = "")]] <- htmltab(url,rm_nodata_cols = F,which = 1)
}
}
}
One option is to use tryCatch and skip the URL's which give an error.
library(htmltab)
tables <- list()
for (i in 1:1549) {
for (j in 2003:2019) {
for (k in 1:5) {
url <- paste0("https://basketball.realgm.com/ncaa/conferences/Big-Ten-Conference/2/Michigan/",i,"/individual-games/",j,"/minutes/Season/desc/",k)
tables[[paste0(i,j,k)]] <- tryCatch({
htmltab(url,rm_nodata_cols = F,which = 1)
}, error = function(e) {
cat("Wrong URL : ", url, " skipping\n")
})
}
}
}
Related
I have a series of functions which goes to a website and collects data. Sometimes the website returns a 404 error and my code breaks. It could take 10 minutes of processing until I get a 404 error, or the code (more often then not) runs without the 404 error.
I have the following code:
linkToStopAt = as.character(unique(currentData$linkURL)[1])
myLinksToSearchOver = as.character(unique(currentData$page))
tmp = NULL
i <- 1
out_lst = list()
while(i <= length(myLinksToSearchOver)){
print(paste("Processing page: ", i))
tmp <- possible_collectPageData(myLinksToSearchOver[i]) %>%
add_column(page = myLinksToSearchOver[i])
if(linkToStopAt %in% tmp$linkURL)
{
print(paste("We stopped at: ", i))
break
}
out_lst[[i]] <- tmp
i <- i + 1
}
Broken down as:
linkToStopAt = as.character(unique(currentData$linkURL)[1]) gives me a single URL where the while loops will break if it see this URL
myLinksToSearchOver = as.character(unique(currentData$page)) gives me multiple links in which the while loop will search over, once it finds the linkToStopAt on one of these links, the while loop breaks.
tmp <- possible_collectPageData(myLinksToSearchOver[i]) %>% add_column(page = myLinksToSearchOver[i]) This is a big function, which relies on many other functions...
######################################################
So, the while loop runs until it finds a link linkToStopAt on one of the pages from myLinksToSearchOver. The function possible_collectPageData just does all my scraping/data processing etc. Each page from myLinksToSearchOver is stored in out_lst[[i]] <- tmp.
I recieve a specific error "Error in if (nrow(df) != nrow(.data)) { : argumento tiene longitud cero" in the console sometimes.
What I want to do, is something like:
repeat {
tmpCollectData <- try(while("ALL-MY-WHILE-LOOP-HERE??")) #try(execute(f))
if (!(inherits(tmpCollectData, "Error in if (nrow(df) != nrow(.data)) { : argumento tiene longitud cero")))
break
}
Where, if the while loop breaks with that error, just run it all again, setting tmp = NULL, i = 1, out_list = list() etc. (Basically start again, I can do this manually by just re-executing the code again)
You could create a function that does your work, and then wrap the call to that function in try(), with silent=TRUE. Then place that in a while(TRUE) loop, breaking out if get_data() does NOT return an error:
Function to do your work
get_data <- function(links, stoplink) {
i=1
out_lst=list()
while(i <= length(links)){
print(paste("Processing page: ", i))
tmp = possible_collectPageData(links[i]) %>% add_column(page = links[i])
if(stoplink %in% tmp$linkURL) {
print(paste("We stopped at: ", i))
break
}
out_lst[[i]] <- tmp
i <- i + 1
}
return(out_lst)
}
Infinite loop that gets broken if result does not have any error.
while(TRUE) {
result = try(get_data(myLinksToSearchOver, linkToStopAt), silent=T)
if(!"try-error" %in% class(result)) break
}
Okay, so I have combed the internet for an answer to my problem and I can only put it down to me being a little naive in how R works.
Below is my code for a function that generates public and private keys from the system clock and uses it to attempt to decrypt an encrypted message. This bit works fine, but obviously as it goes through different random generations it comes back with a lot of garbage and NULL data.
I wanted to filter this out by using grep and testing whether the result of that grep was 1, is so, the decoded message would be put into a list.
The problem is that, no matter how I propose the if statement, my list gets cluttered with both the nonsense entries and the NULL entries.
I've tried, !is.null, is.character. test == 1. etc etc but nothing seems to work. Either the list doesn't get populated at all, or it gets populated by every entry that runs through the if statement.
Any advice would be appreciated. Thanks :)
Edit: Okay, forgive me, for these are copy and paste jobs to provide clarity. The first code is the code I'm using to encrypt the message.
require(gmp)
source("convert.R")
p <- nextprime(urand.bigz(size=51, seed=as.bigz(Sys.time())))
q <- nextprime(urand.bigz(size=50))
n <- p*q
finde <- function(phi) {
r <- floor(log(phi, base = 2))
y <- 0 # initialise
while(y != 1) {
e <- urand.bigz(nb = 1, size = r)
y <- gcd.bigz(e, phi)
}
return(e)
}
phi <- (p-1) * (q-1)
e <-finde(phi)
d <- inv.bigz(e, phi)
text1 <- c("I want to eat a baby panda with my bare teeth and hands. Just so I know there's something else in this world suffering more than myself, right now.")
m <- blocks(text1, n) # arguments are text1 (message) and n (public key)
u <- as.bigz((as.bigz(m, n)^e))
dput(u, file="codedmessage.R")
The second is the code contained in the "convert.R" source file:
blocks <- function(txt, n) {
x <- strtoi(charToRaw(txt), 16L)
ll <- length(x)
bl <- floor(log(n, base=256)) # block length (how large the blocks must be)
nb <- floor(ll / bl)
wp <- bl*nb
rem <- ll - wp
s <- as.bigz(vector(mode="numeric", length=0))
u <- 0
while(u < wp) {
total <- as.bigz(0)
for(i in 1:bl) {
total <- 256 * total + x[i+u]
}
u <- u + bl
s <- c(s, total)
}
if(rem > 0) {
total <- as.bigz(0)
for(i in 1:rem) {
total <- 256 * total + x[i + wp]
}
s <- c(s, total)
}
return(s)
}
words <- function(blocknum) {
w <- vector(mode="numeric", length=0)
wl <- blocknum
while(as.bigz(wl) > 0) {
rem <- as.bigz(wl) %% 256
w <- c(rem, w)
wl <- (as.bigz(wl) - as.bigz(rem)) / 256
}
return(w)
}
dectext <- function(listnum) {
len <- length(listnum)
newls <- as.integer(vector(mode="numeric", length=0))
for(i in 1:len) {
temp <- as.integer(words(listnum[i]))
newls <- c(newls, temp)
}
return(rawToChar(as.raw(newls)))
}
And finally the last code is the decrypt and compile list function that I'm having issues with.
finde <- function(phi) {
r <- floor(log(phi, base = 2))
y <- 0 # initialise
while(y != 1) {
e <- urand.bigz(nb = 1, size = r)
y <- gcd.bigz(e, phi)
}
return(e)
}
FindKey <- function(a, y) {
x <<- 1 #initialisation
decodedlist <<- list() #initialisation
while (x<7200) {
print(x)
print(a)
p <- nextprime(urand.bigz(size=51, seed=as.bigz(a)))
q <- nextprime(urand.bigz(size=50))
n <- p*q
phi <- (p-1) * (q-1)
phi
e <-finde(phi)
d <- inv.bigz(e, phi)
recieved<-dget(file=y)
v<-as.bigz(as.bigz(recieved, n)^d)
tryCatch({
decodetext<-dectext(v)
Decrypt<- capture.output(cat(decodetext))
print(Decrypt)
test <- grep("and", Decrypt)
if (!is.null(Decrypt)){
if (is.character(Decrypt)){
decodedlist[[x]] <<- Decrypt
}else{return}}else{return}
}, warning = function(war) {
return()
}, error = function(err){
return()
}, finally = {
x=x+1
a=a-1})
}
}
Sorry it's long.. But I really don't know what to do :(
I found a "sort of" solution to my problem, albeit within a different code I've written.
I'm not very knowledgeable in the reasoning behind why this works but I believe the problem lay in the fact that the list was storing something with a NULL reference (Reps to Acccumulation for the hint ;D) and therefore was not technically NULL itself.
My workaround for this was to avoid using an if statement altogether, instead I found a more efficient method of filtering out NULL list entries in a program I had written for generating large prime numbers.
Extra points for anyone who can figure out what I'm currently studying ;)
#Combine two lists and remove NULL entries therein.
Prime_List2 <<- PrimeList[-which(sapply(PrimeList, is.null))]
Prime_List1 <<- PrimeList[-which(sapply(PrimeList, is.null))]
Say I have 10 model configurations of n timesteps for 3 different sites, producing a total 30 netcdf files I want to open and manipulate. I can open the 30 files such as
require(ncdf4)
allfiles= list()
nmod=10
nsites=3
for (i in 1:nmod) {
allfiles[[i]] = list(nc_open(paste0('Model',i,'siteA.nc')),
nc_open(paste0('Model',i,'siteB.nc')),
nc_open(paste0('Model',i,'siteC.nc')))
}
When querying the class of what was opened, I have
class(allfiles)
[1] "list"
class(allfiles[[1]][[1]])
[1] "ncdf4"
as expected.
Now what I would like to do is extract the values from a variable in the files such that
var=list()
for (i in 1:nmod) {
for (j in 1:nsites) {
var[[i]][[j]] <- ncvar_get(allfiles[[i]][[j]],"var1")
nc_close(allfiles[[i]][[j]])
}}
but I get this error message:
`Error in *tmp*[[i]] : subscript out of bounds`
If I try
var[[i]] <- ncvar_get(allfiles[[i]][[j]],"var1")
it (understandbly) only produces a list of 10 model configurations at one site, i.e. var[[1]][[1]][1] prints out the value of the variable at model configuration 1, site A, timestep 1 but var[[1]][[2]] doesn't exist.
How can I declare var in the above loop so that it contains all the values for all models, all sites and all timesteps (e.g. for var[[1]][[2]][1] to exist)?
In your original version where the error occurs, in the first inner loop, you try to do something: var[[1]][[1]] <- something, but var[[1]] doesn't exist, and R doesn't know what to do, so I guess the following thing should work, you set var[i] <- list() before you do var[[i]][[j]] <- something:
var=list()
for (i in 1:nmod) {
var[i] <- list()
for (j in 1:nsites) {
var[[i]][[j]] <- ncvar_get(allfiles[[i]][[j]],"var1")
nc_close(allfiles[[i]][[j]])
}
}
For example, if you do:
var <- list()
for (i in 1:10) {
for (j in 1:10) {
var[[i]][[j]] <- 1
}
}
Then the same error happens. But if you set var[[i]] <- list() before carrying out the inner loop like this:
var <- list()
for (i in 1:10) {
var[[i]] <- list()
for (j in 1:10) {
var[[i]][[j]] <- 1
}
}
Then the problem will be solved.
I am having an issue with my code. I have got a for loop so that it Identifies all "strong" tags in an html document and then identifies the row number of a given word in the html. I want it, for any instances where the row numbers match, to make note of that row number. I have it so far, but if there is an instance of the word outside that of the rows where the strong tag is, it fails
url <- readLines("http://afip.gob.ar/contacto")
tagname=NULL
identifier=NULL
IDintag=NULL
rowst=NULL
rowend=NULL
data=NULL
tag <- as.matrix(grep("<strong>",url))
if(length(tag) > 0)
{ID <- grep("Telef|Numero",url)
for(i in 1:length(ID))
{IDintag[i] <- grep(ID[i],tag)
}
for(i in 1:length(IDintag))
{tagname[i] <- tag[IDintag[i]]
}
for(i in 1:length(tagname))
{rowst[i] <- which(grepl(tagname[i],tag))
rowend[i] <- tag[rowst[i] + 1,]-1
data[i] <- toString(url[tagname[i]:rowend[i]])
}
}
This works like a dream but if I change the url to one where the ID terms occur where the strong tag doesn't, it fails for example:
url <- readLines("http://www2.le.ac.uk/contact")
tagname=NULL
identifier=NULL
IDintag=NULL
rowst=NULL
rowend=NULL
data=NULL
tag <- as.matrix(grep("<h2>",url))
if(length(tag) > 0)
{ID <- grep("Telef|Numero|phone",url)
for(i in 1:length(ID))
{IDintag[i] <- grep(ID[i],tag)
}
for(i in 1:length(IDintag))
{tagname[i] <- tag[IDintag[i]]
}
for(i in 1:length(tagname))
{rowst[i] <- which(grepl(tagname[i],tag))
rowend[i] <- tag[rowst[i] + 1,]-1
data[i] <- toString(url[tagname[i]:rowend[i]])
}
}
Thanks in advance
I have the following code (nested for loop) in R which is extremely slow. The loop matches values from two columns. Then picks up a corresponding file and iterates through the file to find a match. Then it picks up that row from the file. The iterations could go up to more than 100,000. Please if some one can provide an insight on how to quicken the process.
for(i in 1: length(Jaspar_ids_in_Network)) {
m <- Jaspar_ids_in_Network[i]
gene_ids <- as.character(GeneTFS$GeneIds[i])
gene_names <- as.character(GeneTFS$Genes[i])
print("i")
print(i)
for(j in 1: length(Jaspar_ids_in_Exp)) {
l <- Jaspar_ids_in_Exp[j]
print("j")
print(j)
if (m == l) {
check <- as.matrix(read.csv(file=paste0(dirpath,listoffiles[j]),sep=",",header=FALSE))
data_check <- data.frame(check)
for(k in 1: nrow(data_check)) {
gene_ids_JF <- as.character(data_check[k,3])
genenames_JF <- as.character(data_check[k,4])
if(gene_ids_JF == gene_ids) {
GeneTFS$Source[i] <- as.character(data_check[k,3])
data1 <- rbind(data1, cbind(as.character(data_check[k,3]),
as.character(data_check[k,8]),
as.character(data_check[k,9]),
as.character(data_check[k,6]),
as.character(data_check[k,7]),
as.character(data_check[k,5])))
} else if (toupper(genenames_JF) == toupper(gene_names)) {
GeneTFS$Source[i] <- as.character(data_check[k,4])
data1 <- rbind(data1, cbind(as.character(data_check[k,4]),
as.character(data_check[k,5]),
as.character(data_check[k,6]),
as.character(data_check[k,7]),
as.character(data_check[k,8]),
as.character(data_check[k,2])))
} else {
# GeneTFS[i,4] <- "No Evidence"
}
}
} else {
# GeneTFS[i,4] <- "Record Not Found"
}
}
}
If you pull out the logic for processing one pair into a function, f(m,l), then you could replace the double loop with:
outer(Jaspar_ids_in_Network, Jaspar_ids_in_Exp, Vectorize(f))