I've been working on scraping data from the web and manipulating it in R, but I'm having some trouble when I begin to include the dreaded for loop. I am working with some arbitrary sport statistics, with the idea being that I can calculate the per game average for various stats for each player.
Each of these pieces works outside of the loop, but falls apart inside. Ideally, my code will do three things:
1) Scraping the data. I have a list of player names "names" (by row), and the unique piece of the url for each player in column 2. The website has a table named "stats" on each page, which is nice of them.
library(XML)
statMean <- matrix(ncol = 8, nrow = 20)
for(h in 1:20){
webname <- names[h,2]
vurl <- paste("http://www.pro-football-reference.com/players/",
webname, "/gamelog/2015")
tables <- readHTMLTable(vurl)
t1 <- tables$stats
2) Pick the columns I want and turn the values into numeric values.
temp <- t1[, c(9,10,12,13,14,15,17)]
temp <- sapply(temp, function(x) as.numeric(as.character(x)))
3) Calculate the mean of each column, bind the unique player name to the column means, and rbind that vector to a full table.
statMean <- rbind(statMean, c(webname, colMeans(temp)))
When I run through these steps outside of the Loop, it seems to work ... when I run the loop I inevitably get:
Error in colMeans(temp) : 'x' must be an array of at least two dimensions
I've looked at a number of For Loop questions on this site, and my code looks a lot different than how it started. Unfortunately, each time I try something new, I end up with a version of the above error.
Any help would be fantastic. Thanks.
Related
I am trying to run a function to download data from the USGS website using dataRetrieval package of R and a function I have created called getstreamflow. The code is the following:
siteNumber <- c("094985005","09498501","09489500","09489499","09498502")
Streamflow <- sapply(siteNumber, function(siteNumber) tryCatch(getstreamflow(siteNumber), error = function(e) message(paste("Error in station ", siteNumber))))
Streamflow <- Filter(NROW,Streamflow) #to delete empty data frames
I got the output I want that it is the one shown in the image below:
However, when I ran the same code but increase the number of stations in the input siteNumber
The output change and instead to produce several dataframes inside of a list. It generates a list for each data frame.
Does someone know why this happens? It is the same function only changes the number of stations in the siteNumber
Based on the image showed in the new data, each element in the list is nested as a list. We can extract the list element (of length 1) with [[1]] and then apply the Filter
out <- Filter(NROW, lapply(Streamflow, function(x) x[[1]]))
As we used NROW, it passed the test for list as well where it returns 1 for length attribute of list and thus all the elements meet the condition TRUE. Also, in the previous step, OP uses sapply and sapply is one function which can sometimes simplify the output. Instead of sapply use lapply (or specify simplify = FALSE)
I have a dataset like this:
contingency_table<-tibble::tibble(
x1_not_happy = c(1,4),
x1_happy = c(19,31),
x2_not_happy = c(1,4),
x2_happy= c(19,28),
x3_not_happy=c(14,21),
X3_happy=c(0,9),
x4_not_happy=c(3,13),
X4_happy=c(17,22)
)
in fact, there are many other variables that come from a poll aplied in two different years.
Then, I apply a Fisher test in each 2X2 contingency matrix, using this code:
matrix1_prueba <- contingency_table[1:2,1:2]
matrix2_prueba<- contingency_table[1:2,3:4]
fisher1<-fisher.test(matrix1_prueba,alternative="two.sided",conf.level=0.9)
fisher2<-fisher.test(matrix2_prueba,alternative="two.sided",conf.level=0.9)
I would like to run this task using a short code by mean of a function or a loop. The output must be a vector with the p_values of each questions.
Thanks,
Frederick
So this was a bit of fun to do. The main thing that you need to recognize is that you want combinations of your data. There are a number of functions in R that can do that for you. The main workhorse is combn() Link
So in the language of the problem, we want all combinations of your tibble taken 2 at a time link2
From there, you just need to do some looping structure to get your tests to work, and extract the p-values from the object.
list_tables <- lapply(combn(contingency_table,2,simplify=F), fisher.test)
unlist(lapply(list_tables, `[`, 'p.value'))
This should produce your answer.
EDIT
Given the updated requirements for just adjacement data.frame columns, the following modifications should work.
full_list <- combn(contingency_table,2,simplify=F)
full_list <- full_list[sapply(
full_list, function(x) all(startsWith(names(x), substr(names(x)[1], 1,2))))]
full_list <- lapply(full_list, fisher.test)
unlist(lapply(full_list, `[`, 'p.value'))
This is approximately the same code as before, but now we have to find the subsets of the data that have the same question prefix name. This only works if the prefixes are exactly the same (X3 != x3). I think this is a better solution than trying to work with column indexes, and without the guarantee of always being next to one another. The sapply code does just that. The final output should be what you need for the problem.
I'm relatively new to programming, and have found the following link incredibly helpful in trying to create new variables from a large dataset using a loop:
Change variable name in for loop using R
i.e.
for (i in 1:13){
assign(paste("conD",i, sep=""), (subset(con,day==i)))
}
This is to produce a set of 13 variables from a dataset of control biological samples, with each subset containing data from each day in the timeseries day1-day13 i.e. "conD1" would contain data from controls analysed on day 1. (the dimensions of the df are generally 3x4000). This worked perfectly.
However, I now wish to create 13 more variables in a loop, with the median of each column calculated, i.e. "medconD1" - "medconD13" within a loop, each one 1x4000, by recalling the created variables conD1-conD13.
I have tried the following code, but this doesnt work:
for (i in 1:13){
assign(paste("medconD",i, sep=""), (apply(conD[i][8:n], 2, FUN = median)))
}
Can anyone help or point out where I am going wrong here? How do I recall the conD[i] variables in this second part of code? This will save me weeks of work with large datasets!
Thank you
All sorted out now (thanks helpers!) - answer was to add them to a list as I made them, then recall them from a list:
conlist=c()
for (i in 1:13){
x<-assign(paste("conD",i, sep=""), (subset(con,day==i)))
conlist=c(conlist,list(x))
}
medconlist=c()
for (i in 1:13){
x<-assign(paste("medconD",i, sep=""), (apply(conlist[[i]][,8:n], 2, FUN = median)))
medconlist=c(medconlist,list(x))
}
I have a large data.frame called rain with information of many species mesured in different plots at different times (census), from which I want to extract the information. This data frame have many collumns, and in dataF2 I want to keep the same structure however I want to extract from rain the information of the penultimate census (Census.No is one of the collumns of rain) in each plot (Plot.Code is another one). In idx3 I have the information of the number of the penultimate census for each plot.
It's easy to do it for one plot
data1<- rain[Plot.Code==idx3[1,1] & Census.No==idx3[1,2],]
I've been trying to do for loops in R.. but I keep overwriting my data.frame and ending up just with the last loop.
dataF2<- data.frame(nrow= nrow (rain), ncol = ncol (rain))
summary (dataF2)
for (i in 1:length (idx3[,1])){
dataF2<- rain[Plot.Code==idx3[i,1] & Census.No==idx3[i,2],]
}
Here I want to extract from a data frame the information of the penultimate census in each plot (ixd3 contains this information of what was the penultimate census in each plot).
I've tried many things, like:
dataF2<- data.frame(nrow= nrow (rain), ncol = ncol (rain))
for (i in 1:length (idx3[,1])){
data1<- rainfor[Plot.Code==idx3[i,1] & Census.No==idx3[i,2],]
dataF2<- rbind (data1[i])
}
But nothing worked.. my problem is that it keeps overwithin on dataF2!
Cheers!!!
Your clarifications in the comments helped somewhat, but reproducible examples are always better. Let's start at the beginning:
dataF2<- data.frame(nrow= nrow (rain), ncol = ncol (rain))
This is wrong. I think that you're trying to create an empty data frame with the same dimensions as your data frame rain. If you examine dataF2 you'll see that this is far from what you have done with this line. If you read the documentation for the function ?data.frame it will become clear that there are no arguments called nrow and ncol. What you probably intended was something like this:
dataF2 <- rain
dataF2[] <- NA
Inside your for loop you are overwriting your entire data frame because....you are overwriting your entire data frame.
dataF2<- rain[Plot.Code==idx3[i,1] & Census.No==idx3[i,2],]
This assigns something to dataF2, replacing it completely. If you want to assign to just a single row of dataF2 you need to assign to that specific row:
dataF2[i,] <- rain[Plot.Code==idx3[i,1] & Census.No==idx3[i,2],]
I can't absolutely assure that this will work correctly, since you haven't provided a sufficiently detailed example, so I'm not sure that all the dimensions will coincide properly when you index on i. But this is the basic idea.
I am trying to run some Monte Carlo simulations on animal position data. So far, I have sampled 100 X and Y coordinates, 100 times. This results in a list of 200. I then convert this list into a dataframe that is more condusive to eventual functions I want to run for each sample (kernel.area).
Now I have a data frame with 200 columns, and I would like to perform the kernel.area function using each successive pair of columns.
I can't reproduce my own data here very well, so I've tried to give a basic example just to show the structure of the data frame I'm working with. I've included the for loop I've tried so far, but I am still an R novice and would appreciate any suggestions.
# generate dataframe representing X and Y positions
df <- data.frame(x=seq(1:200),y=seq(1:200))
# 100 replications of sampling 100 "positions"
resamp <- replicate(100,df[sample(nrow(df),100),])
# convert to data frame (kernel.area needs an xy dataframe)
df2 <- do.call("rbind", resamp[1:2,])
# xy positions need to be in columns for kernel.area
df3 <- t(df2)
#edit: kernel.area requires you have an id field, but I am only dealing with one individual, so I'll construct a fake one of the same length as the positions
id=replicate(100,c("id"))
id=data.frame(id)
Here is the structure of the for loop I've tried (edited since first post):
for (j in seq(1,ncol(df3)-1,2)) {
kud <- kernel.area(df3[,j:(j+1)],id=id,kern="bivnorm",unin=c("m"),unout=c("km2"))
print(kud)
}
My end goal is to calculate kernel.area for each resampling event (ie rows 1:100 for every pair of columns up to 200), and be able to combine the results in a dataframe. However, after running the loop, I get this error message:
Error in df[, 1] : incorrect number of dimensions
Edit: I realised my id format was not the same as my data frame, so I change it and now have the error:
Error in kernelUD(xy, id, h, grid, same4all, hlim, kern, extent) :
id should have the same length as xy
First, a disclaimer: I have never worked with the package adehabitat, which has a function kernel.area, which I assume you are using. Perhaps you could confirm which package contains the function in question.
I think there are a couple suggestions I can make that are independent of knowledge of the specific package, though.
The first lies in the creation of df3. This should probably be
df3 <- t(df2), but this is most likely correct in your actual code
and just a typo in your post.
The second suggestion has to do with the way you subset df3 in the
loop. j:j+1 is just a single number, since the : has a higher
precedence than + (see ?Syntax for the order in which
mathematical operations are conducted in R). To get the desired two
columns, use j:(j+1) instead.
EDIT:
When loading adehabitat, I was warned to "Be careful" and use the related new packages, among which is adehabitatHR, which also contains a function kernel.area. This function has slightly different syntax and behavior, but perhaps it would be worthwhile examining. Using adehabitatHR (I had to install from source since the package is not available for R 2.15.0), I was able to do the following.
library(adehabitatHR)
for (j in seq(1,ncol(df3)-1,2)) {
kud <-kernelUD(SpatialPoints(df3[,j:(j+1)]),kern="bivnorm")
kernAr<-kernel.area(kud,unin=c("m"),unout=c("km2"))
print(kernAr)
}
detach(package:adehabitatHR, unload=TRUE)
This prints something, and as is mentioned in a comment below, kernelUD() is called before kernel.area().