I have a CSV dataset (call it data) as follow:
CLASS CoverageT1 CoverageT2 CoverageT3
Gamma 90 80 75
Gamma 89 72 79
Gamma 92 86 75
Alpha 50 80 67
Alpha 53 78 60
Alpha 58 81 75
I would like to retrieve the unique classes and calculate the average for each coverage column.
What I've done so far is the following:
classes <- subset(data, select = c(CLASS))
unique_classes <- unique(classes)
for(x in unique_classes){
cove <- subset(data, CLASS == x , select=c(CoverageT1:CoverageT3))
average <- colMeans(cove)
print(cove)
}
As a result, I got the following results:
CoverageT1 CoverageT2 CoverageT3
1 90 80 75
3 92 86 75
4 50 80 67
6 58 81 75
I want to retrieve the coverage values based on each class and then calculate the average. When I print the retrieved coverage values, I get some rows and the other are missing!
Can someone help me solving this issue
Thanks
Your code isn't working because, amongst other things, you are assigning to average on each iteration and the previous is lost
There are several ways to do what you are trying to do. This would be my approach:
library(dplyr)
data %>% group_by(CLASS) %>% summarise_all(mean)
Another option using aggregate
aggregate(data, . ~ CLASS , mean)
Taking your idea and wrapping it in by.
xy <- read.table(text = "CLASS CoverageT1 CoverageT2 CoverageT3
Gamma 90 80 75
Gamma 89 72 79
Gamma 92 86 75
Alpha 50 80 67
Alpha 53 78 60
Alpha 58 81 75", header = TRUE)
out <- by(data = xy[, -1], INDICES = list(xy$CLASS), FUN = colMeans)
out <- do.call(rbind, out)
out
CoverageT1 CoverageT2 CoverageT3
Alpha 53.66667 79.66667 67.33333
Gamma 90.33333 79.33333 76.33333
This is how I solved it:
coverage_all <- aggregate(coverage , list(class=data$CLASS), mean)
I have a dataframe which is 214 columns long and many rows long, and I want to perform a fisher's exact test for each row using values from 4 columns.
An example subset of relevant information from my dataframe looks like:
Variant DB.count.1 DB.count.2 pop.count.1 pop.count.2
A 23 62 35 70
B 81 4 39 22
C 51 42 49 52
D NA NA 65 8
E 73 21 50 33
F 72 13 81 10
G 61 32 75 21
H NA NA 42 22
I NA NA 60 20
J 80 12 72 24
I am trying to use a for loop to:
create a contingency table for each row for the Fisher's exact test to compare DB.counts to pop.counts
run a Fisher's exact test using this contingency table to determine if there is a difference between DB.counts and pop.counts
output the p-value result to a new column on my dataframe
As you can see there are "NA" values in some positions and thus in some contingency tables, obviously this will cause an error, which is ok, but I would like for the code to output a value to the column when it encounters this error such as "." or "error" and skip to the next row/contingency table.
i.e. I would like an output which looks like this:
Variant DB.count.1 DB.count.2 pop.count.1 pop.count.2 fishers
A 23 62 35 70 0.4286
B 81 4 39 22 <0.0001
C 51 42 49 52 0.3921
D NA NA 65 8 error
E 73 21 50 33 0.0143
F 72 13 81 10 0.5032
G 61 32 75 21 0.0744
H NA NA 42 22 error
I NA NA 60 20 error
J 80 12 72 24 0.0425
The code I currently have (based on R loop over Fisher test - Error message) is:
df$fishers" <- for (i in 1:nrow(df))
{
table <- matrix(c(df[i,4], df[i,5], df[i,2], df[i,3]), ncol = 2, byrow = TRUE)
fisher.test(table, alternative="greater")
}
This seems to create the contingency tables the way I want but the problem of bypassing the errors and printing the p-vlaue to the new column remains. I have tried to use try and tryCatch but have been unsuccessful in doing so.
I am an R beginner so really appreciate any advice on how to improve my questions or any advice for my problem! Thank you!
Edit 1: I have now tried using the data.table package as below and have got what I need from data sets with no "NA" values but how do I skip the errors and make the code continue? Thanks!!!
library(data.table)
dt <- data.table(df)
dt[, p.val := fisher.test(matrix(c(pop.count.1, pop.count.2, DB.count.1, DB.count.2), ncol=2), workspace=1e9)$p.value, by=Variant]
df <- as.data.frame(dt)
You can include an if-else statement in your loop like this:
res <- NULL
for (i in 1:nrow(df)){
table <- matrix(c(df[i,4], df[i,5], df[i,2], df[i,3]), ncol = 2, byrow = TRUE)
# if any NA occurs in your table save an error in p else run the fisher test
if(any(is.na(table))) p <- "error" else p <- fisher.test(table, alternative="greater")$p.value
# save all p values in a vector
res <- c(res,p)
}
df$fishers <- res
Or put the code in a function and use apply instead of a loop:
foo <- function(y){
# include here as.numeric to be sure that your values are numeric:
table <- matrix(as.numeric(c(y[4], y[5], y[2], y[3])), ncol = 2, byrow = TRUE)
if(any(is.na(table))) p <- "error" else p <- fisher.test(table, alternative="greater")$p.value
p
}
df$fishers <- apply(df, 1, foo)
Hi I never edited a question of mine but I'll give it a try. It's not soo extremely important what the code means actually. For me only saving the vectors "liste" in a new list is relevant :D
test <- list()
test <- replicate(5, sample(1:100, 50), simplify = FALSE) # Creates a list of 5 vectors
> test[[1]]
[1] 90 96 20 86 32 77 83 33 64 29 88 97 78 81 40 60 89 19 31 59 26 38 34 71 5 80 85
[28] 3 70 87 41 50 6 18 37 58 9 76 91 62 12 30 42 94 72 95 100 10 68 82
S <- test[[1]]
x <- diff(S) # following algorythm creates "liste" (vector) for test [[1]]
trendtest <- list()
k <- NULL
d <- NULL
t <- vector("list",length(x))
A <- vector("list",length(x))
z <- vector("list",length(x)-2)
za <- vector("list",length(x)-2)
liste <- NULL
dreisum <- sapply(1:(length(x)-2), function(i) sum(x[c(i,(i+1))]))
dreisumi <- lapply(1:(length(x)-2), function(i) dreisum[i:(length(x)-2)])
zdreisumi<- lapply(1:(length(x)-4), function(i) dreisumi[[i]] [3:length(dreisumi[[i]])]<0)
zadreisumi<- lapply(1:(length(S)-4), function(i) dreisumi[[i]][3:length(dreisumi[[i]])]>0)
Si <- lapply(1:(length(x)-2), function(i) S[i:(length(x))])
i <- 1
h <- 1
while(i<(length(x)-3) & h!=Inf){
k <- c(k,k <- (S[i]-S[i+2])/(-2))
d <- c(d,d <- (S[i+2]*i-S[i]*(i+2))/(-2))
t[[i]] <- i:(length(x))
A[[i]] <- k[length(liste)+1]*t[[i]]+d[length(liste)+1]
A[[i]][3] <- S[i+2]
z[[i]] <- Si[[i]][3:length(Si[[i]])]<A[[i]][3:length(A[[i]])]
za[[i]] <- Si[[i]][3:length(Si[[i]])]>A[[i]][3:length(A[[i]])]
if(k[length(liste)+1]>0 & S[i+3]>A[[i]][4] & is.element(TRUE,z[[i]])){h <- (min(which(z[[i]]!=FALSE))+1)}else{
if(k[length(liste)+1]>0 & S[i+3]<A[[i]][4] & is.element(TRUE,za[[i]])){h <- (min(which(za[[i]]!=FALSE))+1)}else{
if(k[length(liste)+1]<0 & S[i+3]>A[[i]][4] & is.element(TRUE,z[[i]])){h <- (min(which(z[[i]]!=FALSE))+1)}else{
if(k[length(liste)+1]<0 & S[i+3]<A[[i]][4] & is.element(TRUE,za[[i]])){h <- (min(which(za[[i]]!=FALSE))+1)}else{
if(k[length(liste)+1]>0 & S[i+3]>A[[i]][4] & (all(z[[i]]==FALSE))){h <- (min(which(zdreisumi[[i]]!=FALSE))+2)}else{
if(k[length(liste)+1]>0 & S[i+3]<A[[i]][4] & (all(za[[i]]==FALSE))){h <- (min(which(zdreisumi[[i]]!=FALSE))+2)}else{
if(k[length(liste)+1]<0 & S[i+3]>A[[i]][4] & (all(z[[i]]==FALSE))){h <- (min(which(zadreisumi[[i]]!=FALSE))+2)}else{
if(k[length(liste)+1]<0 & S[i+3]<A[[i]][4] & (all(za[[i]]==FALSE))){h <- (min(which(zadreisumi[[i]]!=FALSE))+2)}}}}}}}}
liste <- c(liste,i)
i <- i+h-1
if((length(x)-3)<=i & i<=length(x)){liste <- c(liste,i)}}
> liste
[1] 1 3 7 10 12 16 18 20 24 27 30 33 36 39 41 46
Actually the whole code is not so interesting for my problem because it works! I made the example for test[[1]] now. BUT I want that a for-loop (or whatever) takes ALL vectors in "test" and saves ALL 5 vectors "liste" in a new list (lets call it "trendtest" ... whatever :D)
The following will do what you ask for:
Delete the line trendtest <- list().
Take the code from x <- diff(S) to last line (except the very last line that only prints liste) and insert it at the position indicated by the placeholder __CODE_HERE__.
trendtest <- lapply(test, FUN = function(S) {
__CODE_HERE__
return(liste)
})
This is the "R way" of doing what you want. Alternatively, you could do the following (which is closer to your initial approach, but less the "R way"):
trendtest <- vector("list", length(test))
for (u in 1:length(test)) { # better: u in seq_along(test)
S <- test[[u]]
__CODE_HERE__
trendtest[[u]] <- liste
}
Note that there will be an error message which is due to the sample data (which doesn't fit the algorithm provided) and is unrelated to saving liste in trendtest.
I have a difficulty with application of the data frame on my function in R. I have a data.frame with three columns ID of a point, its location on x axis and its location on y axis. All I need to do is to find for a given point IDs of points that lies in its neighborhood. I've made the function that shows whether the point lies within a circle where the center is a location of observed point and returns it's ID if true.
Here is my code:
point_id <- locationdata$point_id
x_loc <- locationdata$x_loc
y_loc <- locationdata$y_loc
locdata <- data.frame(point_id, x_loc, y_loc)
#radius set to1km
incircle3 <- function(x_loc, y_loc, center_x, center_y, pointid, r = 1000000){
dx = (x_loc-center_x)
dy = (y_loc-center_y)
if (b <- dx^2 + dy^2 <= r^2){
print(shopid)} ##else {print('')}
}
Unfortunately I don't know how to apply this function on the whole data frame. So once I enter the locations of the observed point it would return me IDs of all points that lies in the neighborhood. Ideally I would need to find this relation for all the points automatically. So it would return me the points that lies in the neighborhood of each point from the dataset. Previously I have been inserting the center_x and center_y manually.
Thank you very much for your advices in advance!
You can tackle this with R's dist function:
# set the random seed and create some dummy data
set.seed(101)
dummy <- data.frame(id=1:100, x=runif(100), y=runif(100))
> head(dummy)
id x y
1 1 0.37219838 0.12501937
2 2 0.04382482 0.02332669
3 3 0.70968402 0.39186128
4 4 0.65769040 0.85959857
5 5 0.24985572 0.71833452
6 6 0.30005483 0.33939503
Call the dist function which returns a dist object. The default distance metric is Euclidean which is what you have coded in your question.
dists <- dist(dummy[,2:3])
Loop over the distance matrix and return the indices for each id that are within some constant distance:
neighbors <- apply(as.matrix(dists), 1, function(x) which(x < 0.33))
> neighbors[[1]]
1 6 7 8 19 23 30 32 33 34 42 44 46 51 55 87 88 91 94 99
Here's a modification to handle volatile ids:
set.seed(101)
dummy <- data.frame(id=sample(1:100, 100), x=runif(100), y=runif(100))
> head(dummy)
id x y
1 38 0.12501937 0.60567568
2 5 0.02332669 0.56259740
3 70 0.39186128 0.27685556
4 64 0.85959857 0.22614243
5 24 0.71833452 0.98355758
6 29 0.33939503 0.09838715
dists <- dist(dummy[,2:3])
neighbors <- apply(as.matrix(dists), 1, function(x) {
dummy$id[which(x < 0.33)]
})
names(neighbors) <- dummy$id
> neighbors[['38']]
[1] 38 5 55 80 63 76 17 71 47 11 88 13 41 21 36 31 73 61 99 59 39 89 94 12 18 3
I'm interested in developing a modified bootstrap that samples some vector of length x, with replacement, but must meet a number of number of criteria before stopping the sampling. I'm attempting to calculate confidence intervals for lambda of a populations growth rate, 10000 iterations, but in some groupings of individuals, say vector 13, there are very few individuals growing out of the group. Typical bootstrapping would lead to a fair number instances where growth in this vector does not occur and hence the model falls apart. Each vector consists of a certain number of 1's, 2's, and 3's where 1 represents staying within a group, 2 growing out of a group, and 3 death. Here is what I have so far without the modification, it is likely not the best approach time wise, but I am new to R.
st13 <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,3,3)
#runs
n <- 10000
stage <- st13
stagestay <- vector()
stagemoved <- vector()
stagedead <- vector()
for(i in 1:n){
index <- sample(stage, replace=T)
stay <- ((length(index[index==1]))/(length(index)))
moved <- ((length(index[index==2]))/(length(index)))
stagestay <- rbind(stagestay,stay)
stagemoved <- rbind(stagemoved,moved)
}
Currently, this samples
My question is then: In what way can I modify the sample function to continue sampling these numbers until the length of "index" is at least the same as st13 AND until at least 1 instance of a 2 is present in "index"?
Thanks very much,
Kristopher Hennig
Masters Student
University of Mississippi
Oxford, MS, 38677
Update:
The answer from #lselzer reminded me that the requirement was for the length of the sample to be at least as long as st13. My code above just keeps sampling until it finds a bootstrap sample that contains a 2. The code of #lselzer grows the sample, 1 new index at a time, until the sample contains a 2. This is quite inefficient as you might have to call sample() many times till you get 2. My code might repeat a long time before a 2 is returned in the sample. So can we do any better?
One way would be to sample a large sample with replacement using a single call to sample(). Check which are 2s and see if there is a 2 within the first length(st13) entries. If there is, return those entries, if not, find the first 2 in the large sample and return all entries up to an including that one. If there are no 2s, add on another large sample and repeat. Here is some code:
#runs
n <- 100 #00
stage <- st13
stagedead <- stagemoved <- stagestay <- Size <- vector()
sampSize <- 100 * (len <- length(stage)) ## sample size to try
for(i in seq_len(n)){
## take a large sample
samp <- sample(stage, size = sampSize, replace = TRUE)
## check if there are any `2`s and which they are
## and if no 2s expand the sample
while(length((twos <- which(samp == 2))) < 1) {
samp <- c(samp, sample(stage, size = sampSize, replace = TRUE))
}
## now we have a sample containing at least one 2
## so set index to the required set of elements
if((min.two <- min(twos)) <= len) {
index <- samp[seq_len(len)]
} else {
index <- samp[seq_len(min.two)]
}
stay <- length(index[index==1]) / length(index)
moved <- length(index[index==2]) / length(index)
stagestay[i] <- stay
stagemoved[i] <- moved
Size[i] <- length(index)
}
Here is a really degenerate vector with only a single 2 in 46 entries:
R> st14 <- sample(c(rep(1, 45), 2))
R> st14
[1] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[39] 1 1 1 1 1 1 1 1
If I use the above loop on it rather than st13, I get the following for the minimum sample size required to get a 2 on each of the 100 runs:
R> Size
[1] 65 46 46 46 75 46 46 57 46 106 46 46 46 66 46 46 46 46
[19] 46 46 46 46 46 279 52 46 63 70 46 46 90 107 46 46 46 87
[37] 130 46 46 46 46 46 46 60 46 167 46 46 46 71 77 46 46 84
[55] 58 90 112 52 46 53 85 46 59 302 108 46 46 46 46 46 174 46
[73] 165 103 46 110 46 80 46 166 46 46 46 65 46 46 46 286 71 46
[91] 131 61 46 46 141 46 46 53 47 83
So it would suggest that the sampSize I chose (100 * length(stage)) is a bit of overkill here but as all the operators we are using are vectorised we probably don't incur to much penalty for the overly long initial sample size, and we certainly don't incur any extra sample() calls.
Original:
If I understand you correctly, the problem is that sample() might not return any 2 indicies at all. If so, we can continue sampling until it does using the repeat control flow construct.
I've altered your code accordingly, and optimised it a bit because you never grow objects in a loop like you were doing. There are other ways this could be improved, but I'll stick with the loop for now. Explanation comes below.
st13 <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,3,3)
#runs
n <- 10000
stage <- st13
stagedead <- stagemoved <- stagestay <- vector()
for(i in seq_len(n)){
repeat {
index <- sample(stage, replace = TRUE)
if(any(index == 2)) {
break
}
}
stay <- length(index[index==1]) / length(index)
moved <- length(index[index==2]) / length(index)
stagestay[i] <- stay
stagemoved[i] <- moved
}
This is the main change related to your Q:
repeat {
index <- sample(stage, replace = TRUE)
if(any(index == 2)) {
break
}
}
what this does is repeat the code contained in the braces until a break is triggered to jump us out of the repeat loop. So what happens is we take a bootstrap sample, then check if any of the sample contains the index 2. If there are any 2s then we break out and carry on with the rest of the current for loop iteration. If the sample doesn't contain any 2s, the break is not triggered and we go round again taking another sample. This will happen until we do get a sample with a 2 in it.
For starters, sample has a size argument which you could use to match the length of st13. The second part of your question could be solved using a while loop.
st13 <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,3,3)
#runs
n <- 10000
stage <- st13
stagestay <- vector()
stagemoved <- vector()
stagedead <- vector()
for(i in 1:n){
index <- sample(stage, length(stage), replace=T)
while(!any(index == 2)) {
index <- c(index, sample(stage, 1, replace = T))
}
stay <- ((length(index[index==1]))/(length(index)))
moved <- ((length(index[index==2]))/(length(index)))
stagestay[i] <- stay
stagemoved[i] <- moved
}
While I was writing this Gavin posted his answer which is similar to mine, but I added the size argument to be sure index has at least the lenght of st13