I have a problem. Initially I solved it for scenario 1. Scenario 1 is I have a dataframe df. I need to plot mean of the variables that are numeric
df
A B C D E F
1 asd 29 sf 36 sf 44
2 fsd 24 gfd 56 gfd 34
3 gs 46 asd 39 asd 37
4 asd 50 gfg 26 gfg 23
5 sf 43 fg 56 fg 37
6 dfg 29 er 35 er 51
7 sdfg 32 tr 27 tr 28
8 fgdsgd 24 qw 31 qw 36
I have a code to plot the mean of variables that are numeric. The cod e is as shown below
p2 <- list()
cs <- names(Filter(is.numeric, df))
for(i in cs)
{
p2[i] <- mean(df[,i])
do.call(rbind,p2) %>% as.data.frame()
}
p2 <- as.data.frame(p2)
p2 <- unlist(p2)
p2 <- stack(p2)
ggplot(data=p2,aes(x=ind,y=values))+geom_bar(stat =
"identity")+ylab("Mean")
But I need to have another loop. The above scenario is only mean. Now I also need median , Sd and many more. So i have called these function as a vector
gh <- list()
mea <- c("mean","median","sd")
p2 <- list()
cs <- names(Filter(is.numeric, df))
for(i in cs)
{
for(j in mea)
{
p2[i] <- gh[[j]](df[i])
do.call(rbind,p2) %>% as.data.frame()
}
}
p2 <- as.data.frame(p2)
p2 <- unlist(p2)
p2 <- stack(p2)
ggplot(data=p2,aes(x=ind,y=values))+geom_bar(stat =
"identity")+ylab("Mean")
Could you help me in building Scenario 2 that is the above one
So ideally it should plot mean, median and sd for all the variables that are numeric
Here is how I would do it without loops using mtcars data set
apply(mtcars,2,function(x){
if (is.numeric(x)){
return(c("mean"=mean(x),"median"=median(x),"sd"=sd(x)))
}
})
Related
here's my for loop version of doing resample and remodel,
B <- 999
n <- nrow(butterfly)
estMat <- matrix(NA, B+1, 2)
estMat[B+1,] <- model$coef
for (i in 1:B) {
resample <- butterfly[sample(1:n, n, replace = TRUE),]
re.model <- lm(Hk ~ inv.alt, resample)
estMat[i,] <- re.model$coef
}
I tried to avoid for loop,
B <- 999
n <- nrow(butterfly)
resample <- replicate(B, butterfly[sample(1:n, replace = TRUE),], simplify = FALSE)
re.model <- lapply(resample, lm, formula = Hk ~ inv.alt)
re.model.coef <- sapply(re.model,coef)
estMat <- cbind(re.model.coef, model$coef)
It worked but didn't improve efficiency. Is there any approach I can do vectorization?
Sorry, not quite familiar with StackOverflow. Here's the dataset butterfly.
colony alt precip max.temp min.temp Hk
pd+ss 0.5 58 97 16 98
sb 0.8 20 92 32 36
wsb 0.57 28 98 26 72
jrc+jrh 0.55 28 98 26 67
sj 0.38 15 99 28 82
cr 0.93 21 99 28 72
mi 0.48 24 101 27 65
uo+lo 0.63 10 101 27 1
dp 1.5 19 99 23 40
pz 1.75 22 101 27 39
mc 2 58 100 18 9
hh 4.2 36 95 13 19
if 2.5 34 102 16 42
af 2 21 105 20 37
sl 6.5 40 83 0 16
gh 7.85 42 84 5 4
ep 8.95 57 79 -7 1
gl 10.5 50 81 -12 4
(Assuming butterfly$inv.alt <- 1/butterfly$alt)
You get the error because resample is not a list of resampled data.frames, which you can obtain with:
resample <- replicate(B, butterfly[sample(1:n, replace = TRUE),], simplify = FALSE)
The the following should work:
re.model <- lapply(resample, lm, formula = Hk ~ inv.alt)
To extract coefficients from a list of models, re.model$coef does work. The correct path to coefficients are: re.model[[1]]$coef, re.model[[2]]$coef, .... You can get all of them with the following code:
re.model.coef <- sapply(re.model, coef)
Then you can combined it with the observed coefficients:
estMat <- cbind(re.model.coef, model$coef)
In fact, you can put all of them into replicate:
re.model.coef <- replicate(B, {
bf.rs <- butterfly[sample(1:n, replace = TRUE),]
coef(lm(formula = Hk ~ inv.alt, data = bf.rs))
})
estMat <- cbind(re.model.coef, model$coef)
Here is a code to generate a data.frame :
ref_variables=LETTERS[1:10]
row=100
d0=seq(1:100)
for (i in seq_along(ref_variables)){
dtemp=sample(seq(1:row),row,TRUE)
d0=data.frame(d0,dtemp)
}
d0[,1]=NULL
names(d0)=ref_variables
I have a dataset, data.frame or data.table, whatever.
Let's say I want to modify the columns 2 to 4 by dividing each of them by the first one. Of Course, I can make a loop like this :
columns_name_to_divide=c("B","C","H")
column_divisor="A"
for (i in seq_along(columns_name_to_divide)){
ds[columns_name_to_divide[i]] = ds[columns_name_to_divide[i]] / ds[column_divisor]
}
But is there a way more elegant to do it?
> d0[2:4] <- d0[,2:4]/d0[,1]
This will substitute your original values with result you get after dividing column 2,3,4 by column 1. Rest will remain the same.
If you want to create 3 new columns in d0 with new values after dividing column 2,3,4 by column 1 This will not replace the original values in column 2,3, and 4. The calculated values would be in column 11,12 and 13 respectively.
> dim(d0)
# [1] 100 10
> d0[11:13] <- d0[,2:4]/d0[,1]
> dim(d0)
# [1] 100 13
To round up the new values, you can simply add round() function to 2 decimal places like below:
> d0[2:4] <- round(d0[,2:4]/d0[,1],2) # Original values subtituted at 2,3,4
# OR
> d0[11:13] <- round(d0[,2:4]/d0[,1],2) # New columns added, original columns are untouched.
We can use set from data.table which would be make this more efficient as the overhead of .[data.table is avoided when called multiple times (though not in this case).
library(data.table)
setDT(d0)
for(j in columns_name_to_divide){
set(d0, i = NULL, j = j, value = d0[[j]]/d0[[column_divisor]])
}
Or using lapply
setDT(d0)[, (columns_name_to_divide) := lapply(.SD, `/`,
d0[[column_divisor]]), .SDcols = columns_name_to_divide]
Or an elegant option using dplyr
library(dplyr)
library(magrittr)
d0 %<>%
mutate_each_(funs(./d0[[column_divisor]]), columns_name_to_divide)
head(d0)
# A B C D E F G H I J
#1 60 0.4000000 1.1500000 6 86 27 19 0.150000 94 97
#2 11 0.6363636 0.3636364 25 52 44 82 8.818182 84 68
#3 80 0.8750000 1.1375000 72 34 56 69 0.125000 34 17
#4 77 0.3116883 1.0259740 9 44 87 61 1.064935 79 40
#5 18 0.3333333 5.0555556 60 69 62 89 2.166667 21 34
#6 42 1.3333333 2.3095238 61 20 87 95 1.428571 78 63
Benchmarks
set.seed(42)
d1 <- as.data.frame(matrix(sample(1:9, 1e7*7, replace=TRUE), ncol=7))
d2 <- copy(d1)
d3 <- copy(d1)
system.time({
d2 %<>%
mutate_each(funs(./d2[["V2"]]), V4:V7)
})
# user system elapsed
# 0.52 0.39 0.91
system.time({
d1[,4:7] <- d1[,4:7]/d1[,2]
})
# user system elapsed
# 1.72 0.72 2.44
system.time({
setDT(d3)
for(j in 4:7){
set(d3, i = NULL, j = j, value = d3[[j]]/d3[["V2"]])
}
})
# user system elapsed
# 0.32 0.16 0.47
You can do this:
library(data.table)
cols <- names(df)[2:4]
col1 <- names(df)[1]
setDT(df)[, (cols) := lapply (cols, function(x) get(x) / get(col1) )]
# sample data for reproducible example:
df <- data.frame(V1=rep(10,5),
V2=rep(20,5),
V3=rep(30,5),
V4=rep(40,5),
V5=rep(50,5))
I have a dataframe which is 214 columns long and many rows long, and I want to perform a fisher's exact test for each row using values from 4 columns.
An example subset of relevant information from my dataframe looks like:
Variant DB.count.1 DB.count.2 pop.count.1 pop.count.2
A 23 62 35 70
B 81 4 39 22
C 51 42 49 52
D NA NA 65 8
E 73 21 50 33
F 72 13 81 10
G 61 32 75 21
H NA NA 42 22
I NA NA 60 20
J 80 12 72 24
I am trying to use a for loop to:
create a contingency table for each row for the Fisher's exact test to compare DB.counts to pop.counts
run a Fisher's exact test using this contingency table to determine if there is a difference between DB.counts and pop.counts
output the p-value result to a new column on my dataframe
As you can see there are "NA" values in some positions and thus in some contingency tables, obviously this will cause an error, which is ok, but I would like for the code to output a value to the column when it encounters this error such as "." or "error" and skip to the next row/contingency table.
i.e. I would like an output which looks like this:
Variant DB.count.1 DB.count.2 pop.count.1 pop.count.2 fishers
A 23 62 35 70 0.4286
B 81 4 39 22 <0.0001
C 51 42 49 52 0.3921
D NA NA 65 8 error
E 73 21 50 33 0.0143
F 72 13 81 10 0.5032
G 61 32 75 21 0.0744
H NA NA 42 22 error
I NA NA 60 20 error
J 80 12 72 24 0.0425
The code I currently have (based on R loop over Fisher test - Error message) is:
df$fishers" <- for (i in 1:nrow(df))
{
table <- matrix(c(df[i,4], df[i,5], df[i,2], df[i,3]), ncol = 2, byrow = TRUE)
fisher.test(table, alternative="greater")
}
This seems to create the contingency tables the way I want but the problem of bypassing the errors and printing the p-vlaue to the new column remains. I have tried to use try and tryCatch but have been unsuccessful in doing so.
I am an R beginner so really appreciate any advice on how to improve my questions or any advice for my problem! Thank you!
Edit 1: I have now tried using the data.table package as below and have got what I need from data sets with no "NA" values but how do I skip the errors and make the code continue? Thanks!!!
library(data.table)
dt <- data.table(df)
dt[, p.val := fisher.test(matrix(c(pop.count.1, pop.count.2, DB.count.1, DB.count.2), ncol=2), workspace=1e9)$p.value, by=Variant]
df <- as.data.frame(dt)
You can include an if-else statement in your loop like this:
res <- NULL
for (i in 1:nrow(df)){
table <- matrix(c(df[i,4], df[i,5], df[i,2], df[i,3]), ncol = 2, byrow = TRUE)
# if any NA occurs in your table save an error in p else run the fisher test
if(any(is.na(table))) p <- "error" else p <- fisher.test(table, alternative="greater")$p.value
# save all p values in a vector
res <- c(res,p)
}
df$fishers <- res
Or put the code in a function and use apply instead of a loop:
foo <- function(y){
# include here as.numeric to be sure that your values are numeric:
table <- matrix(as.numeric(c(y[4], y[5], y[2], y[3])), ncol = 2, byrow = TRUE)
if(any(is.na(table))) p <- "error" else p <- fisher.test(table, alternative="greater")$p.value
p
}
df$fishers <- apply(df, 1, foo)
Hi I never edited a question of mine but I'll give it a try. It's not soo extremely important what the code means actually. For me only saving the vectors "liste" in a new list is relevant :D
test <- list()
test <- replicate(5, sample(1:100, 50), simplify = FALSE) # Creates a list of 5 vectors
> test[[1]]
[1] 90 96 20 86 32 77 83 33 64 29 88 97 78 81 40 60 89 19 31 59 26 38 34 71 5 80 85
[28] 3 70 87 41 50 6 18 37 58 9 76 91 62 12 30 42 94 72 95 100 10 68 82
S <- test[[1]]
x <- diff(S) # following algorythm creates "liste" (vector) for test [[1]]
trendtest <- list()
k <- NULL
d <- NULL
t <- vector("list",length(x))
A <- vector("list",length(x))
z <- vector("list",length(x)-2)
za <- vector("list",length(x)-2)
liste <- NULL
dreisum <- sapply(1:(length(x)-2), function(i) sum(x[c(i,(i+1))]))
dreisumi <- lapply(1:(length(x)-2), function(i) dreisum[i:(length(x)-2)])
zdreisumi<- lapply(1:(length(x)-4), function(i) dreisumi[[i]] [3:length(dreisumi[[i]])]<0)
zadreisumi<- lapply(1:(length(S)-4), function(i) dreisumi[[i]][3:length(dreisumi[[i]])]>0)
Si <- lapply(1:(length(x)-2), function(i) S[i:(length(x))])
i <- 1
h <- 1
while(i<(length(x)-3) & h!=Inf){
k <- c(k,k <- (S[i]-S[i+2])/(-2))
d <- c(d,d <- (S[i+2]*i-S[i]*(i+2))/(-2))
t[[i]] <- i:(length(x))
A[[i]] <- k[length(liste)+1]*t[[i]]+d[length(liste)+1]
A[[i]][3] <- S[i+2]
z[[i]] <- Si[[i]][3:length(Si[[i]])]<A[[i]][3:length(A[[i]])]
za[[i]] <- Si[[i]][3:length(Si[[i]])]>A[[i]][3:length(A[[i]])]
if(k[length(liste)+1]>0 & S[i+3]>A[[i]][4] & is.element(TRUE,z[[i]])){h <- (min(which(z[[i]]!=FALSE))+1)}else{
if(k[length(liste)+1]>0 & S[i+3]<A[[i]][4] & is.element(TRUE,za[[i]])){h <- (min(which(za[[i]]!=FALSE))+1)}else{
if(k[length(liste)+1]<0 & S[i+3]>A[[i]][4] & is.element(TRUE,z[[i]])){h <- (min(which(z[[i]]!=FALSE))+1)}else{
if(k[length(liste)+1]<0 & S[i+3]<A[[i]][4] & is.element(TRUE,za[[i]])){h <- (min(which(za[[i]]!=FALSE))+1)}else{
if(k[length(liste)+1]>0 & S[i+3]>A[[i]][4] & (all(z[[i]]==FALSE))){h <- (min(which(zdreisumi[[i]]!=FALSE))+2)}else{
if(k[length(liste)+1]>0 & S[i+3]<A[[i]][4] & (all(za[[i]]==FALSE))){h <- (min(which(zdreisumi[[i]]!=FALSE))+2)}else{
if(k[length(liste)+1]<0 & S[i+3]>A[[i]][4] & (all(z[[i]]==FALSE))){h <- (min(which(zadreisumi[[i]]!=FALSE))+2)}else{
if(k[length(liste)+1]<0 & S[i+3]<A[[i]][4] & (all(za[[i]]==FALSE))){h <- (min(which(zadreisumi[[i]]!=FALSE))+2)}}}}}}}}
liste <- c(liste,i)
i <- i+h-1
if((length(x)-3)<=i & i<=length(x)){liste <- c(liste,i)}}
> liste
[1] 1 3 7 10 12 16 18 20 24 27 30 33 36 39 41 46
Actually the whole code is not so interesting for my problem because it works! I made the example for test[[1]] now. BUT I want that a for-loop (or whatever) takes ALL vectors in "test" and saves ALL 5 vectors "liste" in a new list (lets call it "trendtest" ... whatever :D)
The following will do what you ask for:
Delete the line trendtest <- list().
Take the code from x <- diff(S) to last line (except the very last line that only prints liste) and insert it at the position indicated by the placeholder __CODE_HERE__.
trendtest <- lapply(test, FUN = function(S) {
__CODE_HERE__
return(liste)
})
This is the "R way" of doing what you want. Alternatively, you could do the following (which is closer to your initial approach, but less the "R way"):
trendtest <- vector("list", length(test))
for (u in 1:length(test)) { # better: u in seq_along(test)
S <- test[[u]]
__CODE_HERE__
trendtest[[u]] <- liste
}
Note that there will be an error message which is due to the sample data (which doesn't fit the algorithm provided) and is unrelated to saving liste in trendtest.
Is there a way to use apply functions on 'windows' or 'ranges'? This example should serve to illustrate:
a <- 11:20
Now I want to calculate the sums of consecutive elements. i.e.
[11+12, 12+13, 13+14, ...]
The ways I can think of handling this are:
a <- 11:20
b <- NULL
for(i in 1:(length(a)-1))
{
b <- c(b, a[i] + a[i+1])
}
# b is 23 25 27 29 31 33 35 37 39
or alternatively,
d <- sapply( 1:(length(a)-1) , function(i) a[i] + a[i+1] )
# d is 23 25 27 29 31 33 35 37 39
Is there a better way to do this?
I'm hoping there's something like:
e <- windowapply( a, window=2, function(x) sum(x) ) # fictional function
# e should be 23 25 27 29 31 33 35 37 39
Here's an anternative using rollapply from zoo package
> rollapply(a, width=2, FUN=sum )
[1] 23 25 27 29 31 33 35 37 39
zoo package also offers rollsum function
> rollsum(a, 2)
[1] 23 25 27 29 31 33 35 37 39
We can define a general moving() function:
moving <- function(f){
g <- function(i , x, n , f, ...) f(x[(i-n+1):i], ...)
function(x, n, ...) {
N <- length(x)
vapply(n:N, g, x , n , f, FUN.VALUE = numeric(1), ...)
}
}
Function moving() returns function that, in turn can be used to generate any moving_f() functions:
moving_sum <- moving(sum)
moving_sum(x = 11:20, n = 2)
similarly, even passing extra arguments to moving_f()
moving_mean <- moving(mean)
moving_mean(x = rpois(22, 6), n = 5, trim = 0.1)
You can achieve your windowapply function by first creating a list of indices and then *applying over them such that they are used as extraction indices:
j <- lapply(seq_along(a), function(i) if(i<10) c(i,i+1) else i)
sapply(j, function(j) sum(a[j]))
## [1] 23 25 27 29 31 33 35 37 39