I hope that, this question will be nice tutorial for beginners in R (such as me).
I was used to programming languages where loops are necessary to manipulate the data, algorithms, etc.
Nevertheless, loops in R are slow, what can be seen in case of large data.
Fortunately R provides bulit-in functions which allow to iterating through elements and do some calculation in very efficient way.
Now I'd like to avoid loops when I'm analysying data in R. So I've read about lapply, apply and other useful functions.
I'd like to make correlation between first and each other column of my data and print: sample name, sample estimate and p-value in nice table - everything without for loop.
My idea - create fake data from stratch:
surv <- c(7.1,8,4,2,0.5,5,6)
geneA_expr <- runif(n = 7, min = 1, max = 30)
geneB_expr <- runif(n = 7, min = 1, max = 30)
geneC_expr <- runif(n = 7, min = 1, max = 30)
my_data <- data.frame(surv, geneA_expr, geneB_expr, geneC_expr)
Correlation test with apply - found it here in Stack Overflow and in manual:
md_stat <- apply(my_data[,2:4], 2, cor.test, my_data$surv, method="pearson")
md_stat is a list, now I'd like to visualize it nicely, but I have no idea how to do it, it's too complicated for me, so I used for loop
for(i in names(md_stat)){
cat(i ,md_stat[[i]]$estimate, md_stat[[i]]$p.value, '\n')
}
geneA_expr 0.2517658 0.5860052
geneB_expr 0.2438112 0.5982849
geneC_expr 0.8026801 0.02977544
How to replace above for loop by other bulit-in function?
unlist every list within md_stat. then bind the outputs into a matrix.
do.call(rbind, lapply(md_stat, unlist))
Try this
temp <- lapply(seq_along(md_stat), function(i) {
cat(names(md_stat)[[i]], md_stat[[i]]$estimate, md_stat[[i]]$p.value, '\n')
})
I can think of 4 ways for you to do this, 1 of which depends on the purrr package.
You could use a loop, walk from the purrr package, lapply and a recursive function.
library(microbenchmark)
library(purrr)
surv <- c(7.1,8,4,2,0.5,5,6)
geneA_expr <- runif(n = 7, min = 1, max = 30)
geneB_expr <- runif(n = 7, min = 1, max = 30)
geneC_expr <- runif(n = 7, min = 1, max = 30)
my_data <- data.frame(surv, geneA_expr, geneB_expr, geneC_expr)
md_stat <- apply(my_data[,2:4], 2, cor.test, my_data$surv, method="pearson")
md_loop <- function(md_stat) {
for(i in names(md_stat)){
cat(i ,md_stat[[i]]$estimate, md_stat[[i]]$p.value, '\n')
}
}
md_walk <- function(md_stat) {
walk(names(md_stat), function(i) {
cat(i ,md_stat[[i]]$estimate, md_stat[[i]]$p.value, '\n')
})
}
md_apply <- function(md_stat) {
lapply(names(md_stat), function(i) {
cat(i[[1]],md_stat[[i[[1]]]]$estimate, md_stat[[i[[1]]]]$p.value, '\n')
})
}
md_recursive <- function(md_stat) {
i <- names(md_stat)
if(length(i) < 1) {
NULL
} else {
cat(i[[1]],md_stat[[i[[1]]]]$estimate, md_stat[[i[[1]]]]$p.value, '\n')
md_recursive(tail(md_stat, -1))
}
}
md_speed <- microbenchmark(
md_loop(md_stat),
md_walk(md_stat),
md_apply(md_stat),
md_recursive(md_stat)
)
Speed comparisons
Related
Long time reader, first time poster. I have not found any previous questions about my current problem. I would like to create multiple linear functions, which I can later apply to variables. I have a data frame of slopes: df_slopes and a data frame of constants: df_constants.
Dummy data:
df_slope <- data.frame(var1 = c(1, 2, 3,4,5), var2 = c(2,3,4,5,6), var3 = c(-1, 1, 0, -10, 1))
df_constant<- data.frame(var1 = c(3, 4, 6,7,9), var2 = c(2,3,4,5,6), var3 = c(-1, 7, 8, 0, -1))
I would like to construct functions such as
myfunc <- function(slope, constant, trvalue){
result <- trvalue*slope+constant
return(result)}
where the slope and constant values are
slope<- df_slope[i,j]
constant<- df_constant[i,j]
I have tried many ways, for example like this, creating a dataframe of functions with for loop
myfunc_all<-data.frame()
for(i in 1:5){
for(j in 1:3){
myfunc_all[i,j]<-function (x){ x*df_slope[i,j]+df_constant[i,j] }
full_func[[i]][j]<- func_full
}
}
without success. The slope-constant values are paired up, such as df_slope[i,j] is paired with df_constant[i,j]. The desired end result would be some kind of data frame, from where I can call a function by giving it the coordinates, for example like this:
myfunc_all[i,j}
but any form would be great. For example
myfunc_all[2,1]
in our case would be
function (x){ x*2+4]
which I can apply to different x values. I hope my problem is clear.
So you have a slight problem with lazy evaluation and variable scopes when you are using a for loop to build functions (see here for more info). It's a bit safer to use something like mapply which will create closures for you. Try
myfunc_all <- with(expand.grid(1:5, 1:3), mapply(function(i, j) {
function(x) {
x*df_slope[i,j]+df_constant[i,j]
}
},Var1, Var2))
dim(myfunc_all) <- c(5,3)
This will create an array like object. The only difference is that you need to use double brackets to extract the function. For example
myfunc_all[[2,1]](0)
# [1] 4
myfunc_all[[5,3]](0)
# [1] -1
Alternative you can choose to write a function that returns a function. That would look like
myfunc_all <- (function(slopes, constants) {
function(i, j)
function(x) x*slopes[i,j]+constants[i,j]
})(df_slope, df_constant)
then rather than using brackets, you call the function with parenthesis.
myfunc_all(2,1)(0)
# [1] 4
myfunc_all(5,3)(0)
# [1] -1
df_slope <- data.frame(var1 = c(1, 2, 3,4,5), var2 = c(2,3,4,5,6), var3 = c(-1, 1, 0, -10, 1))
df_constant<- data.frame(var1 = c(3, 4, 6,7,9), var2 = c(2,3,4,5,6), var3 = c(-1, 7, 8, 0, -1))
functions = vector(mode = "list", length = nrow(df_slope))
for (i in 1:nrow(df_slope)) {
functions[[i]] = function(i,x) { df_slope[i]*x + df_constant[i]}
}
f = function(i, x) {
functions[[i]](i, x)
}
f(1, 1:10)
f(3, 5:10)
I am trying to simulate 5000 samples of size 5 from a normal distribution with mean 5 and standard deviation 3. I want to then compute the mean of each sample and make a histogram of the sample means
My current code is not giving me an error but I don't think it's right:
nrSamples = 5000
e <- list(mode="vector",length=nrSamples)
for (i in 1:nrSamples) {
e[[i]] <- rnorm(n = 5, mean = 5, sd = 3)
}
sample_means <- matrix(NA, 5000,1)
for (i in 1:5000){
sample_means[i] <- mean(e[[i]])
}
Any idea on how to tackle this? I am very very new to R!
You don't need a list in this case. It is a common mistake of new R users to use lists excessively.
observations <- matrix(rnorm(25000, mean=5, sd=3), 5000, 5)
means <- rowMeans(observations)
Now means is a vector of 5000 elements.
You can actually do this without for loops. replicate can be used to create the 5000 samples. Then use sapply to return the mean of each sample. Wrap the sapply call in hist() to get the histogram of means.
dat = replicate(5000, rnorm(5,5,3), simplify=FALSE)
hist(sapply(dat, mean))
Or, if you want to save the means:
sample.means = sapply(dat,mean)
hist(sample.means)
I think your code is giving valid results. list(mode="vector",length=nrSamples) isn't doing what I think you intended (run it in the console and see what happens), but it works out because the first two list elements get overwritten in the loop.
Although there's no need to use loops here, just for illustration here are two modified versions of your code using loops:
# 1. Store random samples in a list
e <- vector("list", nrSamples)
for (i in 1:nrSamples) {
e[[i]] <- rnorm(n = 5, mean = 5, sd = 3)
}
sample_means = rep(NA, nrSamples)
for (i in 1:nrSamples){
sample_means[i] <- mean(e[[i]])
}
# 2. Store random samples in a matrix
e <- matrix(rep(NA, 5000*5), nrow=5)
for (i in 1:nrSamples) {
e[,i] <- rnorm(n = 5, mean = 5, sd = 3)
}
sample_means = rep(NA, nrSamples)
for (i in 1:nrSamples){
sample_means[i] <- mean(e[, i])
}
Your code is fine (see below), but I would suggest you try the following:
yourlist <- lapply(1:nrSamples, function(x) rnorm(n=5, mean = 5, sd = 3 ))
yourmeans <- sapply(yourlist, mean)
Here, for each element of the sequence 1, 2, 3, ... nrSamples that I supply as the first argument, lapply executes an function with the given element of the sequence as argument (i.e. x). The function that I have supplied does not depend on x, however, so it is just replicated 5000 times, and the output is stored in a list (this is what lapply does). It is an easy way to avoid loops in situations like these. Needless to say, you could also just run
yourmeans <- sapply(1:nrSamples, function(x) mean(rnorm(n=5, mean = 5, sd = 3)))
Apart from the means, the latter does not store your results though, which may not be what you want. Also note that I call sapply to return a vector, which you can then use to plot your histogram, using e.g. hist(yourmeans).
To show that your code is fine, consider the following:
set.seed(42)
nrSamples = 5000
e <- list(mode="vector",length=nrSamples)
for (i in 1:nrSamples) {
e[[i]] <- rnorm(n = 5, mean = 5, sd = 3)
}
sample_means <- matrix(NA, 5000,1)
for (i in 1:5000){
sample_means[i] <- mean(e[[i]])
}
set.seed(42)
yourlist <- lapply(1:nrSamples, function(x) rnorm(n=5, mean = 5, sd = 3 ))
yourmeans <- sapply(yourlist, mean)
all.equal(as.vector(sample_means), yourmeans)
[1] TRUE
Here, I set the seed to the random number generator to make sure that the random numbers are the same. As you see, your code works fine, though as others have pointed out, loops can easily be avoided.
Beginning R programmer here. I'm trying to run a function with the argument being the number of samples (user-defined) and the output being a vector of means of those samples.
Here is what I have so far, however, I only get one mean value returned. How do I alter the formula so I get a vector of the means that is variable on the number the user inputs?
Pop1 <- rnorm(500, mean = 0.5, sd = 0.2)
My_Func <- function(Samples) {
A <- sample(Pop1, size = 25, replace = TRUE)
for (i in 1:Samples) {
Means <- mean(A)
}
return(Means)
}
Using a for loop it can be like this. As #MrFlick mentioned, avoid assingning the loop to the same variable. Include it into the loop.
Pop1 <- rnorm(500, mean = 0.5, sd = 0.2)
My_Func <- function(Samples) {
Means = numeric(Samples)
for (i in 1:Samples) {
A <- sample(Pop1, size = 25, replace = TRUE)
Means[i] <- mean(A)
}
return(Means)
}
Example:
require(data.table)
example = matrix(c(rnorm(15, 5, 1), rep(1:3, each=5)), ncol = 2, nrow = 15)
example = data.table(example)
setnames(example, old=c("V1","V2"), new=c("target", "index"))
example
threshold = 100
accumulating_cost = function(x,y) { x-cumsum(y) }
whats_left = accumulating_cost(threshold, example$target)
whats_left
I want whats_left to consist of the difference between threshold and the cumulative sum of values in example$target for which example$index = 1, and 2, and 3. So I used the following for loop:
rm(whats_left)
whats_left = vector("list")
for(i in 1:max(example$index)) {
whats_left[[i]] = accumulating_cost(threshold, example$target[example$index==i])
}
whats_left = unlist(whats_left)
whats_left
plot(whats_left~c(1:15))
I know for loops aren't the devil in R, but I'm habituating myself to use vectorization when possible (including getting away from apply, being a for loop wrapper). I'm pretty sure it's possible here, but I can't figure out how to do it. Any help would be much appreciated.
All you trying to do is accumulate the cost by index. Thus, you might want to use the by argument as in
example[, accumulating_cost(threshold, target), by = index]
I know similar questions have been asked in this site here, here, and here, but none of them tackles my problem.
I've a data frame which I want to apply the rdirichlet function (from gtools) to each line. So, each line shall be consider as aplha.
data = NULL
data <- data.frame(rbind(
oct = c(60, 32, 8),
sep = c(53, 35, 12),
ago = c(54, 40, 6)
))
data <- data/100*1000
library(gtools) # contains the function
sim <- 10000 # simulation
My first attenpt was to use apply, it does work, but the output is not that clear for conducting further analysis; each row computation becomes a vector:
p = apply(data, 1, function(x) rdirichlet(sim, alpha = x + 1))
I also try in a loop without success:
p = NULL
for(i in 1:length(data)) {
p[i] <- rdirichlet(sim, alpha = data[i] + 1)
}
Any tip how can I solve this?
Well firstly you might want to change the data in your anonymous function in the apply to x to match the x in function(x)
apply(data, 1, function(x) rdirichlet(sim, alpha = x + 1))
This works for me, as in it provides an output with three columns and 30000 rows.
Two important things here. First, vectorizing is the best way to go:
ans <- apply(data, 1, function(x) rdirichlet(sim, alpha = x + 1))
By doing this, you'll receive each row computations as vector, essentially k vs sim like.
Then you'll need to subsample things like:
margin <- ans[1:100000,1] - ans[100001:200000,1]