How can I efficiently generate a dataframe of simulated values? - r

I'm trying to generate a data frame of simulated values based on existing distribution parameters. My main data frame contains the mean and standard deviation for each observation, like so:
example.data <- data.frame(country=c("a", "b", "c"),
score_mean=c(0.5, 0.4, 0.6),
score_sd=c(0.1, 0.1, 0.2))
# country score_mean score_sd
# 1 a 0.5 0.1
# 2 b 0.4 0.1
# 3 c 0.6 0.2
I can use sapply() and a custom function to use the score_mean and score_sd parameters to randomly draw from a normal distribution:
score.simulate <- function(score.mean, score.sd) {
return(mean(rnorm(100, mean=score.mean, sd=score.sd)))
}
simulated.scores <- sapply(example.data$score_mean,
FUN=score.simulate,
score.sd=example.data$score_sd)
# [1] 0.4936432 0.3753853 0.6267956
This will generate one round (or column) of simulated values. However, I'd like to generate a lot of columns (like 100 or 1,000). The only way I've found to do this is to wrap my sapply() function inside a generic function inside lapply() and then convert the resulting list into a data frame with ldply() in plyr:
results.list <- lapply(1:5, FUN=function(x) sapply(example.data$score_mean, FUN=score.simulate, score.sd=example.data$score_sd))
library(plyr)
simulated.scores <- as.data.frame(t(ldply(results.list)))
# V1 V2 V3 V4 V5
# V1 0.5047807 0.4902808 0.4857900 0.5008957 0.4993375
# V2 0.3996402 0.4128029 0.3875678 0.4044486 0.3982045
# V3 0.6017469 0.6055446 0.6058766 0.5894703 0.5960403
This works, but (1) it seems really convoluted, especially with the as.data.frame(t(ldply(lapply(... FUN=function(x) sapply ...)))) approach, (2) it is really slow when using large numbers of iterations or bigger data—my actual dataset has 3,000 rows, and running 1,000 iterations takes 1–2 minutes.
Is there a more efficient way to create a data frame of simulated values like this?

The quickest way I can think of is to take advantage of the vectorisation built-in to rnorm. Both the mean and sd arguments are vectorised, however you can only supply a single integer for the number of draws. If you supply a vector to the mean and sd arguments, R will cycle through them until it has completed the required number of draws. Therefore, just make the argument n to rnorm a multiple of the length of your mean vector. The multiplier will be the number of replicates for each row of your data.frame. In the function below this is n.
I can't think of a factor way than using base::rnorm on its own.
Worked example
#example data
df <- data.frame(country=c("a", "b", "c"),
mean=c(1, 10, 100),
sd=c(1, 2, 10))
#function which returns a matrix, and takes column vectors as arguments for mean and sd
normv <- function( n , mean , sd ){
out <- rnorm( n*length(mean) , mean = mean , sd = sd )
return( matrix( out , , ncol = n , byrow = FALSE ) )
}
#reproducible result (note order of magnitude of rows and input sample data)
set.seed(1)
normv( 5 , df$mean , df$sd )
# [,1] [,2] [,3] [,4] [,5]
#[1,] 0.3735462 2.595281 1.487429 0.6946116 0.3787594
#[2,] 10.3672866 10.659016 11.476649 13.0235623 5.5706002
#[3,] 91.6437139 91.795316 105.757814 103.8984324 111.2493092

This can be done very quickly if you remember that rnorm(1, mean, sd) is the same as rnorm(1)*sd + mean so using your data frame df, you can generate sim simulations of your obs observations like:
obs = nrow(df)
sim = 1000
mat = data.frame(matrix(rnorm(obs*sim), obs, sim) * df$sd + df$mean)
You can check that this has the desired means by using rowMeans(mat) and check the standard deviation for, say, row 1 as sd(mat[1,]).

Related

Looping 10 possible N samples and calculate sums of columns

I'm generating n samples, each of dimension m, and I populate a matrix mxn. Then I use the apply function to go for every column of the matrix (every sample generated) and return a list with the sum for the elements of each column. At the end I calculate the mean of all of those sums.
data = replicate(n, rnorm(m, mean = mu, sd = variance))
sum_of_column <- function(col) {
s <- sum(col)
}
sums <- apply(data, 2, sum_of_column)
me <- mean(sums)
sums is the list where each index is the sum of the respective column. me is the mean of that list.
But n is a single value and I want it to be a list of numbers (like 1:10), meaning I want to do this algorithm for every possible n = 1, n = 2, n = ... , n = 10 for which I need to store sums and calculate their mean. I may end up with a bidimensional array (as dataframe) where one column are the n's and the other column the correspondent mean of sums for that n.
In other words, I need to loop this algorithm I coded and store the value for each n-iteration. Like
n mean(sums)
1 123
2 13
...
10 94
I thought of doing this with a for loop, but would there be a smarter way to do this without explicitly looping? Maybe using apply for 3 dimensions?
You could put the logic into a function FUN. In its arguments, predefine m, mu, and sigma. n will be defined dynamically in the loop.
FUN <- \(n, m=1e5, mu=0, sigma=1) {
mxn <- replicate(n, rnorm(m, mean=mu, sd=sigma))
return(c(n=n, mean_of_sums=mean(colSums(mxn))))
}
FUN(1)
# n mean_of_sums
# 1 1 -226.6016
To loop over the n, you could use vapply, which is similar to sapply, but predefines FUN.VALUE in the third argument which saves work for R and, thus, is faster. To get the n into rows, you want to transpose the result.
n <- 1:100
set.seed(42)
r <- t(vapply(n, \(n) FUN(n), c(0, 0)))
r <- as.data.frame(r) ## if wanted
head(r)
# n mean_of_sums
# 1 1 -412.6182
# 2 2 -114.6650
# 3 3 304.1592
# 4 4 75.8026
# 5 5 -208.2705
# 6 6 126.6526
plot(r, type='l', col=4)
abline(h=0, col=8)

Selecting rows in R based on threshold

In R, I have a matrix with N columns of all numbers. (Each row has a name, but that's irrelevant.) I'd like to return rows where there is at least one column has a value greater than some threshold. Right now, I'm doing something like this:
THRESHOLD <- 10
# my_matrix[,1] can be ignored
my_matrix <- subset (my_matrix, my_matrix[,1] > THRESHOLD | my_matrix[,2] > THRESHOLD | ... )
It seems odd to have to manually list each column. Also, if the number of input columns changes, I have to rewrite this.
There has to be a better way, but I can't figure out what I should be looking for.
I can convert my matrix to a data frame, if that is easier... Any suggestions would be appreciated!
find any row values greater than threshold using apply and use it to extract the rows from mat data.
mat[apply( mat2, 1, function( x ) any( x > threshold ) ), ]
EDIT:
Break down of the above single line.
# create sample data by simulating samples from standard normal distribution
set.seed(1L) # set random number generator for consistent data simulation
mat <- matrix( data = c(letters[1:3], as.character( rnorm(9, mean = 0, sd = 1))),
byrow = FALSE,
nrow = 3,
ncol = 4 ) # create simulated data matrix
threshold <- 0 # set threshold
mat2 <- apply( mat[, 2:ncol(mat) ], 2, as.numeric ) # extract columns 2 to end and convert to numeric
# Get the logical indices (true or false) if any row has values greater than 0 (threshold)
row_indices <- apply( mat2, 1, function( x ) any( x > threshold ) )
mat[row_indices, ] # extract matrix data rows that has TRUE in row_indices
# [,1] [,2] [,3] [,4]
# [1,] "a" "-0.626453810742332" "1.59528080213779" "0.487429052428485"
# [2,] "b" "0.183643324222082" "0.329507771815361" "0.738324705129217"
# [3,] "c" "-0.835628612410047" "-0.820468384118015" "0.575781351653492"
Note:
In your question, you mentioned that first column is character and the rest are numbers. By rule, matrix can hold one data type. Given this information, I assume that your data matrix is a character data type. You can find it by using class(mat). If it is character matrix, then extract columns 2 to end and then convert it to numeric. Then use it in the apply loop to check for any values greater than threshold.

How to apply a function to array margin and create pairwise combination matrix

I am using R to apply a self-written function, that takes as an input two numeric vectors plus a numeric parameter, over column margins of data frame. Each column in data frame is a numeric vector and I want to perform pairwise computations and create a matrix which has all possible combinations of the columns with indicated result of the computation. So essentially I want to generate a behaviour similar to the one yielded by cor() function.
# Data
> head(d)
1 2 3 4
1 -1.01035342 1.2490665 0.7202516 0.101467379
2 -0.50700743 1.4356733 0.9032172 -0.001583743
3 -0.09055243 0.4695046 2.4487632 -1.082570048
4 1.11230416 0.2885735 0.3534247 -0.728574628
5 -1.96115691 0.4831158 1.5650052 0.648675605
6 1.20434218 1.7668086 0.2170858 -0.161570792
> cor(d)
1 2 3 4
1 1.00000000 0.08320968 -0.06432155 0.04909430
2 0.08320968 1.00000000 -0.04557743 -0.01092765
3 -0.06432155 -0.04557743 1.00000000 -0.01654762
4 0.04909430 -0.01092765 -0.01654762 1.00000000
I found this useful answer: Perform pairwise comparison of matrix
Based on this I wrote this function which makes use of another self-written function compareFunctions()
createProbOfNonEqMatrix <- function(df,threshold){
combinations <- combn(ncol(df),2)
predDF <- matrix(nrow = length(density(df[,1])$y)) # df creation for predicted values from density function
for(i in 1:ncol(df)){
predCol <- density(df[,i])$y # convert df of original values to df of predicted values from density function
predDF <- cbind(predDF,predCol)
}
predDF <- predDF[,2:ncol(predDF)]
colnames(predDF) <- colnames(df) # give the predicted values column names as in the original df
predDF <- as.matrix(predDF)
out.mx <- apply( X=combinations,MARGIN = 2,FUN = "compareFunctions",
predicted_by_first = predDF[,combinations[1]],
predicted_by_second = predDF[,combinations[2]],
threshold = threshold)
return(out.mx)
}
The predicted_by_first, predicted_by_second and threshold are inputs for compareFunctions. However I get the following error:
Error in FUN(newX[, i], ...) : unused argument (newX[, i])
In desperation I tried this:
createProbOfNonEqMatrix <- function(df,threshold){
combinations <- combn(ncol(df),2)
predDF <- matrix(nrow = length(density(df[,1])$y))
for(i in 1:ncol(df)){
predCol <- density(df[,i])$y
predDF <- cbind(predDF,predCol)
}
predDF <- predDF[,2:ncol(predDF)]
colnames(predDF) <- colnames(df)
predDF <- as.matrix(predDF)
out.mx <- apply(
X=combinations,MARGIN = 2,FUN = function(x) {
diff <- abs(predDF[,x[1]]-predDF[,x[2]])
boolean <- diff<threshold
acceptCount <- length(boolean[boolean==TRUE])
probability <- acceptCount/length(diff)
return(probability)
}
)
return(out.mx)
}
It does seem to be working but instead of returning the pairwise matrix it gives me a vector:
> createProbOfNonEqMatrix(d,0.001)
[1] 0.10351562 0.08203125 0.13476562 0.13085938 0.14843750 0.10937500
Will you be able to guide me on how to make the desired pairwise matrix even if it implies writing the function code again within apply()? Also, if you could give me an idea on how to keep track of what pairwise comparisons are performed it will be greatly appreciated.
Thank you,
Alex
Your output gives you the result of the calculation in the order of the pairs in combinations: (1,2), (1,3), (1,4), (2,3), (2,4), (3,4). If you want to organise this into a symmetric square matrix you can do a basic manipulation on the result, e.g. as follows:
out.mx<-c(0.10351562, 0.08203125, 0.13476562, 0.13085938, 0.14843750, 0.10937500)
out.mtx<-matrix(nrow=ncol(df1),ncol=ncol(df1))
out.mtx[,]<-1
for (i in 1:length(combinations[1,])){
a<-combinations[1,i]
b<-combinations[2,i]
out.mtx[a,b]<-out.mtx[b,a]<-out.mx[i]
}
out.mtx
which gives you
[,1] [,2] [,3] [,4]
[1,] 1.00000000 0.1035156 0.08203125 0.1347656
[2,] 0.10351562 1.0000000 0.13085938 0.1484375
[3,] 0.08203125 0.1308594 1.00000000 0.1093750
[4,] 0.13476562 0.1484375 0.10937500 1.0000000

Divide each each cell of large matrix by sum of its row

I have a site by species matrix. The dimensions are 375 x 360. Each value represents the frequency of a species in samples of that site.
I am trying to convert this matrix from frequencies to relative abundances at each site.
I've tried a few ways to achieve this and the only one that has worked is using a for loop. However, this takes an incredibly long time or simply never finishes.
Is there a function or a vectorised method of achieving this? I've included my for-loop as an example of what I am trying to do.
relative_abundance <- matrix(0, nrow= nrow(data_wide),
ncol=ncol(data), dimnames = dimnames(data))
i=0
j=0
for(i in 1:nrow(relative_abundance)){
for(j in 1:ncol(relative_abundance)){
species_freq <- data[i,j]
row_sum <- sum(data[i,])
relative_abundance[i,j] <- species_freq/row_sum
}
}
You could do this using apply, but scale in this case makes things even simplier. Assuming you want to divide columns by their sums:
set.seed(0)
relative_abundance <- matrix(sample(1:10, 360*375, TRUE), nrow= 375)
freqs <- scale(relative_abundance, center = FALSE,
scale = colSums(relative_abundance))
The matrix is too big to output here, but here's how it shoud look like:
> head(freqs[, 1:5])
[,1] [,2] [,3] [,4] [,5]
[1,] 0.004409603 0.0014231499 0.003439803 0.004052685 0.0024026910
[2,] 0.001469868 0.0023719165 0.002457002 0.005065856 0.0004805382
[3,] 0.001959824 0.0018975332 0.004914005 0.001519757 0.0043248438
[4,] 0.002939735 0.0042694497 0.002948403 0.002532928 0.0009610764
[5,] 0.004899559 0.0009487666 0.000982801 0.001519757 0.0028832292
[6,] 0.001469868 0.0023719165 0.002457002 0.002026342 0.0009610764
And a sanity check:
> head(colSums(freqs))
[1] 1 1 1 1 1 1
Using apply:
freqs2 <- apply(relative_abundance, 2, function(i) i/sum(i))
This has the advatange of being easly changed to run by rows, but the results will be joined as columns anyway, so you'd have to transpose it.
Firstly, you could just do
relative_abundance[i,j] <- data[i,j]/sum(data[i,])
so you dont create the variables...
But to vectorise it, I suggest: compute the row sums with rowsum function(fast) and then you can just use apply by columns and each of that divide by the rowsums:
relative_freq<-apply(data,2,function(x) data[,x]/rowsum(data))
Using some simple linear algebra we can produce faster results. Simply multiply on the left by a diagonal matrix with the scaling factors you need, like this:
library(Matrix)
set.seed(0)
relative_abundance <- matrix(sample(1:10, 360*375, TRUE), nrow= 375)
Diagonal_Matrix <- diag(1/rowSums(relative_abundance))
And then we multiply from the left:
row_normalized_matrix <- Diagonal_Matrix %*% relative_abundance
If you want to normalize columnwise simply make:
Diagonal_Matrix <- diag(1/colSums(relative_abundance))
and multiply from the right.
You can do something like this
relative_abundance <- matrix(sample(1:10, 360*375, TRUE), nrow= 375)
datnorm <- relative_abundance/rowSums(relative_abundance)
this will be faster if relative_abundance is a matrix rather than a data.frame

Selecting a range of rows from R data frame

I have a data frame with 1000 rows and I want to perform some operation on it with 100 rows at a time.
So, I am trying to find out how would I use a counter increment on the number of rows and select 100 rows at a time like 1 to 100, then 101 to 200... uptil 1000 and perform operation on each subset using a for loop. Can anyone please suggest what how can this be done as I could not find out a good method.
An easy way would be to create a grouping variable, then use split() and lapply() to do whatever operations you need to.
Your grouping can be easily created using rep().
Here is an example:
set.seed(1)
demo = data.frame(A = sample(300, 50, replace=TRUE),
B = rnorm(50))
demo$groups = rep(1:5, each=10)
demo.split = split(demo, demo$groups)
lapply(demo.split, colMeans)
# $`1`
# A B groups
# 165.9000000 -0.1530186 1.0000000
#
# $`2`
# A B groups
# 168.2000000 0.1141589 2.0000000
#
# $`3`
# A B groups
# 126.0000000 0.1625241 3.0000000
#
# $`4`
# A B groups
# 159.4000000 0.3340555 4.0000000
#
# $`5`
# A B groups
# 181.8000000 0.0363812 5.0000000
If you prefer to not add the groups to your source data.frame, you can achieve the same effect by doing the following:
groups = rep(1:5, each=10)
lapply(split(demo, groups), colMeans)
Of course, replace colMeans with whatever function you want.
Using your example of a data.frame with 1000 rows, your rep() statement should be something like:
rep(1:10, each=100)
The answer from #mrdwab is great and shows how to avoid a for loop. But if you really must use a for loop (the biglm package would be one example where you might want to) then here is one approach:
for( i in seq(1,1000,by=100) ) {
myfun( df[ i:(i+99), ] )
}
If the total number of rows is not a multiple of the block size then you might want something more like:
tmp <- seq( 1, nrow(df), by=100 )
tmp2 <- c( tail( tmp, -1)-1, nrow(df) )
n <- length(tmp)
out <- numeric(n)
for( i in seq_along(tmp) ) {
out[i] <- myfun( df[ tmp[i]:tmp2[i], ] )
}

Resources