How to add data to R data frame - r

I can't imagine it should be that difficult, but probably, coming from Python, my mindset is biased.
I know I'm going to carry out 50 calculations and the result of each calculation, together with two parameters characterizing the calculation, should build up a data frame.
So my approach is to instantiate the data frame and then I want to add the results whenever they become available. Please see the indicated row below:
# Number of simulations
nsim = 50
# The data frame which should carry the calculation (parameters and solutions).
sol <- data.frame(col.names=c("ni", "Xbar", "n"))
# Fifty values for n.
n <- seq.int(5, 5000, length.out=nsim)
for(ni in n)
{
# A random sample containing possible duplicates.
X <- sample(seq(-ni, ni, length=ni+1), replace=T)
Xbar <- round(mean(X), 3)
sol <- rbind(sol, c(ni, Xbar, n)) # <<-- How to do this correctly??
}
This doesn't work.

There are two ways to do this correctly. One is to pre-define your data.frame (its size) and then populate it iteratively in a for-loop:
nsim <- 10 # reduce to 10 to simplify output
n <- seq.int(5, 5000, length.out=nsim)
sol <- setNames(data.frame(matrix(nrow=nsim, ncol=3)), c("ni", "Xbar", "n"))
set.seed(1) # for reproducibility
for(ni in seq_along(n)) {
Xbar <- round(mean(sample(seq(-n[ni], n[ni], length=n[ni]+1), replace=T)), 3)
sol[ni,] <- c(ni, Xbar, n[ni])
}
Alternatively, you can use sapply on your n vector to create a vector of results and then cbind everything back together:
set.seed(1) # for reproducibility
sol <- data.frame(
ni = seq_along(n),
Xbar = sapply(n, function(ni) {
round(mean(sample(seq(-ni, ni, length=ni+1), replace=T)), 3)
}),
n = n
)
Either way, you'll end up with a nice dataframe:
> str(sol)
'data.frame': 10 obs. of 3 variables:
$ ni : num 1 2 3 4 5 6 7 8 9 10
$ Xbar: num 0.667 -0.232 -14.599 -26.026 36.51 ...
$ n : num 5 560 1115 1670 2225 ...

1) Check what your initial sol contains.
> sol <- data.frame(col.names=c("ni", "Xbar", "n"))
> sol
col.names
1 ni
2 Xbar
3 n
Not what you want. See this question.
2) Make sure seq.int does what you expect - check the documentation of (or just the output of) seq.int. e.g. look at what n contains:
> n
[1] 5.0000 106.9388 208.8776 310.8163 412.7551 514.6939 616.6327
[8] 718.5714 820.5102 922.4490 1024.3878 1126.3265 1228.2653 1330.2041
[15] 1432.1429 1534.0816 1636.0204 1737.9592 1839.8980 1941.8367 2043.7755
[22] 2145.7143 2247.6531 2349.5918 2451.5306 2553.4694 2655.4082 2757.3469
[29] 2859.2857 2961.2245 3063.1633 3165.1020 3267.0408 3368.9796 3470.9184
[36] 3572.8571 3674.7959 3776.7347 3878.6735 3980.6122 4082.5510 4184.4898
[43] 4286.4286 4388.3673 4490.3061 4592.2449 4694.1837 4796.1224 4898.0612
[50] 5000.0000
Is that what you meant?
3) Given (1) the problems are not surprising, but in any case, just carry out the first time through the loop a line at a time. See what happens:
sim = 50
sol <- data.frame(col.names=c("ni", "Xbar", "n"))
ni=5
X <- sample(seq(-ni, ni, length=ni+1), replace=T)
Xbar <- round(mean(X), 3)
sol <- rbind(sol, c(ni, Xbar, n))
print(sol)
Gives:
Warning message:
In `[<-.factor`(`*tmp*`, ri, value = 5) :
invalid factor level, NA generated
> print(sol)
col.names
1 ni
2 Xbar
3 n
4 <NA>
Now the behavior is unsurprising; we can't add three columns to something with one column.
4) You don't want to do it this way anyway. It's better to initialize sol to be its final size and then fill it in.
See, for example, this answer
However, the more common R idiom would be to avoid loops where possible; there are a number of functions that will let you create the whole thing at once.

First of all, can you clarify the expected output format that you expect?
As of now, on modifying the code to generate a data frame, the following output will be generated (let me know if this is what you expect & then its not difficult to generate the following) :
ni Xbar n
10.000 2.182 12.000
If this is what you expect, then one way to do this would be as follows:
Step 1: Create Vectors
Step 2: Create Data frame from the above vectors
Step 3: Run your operations in a loop & fill in row by row.
nsim=50
n=seq.int(5, 5000, length.out=nsim)
ni<-vector(mode='numeric',length=nsim)
Xbar<-vector(mode='numeric',length=nsim)
out<-data.frame(ni=ni,Xbar=Xbar,n=n)
for ( i in 1:length(n)){
X<- sample(seq(-n[i], n[i], length=n[i]+1), replace=T)
out[i,'Xbar'] <- round(mean(X), 3)
out[i,'ni']<-n[i]
}
The output is as follows:

Related

Looping 10 possible N samples and calculate sums of columns

I'm generating n samples, each of dimension m, and I populate a matrix mxn. Then I use the apply function to go for every column of the matrix (every sample generated) and return a list with the sum for the elements of each column. At the end I calculate the mean of all of those sums.
data = replicate(n, rnorm(m, mean = mu, sd = variance))
sum_of_column <- function(col) {
s <- sum(col)
}
sums <- apply(data, 2, sum_of_column)
me <- mean(sums)
sums is the list where each index is the sum of the respective column. me is the mean of that list.
But n is a single value and I want it to be a list of numbers (like 1:10), meaning I want to do this algorithm for every possible n = 1, n = 2, n = ... , n = 10 for which I need to store sums and calculate their mean. I may end up with a bidimensional array (as dataframe) where one column are the n's and the other column the correspondent mean of sums for that n.
In other words, I need to loop this algorithm I coded and store the value for each n-iteration. Like
n mean(sums)
1 123
2 13
...
10 94
I thought of doing this with a for loop, but would there be a smarter way to do this without explicitly looping? Maybe using apply for 3 dimensions?
You could put the logic into a function FUN. In its arguments, predefine m, mu, and sigma. n will be defined dynamically in the loop.
FUN <- \(n, m=1e5, mu=0, sigma=1) {
mxn <- replicate(n, rnorm(m, mean=mu, sd=sigma))
return(c(n=n, mean_of_sums=mean(colSums(mxn))))
}
FUN(1)
# n mean_of_sums
# 1 1 -226.6016
To loop over the n, you could use vapply, which is similar to sapply, but predefines FUN.VALUE in the third argument which saves work for R and, thus, is faster. To get the n into rows, you want to transpose the result.
n <- 1:100
set.seed(42)
r <- t(vapply(n, \(n) FUN(n), c(0, 0)))
r <- as.data.frame(r) ## if wanted
head(r)
# n mean_of_sums
# 1 1 -412.6182
# 2 2 -114.6650
# 3 3 304.1592
# 4 4 75.8026
# 5 5 -208.2705
# 6 6 126.6526
plot(r, type='l', col=4)
abline(h=0, col=8)

Minimum absolute difference between vector pairs (greedy algorithm)

Given a numeric vector, I'd like to find the smallest absolute difference in combinations of size 2. However, the point of friction comes with the use of combn to create the matrix holding the pairs. How would one handle issues when a matrix/vector is too large?
When the number of resulting pairs (number of columns) using combn is too large, I get the following error:
Error in matrix(r, nrow = len.r, ncol = count) :
invalid 'ncol' value (too large or NA)
This post states that the size limit of a matrix is roughly one billion rows and two columns.
Here is the code I've used. Apologies for the use of cat in my function output -- I'm solving the Minimum Absolute Difference in an Array Greedy Algorithm problem in HackerRank and R outputs are only counted as correct if they're given using cat:
minimumAbsoluteDifference <- function(arr) {
combos <- combn(arr, 2)
cat(min(abs(combos[1,] - combos[2,])))
}
# This works fine
input0 <- c(3, -7, 0)
minimumAbsoluteDifference(input0) #returns 3
# This fails
inputFail <- rpois(10e4, 1)
minimumAbsoluteDifference(inputFail)
#Error in matrix(r, nrow = len.r, ncol = count) :
# invalid 'ncol' value (too large or NA)
TL;DR
No need for combn or the like, simply:
min(abs(diff(sort(v))))
The Nitty Gritty
Finding the difference between every possible combinations is O(n^2). So when we get to vectors of length 1e5, the task is burdensome both computationally and memory-wise.
We need a different approach.
How about sorting and taking the difference only with its neighbor?
By first sorting, for any element vj, the difference min |vj - vj -/+ 1| will be the smallest such difference involving vj. For example, given the sorted vector v:
v = -9 -8 -6 -4 -2 3 8
The smallest distance from -2 is given by:
|-2 - 3| = 5
|-4 - -2| = 2
There is no need in checking any other elements.
This is easily implemented in base R as follows:
getAbsMin <- function(v) min(abs(diff(sort(v))))
I'm not going to use rpois as with any reasonably sized vector, duplicates will be produces, which will trivially give 0 as an answer. A more sensible test would be with runif or sample (minimumAbsoluteDifference2 is from the answer provided by #RuiBarradas):
set.seed(1729)
randUnif100 <- lapply(1:100, function(x) {
runif(1e3, -100, 100)
})
randInts100 <- lapply(1:100, function(x) {
sample(-(1e9):(1e9), 1e3)
})
head(sapply(randInts100, getAbsMin))
[1] 586 3860 2243 2511 5186 3047
identical(sapply(randInts100, minimumAbsoluteDifference2),
sapply(randInts100, getAbsMin))
[1] TRUE
options(scipen = 99)
head(sapply(randUnif100, getAbsMin))
[1] 0.00018277206 0.00020549633 0.00009834766 0.00008395873 0.00005299225 0.00009313226
identical(sapply(randUnif100, minimumAbsoluteDifference2),
sapply(randUnif100, getAbsMin))
[1] TRUE
It's very fast as well:
library(microbenchmark)
microbenchmark(a = getAbsMin(randInts100[[50]]),
b = minimumAbsoluteDifference2(randInts100[[50]]),
times = 25, unit = "relative")
Unit: relative
expr min lq mean median uq max neval
a 1.0000 1.0000 1.0000 1.0000 1.00000 1.00000 25
b 117.9799 113.2221 105.5144 107.6901 98.55391 81.05468 25
Even for very large vectors, the result is instantaneous;
set.seed(321)
largeTest <- sample(-(1e12):(1e12), 1e6)
system.time(print(getAbsMin(largeTest)))
[1] 3
user system elapsed
0.083 0.003 0.087
Something like this?
minimumAbsoluteDifference2 <- function(x){
stopifnot(length(x) >= 2)
n <- length(x)
inx <- rep(TRUE, n)
m <- NULL
for(i in seq_along(x)[-n]){
inx[i] <- FALSE
curr <- abs(x[i] - x[which(inx)])
m <- min(c(m, curr))
}
m
}
# This works fine
input0 <- c(3, -7, 0)
minimumAbsoluteDifference(input0) #returns 3
minimumAbsoluteDifference2(input0) #returns 3
set.seed(2020)
input1 <- rpois(1e3, 1)
minimumAbsoluteDifference(input1) #returns 0
minimumAbsoluteDifference2(input1) #returns 0
inputFail <- rpois(1e5, 1)
minimumAbsoluteDifference(inputFail) # This fails
minimumAbsoluteDifference2(inputFail) # This does not fail

how to find the most similar columns in a matrix?

I have a matrix in which I would like to find those columns that are very similar (I am not looking to find identical columns)
# to generate a matrix
Mat<- matrix(rexp(200, rate=.1), ncol=1000, nrow=400)
I personally thought of "cor" or "all.equal" and I did as follows, but did not work.
indexmax <- apply(Mat, MARGIN = 2, function(x) which(cor(x) >= 0.5, arr.ind = TRUE))
what I need as output is show which columns are highly similar and the degrees of their similarity (it can be correlation coefficient)
similar means their values are similar within some threshold (for example over 75% of the values residuals (e.g. column1-column2) are less than abs(0.5)
I would also love to see how then this is different from correlated. do they result in identical results ?
Using correlation you could try (with a simpler matrix for demonstration)
set.seed(123)
Mat <- matrix(rnorm(300), ncol = 10)
library(matrixcalc)
corr <- cor(Mat)
res <-which(lower.triangle(corr)>.3, arr.ind = TRUE)
data.frame(res[res[,1] != res[,2],], correlation = corr[res[res[,1] != res[,2],]])
row col correlation
1 8 1 0.3387738
2 6 2 0.3350891
Both row and col actually refer to the columns in your original matrix. So, for example, the correlation between column 8 and column 1 is 0.3387738
I'd take linear regression approach:
Mat<- matrix(rexp(200, rate=.1), ncol=100, nrow=400)
combinations <- combn(1:ncol(Mat), m = 2)
sigma <- NULL
for(i in 1:ncol(combinations)){
sigma <- c(sigma, summary(lm(Mat[,combinations[1,1]] ~ Mat[,combinations[2,1]]))$sigma)
}
sigma <- data.frame(sigma = sigma, comb_nr = 1:ncol(combinations))
And residual standard error as an optional criteria.
You can further order data frame by sigma and get best/worst combinations.
If you want a (not so elegant) straightforward approach that's likely to be very slow for matrices of your size, you can do this:
set.seed(1)
Mat <- matrix(runif(40000), ncol=100, nrow=400)
col.combs <- t(combn(1:ncol(Mat), 2))
similar <- data.frame(Col1=NULL, Col2=NULL, Corr=NULL, Pct.Diff=NULL)
# Compare each pair of columns
for (k in 1:nrow(col.combs)) {
i <- col.combs[k, 1]
j <- col.combs[k, 2]
# Difference within threshold?
diff.thresh <- (abs(Mat[, i] - Mat[, j]) < 0.5)
pair.corr <- cor(Mat[, 1], Mat[, 2])
if (mean(diff.thresh) > 0.75)
similar <- rbind(similar, c(i, j, pair.corr, 100*mean(diff.thresh)))
}
In this example there are 2590 distinct pairs of columns with more than 75% of their values within 0.5 of each other (elementwise). You can check the actual difference and correlation coefficient by looking at the resulting data frame.
> head(similar)
Col1 Col2 Corr Pct.Diff
1 1 2 -0.003187894 76.75
2 1 3 0.074061019 76.75
3 1 4 0.082668387 78.00
4 1 5 0.001713751 75.50
5 1 8 0.052228907 75.75
6 1 12 -0.017921978 78.00
Perhaps it's not the best solution, but gets the job done.
Also, if you're unsure why I used mean(diff.thresh), it's because the sum of a logical vector is the number of TRUE elements. The mean is the sum divided by the length, which means that in this case it's the fraction of values within the threshold.

R: Rolling window function with adjustable window and step-size for irregularly spaced observations

Say there is a 2-column data frame with a time or distance column which sequentially increases and an observation column which may have NAs here and there. How can I efficiently use a sliding window function to get some statistic, say a mean, for the observations in a window of duration X (e.g. 5 seconds), slide the window over Y seconds (e.g. 2.5 seconds), repeat... The number of observations in the window is based on the time column, thus both the number of observations per window and the number of observations to slide the window may vary The function should accept any window size up to the number of observations and a step size.
Here is sample data (see "Edit:" for a larger sample set)
set.seed(42)
dat <- data.frame(time = seq(1:20)+runif(20,0,1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:19,2)] <- NA_real_
head(dat)
time measure
1 1.914806 1.0222694
2 2.937075 0.3490641
3 3.286140 NA
4 4.830448 0.8112979
5 5.641746 0.8773504
6 6.519096 1.2174924
Desired Output for the specific case of a 5 second window, 2.5 second step, first window from -2.5 to 2.5, na.rm=FALSE:
[1] 1.0222694
[2] NA
[3] NA
[4] 1.0126639
[5] 0.9965048
[6] 0.9514456
[7] 1.0518228
[8] NA
[9] NA
[10] NA
Explanation: In the desired output the very first window looks for times between -2.5 and 2.5. One observation of measure is in this window, and it is not an NA, thus we get that observation: 1.0222694. The next window is from 0 to 5, and there is an NA in the window, so we get NA. Same for the window from 2.5 to 7.5. The next window is from 5 to 10. There are 5 observations in the window, none are NA. So, we get the average of those 5 observations (i.e. mean(dat[dat$time >5 & dat$time <10,'measure']) )
What I tried: Here is what I tried for the specific case of a window where the step size is 1/2 the window duration:
windo <- 5 # duration in seconds of window
# partition into groups depending on which window(s) an observation falls in
# When step size >= window/2 and < window, need two grouping vectors
leaf1 <- round(ceiling(dat$time/(windo/2))+0.5)
leaf2 <- round(ceiling(dat$time/(windo/2))-0.5)
l1 <- tapply(dat$measure, leaf1, mean)
l2 <- tapply(dat$measure, leaf2, mean)
as.vector(rbind(l2,l1))
Not flexible, not elegant, not efficient. If step size isn't 1/2 window size, the approach will not work, as is.
Any thoughts on a general solution to this kind of problem? Any solution is acceptable. The faster the better, though I prefer solutions using base R, data.table, Rcpp, and/or parallel computation. In my real data set, there are several millions of observations contained in a list of data frames (max data frame is ~400,000 observations).
Below is a extra info: A larger sample set
Edit: As per request, here is a larger, more realistic example dataset with many more NAs and the minimum time span (~0.03). To be clear, though, the list of data frames contains small ones like the one above, as well as ones like the following and larger:
set.seed(42)
dat <- data.frame(time = seq(1:50000)+runif(50000, 0.025, 1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:50000,1000)] <- NA_real_
dat$measure[c(350:450,3000:3300, 20000:28100)] <- NA_real_
dat <- dat[-c(1000:2000, 30000:35000),]
# a list with a realistic number of observations:
dat <- lapply(1:300,function(x) dat)
Here is an attempt with Rcpp. The function assumes that data is sorted according to time. More testing would be advisable and adjustments could be made.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector rollAverage(const NumericVector & times,
NumericVector & vals,
double start,
const double winlen,
const double winshift) {
int n = ceil((max(times) - start) / winshift);
NumericVector winvals;
NumericVector means(n);
int ind1(0), ind2(0);
for(int i=0; i < n; i++) {
if (times[0] < (start+winlen)) {
while((times[ind1] <= start) &
(times[ind1+1] <= (start+winlen)) &
(ind1 < (times.size() - 1))) {
ind1++;
}
while((times[ind2+1] <= (start+winlen)) & (ind2 < (times.size() - 1))) {
ind2++;
}
if (times[ind1] >= start) {
winvals = vals[seq(ind1, ind2)];
means[i] = mean(winvals);
} else {
means[i] = NA_REAL;
}
} else {
means[i] = NA_REAL;
}
start += winshift;
}
return means;
}
Testing it:
set.seed(42)
dat <- data.frame(time = seq(1:20)+runif(20,0,1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:19,2)] <- NA_real_
rollAverage(dat$time, dat$measure, -2.5, 5.0, 2.5)
#[1] 1.0222694 NA NA 1.0126639 0.9965048 0.9514456 1.0518228 NA NA NA
With your list of data.frames (using data.table):
set.seed(42)
dat <- data.frame(time = seq(1:50000)+runif(50000, 0.025, 1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:50000,1000)] <- NA_real_
dat$measure[c(350:450,3000:3300, 20000:28100)] <- NA_real_
dat <- dat[-c(1000:2000, 30000:35000),]
# a list with a realistic number of observations:
dat <- lapply(1:300,function(x) dat)
library(data.table)
dat <- lapply(dat, setDT)
for (ind in seq_along(dat)) dat[[ind]][, i := ind]
#possibly there is a way to avoid these copies?
dat <- rbindlist(dat)
system.time(res <- dat[, rollAverage(time, measure, -2.5, 5.0, 2.5), by=i])
#user system elapsed
#1.51 0.02 1.54
print(res)
# i V1
# 1: 1 1.0217126
# 2: 1 0.9334415
# 3: 1 0.9609050
# 4: 1 1.0123473
# 5: 1 0.9965922
# ---
#6000596: 300 1.1121296
#6000597: 300 0.9984581
#6000598: 300 1.0093060
#6000599: 300 NA
#6000600: 300 NA
Here is a function that gives the same result for your small data frame. It's not particularly quick: it takes several seconds to run on one of the larger datasets in your second dat example.
rolling_summary <- function(DF, time_col, fun, window_size, step_size, min_window=min(DF[, time_col])) {
# time_col is name of time column
# fun is function to apply to the subsetted data frames
# min_window is the start time of the earliest window
times <- DF[, time_col]
# window_starts is a vector of the windows' minimum times
window_starts <- seq(from=min_window, to=max(times), by=step_size)
# The i-th element of window_rows is a vector that tells us the row numbers of
# the data-frame rows that are present in window i
window_rows <- lapply(window_starts, function(x) { which(times>=x & times<x+window_size) })
window_summaries <- sapply(window_rows, function(w_r) fun(DF[w_r, ]))
data.frame(start_time=window_starts, end_time=window_starts+window_size, summary=window_summaries)
}
rolling_summary(DF=dat,
time_col="time",
fun=function(DF) mean(DF$measure),
window_size=5,
step_size=2.5,
min_window=-2.5)
Here are some functions that will give the same output on your first example:
partition <- function(x, window, step = 0){
a = x[x < step]
b = x[x >= step]
ia = rep(0, length(a))
ib = cut(b, seq(step, max(b) + window, by = window))
c(ia, ib)
}
roll <- function(df, window, step = 0, fun, ...){
tapply(df$measure, partition(df$time, window, step), fun, ...)
}
roll_steps <- function(df, window, steps, fun, ...){
X = lapply(steps, roll, df = df, window = window, fun = fun, ...)
names(X) = steps
X
}
Output for your first example:
> roll_steps(dat, 5, c(0, 2.5), mean)
$`0`
1 2 3 4 5
NA 1.0126639 0.9514456 NA NA
$`2.5`
0 1 2 3 4
1.0222694 NA 0.9965048 1.0518228 NA
You can also ignore missing values this way easily:
> roll_steps(dat, 5, c(0, 2.5), mean, na.rm = TRUE)
$`0`
1 2 3 4 5
0.7275438 1.0126639 0.9514456 0.9351326 NaN
$`2.5`
0 1 2 3 4
1.0222694 0.8138012 0.9965048 1.0518228 0.6122983
This can also be used for a list of data.frames:
> x = lapply(dat2, roll_steps, 5, c(0, 2.5), mean)
Ok, how about this.
library(data.table)
dat <- data.table(dat)
setkey(dat, time)
# function to compute a given stat over a time window on a given data.table
window_summary <- function(start_tm, window_len, stat_fn, my_dt) {
pos_vec <- my_dt[, which(time>=start_tm & time<=start_tm+window_len)]
return(stat_fn(my_dt$measure[pos_vec]))
}
# a vector of window start times
start_vec <- seq(from=-2.5, to=dat$time[nrow(dat)], by=2.5)
# sapply'ing the function above over vector of start times
# (in this case, getting mean over 5 second windows)
result <- sapply(start_vec, window_summary,
window_len=5, stat_fn=mean, my_dt=dat)
On my machine, it processes the first 20,000 rows of your large dataset in 13.06781 secs; all rows in 51.58614 secs
Here's another attempt to use pure data.table approach and its between function.
Have compared Rprof against the above answers (except #Rolands answer) and it seems the most optimized one.
Haven't tested for bugs though, but if you"ll like it, I'll expand the answer.
Using your dat from above
library(data.table)
Rollfunc <- function(dat, time, measure, wind = 5, slide = 2.5, FUN = mean, ...){
temp <- seq.int(-slide, max(dat$time), by = slide)
temp <- cbind(temp, temp + wind)
setDT(dat)[, apply(temp, 1, function(x) FUN(measure[between(time, x[1], x[2])], ...))]
}
Rollfunc(dat, time, measure, 5, 2.5)
## [1] 1.0222694 NA NA 1.0126639 0.9965048 0.9514456 1.0518228 NA NA
## [10] NA
You can also specify the functions and its arguments, i.e., for example:
Rollfunc(dat, time, measure, 5, 2.5, max, na.rm = TRUE)
will also work
Edit: I did some benchnarks against #Roland and his method clearly wins (by far), so I would go with the Rcpp aproach

using lists for simulation

I set myself a little challange on my way to learning R. The question was, given a sample of 500 numbers in normal distribution with mean 20, how many numbers under 20 would I get for standard deviations from 6 to 10. Just to have to learn more I decided to get 4 samples for each sd. So by the end I should have:
sd6samp1:...
sd6samp2:...
....
sd10samp4:...
My first approach, which worked was:
ddss<-c(6:10) # sd's
sam<-c(1:4) # 4 samples for each
k=0 # counter in 0
for (i in ddss) { # for each sd
for (j in sam) { # for each sample
nam <- paste("sam",i,".",j, sep="") # building a name
n <- assign(nam,rnorm(500, 20, i)) # the great assign function
k <- k+sum(n<=0)
}
print(assign(paste("ds",i,sep=""), k)) # ohh assign you're great
k=0 # reset counter
}
While looking for how to create variable names with the looping 'i', founded that 'assign' does the work but it also said:
Note though that if you are planning some simulations,
many guRus would say that you should use a list.
So I thoght it would be good to learn lists...
In the meanwhile I also discover a great other option...
ddss <- c(6:10)
for (i in ddss) {
print(paste('prob. x<=0), with sd=',i))
print(pnorm(0,mean=20,sd=i)*500)
}
This worked to answer the question, but the lists were still to be done... and a lot of R has yet to be learned. The main idea wasn't to know the very prob or number of negatives... but to learn R and specifically some looping.
So, I've been trying to go with the mentioned lists
My closest approach has been:
ddss<-c(6:10) # sd's to be calculated.
sam<-c(1:4) # 4 samples for each sd
liss<-list() # initializing the list
for (i in ddss) { # for each sd
liss[[i]] <- list()
for (j in sam) { # for each sample
liss[[i]][[j]] <- rnorm(500, 20, i)
print(paste('ds',i,'samp',j,'=',sum(liss[[i]][[j]]<0)))
}
}
With this one I get the information but I'm wondering about two issues (1 & 2) and some other questions (3 & 4):
I get a list of 10 elements, 6 empty ones and then 4 with sublists. I can't seem to find out how to work with elements 1:4 of the list (sd's) with the 6:9 names (the very sd's).
Even though I tried, I couldn't get to name the lists elements through the 'for' loops. Any insight on these issues would be great.
Since in this context of simulations. What do you think is better: nested lists (lists with sublists) or simple (longer) lists?
I wondered whether the 'apply' functions would be of any help here, I tried to do something, like:
vbv<-matrix(c(6,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9))
lsl<-apply(vbv, 2, function(x) rnorm(500,20,x))
But it looks I'm not getting even close....
Thanks for your time if you've read this far!
You may as well take some more to reply ;-).
The problem is in your indexes: you are running over indexer i from ddss, which runs from 6 to 10. So in the first tour of duty in your outer loop, your first statement really says: liss[[6]]<-list(), implying that the first 5 ones are NULL.
So if you insist on working with loops, this is what you should do (check ?seq_along):
ddss<-c(6:10) # sd's to be calculated.
sam<-c(1:4) # 4 samples for each sd
liss<-list() # initializing the list
for (i in seq_along(ddss)) { # now, i runs from 1 to 5
liss[[i]] <- list()
for (j in sam) { # for each sample
liss[[i]][[j]] <- rnorm(500, 20, i)
print(paste('ds',ddss[i],'samp',j,'=',sum(liss[[i]][[j]]<0)))
}
names(liss[[i]])<-as.character(sam)#this should solve your naming issue (1/2)
}
names(liss)<-as.character(ddss)#this should solve your naming issue (2/2)
Note that, as always, it is a good idea to name your variables something more useful than i or j: if you'd named it curds, maybe you wouldn't have used it immediately as an indexer in a list?
Now, if you are really aiming for improvement (but want to stick to lists), you indeed want to go with the apply style functions:
liss<-lapply(ddss, function(curds){ #apply the inline function to each ds and store results in a list
return(lapply(sam, function(cursam){ #apply inline function to each sam and store results in a list
rv<-rnorm(500, 20, curds)
cat('ds',curds,'samp',cursam,'=',sum(rv<0), "\n") #maybe better for your purposes.
return(rv)
}))
})
Finally, for your case, there is not a lot of reason to actually use lists (nor do you even need to keep the sampled data for each ds/sam): you can store everything as a threedimensional array, but since you specify it as a learning exercise (hey, maybe the array thing can be your next exercise :-)), I'll leave it at that.
lapply() is helpful here, where we can just apply over the set of values for the SD. It helps to write a custom wrapper around the rnorm() function so we can pass in different values for the various arguments of rnorm(), and handle the k replicates (k = 4 in your example) in a nice fashion also. That wrapper is foo() below:
foo <- function(sd, n, mean, reps = 1) {
rands <- rnorm(n * reps, mean = mean, sd = sd)
if(reps > 1)
rands <- matrix(rands, ncol = reps)
rands
}
We use it in an lapply() call like so:
sims <- lapply(6:10, FUN = foo, mean = 20, n = 500, reps = 4)
Which gives:
R> str(sims)
List of 5
$ : num [1:500, 1:4] 30.3 22 15.6 20 19.4 ...
$ : num [1:500, 1:4] 20.9 21.7 17.7 35 30 ...
$ : num [1:500, 1:4] 17.88 26.48 5.19 19.25 15.59 ...
$ : num [1:500, 1:4] 27.41 12.72 9.38 35.09 11.08 ...
$ : num [1:500, 1:4] 16.2 11.6 20.5 35.4 27.3 ...
We can then compute the number of observations < 20 per SD
names(sims) <- paste("SD", 6:10, sep = "")
out <- lapply(sims, function(x) colSums(x < 20))
Which gives:
R> out
$SD6
[1] 218 251 253 227
$SD7
[1] 250 242 233 232
$SD8
[1] 258 241 246 274
$SD9
[1] 252 245 249 258
$SD10
[1] 253 259 241 242
#Joris suggests I show how to access elements of the list. For example, if you want the results of the simulations for a SD = 20, we could do out[[4]] because 20 was the 4th value in the vector of SDs we applied over, or, because I named the elements of the output list out, we can as for the results of the simulation using out[["SD10"]].
To Answer some of the specific points about your loops etc.,
to add names to a list use names(), e.g. names(mylist) <- c("foo","bar"). You'd be better off in your loop callingnames()` once per iteration of the loop to set up the names in a single shot - you probably wouldn't want to fill the names in as you go along as that would be inefficient.
I don't think it makes too much difference whether you use a nested list or a list containing a matrix as per my example. To alter foo() to return a list so the output of lapply() is a list of lists, we could do:
Code:
bar <- function(sd, n, mean, reps = 1) {
rands <- rnorm(n * reps, mean = mean, sd = sd)
if(reps > 1)
rands <- split(rands, rep(seq_len(reps), each = n))
rands
}
sims2 <- lapply(6:10, FUN = bar, mean = 20, n = 500, reps = 4)
names(sims2) <- paste("SD", 6:10, sep = "")
out2 <- lapply(sims2, function(x) sapply(x, function(y) sum(y < 20)))
which gives the same output as before.
I am going to throw in another solution using the plyr package, which I think is tailor made for such exercises.
library(plyr)
# generate a data frame of parameters, repeating some as required
parameters = data.frame(mean = 20, sd = rep(6:10, each = 4))
# generate sample data for each combination of parameters
sample_data = mdply(df, rnorm, n = 500)
# generate answer by counting number of observations less than 20
answer = data.frame(
parameters,
obs_less_20 = rowSums(sample_data[,-c(1, 2),] < 20)
)
head(answer)
mean sd obs_less_20
1 20 6 247
2 20 6 250
3 20 6 242
4 20 6 259
5 20 7 240
6 20 7 237

Resources