Generate multivariate normal data with unequal sample sizes - r

I would like to generate multivariate random data manipulating the sample size and variance using MASS::mvrnorm (or, as the case may prove to be, rnorm). This is fairly straightforward, however, the trick is that I intend to simulate a statistic with this generated data that compares two different sample sizes (of different lengths). Essentially, this creates a 3X3 design where I have three levels of 2 different sample sizes (e.g. [450,150], [300,300], [150,450]) crossed with three levels of variance (e.g. [1,1], [1,3], [1,10]).
n <- c(450,150,300,300,150,450) # sample sizes
sig <- matrix(c(1,1,1,3,1,10), nrow=2, byrow=F) # variance
mu <- (5, 5, 5) # mean is constant across all conditions
mvrnorm(n, mu, sig)
I'm sure that I have to iterate through my vector of sample sizes, just as I would if I were only generating one sample size per condition. But since I'm generating two sample sizes for each condition, I am unsure how to do this.

If I understand you correctly, you want six samples: 450 draws from N(5, 1), 150 draws from N(5, 1), 300 draws from N(5, 1), 300 draws from N(5, 3), 150 draws from N(5, 1), and 450 draws from N(5, 10). You can get that via
samples <- mapply(rnorm, n = c(450,150,300,300,150,450), sd = c(1,1,1,3,1,10))
Obviously I'll omit the output due to size, but you can see what I mean:
str(samples)
List of 6
$ : num [1:450] 0.785 -0.21 0.192 -0.265 -0.501 ...
$ : num [1:150] 1.224 -0.315 -0.131 -0.923 0.407 ...
$ : num [1:300] -0.413 -1.081 0.469 1.332 0.244 ...
$ : num [1:300] -0.748 -0.628 0.753 1.4 3.883 ...
$ : num [1:150] 0.376 -1.193 1.133 1.839 1.528 ...
$ : num [1:450] 2.19 -3.17 2.45 0.75 -8.4 ...
Then you want to run some sort of test between samples[[1]] and samples[[2]], then between samples[[3]] and samples[[4]], and finally between samples[[5]] and samples[[6]]. I don't know what test you intend to run, but that should be straightforward if you have a function for the test: Just feed in the proper list elements.
Update
Based on the comment, what you need to get all the sample combinations you want is
f <- function(sample_size_pairs, sd_pairs) {
return(sapply(1:nrow(sample_size_pairs), function(i) {
mapply(rnorm, n = sample_size_pairs[i, ], sd = sd_pairs[i])
}))
}
sample_sizes <- matrix(c(rep(c(450, 150), 3), rep(c(150, 450), 3),
rep(c(300, 300), 3)), ncol = 2, byrow = TRUE)
sds <- matrix(rep(c(1, 1, 1, 3, 1, 10), 3), ncol = 2, byrow = TRUE)
g <- f(sample_sizes, sds)
str(g)
List of 9
$ :List of 2
..$ : num [1:450] 1.4243 1.733 0.5004 -0.8036 -0.0101 ...
..$ : num [1:150] -0.0607 0.1797 0.3787 -0.6676 -1.4352 ...
$ :List of 2
..$ : num [1:450] -0.0766 -0.1407 -0.4893 0.2251 1.0174 ...
..$ : num [1:150] -1.8814 -1.3532 -1.2888 -0.0542 0.2637 ...
$ :List of 2
..$ : num [1:450] 1.945 -1.375 -1.258 0.292 -0.208 ...
..$ : num [1:150] -1.291 -0.557 -1.199 1.385 -2.062 ...
$ :List of 2
..$ : num [1:150] -2.461 -0.345 -1.454 -0.286 0.942 ...
..$ : num [1:450] -0.75 -0.636 -0.488 1.818 -0.585 ...
$ :List of 2
..$ : num [1:150] -1.238 -0.765 -1.447 -1.153 -1.466 ...
..$ : num [1:450] 2.5461 0.9368 -0.0503 -0.9727 -1.4101 ...
$ :List of 2
..$ : num [1:150] 0.7209 2.4342 -0.7617 0.0285 -1.3297 ...
..$ : num [1:450] -0.6882 0.0927 -0.8981 -0.4088 1.3421 ...
$ : num [1:300, 1:2] 2.217 -0.161 -0.976 0.26 -0.362 ...
$ : num [1:300, 1:2] 0.456 -0.112 -0.541 3.759 0.32 ...
$ : num [1:300, 1:2] 0.165 0.247 -0.187 -0.624 -1.335 ...

Related

Using rollapply to output to lists of lists

I would like to use rollapply or rollapplyr to apply the modwt function to my time series data.
I'm familiar with how rollapply/r works but I need some help setting up the output so that I can correctly store my results when using rollapply.
The modwt function in the waveslim package takes a time series and decomposes it into J levels, for my particular problem J = 4 which means I will have 4 sets of coefficients from my single time series stored in a list of 5. Of this list I am only concerned with d1,d2,d3 & d4.
The output of the modwt function looks as follows
> str(ar1.modwt)
List of 5
$ d1: num [1:200] -0.223 -0.12 0.438 -0.275 0.21 ...
$ d2: num [1:200] 0.1848 -0.4699 -1.183 -0.9698 -0.0937 ...
$ d3: num [1:200] 0.5912 0.6997 0.5416 0.0742 -0.4989 ...
$ d4: num [1:200] 1.78 1.86 1.85 1.78 1.65 ...
$ s4: num [1:200] 4.64 4.42 4.19 3.94 3.71 ...
- attr(*, "class")= chr "modwt"
- attr(*, "wavelet")= chr "la8"
- attr(*, "boundary")= chr "periodic"
In the example above I have applied the modwt function to the full length time series of length 200 but I wish to apply it to a small rolling window of 30 using rollapply.
I have already tried the following but the output is a large matrix and I cannot easily identify which values belong to d1,d2,d3 or d4
roller <- rollapplyr(ar1, 30,FUN=modwt,wf="la8",n.levels=4,boundary="periodic")
The output of this is a large matrix with the following structure:
> str(roller)
List of 855
$ : num [1:30] 0.117 -0.138 0.199 -1.267 1.872 ...
$ : num [1:30] -0.171 0.453 -0.504 -0.189 0.849 ...
$ : num [1:30] 0.438 -0.3868 0.1618 -0.0973 -0.0247 ...
$ : num [1:30] -0.418 0.407 0.639 -2.013 1.349 ...
...lots of rows omitted...
$ : num [1:30] 0.307 -0.658 -0.105 1.128 -0.978 ...
[list output truncated]
- attr(*, "dim")= int [1:2] 171 5
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:5] "d1" "d2" "d3" "d4" ...
How can I set up a variable such that it will store the (200-30)+1 lists with lists within this for each of the scales d1,d2,d3 and d4?
For a reproducible example please use the following:
library(waveslim)
data(ar1)
ar1.modwt <- modwt(ar1, "la8", 4)
Define modwt2 which invokes modwt, takes the first 4 components and strings them out into a numeric vector. Then use rollapplyr with that giving rollr where each row of rollr is the result of one call to modwt2. Finally, reshape each row of rollr into a separate matrix and create a list, L, of those matrices:
modwt2 <- function(...) unlist(head(modwt(...), 4))
rollr <- rollapplyr(ar1, 30, FUN = modwt2, wf = "la8", n.levels = 4, boundary = "periodic")
L <- lapply(1:nrow(rollr), function(i) matrix(rollr[i,], , 4))
If a 30 x 4 x 171 array is desired then the following will simplify it into a 3d array:
simplify2array(L)
or as a list of lists:
lapply(L, function(x) as.list(as.data.frame(x)))
2) This is an alternate solution that just uses lapply directly and returns a list each of whose components is the list consisting of d1, d2, d3 and d4.
lapply(1:(200-30+1), function(i, ...) head(modwt(ar1[seq(i, length = 30)], ...), 4),
wf = "la8", n.levels = 4, boundary = "periodic")
Updates: Code improvements, expand (1) and add (2).

Subsetting and replacing in a list variable nested in a dataframe

Here is my dataframe example. It includes a column variable, named "dta" which is a single list of n values I want to keep for each of my scenario:
set.seed(777)
df <- data.frame(theo = numeric(),
size = numeric(),
dta = I(list()))
df[ 1: 5,"theo"] <- qlnorm(0.1, meanlog=0, sdlog=1, lower.tail = TRUE, log.p = FALSE)
df[ 6:10,"theo"] <- qlnorm(0.2, meanlog=0, sdlog=1, lower.tail = TRUE, log.p = FALSE)
df[ 1: 5,"size"] <- 10
df[ 6:10,"size"] <- 20
for(i in 1:10){
df$dta[i] <- list(rlnorm(df$size[i], meanlog = 0, sdlog = 1))
}
df
str(df)
This should give a df like:
theo size dta
1 0.2776062 10 1.631967....
2 0.2776062 10 0.737667....
3 0.2776062 10 0.131252....
4 0.2776062 10 1.937334....
5 0.2776062 10 0.739868....
6 0.4310112 20 4.631176....
7 0.4310112 20 2.610180....
8 0.4310112 20 0.175918....
9 0.4310112 20 3.501670....
10 0.4310112 20 0.588178....
or:
'data.frame': 10 obs. of 4 variables:
$ theo: num 0.278 0.278 0.278 0.278 0.278 ...
$ size: num 10 10 10 10 10 20 20 20 20 20
$ dta :List of 10
..$ : num 1.632 0.671 1.667 0.671 5.148 ...
..$ : num 0.738 1.056 0.152 0.967 10.089 ...
..$ : num 0.131 1.256 0.457 3.574 4.211 ...
..$ : num 1.937 2.359 3.496 0.297 4.587 ...
..$ : num 0.74 0.66 0.481 0.434 1.874 ...
..$ : num 4.631 0.298 10.28 0.933 1.286 ...
..$ : num 2.61 0.472 0.251 1.61 0.303 ...
..$ : num 0.176 0.566 2.156 0.407 3.52 ...
..$ : num 3.502 1.748 1.283 0.648 1.359 ...
..$ : num 0.588 0.392 2.447 1.926 0.86 ...
..- attr(*, "class")= chr "AsIs"
Now, I want to subset that list in such a way that:
for each list, each value is compared with the fixed value "theo" stored in the dataframe
when that value is below or equal to "theo", then recode that value NA
Here is a working code and gives me exactly what I want:
df$dta2 <- df$dta
for(i in 1:10){
df$dta2[[i]] [ df$dta2[[i]] <= df$theo[i] ] <- NA
}
However I was wondering is there is a way to get the same result with a single line of code and no "for loop" to proceed with a conditional replacement of values contained in a list which is nested in a dataframe?
We can use Map
df$dta3 <- Map(function(x,y) replace(x, x<=y, NA), df$dta, df$theo)
all.equal(df$dta2, df$dta3, check.attributes=FALSE)
#[1] TRUE

Quickly sum a big list of lists?

I have a 10000 lists (results of a simulation), each containing 22500 lists (each list is a pixel in an image) which contains a vector of length 55.
# Simple Example
m <- replicate(2, list(runif(55)))
m2 <- replicate(3, list(m))
str(m2,list.len = 3)
List of 3
$ :List of 4
..$ : num [1:55] 0.107 0.715 0.826 0.582 0.604 ...
..$ : num [1:55] 0.949 0.389 0.645 0.331 0.698 ...
..$ : num [1:55] 0.138 0.207 0.32 0.442 0.721 ...
.. [list output truncated]
$ :List of 4
..$ : num [1:55] 0.107 0.715 0.826 0.582 0.604 ...
..$ : num [1:55] 0.949 0.389 0.645 0.331 0.698 ...
..$ : num [1:55] 0.138 0.207 0.32 0.442 0.721 ...
.. [list output truncated]
$ :List of 4
..$ : num [1:55] 0.107 0.715 0.826 0.582 0.604 ...
..$ : num [1:55] 0.949 0.389 0.645 0.331 0.698 ...
..$ : num [1:55] 0.138 0.207 0.32 0.442 0.721 ...
.. [list output truncated]
# my function
m3 <- lapply(seq_along(m2[[1]]), FUN = function(j) Reduce('+', lapply(seq_along(m2), FUN = function(i) m2[[i]][[j]])))
#by hand
identical(m2[[1]][[1]] + m2[[2]][[1]] + m2[[3]][[1]], m3[[1]] )
I wrote a nested lapply with Reduce to sum the lists. On a small example, as in above, it's fast but on my real data, it's really slow.
#slow code
m <- replicate(22500, list(runif(55)))
m2 <- replicate(10000, list(m))
str(m2,list.len = 3)
m3 <- lapply(seq_along(m2[[1]]), FUN = function(j) Reduce('+', lapply(seq_along(m2), FUN = function(i) m2[[i]][[j]])))
How can I speed this up, or should I change data structures?
Thanks.
This gives some improvement (>2x):
split(Reduce(`+`, lapply(m2, unlist)), rep(seq_along(m2[[1]]), lengths(m2[[1]])))
Since your data is essentially rectangular, had you stored it in this shape:
library(data.table)
d = rbindlist(lapply(m2, function(x) transpose(as.data.table(x))), id = T
)[, id.in := 1:.N, by = .id]
# .id V1 V2 V55 id.in
#1: 1 0.4605065 0.09744975 ... 0.8620728 1
#2: 1 0.6666742 0.10435471 ... 0.3991940 2
#3: 2 0.4605065 0.09744975 ... 0.8620728 1
#4: 2 0.6666742 0.10435471 ... 0.3991940 2
#5: 3 0.4605065 0.09744975 ... 0.8620728 1
#6: 3 0.6666742 0.10435471 ... 0.3991940 2
You could do the aggregation even faster by doing:
d[, lapply(.SD, sum), by = id.in]
But if the list is your starting point, the conversion would take up the majority of the time.

Using apply over two lists of different lengths

This question is related to my earlier question found here: https://stackoverflow.com/questions/33089532/r-accounting-for-a-factor-with-this-logistic-regression-function-replace-lappl
I realize that I didn't do a good job at asking the first question, so here is a more simple analog with actual data:
My data looks something like this:
#data look like this, but with a variable number of "y" columms
wk<-rep(1:50,2)
X<-rnorm(1:100,1)
y1<-rnorm(1:100,1)
y2<-rnorm(1:100,1)
df1<-as.data.frame(cbind(wk,X,y1,y2))
df1$hyst<-ifelse(df1$wk>=5 & df1$wk<32, "R", "F")
Y<-df1[, -which(colnames(df1) %in% c("wk"))] #this step makes more sense with my actual data since I have a bunch of columns to remove
l1<-length(Y)-1
lst1<-lapply(2:l1,function(x){colnames(Y[x])})
dflst<-c("Y",'Y[Y$hyst=="R",]','Y[Y$hyst=="F",]')
I want to run a model over all Y columns for the full data set (all data) and for two subsets, when the factor hyst=="R" and when hyst=="F".
To do this, I have nested two lapply functions, which sort of works, but I think it essentially doubles my results and is causing me all sorts of list headaches.
Here is the nested lapply code:
lms <- lapply(dflst, function(z){
lapply(lst1, function(y) {
form <- paste0(y, " ~ X")
lm(form, data=eval(parse(text=z)))
})
})
How can I replace or modify the nested lapply function to obtain a model run for each Y column for each data set( all, "R", and "F")?
Construct your DF list like
DFlst <- c(list(full=Y), split(Y, Y$hyst))
str(DFlst)
List of 3
$ full:'data.frame': 100 obs. of 4 variables:
..$ X : num [1:100] 1.792 3.192 0.367 1.632 1.388 ...
..$ y1 : num [1:100] 3.354 1.189 1.99 0.639 0.1 ...
..$ y2 : num [1:100] 0.864 2.415 0.437 1.069 1.368 ...
..$ hyst: chr [1:100] "F" "F" "F" "F" ...
$ F :'data.frame': 46 obs. of 4 variables:
..$ X : num [1:46] 1.792 3.192 0.367 1.632 0.707 ...
..$ y1 : num [1:46] 3.354 1.189 1.99 0.639 0.894 ...
..$ y2 : num [1:46] 0.864 2.415 0.437 1.069 1.213 ...
..$ hyst: chr [1:46] "F" "F" "F" "F" ...
$ R :'data.frame': 54 obs. of 4 variables:
..$ X : num [1:54] 1.388 2.296 0.409 1.494 0.943 ...
..$ y1 : num [1:54] 0.1002 0.6425 -0.0918 1.199 0.8767 ...
..$ y2 : num [1:54] 1.368 1.122 0.402 -0.237 1.518 ...
..$ hyst: chr [1:54] "R" "R" "R" "R" ...
Do some regressions:
res <- lapply(DFlst, function(DF) {
cols = grep("^y[0-9]+$",names(DF),value=TRUE)
lapply(setNames(cols,cols),
function(y) lm(paste(y,"~X"), data=DF))
})
str(res, list.len=2, give.attr=FALSE)
List of 3
$ full:List of 2
..$ y1:List of 12
.. ..$ coefficients : Named num [1:2] 0.903 0.111
.. ..$ residuals : Named num [1:100] 2.2509 -0.0698 1.046 -0.4464 -0.9578 ...
.. .. [list output truncated]
..$ y2:List of 12
.. ..$ coefficients : Named num [1:2] 1.423 -0.166
.. ..$ residuals : Named num [1:100] -0.2623 1.5213 -0.9253 -0.0837 0.1751 ...
.. .. [list output truncated]
$ F :List of 2
..$ y1:List of 12
.. ..$ coefficients : Named num [1:2] 0.9289 0.0769
.. ..$ residuals : Named num [1:46] 2.2871 0.0146 1.0332 -0.4157 -0.0889 ...
.. .. [list output truncated]
..$ y2:List of 12
.. ..$ coefficients : Named num [1:2] 1.4177 -0.0789
.. ..$ residuals : Named num [1:46] -0.413 1.25 -0.952 -0.22 -0.149 ...
.. .. [list output truncated]
[list output truncated]

Adding principal components as variables to a data frame

I am working with a dataset of 10000 data points and 100 variables in R. Unfortunately the variables I have do not describe the data in a good way. I carried out a PCA analysis using prcomp() and the first 3 PCs seem to account for a most of the variability of the data. As far as I understand, a principal component is a combination of different variables; therefore it has a certain value corresponding to each data point and can be considered as a new variable. Would I be able to add these principal components as 3 new variables to my data? I would need them for further analysis.
A reproducible dataset:
set.seed(144)
x <- data.frame(matrix(rnorm(2^10*12), ncol=12))
y <- prcomp(formula = ~., data=x, center = TRUE, scale = TRUE, na.action = na.omit)
PC scores are stored in the element x of prcomp() result.
str(y)
List of 6
$ sdev : num [1:12] 1.08 1.06 1.05 1.04 1.03 ...
$ rotation: num [1:12, 1:12] -0.0175 -0.1312 0.3284 -0.4134 0.2341 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:12] "X1" "X2" "X3" "X4" ...
.. ..$ : chr [1:12] "PC1" "PC2" "PC3" "PC4" ...
$ center : Named num [1:12] 0.02741 -0.01692 -0.03228 -0.03303 0.00122 ...
..- attr(*, "names")= chr [1:12] "X1" "X2" "X3" "X4" ...
$ scale : Named num [1:12] 0.998 1.057 1.019 1.007 0.993 ...
..- attr(*, "names")= chr [1:12] "X1" "X2" "X3" "X4" ...
$ x : num [1:1024, 1:12] 1.023 -1.213 0.167 -0.118 -0.186 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:1024] "1" "2" "3" "4" ...
.. ..$ : chr [1:12] "PC1" "PC2" "PC3" "PC4" ...
$ call : language prcomp(formula = ~., data = x, na.action = na.omit, center = TRUE, scale = TRUE)
- attr(*, "class")= chr "prcomp"
You can get them with y$x and then chose those columns you need.
x.new<-cbind(x,y$x[,1:3])
str(x.new)
'data.frame': 1024 obs. of 15 variables:
$ X1 : num 1.14 2.38 0.684 1.785 0.313 ...
$ X2 : num -0.689 0.446 -0.72 -3.511 0.36 ...
$ X3 : num 0.722 0.816 0.295 -0.48 0.566 ...
$ X4 : num 1.629 0.738 0.85 1.057 0.116 ...
$ X5 : num -0.737 -0.827 0.65 -0.496 -1.045 ...
$ X6 : num 0.347 0.056 -0.606 1.077 0.257 ...
$ X7 : num -0.773 1.042 2.149 -0.599 0.516 ...
$ X8 : num 2.05511 0.4772 0.18614 0.02585 0.00619 ...
$ X9 : num -0.0462 1.3784 -0.2489 0.1625 0.6137 ...
$ X10: num -0.709 0.755 0.463 -0.594 -1.228 ...
$ X11: num -1.233 -0.376 -2.646 1.094 0.207 ...
$ X12: num -0.44 -2.049 0.315 0.157 2.245 ...
$ PC1: num 1.023 -1.213 0.167 -0.118 -0.186 ...
$ PC2: num 1.2408 0.6077 1.1885 3.0789 0.0797 ...
$ PC3: num -0.776 -1.41 0.977 -1.343 0.987 ...
Didzis Elferts's response only works if your data, x, has no NAs. Here's how you can add the components if your data does have NAs.
library(tidyverse)
components <- y$x %>% rownames_to_column("id")
x <- x %>% rownames_to_column("id") %>% left_join(components, by = "id")

Resources