Summing N normal distributions - r

I am trying to determine the distribution of the sum of N univariate distributions.
Can you suggest a function that allows me to dynamically input any N number of distributions?
This works:
library(distr)
var1 <- Norm(mean=14, sd=1)
var2 <- Norm(mean=10, sd=1)
var3 <- Norm(mean=9, sd=1)
conv <- convpow(var1+var2+var3,1)
This (obviously) doesn't work since pasting the list together creates a messy character string, however this is the framework for my ideal function:
convolution_multi <- function(mean_list = c(14,10,9,10,50)){
distribution_list <- lapply(X = mean_list, Norm, sd=1)
conv_out <- convpow(paste(distribution_list,collapse="+"),1)
return(conv_out)
}
Thanks for your help!

You can use Reduce to repeatedly add each RV to one another. After that you can use convpow
new_var <- Reduce("+", distribution_list)
convpow(new_var, 1)
With that being said the call to convpow does absolutely nothing here.
> identical(convpow(new_var, 1), new_var)
[1] TRUE

Related

When do I have to set a seed?

I think this is a very basic question
I am doing simulations, so I make functions to recreate for example a random walk, which mathematically takes this form:
so to simulate it I make my function:
ar_1 <- function(iter, y0, sigma_e){
e <- rnorm(iter, sd = sigma_e)
y <- numeric(iter)
y[1] <- y0
for(t in 2:iter){
y[t] = y[t-1]+e[t]
}
result <- data.frame(iteration = seq(1,iter), y = y)
print(plot(result$iteration, result$y, type="l"))
return(result)
}
try1 <- ar_1(iter = 100, y0 = 2, sigma_e = 0.0003)
So the thing is the e vector takes random numbers.
I want to replicate the same graph and values wherever, so I know I gotta use a seed.
So my question is: does the seed goes inside the function or at the very start of the script?
Furthermore, I would want to know why.
If you set.seed once at the top of the script, the seed will remain set until the first call to rnorm. Subsequent calls to functions that require a random seed will not use the initial seed.
So really the answer is: do you intend to call the function more than once? If so, then set the seed inside the function.
Note that you do not need a for loop in your function. Because R is vectorized, loops can ussually be avoided. Random walk values can be calculated using the base R cumsum function. For example:
set.seed(7)
y1 <- pi
rand_vals <- rnorm(10, 0, 5)
path <- c(y1, rand_vals)
walk <- cumsum(path)
rand_vals
[1] 11.4362358 -5.9838584 -3.4714626 -2.0614648 -4.8533667 -4.7363997 3.7406967 -0.5847761 0.7632881 10.9498905
path
[1] 3.1415927 11.4362358 -5.9838584 -3.4714626 -2.0614648 -4.8533667 -4.7363997 3.7406967 -0.5847761 0.7632881
[11] 10.9498905
walk
[1] 3.141593 14.577828 8.593970 5.122507 3.061043 -1.792324 -6.528724 -2.788027 -3.372803 -2.609515 8.340376

Looking to conduct six different t-tests based on a generated dataset

lapply(1:5000, function(x) rnorm(n=20, mean=0, sd=1)) is the function I used to generate the data
t.test(x, mu=mu0, alt="two.sided", lev=0.95) is the t-test formula I made
I need to conduct six tests with µ0 = 0, 2, and the three alternatives from alpha=0.05
Are you looking for something like this?
x <- sapply(1:50, function(x) rnorm(n=20, mean=0, sd=1))
my_mu <- c(0.1,0.2,0.3,0.4,0.6)
map(my_mu, function(a) {
t.test(x, mu= a, alt="two.sided", lev=0.95)
})
x and y from the t.test require a numeric vector of values, sapply has the ability to return numeric vector of values, lapply returns a list so t.test can't take it.

automatization of lm tests with all possible var combinations and getting values for: shapiro.test(), bptest(),vif() in R

I´ve spent days searching for the optimal models which would fulfill all of the standard OLS assumptions (normal distribution, homoscedasticity, no multicollinearity) in R but with 12 variables, it´s impossible to find the optimal var combination. So I was trying to create a script which would automatize this process.
Here the sample code for calculations:
x1 <- runif(100, 0, 10)
x2 <- runif(100, 0, 10)
x3 <- runif(100, 0, 10)
x4 <- runif(100, 0, 10)
x5 <- runif(100, 0, 10)
df <- as.data.frame(cbind(x1,x2,x3,x4,x5))
library(lmtest)
library(car)
model <- lm(x1~x2+x3+x4+x5, data = df)
# check for normal distribution (Shapiro-Wilk-Test)
rs_sd <- rstandard(model)
shapiro.test(rs_sd)
# check for heteroskedasticity (Breusch-Pagan-Test)
bptest(model)
# check for multicollinearity
vif(model)
#-------------------------------------------------------------------------------
# models without outliers
# identify outliers (calculating the Cooks distance, if x > 4/(n-k-1) --> outlier
cooks <- round(cooks.distance(model), digits = 4)
df_no_out <- cbind(df, cooks)
df_no_out <- subset(df_no_out, cooks < 4/(100-4-1))
model_no_out <- lm(x1~x2+x3+x4+x5, data = df_no_out)
# check for normal distribution
rs_sd_no_out<- rstandard(model_no_out)
shapiro.test(rs_sd_no_out)
# check for heteroskedasticity
bptest(model_no_out)
# check for multicollinearity
vif(model_no_out)
What I have in mind is to loop through all of the var combinations and get the P-VALUES for the shapiro.test() and the bptest() or the VIF-values for all models created so I can compare the significance values or the multicollinearity resp. (in my dataset, the multicollinearity shouldn´t be a problem and since to check for multicollinearity the VIF test produces more values (for each var 1xVIF factor) which will be probably more challenging for implementing in the code), the p-values for shapiro.test + bptest() would suffice…).
I´ve tried to write several scripts which would automatize the process but without succeed (unfortunately I´m not a programmer).
I know there´re already some threads dealing with this problem
How to run lm models using all possible combinations of several variables and a factor
Finding the best combination of variables for high R-squared values
but I haven´t find a script which would also calculate JUST the P-VALUES.
Especially the tests for models without outliers are important because after removing the outliers the OLS assumptions are fullfilled in many cases.
I would really very appreciate any suggestions or help with this.
you are scratching the surface of what is now referred to as Statistical learning. the intro text is "Statistical Learning with applications in R" and the grad level text is "The Elements of Statistical learning".
to do what you need you use regsubsets() function from the "leaps" package. However if you read at least chapter 6 from the intro book you will discover about cross-validation and bootstrapping which are the modern way of doing model selection.
The following automates the models fitting and the tests you made afterwards.
There is one function that fits all possible models. Then a series of calls to the *apply functions will get the values you want.
library(lmtest)
library(car)
fitAllModels <- function(data, resp, regr){
f <- function(M){
apply(M, 2, function(x){
fmla <- paste(resp, paste(x, collapse = "+"), sep = "~")
fmla <- as.formula(fmla)
lm(fmla, data = data)
})
}
regr <- names(data)[names(data) %in% regr]
regr_list <- lapply(seq_along(regr), function(n) combn(regr, n))
models_list <- lapply(regr_list, f)
unlist(models_list, recursive = FALSE)
}
Now the data.
# Make up a data.frame to test the function above.
# Don't forget to set the RNG seed to make the
# results reproducible
set.seed(7646)
x1 <- runif(100, 0, 10)
x2 <- runif(100, 0, 10)
x3 <- runif(100, 0, 10)
x4 <- runif(100, 0, 10)
x5 <- runif(100, 0, 10)
df <- data.frame(x1, x2, x3, x4, x5)
First fit all models with "x1" as response and the other variables as possible regressors. The function can be called with one response and any number of possible regressors you want.
fit_list <- fitAllModels(df, "x1", names(df)[-1])
And now the sequence of tests.
# Normality test, standardized residuals
rs_sd_list <- lapply(fit_list, rstandard)
sw_list <- lapply(rs_sd_list, shapiro.test)
sw_pvalues <- sapply(sw_list, '[[', 'p.value')
# check for heteroskedasticity (Breusch-Pagan-Test)
bp_list <- lapply(fit_list, bptest)
bp_pvalues <- sapply(bp_list, '[[', 'p.value')
# check for multicollinearity,
# only models with 2 or more regressors
vif_values <- lapply(fit_list, function(fit){
regr <- attr(terms(fit), "term.labels")
if(length(regr) < 2) NA else vif(fit)
})
A note on the Cook's distance. In your code, you are subsetting the original data.frame, producing a new one without outliers. This will duplicate data. I have opted for a list of indices of the df's rows only. If you prefer the duplicated data.frames, uncomment the line in the anonymous function below and comment out the last one.
# models without outliers
# identify outliers (calculating the
# Cooks distance, if x > 4/(n - k - 1) --> outlier
df_no_out_list <- lapply(fit_list, function(fit){
cooks <- cooks.distance(fit)
regr <- attr(terms(fit), "term.labels")
k <- length(regr)
inx <- cooks < 4/(nrow(df) - k - 1)
#df[inx, ]
which(inx)
})
# This tells how many rows have the df's without outliers
sapply(df_no_out_list, NROW)
# A data.frame without outliers. This one is the one
# for model number 8.
# The two code lines could become a one-liner.
i <- df_no_out_list[[8]]
df[i, ]

Generate random values in R with a defined correlation in a defined range

For a science project, I am looking for a way to generate random data in a certain range (e.g. min=0, max=100000) with a certain correlation with another variable which already exists in R. The goal is to enrich the dataset a little so I can produce some more meaningful graphs (no worries, I am working with fictional data).
For example, I want to generate random values correlating with r=-.78 with the following data:
var1 <- rnorm(100, 50, 10)
I already came across some pretty good solutions (i.e. https://stats.stackexchange.com/questions/15011/generate-a-random-variable-with-a-defined-correlation-to-an-existing-variable), but only get very small values, which I cannot transform so the make sense in the context of the other, original values.
Following the example:
var1 <- rnorm(100, 50, 10)
n <- length(var1)
rho <- -0.78
theta <- acos(rho)
x1 <- var1
x2 <- rnorm(n, 50, 50)
X <- cbind(x1, x2)
Xctr <- scale(X, center=TRUE, scale=FALSE)
Id <- diag(n)
Q <- qr.Q(qr(Xctr[ , 1, drop=FALSE]))
P <- tcrossprod(Q) # = Q Q'
x2o <- (Id-P) %*% Xctr[ , 2]
Xc2 <- cbind(Xctr[ , 1], x2o)
Y <- Xc2 %*% diag(1/sqrt(colSums(Xc2^2)))
var2 <- Y[ , 2] + (1 / tan(theta)) * Y[ , 1]
cor(var1, var2)
What I get for var2 are values ranging between -0.5 and 0.5. with a mean of 0. I would like to have much more distributed data, so I could simply transform it by adding 50 and have a quite simililar range compared to my first variable.
Does anyone of you know a way to generate this kind of - more or less -meaningful data?
Thanks a lot in advance!
Starting with var1, renamed to A, and using 10,000 points:
set.seed(1)
A <- rnorm(10000,50,10) # Mean of 50
First convert values in A to have the new desired mean 50,000 and have an inverse relationship (ie subtract):
B <- 1e5 - (A*1e3) # Note that { mean(A) * 1000 = 50,000 }
This only results in r = -1. Add some noise to achieve the desired r:
B <- B + rnorm(10000,0,8.15e3) # Note this noise has mean = 0
# the amount of noise, 8.15e3, was found through parameter-search
This has your desired correlation:
cor(A,B)
[1] -0.7805972
View with:
plot(A,B)
Caution
Your B values might fall outside your range 0 100,000. You might need to filter for values outside your range if you use a different seed or generate more numbers.
That said, the current range is fine:
range(B)
[1] 1668.733 95604.457
If you're happy with the correlation and the marginal distribution (ie, shape) of the generated values, multiply the values (that fall between (-.5, +.5) by 100,000 and add 50,000.
> c(-0.5, 0.5) * 100000 + 50000
[1] 0e+00 1e+05
edit: this approach, or any thing else where 100,000 & 50,000 are exchanged for different numbers, will be an example of a 'linear transformation' recommended by #gregor-de-cillia.

Generating multiple datasets and applying function and output multiple dataset

Here is my problem, just hard for me...
I want to generate multiple datasets, then apply a function to these datasets and output corresponding output in single or multiple dataset (whatever possible)...
My example, although I need to generate a large number of variables and datasets
seed <- round(runif(10)*1000000)
datagen <- function(x){
set.seed(x)
var <- rep(1:3, c(rep(3, 3)))
yvar <- rnorm(length(var), 50, 10)
matrix <- matrix(sample(1:10, c(10*length(var)), replace = TRUE), ncol = 10)
mydata <- data.frame(var, yvar, matrix)
}
gdt <- lapply (seed, datagen)
# resulting list (I believe is correct term) has 10 dataframes:
# gdt[1] .......to gdt[10]
# my function, this will perform anova in every component data frames and
#output probability coefficients...
anovp <- function(x){
ind <- 3:ncol(x)
out <- lm(gdt[x]$yvar ~ gdt[x][, ind[ind]])
pval <- out$coefficients[,4][2]
pval <- do.call(rbind,pval)
}
plist <- lapply (gdt, anovp)
Error in gdt[x] : invalid subscript type 'list'
This is not working, I tried different options. But could not figure out...finally decided to bother experts, sorry for that...
My questions are:
(1) Is this possible to handle such situation in this way or there are other alternatives to handle such multiple datasets created?
(2) If this is right way, how can I do it?
Thank you for attention and I will appreciate your help...
You have the basic idea right, in that you should create a list of data frames and then use lapply to apply the function to each element of the list. Unfortunately, there are several oddities in your code.
There is no point in randomly generating a seed, then setting it. You only need to use set.seed in order to make random numbers reproducible. Cut the lines
seed <- round(runif(10)*1000000)
and maybe
set.seed(x)
rep(1:3, c(rep(3, 3))) is the same as rep(1:3, each = 3).
Don't call your variables var or matrix, since they will mask the names of those functions. since it's confusing.
3:ncol(x) is dangerous. If x has less than 3 columns it doesn't do what you think it does.
... and now, the problem you actually wanted solving.
The problem is in the line out <- lm(gdt[x]$yvar ~ gdt[x][, ind[ind]]).
lapply passes data frames into anovp, not indicies, so x is a data frame in gdt[x]. Which throws an error.
One more thing. While you are rewriting that line, note that lm takes a data argument, so you don't need to do things like gdt$some_column; you can just reference some_column directly.
EDIT: Further advice.
You appear to always use the formula yvar ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10. Since its the same each time, create it before your call to lapply.
independent_vars <- paste(colnames(gdt[[1]])[-1:-2], collapse = " + ")
model_formula <- formula(paste("yvar", independent_vars, sep = " ~ "))
I probably wouldn't bother with the anovp function. Just do
models <- lapply(gdt, function(data) lm(model_formula, data))
Then include a further call to lapply to play with the coefficients if necessary. The next line replicates your anovp code, but won't work because model$coefficients is a vector (so the dimensions aren't right). Adjust to retrieve the bit you actualy want.
coeffs <- lapply(models, function(model) do.call(rbind, model$coefficients[,4][2]))

Resources