Looking to conduct six different t-tests based on a generated dataset - r

lapply(1:5000, function(x) rnorm(n=20, mean=0, sd=1)) is the function I used to generate the data
t.test(x, mu=mu0, alt="two.sided", lev=0.95) is the t-test formula I made
I need to conduct six tests with µ0 = 0, 2, and the three alternatives from alpha=0.05

Are you looking for something like this?
x <- sapply(1:50, function(x) rnorm(n=20, mean=0, sd=1))
my_mu <- c(0.1,0.2,0.3,0.4,0.6)
map(my_mu, function(a) {
t.test(x, mu= a, alt="two.sided", lev=0.95)
})
x and y from the t.test require a numeric vector of values, sapply has the ability to return numeric vector of values, lapply returns a list so t.test can't take it.

Related

Getting AUCs for several predictors and outcomes in a dataframe

I want to be able to do lots of AUCs at once from the pROC package. Here is a simple dataframe with two predictors and a binary outcome and my attempt to use sapply() along with auc() and roc() from the pROC library. What am I doing wrong?
library(pROC)
df <- data.frame(z = rnorm(100,0,1), x=rnorm(100,0,1), y = as.factor(sample(0:1, 100, replace=TRUE)))
#One AUC at a time
auc(roc(df$y, df$x))
auc(roc(df$y, df$z))
#Trying to get multiple
predictors <- c("z","x")
results <- lapply(df, function(x){auc(roc(y, predictors))})
This solution works using a for loop, is that the most elegant method or can sapply/lapply be used instead?
Calculating multiple ROC curves in R using a for loop and pROC package. What variable to use in the predictor field?
You can use lapply in the following way -
predictors <- c("z","x")
results <- lapply(predictors, function(x) auc(roc(df$y, df[[x]])))
results
#[[1]]
#Area under the curve: 0.6214
#[[2]]
#Area under the curve: 0.6238
sapply would return a numeric vector.
sapply(predictors, function(x) auc(roc(df$y, df[[x]])))
# z x
#0.6213942 0.6237981

Summing N normal distributions

I am trying to determine the distribution of the sum of N univariate distributions.
Can you suggest a function that allows me to dynamically input any N number of distributions?
This works:
library(distr)
var1 <- Norm(mean=14, sd=1)
var2 <- Norm(mean=10, sd=1)
var3 <- Norm(mean=9, sd=1)
conv <- convpow(var1+var2+var3,1)
This (obviously) doesn't work since pasting the list together creates a messy character string, however this is the framework for my ideal function:
convolution_multi <- function(mean_list = c(14,10,9,10,50)){
distribution_list <- lapply(X = mean_list, Norm, sd=1)
conv_out <- convpow(paste(distribution_list,collapse="+"),1)
return(conv_out)
}
Thanks for your help!
You can use Reduce to repeatedly add each RV to one another. After that you can use convpow
new_var <- Reduce("+", distribution_list)
convpow(new_var, 1)
With that being said the call to convpow does absolutely nothing here.
> identical(convpow(new_var, 1), new_var)
[1] TRUE

Perform lm() for each row of a data.frame, where predictor lengths are different

I am trying to create a loop that perform lm on each row of a dataframe (response) against a predictor vector which is outside data frame. Values and length of the predictor varies depending on which category (df$Group) the row belongs to.
Appreciate if you could help me to create a loop that performs lm for each row, which also saves coefficients to a single vector/ dataframe. Also, to run the loop through several dataframes with same structure and then save coefficients to a single dataframe what changes should I make to the code?
Below is what I tried- it does not save coefficients.
dfList <- list(df,df1,df2)
df <- data.frame(ID=c(1:10),Group=c("A","A","A","A","B","B","B","B","B","B"), T1= rnorm(10, mean=1, sd=1),
T2= rnorm(10, mean=2, sd=1),T3= seq(40,58,by=2),T4= seq(10,28,by=2))
A <- df$ID
B <- df$Group
C <- numeric(length=length(A))
x1 <- c(1:4)
x2 <- c(2:4)
for (i in length(A)){
if(B[i] == "A"){C[i] <- apply(df[,c(3:6)],1,function(y) lm(y~x1)$coefficients[2])[i]}
if(B[i] == "B"){C[i] <- apply(df[,c(4:6)],1,function(y)lm(y~x2)$coefficients[2])[i]}
}
Appreciate any help!
Thank you!
Are you trying to do something like this ?
list_df <- split(df, df$Group)
A_coeff <- apply(list_df[[1]][, 3:6], 1, function(y) lm(y~x1)$coefficients[2])
B_coeff <- apply(list_df[[2]][, 4:6], 1, function(y) lm(y~x2)$coefficients[2])

Find slope by linear regression of 2 matrices (R)

I have 2 matrices. One contains the quantities a client bought of products. The matrix looks like this quantitymatrix:
the other one contains unitprices at which a client bought the products. The matrix looks like this pricematrix:
How can I run a linear regression with the matrices so that I obtain the slope for each product?
Your data:
quantity <- matrix(c(4,2,6, 9,4,3, 1,1,2, 3,1,5), 3, 4)
price <- matrix(c(1,0.5,8, 4.2,1.2,2, 2,5,2, 1,2.5,1), 3, 4)
First, you have to transform your two matrices into a single data frame. (Although you can avoid that if you want, but I think it makes it much more straightforward if you do so):
df <- data.frame(quantity = as.numeric(quantity),
price = as.numeric(price),
product = rep(1:4, each = 3), ID = 1:3)
Then, run the linear models by groups:
lms <- by(df, df$product, FUN = function(x) lm(price~quantity, data = x))
And get the slopes:
slopes <- sapply(lms, coef)[2,]
If however, you want to keep the orignial matrices as they are, you can run a simple loop:
slopes <- numeric(dim(price)[2])
for (i in 1:dim(price)[2]) {
model <- lm(price[,i]~quantity[,i])
slopes[i] <- coef(model)[2]
}
NB: this solution assumes that the two matrices have identical dimensions.
And if you want to avoid loops, the following solution may be faster:
f <- function(x,y) coef(lm(y~x))[2]
l <- function(m) lapply(seq_len(ncol(m)), function(i) m[,i])
mapply(f, l(quantity), l(price))

automatization of lm tests with all possible var combinations and getting values for: shapiro.test(), bptest(),vif() in R

I´ve spent days searching for the optimal models which would fulfill all of the standard OLS assumptions (normal distribution, homoscedasticity, no multicollinearity) in R but with 12 variables, it´s impossible to find the optimal var combination. So I was trying to create a script which would automatize this process.
Here the sample code for calculations:
x1 <- runif(100, 0, 10)
x2 <- runif(100, 0, 10)
x3 <- runif(100, 0, 10)
x4 <- runif(100, 0, 10)
x5 <- runif(100, 0, 10)
df <- as.data.frame(cbind(x1,x2,x3,x4,x5))
library(lmtest)
library(car)
model <- lm(x1~x2+x3+x4+x5, data = df)
# check for normal distribution (Shapiro-Wilk-Test)
rs_sd <- rstandard(model)
shapiro.test(rs_sd)
# check for heteroskedasticity (Breusch-Pagan-Test)
bptest(model)
# check for multicollinearity
vif(model)
#-------------------------------------------------------------------------------
# models without outliers
# identify outliers (calculating the Cooks distance, if x > 4/(n-k-1) --> outlier
cooks <- round(cooks.distance(model), digits = 4)
df_no_out <- cbind(df, cooks)
df_no_out <- subset(df_no_out, cooks < 4/(100-4-1))
model_no_out <- lm(x1~x2+x3+x4+x5, data = df_no_out)
# check for normal distribution
rs_sd_no_out<- rstandard(model_no_out)
shapiro.test(rs_sd_no_out)
# check for heteroskedasticity
bptest(model_no_out)
# check for multicollinearity
vif(model_no_out)
What I have in mind is to loop through all of the var combinations and get the P-VALUES for the shapiro.test() and the bptest() or the VIF-values for all models created so I can compare the significance values or the multicollinearity resp. (in my dataset, the multicollinearity shouldn´t be a problem and since to check for multicollinearity the VIF test produces more values (for each var 1xVIF factor) which will be probably more challenging for implementing in the code), the p-values for shapiro.test + bptest() would suffice…).
I´ve tried to write several scripts which would automatize the process but without succeed (unfortunately I´m not a programmer).
I know there´re already some threads dealing with this problem
How to run lm models using all possible combinations of several variables and a factor
Finding the best combination of variables for high R-squared values
but I haven´t find a script which would also calculate JUST the P-VALUES.
Especially the tests for models without outliers are important because after removing the outliers the OLS assumptions are fullfilled in many cases.
I would really very appreciate any suggestions or help with this.
you are scratching the surface of what is now referred to as Statistical learning. the intro text is "Statistical Learning with applications in R" and the grad level text is "The Elements of Statistical learning".
to do what you need you use regsubsets() function from the "leaps" package. However if you read at least chapter 6 from the intro book you will discover about cross-validation and bootstrapping which are the modern way of doing model selection.
The following automates the models fitting and the tests you made afterwards.
There is one function that fits all possible models. Then a series of calls to the *apply functions will get the values you want.
library(lmtest)
library(car)
fitAllModels <- function(data, resp, regr){
f <- function(M){
apply(M, 2, function(x){
fmla <- paste(resp, paste(x, collapse = "+"), sep = "~")
fmla <- as.formula(fmla)
lm(fmla, data = data)
})
}
regr <- names(data)[names(data) %in% regr]
regr_list <- lapply(seq_along(regr), function(n) combn(regr, n))
models_list <- lapply(regr_list, f)
unlist(models_list, recursive = FALSE)
}
Now the data.
# Make up a data.frame to test the function above.
# Don't forget to set the RNG seed to make the
# results reproducible
set.seed(7646)
x1 <- runif(100, 0, 10)
x2 <- runif(100, 0, 10)
x3 <- runif(100, 0, 10)
x4 <- runif(100, 0, 10)
x5 <- runif(100, 0, 10)
df <- data.frame(x1, x2, x3, x4, x5)
First fit all models with "x1" as response and the other variables as possible regressors. The function can be called with one response and any number of possible regressors you want.
fit_list <- fitAllModels(df, "x1", names(df)[-1])
And now the sequence of tests.
# Normality test, standardized residuals
rs_sd_list <- lapply(fit_list, rstandard)
sw_list <- lapply(rs_sd_list, shapiro.test)
sw_pvalues <- sapply(sw_list, '[[', 'p.value')
# check for heteroskedasticity (Breusch-Pagan-Test)
bp_list <- lapply(fit_list, bptest)
bp_pvalues <- sapply(bp_list, '[[', 'p.value')
# check for multicollinearity,
# only models with 2 or more regressors
vif_values <- lapply(fit_list, function(fit){
regr <- attr(terms(fit), "term.labels")
if(length(regr) < 2) NA else vif(fit)
})
A note on the Cook's distance. In your code, you are subsetting the original data.frame, producing a new one without outliers. This will duplicate data. I have opted for a list of indices of the df's rows only. If you prefer the duplicated data.frames, uncomment the line in the anonymous function below and comment out the last one.
# models without outliers
# identify outliers (calculating the
# Cooks distance, if x > 4/(n - k - 1) --> outlier
df_no_out_list <- lapply(fit_list, function(fit){
cooks <- cooks.distance(fit)
regr <- attr(terms(fit), "term.labels")
k <- length(regr)
inx <- cooks < 4/(nrow(df) - k - 1)
#df[inx, ]
which(inx)
})
# This tells how many rows have the df's without outliers
sapply(df_no_out_list, NROW)
# A data.frame without outliers. This one is the one
# for model number 8.
# The two code lines could become a one-liner.
i <- df_no_out_list[[8]]
df[i, ]

Resources