Looping statistic Tests in R - r

I would like to apply T test in R within a loop
Groups Length Size Diet place
A 2.4048381 0.7474989 1.6573392 334.3273456
A 2.72500485 0.86392165 1.8610832 452.5593152
A 1.396782867 0.533330367 0.8634525 225.5998728
B 1.3888505 0.46478175 0.92406875 189.9576476
B 1.38594795 0.60068945 0.7852585 298.3744962
B 2.53491245 0.95608005 1.5788324 303.9052525
I tried this code with loop, but it is not working:
for (i in 2:4){
t.test(table[,c(i)] ~ table$Groups, conf.level = 0.95)
}
Can anyone help me with this?
Thanks!

Your code computes 4 t-tests, but the results are lost, because you don't do anything with them. Try the following:
info <- read.table(header=TRUE, text="Groups Length Size Diet place
A 2.4048381 0.7474989 1.6573392 334.3273456
A 2.72500485 0.86392165 1.8610832 452.5593152
A 1.396782867 0.533330367 0.8634525 225.5998728
B 1.3888505 0.46478175 0.92406875 189.9576476
B 1.38594795 0.60068945 0.7852585 298.3744962
B 2.53491245 0.95608005 1.5788324 303.9052525")
results <- list()
for (i in 2:4){
results[[i]] <- t.test(info[,i] ~ info$Groups, conf.level = 0.95)
}
print(results)
When interacting with the REPL/console, typing the t.test function will compute results and return them. The console will print everything that is returned. In scripts that you source, the t.test function will return results but they wil not be printed. This is why I put them into a list and printed the list later on.
Btw, I stored your information as info not as table. R will deal great with variable names that are also function names, but every now and then you will hava a hard time to read error messages, so avoid naming variables table or matrix or c or df.

Using apply functions you could also do:
res<- cbind(
do.call(rbind,apply(info[,-1],2,function(cv)t.test(cv ~ info$Groups, conf.level = 0.95)
[c("statistic","parameter","p.value")]))
,
t(apply(info[,-1],2,function(cv)unlist(t.test(cv ~ info$Groups, conf.level = 0.95)
[c("conf.int","estimate")])))
)
res
> res
statistic parameter p.value conf.int1 conf.int2 estimate.mean in group A estimate.mean in group B
Length 0.7327329 3.991849 0.5044236 -1.13263 1.943907 2.175542 1.769904
Size 0.2339013 3.467515 0.8282072 -0.47739 0.5595231 0.714917 0.6738504
Diet 0.9336103 3.823748 0.4056203 -0.7396173 1.468761 1.460625 1.096053
place 0.9748978 3.162223 0.398155 -159.4359 306.2686 337.4955 264.0791

Related

Is there a way 2 store factors selected by a (BE) Stepwise Regression run on N datasets via lapply(full_model, FUN(i) {step(i[[“Coeffs”]])})?

I have already written the following code, all of which works OK:
directory_path <- "~/DAEN_698/sample_obs"
file_list <- list.files(path = directory_path, full.names = TRUE, recursive = TRUE)
head(file_list, n = 2)
> head(file_list, n = 2)
[1] "C:/Users/Spencer/Documents/DAEN_698/sample_obs2/0-5-1-1.csv"
[2] "C:/Users/Spencer/Documents/DAEN_698/sample_obs2/0-5-1-2.csv"
# Create another list with the just the "n-n-n-n" part of the names of of each dataset
DS_name_list = stri_sub(file_list, 49, 55)
head(DS_name_list, n = 3)
> head(DS_name_list, n = 3)
[1] "0-5-1-1" "0-5-1-2" "0-5-1-3"
# This command reads all the data in each of the N csv files via their names
# stored in the 'file_list' list of characters.
csvs <- lapply(file_list, read.csv)
### Run a Backward Elimination Stepwise Regression on each of the N csvs.
# Assign the full model (meaning the one with all 30 candidate regressors
# included as the initial model in step 1).
# This is crucial because if the initial model has less than the number of
# total candidate factors for Stepwise to select from in the datasets,
# then it could miss 1 or more of the true factors.
full_model <- lapply(csvs, function(i) {
lm(formula = Y ~ ., data = i) })
# my failed attempt at figuring it out myself
set.seed(50) # for reproducibility
BE_fits3 <- lapply(full_model, function(i) {step(object = i[["coefficients"]],
direction = 'backward', scope = formula(full_model), trace = 0)})
When I hit run on the above 2 lines of code after setting the seed, I get
the following error message in the Console:
Error in terms`(object) : object 'i' not found
To briefly elaborate a bit further on why it is
absolutely essential that the initial model when running a Backward Elimination
version of Stepwise Regression, consider the following example:
Let us say that we start out with an initial model of 25, so, X1:X26 instead of
X1:X30, in that case, it would be possible to miss out on Stepwise Regression j
being able to select/choose 1 or more of the IVs/factors from X26 through X30,
especially if 1 or more of those really are included in the true underlying
population model that characterizes dataset j.
Instead of two lapply loops, one to fit the models and the second to run the stepwise regressions, use a for loop doing both operations one after the other. This is an environments thing, it seems that step is not finding the data when run in the environment of the lapply function.
I have also changed the code to create DS_name_list. Below it processes the full names without string position dependent code.
DS_name_list <- basename(file_list)
DS_name_list <- tools::file_path_sans_ext(DS_name_list)
head(DS_name_list, n = 2)
And here is the regressions code.
csvs <- lapply(file_list, read.csv)
names(csvs) <- DS_name_list
set.seed(50) # for reproducibility
full_model <- vector("list", length = length(csvs))
BE_fits3 <- vector("list", length = length(csvs))
for(i in seq_along(csvs)) {
full_model[[i]] <- lm(formula = Y ~ ., data = csvs[[i]])
BE_fits3[[i]] <- step(object = full_model[[i]],
scope = formula(full_model[[i]]),
direction = 'backward',
trace = 0)
}

Looping through function arguments (series of contrasts with multcomp::glht)

I wish to write a function that runs contrasts over a regression model and bootstraps those results to get confidence intervals, looping that function over a list of contrasts.
I have tried for loops nested within functions, lapply, map ... none seem to get me what I want (returns results for either only the first contrast in the list or the last).
For a single contrast from the list of contrasts, the code looks like this:
df <- data.frame(
H0013301_new_data = c(0,2,3,6,0,4,2,4,8,1),
drink_stat94_KEYES_2 = c("Heavy","Abstainer","Occasional","Moderate","Abstainer","Occasional","Heavy","Moderate","Moderate","Abstainer"),
drink_stat02_KEYES_2 = c("Heavy","Abstainer","Occasional","Abstainer","Abstainer","Heavy","Heavy","Moderate","Moderate","Abstainer"),
drink_stat06_KEYES_2 = c("Occasional","Abstainer","Occasional","Abstainer","Occasional","Heavy","Heavy","Moderate","Moderate","Heavy"),
FIN_weight_survPS_trimmed=
c(.5,2.4,.6,4.8,1.2,.08,.34,.56,1.6,.27)
)
#reordering factors
df$drink_stat94_KEYES_2<-fct_relevel(df$drink_stat94_KEYES_2, "Abstainer", "Occasional", "Moderate", "Heavy")
contrasts(df$drink_stat94_KEYES_2)<-contr.treatment(4,base=1)
df$drink_stat02_KEYES_2<-fct_relevel(df$drink_stat02_KEYES_2, "Abstainer", "Occasional", "Moderate", "Heavy")
contrasts(df$drink_stat02_KEYES_2)<-contr.treatment(4,base=1)
df$drink_stat06_KEYES_2<-fct_relevel(df$drink_stat06_KEYES_2, "Abstainer", "Occasional", "Moderate", "Heavy")
contrasts(df$drink_stat06_KEYES_2)<-contr.treatment(4,base=1)
#defining contrast
c1 <- rbind("A,A,A"=c(1,0,0,0,0,0,0,0,0,0)
)
#defining function to feed to boostrap
fc_2<-function(d,i){
TrialOutcomeModel_M<-lm(H0013301_new_data ~ drink_stat94_KEYES_2 + drink_stat02_KEYES_2 + drink_stat06_KEYES_2, weights=FIN_weight_survPS_trimmed, data = d[i,])
test <- multcomp::glht(TrialOutcomeModel_M, linfct=c1)
return(coef(test))
}
boot_out<-boot(data=df, fc_2, R=500)
boot.ci(boot_out, type="perc")
But let's assume that instead of just c1, I want to run my function (and boostrap the results) over the following list of contrasts:
c1 <- rbind("A,A,A"=c(1,0,0,0,0,0,0,0,0,0)
)
c2 <- rbind("A,A,O"=c(1,0,0,0,0,0,0,1,0,0)
)
c3 <- rbind("A,A,M"=c(1,0,0,0,0,0,0,0,1,0)
)
c_vector<-list(c1,c2,c3)
Any suggested code for how I would go about this?
(P.S. I know that the linfct argument can take a matrix of contrasts, but I'm specifically looking for a loop/lapply solution).
(In the following I'll reference the objects you create in the example code)
The plan has 2 steps:
preparing a function fun_boot() that takes a contrast object (like c1), and returns a boot object based on it and the df data;
applying that function to the list c_vector of contrasts.
Consequently, the implementation has 2 elements:
# [!] Assume all required libraries loaded
# [!] Assume all necessary data exists
# Step 1
fun_boot <- function(contrast)
{
# Make statistic function
fun_statistic <- function(d, i)
{
TrialOutcomeModel_M <- lm(
formula = H0013301_new_data ~ drink_stat94_KEYES_2 + drink_stat02_KEYES_2 + drink_stat06_KEYES_2,
data = d[i,],
weights = FIN_weight_survPS_trimmed
)
test <- multcomp::glht(
TrialOutcomeModel_M,
linfct = contrast
)
return(coef(test))
}
# Make boot call (hehe)
return (boot(
data = df,
statistic = fun_statistic,
R = 500
))
}
# Step 2
boot_out_vector <- lapply(
X = c_vector,
FUN = fun_boot
)

R order lapply output from a function with multiple outputs by variable (column) rather than by function

I have a function in R which includes multiple other functions, including a custom one. I then use lapply to run the combined function across multiple variables. However, when the output is produced it is in the order of
function1: variable a, variable b, variable c
function2: variable a, variable b, variable c
When what I would like is for it to be the other way around:
variable a: function 1, function 2...
variable b: function 1, function 2...
I have recreated an example below using the mtcars dataset, with number of cylinders as a predictor variable, and vs and am as outcome variables.
library(datasets)
library(tidyverse)
library(skimr)
library(car)
data(mtcars)
mtcars_binary <- mtcars %>%
dplyr::select(cyl, vs, am)
# logistic regression function
logistic.regression <- function(logmodel) {
dev <- logmodel$deviance
null.dev <- logmodel$null.deviance
modelN <- length(logmodel$fitted.values)
R.lemeshow <- 1 - dev / null.dev
R.coxsnell <- 1 - exp ( -(null.dev - dev) / modelN)
R.nagelkerke <- R.coxsnell / ( 1 - ( exp (-(null.dev / modelN))))
cat("Logistic Regression\n")
cat("Hosmer and Lemeshow R^2 ", round(R.lemeshow, 3), "\n")
cat("Cox and Snell R^2 ", round(R.coxsnell, 3), "\n")
cat("Nagelkerke R^2" , round(R.nagelkerke, 3), "\n")
}
# all logistic regression results
log_regression_tests1 <- function(df_vars, df_data) {
glm_summary <- glm(df_data[,df_vars] ~ df_data[,1], data = df_data, family = binomial, na.action = "na.omit")
glm_print <- print(glm_summary)
log_results <- logistic.regression(glm_summary)
blr_coefficients <- exp(glm_summary$coefficients)
blr_confint <- exp(confint(glm_summary))
list(glm_summary = glm_summary, glm_print = glm_print, log_results = log_results, blr_coefficients = blr_coefficients, blr_confint = blr_confint)
}
log_regression_results1 <- sapply(colnames(mtcars_binary[,2:3]), log_regression_tests1, mtcars_binary, simplify = FALSE)
log_regression_results1
When I do this, the output is being produced as:
glm_summary: vs, am
log_results: vs, am
etc. etc.
When what I would like for the output to be ordered is:
vs: all function outputs
am: all function outputs
In addition, when I run this line of code, log_regression_results1 <- sapply(colnames(mtcars_binary[,2:3]), log_regression_tests1, mtcars_binary, simplify = FALSE) I get only the results of the logistic regression function, but when I print the overall results log_regression_results1 I get the remaining output, could anyone explain why?
Finally, the glm_summary function is not producing all of the output which it should. When I run the functions independently on a single variable, like so
glm_vs <- glm(vs ~ cyl, data = mtcars_binary, family = binomial, na.action = "na.omit")
summary(glm_vs)
logistic.regression(glm_vs)
exp(glm_vs$vs)
exp(confint(glm_vs))
it also produces the standard error, z value, and p value for summary(glm_vs) which it does not do embedded in the function, even though I have ```glm_print <- print(glm_summary)' included. Is there a way to get the output for the full summary function within the log_regression_tests1 function?
when I run your code up to log_regression_results1 I got exactly what you ask for:
summary(log_regression_results1)
Length Class Mode
vs 5 -none- list
am 5 -none- list
maybe you meant to ask the other way round?

Getting more precision in pvalue in survdiff

I am running a survdiff using the survival package and the p-value is 0.02. I would like to see it have more precision(ie. 0.02xxxx). Is there an argument that I can pass to specify the length of the pvalue. I read the documentation for the survival package and did not find any mention on how to specify it.
survdiff(surv_object~access_sam2$Area_mTLSHL)
Credits.
The computation of the p-value for objects of class "survdiff" is not completely obvious. I had to see what is going on in the print method for objects of that class to understand the way the degrees of freedom are computed.
The code below is a simplification of the code of print.survdiff and therefore the credits go to
citation("survival")
#
#Therneau T (2015). _A Package for Survival Analysis
#in S_. version 2.38, <URL:
#https://CRAN.R-project.org/package=survival>.
#
#Terry M. Therneau, Patricia M. Grambsch (2000).
#_Modeling Survival Data: Extending the Cox Model_.
#Springer, New York. ISBN 0-387-98784-3.
#
#To see these entries in BibTeX format, use
#'print(<citation>, bibtex=TRUE)', 'toBibtex(.)', or
#set 'options(citation.bibtex.max=999)'.
The code itself can be seen in the sources or by running
getAnywhere("print.survdiff")
Now for the question's problem.
I have written a generic pvalue function to make it easier to call a method for objects of the class returned by function survdiff. The example is the taken from the help page of that function.
The return value is a named list with 3 members, the names are self explanatory. One of them, chisq is a repetition of a value returned by survdiff. I have included it for the sake of completeness.
pvalue <- function(x, ...) UseMethod("pvalue")
pvalue.survdiff <- function (x, ...)
{
if (length(x$n) == 1) {
df <- 1
pval <- pchisq(x$chisq, 1, lower.tail = FALSE)
} else {
if (is.matrix(x$obs)) {
otmp <- rowSums(x$obs)
etmp <- rowSums(x$exp)
} else {
otmp <- x$obs
etmp <- x$exp
}
df <- sum(etmp > 0) - 1
pval <- pchisq(x$chisq, df, lower.tail = FALSE)
}
list(chisq = x$chisq, p.value = pval, df = df)
}
srv <- survdiff(Surv(futime, fustat) ~ rx, data = ovarian)
pvalue(srv)
#$chisq
#[1] 1.06274
#
#$p.value
#[1] 0.3025911
#
#$df
#[1] 1
I am not sure about the survival package and you did not provide a reproducible code (please do so next time). But in general, if you want to see more digits what you need to do is
print(value, digits= n)
# n is the number of digits you want to see
In your case it is
print(survdiff(surv_object~access_sam2$Area_mTLSHL), 6)

Bootstrap LASSO Lambda

I am trying to bootstrap lambda1 parameter in LASSO regression (using library penalized) (NOT the coefficients estimates as i KNOW that is does not make sense to calculate e.g. 95% CIs for them, this is the question about lambda1 ONLY).
This is where I am so far:
df <- read.table(header=T, text="group class v1 v2
1 Ala 1 3.98 23.2
2 Ala 2 5.37 18.5
3 C 1 4.73 22.1
4 B 1 4.17 22.3
5 C 2 4.47 22.4
")
Tried this:
X<-df[,c(3,4)] # data, variables in columns, cases in rows
Y<-df[,2] # dichotomous response
for (i 1:100) {
opt1<-optL1(Y,X)
opt1$lambda
}
But got Error: unexpected "}" in "}"
Tried this:
f<-function(X,Y,i){
opt1<-optL1(Y,X,[i])
}
boot(X,f,100)
But got Error in boot (X,f,100): incorrect number of subscripts on matrix... Can somebody help?
Here is what is wrong with the for loop:
1) It needs the syntax for (i in 1:100) {} in order to work;
2) It needs to save opt1$lambda in a proper object;
3) It most likely needs the values (Y,X) to change from one iteration of the loop to another.
The R code which addresses items 1) and 2) above could be written as follows:
lambda <- NULL
for (i in 1:100) {
opt1 <- optL1(Y,X) # opt1 will NOT change
# since Y and X are the SAME
# over each iteration of the for loop
lambda <- c(lambda, opt1$lambda)
}
lambda
In this code, the object lambda which will store the value opt1$lambda produced at each iteration is declared at the top of the for loop with the command lambda -> NULL and then it is augmented after each iteration with the command
lambda <- c(lambda, opt1$lambda).
In general, using the NULL trick is not recommended for a large number of iterations. A better alternative would be this:
lambda <- list('vector', 100)
for (i in 1:100) {
opt1 <- optL1(Y,X) # opt1 will NOT change
# since Y and X are the SAME
# over each iteration of the for loop
lambda[i] <- opt1$lambda
}
lambda <- unlist(lambda)
lambda
With this second alternative, we pre-allocate lambda at the top of the for loop to be a list with 100 components, such that the i-th component will store the value opt1$lambda produced during the i-th iteration. Inside the for loop, we save the value of opt1$lambda in the list named lambda with the command:
lambda[i] <- opt1$lambda.
At the end of the loop, we unlist lambda so that it becomes a regular vector (i.e., column of numbers).
You can alter the function to take in a data.frame, and specific the columns to use for response and covariate inside optL1 :
library(boot)
library(penalized)
f<-function(data,ind){
fit = optL1(data[ind,"class"],data[ind,c("v1","v2")])
fit$lambda
}
df = data.frame(group=sample(c("A","B","C"),100,replace=TRUE),
class=sample(2,100,replace=TRUE),
v1 = rnorm(100),
v2 = rnorm(100)
)
bo = boot(df,f,100)
o
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = df, statistic = f, R = 100)
Bootstrap Statistics :
original bias std. error
t1* 2.887399 0.2768409 1.85466

Resources