Use lapply and show variable names - r

I am fairly new to R, though I've programmed in Python and Java a lot. I have searched these questions about using a for loop to run through a list of variables and everyone keeps mentioning to use lapply. I have done that, and my code works in the sense that it gives me the answers, but it doesn't work in the sense that the answers hide important details. Here's my code and some of the output.
> bat <- read.csv(file="mlbTeam2016-B.csv", header=TRUE)
> varlist <- names(bat)[6:32]
> varlist
[1] "AB.B" "R.B" "H.B" "X2B.B" "X3B.B" "HR.B" "RBI.B"
[8] "BB.B" "SO.B" "SB.B" "CS.B" "AVG.B" "OBP.B" "SLG.B"
[15] "OPS.B" "IBB.B" "HBP.B" "SAC.B" "SF.B" "TB.B" "XBH.B"
[22] "GDP.B" "GO.B" "AO.B" "GO_AO.B" "NP.B" "PA.B"
> lapply(varlist, function(i){
+ var <- eval(parse(text=paste("bat$",i)))
+ cor.test(bat$W, var, alternative="two.sided", method="pearson")
+ })
[[1]]
Pearson's product-moment correlation
data: bat$W and var
t = 0.35067, df = 28, p-value = 0.7285
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3013221 0.4164731
sample estimates:
cor
0.06612551
etc
The problem is that each output says data: bat$W and var without telling me which variable it is testing in this step. This is fine, except I have to go back and look up to see what variable this corresponds to. That is better than typing this code in dozens of times, but not ideal. I also know that using eval(parse( is bad, but I can't figure out another way to handle that line.
This is my desired output:
[[1]]
Pearson's product-moment correlation
data: bat$W and bat$AB.B
t = 0.35067, df = 28, p-value = 0.7285
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3013221 0.4164731
sample estimates:
cor
0.06612551

I would suggest creating a correlation matrix rather than doing this using lapply.
This link will walk you through how to do that
http://www.sthda.com/english/wiki/correlation-matrix-a-quick-start-guide-to-analyze-format-and-visualize-a-correlation-matrix-using-r-software
You can select the variables you want using dplyr:
select(bat, one_of(varlist))
This should be a bit easier than the approach you are using.

Related

Change recursive vector to atomic vector for t-test

I'm new to R and am trying to run a t-test for two means. I keep getting the error is.atomic is not TRUE. I know I need to make my data atomic, but I haven't found a way online.
I've ran code to check that the data is recursive and did a as.data.frame(mydata).
titanic_summary <- data.frame(Outcome = c("Survived", "Died"),
Mean_Age = c(28.34369, 30.62618),
N = c(342, 549),
Total_Missing = c(52, 125))
titanic_summary
Run a stats test (two sample T-test)
str(titanic_summary)
as.data.frame(titanic_summary)
is.atomic(titanic_summary)
is.recursive(titanic_summary)
titanic_test <- titanic_summary %>%
t.test(Outcome~Mean_Age)
Error in var(x) : is.atomic(x) is not TRUE
t.test does not work the way you seem to think. To avoid that particular error, you could instead use something like titanic_test <- t.test(Mean_Age ~ Outcome, data = titanic_summary) but that would just give you different errors, which comes down to the real question:
You presumably want to see whether there may be a relationship between age and survival, i.e. whether the difference in average ages of 2.28249 is significant but you will need the individual ages or some other additional information about dispersion for this
If you do use the detailed dataset then I suspect that what you really want is something like this:
library(titanic)
titanic_test <- t.test(Age ~ Survived, data = titanic_train)
which would give (for the Kaggle selected training set used in the titanic package)
> titanic_test
Welch Two Sample t-test
data: Age by Survived
t = 2.046, df = 598.84, p-value = 0.04119
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.09158472 4.47339446
sample estimates:
mean in group 0 mean in group 1
30.62618 28.34369

Is it possible to change Type I error threshold using t.test() function?

I am asked to compute a test statistic using the t.test() function, but I need to reduce the type I error. My prof showed us how to change a confidence level for this function, but not the acceptable type I error for null hypothesis testing. The goal is for the argument to automatically compute a p-value based on a .01 error rate rather than the normal .05.
The r code below involves a data set that I have downloaded.
t.test(mid$log_radius_area, mu=8.456)
I feel like I've answered this somewhere, but can't seem to find it on SO or CrossValidated.
Similarly to this question, the answer is that t.test() doesn't specify any threshold for rejecting/failing to reject the null hypothesis; it reports a p-value, and you get to decide whether to reject or not. (The conf.level argument is for adjusting which confidence interval the output reports.)
From ?t.test:
t.test(1:10, y = c(7:20))
Welch Two Sample t-test
data: 1:10 and c(7:20)
t = -5.4349, df = 21.982, p-value = 1.855e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.052802 -4.947198
sample estimates:
mean of x mean of y
5.5 13.5
Here the p-value is reported as 1.855e-05, so the null hypothesis would be rejected for any (pre-specified) alpha level >1.855e-05. Note that the output doesn't say anywhere "the null hypothesis is rejected at alpha=0.05" or anything like that. You could write your own function to do that, using the $p.value element that is saved as part of the test results:
report_test <- function(tt, alpha=0.05) {
cat("the null hypothesis is ")
if (tt$p.value > alpha) {
cat("**NOT** ")
}
cat("rejected at alpha=",alpha,"\n")
}
tt <- t.test(1:10, y = c(7:20))
report_test(tt)
## the null hypothesis is rejected at alpha= 0.05
Most R package/function writers don't bother to do this, because they figure that it should be simple enough for users to do for themselves.

R - Error T-test For loop command between variables

Currently in the process of writing a For Loop that'll calculate and print t-test results, I'm testing for the difference in means of all variables (faminc, fatheduc, motheduc, white, cigtax, cigprice) between smokers and non-smokers ("smoke"; 0=non, 1=smoker)
Current code:
type <- c("faminc", "fatheduc", "motheduc", "white", "cigtax", "cigprice")
count <- 1
for(name in type){
temp <- subset(data, data[name]==1)
cat("For", name, "between smokers and non, the difference in means is: \n")
print(t.test(temp$smoke))
count <- count + 1
}
However, I feel that 'temp' doesn't belong here and when running the code I get:
For faminc between smokers and non, the difference in means is:
Error in t.test.default(temp$smoke) : not enough 'x' observations
The simple code of
t.test(faminc~smoke,data=data)
does what I need, but I'd like to get some practice/better understanding of for loops.
Here is a solution that generates the output requested in the OP, using lapply() with the mtcars data set.
data(mtcars)
varList <- c("wt","disp","mpg")
results <- lapply(varList,function(x){
t.test(mtcars[[x]] ~ mtcars$am)
})
names(results) <- varList
for(i in 1:length(results)){
message(paste("for variable:",names(results[i]),"difference between manual and automatic transmissions is:"))
print(results[[i]])
}
...and the output:
> for(i in 1:length(results)){
+ message(paste("for variable:",names(results[i]),"difference between manual and automatic transmissions is:"))
+ print(results[[i]])
+ }
for variable: wt difference between manual and automatic transmissions is:
Welch Two Sample t-test
data: mtcars[[x]] by mtcars$am
t = 5.4939, df = 29.234, p-value = 6.272e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.8525632 1.8632262
sample estimates:
mean in group 0 mean in group 1
3.768895 2.411000
for variable: disp difference between manual and automatic transmissions is:
Welch Two Sample t-test
data: mtcars[[x]] by mtcars$am
t = 4.1977, df = 29.258, p-value = 0.00023
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
75.32779 218.36857
sample estimates:
mean in group 0 mean in group 1
290.3789 143.5308
for variable: mpg difference between manual and automatic transmissions is:
Welch Two Sample t-test
data: mtcars[[x]] by mtcars$am
t = -3.7671, df = 18.332, p-value = 0.001374
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.280194 -3.209684
sample estimates:
mean in group 0 mean in group 1
17.14737 24.39231
>
Compare your code that works...
t.test(faminc~smoke,data=data)
You are specifying a relationship between variables (faminc~smoke) which means that you think the mean of faminc is different between the values of smoke and you wish to use the data dataset.
The equivalent line in your loop...
print(t.test(temp$smoke))
...only gives the single column of temp$smoke after having selected those who have the value 1 for each of faminc, fatheduc etc. So even if you wrote...
print(t.test(faminc~smoke, data=data))
Further your count is doing nothing.
If you want to perform a range of testes in this manner you could
type <- c("faminc", "fatheduc", "motheduc", "white", "cigtax", "cigprice")
for(name in type){
cat("For", name, "between smokers and non, the difference in means is: \n")
print(t.test(name~smoke, data=data))
}
Whether this is what you want to do though isn't clear to me, your variables suggest family (faminc), father (fatheduc), mother (motheduc), ethnicity (white), tax (cigtax) and price (cigprice).
I can't think why you would want to compare the mean cigarette price or tax between smokers and non-smokers, because the later are not going to have any value for this since they don't smoke!
You're code suggests these are perhaps binary variables though (since you are filtering on each value being 1) which to me suggests this isn't even what you want to do.
If you wish to look in subsets of data then a tidier approach to performing regression rather than loops is to use purrr.
In future when asking consider providing a sample of data along with the full copy & pasted output as advised in How to create a Minimal, Complete, and Verifiable example - Help Center - Stack Overflow. Because this allows people to see in greater detail what you are doing (e.g. I've only guessed about your variables). With statistics its also useful to state what your hypothesis is too to help people understand what it is you are trying to achieve overall.

Grabbing certain results out of multiple t.test outputs to create a table

I have run 48 t-tests (coded by hand instead of writing a loop) and would like to splice out certain results of those t.tests to create a table of the things I'm most interested in.
Specifically, I would like to keep only the p-value, confidence interval, and the mean of x and mean of y for each of these 48 tests and then build a table of the results.
Is there an elegant, quick way to do this beyond the top answer detailed here , wherein I would go in for all 48 tests and grab all three desired outputs with something along the lines of ttest$p.value? Perhaps a loop?
Below is a sample of the coded input for one t-test, followed by the output delivered by R.
# t.test comparing means of Change_Unemp for 2005 government employment (ix)
lowgov6 <- met_res[met_res$Gov_Emp_2005 <= 93310, "Change_Unemp"]
highgov6 <- met_res[met_res$Gov_Emp_2005 > 93310, "Change_Unemp"]
t.test(lowgov6,highgov6,pool.sd=FALSE,na.rm=TRUE)
Welch Two Sample t-test
data: lowgov6 and highgov6
t = 1.5896, df = 78.978, p-value = 0.1159
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.1813909 1.6198399
sample estimates:
mean of x mean of y
4.761224 4.042000
Save all of your t-tests into a list:
tests <- list()
tests[[1]] <- t.test(lowgov6,highgov6,pool.sd=FALSE,na.rm=TRUE)
# repeat for all tests
# there are probably faster ways than doing all of that by hand
# extract your values using `sapply`
sapply(tests, function(x) {
c(x$estimate[1],
x$estimate[2],
ci.lower = x$conf.int[1],
ci.upper = x$conf.int[2],
p.value = x$p.value)
})
The output is something like the following:
[,1] [,2]
mean of x 0.12095949 0.03029474
mean of y -0.05337072 0.07226999
ci.lower -0.11448679 -0.31771191
ci.upper 0.46314721 0.23376141
p.value 0.23534905 0.76434012
But will have 48 columns. You can t() the result if you'd like it transposed.

What is the ggplot2/plyr way to calculate statistical tests between two subgroups?

I am a rather novice user of R and have come to appreciate the elegance of ggplot2 and plyr. Right now, I am trying to analyze a large dataset that I can not share here, but I have reconstructed my problem with the diamonds dataset (shortened for convenience).
Without further ado:
diam <- diamonds[diamonds$cut=="Fair"|diamonds$cut=="Ideal",]
boxplots <- ggplot(diam, aes(x=cut, price)) + geom_boxplot(aes(fill=cut)) + facet_wrap(~ color)
print(boxplots)
What the plot produces is a set of boxplots, comparing the price of the two cuts "Fair" and "Ideal".
I would now very much like to proceed by statistically comparing the two cuts for each color subgroup (D,E,F,..,J) using either t.test or wilcox.test.
How would I implement this in an way that is as elegant as the ggplot2-syntax? I assume I would use ddply from the plyr-package, but I couldn't figure out how to feed two subgroups into a function that calculates the appropriate statistics..
I think you're looking for:
library(plyr)
ddply(diam,"color",
function(x) {
w <- wilcox.test(price~cut,data=x)
with(w,data.frame(statistic,p.value))
})
(Substituting t.test for wilcox.test seems to work fine too.)
results:
color statistic p.value
1 D 339753.5 4.232833e-24
2 E 591104.5 6.789386e-19
3 F 731767.5 2.955504e-11
4 G 950008.0 1.176953e-12
5 H 611157.5 2.055857e-17
6 I 213019.0 3.299365e-04
7 J 56870.0 2.364026e-01
ddply returns a data frame as output and, assuming that I am reading your question properly, that isn't what you are looking for. I believe you would like to conduct a series of t-tests using a series of subsets of data so the only real task is compiling a list of those subsets. Once you have them you can use a function like lapply() to run a t-test for each subset in your list. I am sure this isn't the most elegant solution, but one approach would be to create a list of unique pairs of your colors using a function like this:
get.pairs <- function(v){
l <- length(v)
n <- sum(1:l-1)
a <- vector("list",n)
j = 1
k = 2
for(i in 1:n){
a[[i]] <- c(v[j],v[k])
if(k < l){
k <- k + 1
} else {
j = j + 1
k = j + 1
}
}
return(a)
}
Now you can use that function to get your list of unique pairs of colors:
> (color.pairs <- get.pairs(levels(diam$color))))
[[1]]
[1] "D" "E"
[[2]]
[1] "D" "F"
...
[[21]]
[1] "I" "J"
Now you can use each of these lists to run a t.test (or whatever you would like) on your subset of your data frame, like so:
> t.test(price~cut,data=diam[diam$color %in% color.pairs[[1]],])
Welch Two Sample t-test
data: price by cut
t = 8.1594, df = 427.272, p-value = 3.801e-15
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1008.014 1647.768
sample estimates:
mean in group Fair mean in group Ideal
3938.711 2610.820
Now use lapply() to run your test for each subset in your list of color pairs:
> lapply(color.pairs,function(x) t.test(price~cut,data=diam[diam$color %in% x,]))
[[1]]
Welch Two Sample t-test
data: price by cut
t = 8.1594, df = 427.272, p-value = 3.801e-15
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1008.014 1647.768
sample estimates:
mean in group Fair mean in group Ideal
3938.711 2610.820
...
[[21]]
Welch Two Sample t-test
data: price by cut
t = 0.8813, df = 375.996, p-value = 0.3787
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-260.0170 682.3882
sample estimates:
mean in group Fair mean in group Ideal
4802.912 4591.726

Resources