Change recursive vector to atomic vector for t-test - r

I'm new to R and am trying to run a t-test for two means. I keep getting the error is.atomic is not TRUE. I know I need to make my data atomic, but I haven't found a way online.
I've ran code to check that the data is recursive and did a as.data.frame(mydata).
titanic_summary <- data.frame(Outcome = c("Survived", "Died"),
Mean_Age = c(28.34369, 30.62618),
N = c(342, 549),
Total_Missing = c(52, 125))
titanic_summary
Run a stats test (two sample T-test)
str(titanic_summary)
as.data.frame(titanic_summary)
is.atomic(titanic_summary)
is.recursive(titanic_summary)
titanic_test <- titanic_summary %>%
t.test(Outcome~Mean_Age)
Error in var(x) : is.atomic(x) is not TRUE

t.test does not work the way you seem to think. To avoid that particular error, you could instead use something like titanic_test <- t.test(Mean_Age ~ Outcome, data = titanic_summary) but that would just give you different errors, which comes down to the real question:
You presumably want to see whether there may be a relationship between age and survival, i.e. whether the difference in average ages of 2.28249 is significant but you will need the individual ages or some other additional information about dispersion for this
If you do use the detailed dataset then I suspect that what you really want is something like this:
library(titanic)
titanic_test <- t.test(Age ~ Survived, data = titanic_train)
which would give (for the Kaggle selected training set used in the titanic package)
> titanic_test
Welch Two Sample t-test
data: Age by Survived
t = 2.046, df = 598.84, p-value = 0.04119
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.09158472 4.47339446
sample estimates:
mean in group 0 mean in group 1
30.62618 28.34369

Related

Paired sample t-test in R: a question of direction

I have a question about the sign of t in a paired-sample t-test using different data structures, but the same data. I know that the sign doesn't make a difference in terms of significance, but, it does generally tell the user if there have been decreases over time or increases over time. So, I need to make sure that the code I provide produces the same results OR, is explained correctly.
I have to explain the results (and code) as an example we're giving users of our software, which uses R (Rdotnet within a C# program) for statistics. I'm unclear as to the proper order of variables in both methods in R.
Method 1 uses two matrices
## Sets seed for repetitive number generation
set.seed(2820)
## Creates the matrices
preTest <- c(rnorm(100, mean = 145, sd = 9))
postTest <- c(rnorm(100, mean = 138, sd = 8))
## Runs paired-sample T-Test just on two original matrices
t.test(preTest,postTest, paired = TRUE)
The results show significance and with the positive t, tells me that there has been a reduction in the mean difference from preTest to PostTest.
Paired t-test
data: preTest and postTest
t = 7.1776, df = 99, p-value = 1.322e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
6.340533 11.185513
sample estimates:
mean of the differences
8.763023
However, most people are going to get their data not from two matrices, but, from a file with values for BEFORE and AFTER. I will have these data in a csv and import them during a demo. So, to mimic this, I need to create data frame in the structure that users of our software are used to seeing. 'pstt' should look like the dataframe I have after I import a csv.
Method 2: uses a flat-file structure
## Converts the matrices into a dataframe that looks like the way these
data are normally stored in a csv or Excel
ID <- c(1:100)
pstt <- data.frame(ID,preTest,postTest)
## Puts the data in a form that can be used by R (grouping var | data var)
pstt2 <- data.frame(
group = rep(c("preTest","postTest"),each = 100),
weight = c(preTest, postTest)
)
## Runs paired-sample T-Test on the newly structured data frame
t.test(weight ~ group, data = pstt2, paired = TRUE)
The results for this t-test has the t negative, which may indicate to the user that the variable under study has increased over time.
Paired t-test
data: weight by group
t = -7.1776, df = 99, p-value = 1.322e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.185513 -6.340533
sample estimates:
mean of the differences
-8.763023
Is there a way to define explicitly which group is the BEFORE and which is the AFTER? Or, do you have to have the AFTER group first in Method 2.
Thanks for any help/explanation.
Here is the full R program that I used:
## sets working dir
# setwd("C:\\Temp\\")
## runs file from command line
# source("paired_ttest.r",echo=TRUE)
## Sets seed for repetitive number generation
set.seed(2820)
## Creates the matrices
preTest <- c(rnorm(100, mean = 145, sd = 9))
postTest <- c(rnorm(100, mean = 138, sd = 8))
ID <- c(1:100)
## Converts the matrices into a dataframe that looks like the way these
data are normally stored
pstt <- data.frame(ID,preTest,postTest)
## Puts the data in a form that can be used by R (grouping var | data var)
pstt2 <- data.frame(
group = rep(c("preTest","postTest"),each = 100),
weight = c(preTest, postTest)
)
print(pstt2)
## Runs paired-sample T-Test just on two original matrices
# t.test(preTest,postTest, paired = TRUE)
## Runs paired-sample T-Test on the newly structured data frame
t.test(weight ~ group, data = pstt2, paired = TRUE)
Since group is a factor, the t.test will use the first level of that factor as the reference level. By default factor levels are sorted alphabetically to "AFTER" would come before "BEFORE" and "postTest" would be come before "preTest". You can explicitly set reference level of a factor with relevel().
t.test(weight ~ relevel(group, "preTest"), data = pstt2, paired = TRUE)

R - Error T-test For loop command between variables

Currently in the process of writing a For Loop that'll calculate and print t-test results, I'm testing for the difference in means of all variables (faminc, fatheduc, motheduc, white, cigtax, cigprice) between smokers and non-smokers ("smoke"; 0=non, 1=smoker)
Current code:
type <- c("faminc", "fatheduc", "motheduc", "white", "cigtax", "cigprice")
count <- 1
for(name in type){
temp <- subset(data, data[name]==1)
cat("For", name, "between smokers and non, the difference in means is: \n")
print(t.test(temp$smoke))
count <- count + 1
}
However, I feel that 'temp' doesn't belong here and when running the code I get:
For faminc between smokers and non, the difference in means is:
Error in t.test.default(temp$smoke) : not enough 'x' observations
The simple code of
t.test(faminc~smoke,data=data)
does what I need, but I'd like to get some practice/better understanding of for loops.
Here is a solution that generates the output requested in the OP, using lapply() with the mtcars data set.
data(mtcars)
varList <- c("wt","disp","mpg")
results <- lapply(varList,function(x){
t.test(mtcars[[x]] ~ mtcars$am)
})
names(results) <- varList
for(i in 1:length(results)){
message(paste("for variable:",names(results[i]),"difference between manual and automatic transmissions is:"))
print(results[[i]])
}
...and the output:
> for(i in 1:length(results)){
+ message(paste("for variable:",names(results[i]),"difference between manual and automatic transmissions is:"))
+ print(results[[i]])
+ }
for variable: wt difference between manual and automatic transmissions is:
Welch Two Sample t-test
data: mtcars[[x]] by mtcars$am
t = 5.4939, df = 29.234, p-value = 6.272e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.8525632 1.8632262
sample estimates:
mean in group 0 mean in group 1
3.768895 2.411000
for variable: disp difference between manual and automatic transmissions is:
Welch Two Sample t-test
data: mtcars[[x]] by mtcars$am
t = 4.1977, df = 29.258, p-value = 0.00023
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
75.32779 218.36857
sample estimates:
mean in group 0 mean in group 1
290.3789 143.5308
for variable: mpg difference between manual and automatic transmissions is:
Welch Two Sample t-test
data: mtcars[[x]] by mtcars$am
t = -3.7671, df = 18.332, p-value = 0.001374
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.280194 -3.209684
sample estimates:
mean in group 0 mean in group 1
17.14737 24.39231
>
Compare your code that works...
t.test(faminc~smoke,data=data)
You are specifying a relationship between variables (faminc~smoke) which means that you think the mean of faminc is different between the values of smoke and you wish to use the data dataset.
The equivalent line in your loop...
print(t.test(temp$smoke))
...only gives the single column of temp$smoke after having selected those who have the value 1 for each of faminc, fatheduc etc. So even if you wrote...
print(t.test(faminc~smoke, data=data))
Further your count is doing nothing.
If you want to perform a range of testes in this manner you could
type <- c("faminc", "fatheduc", "motheduc", "white", "cigtax", "cigprice")
for(name in type){
cat("For", name, "between smokers and non, the difference in means is: \n")
print(t.test(name~smoke, data=data))
}
Whether this is what you want to do though isn't clear to me, your variables suggest family (faminc), father (fatheduc), mother (motheduc), ethnicity (white), tax (cigtax) and price (cigprice).
I can't think why you would want to compare the mean cigarette price or tax between smokers and non-smokers, because the later are not going to have any value for this since they don't smoke!
You're code suggests these are perhaps binary variables though (since you are filtering on each value being 1) which to me suggests this isn't even what you want to do.
If you wish to look in subsets of data then a tidier approach to performing regression rather than loops is to use purrr.
In future when asking consider providing a sample of data along with the full copy & pasted output as advised in How to create a Minimal, Complete, and Verifiable example - Help Center - Stack Overflow. Because this allows people to see in greater detail what you are doing (e.g. I've only guessed about your variables). With statistics its also useful to state what your hypothesis is too to help people understand what it is you are trying to achieve overall.

Effects from multinomial logistic model in mlogit

I received some good help getting my data formatted properly produce a multinomial logistic model with mlogit here (Formatting data for mlogit)
However, I'm trying now to analyze the effects of covariates in my model. I find the help file in mlogit.effects() to be not very informative. One of the problems is that the model appears to produce a lot of rows of NAs (see below, index(mod1) ).
Can anyone clarify why my data is producing those NAs?
Can anyone help me get mlogit.effects to work with the data below?
I would consider shifting the analysis to multinom(). However, I can't figure out how to format the data to fit the formula for use multinom(). My data is a series of rankings of seven different items (Accessible, Information, Trade offs, Debate, Social and Responsive) Would I just model whatever they picked as their first rank and ignore what they chose in other ranks? I can get that information.
Reproducible code is below:
#Loadpackages
library(RCurl)
library(mlogit)
library(tidyr)
library(dplyr)
#URL where data is stored
dat.url <- 'https://raw.githubusercontent.com/sjkiss/Survey/master/mlogit.out.csv'
#Get data
dat <- read.csv(dat.url)
#Complete cases only as it seems mlogit cannot handle missing values or tied data which in this case you might get because of median imputation
dat <- dat[complete.cases(dat),]
#Change the choice index variable (X) to have no interruptions, as a result of removing some incomplete cases
dat$X <- seq(1,nrow(dat),1)
#Tidy data to get it into long format
dat.out <- dat %>%
gather(Open, Rank, -c(1,9:12)) %>%
arrange(X, Open, Rank)
#Create mlogit object
mlogit.out <- mlogit.data(dat.out, shape='long',alt.var='Open',choice='Rank', ranked=TRUE,chid.var='X')
#Fit Model
mod1 <- mlogit(Rank~1|gender+age+economic+Job,data=mlogit.out)
Here is my attempt to set up a data frame similar to the one portrayed in the help file. It doesnt work. I confess although I know the apply family pretty well, tapply is murky to me.
with(mlogit.out, data.frame(economic=tapply(economic, index(mod1)$alt, mean)))
Compare from the help:
data("Fishing", package = "mlogit")
Fish <- mlogit.data(Fishing, varying = c(2:9), shape = "wide", choice = "mode")
m <- mlogit(mode ~ price | income | catch, data = Fish)
# compute a data.frame containing the mean value of the covariates in
# the sample data in the help file for effects
z <- with(Fish, data.frame(price = tapply(price, index(m)$alt, mean),
catch = tapply(catch, index(m)$alt, mean),
income = mean(income)))
# compute the marginal effects (the second one is an elasticity
effects(m, covariate = "income", data = z)
I'll try Option 3 and switch to multinom(). This code will model the log-odds of ranking an item as 1st, compared to a reference item (e.g., "Debate" in the code below). With K = 7 items, if we call the reference item ItemK, then we're modeling
log[ Pr(Itemk is 1st) / Pr(ItemK is 1st) ] = αk + xTβk
for k = 1,...,K-1, where Itemk is one of the other (i.e. non-reference) items. The choice of reference level will affect the coefficients and their interpretation, but it will not affect the predicted probabilities. (Same story for reference levels for the categorical predictor variables.)
I'll also mention that I'm handling missing data a bit differently here than in your original code. Since my model only needs to know which item gets ranked 1st, I only need to throw out records where that info is missing. (E.g., in the original dataset record #43 has "Information" ranked 1st, so we can use this record even though 3 other items are NA.)
# Get data
dat.url <- 'https://raw.githubusercontent.com/sjkiss/Survey/master/mlogit.out.csv'
dat <- read.csv(dat.url)
# dataframe showing which item is ranked #1
ranks <- (dat[,2:8] == 1)
# for each combination of predictor variable values, count
# how many times each item was ranked #1
dat2 <- aggregate(ranks, by=dat[,9:12], sum, na.rm=TRUE)
# remove cases that didn't rank anything as #1 (due to NAs in original data)
dat3 <- dat2[rowSums(dat2[,5:11])>0,]
# (optional) set the reference levels for the categorical predictors
dat3$gender <- relevel(dat3$gender, ref="Female")
dat3$Job <- relevel(dat3$Job, ref="Government backbencher")
# response matrix in format needed for multinom()
response <- as.matrix(dat3[,5:11])
# (optional) set the reference level for the response by changing
# the column order
ref <- "Debate"
ref.index <- match(ref, colnames(response))
response <- response[,c(ref.index,(1:ncol(response))[-ref.index])]
# fit model (note that age & economic are continuous, while gender &
# Job are categorical)
library(nnet)
fit1 <- multinom(response ~ economic + gender + age + Job, data=dat3)
# print some results
summary(fit1)
coef(fit1)
cbind(dat3[,1:4], round(fitted(fit1),3)) # predicted probabilities
I didn't do any diagnostics, so I make no claim that the model used here provides a good fit.
You are working with Ranked Data, not just Multinomial Choice Data. The structure for the Ranked data in mlogit is that first set of records for a person are all options, then the second is all options except the one ranked first, and so on. But the index assumes equal number of options each time. So a bunch of NAs. We just need to get rid of them.
> with(mlogit.out, data.frame(economic=tapply(economic, index(mod1)$alt[complete.cases(index(mod1)$alt)], mean)))
economic
Accessible 5.13
Debate 4.97
Information 5.08
Officials 4.92
Responsive 5.09
Social 4.91
Trade.Offs 4.91

performing a chi square test across multiple variables and extracting the relevant p value in R

Ok straight to the question. I have a database with lots and lots of categorical variable.
Sample database with a few variables as below
gender <- as.factor(sample( letters[6:7], 100, replace=TRUE, prob=c(0.2, 0.8) ))
smoking <- as.factor(sample(c(0,1),size=100,replace=T,prob=c(0.6,0.4)))
alcohol <- as.factor(sample(c(0,1),size=100,replace=T,prob=c(0.3,0.7)))
htn <- as.factor(sample(c(0,1),size=100,replace=T,prob=c(0.2,0.8)))
tertile <- as.factor(sample(c(1,2,3),size=100,replace=T,prob=c(0.3,0.3,0.4)))
df <- as.data.frame(cbind(gender,smoking,alcohol,htn,tertile))
I want to test the hypothesis, using a chi square test, that there is a difference in the portion of smokers, alcohol use, hypertension (htn) etc by tertile (3 factors). I then want to extract the p values for each variable.
Now i know i can test each individual variable using a 2 by 3 cross tabulation but is there a more efficient code to derive the test statistic and p-value across all variables in one go and extract the p value across each variable
Thanks in advance
Anoop
If you want to do all the comparisons in one statement, you can do
mapply(function(x, y) chisq.test(x, y)$p.value, df[, -5], MoreArgs=list(df[,5]))
# gender smoking alcohol htn
# 0.4967724 0.8251178 0.5008898 0.3775083
Of course doing tests this way is somewhat statistically inefficient since you are doing multiple tests here so some correction is required to maintain an appropriate type 1 error rate.
You can run the following code chunk if you want to get the test result in details:
lapply(df[,-5], function(x) chisq.test(table(x,df$tertile), simulate.p.value = TRUE))
You can get just p-values:
lapply(df[,-5], function(x) chisq.test(table(x,df$tertile), simulate.p.value = TRUE)$p.value)
This is to get the p-values in the data frame:
data.frame(lapply(df[,-5], function(x) chisq.test(table(x,df$tertile), simulate.p.value = TRUE)$p.value))
Thanks to RPub for inspiring.
http://www.rpubs.com/kaz_yos/1204

T-test with grouping variable

I've got a data frame with 36 variables and 74 observations. I'd like to make a two sample paired ttest of 35 variables by 1 grouping variable (with two levels).
For example: the data frame contains "age" "body weight" and "group" variables.
Now I suppose I can do the ttest for each variable with this code:
t.test(age~group)
But, is there a way to test all the 35 variables with one code, and not one by one?
Sven has provided you with a great way of implementing what you wanted to have implemented. I, however, want to warn you about the statistical aspect of what you are doing.
Recall that if you are using the standard confidence level of 0.05, this means that for each t-test performed, you have a 5% chance of committing Type 1 error (incorrectly rejecting the null hypothesis.) By the laws of probability, running 35 individual t-tests compounds your probability of committing type 1 error by a factor of 35, or more exactly:
Pr(Type 1 Error) = 1 - (0.95)^35 = 0.834
Meaning that you have about an 83.4% chance of falsely rejecting a null hypothesis. Basically what this means is that, by running so many T-tests, there is a very high probability that at least one of your T-tests is going to provide an incorrect result.
Just FYI.
An example data frame:
dat <- data.frame(age = rnorm(10, 30), body = rnorm(10, 30),
weight = rnorm(10, 30), group = gl(2,5))
You can use lapply:
lapply(dat[1:3], function(x)
t.test(x ~ dat$group, paired = TRUE, na.action = na.pass))
In the command above, 1:3 represents the numbers of the columns including the variables. The argument paired = TRUE is necessary to perform a paired t-test.

Resources