Unpaired but not paired ttest loop in R working - r

I have a loop that goes through a dataframe, runs ttests and stores the resulting p-value of each ttest in another dataframe.
Here is the loop where 'mydata' is the dataframe that the ttests are being run on. 'mydata' is a dataframe with 4 columns:
df <- mydata
mydf <- data.frame(c(1:4))
# this is the new dataframe being initialized to store my p-values
row.names(mydf) <- names(df)
for(i in names(df)){
if(sd(df[[i]]) == 0) {
# this prevents the loop from terminating and returning an error when ttests
# are run on columns with binary values
} else {
ttest <- t.test(df[df$Pre==1,][[i]], df[df$Pre==2,][[i]], paired=FALSE)
# 'Pre' is the column that groups my data into
# distinct cohorts. I am comparing the Pre cohort versus the Post cohort
# in these ttests.
mydf[i,1] <- ttest$p.value
}
}
mydf
Here is my output of mydf for an unpaired (paired=FALSE) ttest:
c.1.4.
density 0.3569670
clust 0.9715987
Pre 3.0000000
HC 4.0000000
However, when I change paired=FALSE to paired=TRUE (to run a paired ttest), here is mydf:
c.1.4.
density 1
clust 2
Pre 3
HC 4
I checked this line of my loop in isolation using the first column of my dataframe, '1' in double brackets,(for paired=TRUE) and it does appear to be outputting a p-value:
ttest <- t.test(df[df$Pre==1,][[1]], df[df$Pre==2,][[1]], paired=TRUE)
ttest$p.value
[1] 0.356967
Below is a sample dataset that you can use to reproduce the error:
density clust Pre HC
RDHC008A_13 0.47991 0.676825 1 1
RDHC009A_13 0.49955 0.696441 1 1
RDHC010A_16 0.491454 0.706507 1 1
RDHC013A_13 0.442879 0.689118 1 1
RDHC014A_13 0.453823 0.691603 1 1
RDHC016A_16 0.481259 0.706978 1 1
RDHC019A_06 0.515442 0.699514 1 1
RDHC021A_15 0.449925 0.685202 1 1
RDHC022A_12 0.461319 0.705446 1 1
RDHC023A_11 0.468816 0.667698 1 1
RDHC024A_12 0.515142 0.719474 1 1
RDHC025A_13 0.496702 0.710877 1 1
RDHC026A_12 0.477061 0.695061 1 1
RDHC027A_12 0.515442 0.722269 1 1
RDHC029A_12 0.406747 0.669998 1 1
RDHC030A_12 0.476162 0.69219 1 1
RDHC032B_13 0.50075 0.685474 1 1
RDHC034B_07 0.525487 0.725558 1 1
RDHC036B_07 0.468816 0.698904 1 1
RDHC038B_07 0.470015 0.706668 1 1
RDHC039B_07 0.511544 0.712818 1 1
RDHC041A_14 0.551574 0.732983 1 1
RDHC004C_12 0.486207 0.695121 2 1
RDHC005C_12 0.505997 0.695598 2 1
RDHC006C_13 0.487406 0.697044 2 1
RDHC013C_12 0.41979 0.685518 2 1
RDHC015C_13 0.297751 0.69632 2 1
RDHC016C_16 0.463718 0.700011 2 1
RDHC019C_14 0.508096 0.690071 2 1
RDHC021C_12 0.448426 0.688265 2 1
RDHC022C_12 0.468816 0.700968 2 1
RDHC024C_12 0.515292 0.70664 2 1
RDHC025C_13 0.473163 0.704231 2 1
RDHC027C_12 0.518741 0.732939 2 1
RDHC030C_11 0.489205 0.708174 2 1
You can import it by doing the following:
copy the data and paste it within the quotation marks of the code below into R:
zz <- ""
now, assign the data to a data.frame:
mydata <- read.table(text=zz, header=TRUE)
I have no idea why changing the 'paired' parameter to TRUE would cause this to happen. Any help/advice would be much appreciated. Thanks - Paul

You initialize the mydf data.frame with the values 1:4 here
mydf <- data.frame(c(1:4))
basically the loop does nothing because t.test is throwing an error when you do PAIRED=TRUE because your two sets of values aren't the same length (and they need to be when doing a paired t-test. You have 22 values where Pre==1 and 13 values where Pre==2. You can't do a paired test with an imbalance like that.

Related

Filtering of dataframe columns displaying a counter intuitive behavior (R)

Take as an example the dataframe below. I need to change the dataframe by keeping only the columns that are in the filter objects.
test <- data.frame(A = c(1,6,1,2,3) , B = c(1,2,1,1,2), C = c(1,7,6,4,1), D = c(1,1,1,1,1))
filter <- c("A", "B", "C", "D")
filter2 <- c("A","B","D")
To do that I'm using this piece of code:
`%ni%` <- Negate(`%in%`)
test <- test[,-which(names(test) %ni% filter2)]
If I use the filter2 object I get what is expected:
A B D
1 1 1 1
2 6 2 1
3 1 1 1
4 2 1 1
5 3 2 1
However, if I use the filter object, I get a dataframe with zero columns:
data frame with 0 columns and 5 rows
I expected to get an untouched dataframe, since filter had all test columns in it. Why does this happen, and how can I write a more reliable code not to get empty dataframes in these situations?
Use ! instead of -
test[,!(names(test) %ni% filter2)]
test[,!(names(test) %ni% filter)]
by wrapping with which and using -, it works only when the length of output of which is greater than 0
> which(names(test) %ni% filter2)
[1] 3
> which(names(test) %ni% filter)
integer(0)
By doing the -, there is no change in the integer(0) case
> -which(names(test) %ni% filter)
integer(0)
> -which(names(test) %ni% filter2)
[1] -3
thus,
> test[integer(0)]
data frame with 0 columns and 5 rows
I think you can simplify the column selection process by subsetting the dataframe with character vector of column names.
test[filter]
# A B C D
#1 1 1 1 1
#2 6 2 7 1
#3 1 1 6 1
#4 2 1 4 1
#5 3 2 1 1
test[filter2]
# A B D
#1 1 1 1
#2 6 2 1
#3 1 1 1
#4 2 1 1
#5 3 2 1

Testing/Training data sets stratified on two crossed variables

I have a data set which is crossed with respect to two categorical variables, and only 1 rep per combination:
> examp <- data.frame(group=rep(LETTERS[1:4], each=6), class=rep(LETTERS[16:21], times=4))
> table(examp$group, examp$class)
P Q R S T U
A 1 1 1 1 1 1
B 1 1 1 1 1 1
C 1 1 1 1 1 1
D 1 1 1 1 1 1
I need to create a testing/training data set (50/50 split) which balances both group and class.
I know I can use createDataPartition from the caret package to balance it in one of the two factors, but this leaves impalance in the other factor:
> library(caret)
> examp$valid <- "test"
> examp$valid[createDataPartition(examp$group, p=0.5, list=FALSE)] <- "train"
> table(examp$group, examp$valid)
test train
A 3 3
B 3 3
C 3 3
D 3 3
> table(examp$class, examp$valid)
test train
P 1 3
Q 2 2
R 2 2
S 2 2
T 2 2
U 3 1
>
>
> examp$valid <- "test"
> examp$valid[createDataPartition(examp$class, p=0.5, list=FALSE)] <- "train"
> table(examp$group, examp$valid)
test train
A 3 3
B 3 3
C 5 1
D 1 5
> table(examp$class, examp$valid)
test train
P 2 2
Q 2 2
R 2 2
S 2 2
T 2 2
U 2 2
How can I create a partition which is balanced in both factors? If I had multiple reps per group/class combination, I would stratify by interaction(group,class), but I cannot in this case since there is only one observation in each combo.
I propose this algorithm
Randomly sort the unique group values (e.g., DBAC)
Iterate over adjacent pairs of the randomly sorted group values (e.g., first DB, then AC):
Randomly pick half of the class values
Assign the rows with the first group and in the selected half of class to TRAIN
Assign the rows with the second group and not in the selected half of class to TEST

loop ordinal regression statistical analysis and save the data R

could you, please, help me with a loop? I am relatively new to R.
The short version of the data looks ike this:
sNumber blockNo running TrialNo wordTar wordTar1 Freq Len code code2
1 1 1 5 spouse violent 5011 6 1 2
1 1 1 5 violent spouse 17873 7 2 1
1 1 1 5 spouse aviator 5011 6 1 1
1 1 1 5 aviator wife 515 7 1 1
1 1 1 5 wife aviator 87205 4 1 1
1 1 1 5 aviator spouse 515 7 1 1
1 1 1 9 stability usually 12642 9 1 3
1 1 1 9 usually requires 60074 7 3 4
1 1 1 9 requires client 25949 8 4 1
1 1 1 9 client requires 16964 6 1 4
2 2 1 5 grimy cloth 757 5 2 1
2 2 1 5 cloth eats 8693 5 1 4
2 2 1 5 eats whitens 3494 4 4 4
2 2 1 5 whitens woman 18 7 4 1
2 2 1 5 woman penguin 162541 5 1 1
2 2 1 9 pie customer 8909 3 1 1
2 2 1 9 customer sometimes 13399 8 1 3
2 2 1 9 sometimes reimburses 96341 9 3 4
2 2 1 9 reimburses sometimes 65 10 4 3
2 2 1 9 sometimes gangster 96341 9 3 1
I have a code for ordinal regression analysis for one participant for one trial (eye-tracking data - eyeData) that looks like this:
#------------set the path and import the library-----------------
setwd("/AscTask-3/Data")
library(ordinal)
#-------------read the data----------------
read.delim(file.choose(), header=TRUE) -> eyeData
#-------------extract 1 trial from one participant---------------
ss <- subset(eyeData, sNumber == 6 & runningTrialNo == 21)
#-------------delete duplicates = refixations-----------------
ss.s <- ss[!duplicated(ss$wordTar), ]
#-------------change the raw frequencies to log freq--------------
ss.s$lFreq <- log(ss.s$Freq)
#-------------add a new column with sequential numbers as a factor ------------------
ss.s$rankF <- as.factor(seq(nrow(ss.s)))
#------------ estimate an ordered logistic regression model - fit ordered logit model----------
m <- clm(rankF~lFreq*Len, data=ss.s, link='probit')
summary(m)
#---------------get confidence intervals (CI)------------------
(ci <- confint(m))
#----------odd ratios (OR)--------------
exp(coef(m))
The eyeData file is a huge massive of data consisting of 91832 observations with 11 variables. In total there are 41 participants with 78 trials each. In my code I extract data from one trial from each participant to run the anaysis. However, it takes a long time to run the analysis manually for all trials for all participants. Could you, please, help me to create a loop that will read in all 78 trials from all 41 participants and save the output of statistics (I want to save summary(m), ci, and coef(m)) in one file.
Thank you in advance!
You could generate a unique identifier for every trial of every particpant. Then you could loop over all unique values of this identifier and subset the data accordingly. Then you run the regressions and save the output as a R object
eyeData$uniqueIdent <- paste(eyeData$sNumber, eyeData$runningTrialNo, sep = "-")
uniqueID <- unique(eyeData$uniqueIdent)
for (un in uniqueID) {
ss <- eyeData[eyeData$uniqueID == un,]
ss <- ss[!duplicated(ss$wordTar), ] #maybe do this outside the loop
ss$lFreq <- log(ss$Freq) #you could do this outside the loop too
#create DV
ss$rankF <- as.factor(seq(nrow(ss)))
m <- clm(rankF~lFreq*Len, data=ss, link='probit')
seeSumm <- summary(m)
ci <- confint(m)
oddsR <- exp(coef(m))
save(seeSumm, ci, oddsR, file = paste("toSave_", un, ".Rdata", sep = ""))
# add -un- to the output file to be able identify where it came from
}
Variations of this could include combining the output of every iteration in a list (create an empty list in the beginning) and then after running the estimations and the postestimation commands combine the elements in a list and recursively fill the previously created list "gatherRes":
gatherRes <- vector(mode = "list", length = length(unique(eyeData$uniqueIdent) ##before the loop
gatherRes[[un]] <- list(seeSum, ci, oddsR) ##last line inside the loop
If you're concerned with speed, you could consider writing a function that does all this and use lapply (or mclapply).
Here is a solution using the plyr package (it should be faster than a for loop).
Since you don't provide a reproducible example, I'll use the iris data as an example.
First make a function to calculate your statistics of interest and return them as a list. For example:
# Function to return summary, confidence intervals and coefficients from lm
lm_stats = function(x){
m = lm(Sepal.Width ~ Sepal.Length, data = x)
return(list(summary = summary(m), confint = confint(m), coef = coef(m)))
}
Then use the dlply function, using your variables of interest as grouping
data(iris)
library(plyr) #if not installed do install.packages("plyr")
#Using "Species" as grouping variable
results = dlply(iris, c("Species"), lm_stats)
This will return a list of lists, containing output of summary, confint and coef for each species.
For your specific case, the function could look like (not tested):
ordFit_stats = function(x){
#Remove duplicates
x = x[!duplicated(x$wordTar), ]
# Make log frequencies
x$lFreq <- log(x$Freq)
# Make ranks
x$rankF <- as.factor(seq(nrow(x)))
# Fit model
m <- clm(rankF~lFreq*Len, data=x, link='probit')
# Return list of statistics
return(list(summary = summary(m), confint = confint(m), coef = coef(m)))
}
And then:
results = dlply(eyeData, c("sNumber", "TrialNo"), ordFit_stats)

Removing rows after a certain value in R

I have a data frame in R,
df <- data.frame(a=c(1,1,1,2,2,5,5,5,5,5,6,6), b=c(0,1,0,0,0,0,0,1,0,0,0,1))
I want to remove the rows which has values for the variable b equal to 0 which occurs after the value equals to 1 for the duplicated variable a values.
So the output I am looking for is,
df.out <- data.frame(a=c(1,1,2,2,5,5,5,6,6), b=c(0,1,0,0,0,0,1,0,1))
Is there a way to do this in R?
This should do the trick?
ind = intersect(which(df$b==0), which(df$b==1)+1)
df.out = df[-ind,]
The which(df$b==1) returns the index of the df where b==1. add one to this and intersect with the indexes where b==0.
How about
df[ ave(df$b, df$a, FUN=function(x) x>=cummax(x))==1, ]
# a b
# 1 1 0
# 2 1 1
# 4 2 0
# 5 2 0
# 6 5 0
# 7 5 0
# 8 5 1
# 11 6 0
# 12 6 1
Here we use ave to look within each level of a and we test to see if we've seen a 1 yet with cummax.

How do I select distinct from a dataframe in R?

I have a dataframe in R. I want to see what groups are in the dataframe. If this were a SQL database, I would do Select distinct group from dataframe. Is there a way to perform a similar operation in R?
> head(orl.df)
long lat order hole piece group id
1 3710959 565672.3 1 FALSE 1 0.1 0
2 3710579 566171.1 2 FALSE 1 0.1 0
The unique() function should do the trick:
> dat <- data.frame(x=c(1,1,2),y=c(1,1,3))
> dat <- data.frame(x=c(1,1,2),y=c(1,1,3))
> dat
x y
1 1 1
2 1 1
3 2 3
> unique(dat)
x y
1 1 1
3 2 3
Edit: For your example (didn't see the group part)
unique(orl.df$group)
I think the table() function is also a good choice.
table(orl.df$group)
It also tell you the number the items in each group.

Resources