Test to compare proportions / paired (small) samples / 7-levels categorical variables - r

I'm working on data from a pre-post survey: the same participants have been asked the same questions at 2 different times (so the sample are not independant). I have 19 categorical variables (Likert scale with 7 levels).
For each question, I want to know if there is a significant difference between the "pre" and "post" answer. To do this, I want to compare proportions in each of the 7 categories between pre and post results.
I have two data bases (one 'pre' and one 'post') which I have merged as in the following example (I've made sure that the categorical variables have the same levels for PRE and POST):
prepost <- data.frame(ID = c(1:7),
Quest1_PRE = c('5_SomeA','1_StronglyD','3_SomeD','4_Neither','6_Agree','2_Disagree','7_StronglyA'),
Quest1_POST = c('1_StronglyD','7_StronglyA','6_Agree','7_StronglyA','3_SomeD','5_SomeA','7_StronglyA'))
I tried to perform a McNemar test:
temp <- table(prepost_S1$Quest1_PRE,prepost_S1$Quest1_POST)
mcnemar.test(temp)
> McNemar's Chi-squared test
data: temp
McNemar's chi-squared = NaN, df = 21, p-value = NA
But whatever the question, the test always return NA values. I think it is because the pivot table (temp) has very low frequencies (I only have 24 participants).
One exemple of a pivot table (I have 22 participants):
> temp
1_StronglyD 2_Disagree 3_SomeD 4_Neither 5_SomeA 6_Agree 7_StronglyA
1_StronglyD 0 0 0 0 0 1 0
2_Disagree 0 0 0 0 1 0 0
3_SomeD 0 0 0 0 0 1 1
4_Neither 0 0 1 1 2 2 2
5_SomeA 0 0 0 0 1 1 2
6_Agree 0 0 0 0 0 3 2
7_StronglyA 0 0 0 0 0 1 2
I've tried aggregating the variables' levels into 5 instead of 7 ("1_Disagree", "2_SomeD", "3_Neither", "4_SomeA", "5_Agree") but it still doesn't work.
Is there an equivalent of Fisher's exact test for paired sample? I've done research but I couldn't find anything helpful.
If not, could you think of any other test that could answer my question (= Do the answers differ significantly between the pre and post survey)?
Thanks!

Related

Regression with before and after

I have a dataset with four variables (df)
household
group
income
post
1
0
20'000
0
1
0
22'000
1
2
1
10'000
0
2
1
20'000
1
3
0
20'000
0
3
0
21'000
1
4
1
9'000
0
4
1
16'000
1
5
1
8'000
0
5
1
18'000
1
6
0
22'000
0
6
0
26'000
1
7
1
12'000
0
7
1
24'000
1
8
0
24'000
0
8
0
27'000
1
Group is a binary variable and is 1, when household got support from state. and post variable is also binary and is 1, when it is after some household got support from state.
Now I would like to run a before vs after regression that estimates the group effect by comparing post-period and before period for the supported group. I would like to put the dependent variable in logs, to have the effect in percentage, so the impact of state support on income.
I used that code, but I don't know if it is right to get the answer?
library("fixest")
feols(log(income) ~ group + post,data=df) %>% etable()
Is there another way?
If you are looking for the classic 2x2 design your code was almost correct. Change '+' with '*'. This tell us that the supported group increased the income with 7 250 more than the group which not received support.
comparing = feols(income ~ group * post,data)
comparing_log = feols(log(income) ~ group * post,data)
etable(comparing,comparing_log)
PS: The interpretation of the coefficient as percentage change is a good approximation for small numbers. The correct formula for % change is: exp(beta)-1. In this case it is exp(0.5829)-1 = 0.7912.
So the change here is 79,12%.

How do I make a selected table confined to a matrix, rather than a running list?

For my previous lines of code for making tables from column names, they successfully made short and dense matrices for me to readily process data from two questions (from survey results): (2nd example).
However, when I try using the same line of code (above), I don't get that sleek matrix. I end up getting a list of un-linked tables, which I do not want. Perhaps it's due to the new column only having 0's and 1's as numeric characters, vs. the others that have more than 2: (1st example).
[Please forgive my formatting issues (StackOverflow Status: Newbie). Also, many thanks in advance to those checking in on and answering my question!]
>table(select(data_final, `Relationship 2Affected Individual`, Satisfied_Treatments))
Relationship 2Affected Individual 1
1 0
2 0
3 0
6 0
Other (please specify) 0
, , 1 = 1, Response = 10679308122
0
Relationship 2Affected Individual 1
1 0
2 0
3 0
6 0
Other (please specify) 0
, ,
...
> table(select(data_final, `Relationship 2Affected Individual`, Indirect_Benefits))
Indirect_Benefits
Relationship 2Affected Individual 0 1 2 3
1 4 1 0 0
2 42 17 9 3
3 12 1 1 0
6 5 2 2 0
Other (please specify) 1 0 0 0
>#rstudioapi::versionInfo()
>#packageVersion("dplyr")
table(data_final$Relationship 2Affected Individual, data_final$Satisfied_Treatments)
Problem Solved^

confusion matrix of bstTree predictions, Error: 'The data must contain some levels that overlap the reference.'

I am trying to train a model using bstTree method and print out the confusion matrix. adverse_effects is my class attribute.
set.seed(1234)
splitIndex <- createDataPartition(attended_num_new_bstTree$adverse_effects, p = .80, list = FALSE, times = 1)
trainSplit <- attended_num_new_bstTree[ splitIndex,]
testSplit <- attended_num_new_bstTree[-splitIndex,]
ctrl <- trainControl(method = "cv", number = 5)
model_bstTree <- train(adverse_effects ~ ., data = trainSplit, method = "bstTree", trControl = ctrl)
predictors <- names(trainSplit)[names(trainSplit) != 'adverse_effects']
pred_bstTree <- predict(model_bstTree$finalModel, testSplit[,predictors])
plot.roc(auc_bstTree)
conf_bstTree= confusionMatrix(pred_bstTree,testSplit$adverse_effects)
But I get the error 'Error in confusionMatrix.default(pred_bstTree, testSplit$adverse_effects) :
The data must contain some levels that overlap the reference.'
max(pred_bstTree)
[1] 1.03385
min(pred_bstTree)
[1] 1.011738
> unique(trainSplit$adverse_effects)
[1] 0 1
Levels: 0 1
How can I fix this issue?
> head(trainSplit)
type New_missed Therapytypename New_Diesease gender adverse_effects change_in_exposure other_reasons other_medication
5 2 1 14 13 2 0 0 0 0
7 2 0 14 13 2 0 0 0 0
8 2 0 14 13 2 0 0 0 0
9 2 0 14 13 2 1 0 0 0
11 2 1 14 13 2 0 0 0 0
12 2 0 14 13 2 0 0 0 0
uvb_puva_type missed_prev_dose skintypeA skintypeB Age DoseB DoseA
5 5 1 1 1 22 3.000 0
7 5 0 1 1 22 4.320 0
8 5 0 1 1 22 4.752 0
9 5 0 1 1 22 5.000 0
11 5 1 1 1 22 5.000 0
12 5 0 1 1 22 5.000 0
I had similar problem, which refers to this error. I used function confusionMatrix:
confusionMatrix(actual, predicted, cutoff = 0.5)
An I got the following error: Error in confusionMatrix.default(actual, predicted, cutoff = 0.5) : The data must contain some levels that overlap the reference.
I checked couple of things like:
class(actual) -> numeric
class(predicted) -> integer
unique(actual) -> plenty values, since it is probability
unique(predicted) -> 2 levels: 0 and 1
I concluded, that there is problem with applying cutoff part of the function, so I did it before by:
predicted<-ifelse(predicted> 0.5,1,0)
and run the confusionMatrix function, which works now just fine:
cm<- confusionMatrix(actual, predicted)
cm$table
which generated correct outcome.
One takeaway for your case, which might improve interpretation once you make code working:
you mixed input values for your confusion matrix(as per confusionMatrix package documetation), instead of:
conf_bstTree= confusionMatrix(pred_bstTree,testSplit$adverse_effects)
you should have written:
conf_bstTree= confusionMatrix(testSplit$adverse_effects,pred_bstTree)
As said it will most likely help you interpret confusion matrix, once you figure out way to make it work.
Hope it helps.
max(pred_bstTree) [1] 1.03385
min(pred_bstTree) [1] 1.011738
and errors tells it all. Plotting ROC is simply checking the effect of different threshold points. Based on threshold rounding happens e.g. 0.7 will be converted to 1 (TRUE class) and 0.3 will be go 0 (FALSE class); in case threshold is 0.5. Threshold values are in range of (0,1)
In your case regardless of threshold you will always get all observations into TRUE class as even minimum prediction is greater than 1. (Thats why #phiver was wondering if you are doing regression instead of classification) . Without any zero in prediction there is no level in 'prediction' which coincide with zero level in adverse_effects and hence this error.
PS: It will be difficult to tell root cause of error without you posting your data

Combining matrix of daily rows into weekly rows

I have a matrix with dates as row names and TAG#'s as column names. The matrix is populated with 0's and 1's for presence/absence.
eg
29735 29736 29737 29738 29739 29740
2010-07-15 1 0 0 0 0 0
2010-07-16 1 1 0 0 0 0
2010-07-17 1 1 0 0 0 0
2010-07-18 1 1 0 0 0 0
2010-07-19 1 1 0 0 0 0
2010-07-20 1 1 0 0 0 0
I have the following script for calculating site fidelity (% days present):
##Presence/absence data setup
##import file
read.csv('pn.csv')->'pn'
##strip out desired columns
pn[,c(5,7:9)]->pn
##create table of dates and tags
table(pn$Date,pn$Tag)->T
##convert to a matrix
as.matrix(T)->U
##convert to binary for presence/absence
1*(U>2)->U
##insert missing rows
library(micEcon)
insertRow(U,395,0)->U
rownames(U)[395]<-'2011-08-16'
insertRow(U,253,0)->U
rownames(U)[253]<-'2011-03-26'
insertRow(U,250,0)->U
rownames(U)[250]<-'2011-03-22'
insertRow(U,250,0)->U
rownames(U)[250]<-'2011-03-21'
##for presence/absence
##define i(tag or column)
1->i
##define place to store results
cbind(colnames(U),rep(NA,length(colnames(U))))->sfresult
##loop instructions
for(i in 1:ncol(U)){
##identify first detection day
grep(1,U[,i])[1]->tagrow
##count total days since first detection
nrow(U)-tagrow+1->days
##count days present
length(grep(1,U[,i]))->present
##calculate site fidelity
present/days->sfresult[i,2]
}
##change class of results column
as.numeric(sfresult[,2])->sfresult[,2]
##histogram
bins<-c(0,.3,.6,1)
xlab<-c('Low','Med','High')
hist(as.numeric(sfresult[,2]), breaks=bins,xaxt='n', col=heat.colors(3), xlab='Percent Days Present',ylab='Frequency (# of individuals)',main='Site Fidelity',freq=TRUE,labels=xlab)
axis(1,at=bins)
I'd like to calculate site fidelity on a weekly basis. I believe it would be easiest to simply collapse the matrix by combining every seven rows into a weekly matrix that simply sums the 0's and 1's from the daily matrix. Then the same script for site fidelity would calculate it on a weekly basis. Problem is I'm a newbie and I've had trouble finding an answer on how to collapse the daily matrix to a weekly matrix. Thanks for any suggestions.
Something like this should work:
x <- matrix(rbinom(1000,1,.2), nrow=50, ncol=20)
rownames(x) <- 1:50
colnames(x) <- paste0("id", 1:20)
require(data.table)
xdt <- as.data.table(x)
##assuming rows are sorted by date, that there are no missing days, and that the first row is the start of the week
###xdt[, week:=sort(rep(1:7, length.out=nrow(xdt)))] ##wrong
xdt[, week:=rep(1:ceiling(nrow(xdt)/7), each=7)] ##fixed
xdt[, lapply(.SD,sum), by="week",.SDcols=setdiff(names(xdt),"week")]
I can help you better preserve rownames if you provide a reproducible example How to make a great R reproducible example?
Edit:
Also, it's very atypical to use the right assignment -> as you do do above.
R's cut function will trim Dates to their week (see ?cut.Date for more details). After that, it's a simple call to aggregate to get the result you need. Note that cut.Date takes a start.on.monday option.
Data
sites <- read.table(text="29735 29736 29737 29738 29739 29740
2010-07-15 1 0 0 0 0 0
2010-07-16 1 1 0 0 0 0
2010-07-17 1 1 0 0 0 0
2010-07-18 1 1 0 0 0 0
2010-07-19 1 1 0 0 0 0
2010-07-20 1 1 0 0 0 0",
header=TRUE, check.names=FALSE, row.names=1)
Answer
weeks.factor <- cut(as.Date(row.names(sites)),
breaks='weeks', start.on.monday=FALSE)
aggregate(sites, by=list(weeks.factor), FUN=function(col) sum(col)/length(col))
# Group.1 29735 29736 29737 29738 29739 29740
# 1 2010-07-11 1 0.6666667 0 0 0 0
# 2 2010-07-18 1 1.0000000 0 0 0 0

T test to find differentially expressed genes in R

I have a matrix which contains the genes and the mrna.
ID_REF GSM362168 GSM362169 GSM362170 GSM362171 GSM362172 GSM362173 GSM362174
244901_at 5.171072 5.207896 5.191145 5.067809 5.010239 5.556884 4.879528
244902_at 5.296012 5.460796 5.419633 5.440318 5.234789 7.567894 6.908795
I wanted to find the differentially expressed genes from the matrix using t test and i carried out the following.
stat=mt.teststat(control,classlabel,test="t",na=.mt.naNUM,nonpara="n")
and I get the following error
Error in is.factor(classlabel) : object 'classlabel' not found.
I am not sure how I have to assign the classlabels.Is it the right way to find the differentially expressed genes.
The classlabel should be a vector of integers corresponding to observation (column) class labels. I do not understand what that is.
If you open the documentation for mt.teststat:
?mt.teststat
and scroll down to the end, you'll see an example using the "Golub data":
data(golub)
teststat <- mt.teststat(golub, golub.cl)
If you look at golub.cl,it will become clear what the classlabel vector should look like:
golub.cl
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
In this case, 0 or 1 are labels for two classes of sample. There should be as many values in the vector as you have samples, in the same order that the samples appear in the data matrix. You can also look at:
?golub
golub.cl: numeric vector indicating the tumor class, 27 acute
lymphoblastic leukemia (ALL) cases (code 0) and 11 acute
myeloid leukemia (AML) cases (code 1).
So you need to create a similar vector, with labels (0, 1, ...) for however many classes you have for your own data.

Resources