I try to treat ordinal data as a function of time in order to analyze if there is a significant difference between two different modalities.
Here is my data set
num
Note
Semaine
Companion
mummies
mycosis
1
2
13
0
1
2
2
1
13
0
1
1
3
4
13
0
1
1
4
2
13
0
1
1
31
5
13
1
2
1
1
3
14
0
3
2
num is the number of th plant, there are 30 plant in each modality so a total of 60 plants.
Note is the grade given according to the amount of aphid found on the plant :
1 "absence"
2 "founder"
3 "founder + larvae"
4 "colony"
5 "winged colony"
Semaine is the weak in which the statments were done
Companion is the modality
0 "without companion plant"
1 "with companion plant"
mummies is the grade given according to the amount of mummies found on the plant :
1 "absence"
2 "few"
3 "a lot"
I performed a linear regression using the polr function but time is taken into account here as a fixed factor.
head(aphid, 2)
aphid$Note = factor(aphid$Note)
e = polr(Note ~ num + Semaine + Companion, data = aphid, Hess = TRUE)
n4 = data.frame(num = 30, Semaine = 14, Companion = 0)
predict(e, n4, type = "probs")
exp(cbind(OR = coef(e), confint(e)))
I want to find a way to see if there is a statistical difference between the number of aphids, mummies, mycosis and weeks of the two modalities. Is there a model that can do that ?
Related
Dfcensus is the original data frame. I am trying to use Sex, EducYears and Age to predict whether a person's Income is "<=50K" or ">50K".
There are 20,000 rows in x_train_auto (training set) and 12,561 in x_test_auto (test set).
My classification variable (training set) has 15,124 <=50k and 4876 >50k.
Here is my code:
predictions = knn(train = x_train_auto, # response
test = x_test_auto, # response
cl = Df_census$Income[in_train_census], # prediction
k = 25)
table(predictions)
#<=50K
#12561
As you can see, all 12,561 test samples were predicted to have an Income of ">=50K".
This doesn't make sense. I am not sure where I am going wrong.
P.S.: I have sex one-hot encodes as 0 for male and 1 for female. And I have scaled Educ_years and Age and added sex to the data frame. I then added the one-hot encoded sex variable back into the scaled test and train data.
identifying the problem
Your provided x_test-auto.csv data suggests that you passed logical vectors with TRUEs and FALSEs (which define the indices of training and test samples rather than the actual data) to the train and test arguments of class::knn.
the solution
Rather, use the logical vector in x_train_auto (which I believe corresponds to in_train_census in your example) to define two separate data.frames, each containing all your desired predictors. These are then the training and the test set.
p <- c("Age","EducYears","Sex")
Df_train <- Df_census[in_train_census,p]
Df_test <- Df_census[!in_train_census,p]
In the knn function, pass the training set to the train argument, and the test set to the test argument, and further pass the outcome / target variable of the training set (as a factor) to cl.
The output (see ?class::knn) will be the predicted outcome for the test set.
Here is a complete and reproducible workflow using your data.
the data
library(class)
# read data from Dropbox
x_train_auto <- read.csv("https://dropbox.com/s/6kupkp4u4qyizy7/x_test_auto.csv?dl=1", row.names = 1)
Df_census <- read.csv("https://dropbox.com/s/ccvck8ajnatmpv0/Df_census.csv?dl=1", row.names = 1, stringsAsFactors = TRUE)
table(x_train_auto) # TRUE are training, FALSE are test set
#> x_train_auto
#> FALSE TRUE
#> 12561 20000
str(Df_census) # Income as factor, Sex is binary, Age and EducYears are numeric
#> 'data.frame': 32561 obs. of 15 variables:
#> $ Age : int 39 50 38 53 28 37 49 52 31 42 ...
#> $ Work : Factor w/ 9 levels "?","Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
#> $ Fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
#> $ Education : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
#> $ EducYears : int 13 13 9 7 13 14 5 9 14 13 ...
#> $ MaritalStatus: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
#> $ Occupation : Factor w/ 15 levels "?","Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
#> $ Relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
#> $ Race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
#> $ Sex : int 1 1 1 1 0 0 0 1 0 1 ...
#> $ CapitalGain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
#> $ CapitalLoss : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ HoursPerWeek : int 40 13 40 40 40 40 16 45 50 40 ...
#> $ NativeCountry: Factor w/ 42 levels "?","Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ...
#> $ Income : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
# predictors and response
p <- c("Age","EducYears","Sex")
y <- "Income"
# create data partition
in_train_census <- x_train_auto$x
Df_train <- Df_census[in_train_census,]
Df_test <- Df_census[!in_train_census,]
# check
dim(Df_train)
#> [1] 20000 15
dim(Df_test)
#> [1] 12561 15
table(Df_train$Income)
#>
#> <=50K >50K
#> 15124 4876
using class::knn
The knn (k-nearest-neighbors) algorithm can perform better or worse depending on the choice of the hyperparameter k. It's often difficult to know which k value is best for the classification of a particular dataset. In a machine learning setting, you'd want to try out different values of k to find a value that gives the highest performance on your test dataset (i.e., data which was not used for model fitting).
It's always important to strike a good balance between overfitting (model is too complex, and will give good results on the training data, but less accurate or even rubbish results on new test data) and underfitting (model is too trivial to explain the actual patterns in the data). In the case of knn, using a larger k value would probably better safeguard against overfitting, according to the explanations here.
# apply knn for various k using the given training / test set
r <- data.frame(array(NA, dim = c(0, 2), dimnames = list(NULL, c("k","accuracy"))))
for (k in 1:30) {
#cat("k =", k, "\n")
# fit model on training set, predict test set data
set.seed(60402) # to be reproducible
predictions <- knn(train = Df_train[,p],
test = Df_test[,p],
cl = Df_train[,y],
k = k)
# confusion matrix on test set
t <- table(pred = predictions, ref = Df_test[,y])
# accuracy
a <- sum(diag(t)) / sum(t)
# bind
r <- rbind(r, data.frame(k = k, accuracy = a))
}
visualize model assessment
# find best k
r[which.max(r$accuracy),]
#> k accuracy
#> 17 17 0.8007324
(k.best <- r[which.max(r$accuracy),"k"])
#> [1] 17
# plot
with(r, plot(k, accuracy, type = "l"))
abline(v = k.best, lty = 2)
Created on 2021-09-23 by the reprex package (v2.0.1)
interpretation
The loop results suggest that your optimal value of k for this particular training and test set is between 12 and 17 (see plot above), but the accuracy gain is very small compared to using k = 1 (it's at around 80% regardless of k).
additional thoughts
Given that high income is rarer compared to lower income, accuracy might not be the desired performance metric. Sensitivity might be equally or more important, and you could modify the example code to calculate and assess other performance metrics instead.
In addition to pure prediction, you might want to explore whether other variables could be informative predictors of the Income class, by adding them to the p vector and comparing the resulting accuracies.
Here, we base our conclusions on a particular realization of training and test data. Better machine learning practice would be to split your data into 2 (as here), but then repeatedly split the training set again to fit and assess many more models, using e.g. (repeated) k-fold cross validation. A good package to do this in R is e.g. caret or tidymodels.
To gain a better understanding regarding which variables are the best predictors of Income class, I would also carry out a logistic regression on various uncorrelated predictors.
I have produced a logistic regression model in R using the logistf function from the logistf package due to quasi-complete separation. I get the error message:
Error in solve.default(object$var[2:(object$df + 1), 2:(object$df + 1)]) :
system is computationally singular: reciprocal condition number = 3.39158e-17
The data is structured as shown below, though a lot of the data has been cut here. Numbers represent levels (i.e 1 = very low, 5 = very high) not count data. Variables OrdA to OrdH are ordered factors. The variable Binary is a factor.
OrdA OrdB OrdC OrdE OrdF OrdG OrdH Binary
1 3 4 1 1 2 1 1
2 3 4 5 1 3 1 1
1 3 2 5 2 4 1 0
1 1 1 1 3 1 2 0
3 2 2 2 1 1 1 0
I have read here that this can be caused by multicollinearity, but have tested this and it is not the problem.
VIFModel <- lm(Binary ~ OrdA + OrdB + OrdC + OrdD + OrdE +
OrdF + OrdG + OrdH, data = VIFdata)
vif(VIFModel)
GVIF Df GVIF^(1/(2*Df))
OrdA 6.09 3 1.35
OrdB 3.50 2 1.37
OrdC 7.09 3 1.38
OrdD 6.07 2 1.57
OrdE 5.48 4 1.23
OrdF 3.05 2 1.32
OrdG 5.41 4 1.23
OrdH 3.03 2 1.31
The post also indicates that the problem can be caused by having "more variables than observations." However, I have 8 independent variables and 82 observations.
For context each independent variable is ordinal with 5 levels, and the binary dependent variable has 30% of the observations with "successes." I'm not sure if this could be associated with the issue. How do I fix this issue?
X <- model.matrix(Binary ~ OrdA+OrdB+OrdC+OrdD+OrdE+OrdF+OrdG+OrdH,
Data3, family = "binomial"); dim(X); Matrix::rankMatrix(X)
[1] 82 24
[1] 23
attr(,"method")
[1] "tolNorm2"
attr(,"useGrad")
[1] FALSE
attr(,"tol")
[1] 1.820766e-14
Short answer: your ordinal input variables are transformed to 24 predictor variables (number of columns of the model matrix), but the rank of your model matrix is only 23, so you do indeed have multicollinearity in your predictor variables. I don't know what vif is doing ...
You can use svd(X) to help figure out which components are collinear ...
I'm analyzing data from an experiment, replicated in time, where I measured plant emergence at the soil surface. I had 3 experimental runs, represented by the term trialnum, and would like to include trialnum as a random effect.
Here is a summary of variables involved:
data.frame: 768 obs. of 9 variables:
$ trialnum : Factor w/ 2 levels "2","3": 1 1 1 1 1 1 1 1 1 1 ...
$ Flood : Factor w/ 4 levels "0","5","10","15": 2 2 2 2 2 2 1 1 1 1 ...
$ Burial : Factor w/ 4 levels "1.3","2.5","5",..: 3 3 3 3 3 3 4 4 4 4 ...
$ biotype : Factor w/ 6 levels "0","1","2","3",..: 1 2 3 4 5 6 1 2 3 4 ...
$ soil : int 0 0 0 0 0 0 0 0 0 0 ...
$ n : num 15 15 15 15 15 15 15 15 15 15 ...
Where trialnum is the experimental run, Flood, Burial, and biotype are input/independent variables, and soil is the response/dependent variable.
I previously created this model with all input variables:
glmfitALL <-glm(cbind(soil,n)~trialnum*Flood*Burial*biotype,family = binomial(logit),total)`
From this model I found that by running
anova(glmfitALL, test = "Chisq")
trialnum is significant. There were 3 experimental runs, I'm only including 2 of those in my analysis. I have been advised to incorporate trialnum as a random effect so that I do not have to report the experimental runs separately.
To do this, I created the following model:
glmerfitALL <-glmer(cbind(soil,n)~Flood*Burial*biotype + (1|trialnum),
data = total,
family = binomial(logit),
control = glmerControl(optimizer = "bobyqa"))
From this I get the following error message:
maxfun < 10 * length(par)^2 is not recommended. Unable to evaluate scaled gradientModel failed to converge: degenerate Hessian with 9 negative eigenvalues
I have tried running this model in a variety of ways including:
glmerfitALL <-glmer(cbind(soil,n)~Flood*Burial*biotype*(1|trialnum),
data = total,
family = binomial(logit),
control = glmerControl(optimizer = "bobyqa"))
as well as incorporating REML=FALSE and used optimx in place of bobyqa, but all reiterations resulted in a similar error message.
Because this is an "eigenvalue" error, does that mean there is a problem with my source file/original data?
I also found previous threads regarding the lmer4 error messages (sorry I did not save the link), and saw some comments raising issue with the lack of replicates of the random effect. Because I only have 2 replicates trialnum2 and trialnum3, am I able to even run trialnum as a random effect?
Regarding the eigenvalue, the chief recommendation for this is centring and/or scaling predictors.
Regarding the RE groups, around five are an approximate minimum.
How to include the interaction between a covariate and and time for a non-proportional hazards model?
I often find that the proportional hazards assumption for the Cox regressions doesn’t hold.
Take the following data as an example.
> head(data2)
no np_p age_dx1 race1 mr_dx er_1 pr_1 sct_1 surv_mo km_stts1
1 20 1 2 4 1 2 2 4 52 1
2 33 1 3 1 2 1 2 1 11 1
3 67 1 2 4 4 1 1 3 20 1
4 90 1 3 1 3 3 3 2 11 1
5 143 1 2 4 3 1 1 2 123 0
6 180 1 3 1 3 1 1 2 9 1
First, I fitted a Cox regression model.
> fit2 <- coxph(Surv(surv_mo, km_stts1) ~ np_p + age_dx1 + race1 + mr_dx + er_1 + pr_1 + sct_1, data = data)
Second, I assessed the proportional hazards assumption.
> check_PH2 <- cox.zph(fit2, transform = "km")
> check_PH2
rho chisq p
np_p 0.00946 0.0748 7.84e-01
age_dx1 -0.00889 0.0640 8.00e-01
race1 -0.03148 0.7827 3.76e-01
mr_dx -0.03120 0.7607 3.83e-01
er_1 -0.14741 18.5972 1.61e-05
pr_1 0.05906 2.9330 8.68e-02
sct_1 0.17651 23.8030 1.07e-06
GLOBAL NA 53.2844 3.26e-09
So, this means that the hazard function of er_1 and sct_1 were nonproportional over time (Right?).
In my opinion, I can include the interaction between these two covariates and time seperately in the model. But I don't know how to perform it using R.
Thank you.
I have to two matrices (d1, d2) that represent data from two different experiments where rows are different measurements and columns are different timepoints (3 replicates per timepoint).
I want for every measurement on each timepoint to run Anova between the two experiments and record their p-values. All measurements (rows) for my first timepoint equal 1, across the 3 replicates for both experiments. I meant to exclude this timepoint from my code, but I forgot to do so and I found out that Anova returns different results despite their constant values (Output[1,]). So that got me questioning my method. Why is this happening? Am I doing something wrong with my code?
Dummy Sample of my data and code follow below:
expA <- read.table(header = TRUE, text =
'expA.1 expA.1 expA.1 expA.2 expA.2 expA.2 expA.3 expA.3 expA.3
1 1 1 1 1.0590361 6.0840000 0.3126238 0.9512048 0.5080000 0.1923839
2 1 1 1 1.0250958 2.2046729 1.0730671 1.1411717 1.8937081 0.8498161
3 1 1 1 0.5253539 0.7999517 0.5635622 0.5484518 0.5883386 0.8645538
4 1 1 1 1.2129670 1.4956726 1.5498014 1.7048222 1.1656466 1.6938889
5 1 1 1 0.1903024 3.6452976 0.8402939 0.7885819 2.6368440 0.9033027
6 1 1 1 0.9538349 1.9105057 1.3856549 1.0811355 1.5250233 1.3113107')
expB <- read.table(header = TRUE, text =
'expB.1 expB.1 expB.1 expB.2 expB.2 expB.2 expB.3 expB.3 expB.3
1 1 1 1 0.5850559 0.5858956 0.4498688 0.7030726 0.5139633 0.4788714
2 1 1 1 0.8810560 0.9518831 1.2129251 0.9662685 1.1690522 1.1529898
3 1 1 1 0.8331984 0.7367421 0.7101967 0.9298349 1.0121873 0.8389997
4 1 1 1 0.9780624 0.7264760 1.1186710 0.7796886 0.6307443 1.0288995
5 1 1 1 0.6441718 1.2217535 0.7014686 1.1087473 1.3889726 0.7439220
6 1 1 1 1.1069592 1.3709256 1.3999807 1.3019256 1.3422111 1.5893162')
#Anova loop code
Output <- array(numeric(0), c(6, 3))
for (i in 1:nrow(expA)) {
Output[i,] <- c(unlist(summary(aov(c(t(expA[i,1:3]), t(expB[i,1:3])) ~ c(rep("expA", 3), rep("expB", 3)))))[9],
unlist(summary(aov(c(t(expA[i,4:6]), t(expB[i,4:6])) ~ c(rep("expA", 3), rep("expB", 3)))))[9],
unlist(summary(aov(c(t(expA[i,7:9]), t(expB[i,7:9])) ~ c(rep("expA", 3), rep("expB", 3)))))[9]
)
#unlist(summary(aov(c(t(expA[i,10:12])
}
#Output table of p-values
t.1 t.2 t.3
1 0.4158234414 0.34373556 0.95204240
2 0.0328117398 0.35232074 0.56545725
3 0.0006251536 0.23615864 0.07974182
4 0.6730575455 0.03679283 0.02900738
5 0.5410373240 0.54923102 0.59392961
6 0.0050755052 0.69271985 0.53786311