I am working with lda command to analyze a 2-column, 234 row dataset (x): column X1 contains the predictor variable (metric) and column X2 the independent variable (categorical, 4 categories). I would like to build a linear discriminant model by using 150 observations and then use the other 84 observations for validation. After a random partitioning of data i get x.build and x.validation with 150 and 84 observations, respectively. I run the following
fit = lda(x.build$X2~x.build$X1, data=x.build, na.action="na.omit")
Then I run predict command like this:
pred = predict(fit, newdata=x.validation)
From the reading of the commands description I thought that in pred$class I would get the classification of validation data according to the model built, but I get the classification of 150 observations instead of the 84 I intended to use as validation data. I don't really know what is happening, can someone please give me an example of how I should be conducting this analysis?
Thank you very much in advance.
Try this instead:
fit = lda(X2~X1, data=x.build, na.action="na.omit")
pred = predict(fit, newdata=x.validation)
If you use this formula x.build$X2~x.build$X1 when you build the model, predict expects x.build$X1 column in the validation data. Obviously there isn't one so you get prediction for training data.
Related
This is my first time posting on here, so I apologize if this isn't the correct format/info. But I'm attempting to run a model in R with the zeroinfl function (pscl package). My data consists of insect count data and 5 different variables, which are 5 different habitat types.
Zero inflation poisson model
summary(m1 <- zeroinfl(count~Hab_1+Hab_2+Hab_3+Hab_4+Hab_5, data = insect_data))
I'm able to run the model when I only use 4 variables in the equation, but when I add the fifth variable it gives me this error code:
Error in optim(fn = loglikfun, gr = gradfun, par = c(start$count, start$zero, :
non-finite value supplied by optim
Is there a way to run a zero inflated model using all 5 of these variables or am I missing something? Any input would be greatly appreciated, thank you!
First time asking in the forums, this time I couldn't find the solutions in other answers.
I'm just starting to learn to use R, so I can't help but think this has a simple solution I'm failing to see.
I'm analyzing the relationship between different insect species (SP) and temperature (T), explanatory variables
and the area of the femur of the resulting adult (Femur.area) response variable.
This is my linear model:
ModeloP <- lm(Femur.area ~ T * SP, data=Datos)
No error, but when I want to model variance with gls,
modelo_varPower <- gls(Femur.area ~ T*SP,
weights = varPower(),
data = Datos
)
I get the following errors...
Error in glsEstimate(object, control = control) :
computed "gls" fit is singular, rank 19
The linear model barely passes the Shapiro test of normality, could this be the issue?
Shapiro-Wilk normality test
data: re
W = 0.98269, p-value = 0.05936
Strangely I've run this model using another explanatory variable and had no errors, all I can read in the forums has to do with multiple samplings along a period of time, and thats not my case.
Since the only difference is the response variable I'm uploading and image of how the table looks like in case it helps.
You have some missing cells in your SP:T interaction. lm() tolerates these (if you look at coef(lm(Femur.area~SP*T,data=Datos)) you'll see some NA values for the missing interactions). gls() does not. One way to deal with this is to create an interaction variable and drop the missing levels, then fit the model as (effectively) a one-way rather than a two-way ANOVA. (I called the data dd rather than datos.)
dd3 <- transform(na.omit(dd), SPT=droplevels(interaction(SP,T)))
library(nlme)
gls(Femur.area~SPT,weights=varPower(form=~fitted(.)),data=dd3)
If you want the main effects and the interaction term and the power-law variance that's possible, but it's harder.
I have a database where there are 136 species for 6 variables. For 4 variables there are data for all species. However, for the other 2 variables, there are data for only 88 species. When we look at the 6 variables together, only 78 species have data for all variables.
So, i ran models using this variables.
Note that the models have different species sample sizes, varying according to the data in the database
I need to know if AICc is a good way to compare these models.
The model.avg in MuMIn package returns a error when i try to run a list including all my models:
mods <- list(mod1, mod2, ..., mod14)
aicc <- summary(model.avg(mods))
*Error in model.avg.default(mods) :
models are not all fitted to the same data*
This error makes me think that is not possible rank models based in different sample sizes using AICc. I need help to solve this question!
Thanks in advance!
Basically, all information criteria (as AIC is) are based on the likelihood function of the model that is influenced by sample size. The sample size is directly correlated with information criteria (greater sample size = lower likelihood = greater information criteria).
This means that you cannot compare different sample-size models using AIC or any information criteria.
That's also why your model.avg is failing.
I have a dataset with 283 observation of 60 variables. My outcome variable is dichotomous (Diagnosis) and can be either of two diseases. I am comparing two types of diseases that often show much overlap and i am trying to find the features that can help differentiate these diseases from each other. I understand that LASSO logistic regression is the best solution for this problem, however it can not be run on a incomplete dataset.
So i imputed my missing data with MICE package in R and found that approximately 40 imputations is good for the amount of missing data that i have.
Now i want to perform lasso logistic regression on all my 40 imputed datasets and somehow i am stuck at the part where i need to pool the results of all these 40 datasets.
The with() function from MICE does not work on .glmnet
# Impute database with missing values using MICE package:
imp<-mice(WMT1, m = 40)
#Fit regular logistic regression on imputed data
imp.fit <- glm.mids(Diagnosis~., data=imp,
family = binomial)
# Pool the results of all the 40 imputed datasets:
summary(pool(imp.fit),2)
The above seems to work fine with logistic regression using glm(), but when i try the exact above to perform Lasso regression i get:
# First perform cross validation to find optimal lambda value:
CV <- cv.glmnet(Diagnosis~., data = imp,
family = "binomial", alpha = 1, nlambda = 100)
When i try to perform cross validation I get this error message:
Error in as.data.frame.default(data) :
cannot coerce class ‘"mids"’ to a data.frame
Can somebody help me with this problem?
A thought:
Consider running the analyses on each of the 40 datasets.
Then, storing which variables are selected in each in a matrix.
Then, setting some threshold (e.g., selected in >50% of datasets).
I am trying to use random Forest model on my data set which has 4679 observations and 13 variables.
I am using the random forest model to predict is a part will fail or not.
On the total 4679 observations, I have 66 observation with my target variable as NA. I wanted to predict if this 66 part will fail or not.
SO, I decided to split my train data into first 4613 as my train data and rest 66 rows as my test data.
train<- Imputed_data[1:4613, ]
test <- Imputed_data[4614:4679, ]
I then used the below code for my random forest
fit<- randomForest(claim.Qty.Accepted~., data=train, na.action=na.exclude)
The training confusion matrix I received was clear.
I tried the same to predict my test with the following piece of code
#Prediction for test set
p2 <- predict(fit, test)
head(p2)
head(test$claim.Qty.Accepted)
caret::confusionMatrix(p2, test$claim.Qty.Accepted)
the confusion matrix was 0 with both the classes of Yes and No.
I later saved the predicted value p2 in the form of data frame like below; in the table i could see that all the 66 entries have Yes and No classes.
t2<- data.frame(p2)
I am confused why, the confusion matrix didn't show me the results of prediction? Also is this a right approach I am following to predict my test result? Any lead would be helpful, since i am new in the field.