should be factors with the same levels, error and reference - r

I have this code (below) and need to use CARET and split the data set in 40% of all data in the dataset should be in trainset, the rest in testset; the payment variable should be distributed equally across the split but the code of the confusionmatrixline gives an error which says:
"Error: data and reference should be factors with the same levels."
EDIT: the payment variable is a binominal variable so 0 (no) and 1 (yes). gdp are just numbers
Sample dataset: (don't now how to make a table here yet)
payment gdp
0 838493
1 9303032
0 72738
1 38300022
1 283283
How to fix this??
My code:
`index <- createDataPartition(y = dataset$payment, p = 0.40, list = F)
trainset <- dataset[index, ]
testset <- dataset[-index, ]
payment_knn <- train(payment ~ gdp, method = "knn", data = trainset,
trControl = trainControl(method = 'cv', number = 5))
predicted_outcomes <- predict(payment_knn, testset)
conMX_pay <- confusionMatrix(predicted_outcomes, testset$payment)
conMX_pay `

This is purely for illustration purposes. Make sure test data is the same as train data.
df<-df %>%
mutate(payment=as.factor(payment),gdp=as.numeric(gdp))
metric<-"Accuracy"
control<-trainControl(method="cv",number = 10)
train_set<-createDataPartition(df$payment,p=0.8,list=F)
valid_me<-df[-train_set,]
train_me<-df[train_set,]
#Training
set.seed(233)
fit.knn<-train(payment~.,method="knn",data=train_me,metric=metric,trControl=control)
validated<-predict(fit.knn,valid_me)
confusionMatrix(validated,valid_me$payment)
This works fine given the data in your question. Warnings because the data set is too small. Purely for illustration.
Data Used:
payment gdp
1 0 838493
2 1 9303032
3 0 72738
4 1 38300022
5 1 283283
Cheers!

Related

GAM Error: Error in model.frame.default- variable lengths differ (found for 'day_num')

I am trying to generate predictions for covid cases using a GAM model. The following code works and produces a projection of US cases.
US_data = covid_cases %>% select(United.States, day_num)
head(US_data)
United.States day_num
1 0 1
2 0 2
3 0 3
4 0 4
5 0 5
6 0 6
end_date = nrow(US_data)+28
new_data = data.frame(seq(1:end_date))
colnames(new_data) = "day_num"
US_gam <- gam(United.States~s(day_num,k=45), data=US_data)
#generate predictions
US_predictions = data.frame(predict(US_gam, new_data))
US_predictions$day_num <- as.numeric(new_data$day_num)
names(US_predictions)[1] <- "United.States"
I want to apply the same code to any country I choose and therefore thought a simple function would be easiest. This function basically takes all the code above and wraps it up in a function.
get_df <- function(df,location, day_num){
data = df %>% select(location, day_num)
return(data)
}
projection <- function(df,location,day_num){
data = get_df(df,location, day_num)
end_date = nrow(data)+28
new_data = as.data.frame(seq(1:end_date))
colnames(new_data) = "day_num"
country_gam <- gam(location~ s(day_num,k=45), data=data)
country_predictions = data.frame(predict(country_gam, new_data))
country_predictions$day_num <- as.numeric(new_data$day_num)
names(country_predictions)[1] <- location
return (country_predictions)
}
However - initially it failed at the data subsetting line, so I put the get_df as a helper function. Now it fails at the gam analysis:
US_data <- projection(covid_cases, "United.States", "day_num")
Show Traceback
Error in model.frame.default(formula = location ~ 1 + day_num, data = data, :
variable lengths differ (found for 'day_num')
6. model.frame.default(formula = location ~ 1 + day_num, data = data,
drop.unused.levels = TRUE)
5. stats::model.frame(formula = location ~ 1 + day_num, data = data,
drop.unused.levels = TRUE)
4. eval(mf, parent.frame())
3. eval(mf, parent.frame())
2. gam(location ~ s(day_num, k = 45), data = data)
1. projection(covid_cases, "United.States")
The error seems to suggest that data$day_num does not equal data$location length, but I can't work out why it would say that because they are the same length.
I've read the various responses on stack overflow and can't find any answers, and hunting around the internet has turned up anything either. I'd greatly appreciate any help!
To get the covid data for a fully reproducible example:
covid_cases <- read.csv(url("https://covid.ourworldindata.org/data/ecdc/total_cases.csv"))
covid_cases[is.na(covid_cases)] = 0
covid_cases$day_num = as.numeric(covid_cases[,1])

dimension of predicted results is lower than given matrix

I have a dataset of 17 columns and 500000 rows. I want to predict 250000 of one of these columns. so my training dataset has 250000 rows. after dividing to testing and training set, I ran "gbm" and "lm" model on the set. (
modellm <- train(DARAMAD ~ ., data = trainig, method = "lm", na.action = na.pass)
modelgbm <- train(DARAMAD ~., data = trainig, method = "gbm", na.action = na.omit)
the problem is that when I want to predict, I only receive a vector of 9976 elements while, I try to predict 250000 elements.
z <- predict(modelgbm, newdata = forPredict)
z <- predict(modellm, newdata = forPredict)
forPredict and training datasets both have dimensions of 250000.
your code didn't work for me, but I counted NAs as follows:
naCountFunc <- function(x) sum(is.na(x))
naCount <- sapply(trainData, naCountFunc)
as.data.frame(table(naCount))
naCount Freq
1 0 12
2 1 1
3 100 2
4 187722 1
5 188664 1
these two columns with high NAs are not the one I want to predict. the "daramad" column hasn't any NA.

I want to use AUPRC as the performance measure, in a GBM run using caret package. How can I use a customized metric such as auprc?

I am trying to use AUPRC as my custom metric for a gbm model fit because I have imbalanced classifier. However, when i try to incorporate the custom metric I am getting the following error mentioned in the code. Not sure what I am doing wrong.
Also the auprcSummary() works on its own when i run it inline. It is giving me an error when i try to incorporate it in train().
library(dplyr) # for data manipulation
library(caret) # for model-building
library(pROC) # for AUC calculations
library(PRROC) # for Precision-Recall curve calculations
auprcSummary <- function(data, lev = NULL, model = NULL){
index_class2 <- data$Class == "Class2"
index_class1 <- data$Class == "Class1"
the_curve <- pr.curve(data$Class[index_class2],
data$Class[index_class1],
curve = FALSE)
out <- the_curve$auc.integral
names(out) <- "AUPRC"
out
}
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
summaryFunction = auprcSummary,
classProbs = TRUE)
set.seed(5627)
orig_fit <- train(Class ~ .,
data = toanalyze.train,
method = "gbm",
verbose = FALSE,
metric = "AUPRC",
trControl = ctrl)
This is the error I am getting:
Error in order(scores.class0) : argument 1 is not a vector
Is it because pr.curve() takes only numeric vectors as inputs (scores/probabilities?)
caret has a built-in function called prSummary that computes that for you. You don't have to write your own.
I think this approach yields an appropriate custom summary function:
library(caret)
library(pROC)
library(PRROC)
library(mlbench) #for the data set
data(Ionosphere)
in pr.curve function the classification scores may be either provided separately for the data points of each of the classes, i.e., as scores.class0 for the data points from the positive/foreground class and as scores.class1 for the data points of the negative/background class; or the classification scores for all data points are provided as scores.class0 and the labels are provided as numerical values (1 for the positive class, 0 for the negative class) as weights.class0 (I copied this from the help of the function I apologize if it is unclear).
I opted to provide the later - probability for all in scores.class0 and class assignment in weights.class0.
caret states that if the classProbs argument of the trainControl object is set to TRUE, additional columns in data will be present that contains the class probabilities. So for the Ionosphere data columns good and bad should be present:
levels(Ionosphere$Class)
#output
[1] "bad" "good"
to convert to 0/1 labeling one can just do:
as.numeric(Ionosphere$Class) - 1
good will become 1
bad will become 0
now we have all the data for the custom function
auprcSummary <- function(data, lev = NULL, model = NULL){
prob_good <- data$good #take the probability of good class
the_curve <- pr.curve(scores.class0 = prob_good,
weights.class0 = as.numeric(data$obs)-1, #provide the class labels as 0/1
curve = FALSE)
out <- the_curve$auc.integral
names(out) <- "AUPRC"
out
}
Instead of using data$good which will work on this data set alone one can extract the class names and use that to get the desired column:
lvls <- levels(data$obs)
prob_good <- data[,lvls[2]]
It is important to note each time you update the summaryFunction you need to update the trainControl object.
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
summaryFunction = auprcSummary,
classProbs = TRUE)
orig_fit <- train(y = Ionosphere$Class, x = Ionosphere[,c(1,3:34)], #omit column 2 to avoid a bunch of warnings related to the data set
method = "gbm",
verbose = FALSE,
metric = "AUPRC",
trControl = ctrl)
orig_fit$results
#output
shrinkage interaction.depth n.minobsinnode n.trees AUPRC AUPRCSD
1 0.1 1 10 50 0.9722775 0.03524882
4 0.1 2 10 50 0.9758017 0.03143379
7 0.1 3 10 50 0.9739880 0.03316923
2 0.1 1 10 100 0.9786706 0.02502183
5 0.1 2 10 100 0.9817447 0.02276883
8 0.1 3 10 100 0.9772322 0.03301064
3 0.1 1 10 150 0.9809693 0.02078601
6 0.1 2 10 150 0.9824430 0.02284361
9 0.1 3 10 150 0.9818318 0.02287886
Seems reasonable

Applying logistic regression to simple dataset

I have trying to apply logistic regression or any other of ML algorithm to this simple data set but I have failed miserably and got many error. I am tr
dim(data)
[1] 11580 12
head(data)
ReturnJan ReturnFeb ReturnMar ReturnApr ReturnMay ReturnJune
1 0.08067797 0.06625000 0.03294118 0.18309859 0.130333952 -0.01764234
2 -0.01067989 0.10211539 0.14549595 -0.08442804 -0.327300392 -0.35926605
3 0.04774193 0.03598972 0.03970223 -0.16235294 -0.147426982 0.04858934
4 -0.07404022 -0.04816956 0.01821862 -0.02467917 -0.006036217 -0.02530364
5 -0.03104575 -0.21267723 0.09147609 0.18933823 -0.153846154 -0.10611511
6 0.57980016 0.33225225 -0.40546095 -0.06000000 0.060732113 -0.21536106
And the 12th column the one I am trying to predict looks like this
PositiveDec
0
0
0
1
1
1
Here is my attempt
new.data <- data[,-12] #Remove labels' column
index <- sample(1:nrow(new.data), size = 0.8*nrow(new.data))#Split data
train.data <- new.data[index,]
test.data <- new.data[-index,]
fit.glm <- glm(data[,12]~.,data = data, family = "binomial")
You are getting there, but have several syntactic errors and, as pointed out in comments, need to leave your outcome variable in. This should work:
index <- sample(1:nrow(data), size = 0.8 * nrow(data))
train.data <- data[index, ]
fit.glm <- glm(PositiveDec ~ ., data = train.data, family = "binomial")

MCMCglmm multinomial model in R

I'm trying to create a model using the MCMCglmm package in R.
The data are structured as follows, where dyad, focal, other are all random effects, predict1-2 are predictor variables, and response 1-5 are outcome variables that capture # of observed behaviors of different subtypes:
dyad focal other r present village resp1 resp2 resp3 resp4 resp5
1 10101 14302 0.5 3 1 0 0 4 0 5
2 10405 11301 0.0 5 0 0 0 1 0 1
…
So a model with only one outcome (teaching) is as follows:
prior_overdisp_i <- list(R=list(V=diag(2),nu=0.08,fix=2),
G=list(G1=list(V=1,nu=0.08), G2=list(V=1,nu=0.08), G3=list(V=1,nu=0.08), G4=list(V=1,nu=0.08)))
m1 <- MCMCglmm(teaching ~ trait-1 + at.level(trait,1):r + at.level(trait,1):present,
random= ~idh(at.level(trait,1)):focal + idh(at.level(trait,1)):other +
idh(at.level(trait,1)):X + idh(at.level(trait,1)):village,
rcov=~idh(trait):units, family = "zipoisson", prior=prior_overdisp_i,
data = data, nitt = nitt.1, thin = 50, burnin = 15000, pr = TRUE, pl = TRUE, verbose = TRUE, DIC = TRUE)
Hadfield's course notes (Ch 5) give an example of a multinomial model that uses only a single outcome variable with 3 levels (sheep horns of 3 types). Similar treatment can be found here: http://hlplab.wordpress.com/2009/05/07/multinomial-random-effects-models-in-r/ This is not quite right for what I'm doing, but contains helpful background info.
Another reference (Hadfield 2010) gives an example of a multi-response MCMCglmm that follows the same format but uses cbind() to predict a vector of responses, rather than a single outcome. The same model with multiple responses would look like this:
m1 <- MCMCglmm(cbind(resp1, resp2, resp3, resp4, resp5) ~ trait-1 +
at.level(trait,1):r + at.level(trait,1):present,
random= ~idh(at.level(trait,1)):focal + idh(at.level(trait,1)):other +
idh(at.level(trait,1)):X + idh(at.level(trait,1)):village,
rcov=~idh(trait):units,
family = cbind("zipoisson","zipoisson","zipoisson","zipoisson","zipoisson"),
prior=prior_overdisp_i,
data = data, nitt = nitt.1, thin = 50, burnin = 15000, pr = TRUE, pl = TRUE, verbose = TRUE, DIC = TRUE)
I have two programming questions here:
How do I specify a prior for this model? I've looked at the materials mentioned in this post but just can't figure it out.
I've run a similar version with only two response variables, but I only get one slope - where I thought I should get a different slope for each resp variable. Where am I going wrong, or having I misunderstood the model?
Answer to my first question, based on the HLP post and some help from a colleage/stats consultant:
# values for prior
k <- 5 # originally: length(levels(dative$SemanticClass)), so k = # of outcomes for SemanticClass aka categorical outcomes
I <- diag(k-1) #should make matrix of 0's with diagonal of 1's, dimensions k-1 rows and k-1 columns
J <- matrix(rep(1, (k-1)^2), c(k-1, k-1)) # should make k-1 x k-1 matrix of 1's
And for my model, using the multinomial5 family and 5 outcome variables, the prior is:
prior = list(
R = list(fix=1, V=0.5 * (I + J), n = 4),
G = list(
G1 = list(V = diag(4), n = 4))
For my second question, I need to add an interaction term to the fixed effects in this model:
m <- MCMCglmm(cbind(Resp1, Resp2...) ~ -1 + trait*predictorvariable,
...
The result gives both main effects for the Response variables and posterior estimates for the Response/Predictor interaction (the effect of the predictor variable on each response variable).

Resources