dimension of predicted results is lower than given matrix - r

I have a dataset of 17 columns and 500000 rows. I want to predict 250000 of one of these columns. so my training dataset has 250000 rows. after dividing to testing and training set, I ran "gbm" and "lm" model on the set. (
modellm <- train(DARAMAD ~ ., data = trainig, method = "lm", na.action = na.pass)
modelgbm <- train(DARAMAD ~., data = trainig, method = "gbm", na.action = na.omit)
the problem is that when I want to predict, I only receive a vector of 9976 elements while, I try to predict 250000 elements.
z <- predict(modelgbm, newdata = forPredict)
z <- predict(modellm, newdata = forPredict)
forPredict and training datasets both have dimensions of 250000.

your code didn't work for me, but I counted NAs as follows:
naCountFunc <- function(x) sum(is.na(x))
naCount <- sapply(trainData, naCountFunc)
as.data.frame(table(naCount))
naCount Freq
1 0 12
2 1 1
3 100 2
4 187722 1
5 188664 1
these two columns with high NAs are not the one I want to predict. the "daramad" column hasn't any NA.

Related

should be factors with the same levels, error and reference

I have this code (below) and need to use CARET and split the data set in 40% of all data in the dataset should be in trainset, the rest in testset; the payment variable should be distributed equally across the split but the code of the confusionmatrixline gives an error which says:
"Error: data and reference should be factors with the same levels."
EDIT: the payment variable is a binominal variable so 0 (no) and 1 (yes). gdp are just numbers
Sample dataset: (don't now how to make a table here yet)
payment gdp
0 838493
1 9303032
0 72738
1 38300022
1 283283
How to fix this??
My code:
`index <- createDataPartition(y = dataset$payment, p = 0.40, list = F)
trainset <- dataset[index, ]
testset <- dataset[-index, ]
payment_knn <- train(payment ~ gdp, method = "knn", data = trainset,
trControl = trainControl(method = 'cv', number = 5))
predicted_outcomes <- predict(payment_knn, testset)
conMX_pay <- confusionMatrix(predicted_outcomes, testset$payment)
conMX_pay `
This is purely for illustration purposes. Make sure test data is the same as train data.
df<-df %>%
mutate(payment=as.factor(payment),gdp=as.numeric(gdp))
metric<-"Accuracy"
control<-trainControl(method="cv",number = 10)
train_set<-createDataPartition(df$payment,p=0.8,list=F)
valid_me<-df[-train_set,]
train_me<-df[train_set,]
#Training
set.seed(233)
fit.knn<-train(payment~.,method="knn",data=train_me,metric=metric,trControl=control)
validated<-predict(fit.knn,valid_me)
confusionMatrix(validated,valid_me$payment)
This works fine given the data in your question. Warnings because the data set is too small. Purely for illustration.
Data Used:
payment gdp
1 0 838493
2 1 9303032
3 0 72738
4 1 38300022
5 1 283283
Cheers!

Applying logistic regression to simple dataset

I have trying to apply logistic regression or any other of ML algorithm to this simple data set but I have failed miserably and got many error. I am tr
dim(data)
[1] 11580 12
head(data)
ReturnJan ReturnFeb ReturnMar ReturnApr ReturnMay ReturnJune
1 0.08067797 0.06625000 0.03294118 0.18309859 0.130333952 -0.01764234
2 -0.01067989 0.10211539 0.14549595 -0.08442804 -0.327300392 -0.35926605
3 0.04774193 0.03598972 0.03970223 -0.16235294 -0.147426982 0.04858934
4 -0.07404022 -0.04816956 0.01821862 -0.02467917 -0.006036217 -0.02530364
5 -0.03104575 -0.21267723 0.09147609 0.18933823 -0.153846154 -0.10611511
6 0.57980016 0.33225225 -0.40546095 -0.06000000 0.060732113 -0.21536106
And the 12th column the one I am trying to predict looks like this
PositiveDec
0
0
0
1
1
1
Here is my attempt
new.data <- data[,-12] #Remove labels' column
index <- sample(1:nrow(new.data), size = 0.8*nrow(new.data))#Split data
train.data <- new.data[index,]
test.data <- new.data[-index,]
fit.glm <- glm(data[,12]~.,data = data, family = "binomial")
You are getting there, but have several syntactic errors and, as pointed out in comments, need to leave your outcome variable in. This should work:
index <- sample(1:nrow(data), size = 0.8 * nrow(data))
train.data <- data[index, ]
fit.glm <- glm(PositiveDec ~ ., data = train.data, family = "binomial")

R MICE impute new observations

When I use the mice package to impute data I have the following issue:
I can't seem to find a way to replace NA values of new observations, given that I already have imputed the missing data in the training set.
Example 1
I have trained an algorithm with data from data frame with 10 features and 1000 observations.
How can I predict a new observation using this algorithm (with missing data)?
Example 2
Supose we have a data frame with NA values:
V1 V2 V3 R1
1 2 NA 1
1.4 -1 0 0
1.2 NA 0 1
1.6 NA 1 1
1.2 3 1 0
I impute the missing values using the mice package:
imp <- mice(df, m = 2, maxit = 100, meth = 'pmmm', seed = 12345)
The object df now has 2 dataframes with imputed values.
(dfImp1)
V1 V2 V3 R1
1 2 0.5 1
1.4 -1 0 0
1.2 1.5 0 1
1.6 1.5 1 1
1.2 3 1 0
Now with this data frame, I can train an algorithm:
modl <- glm(R1~., (dfImp1), family = binomial)
I want to predict the response of a new observation, e.g:
obs1 <- data.frame(V1 = 1, V2 = 1.4, V3 = NA)
How do I impute the missing data a of new individual observation?
It seems that mice package has not a built-in solution but we can write one.
The idea is to:
(1) use the same mice algorithm to fill NA in dataset used to train GLM and the new observation(s);
(2) predict only the new observation without NA.
I'm going to use iris as a data example.
library(R6)
library(mice)
# Binary output to use Binomial
df <- iris %>% filter(Species != "virginica")
# The new observation
new_data <- tail(df, 1)
# the dataset used to train the model
df <- head(df,-1)
# Now, let insert some NAs
insert_nas <- function(x) {
set.seed(123)
len <- length(x)
n <- sample(1:floor(0.2*len), 1)
i <- sample(1:len, n)
x[i] <- NA
x
}
df$Sepal.Length <- insert_nas(df$Sepal.Length)
df$Petal.Width <- insert_nas(df$Petal.Width)
new_data$Sepal.Width = NA
summary(df)
In fit method we apply mice to fill NAs, fit a GLM model and store it to be used in predict method.
In predict method we (1) add the new_observation to the dataset (with NAs), (2) replace NA again using mice, (3) get back the row(s) of the new observation(s) without NA and then (4) apply GLM to predict this new observation.
# R6 Class Generator
GLMWithMice <- R6Class("GLMWithMice", list(
model = NULL,
df = NULL,
fitted = FALSE,
initialize = function(df) {
self$df <- df
},
fit = function(formula = "Species~.", family = binomial) {
imp <- mice(self$df, m = 2, maxit = 100, meth = 'pmm', seed = 12345, print=FALSE)
df_cleaned <- complete(imp,1)
self$model <- glm(formula, df_cleaned, family = family, maxit = 100)
self$fitted <- TRUE
return(cat("\n model fitted!"))
},
predict = function(new_data, type = "response"){
n_rows <- nrow(self$df)
df_new <- bind_rows(self$df, new_data)
imp <- mice(df_new, m = 2, maxit = 100, meth = 'pmm', seed = 12345, print=FALSE)
df_cleaned <- complete(imp,1)
new_data_cleaned <- tail(df_cleaned, nrow(df_new) - n_rows)
return(predict(self$model,new_data_cleaned, type = type))
}
)
)
#Let's create a new instance of "GLMWithMice" class
model <- GLMWithMice$new(df = df)
class(model)
model$fit(formula = Species~., family = binomial)
model$predict(new_data = new_data)

Rolling regression and prediction with lm() and predict()

I need to apply lm() to an enlarging subset of my dataframe dat, while making prediction for the next observation. For example, I am doing:
fit model predict
---------- -------
dat[1:3, ] dat[4, ]
dat[1:4, ] dat[5, ]
. .
. .
dat[-1, ] dat[nrow(dat), ]
I know what I should do for a particular subset (related to this question: predict() and newdata - How does this work?). For example to predict the last row, I do
dat1 = dat[1:(nrow(dat)-1), ]
dat2 = dat[nrow(dat), ]
fit = lm(log(clicks) ~ log(v1) + log(v12), data=dat1)
predict.fit = predict(fit, newdata=dat2, se.fit=TRUE)
How can I do this automatically for all subsets, and potentially extract what I want into a table?
From fit, I'd need the summary(fit)$adj.r.squared;
From predict.fit I'd need predict.fit$fit value.
Thanks.
(Efficient) solution
This is what you can do:
p <- 3 ## number of parameters in lm()
n <- nrow(dat) - 1
## a function to return what you desire for subset dat[1:x, ]
bundle <- function(x) {
fit <- lm(log(clicks) ~ log(v1) + log(v12), data = dat, subset = 1:x, model = FALSE)
pred <- predict(fit, newdata = dat[x+1, ], se.fit = TRUE)
c(summary(fit)$adj.r.squared, pred$fit, pred$se.fit)
}
## rolling regression / prediction
result <- t(sapply(p:n, bundle))
colnames(result) <- c("adj.r2", "prediction", "se")
Note I have done several things inside the bundle function:
I have used subset argument for selecting a subset to fit
I have used model = FALSE to not save model frame hence we save workspace
Overall, there is no obvious loop, but sapply is used.
Fitting starts from p, the minimum number of data required to fit a model with p coefficients;
Fitting terminates at nrow(dat) - 1, as we at least need the final column for prediction.
Test
Example data (with 30 "observations")
dat <- data.frame(clicks = runif(30, 1, 100), v1 = runif(30, 1, 100),
v12 = runif(30, 1, 100))
Applying code above gives results (27 rows in total, truncated output for 5 rows)
adj.r2 prediction se
[1,] NaN 3.881068 NaN
[2,] 0.106592619 3.676821 0.7517040
[3,] 0.545993989 3.892931 0.2758347
[4,] 0.622612495 3.766101 0.1508270
[5,] 0.180462206 3.996344 0.2059014
The first column is the adjusted-R.squared value for fitted model, while the second column is the prediction. The first value for adj.r2 is NaN, because the first model we fit has 3 coefficients for 3 data points, hence no sensible statistics is available. The same happens to se as well, as the fitted line has no 0 residuals, so prediction is done without uncertainty.
I just made up some random data to use for this example. I'm calling the object data because that was what it was called in the question at the time that I wrote this solution (call it anything you like).
(Efficient) Solution
data <- data.frame(v1=rnorm(100),v2=rnorm(100),clicks=rnorm(100))
data1 = data[1:(nrow(data)-1), ]
data2 = data[nrow(data), ]
for(i in 3:nrow(data)){
nam <- paste("predict", i, sep = "")
nam1 <- paste("fit", i, sep = "")
nam2 <- paste("summary_fit", i, sep = "")
fit = lm(clicks ~ v1 + v2, data=data[1:i,])
tmp <- predict(fit, newdata=data2, se.fit=TRUE)
tmp1 <- fit
tmp2 <- summary(fit)
assign(nam, tmp)
assign(nam1, tmp1)
assign(nam2, tmp2)
}
All of the results you want will be stored in the data objects this creates.
For example:
> summary_fit10$r.squared
[1] 0.3087432
You mentioned in the comments that you'd like a table of results. You can programmatically create tables of results from the 3 types of output files like this:
rm(data,data1,data2,i,nam,nam1,nam2,fit,tmp,tmp1,tmp2)
frames <- ls()
frames.fit <- frames[1:98] #change index or use pattern matching as needed
frames.predict <- frames[99:196]
frames.sum <- frames[197:294]
fit.table <- data.frame(intercept=NA,v1=NA,v2=NA,sourcedf=NA)
for(i in 1:length(frames.fit)){
tmp <- get(frames.fit[i])
fit.table <- rbind(fit.table,c(tmp$coefficients[[1]],tmp$coefficients[[2]],tmp$coefficients[[3]],frames.fit[i]))
}
fit.table
> fit.table
intercept v1 v2 sourcedf
2 -0.0647017971121678 1.34929652763687 -0.300502017324518 fit10
3 -0.0401617893034109 -0.034750571912636 -0.0843076273486442 fit100
4 0.0132968863522573 1.31283604433593 -0.388846211083564 fit11
5 0.0315113918953643 1.31099122173898 -0.371130010135382 fit12
6 0.149582794027583 0.958692838785998 -0.299479715938493 fit13
7 0.00759688947362175 0.703525856001948 -0.297223988673322 fit14
8 0.219756240025917 0.631961979610744 -0.347851129205841 fit15
9 0.13389223748979 0.560583832333355 -0.276076134872669 fit16
10 0.147258022154645 0.581865844000838 -0.278212722024832 fit17
11 0.0592160359650468 0.469842498721747 -0.163187274356457 fit18
12 0.120640756525163 0.430051839741539 -0.201725012088506 fit19
13 0.101443924785995 0.34966728554219 -0.231560038360121 fit20
14 0.0416637001406594 0.472156988919337 -0.247684504074867 fit21
15 -0.0158319749710781 0.451944113682333 -0.171367482879835 fit22
16 -0.0337969739950376 0.423851304105399 -0.157905431162024 fit23
17 -0.109460218252207 0.32206642419212 -0.055331391802687 fit24
18 -0.100560410735971 0.335862465403716 -0.0609509815266072 fit25
19 -0.138175283219818 0.390418411384468 -0.0873106257144312 fit26
20 -0.106984355317733 0.391270279253722 -0.0560299858019556 fit27
21 -0.0740684978271464 0.385267011513678 -0.0548056844433894 fit28

Type parameter of the predict() function

What is the difference between type="class" and type="response" in the predict function?
For instance between:
predict(modelName, newdata=testData, type = "class")
and
predict(modelName, newdata=testData, type = "response")
Response gives you the numerical result while class gives you the label assigned to that value.
Response lets you to determine your threshold. For instance,
glm.fit = glm(Direction~., data=data, family = binomial, subset = train)
glm.probs = predict(glm.fit, test, type = "response")
In glm.probs we have some numerical values between 0 and 1. Now we can determine the threshold value, let's say 0.6. Direction has two possible outcomes, up or down.
glm.pred = rep("Down",length(test))
glm.pred[glm.probs>.6] = "Up"
type = "response" is used in glm models and type = "class" is used in rpart models(CART).
See:
predict.glm
predict.rpart
see ?predict.lm:
predict.lm produces a vector of predictions or a matrix of predictions and bounds with column names fit, lwr, and upr if interval is set. For type = "terms" this is a matrix with a column per term and may have an attribute "constant".
> d <- data.frame(x1=1:10,x2=rep(1:5,each=2),y=1:10+rnorm(10)+rep(1:5,each=2))
> l <- lm(y~x1+x2,d)
> predict(l)
1 2 3 4 5 6 7 8 9 10
2.254772 3.811761 4.959634 6.516623 7.664497 9.221486 10.369359 11.926348 13.074222 14.631211
> predict(l,type="terms")
x1 x2
1 -7.0064511 0.8182315
2 -5.4494620 0.8182315
3 -3.8924728 0.4091157
4 -2.3354837 0.4091157
5 -0.7784946 0.0000000
6 0.7784946 0.0000000
7 2.3354837 -0.4091157
8 3.8924728 -0.4091157
9 5.4494620 -0.8182315
10 7.0064511 -0.8182315
attr(,"constant")
[1] 8.442991
i.e. predict(l) is the row sums of predict(l,type="terms") + the constant

Resources