When I use the mice package to impute data I have the following issue:
I can't seem to find a way to replace NA values of new observations, given that I already have imputed the missing data in the training set.
Example 1
I have trained an algorithm with data from data frame with 10 features and 1000 observations.
How can I predict a new observation using this algorithm (with missing data)?
Example 2
Supose we have a data frame with NA values:
V1 V2 V3 R1
1 2 NA 1
1.4 -1 0 0
1.2 NA 0 1
1.6 NA 1 1
1.2 3 1 0
I impute the missing values using the mice package:
imp <- mice(df, m = 2, maxit = 100, meth = 'pmmm', seed = 12345)
The object df now has 2 dataframes with imputed values.
(dfImp1)
V1 V2 V3 R1
1 2 0.5 1
1.4 -1 0 0
1.2 1.5 0 1
1.6 1.5 1 1
1.2 3 1 0
Now with this data frame, I can train an algorithm:
modl <- glm(R1~., (dfImp1), family = binomial)
I want to predict the response of a new observation, e.g:
obs1 <- data.frame(V1 = 1, V2 = 1.4, V3 = NA)
How do I impute the missing data a of new individual observation?
It seems that mice package has not a built-in solution but we can write one.
The idea is to:
(1) use the same mice algorithm to fill NA in dataset used to train GLM and the new observation(s);
(2) predict only the new observation without NA.
I'm going to use iris as a data example.
library(R6)
library(mice)
# Binary output to use Binomial
df <- iris %>% filter(Species != "virginica")
# The new observation
new_data <- tail(df, 1)
# the dataset used to train the model
df <- head(df,-1)
# Now, let insert some NAs
insert_nas <- function(x) {
set.seed(123)
len <- length(x)
n <- sample(1:floor(0.2*len), 1)
i <- sample(1:len, n)
x[i] <- NA
x
}
df$Sepal.Length <- insert_nas(df$Sepal.Length)
df$Petal.Width <- insert_nas(df$Petal.Width)
new_data$Sepal.Width = NA
summary(df)
In fit method we apply mice to fill NAs, fit a GLM model and store it to be used in predict method.
In predict method we (1) add the new_observation to the dataset (with NAs), (2) replace NA again using mice, (3) get back the row(s) of the new observation(s) without NA and then (4) apply GLM to predict this new observation.
# R6 Class Generator
GLMWithMice <- R6Class("GLMWithMice", list(
model = NULL,
df = NULL,
fitted = FALSE,
initialize = function(df) {
self$df <- df
},
fit = function(formula = "Species~.", family = binomial) {
imp <- mice(self$df, m = 2, maxit = 100, meth = 'pmm', seed = 12345, print=FALSE)
df_cleaned <- complete(imp,1)
self$model <- glm(formula, df_cleaned, family = family, maxit = 100)
self$fitted <- TRUE
return(cat("\n model fitted!"))
},
predict = function(new_data, type = "response"){
n_rows <- nrow(self$df)
df_new <- bind_rows(self$df, new_data)
imp <- mice(df_new, m = 2, maxit = 100, meth = 'pmm', seed = 12345, print=FALSE)
df_cleaned <- complete(imp,1)
new_data_cleaned <- tail(df_cleaned, nrow(df_new) - n_rows)
return(predict(self$model,new_data_cleaned, type = type))
}
)
)
#Let's create a new instance of "GLMWithMice" class
model <- GLMWithMice$new(df = df)
class(model)
model$fit(formula = Species~., family = binomial)
model$predict(new_data = new_data)
Related
I am using xgboost similar to the following example, where I "recode" some numeric value to a (numeric) value in 0,1,2 indicating class labels. Note that I did not convert this to a factor variable.
Then I use xgboost to fit a model and produce predictions.
library(xgboost)
iris$Species <- runif(nrow(iris))
recode <- function(x){
if(x >= 0 & x <= 0.33){
x <- 0
} else if(x > 0.33 & x <= 0.66){
x <- 1
} else if(x > 0.66){
x <- 2
}
}
train <- xgb.DMatrix(data = as.matrix(iris[,-5]),
label = sapply(iris$Species, FUN = recode))
bst <- xgboost(data = train,
max_depth = 4, eta = 0.5, nrounds = 10,
objective = "multi:softprob",
num_class = 3)
pred <- predict(bst, as.matrix(iris[, -5]), reshape = TRUE)
str(pred)
Is there away to obtain the column labels of the matrix of predictions? Or can I be sure that they are ordered according to the numeric values to which I recoded the input?
The columns follow the same order as your label, so it's 0,1 and 2. To be sure, you can do a confusion matrix to check whether you are predicting it correctly:
library(xgboost)
set.seed(100)
iris$Species <- runif(nrow(iris))
train <- xgb.DMatrix(data = as.matrix(iris[,-5]),
label = sapply(iris$Species, FUN = recode))
bst <- xgboost(data = train,
max_depth = 4, eta = 0.5, nrounds = 10,
objective = "multi:softprob",
num_class = 3)
pred <- predict(bst, as.matrix(iris[, -5]), reshape = TRUE)
# which.max tells you which column is most probable
# we convert them back to 0-2, assuming column 1 corresponds to 0
predicted = apply(pred,1,which.max)-1
actual = sapply(iris$Species,recode)
table(predicted,actual)
The results are:
actual
predicted 0 1 2
0 36 2 2
1 4 48 4
2 6 3 45
So most of those predicted to be 0,1 or 2 follows the highest probable class predicted.
Or if you use caret:
caret::confusionMatrix(factor(predicted,levels=1:3),factor(actual,levels=1:3))
I was trying to perform Autoencoder for anomaly detection. I used H2O R package to generate reconstruction MSE for a sample data using h2o.anomaly function. However, I have also tried to manually calculate it by myself according the the MSE formula from the documentation link below:
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/performance-and-prediction.html#mse-mean-squared-error
The training data consisting of three features and 5 rows that I used to build the model is as below:
head(train_dat)
Feature1 Feature2 Feature3
1 68.18 0.1806535 3.871201
2 71.51 0.3987761 2.484907
3 67.77 0.4285304 3.332205
4 69.58 0.1823216 2.890372
5 70.98 0.4134333 1.791759
The test data consisting of three features and 5 rows that I used for prediction is as below:
head(test_dat)
Feature1 Feature2 Feature3
1 68.33000 0.4350239 2.708050
2 73.98000 0.5550339 3.044522
3 67.11000 0.7323679 2.639057
4 69.90395 0.9999787 4.499810
5 71.28867 0.4882539 3.091042
After running training and prediction, the reconstructed features are as below:
head(mod.out)
reconstr_Feature1 reconstr_Feature2 reconstr_Feature3
1 69.66297 0.4239244 2.346250
2 69.88329 0.3963843 2.381598
3 69.46544 0.4610502 2.233164
4 68.96117 0.4229165 2.676295
5 69.63208 0.3895452 2.530025
When I used the h2o.anomaly function for MSE calculation, I received MSE output as below:
head(mse.list)
Reconstruction.MSE
1 0.05310159
2 0.57037600
3 0.54427385
4 2.08407248
5 0.14251951
However, when I tried to calculate the MSE by applying the function below, I obtained different MSE output:
mod.anon.validate <- apply((test_dat - mod.out)^2, 1, mean)
mse.list.validate <- as.data.frame(mod.anon.validate)
head(mse.list.validate)
mod.anon.validate
1 0.6359438
2 5.7492281
3 1.9288268
4 1.5156829
5 1.0229217
I was wondering what I have done wrong in my manual MSE calculation? When it is called "Reconstruction MSE", is it different from the general MSE? The full R script is as below:
### H2O Autoencoder test run ###
#Load test and training data.
test_dat <- read.table("sample.test.dat", header=TRUE)
train_dat <- read.table("sample.train.dat", header=TRUE)
#Start H2O
library(h2o)
localH2O <- h2o.init(port =54321)
#Training and deep learning
feature_names <- names(train_dat[1:3])
unmod.hex <- as.h2o(train_dat, destination_frame="train.hex") ; mod.hex=as.h2o(test_dat, destination_frame="test.hex")
unmod.dl <- h2o.deeplearning(x=feature_names,
training_frame=unmod.hex,
autoencoder = TRUE,
reproducible = T,
hidden = c(3,2,3), epochs = 50,
activation = "Tanh")
#Output result
mod.out <- as.data.frame(h2o.predict(unmod.dl,mod.hex,type=response))
mod.anon <- h2o.anomaly(unmod.dl, mod.hex, per_feature=FALSE)
mse.list <- as.data.frame(mod.anon)
mod.anon.validate <- apply((test_dat - mod.out)^2, 1, mean)
mse.list.validate <- as.data.frame(mod.anon.validate)
Thanks for your help.
The calculations don't match because MSE is calculated in the normalised space. If you set standardize=FALSE param in h2o.deeplearning() it will match:
unmod.dl <- h2o.deeplearning(x=feature_names, standardize = FALSE,
training_frame=unmod.hex,
autoencoder = TRUE,
reproducible = T,
hidden = c(3,2,3), epochs = 50,
activation = "Tanh")
mod.out <- as.data.frame(h2o.predict(unmod.dl, mod.hex, type=response))
mod.anon <- h2o.anomaly(unmod.dl, mod.hex, per_feature=FALSE)
mse.list <- as.data.frame(mod.anon)
mse.list
> mse.list
Reconstruction.MSE
1 1512.740
2 1777.491
3 1458.438
4 1587.593
5 1648.999
> mod.anon.validate <- apply((test_dat - mod.out)^2, 1, mean)
> mse.list.validate <- as.data.frame(mod.anon.validate)
> mse.list.validate
mod.anon.validate
1 1512.740
2 1777.491
3 1458.438
4 1587.593
5 1648.999
Here's an example of how to normalize:
#Load test and training data.
test_dat <- sample.test
train_dat <- sample.train
#Start H2O
library(h2o)
localH2O <- h2o.init(port =54321, strict_version_check = FALSE)
#Training and deep learning
feature_names <- names(train_dat[1:3])
unmod.hex <- as.h2o(train_dat, destination_frame="train.hex")
mod.hex <- as.h2o(test_dat, destination_frame="test.hex")
unmod.dl <- h2o.deeplearning(x=feature_names,
training_frame=unmod.hex,
autoencoder = TRUE,
reproducible = T,
hidden = c(3,2,3), epochs = 50,
activation = "Tanh")
# Anomaly Detection
mod.anon <- h2o.anomaly(unmod.dl, mod.hex, per_feature=FALSE)
mse.list <- as.data.frame(mod.anon)
# Manual MSE
mod.out <- as.data.frame(h2o.predict(unmod.dl, mod.hex, type=response))
# Scale Output
s <- apply(train_dat, 2, max) - apply(train_dat, 2, min)
m <- apply(train_dat, 2, mean)
original_scaled <- t(apply(test_dat, 1, function(x) (x-m)/s))
recreate_scaled <- t(apply(mod.out, 1, function(x) (x-m)/s))
mod.anon.validate <- apply((original_scaled - recreate_scaled)^2, 1, mean)
mse.list.validate <- as.data.frame(mod.anon.validate)
# Compare Outputs
print(mse.list)
print(mse.list.validate)
I have trying to apply logistic regression or any other of ML algorithm to this simple data set but I have failed miserably and got many error. I am tr
dim(data)
[1] 11580 12
head(data)
ReturnJan ReturnFeb ReturnMar ReturnApr ReturnMay ReturnJune
1 0.08067797 0.06625000 0.03294118 0.18309859 0.130333952 -0.01764234
2 -0.01067989 0.10211539 0.14549595 -0.08442804 -0.327300392 -0.35926605
3 0.04774193 0.03598972 0.03970223 -0.16235294 -0.147426982 0.04858934
4 -0.07404022 -0.04816956 0.01821862 -0.02467917 -0.006036217 -0.02530364
5 -0.03104575 -0.21267723 0.09147609 0.18933823 -0.153846154 -0.10611511
6 0.57980016 0.33225225 -0.40546095 -0.06000000 0.060732113 -0.21536106
And the 12th column the one I am trying to predict looks like this
PositiveDec
0
0
0
1
1
1
Here is my attempt
new.data <- data[,-12] #Remove labels' column
index <- sample(1:nrow(new.data), size = 0.8*nrow(new.data))#Split data
train.data <- new.data[index,]
test.data <- new.data[-index,]
fit.glm <- glm(data[,12]~.,data = data, family = "binomial")
You are getting there, but have several syntactic errors and, as pointed out in comments, need to leave your outcome variable in. This should work:
index <- sample(1:nrow(data), size = 0.8 * nrow(data))
train.data <- data[index, ]
fit.glm <- glm(PositiveDec ~ ., data = train.data, family = "binomial")
I need to apply lm() to an enlarging subset of my dataframe dat, while making prediction for the next observation. For example, I am doing:
fit model predict
---------- -------
dat[1:3, ] dat[4, ]
dat[1:4, ] dat[5, ]
. .
. .
dat[-1, ] dat[nrow(dat), ]
I know what I should do for a particular subset (related to this question: predict() and newdata - How does this work?). For example to predict the last row, I do
dat1 = dat[1:(nrow(dat)-1), ]
dat2 = dat[nrow(dat), ]
fit = lm(log(clicks) ~ log(v1) + log(v12), data=dat1)
predict.fit = predict(fit, newdata=dat2, se.fit=TRUE)
How can I do this automatically for all subsets, and potentially extract what I want into a table?
From fit, I'd need the summary(fit)$adj.r.squared;
From predict.fit I'd need predict.fit$fit value.
Thanks.
(Efficient) solution
This is what you can do:
p <- 3 ## number of parameters in lm()
n <- nrow(dat) - 1
## a function to return what you desire for subset dat[1:x, ]
bundle <- function(x) {
fit <- lm(log(clicks) ~ log(v1) + log(v12), data = dat, subset = 1:x, model = FALSE)
pred <- predict(fit, newdata = dat[x+1, ], se.fit = TRUE)
c(summary(fit)$adj.r.squared, pred$fit, pred$se.fit)
}
## rolling regression / prediction
result <- t(sapply(p:n, bundle))
colnames(result) <- c("adj.r2", "prediction", "se")
Note I have done several things inside the bundle function:
I have used subset argument for selecting a subset to fit
I have used model = FALSE to not save model frame hence we save workspace
Overall, there is no obvious loop, but sapply is used.
Fitting starts from p, the minimum number of data required to fit a model with p coefficients;
Fitting terminates at nrow(dat) - 1, as we at least need the final column for prediction.
Test
Example data (with 30 "observations")
dat <- data.frame(clicks = runif(30, 1, 100), v1 = runif(30, 1, 100),
v12 = runif(30, 1, 100))
Applying code above gives results (27 rows in total, truncated output for 5 rows)
adj.r2 prediction se
[1,] NaN 3.881068 NaN
[2,] 0.106592619 3.676821 0.7517040
[3,] 0.545993989 3.892931 0.2758347
[4,] 0.622612495 3.766101 0.1508270
[5,] 0.180462206 3.996344 0.2059014
The first column is the adjusted-R.squared value for fitted model, while the second column is the prediction. The first value for adj.r2 is NaN, because the first model we fit has 3 coefficients for 3 data points, hence no sensible statistics is available. The same happens to se as well, as the fitted line has no 0 residuals, so prediction is done without uncertainty.
I just made up some random data to use for this example. I'm calling the object data because that was what it was called in the question at the time that I wrote this solution (call it anything you like).
(Efficient) Solution
data <- data.frame(v1=rnorm(100),v2=rnorm(100),clicks=rnorm(100))
data1 = data[1:(nrow(data)-1), ]
data2 = data[nrow(data), ]
for(i in 3:nrow(data)){
nam <- paste("predict", i, sep = "")
nam1 <- paste("fit", i, sep = "")
nam2 <- paste("summary_fit", i, sep = "")
fit = lm(clicks ~ v1 + v2, data=data[1:i,])
tmp <- predict(fit, newdata=data2, se.fit=TRUE)
tmp1 <- fit
tmp2 <- summary(fit)
assign(nam, tmp)
assign(nam1, tmp1)
assign(nam2, tmp2)
}
All of the results you want will be stored in the data objects this creates.
For example:
> summary_fit10$r.squared
[1] 0.3087432
You mentioned in the comments that you'd like a table of results. You can programmatically create tables of results from the 3 types of output files like this:
rm(data,data1,data2,i,nam,nam1,nam2,fit,tmp,tmp1,tmp2)
frames <- ls()
frames.fit <- frames[1:98] #change index or use pattern matching as needed
frames.predict <- frames[99:196]
frames.sum <- frames[197:294]
fit.table <- data.frame(intercept=NA,v1=NA,v2=NA,sourcedf=NA)
for(i in 1:length(frames.fit)){
tmp <- get(frames.fit[i])
fit.table <- rbind(fit.table,c(tmp$coefficients[[1]],tmp$coefficients[[2]],tmp$coefficients[[3]],frames.fit[i]))
}
fit.table
> fit.table
intercept v1 v2 sourcedf
2 -0.0647017971121678 1.34929652763687 -0.300502017324518 fit10
3 -0.0401617893034109 -0.034750571912636 -0.0843076273486442 fit100
4 0.0132968863522573 1.31283604433593 -0.388846211083564 fit11
5 0.0315113918953643 1.31099122173898 -0.371130010135382 fit12
6 0.149582794027583 0.958692838785998 -0.299479715938493 fit13
7 0.00759688947362175 0.703525856001948 -0.297223988673322 fit14
8 0.219756240025917 0.631961979610744 -0.347851129205841 fit15
9 0.13389223748979 0.560583832333355 -0.276076134872669 fit16
10 0.147258022154645 0.581865844000838 -0.278212722024832 fit17
11 0.0592160359650468 0.469842498721747 -0.163187274356457 fit18
12 0.120640756525163 0.430051839741539 -0.201725012088506 fit19
13 0.101443924785995 0.34966728554219 -0.231560038360121 fit20
14 0.0416637001406594 0.472156988919337 -0.247684504074867 fit21
15 -0.0158319749710781 0.451944113682333 -0.171367482879835 fit22
16 -0.0337969739950376 0.423851304105399 -0.157905431162024 fit23
17 -0.109460218252207 0.32206642419212 -0.055331391802687 fit24
18 -0.100560410735971 0.335862465403716 -0.0609509815266072 fit25
19 -0.138175283219818 0.390418411384468 -0.0873106257144312 fit26
20 -0.106984355317733 0.391270279253722 -0.0560299858019556 fit27
21 -0.0740684978271464 0.385267011513678 -0.0548056844433894 fit28
unfortunately I have problems using predict() in the following simple example:
library(e1071)
x <- c(1:10)
y <- c(0,0,0,0,1,0,1,1,1,1)
test <- c(11:15)
mod <- svm(y ~ x, kernel = "linear", gamma = 1, cost = 2, type="C-classification")
predict(mod, newdata = test)
The result is as follows:
> predict(mod, newdata = test)
1 2 3 4 <NA> <NA> <NA> <NA> <NA> <NA>
0 0 0 0 0 1 1 1 1 1
Can anybody explain why predict() only gives the fitted values of the training sample (x,y) and does not care about the test-data?
Thank you very much for your help!
Richard
It looks like this is because you misuse the formula interface to svm(). Normally, one supplies a data frame or similar object within which the variables in the formula are searched for. It usually doesn't matter if you don't do this, even if it is not best practice, but when you want to predict, not putting variables in a data frame gets you in a right mess. The reason it returns the training data is because you don't provide newdata an object with a component named x in it. Hence it can't find the new data x so returns the fitted values. This is common for most R predict methods I know.
The solution then is to i) put your training data in a data frame and pass svm this as the data argument, and ii) supply a new data frame containing x (from test) to predict(). E.g.:
> DF <- data.frame(x = x, y = y)
> mod <- svm(y ~ x, data = DF, kernel = "linear", gamma = 1, cost = 2,
+ type="C-classification")
> predict(mod, newdata = data.frame(x = test))
1 2 3 4 5
1 1 1 1 1
Levels: 0 1
You need newdata to be of the same form, ie using a data.frame helps:
R> library(e1071)
Loading required package: class
R> df <- data.frame(x=1:10, y=sample(c(0,1), 10, rep=TRUE))
R> mod <- svm(y ~ x, kernel = "linear", gamma = 1,
+ cost = 2, type="C-classification", data=df)
R> newdf <- data.frame(x=11:15)
R> predict(mod, newdata=newdf)
1 2 3 4 5
0 0 0 0 0
Levels: 0 1
R>
By the way, this is also shown the help page for svm():
## density-estimation
# create 2-dim. normal with rho=0:
X <- data.frame(a = rnorm(1000), b = rnorm(1000))
attach(X)
# traditional way:
m <- svm(X, gamma = 0.1)
# formula interface:
m <- svm(~., data = X, gamma = 0.1)
# or:
m <- svm(~ a + b, gamma = 0.1)
# test:
newdata <- data.frame(a = c(0, 4), b = c(0, 4))
predict (m, newdata)
So in sum, use the formula interface and supply a data.frame --- that is how essentially all modeling functions in R work.