unfortunately I have problems using predict() in the following simple example:
library(e1071)
x <- c(1:10)
y <- c(0,0,0,0,1,0,1,1,1,1)
test <- c(11:15)
mod <- svm(y ~ x, kernel = "linear", gamma = 1, cost = 2, type="C-classification")
predict(mod, newdata = test)
The result is as follows:
> predict(mod, newdata = test)
1 2 3 4 <NA> <NA> <NA> <NA> <NA> <NA>
0 0 0 0 0 1 1 1 1 1
Can anybody explain why predict() only gives the fitted values of the training sample (x,y) and does not care about the test-data?
Thank you very much for your help!
Richard
It looks like this is because you misuse the formula interface to svm(). Normally, one supplies a data frame or similar object within which the variables in the formula are searched for. It usually doesn't matter if you don't do this, even if it is not best practice, but when you want to predict, not putting variables in a data frame gets you in a right mess. The reason it returns the training data is because you don't provide newdata an object with a component named x in it. Hence it can't find the new data x so returns the fitted values. This is common for most R predict methods I know.
The solution then is to i) put your training data in a data frame and pass svm this as the data argument, and ii) supply a new data frame containing x (from test) to predict(). E.g.:
> DF <- data.frame(x = x, y = y)
> mod <- svm(y ~ x, data = DF, kernel = "linear", gamma = 1, cost = 2,
+ type="C-classification")
> predict(mod, newdata = data.frame(x = test))
1 2 3 4 5
1 1 1 1 1
Levels: 0 1
You need newdata to be of the same form, ie using a data.frame helps:
R> library(e1071)
Loading required package: class
R> df <- data.frame(x=1:10, y=sample(c(0,1), 10, rep=TRUE))
R> mod <- svm(y ~ x, kernel = "linear", gamma = 1,
+ cost = 2, type="C-classification", data=df)
R> newdf <- data.frame(x=11:15)
R> predict(mod, newdata=newdf)
1 2 3 4 5
0 0 0 0 0
Levels: 0 1
R>
By the way, this is also shown the help page for svm():
## density-estimation
# create 2-dim. normal with rho=0:
X <- data.frame(a = rnorm(1000), b = rnorm(1000))
attach(X)
# traditional way:
m <- svm(X, gamma = 0.1)
# formula interface:
m <- svm(~., data = X, gamma = 0.1)
# or:
m <- svm(~ a + b, gamma = 0.1)
# test:
newdata <- data.frame(a = c(0, 4), b = c(0, 4))
predict (m, newdata)
So in sum, use the formula interface and supply a data.frame --- that is how essentially all modeling functions in R work.
Related
sample_size <- 200
sample_meanvector <- c(3, 4)
sample_covariance_matrix <- matrix(c(2, 1, 1, 2),
ncol = 2)
# create bivariate normal distribution
sample_distribution <- mvrnorm(n = sample_size,
mu = sample_meanvector,
Sigma = sample_covariance_matrix)
#Convert the datatype
df_sample_distribution <- as.data.frame(sample_distribution)
df_sample_distribution$Y <- (1 + df_sample_distribution$V1*2 + df_sample_distribution$V2 + rnorm(200,0,1))
colnames(df_sample_distribution)[1] <- "X1"
colnames(df_sample_distribution)[2] <- "X2"
Code above is the one I use to generate a bivariate normal distribution vectors and code below is the code to run regression over the generated data.
Test2 <- lm( Y ~ X1, data = df_sample_distribution)
#to extract only specific coefficients
summary(Test)$coefficients[2,1]
My question is whether there is a way such that I can regenerate data and run regression over it for 200 times and save all the outputs in a list. Here is the pseudo code in my head.
for (){
#generate data
for ()
{
#extract coeffiients and insert them in a list
}
}
In simple terms,
step 1: create data
step 2: run regression over it
step 3: get the coefficient (hopefully save them in a list)
I am looking for code that can loop through step 1 to 3 for 200 times and save everything results. Any ideas or inspirations are welcomed. Thank you guys in advance.
Just wrap your code into a for-loop like your pseudo code:
library(MASS)
iterations <- 10 # In your example this should be 200
sample_size <- 200
sample_meanvector <- c(3, 4)
sample_covariance_matrix <- matrix(c(2, 1, 1, 2),
ncol = 2)
# create output data.frame
df_output <- data.frame(iteration = integer(0), coeff = double(0))
# loop over data generation and regression
for (i in seq_len(iterations)) {
sample_distribution <- mvrnorm(n = sample_size,
mu = sample_meanvector,
Sigma = sample_covariance_matrix)
#Convert the datatype
df_sample_distribution <- as.data.frame(sample_distribution)
df_sample_distribution$Y <- (1 + df_sample_distribution$V1*2 + df_sample_distribution$V2 + rnorm(200,0,1))
colnames(df_sample_distribution)[1] <- "X1"
colnames(df_sample_distribution)[2] <- "X2"
df_output[i, 1] <- i
df_output[i, 2] <- summary(lm( Y ~ X1, data = df_sample_distribution))$coefficients[2,1]
}
This returns df_output containing coefficients for each iteration:
iteration coeff
1 1 2.647886
2 2 2.274654
3 3 2.447453
4 4 2.451471
5 5 2.568877
6 6 2.428295
7 7 2.440396
8 8 2.478357
9 9 2.477211
10 10 2.367012
I am fitting the following model
fit<- lmer(y ~ a + b + (1|c) + (1|a:d) , data=inputdata)
to real observations collected in "inputdata".
Now I want to generate various (1000) modelled datasets for a simulation based on the model parameters and the determined errors. I can use
pred <- predict(fit, newdata=list(a=val_a1, b=val_b1, c=val_c1, d = val_d1),
allow.new.levels = TRUE)
but this always provides the same (the most likely, mean value). Is there a way to get a distribution of values, meaning to draw from a predicted distribution?
As asked by #Adam Quek a reproducable example:
#creating dataset
a <- as.factor(sort(rep(1:4,5 )))
b <- rep(1:2,10)+0.5
c <- as.factor(c( sort(rep(1:2,5)),sort(rep(1:2,5)) ))
d <- as.factor(rep(1:5,4 ))
a <- c(a,a,a)
b <- c(b,b,b)
c <- c(c,c,c)
d <- c(d,d,d)
y <- rnorm(60)
inputdata = data.frame(y,a,b,c,d)
# fitting the model
fit<- lmer(y ~ a + b + (1|c) + (1|a:d) , data=inputdata)
# making specific predictions for a parameter set
val_a1 = 1
val_b1 = 2
val_c1 = 1
val_d1 = 4
pred <- predict(fit, newdata=list(a=val_a1, b=val_b1, c=val_c1, d = val_d1),
allow.new.levels = TRUE)
pred
what I obtain is:
0.2394255
If I do it again
pred <- predict(fit, newdata=list(a=val_a1, b=val_b1, c=val_c1, d = val_d1),
allow.new.levels = TRUE)
pred
I get of course:
0.2394255
but what I am searching for is a R function or routine that easily provides a suite of predictions that follow the distribution of my input values. Something like
for (i in 1:1000){
pred[i] <- predict(fit, newdata=list(a=val_a1, b=val_b1, c=val_c1, d =
val_d1),allow.new.levels = TRUE)
}
and mean(pred) = 0.2394255 but sd(pred) != 0
Thanks to #Alex W! bootMer does the job. Below for those who are interested the solution for the example:
m1 <- function(.) {
predict(., newdata=inputdata, re.form=NULL)
}
boot1 <- lme4::bootMer(fit, m1, nsim=1000, use.u=FALSE, type="parametric")
boot1$t[,1]
where boot1$t[,1]now contains the 1000 predictions when using the parameter values defined in inputdata[1,].
https://cran.r-project.org/web/packages/merTools/vignettes/Using_predictInterval.html
was a helpful link.
When I use the mice package to impute data I have the following issue:
I can't seem to find a way to replace NA values of new observations, given that I already have imputed the missing data in the training set.
Example 1
I have trained an algorithm with data from data frame with 10 features and 1000 observations.
How can I predict a new observation using this algorithm (with missing data)?
Example 2
Supose we have a data frame with NA values:
V1 V2 V3 R1
1 2 NA 1
1.4 -1 0 0
1.2 NA 0 1
1.6 NA 1 1
1.2 3 1 0
I impute the missing values using the mice package:
imp <- mice(df, m = 2, maxit = 100, meth = 'pmmm', seed = 12345)
The object df now has 2 dataframes with imputed values.
(dfImp1)
V1 V2 V3 R1
1 2 0.5 1
1.4 -1 0 0
1.2 1.5 0 1
1.6 1.5 1 1
1.2 3 1 0
Now with this data frame, I can train an algorithm:
modl <- glm(R1~., (dfImp1), family = binomial)
I want to predict the response of a new observation, e.g:
obs1 <- data.frame(V1 = 1, V2 = 1.4, V3 = NA)
How do I impute the missing data a of new individual observation?
It seems that mice package has not a built-in solution but we can write one.
The idea is to:
(1) use the same mice algorithm to fill NA in dataset used to train GLM and the new observation(s);
(2) predict only the new observation without NA.
I'm going to use iris as a data example.
library(R6)
library(mice)
# Binary output to use Binomial
df <- iris %>% filter(Species != "virginica")
# The new observation
new_data <- tail(df, 1)
# the dataset used to train the model
df <- head(df,-1)
# Now, let insert some NAs
insert_nas <- function(x) {
set.seed(123)
len <- length(x)
n <- sample(1:floor(0.2*len), 1)
i <- sample(1:len, n)
x[i] <- NA
x
}
df$Sepal.Length <- insert_nas(df$Sepal.Length)
df$Petal.Width <- insert_nas(df$Petal.Width)
new_data$Sepal.Width = NA
summary(df)
In fit method we apply mice to fill NAs, fit a GLM model and store it to be used in predict method.
In predict method we (1) add the new_observation to the dataset (with NAs), (2) replace NA again using mice, (3) get back the row(s) of the new observation(s) without NA and then (4) apply GLM to predict this new observation.
# R6 Class Generator
GLMWithMice <- R6Class("GLMWithMice", list(
model = NULL,
df = NULL,
fitted = FALSE,
initialize = function(df) {
self$df <- df
},
fit = function(formula = "Species~.", family = binomial) {
imp <- mice(self$df, m = 2, maxit = 100, meth = 'pmm', seed = 12345, print=FALSE)
df_cleaned <- complete(imp,1)
self$model <- glm(formula, df_cleaned, family = family, maxit = 100)
self$fitted <- TRUE
return(cat("\n model fitted!"))
},
predict = function(new_data, type = "response"){
n_rows <- nrow(self$df)
df_new <- bind_rows(self$df, new_data)
imp <- mice(df_new, m = 2, maxit = 100, meth = 'pmm', seed = 12345, print=FALSE)
df_cleaned <- complete(imp,1)
new_data_cleaned <- tail(df_cleaned, nrow(df_new) - n_rows)
return(predict(self$model,new_data_cleaned, type = type))
}
)
)
#Let's create a new instance of "GLMWithMice" class
model <- GLMWithMice$new(df = df)
class(model)
model$fit(formula = Species~., family = binomial)
model$predict(new_data = new_data)
Say I have a data frame like this:
X <- data_frame(
x = rep(seq(from = 1, to = 10, by = 1), 3),
y = 2*x + rnorm(length(x), sd = 0.5),
g = rep(LETTERS[1:3], each = length(x)/3))
How can I fit a regression y~x grouped by variable g and add the values from the fitted and resid generic methods to the data frame?
I know I can do:
A <- X[X$g == "A",]
mA <- with(A, lm(y ~ x))
A$fit <- fitted(mA)
A$res <- resid(mA)
B <- X[X$g == "B",]
mB <- with(B, lm(y ~ x))
B$fit <- fitted(mB)
B$res <- resid(mB)
C <- X[X$g == "C",]
mC <- with(B, lm(y ~ x))
C$fit <- fitted(mC)
C$res <- resid(mC)
And then rbind(A, B, C). However, in real life I am not using lm (I'm using rqss in the quantreg package). The method occasionally fails, so I need error handling, where I'd like to place NA all the rows that failed. Also, there are way more than 3 groups, so I don't want to just keep copying and pasting code for each group.
I tried using dplyr with do but didn't make any progress. I was thinking it might be something like:
make_qfits <- function(data) {
data %>%
group_by(g) %>%
do(failwith(NULL, rqss), formula = y ~ qss(x, lambda = 3))
}
Would this be easy to do by that approach? Is there another way in base R?
You can use do on grouped data for this task, fitting the model in each group in do and putting the model residuals and fitted values into a data.frame. To add these to the original data, just include the . that represents the data going into do in the output data.frame.
In your simple case, this would look like this:
X %>%
group_by(g) %>%
do({model = rqss(y ~ qss(x, lambda = 3), data = .)
data.frame(., residuals = resid.rqss(model), fitted = fitted(model))
})
Source: local data frame [30 x 5]
Groups: g
x y g residuals fitted
1 1 1.509760 A -1.368963e-08 1.509760
2 2 3.576973 A -8.915993e-02 3.666133
3 3 6.239950 A 4.174453e-01 5.822505
4 4 7.978878 A 4.130033e-09 7.978878
5 5 10.588367 A 4.833475e-01 10.105020
6 6 11.786445 A -3.807876e-01 12.167232
7 7 14.646221 A 4.167763e-01 14.229445
8 8 15.938253 A -3.534045e-01 16.291658
9 9 19.114927 A 7.610560e-01 18.353871
10 10 19.574449 A -8.416343e-01 20.416083
.. .. ... . ... ...
Things will look more complicated if you need to catch errors. Here is what it would look like using try and filling the residuals and fitted columns with NA if fit attempt for the group results in an error.
X[9:30,] %>%
group_by(g) %>%
do({catch = try(rqss(y ~ qss(x, lambda = 3), data = .))
if(class(catch) == "try-error"){
data.frame(., residuals = NA, fitted = NA)
}
else{
model = rqss(y ~ qss(x, lambda = 3), data = .)
data.frame(., residuals = resid.rqss(model), fitted = fitted(model))
}
})
Source: local data frame [22 x 5]
Groups: g
x y g residuals fitted
1 9 19.114927 A NA NA
2 10 19.574449 A NA NA
3 1 2.026199 B -4.618675e-01 2.488066
4 2 4.399768 B 1.520739e-11 4.399768
5 3 6.167690 B -1.437800e-01 6.311470
6 4 8.642481 B 4.193089e-01 8.223172
7 5 10.255790 B 1.209160e-01 10.134874
8 6 12.875674 B 8.290981e-01 12.046576
9 7 13.958278 B -4.803891e-10 13.958278
10 8 15.691032 B -1.789479e-01 15.869980
.. .. ... . ... ...
For the lm models you could try
library(nlme) # lmList to do lm by group
library(ggplot2) # fortify to get out the fitted/resid data
do.call(rbind, lapply(lmList(y ~ x | g, data=X), fortify))
This gives you the residual and fitted data in ".resid" and ".fitted" columns as well as a bunch of other fit data. By default the rownames will be prefixed with the letters from g.
With the rqss models that might fail
do.call(rbind, lapply(split(X, X$g), function(z) {
fit <- tryCatch({
rqss(y ~ x, data=z)
}, error=function(e) NULL)
if (is.null(fit)) data.frame(resid=numeric(0), fitted=numeric(0))
else data.frame(resid=fit$resid, fitted=fitted(fit))
}))
Here's a version that works with base R:
modelit <- function(df) {
mB <- with(df, lm(y ~ x, na.action = na.exclude))
df$fit <- fitted(mB)
df$res <- resid(mB)
return(df)
}
dfs.with.preds <- lapply(split(X, as.factor(X$g)), modelit)
output <- Reduce(function(x, y) { rbind(x, y) }, dfs.with.preds)
Is there a single function, similar to "runif", "rnorm" and the like which will produce simulated predictions for a linear model? I can code it on my own, but the code is ugly and I assume that this is something someone has done before.
slope = 1.5
intercept = 0
x = as.numeric(1:10)
e = rnorm(10, mean=0, sd = 1)
y = slope * x + intercept + e
fit = lm(y ~ x, data = df)
newX = data.frame(x = as.numeric(11:15))
What I'm interested in is a function that looks like the line below:
sims = rlm(1000, fit, newX)
That function would return 1000 simulations of y values, based on the new x variables.
Showing that Gavin Simpson's suggestion of modifying stats:::simulate.lm is a viable one.
## Modify stats:::simulate.lm by inserting some tracing code immediately
## following the line that reads "ftd <- fitted(object)"
trace(what = stats:::simulate.lm,
tracer = quote(ftd <- list(...)[["XX"]]),
at = list(6))
## Prepare the data and 'fit' object
df <- data.frame(x =x<-1:10, y = 1.5*x + rnorm(length(x)))
fit <- lm(y ~ x, data = df)
## Define new covariate values and compute their predicted/fitted values
newX <- 8:1
newFitted <- predict(fit, newdata = data.frame(x = newX))
## Pass in fitted via the argument 'XX'
simulate(fit, nsim = 4, XX = newFitted)
# sim_1 sim_2 sim_3 sim_4
# 1 11.0910257 11.018211 10.95988582 13.398902
# 2 12.3802903 10.589807 10.54324607 11.728212
# 3 8.0546746 9.925670 8.14115433 9.039556
# 4 6.4511230 8.136040 7.59675948 7.892622
# 5 6.2333459 3.131931 5.63671024 7.645412
# 6 3.7449859 4.686575 3.45079655 5.324567
# 7 2.9204519 3.417646 2.05988078 4.453807
# 8 -0.5781599 -1.799643 -0.06848592 0.926204
That works, but this is a cleaner (and likely better) approach:
## A function for simulating at new x-values
simulateX <- function(object, nsim = 1, seed = NULL, X, ...) {
object$fitted.values <- predict(object, X)
simulate(object = object, nsim = nsim, seed = seed, ...)
}
## Prepare example data and a fit object
df <- data.frame(x =x<-1:10, y = 1.5*x + rnorm(length(x)))
fit <- lm(y ~ x, data = df)
## Supply new x-values in a data.frame of the form expected by
## the newdata= argument of predict.lm()
newX <- data.frame(x = 8:1)
## Try it out
simulateX(fit, nsim = 4, X = newX)
# sim_1 sim_2 sim_3 sim_4
# 1 11.485024 11.901787 10.483908 10.818793
# 2 10.990132 11.053870 9.181760 10.599413
# 3 7.899568 9.495389 10.097445 8.544523
# 4 8.259909 7.195572 6.882878 7.580064
# 5 5.542428 6.574177 4.986223 6.289376
# 6 5.622131 6.341748 4.929637 4.545572
# 7 3.277023 2.868446 4.119017 2.609147
# 8 1.296182 1.607852 1.999305 2.598428