I'm working with a database where I can't change the variable names by company decision.
One of the variables is named as follows:
%Variable
I fit a model (lm object) and this variable is included.
Now I want to use the predict() function and I need to create a dataframe using this same name in one of the columns to be able to predict values. I'm doing the following:
new_x <- data.frame(X1 = 1, X2 = 0, X3 = 0, X4= 1, X5 = 0.765, `%VARIABLE` = 16.1)
predict(object = model4, newdata = new_x, level = 0.95, interval = 'confidence')
However, the new_x dataframe has the last column named as X.VARIABLE and not as %VARIABLE.
How can I fix this?
The function names() works.
names(new_x)[6] <- "%VARIABLE"
new_x
X1 X2 X3 X4 X5 %VARIABLE
1 1 0 0 1 0.765 16.1
Related
Firstly, let's say I have a data frame df with variables y, x1, x2, x1 is a continuous variable and x2 is a factor.
Let's say I have a model:
model <- glm(y ~ x1 + x2, data = df, family = binomial)
This will result in an object where I can extract the coefficients using the command model$coefficients.
However, for use in another program I would like to export the data frame df, but I'd also like to be able to display the results of the model beyond simply adding the fitted values to the data frame.
Therefore I would like to have coeff1*x1 and coeff2*x2 also in the same dataframe, so that I could use these and the original data together to display their effects. The problem arises from the fact that one of the variables is a multi-level factor and therefore it's not preferable to simply use a for-loop to extract the coefficients and multiply the variables with them.
Is there another way to add two new variables to the dataframe df such that they've been derived from combining the original variables x1, x2 and their respective coefficients?
Try:
set.seed(123)
N <- 10
df <- data.frame(x1 = rnorm(N, 10, 1),
x2 = sample(1:3, N, TRUE),
y = as.integer(50 - x2* 0.4 + x1 * 1.2 + rnorm(N, 0, 0.5) > 52))
model <- glm(y ~ x1 + x2, data = df, family = binomial)
# add column for intercept
df <- cbind(x0 = rep(1, N), df)
df$intercept <- df$x0 * model$coefficients["(Intercept)"]
df[["coeff1*x1"]] <- df$x1 * model$coefficients["x1"]
df[["coeff2*x2"]] <- df$x2 * model$coefficients["x2"]
# x0 x1 x2 y intercept coeff1*x1 coeff2*x2
# 1 1 9.439524 1 1 24.56607 -3.361333e-06 -4.281056e-07
# 2 1 9.769823 1 1 24.56607 -3.478949e-06 -4.281056e-07
# 3 1 11.558708 1 1 24.56607 -4.115956e-06 -4.281056e-07
Alternatively:
# add column for intercept
df <- cbind(x0 = rep(1, N), df)
tmp <- as.data.frame(Map(function(x, y) x * y, subset(df, select = -y), model$coefficients))
names(tmp) <- paste0("coeff*", names(model$coefficients))
cbind(df, tmp)
library(rqPen)
n <- 60
p <- 7
rho <- .5
beta <- c(3,1.5,0,2,0,0,0)
R <- matrix(0,p,p)
for(i in 1:p){
for(j in 1:p){
R[i,j] <- rho^abs(i-j)
}
}
set.seed(1234)
x <- matrix(rnorm(n*p),n,p) %*% t(chol(R))
y <- x %*% beta + rnorm(n)
q.lasso_scad = cv.rq.pen(x, y, tau = 0.5, lambda = NULL, penalty = "SCAD", intercept = FALSE, criteria = "CV", cvFunc = "check", nfolds = 10,
foldid = NULL, nlambda = 100, eps = 1e-04, init.lambda = 1,alg="QICD")
q.lasso_scad
coef1 = q.lasso_scad$models[[which.min(q.lasso_scad$cv[,2])]]
coef1
I have the following output
Coefficients:
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
0.0000000 0.3226967 1.8131688 -0.1971847 0.1981571 0.7715635 -0.2289284 -0.1087028 0.9713283 -0.1079333
I want to extract the coefficients only. How can I do that?
Thank you in advance.
It's a bit backwards but you can do:
as.data.frame(as.list.data.frame(coef1)$coefficients)
Result:
as.list.data.frame(coef1)$coefficients
x1 3.17487201
x2 1.15712559
x3 0.05078333
x4 2.27113756
x5 0.24893740
x6 0.00000000
x7 -0.07542964
If I understand the issue correctly, the output from rqPen is some sort of a fancy list with additional attributes. as.list.data.frame basically forces coef1 to be a "normal" list, which allows me to use $coefficients to extract the coefficients values. Lastly, I use as.data.frame to convert it into a more usable object.
If you just want the values, you can replace as.data.frame with as.vector:
as.vector(as.list.data.frame(coef1)$coefficients)
Result:
[1] 3.17487201 1.15712559 0.05078333 2.27113756 0.24893740 0.00000000
[7] -0.07542964
I don't have access to R program, so I can't verify that it will work. But try this:
names(coef1) <- NULL
coef1
Some modeling functions, e.g. glmnet(), require (or just allow for) the data to be passed in as a predictor matrix and a response matrix (or vector) as apposed to using a formula. In these cases, it's typically the case that the predict() method, e.g. predict.glmnet(), requires that the newdata argument provides a predictor matrix with the same features as was used to train the model.
A convenient way to create a predictor matrix when your dataframe has factors (R's categorical data type) is to use the model.matrix() function, which automatically creates dummy features for your categorical variables:
# this is the dataframe and matrix I want to use to train the model
set.seed(1)
df <- data.frame(x1 = factor(sample(LETTERS[1:5], replace = T, 20)),
x2 = rnorm(20, 100, 5),
x3 = factor(sample(c("U","L"), replace = T, 20)),
y = rnorm(20, 10, 2))
mm <- model.matrix(y~., data = df)
But when I introduce a dataframe with new observations that contain only a subset of the levels of the factors from the original dataframe, model.matrix() (predictably) returns a matrix with different dummy features. This new matrix cannot be used in predict.glm() because it doesn't have the same features that the model is expecting:
# this is the dataframe and matrix I want to predict on
set.seed(1)
df_new <- data.frame(x1 = c("B", "C"),
x2 = rnorm(2, 100, 5),
x3 = c("L","U"))
mm_new <- model.matrix(~., data = df_new)
Is there a way to save the transformation (creating all necessary dummy features) from a dataframe to a model matrix so that I can re-apply this transformation to future observations? In my above example, this would ideally result in mm_new having identical feature names as mm so that predict() can accept mm_new.
I want to add that I'm aware of this approach, which essentially suggests to include the observations from df_new in df before calling model.matrix(). This work fine if I have all the observations to begin with, and I'm just training and testing models. However, the new observations will only be accessible in the future (in a production prediction pipeline), and I want to avoid the overhead of re-loading the entire training dataframe for new predictions.
I found exactly what I needed available in the documentation for model.matrix and model.frame, and wanted to share. There is an argument in model.matrix called xlev which is "to be used as argument of model.frame if data is such that model.frame is called."
If model.matrix calls model.frame, xlev expects a list of character vectors for each factor in the dataframe (with the list element name being the factor name); each character vector contains the full set of factor levels needed to build the new model.matrix with the same dummy features as the original model.matrix.
Here's a working example:
set.seed(1)
df <- data.frame(x1 = factor(sample(LETTERS[1:5], replace = T, 20)),
x2 = rnorm(20, 100, 5),
x3 = factor(sample(c("U","L"), replace = T, 20)),
y = rnorm(20, 10, 2))
mm <- model.matrix(y~., data = df)
# this is a list of levels for each factor in the original df
xlevs <- lapply(df[,sapply(df, is.factor), drop = F], function(j){
levels(j)
})
# this is a new df with only a subset of the levels of the original factors
df_new <- data.frame(x1 = c("B", "C"),
x2 = rnorm(2, 100, 5),
x3 = c("U","U"))
# calling "xlev = " builds out a model.matrix with identical levels as the original df
mm_new <- model.matrix(~., data = df_new[1,], xlev = xlevs)
Note that this solution only handles factor levels that are a subset of the original factor levels. It isn't intended to handle new factor levels.
The problem with model.matrix() is that it does not save any transforming parameters. I write a package called ModelMatrixModel, ModelMatrixModel() function in the package returns a class that stores the transformed matrix and the transforming parameters, including factor levels information and orthogonal polynomials coefficients, which can then be apply to new data. It also give many options, such as handling invalid levels, keeping first dummy variable , returning sparse matrix, and scaling the output matrix.
#devtools::install_github("xinyongtian/R_ModelMatrixModel")
library(ModelMatrixModel)
df <- data.frame(x1 = factor(sample(LETTERS[1:5], replace = T, 20)),
x2 = rnorm(20, 100, 5),
x3 = factor(sample(c("U","L"), replace = T, 20)),
y = rnorm(20, 10, 2))
df_new <- data.frame(x1 = c("B", "C"),
x2 = rnorm(2, 100, 5),
x3 = c("U","U"))
m <- ModelMatrixModel(y~1+x1+x2+x3, data = df,remove_1st_dummy = T,sparse=F)
head(m$x,2)
## _Intercept_ x1B x1C x1D x1E x2 x3U
## 1 1 0 0 0 0 93.64492 0
## 2 1 1 0 0 0 101.08855 1
m_new=predict(m,df_new)
head(m_new$x,2)
## _Intercept_ x1B x1C x1D x1E x2 x3U
## 1 1 1 0 0 0 106.63825 1
## 2 1 0 1 0 0 99.00571 1
Let's say we have a data frame with a set of 3 dependent variables and 6 independent variables tagged by a grouping variable. An example of this format is generated with the sample code below:
library(tidyverse)
library(broom)
n <- 15
df <- data.frame(groupingvar= sample(letters[1:2], size = n, replace = TRUE),
y1 = rnorm(n,10,1), y2=rnorm(n,100,10), y3=rnorm(n,1000,100),
x1= rnorm(n,10,1), x2=rnorm(n,10,1), x3=rnorm(n,10,1),
x4=rnorm(n,10,1), x5=rnorm(n,10,1), x6=rnorm(n,10,1))
df <- arrange(df,groupingvar)
If I wanted to regress each of the y1, y2, y3 on the set of x1 through x6 I could use something along the lines of:
y <- as.matrix(select(df,y1:y3))
x <- as.matrix(select(df,x1:x6))
regs <-lm(y~x)
coeffs <- tidy(regs)
coeffs <- arrange(coeffs,response, term)
(by making use of the following line from the lm() help: "If response is a matrix, a linear model is fitted separately by least-squares to each column of the matrix.")
However, if I need to first group by the grouping variable and then apply the lm function then I'm not quite sure how to do it. I have tried the following, but it produces the same set of coefficients for both groups.
regs2 <- df %>% group_by(groupingvar) %>%
do(fit2 = lm(as.matrix(select(df,y1:y3)) ~ as.matrix(select(df,x1:x6))))
coeffs2 <- tidy(regs2,fit2)
coeffs2 <- arrange(coeffs2,groupingvar, response)
In data.table, you could melt (reshape long -- stack the outcome variables in one column instead of stored in three columns) & lm by both groupingvar and the outcome variable:
library(data.table)
setDT(df)
#alternatively, set id.vars = c('groupingvar', paste0('x', 1:6)), etc.
longDT = melt(df, id.vars = grep('y', names(df), invert = TRUE))
#this helper function basically splits a named vector into
# its two components
coefsplit = function(reg) {
beta = coef(reg)
list(var = names(beta), coef = beta)
}
#I personally wouldn't assign longDT, I'd just chain this onto
# the output of melt;
longDT[ , coefsplit(lm(value ~ ., data = .SD)), by = .(groupingvar, variable)]
# groupingvar variable var coef
# 1: a y1 (Intercept) -3.595564e+03
# 2: a y1 x1 -3.796627e+01
# 3: a y1 x2 -1.557268e+02
# 4: a y1 x3 2.862738e+02
# 5: a y1 x4 1.579548e+02
# ...
# 38: b y3 x2 2.136253e+01
# 39: b y3 x3 -3.810176e+01
# 40: b y3 x4 4.187719e+01
# 41: b y3 x5 -2.586184e+02
# 42: b y3 x6 1.181879e+02
# groupingvar variable var coef
I also found a way to achieve this using cbind() as follows:
library(tidyverse)
library(broom)
n <- 20
df4 <- data.frame(groupingvar= sample(1:2, size = n, replace = TRUE),
y1 = rnorm(n,10,1), y2=rnorm(n,100,10), y3=rnorm(n,1000,100),
x1= rnorm(n,10,1), x2=rnorm(n,10,1), x3=rnorm(n,10,1),
x4=rnorm(n,10,1), x5=rnorm(n,10,1), x6=rnorm(n,10,1))
df4 <- arrange(df4,groupingvar)
regs <- df4 %>% group_by(groupingvar) %>%
do(fit = lm(cbind(y1,y2,y3) ~ . -groupingvar, data = .))
coeffs <- tidy(regs, fit)
I have 6 classes of outcome variable and 14 predictor variables. I built the model below:
fit <- multinom(y ~ X1 + X2 + as.factor(X3) + ... + X14, data= Original)
And I want to predict probabilities of each class of outcome for a given new data point.
X1 <- 1.6
X2 <- 4
x3 <- 15
.
.
.
X14 <- 8
dfin <- data.frame( ses = c(100, 200, 300), X1, X2, X3, ..., X14)
Then I run predict:
predict(fit, todaydata = dfin, type = "probs")
The outcome looks like:
#class1 #class2 #class3 #class4 #class5 #class6
#5541 7.226948e-01 1.498199e-01 8.086624e-02 1.253289e-02 8.799416e-03 2.528670e-02
#5546 6.034188e-01 7.386553e-02 1.908132e-01 1.229962e-01 4.716406e-04 8.434623e-03
#5548 7.266859e-01 1.278779e-01 1.001634e-01 2.032530e-02 7.156766e-03 1.779076e-02
#5562 7.120179e-01 1.471181e-01 9.146071e-02 1.265592e-02 8.189511e-03 2.855781e-02
#5666 6.645056e-01 3.034978e-02 1.687687e-01 1.219601e-01 3.972833e-03 1.044308e-02
#5668 4.875966e-01 3.126855e-02 2.090006e-01 2.430828e-01 3.721631e-03 2.532970e-02
#5670 3.900772e-01 1.305786e-02 1.803779e-01 4.137106e-01 1.314298e-03 1.462155e-03
#5671 4.272971e-01 1.194599e-02 1.748494e-01 3.833422e-01 8.863019e-04 1.678975e-03
#5674 5.477521e-01 2.587478e-02 1.650817e-01 2.487404e-01 3.368726e-03 9.182195e-03
#5677 4.300207e-01 9.532836e-03 1.608679e-01 3.946310e-01 2.626104e-03 2.321351e-03
#5678 4.542981e-01 1.220728e-02 1.410984e-01 3.885146e-01 2.670689e-03 1.210891e-03
#...
Then I change values of new data point by running the lines below:
X1 <- 2.7
X2 <- 5.1
x3 <- 28
.
.
.
X14 <- 2
dfin2 <- data.frame( ses = c(100, 200, 300), X1, X2, X3, ..., X14)
predict(fit, todaydata = dfin2, type = "probs")
again I got exactly the same probabilities.
#class1 #class2 #class3 #class4 #class5 #class6
#5541 7.226948e-01 1.498199e-01 8.086624e-02 1.253289e-02 8.799416e-03 2.528670e-02
#5546 6.034188e-01 7.386553e-02 1.908132e-01 1.229962e-01 4.716406e-04 8.434623e-03
#5548 7.266859e-01 1.278779e-01 1.001634e-01 2.032530e-02 7.156766e-03 1.779076e-02
#5562 7.120179e-01 1.471181e-01 9.146071e-02 1.265592e-02 8.189511e-03 2.855781e-02
#5666 6.645056e-01 3.034978e-02 1.687687e-01 1.219601e-01 3.972833e-03 1.044308e-02
#5668 4.875966e-01 3.126855e-02 2.090006e-01 2.430828e-01 3.721631e-03 2.532970e-02
#5670 3.900772e-01 1.305786e-02 1.803779e-01 4.137106e-01 1.314298e-03 1.462155e-03
#5671 4.272971e-01 1.194599e-02 1.748494e-01 3.833422e-01 8.863019e-04 1.678975e-03
#5674 5.477521e-01 2.587478e-02 1.650817e-01 2.487404e-01 3.368726e-03 9.182195e-03
#5677 4.300207e-01 9.532836e-03 1.608679e-01 3.946310e-01 2.626104e-03 2.321351e-03
#5678 4.542981e-01 1.220728e-02 1.410984e-01 3.885146e-01 2.670689e-03 1.210891e-03
#...
What am I doing wrong that cause same outcome for 2 different dfin and dfin2 data frames?
My second question is that why for a single data point I get so many rows of outcome?
Thanks a lot for your time!