The function stats::model.matrix can be used to convert an input data frame into a matrix suitable for raw algebraic manipulation in solving regression equations etc. For example:
set.seed(0)
df <- data.frame(a = rnorm(5), n = rnorm(5))
model.matrix(~., data = df)
produces:
(Intercept) a n
1 1 1.2629543 -1.539950042
2 1 -0.3262334 -0.928567035
3 1 1.3297993 -0.294720447
4 1 1.2724293 -0.005767173
5 1 0.4146414 2.404653389
One of the features is that it controls how the intercept is added to the design matrix. Compare: model.matrix(~ 0 + ., data = df) vs model.matrix(~ 1 + ., data = df)
Now, the . in the formula specifies that all variables of df should be included. This leads to a problem when I wish to pass an empty data frame and thus create just the intercepts, e.g.:
df <- data.frame(matrix(, nrow=5, ncol=0))
model.matrix(~ ., data = df)
leads to:
Error in terms.formula(object, data = data) :
'.' in formula and no 'data' argument
Does anyone know how to get around this, to be specific, I want to produce the following result given by model.matrix(~ 1, data = df), except using the ..
You could do:
model.matrix(as.formula(paste0("~", colnames(df), "+1", collapse="+")), data=df)
However, I think you are better off using an if statement. I think formulas in R are more convenience than anything fundamental.
Related
I was wondering why lm() says 5 coefs not defined because of singularities and then gives all NA in the summary output for 5 coefficients.
Note that all my predictors are categorical.
Is there anything wrong with my data on these 5 coefficients or code? How can I possibly fix this?
d <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/v.csv", h = T) # Data
nms <- c("Age","genre","Length","cf.training","error.type","cf.scope","cf.type","cf.revision")
d[nms] <- lapply(d[nms], as.factor) # make factor
vv <- lm(dint~Age+genre+Length+cf.training+error.type+cf.scope+cf.type+cf.revision, data = d)
summary(vv)
First 6 lines of output:
Coefficients: (5 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.17835 0.63573 0.281 0.779330
Age1 -0.04576 0.86803 -0.053 0.958010
Age2 0.46431 0.87686 0.530 0.596990
Age99 -1.64099 1.04830 -1.565 0.118949
genre2 1.57015 0.55699 2.819 0.005263 **
genre4 NA NA NA NA ## For example here is all `NA`s? there are 4 more !
As others noted, a problem is that you seem to have multicollinearity. Another is that there are missing values in your dataset. The missing values should probably just be removed. As for correlated variables, you should inspect your data to identify this collinearity, and remove it. Deciding which variables to remove and which to retain is a very domain-specific topic. However, you could if you wish decide to use regularisation and fit a model while retaining all variables. This also allows you to fit a model when n (number of samples) is less than p (number of predictors).
I've shown code below that demonstrates how to examine the correlation structure within your data, and to identify which variables are most correlated (thanks to this answer. I've included an example of fitting such a model, using L2 regularisation (commonly known as ridge regression).
d <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/v.csv", h = T) # Data
nms <- c("Age","genre","Length","cf.training","error.type","cf.scope","cf.type","cf.revision")
d[nms] <- lapply(d[nms], as.factor) # make factor
vv <- lm(dint~Age+genre+Length+cf.training+error.type+cf.scope+cf.type+cf.revision, data = d)
df <- d
df[] <- lapply(df, as.numeric)
cor_mat <- cor(as.matrix(df), use = "complete.obs")
library("gplots")
heatmap.2(cor_mat, trace = "none")
## https://stackoverflow.com/questions/22282531/how-to-compute-correlations-between-all-columns-in-r-and-detect-highly-correlate
library("tibble")
library("dplyr")
library("tidyr")
d2 <- df %>%
as.matrix() %>%
cor(use = "complete.obs") %>%
## Set diag (a vs a) to NA, then remove
(function(x) {
diag(x) <- NA
x
}) %>%
as.data.frame %>%
rownames_to_column(var = 'var1') %>%
gather(var2, value, -var1) %>%
filter(!is.na(value)) %>%
## Sort by decreasing absolute correlation
arrange(-abs(value))
## 2 pairs of variables are almost exactly correlated!
head(d2)
#> var1 var2 value
#> 1 id study.name 0.9999430
#> 2 study.name id 0.9999430
#> 3 Location timed 0.9994082
#> 4 timed Location 0.9994082
#> 5 Age ed.level 0.7425026
#> 6 ed.level Age 0.7425026
## Remove some variables here, or maybe try regularized regression (see below)
library("glmnet")
## glmnet requires matrix input
X <- d[, c("Age", "genre", "Length", "cf.training", "error.type", "cf.scope", "cf.type", "cf.revision")]
X[] <- lapply(X, as.numeric)
X <- as.matrix(X)
ind_na <- apply(X, 1, function(row) any(is.na(row)))
X <- X[!ind_na, ]
y <- d[!ind_na, "dint"]
glmnet <- glmnet(
x = X,
y = y,
## alpha = 0 is ridge regression
alpha = 0)
plot(glmnet)
Created on 2019-11-08 by the reprex package (v0.3.0)
Under such situation you can use "olsrr" package in R for stepwise regression analysis. I am providing you a sample code to do stepwise regression analysis in R
library("olsrr")
#Load the data
d <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/v.csv", h = T)
# stepwise regression
vv <- lm(dint ~ Age + genre + Length + cf.training + error.type + cf.scope + cf.type + cf.revision, data = d)
summary(vv)
k <- ols_step_both_p(vv, pent = 0.05, prem = 0.1)
# stepwise regression plot
plot(k)
# final model
k$model
It will provide you exactly the same output as that of SPSS.
I created a linear model in R using county data from the ACS. There are 3140 entries in my data set, and they all have their corresponding fips codes. I'm trying to make a map of the residuals from my linear model, but I only have 3139 residuals. Does anyone know if there's something R does when creating a linear model that is responsible for this, and how I can fix it so that I can create this map? Thanks!
In response to the suggestion of checking for NAs, I ran this:
which(completedata$fipscode == NA)
integer(0)
R code if it helps:
sectorcodes <- read.csv("sectorcodes1.csv") #ruralubrancode, median hh income
sectorcodesdf <- data.frame(sectorcodes)
religion <- read.csv("Religion2.csv")
religiondf <- data.frame(religion)
merge1 <- merge(sectorcodesdf,religiondf, by = c('fipscode'))
merge1df <- data.frame(merge1)
family <- read.csv("censusdataavgfamsize.csv") #avgfamilysize
familydf <- data.frame(family)
merge2 <- merge(merge1df, familydf, by = c('fipscode'))
merge2df <- data.frame(merge2)
gradrate <- read.csv("censusdatahsgrad.csv")
gradratedf <- data.frame(gradrate)
evenmoredata2 <- merge(gradrate,merge2df, by=c("fipscode"))
#write.csv(evenmoredata2, file = "completedataset.csv")
completedata <- read.csv("completedataset.csv")
completedatadf <- data.frame(completedata)
lm8 <- lm(completedatadf$hsgrad ~ completedatadf$averagefamilysize*completedatadf$Rural_urban_continuum_code_2013*completedatadf$TOTADH*completedatadf$Median_Household_Income_2016)
summary(lm8)
library(blscrapeR)
library(RgoogleMaps)
library(choroplethr)
library(acs)
attach(acs)
require(choroplethr)
dataframe1 <- data.frame(completedatadf$fipscode,completedatadf$averagefamilysize)
names(dataframe1) <- c("region","value")
dataframe2 <- data.frame(completedata$fipscode,completedata$hsgrad)
names(dataframe2) <- c("region","value")
residdf <- data.frame(lm8$residuals)
dataframe3 <- data.frame(completedata$fipscode,lm8$residuals)
names(dataframe3) <- c("region","value")
county_choropleth(dataframe1)
county_choropleth(dataframe2)
county_choropleth(dataframe3)
When I try to run dataframe3, the error message is:
dataframe3 <- data.frame(completedata$fipscode,lm8$residuals)
Error in data.frame(completedata$fipscode, lm8$residuals) :
arguments imply differing number of rows: 3140, 3139
This can be caused by an NA in the response. For example, using the builtin BOD data frame note that there are 5 residuals in this example but 6 rows in b:
b <- BOD
b[3, 2] <- NA
nrow(b)
## [1] 6
fm <- lm(demand ~ Time, b)
resid(fm)
## 1 2 4 5 6
## -0.3578947 -0.2657895 1.6184211 -0.6894737 -0.3052632
We can handle that by specifying na.action = na.exclude when running lm. Note that now there are 6 residuals with the extra one being NA.
fm <- lm(demand ~ Time, b, na.action = na.exclude)
resid(fm)
## 1 2 3 4 5 6
## -0.3578947 -0.2657895 NA 1.6184211 -0.6894737 -0.3052632
Try
data.frame(na.omit(completedata$fipscode), lm8$residuals)
Probably, you're data has NA values.
I have an example dataset below.
train<-data.frame(x1 = c(4,5,6,4,3,5), x2 = c(4,2,4,0,5,4), x3 = c(1,1,1,0,0,1),
x4 = c(1,0,1,1,0,0), x5 = c(0,0,0,1,1,1))
Suppose I want to create separate models for column x3, x4, x5 based on column x1 and x2. For example
lm1 <- lm(x3 ~ x1 + x2)
lm2 <- lm(x4 ~ x1 + x2)
lm3 <- lm(x5 ~ x1 + x2)
I want to then take these models and apply them to a testing set using predict, and then create a matrix that has each model outcome as a column.
test <- data.frame(x1 = c(4,3,2,1,5,6), x2 = c(4,2,1,6,8,5))
p1 <- predict(lm1, newdata = test)
p2 <- predict(lm2, newdata = test)
p3 <- predict(lm3, newdata = test)
final <- cbind(p1, p2, p3)
This is a simplified version where you can do it step by step, the actual data is far too large. Is there a way to create a function or use a for statement to combine this into one or two steps?
I had an inclination to close your question as a duplicate to Fitting a linear model with multiple LHS, but sadly the prediction issue is not addressed over there. On the other hand, Prediction of 'mlm' linear model object from lm() talks about prediction, but is a little bit far off your situation, as you work with formula interface instead of matrix interface.
I did not manage to locate a perfect duplicate target in "mlm" tag. So I think it a good idea to contribute another answer for this tag. As I said in linked questions, predict.mlm does not support se.fit, and at the moment, this is also a missing issue in "mlm" tag. So I would take this chance to fill such gap.
Here is a function to get standard error of prediction:
f <- function (mlmObject, newdata) {
## model formula
form <- formula(mlmObject)
## drop response (LHS)
form[[2]] <- NULL
## prediction matrix
X <- model.matrix(form, newdata)
Q <- forwardsolve(t(qr.R(mlmObject$qr)), t(X))
## unscaled prediction standard error
unscaled.se <- sqrt(colSums(Q ^ 2))
## residual standard error
sigma <- sqrt(colSums(residuals(mlmObject) ^ 2) / mlmObject$df.residual)
## scaled prediction standard error
tcrossprod(unscaled.se, sigma)
}
For your given example, you can do
## fit an `mlm`
fit <- lm(cbind(x3, x4, x5) ~ x1 + x2, data = train)
## prediction (mean only)
pred <- predict(fit, newdata = test)
# x3 x4 x5
#1 0.555956679 0.38628159 0.60649819
#2 0.003610108 0.47653430 0.95848375
#3 -0.458483755 0.48014440 1.27256318
#4 -0.379061372 -0.03610108 1.35920578
#5 1.288808664 0.12274368 0.17870036
#6 1.389891697 0.46570397 0.01624549
## prediction error
pred.se <- f(fit, newdata = test)
# [,1] [,2] [,3]
#[1,] 0.1974039 0.3321300 0.2976205
#[2,] 0.3254108 0.5475000 0.4906129
#[3,] 0.5071956 0.8533510 0.7646849
#[4,] 0.6583707 1.1077014 0.9926075
#[5,] 0.5049637 0.8495959 0.7613200
#[6,] 0.3552794 0.5977537 0.5356451
We can verify that f is correct:
## `lm1`, `lm2` and `lm3` are defined in your question
predict(lm1, test, se.fit = TRUE)$se.fit
# 1 2 3 4 5 6
#0.1974039 0.3254108 0.5071956 0.6583707 0.5049637 0.3552794
predict(lm2, test, se.fit = TRUE)$se.fit
# 1 2 3 4 5 6
#0.3321300 0.5475000 0.8533510 1.1077014 0.8495959 0.5977537
predict(lm3, test, se.fit = TRUE)$se.fit
# 1 2 3 4 5 6
#0.2976205 0.4906129 0.7646849 0.9926075 0.7613200 0.5356451
I need to apply lm() to an enlarging subset of my dataframe dat, while making prediction for the next observation. For example, I am doing:
fit model predict
---------- -------
dat[1:3, ] dat[4, ]
dat[1:4, ] dat[5, ]
. .
. .
dat[-1, ] dat[nrow(dat), ]
I know what I should do for a particular subset (related to this question: predict() and newdata - How does this work?). For example to predict the last row, I do
dat1 = dat[1:(nrow(dat)-1), ]
dat2 = dat[nrow(dat), ]
fit = lm(log(clicks) ~ log(v1) + log(v12), data=dat1)
predict.fit = predict(fit, newdata=dat2, se.fit=TRUE)
How can I do this automatically for all subsets, and potentially extract what I want into a table?
From fit, I'd need the summary(fit)$adj.r.squared;
From predict.fit I'd need predict.fit$fit value.
Thanks.
(Efficient) solution
This is what you can do:
p <- 3 ## number of parameters in lm()
n <- nrow(dat) - 1
## a function to return what you desire for subset dat[1:x, ]
bundle <- function(x) {
fit <- lm(log(clicks) ~ log(v1) + log(v12), data = dat, subset = 1:x, model = FALSE)
pred <- predict(fit, newdata = dat[x+1, ], se.fit = TRUE)
c(summary(fit)$adj.r.squared, pred$fit, pred$se.fit)
}
## rolling regression / prediction
result <- t(sapply(p:n, bundle))
colnames(result) <- c("adj.r2", "prediction", "se")
Note I have done several things inside the bundle function:
I have used subset argument for selecting a subset to fit
I have used model = FALSE to not save model frame hence we save workspace
Overall, there is no obvious loop, but sapply is used.
Fitting starts from p, the minimum number of data required to fit a model with p coefficients;
Fitting terminates at nrow(dat) - 1, as we at least need the final column for prediction.
Test
Example data (with 30 "observations")
dat <- data.frame(clicks = runif(30, 1, 100), v1 = runif(30, 1, 100),
v12 = runif(30, 1, 100))
Applying code above gives results (27 rows in total, truncated output for 5 rows)
adj.r2 prediction se
[1,] NaN 3.881068 NaN
[2,] 0.106592619 3.676821 0.7517040
[3,] 0.545993989 3.892931 0.2758347
[4,] 0.622612495 3.766101 0.1508270
[5,] 0.180462206 3.996344 0.2059014
The first column is the adjusted-R.squared value for fitted model, while the second column is the prediction. The first value for adj.r2 is NaN, because the first model we fit has 3 coefficients for 3 data points, hence no sensible statistics is available. The same happens to se as well, as the fitted line has no 0 residuals, so prediction is done without uncertainty.
I just made up some random data to use for this example. I'm calling the object data because that was what it was called in the question at the time that I wrote this solution (call it anything you like).
(Efficient) Solution
data <- data.frame(v1=rnorm(100),v2=rnorm(100),clicks=rnorm(100))
data1 = data[1:(nrow(data)-1), ]
data2 = data[nrow(data), ]
for(i in 3:nrow(data)){
nam <- paste("predict", i, sep = "")
nam1 <- paste("fit", i, sep = "")
nam2 <- paste("summary_fit", i, sep = "")
fit = lm(clicks ~ v1 + v2, data=data[1:i,])
tmp <- predict(fit, newdata=data2, se.fit=TRUE)
tmp1 <- fit
tmp2 <- summary(fit)
assign(nam, tmp)
assign(nam1, tmp1)
assign(nam2, tmp2)
}
All of the results you want will be stored in the data objects this creates.
For example:
> summary_fit10$r.squared
[1] 0.3087432
You mentioned in the comments that you'd like a table of results. You can programmatically create tables of results from the 3 types of output files like this:
rm(data,data1,data2,i,nam,nam1,nam2,fit,tmp,tmp1,tmp2)
frames <- ls()
frames.fit <- frames[1:98] #change index or use pattern matching as needed
frames.predict <- frames[99:196]
frames.sum <- frames[197:294]
fit.table <- data.frame(intercept=NA,v1=NA,v2=NA,sourcedf=NA)
for(i in 1:length(frames.fit)){
tmp <- get(frames.fit[i])
fit.table <- rbind(fit.table,c(tmp$coefficients[[1]],tmp$coefficients[[2]],tmp$coefficients[[3]],frames.fit[i]))
}
fit.table
> fit.table
intercept v1 v2 sourcedf
2 -0.0647017971121678 1.34929652763687 -0.300502017324518 fit10
3 -0.0401617893034109 -0.034750571912636 -0.0843076273486442 fit100
4 0.0132968863522573 1.31283604433593 -0.388846211083564 fit11
5 0.0315113918953643 1.31099122173898 -0.371130010135382 fit12
6 0.149582794027583 0.958692838785998 -0.299479715938493 fit13
7 0.00759688947362175 0.703525856001948 -0.297223988673322 fit14
8 0.219756240025917 0.631961979610744 -0.347851129205841 fit15
9 0.13389223748979 0.560583832333355 -0.276076134872669 fit16
10 0.147258022154645 0.581865844000838 -0.278212722024832 fit17
11 0.0592160359650468 0.469842498721747 -0.163187274356457 fit18
12 0.120640756525163 0.430051839741539 -0.201725012088506 fit19
13 0.101443924785995 0.34966728554219 -0.231560038360121 fit20
14 0.0416637001406594 0.472156988919337 -0.247684504074867 fit21
15 -0.0158319749710781 0.451944113682333 -0.171367482879835 fit22
16 -0.0337969739950376 0.423851304105399 -0.157905431162024 fit23
17 -0.109460218252207 0.32206642419212 -0.055331391802687 fit24
18 -0.100560410735971 0.335862465403716 -0.0609509815266072 fit25
19 -0.138175283219818 0.390418411384468 -0.0873106257144312 fit26
20 -0.106984355317733 0.391270279253722 -0.0560299858019556 fit27
21 -0.0740684978271464 0.385267011513678 -0.0548056844433894 fit28
I have fitted a lm as follows:
data <- data.frame(x=rnorm(50), x2=runif(50), y=rnorm(50), g=rep(1:3,length.out=50))
model <- lm(y ~ x + x2 + factor(g), data=data)
I want to extract the coefficients of each of the levels of the factor variable by refering to them using names for instance the way I would do with a continuous variable like 'x':
model$coefficients["x"]
I have tried using:
> model$coefficients["g"]
<NA>
NA
But it fails since the levels are renamed as can be observed below:
> model$coefficients
(Intercept) x x2 factor(g)2 factor(g)3
0.60058881 0.01232678 -0.65508242 -0.25919674 -0.04841089
I have also tried using the displayed names using:
model$coefficients["factor(g)2"]
but it doesn't work. How can i get this right?
Many thanks.
I always try to use the coef() function together with the grep() in these cases, I would do something like this:
data <- data.frame(x=rnorm(50), x2=runif(50), y=rnorm(50), g=rep(1:3,length.out=50))
model <- lm(y ~ x + x2 + factor(g), data=data)
estimates <- coef(model)
# Just get the g:2
estimates[grep("^factor\\(g\\)2", names(estimates))]
# If you want to get both factors you just skip the 2
estimates[grep("^factor\\(g\\)", names(estimates))]
# This case does not really require fancy
# regular expressions so you could write
estimates[grep("factor(g)", names(estimates), fixed=TRUE)]
# This comes much more in handy when you have a more complex situtation where
# coefficients have similar names
data <- data.frame(x=rnorm(50), great_g_var=runif(50), y=rnorm(50),
g_var=factor(rep(1:3,length.out=50)),
g_var2=factor(sample(1:3,size=50, replace=TRUE)))
model <- lm(y ~ x + great_g_var + g_var + g_var2, data=data)
estimates <- coef(model)
# Now if you want to do a simple fixed grep you could end up
# with unexpected estimates
estimates[grep("g_var", names(estimates), fixed=TRUE)]
# Returns:
# great_g_var g_var2 g_var3 g_var22 g_var23
# -0.361707955 -0.058988495 0.010967326 -0.008952616 -0.297461520
# Therefore you may want to use regular expressions, here's how you select g_var
estimates[grep("^g_var[0-9]$", names(estimates))]
# Returns:
# g_var2 g_var3
# -0.05898849 0.01096733
# And if you want to have g_var2 you write:
estimates[grep("^g_var2[0-9]$", names(estimates))]
# Returns:
# g_var22 g_var23
# -0.008952616 -0.297461520