Why lm() not showing some output in R? - r

I was wondering why lm() says 5 coefs not defined because of singularities and then gives all NA in the summary output for 5 coefficients.
Note that all my predictors are categorical.
Is there anything wrong with my data on these 5 coefficients or code? How can I possibly fix this?
d <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/v.csv", h = T) # Data
nms <- c("Age","genre","Length","cf.training","error.type","cf.scope","cf.type","cf.revision")
d[nms] <- lapply(d[nms], as.factor) # make factor
vv <- lm(dint~Age+genre+Length+cf.training+error.type+cf.scope+cf.type+cf.revision, data = d)
summary(vv)
First 6 lines of output:
Coefficients: (5 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.17835 0.63573 0.281 0.779330
Age1 -0.04576 0.86803 -0.053 0.958010
Age2 0.46431 0.87686 0.530 0.596990
Age99 -1.64099 1.04830 -1.565 0.118949
genre2 1.57015 0.55699 2.819 0.005263 **
genre4 NA NA NA NA ## For example here is all `NA`s? there are 4 more !

As others noted, a problem is that you seem to have multicollinearity. Another is that there are missing values in your dataset. The missing values should probably just be removed. As for correlated variables, you should inspect your data to identify this collinearity, and remove it. Deciding which variables to remove and which to retain is a very domain-specific topic. However, you could if you wish decide to use regularisation and fit a model while retaining all variables. This also allows you to fit a model when n (number of samples) is less than p (number of predictors).
I've shown code below that demonstrates how to examine the correlation structure within your data, and to identify which variables are most correlated (thanks to this answer. I've included an example of fitting such a model, using L2 regularisation (commonly known as ridge regression).
d <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/v.csv", h = T) # Data
nms <- c("Age","genre","Length","cf.training","error.type","cf.scope","cf.type","cf.revision")
d[nms] <- lapply(d[nms], as.factor) # make factor
vv <- lm(dint~Age+genre+Length+cf.training+error.type+cf.scope+cf.type+cf.revision, data = d)
df <- d
df[] <- lapply(df, as.numeric)
cor_mat <- cor(as.matrix(df), use = "complete.obs")
library("gplots")
heatmap.2(cor_mat, trace = "none")
## https://stackoverflow.com/questions/22282531/how-to-compute-correlations-between-all-columns-in-r-and-detect-highly-correlate
library("tibble")
library("dplyr")
library("tidyr")
d2 <- df %>%
as.matrix() %>%
cor(use = "complete.obs") %>%
## Set diag (a vs a) to NA, then remove
(function(x) {
diag(x) <- NA
x
}) %>%
as.data.frame %>%
rownames_to_column(var = 'var1') %>%
gather(var2, value, -var1) %>%
filter(!is.na(value)) %>%
## Sort by decreasing absolute correlation
arrange(-abs(value))
## 2 pairs of variables are almost exactly correlated!
head(d2)
#> var1 var2 value
#> 1 id study.name 0.9999430
#> 2 study.name id 0.9999430
#> 3 Location timed 0.9994082
#> 4 timed Location 0.9994082
#> 5 Age ed.level 0.7425026
#> 6 ed.level Age 0.7425026
## Remove some variables here, or maybe try regularized regression (see below)
library("glmnet")
## glmnet requires matrix input
X <- d[, c("Age", "genre", "Length", "cf.training", "error.type", "cf.scope", "cf.type", "cf.revision")]
X[] <- lapply(X, as.numeric)
X <- as.matrix(X)
ind_na <- apply(X, 1, function(row) any(is.na(row)))
X <- X[!ind_na, ]
y <- d[!ind_na, "dint"]
glmnet <- glmnet(
x = X,
y = y,
## alpha = 0 is ridge regression
alpha = 0)
plot(glmnet)
Created on 2019-11-08 by the reprex package (v0.3.0)

Under such situation you can use "olsrr" package in R for stepwise regression analysis. I am providing you a sample code to do stepwise regression analysis in R
library("olsrr")
#Load the data
d <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/v.csv", h = T)
# stepwise regression
vv <- lm(dint ~ Age + genre + Length + cf.training + error.type + cf.scope + cf.type + cf.revision, data = d)
summary(vv)
k <- ols_step_both_p(vv, pent = 0.05, prem = 0.1)
# stepwise regression plot
plot(k)
# final model
k$model
It will provide you exactly the same output as that of SPSS.

Related

How to bootstrap correlation using vectorised function applied to large matrix?

I understand how to boostrap using the "boot" package in R, through the PDF for the package and also from these two examples on Stack, Bootstrapped correlation with more than 2 variables in R and Bootstrapped p-value for a correlation coefficient on R.
However, this is for small datasets ( 2 variables or a matrix with 5 variables). I have a very large matrix (1000+ columns) and the code I use to compute the correlation between every metabolite pair (removing duplicate and correlations with the metabolite itself) is:
x <- colnames(dat)
GetCor = function(x,y) cor(dat[,x], dat[,y], method="spearman")
GetCor = Vectorize(GetCor)
out <- data.frame(t(combn(x,2)), stringsAsFactors = F) %>%
mutate(v = GetCor(X1,X2))
I'm not sure how I can then alter this for it to be the function I pass to statistic in boot so
boot_res<- boot(dat, ?, R=1000)
Or would I just need to obtain a matrix of the bootstrapped p value or estimate depending on function code (colMeans(boot_res$t)) and get rid of the upper or lower triangle?
Was curious to know the most efficient way of going about the problem..
Something like this? It follows more or less the same lines as my answer to the 2nd question you link to in your question.
Note that I have simplified the correlation code, cor accepts a data.frame or a matrix, so pass a two column one and keep one of the off diagonal correlation matrix elements.
library(boot)
bootPairwiseCor <- function(data, i) {
d <- data[i,]
combn(d, 2, \(x) cor(x, method="spearman")[1,2])
}
dat <- iris[-5]
nms <- combn(colnames(dat), 2, paste, collapse = "_")
R <- 100L
b <- boot(dat, bootPairwiseCor, R)
b
#>
#> ORDINARY NONPARAMETRIC BOOTSTRAP
#>
#>
#> Call:
#> boot(data = dat, statistic = bootPairwiseCor, R = R)
#>
#>
#> Bootstrap Statistics :
#> original bias std. error
#> t1* -0.1667777 0.0037142908 0.070552718
#> t2* 0.8818981 -0.0002851683 0.017783297
#> t3* 0.8342888 0.0006306610 0.021509280
#> t4* -0.3096351 0.0047809612 0.075976067
#> t5* -0.2890317 0.0045689001 0.069929108
#> t6* 0.9376668 -0.0014838117 0.009632318
data.frame(variables = nms, correlations = colMeans(b$t))
#> variables correlations
#> 1 Sepal.Length_Sepal.Width -0.1630634
#> 2 Sepal.Length_Petal.Length 0.8816130
#> 3 Sepal.Length_Petal.Width 0.8349194
#> 4 Sepal.Width_Petal.Length -0.3048541
#> 5 Sepal.Width_Petal.Width -0.2844628
#> 6 Petal.Length_Petal.Width 0.9361830
Created on 2023-01-28 with reprex v2.0.2
You may want to use cor.test to get theoretical t-values. We will use them for comparison with the B bootstrap t-values. (Recall: The p-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct.)
Here is a similar function to yours, but applying cor.test and extracting statistics.
corr_cmb <- \(X, boot=FALSE) {
stts <- c('estimate', 'statistic', 'p.value')
cmbn <- combn(colnames(X), 2, simplify=FALSE)
a <- lapply(cmbn, \(x) as.data.frame(cor.test(X[, x[1]], X[, x[2]])[stts])) |>
do.call(what=rbind) |>
`rownames<-`(sapply(cmbn, paste, collapse=':'))
if (boot) {
a <- a[, 'statistic']
}
a
}
We run it one times on the data to get a theoretical solution.
rhat <- corr_cmb(dat)
head(rhat, 3)
# estimate statistic p.value
# V1:V2 0.06780426 2.1469547 0.03203729
# V1:V3 0.03471587 1.0973752 0.27274212
# V1:V4 0.05301563 1.6771828 0.09381987
Bootstrap
We can assume from the start that the bootstrap with 1000 columns will run for a while (choose(1000, 2) returns 499500 combinations). That's why we think about a multithreaded solution right away.
To bootstrap we simple repeatedly apply corr_cmb repeatedly on a sample of the data with replications.
We will measure the time to estimate the time needed for 1000 variables.
## setup clusters
library(parallel)
CL <- makeCluster(detectCores() - 1)
clusterExport(CL, c('corr_cmb', 'dat'))
t0 <- Sys.time() ## timestamp before run
B <- 1099L
clusterSetRNGStream(CL, 42)
boot_res <- parSapply(CL, 1:B, \(i) corr_cmb(dat[sample.int(nrow(dat), replace=TRUE), ], boot=TRUE))
t1 <- Sys.time() ## timestamp after run
stopCluster(CL)
After the bootstrap, we calculate the ratios of how many times the absolute bootstrap test statistics exceeded the theoretical ones (Ref.),
boot_p <- rowMeans(abs(boot_res - rowMeans(boot_res)) > abs(rhat$statistic))
and cbind the bootstrap p-values to the theoretical result.
cbind(rhat, boot_p)
# estimate statistic p.value boot_p
# V1:V2 0.06780426 2.1469547 0.03203729 0.03003003
# V1:V3 0.03471587 1.0973752 0.27274212 0.28028028
# V1:V4 0.05301563 1.6771828 0.09381987 0.08208208
# V1:V5 -0.01018682 -0.3218300 0.74764890 0.73473473
# V2:V3 0.03730133 1.1792122 0.23859474 0.23323323
# V2:V4 0.07203911 2.2817257 0.02271539 0.01201201
# V2:V5 0.03098230 0.9792363 0.32770055 0.30530531
# V3:V4 0.02364486 0.7471768 0.45513283 0.47547548
# V3:V5 -0.02864165 -0.9051937 0.36558126 0.38938939
# V4:V5 0.03415689 1.0796851 0.28054328 0.29329329
Note that the data used is fairly normally distributed. If the data is not normally distributed, the bootstrap p-values will be more different.
To conclude, an estimate of the time needed for your 1000 variables.
d <- as.numeric(difftime(t1, t0, units='mins'))
n_est <- 1000
t_est <- d/(choose(m, 2))*choose(n_est, 2)
cat(sprintf('est. runtime for %s variables: %s mins\n', n_est, round(t_est, 1)))
# est. runtime for 1000 variables: 1485.8 mins
(Perhaps for sake of completeness, a single-threaded version for smaller problems:)
## singlethreaded version
# set.seed(42)
# B <- 1099L
# boot_res <- replicate(B, corr_cmb(dat[sample.int(nrow(dat), replace=TRUE), ], boot=TRUE))
Data:
library(MASS)
n <- 1e3; m <- 5
Sigma <- matrix(.5, m, m)
diag(Sigma) <- 1
set.seed(42)
M <- mvrnorm(n, runif(m), Sigma)
M <- M + rnorm(length(M), sd=6)
dat <- as.data.frame(M)

How to nest tables in a column of a dataframe?

I read that it is possible to store dataframes in a column of a dataframe with nest:
https://tidyr.tidyverse.org/reference/nest.html
Is it also possible to store tables in a column of a dataframe?
The reason is that I would like to calculate the Kappa for every subgroup of a dataframe with Caret. Although caret::confusionMatrix(t) expects a table as input.
In the example-code below this works fine if I calculate the Kappa for the complete dataframe at once:
library(tidyverse)
library(caret)
# generate some sample data:
n <- 100L
x1 <- rnorm(n, 1.0, 2.0)
x2 <- rnorm(n, -1.0, 0.5)
y <- rbinom(n, 1L, plogis(1 * x1 + 1 * x2))
my_factor <- rep( c('A','B','C','D'), 25 )
df <- cbind(x1, x2, y, my_factor)
# fit a model and make predictions:
mod <- glm(y ~ x1 + x2, "binomial")
probs <- predict(mod, type = "response")
# confusion matrix
probs_round <- round(probs)
t <- table(factor(probs_round, c(1,0)), factor(y, c(1,0)))
ccm <- caret::confusionMatrix(t)
# extract Kappa:
ccm$overall[2]
> Kappa
> 0.5232
Although if I try to do group_by to generate the Kappa for every factor as a subgroup (see code below) it does not succeed. I suppose I need to nest t in a certain way in df although I don't know how:
# extract Kappa for every subgroup with same factor (NOT WORKING CODE):
df <- cbind(df, probs_round)
df <- as.data.frame(df)
output <- df %>%
dplyr::group_by(my_factor) %>%
dplyr::mutate(t = table(factor(probs_round, c(1,0)), factor(y, c(1,0)))) %>%
summarise(caret::confusionMatrix(t))
Expected output:
>my_factor Kappa
>1 A 0.51
>2 B 0.52
>3 C 0.53
>4 D 0.54
Is this correct and is this possible?
(the exact values for Kappa will be different due to the randomness in the sample data)
Thanks a lot!
You could skip the intermediate mutate() that's giving you trouble to do:
library(dplyr)
library(caret)
df %>%
group_by(my_factor) %>%
summarize(t = confusionMatrix(table(factor(probs_round, c(1,0)),
factor(y, c(1,0))))$overall[2])
Returns:
# A tibble: 4 x 2
my_factor t
<chr> <dbl>
1 A 0.270
2 B 0.513
3 C 0.839
4 D 0.555
The above approach is the easiest to get the desired results. But just to show whats possible, we can use your approach with rowwise::nest_by which groups the data set rowwise.
In the approach below we calculate a separate glm for each subgroup. I'm not sure if that's what you want to do.
library(tidyverse)
library(caret)
# generate some sample data:
n <- 1000L
df <- tibble(x1 = rnorm(n, 1.0, 2.0),
x2 = rnorm(n, -1.0, 0.5),
y = rbinom(n, 1L, plogis(x1 + 1 * x1 + 1 * x2)),
my_factor = rep( c('A','B','C','D'), 250))
output <- df %>%
nest_by(my_factor) %>%
mutate(y = list(data$y),
mod = list(glm(y ~ x1 + x2,
family = "binomial",
data = data)),
probs = list(predict(mod, type = "response")),
probs_round = list(round(probs)),
t = list(table(factor(probs_round, c(1, 0)),
factor(y, c(1, 0)))),
ccm = caret::confusionMatrix(t)$overall[2])
output %>%
pull(ccm)
#> Kappa Kappa Kappa Kappa
#> 0.7743682 0.7078112 0.7157761 0.7549340
Created on 2021-06-23 by the reprex package (v0.3.0)

record linear regression results repeatly

As shown in the following example, what I want to achieve is to run the regression many times, each time R records the estimates of did in one data.frame.
Each time, I changed the year condition in "ifelse", ie., ifelse(mydata$year >= 1993, 1, 0), thus each time I run a different regression.
mydata$time = ifelse(mydata$year >= 1994, 1, 0)
Can anyone help it? My basic code is as below (the data can be downloaded through browser if R returned errors):
library(foreign)
mydata = read.dta("http://dss.princeton.edu/training/Panel101.dta")
mydata$time = ifelse(mydata$year >= 1994, 1, 0)
mydata$did = mydata$time * mydata$treated
mydata$treated = ifelse(mydata$country == "E" | mydata$country == "F" | mydata$country == "G", 1, 0)
didreg = lm(y ~ treated + time + did, data = mydata)
summary(didreg)
Generally if you want to repeat a process many times with some different input each time, you need a function. The following function takes a scalar value year_value as its input, creates local variables for regression and exports estimates for model term did.
foo <- function (year_value) {
## create local variables from `mydata`
y <- mydata$y
treated <- as.numeric(mydata$country %in% c("E", "F", "G")) ## use `%in%`
time <- as.numeric(mydata$year >= year_value) ## use `year_value`
did <- time * treated
## run regression using local variables
didreg <- lm(y ~ treated + time + did)
## return estimate for model term `did`
coef(summary(didreg))["did", ]
}
foo(1993)
# Estimate Std. Error t value Pr(>|t|)
#-2.784222e+09 1.504349e+09 -1.850782e+00 6.867661e-02
Note there are several places where your original code can be improved. Say, using "%in%" instead of multiple "|", and using as.numeric instead of ifelse to coerce boolean to numeric.
Now you need something like a loop to iterate this function over several different year_value. I would use lappy.
## raw list of result from `lapply`
year_of_choice <- 1993:1994 ## taken for example
result <- lapply(year_of_choice, foo)
## rbind them into a matrix
data.frame(year = year_of_choice, do.call("rbind", result), check.names = FALSE)
# year Estimate Std. Error t value Pr(>|t|)
#1 1993 -2784221881 1504348732 -1.850782 0.06867661
#2 1994 -2519511630 1455676087 -1.730819 0.08815711
Note, don't include year 1990 (the minimum of variable year) as a choice, otherwise time will be a vector of 1, as same as the intercept. The resulting model is rank-deficient and you will get "subscript out of bounds" error. R version since 3.5.0 has a new complete argument to generic function coef. So for stability we may use
coef(summary(didreg), complete = TRUE)["did", ]
But you should see all NA or NaN for year 1990.
Here is another option, here we create a matrix for all the years, join it to mydata, gather to long, nest by grouping, then run regression to extract the estimates. Note that "gt_et_**" stands for "greater than or equal to.."
library(foreign)
library(dplyr)
library(tidyr)
library(purrr)
mydata = read.dta("http://dss.princeton.edu/training/Panel101.dta")
mtrx <- matrix(0, length(min(mydata$year):max(mydata$year)), length(min(mydata$year):max(mydata$year)))
mtrx[lower.tri(mtrx, diag = TRUE)] <- 1
df <- mtrx %>% as.data.frame() %>% mutate(year = min(mydata$year):max(mydata$year))
colnames(df) <- c(paste0("gt_et_", df$year), "year")
models <- df %>%
full_join(., mydata, by = "year") %>%
gather(mod, time, gt_et_1990:gt_et_1999) %>%
nest(-mod) %>%
mutate(data = map(data, ~mutate(.x, treated = ifelse(country == "E"|country == "F"|country == "G", 1, 0),
did = time * treated)),
mods = map(data, ~lm(y ~ treated + time + did, data = .x) %>% summary() %>% coef())) %>%
unnest(mods %>% map(broom::tidy)) %>%
filter(.rownames == "did") %>%
select(-.rownames)
models
#> mod Estimate Std..Error t.value Pr...t..
#> 1 gt_et_1991 -2309823993 2410140350 -0.95837738 0.34137018
#> 2 gt_et_1992 -2036098728 1780081308 -1.14382344 0.25682856
#> 3 gt_et_1993 -2784221881 1504348732 -1.85078222 0.06867661
#> 4 gt_et_1994 -2519511630 1455676087 -1.73081886 0.08815711
#> 5 gt_et_1995 -2357323806 1455203186 -1.61992760 0.11001662
#> 6 gt_et_1996 250180589 1511322882 0.16553749 0.86902697
#> 7 gt_et_1997 405842197 1619653548 0.25057346 0.80292231
#> 8 gt_et_1998 -75683039 1852314277 -0.04085864 0.96753194
#> 9 gt_et_1999 2951694230 2452126428 1.20372840 0.23299421
Created on 2018-09-01 by the reprex
package (v0.2.0).

predic.lm gives wrong number of predicted values when I fit and predict a model using a matrix variable

In the past I've used the lm function with matrix-type data and data.frame-type. But I guess this is the first time that I tried to use predict using a model fitted without a data.frame. And I'm can't figure out how to make it work.
I read some other questions (such as Getting Warning: " 'newdata' had 1 row but variables found have 32 rows" on predict.lm) and I'm pretty sure that my problem is related with the coefficient names I'm getting after fitting the model. For some reason the coefficients names are a paste of the matrix name with the column name... and I haven't been able to find how to fix that...
library(tidyverse)
library(MASS)
set.seed(1)
label <- sample(c(T,F), nrow(Boston), replace = T, prob = c(.6,.4))
x.train <- Boston %>% dplyr::filter(., label) %>%
dplyr::select(-medv) %>% as.matrix()
y.train <- Boston %>% dplyr::filter(., label) %>%
dplyr::select(medv) %>% as.matrix()
x.test <- Boston %>% dplyr::filter(., !label) %>%
dplyr::select(-medv) %>% as.matrix()
y.test <- Boston %>% dplyr::filter(., !label) %>%
dplyr::select(medv) %>% as.matrix()
fit_lm <- lm(y.train ~ x.train)
fit_lm2 <- lm(medv ~ ., data = Boston, subset = label)
predict(object = fit_lm, newdata = x.test %>% as.data.frame()) %>% length()
predict(object = fit_lm2, newdata = x.test %>% as.data.frame()) %>% length()
# they get different numbers of predicted data
# the first one gets a number a results consistent with x.train
Any help will be welcome.
I can't fix your tidyverse code because I don't work with this package. But I am able to explain why predict fails in the first case.
Let me just use the built-in dataset trees for a demonstration:
head(trees, 2)
# Girth Height Volume
#1 8.3 70 10.3
#2 8.6 65 10.3
The normal way to use lm is
fit <- lm(Girth ~ ., trees)
The variable names (on the RHS of ~) are
attr(terms(fit), "term.labels")
#[1] "Height" "Volume"
You need to provide these variables in the newdata when using predict.
predict(fit, newdata = data.frame(Height = 1, Volume = 2))
# 1
#11.16125
Now if you fit a model using a matrix:
X <- as.matrix(trees[2:3])
y <- trees[[1]]
fit2 <- lm(y ~ X)
attr(terms(fit2), "term.labels")
#[1] "X"
The variable you need to provide in newdata for predict is now X, not Height or Girth. Note that since X is a matrix variable, you need to protect it with I() when feeding it to a data frame.
newdat <- data.frame(X = I(cbind(1, 2)))
str(newdat)
#'data.frame': 1 obs. of 1 variable:
# $ X: AsIs [1, 1:2] 1 2
predict(fit2, newdat)
# 1
#11.16125
It does not matter that cbind(1, 2) has no column names. What is important is that this matrix is named X in newdat.

Why would my residuals for a linear model in R have one less value than the data set I used to create it?

I created a linear model in R using county data from the ACS. There are 3140 entries in my data set, and they all have their corresponding fips codes. I'm trying to make a map of the residuals from my linear model, but I only have 3139 residuals. Does anyone know if there's something R does when creating a linear model that is responsible for this, and how I can fix it so that I can create this map? Thanks!
In response to the suggestion of checking for NAs, I ran this:
which(completedata$fipscode == NA)
integer(0)
R code if it helps:
sectorcodes <- read.csv("sectorcodes1.csv") #ruralubrancode, median hh income
sectorcodesdf <- data.frame(sectorcodes)
religion <- read.csv("Religion2.csv")
religiondf <- data.frame(religion)
merge1 <- merge(sectorcodesdf,religiondf, by = c('fipscode'))
merge1df <- data.frame(merge1)
family <- read.csv("censusdataavgfamsize.csv") #avgfamilysize
familydf <- data.frame(family)
merge2 <- merge(merge1df, familydf, by = c('fipscode'))
merge2df <- data.frame(merge2)
gradrate <- read.csv("censusdatahsgrad.csv")
gradratedf <- data.frame(gradrate)
evenmoredata2 <- merge(gradrate,merge2df, by=c("fipscode"))
#write.csv(evenmoredata2, file = "completedataset.csv")
completedata <- read.csv("completedataset.csv")
completedatadf <- data.frame(completedata)
lm8 <- lm(completedatadf$hsgrad ~ completedatadf$averagefamilysize*completedatadf$Rural_urban_continuum_code_2013*completedatadf$TOTADH*completedatadf$Median_Household_Income_2016)
summary(lm8)
library(blscrapeR)
library(RgoogleMaps)
library(choroplethr)
library(acs)
attach(acs)
require(choroplethr)
dataframe1 <- data.frame(completedatadf$fipscode,completedatadf$averagefamilysize)
names(dataframe1) <- c("region","value")
dataframe2 <- data.frame(completedata$fipscode,completedata$hsgrad)
names(dataframe2) <- c("region","value")
residdf <- data.frame(lm8$residuals)
dataframe3 <- data.frame(completedata$fipscode,lm8$residuals)
names(dataframe3) <- c("region","value")
county_choropleth(dataframe1)
county_choropleth(dataframe2)
county_choropleth(dataframe3)
When I try to run dataframe3, the error message is:
dataframe3 <- data.frame(completedata$fipscode,lm8$residuals)
Error in data.frame(completedata$fipscode, lm8$residuals) :
arguments imply differing number of rows: 3140, 3139
This can be caused by an NA in the response. For example, using the builtin BOD data frame note that there are 5 residuals in this example but 6 rows in b:
b <- BOD
b[3, 2] <- NA
nrow(b)
## [1] 6
fm <- lm(demand ~ Time, b)
resid(fm)
## 1 2 4 5 6
## -0.3578947 -0.2657895 1.6184211 -0.6894737 -0.3052632
We can handle that by specifying na.action = na.exclude when running lm. Note that now there are 6 residuals with the extra one being NA.
fm <- lm(demand ~ Time, b, na.action = na.exclude)
resid(fm)
## 1 2 3 4 5 6
## -0.3578947 -0.2657895 NA 1.6184211 -0.6894737 -0.3052632
Try
data.frame(na.omit(completedata$fipscode), lm8$residuals)
Probably, you're data has NA values.

Resources