As the title suggest, I have seen some user mentioned that .lm.fit() functions has an advantage of more speed than a regular lm(), but when i look deeper at .lm.fit() in help, it is supposed to be a fitter functions, it returns a set of list instead of a model, which makes me to think is it still possible to extract components like R squared, Adj R Squared, and lastly do a predict() out of it?
Below is sample data and executions:
test_dat <- data.frame(y = rnorm(780, 20, 10))
for(b in 1:300){
name_var <- paste0("x",b)
test_dat[[name_var]] <- rnorm(780, 0.01 * b, 5)
}
tic()
obj_lm <- lm(y ~ ., data = test_dat)
print(class(obj_lm))
print(summary(obj_lm)$r.squared)
print(summary(obj_lm)$adj.r.squared)
predict(obj_lm)
toc() #approximately 0.4 seconds
tic()
datm <- as.matrix(test_dat)
obj_lm_fit <- .lm.fit(cbind(1,datm[,-1]), datm[,1])
print(class(obj_lm_fit))
toc() #approximately 0.2 seconds
Functions predict and resid are generic and since .lm.fit returns an object of class "list", all you have to do is to write methods implementing the definitions of what you want. Below are methods to compute fitted values, residuals and R^2.
set.seed(2023) # make the results reproducible
test_dat <- data.frame(y = rnorm(780, 20, 10))
for(b in 1:300){
name_var <- paste0("x",b)
test_dat[[name_var]] <- rnorm(780, 0.01 * b, 5)
}
obj_lm <- lm(y ~ ., data = test_dat)
datm <- as.matrix(test_dat)
obj_lm_fit <- .lm.fit(cbind(1,datm[,-1]), datm[,1])
#------------------------------------------------------------------------
# the methods for objects of class "list"
#
fitted.list <- function(object, X) {
X %*% coef(object)
}
resid.list <- residuals.list <- function(object, X, y) {
y_fitted <- fitted(object, X)
y - y_fitted
}
rsquared <- function(x, ...) UseMethod("rsquared")
rsquared.default <- function(x, ...) {
summary(x)$r.squared
}
rsquared.list <- function(object, X, y) {
e <- resid.list(object, X, y)
1 - sum(e^2)/sum( (y - mean(y))^2 )
}
rsquared(obj_lm_fit, cbind(1,datm[,-1]), datm[,1])
#> [1] 0.3948863
rsquared(obj_lm)
#> [1] 0.3948863
Created on 2023-01-03 with reprex v2.0.2
Edit 1
Added method to also calculate adj.R2
adj_rsquared_list <- function(object, X, y){
r2 <- rsquared.list(object, X, y)
k <- ncol(X) - 1
n <- nrow(X)
rate_of_error <- (1 - r2) * (n - 1) / (n - k - 1)
adj_r2 <- 1 - rate_of_error
return(adj_r2)
}
adj_rsquared_list(obj_lm_fit, cbind(1,datm[,-1]), datm[,1])
#> [1] 0.01590061
Edit 2
After the edit by Jovan, I have changed fitted.list above to use coef(), a function that extracts the first arguments list member "coefficients", if it exists, and rewrote the default and list methods of rsquared to accept a adj argument. The code to compute the adjusted R^2 is a copy&paste of Jovan's code.
rsquared <- function(x, ...) UseMethod("rsquared")
rsquared.default <- function(x, adj = FALSE, ...) {
if(adj) {
summary(x)$adj.r.squared
} else summary(x)$r.squared
}
rsquared.list <- function(object, X, y, adj = FALSE) {
e <- resid.list(object, X, y)
r2 <- 1 - sum(e^2)/sum( (y - mean(y))^2 )
if(adj) {
k <- ncol(X) - 1
n <- nrow(X)
rate_of_error <- (1 - r2) * (n - 1) / (n - k - 1)
adj_r2 <- 1 - rate_of_error
adj_r2
} else r2
}
# same as above
rsquared(obj_lm_fit, cbind(1,datm[,-1]), datm[,1])
#> [1] 0.3948863
rsquared(obj_lm)
#> [1] 0.3948863
# new, `adj = TRUE`
rsquared(obj_lm_fit, cbind(1,datm[,-1]), datm[,1], adj = TRUE)
#> [1] 0.01590061
rsquared(obj_lm, adj = TRUE)
#> [1] 0.01590061
Created on 2023-01-03 with reprex v2.0.2
Related
I'm trying to write an R function that takes a felm object as an argument. The function doesn't seem to recognize the felm object inside the function, however, if I run the code outside of a function, it works fine. Can someone help me understand why this doesn't work? Thanks! Reproducible example below.
library(sandwich)
library(lfe)
set.seed(2020)
x <- rnorm(1000)
x2 <- rnorm(length(x))
y <- x + 0.5*x2
adj_test <- diag(1, 1000, 1000)
test_felm <- felm(y ~ x + x2 | 0 | 0 | 0)
vcov.adjacency.robust <- function(felm_object, adjacency.matrix, estfunc=sandwich::estfun) {
eef <- estfunc(felm_object)
N <- nrow(adjacency.matrix)
m <- crossprod(eef, adjacency.matrix %*% eef)
sandwich(felm_object, meat = as.matrix(m) / N)
}
vcov.adjacency.robust(test_felm, adj_test)
Error in model.matrix(x) : object 'felm_object' not found
eef <- sandwich::estfun(test_felm)
N <- nrow(adj_test)
m <- crossprod(eef, adj_test %*% eef)
sandwich(test_felm, meat = as.matrix(m) / N)
(Intercept) x x2
(Intercept) 2.772862e-33 2.615412e-34 2.335601e-35
x 2.615412e-34 7.750617e-33 7.657461e-34
x2 2.335601e-35 7.657461e-34 2.947959e-33
sandwich::estfun is calling a method,
sandwich::estfun
# function (x, ...)
# {
# UseMethod("estfun")
# }
# <bytecode: 0x00000000142bb260>
# <environment: namespace:sandwich>
and the method doesn't seem to be acessible within the function. So calling the right function directly, which is sandwich:::estfun.lm, fixes the problem.
vcov.adjacency.robust <- function(felm_object, adjacency.matrix,
estfunc=sandwich:::estfun.lm) {
eef <- estfunc(felm_object)
N <- nrow(adjacency.matrix)
m <- crossprod(eef, adjacency.matrix %*% eef)
sandwich(felm_object, meat = as.matrix(m) / N)
}
(res <- vcov.adjacency.robust(test_felm, adj_test))
# (Intercept) x x2
# (Intercept) 2.772862e-33 2.615412e-34 2.335601e-35
# x 2.615412e-34 7.750617e-33 7.657461e-34
# x2 2.335601e-35 7.657461e-34 2.947959e-33
eef <- estfun(test_felm)
N <- nrow(adj_test)
m <- crossprod(eef, adj_test %*% eef)
check <- sandwich(test_felm, meat = as.matrix(m) / N)
stopifnot(all.equal(res, check))
I am trying to implement logistic regression and the function works manually, but for some reason I get the error "Error in nrow(X) : object 'X' not found ", even though X is defined before the nrow command. I use the UCI Data "Adult" to test it.
If i try to run the function manually there is no error. Can anyone explain that?
# Sigmoidfunction
sigmoid <- function(z){
g <- 1/(1+exp(-z))
return(g)
}
# Costfunction
cost <- function(theta){
n <- nrow(X)
g <- sigmoid(X %*% theta)
J <- (1/n)*sum((-Y*log(g)) - ((1-Y)*log(1-g)))
return(J)
}
log_reg <- function(datafr, m){
# Train- und Testdaten Split
sample <- sample(1:nrow(datafr), m)
df_train <- datafr[sample,]
df_test <- datafr[-sample,]
num_features <- ncol(datafr) - 1
num_label <- ncol(datafr)
label_levels <- levels(datafr[, num_label])
datafr[, num_features+1] <- ifelse(datafr[, num_label] == names(table(datafr[,num_label]))[1], 0, 1)
# Predictor variables
X <- as.matrix(df_train[, 1:num_features])
X_test <- as.matrix(df_test[, 1:num_features])
# Add ones to X
X <- cbind(rep(1, nrow(X)), X)
X_test <- cbind(rep(1, nrow(X_test)), X_test)
# Response variable
Y <- as.matrix(df_train[, num_label] )
Y <- ifelse(Y == names(table(Y))[1], 0, 1)
Y_test <- as.matrix(df_test[, num_label] )
Y_test <- ifelse(Y_test == names(table(Y_test))[1], 0, 1)
# Intial theta
initial_theta <- rep(0, ncol(X))
# Derive theta using gradient descent using optim function
theta_optim <- optim(par=initial_theta, fn=cost)
predictions <- ifelse(sigmoid(X_test%*%theta_optim$par)>=0.5, 1, 0)
# Generalization error
error_rate <- sum(predictions!=Y_test)/length(Y_test)
return(error_rate)
}
### Adult Data
data <- read.table('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
sep = ',', fill = F, strip.white = T)
colnames(data) <- c('age', 'workclass', 'fnlwgt', 'education',
'education_num', 'marital_status', 'occupation', 'relationship', 'race', 'sex',
'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income')
# Featureselection
datafr <- data[, c("age", "education_num", "hours_per_week", "income")]
log_reg(datafr = datafr, m = 20)
You are calling cost() in which you refer to X, but X has not been defined in cost(). Either define it within log_reg() after you have defined X, or, better, make X a parameter for cost().
cost <- function(theta, X, Y){
n <- nrow(X)
g <- sigmoid(X %*% theta)
J <- (1/n)*sum((-Y*log(g)) - ((1-Y)*log(1-g)))
return(J)
}
And later
theta_optim <- optim(par=initial_theta, fn=cost, X=X, Y=Y)
In general, try to avoid having variables used in a function which are not defined explicitly as arguments to that function. Otherwise you will always end up with problems like this one.
Also, how did I find it out? I used traceback():
> traceback()
5: nrow(X) at #2
4: fn(par, ...)
3: (function (par)
fn(par, ...))(c(0, 0, 0, 0))
2: optim(par = initial_theta, fn = cost) at #33
1: log_reg(datafr = datafr, m = 20)
I have a problem when I try to run the dffits() function for an object of my own logistic regression.
When I'm running dffits(log) I get the error message:
error in if (model$rank == 0) { : Argument is of length 0
However, when I'm using the inbuilt gym function (family = binomial), then dffits(glm) works just fine.
Here is my function for the logistic regression and a short example of my problem:
mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
mydata$rank <- factor(mydata$rank)
mydata$admit <- factor(mydata$admit)
logRegEst <- function(x, y, threshold = 1e-10, maxIter = 100)
{
calcPi <- function(x, beta)
{
beta <- as.vector(beta)
return(exp(x %*% beta) / (1 + exp(x %*% beta)))
}
beta <- rep(0, ncol(x)) # initial guess for beta
diff <- 1000
# initial value bigger than threshold so that we can enter our while loop
iterCount = 0
# counter to ensure we're not stuck in an infinite loop
while(diff > threshold) # tests for convergence
{
pi <- as.vector(calcPi(x, beta))
# calculate pi by using the current estimate of beta
W <- diag(pi * (1 - pi)) # calculate matrix of weights W
beta_change <- solve(t(x) %*% W %*% x) %*% t(x) %*% (y - pi)
# calculate the change in beta
beta <- beta + beta_change # new beta
diff <- sum(beta_change^2)
# calculate how much we changed beta by in this iteration
# if this is less than threshold, we'll break the while loop
iterCount <- iterCount + 1
# see if we've hit the maximum number of iterations
if(iterCount > maxIter){
stop("This isn't converging.")
}
# stop if we have hit the maximum number of iterations
}
df <- length(y) - ncol(x)
# calculating the degrees of freedom by taking the length of y minus
# the number of x columns
vcov <- solve(t(x) %*% W %*% x)
list(coefficients = beta, vcov = vcov, df = df)
# returning results
}
logReg <- function(formula, data)
{
mf <- model.frame(formula = formula, data = data)
# model.frame() returns us a data.frame with the variables needed to use the
# formula.
x <- model.matrix(attr(mf, "terms"), data = mf)
# model.matrix() creates a design matrix. That means that for example the
#"Sex"-variable is given as a dummy variable with ones and zeros.
y <- as.numeric(model.response(mf)) - 1
# model.response gives us the response variable.
est <- logRegEst(x, y)
# Now we have the starting position to apply our function from above.
est$formula <- formula
est$call <- match.call()
est$data <- data
# We add the formular and the call to the list.
est$x <- x
est$y <- y
# We add x and y to the list.
class(est) <- "logReg"
# defining the class
est
}
log <- logReg(admit ~ gre + gpa, data= mydata)
glm <- glm(admit ~ gre + gpa, data= mydata, family = binomial)
dffits(glm)
dffits(log)
log$data
glm$data
I don't understand why mydata$rank == 0, because when I look at log$data I see that the rank is just defined as in glm$data.
I really appreciate your help!
I am trying to run a multivariate regression with different layers in a RasterStack using focal {raster} or localFun {raster}. With the help of a similar post and the raster reference manual, my code works fine with single RasterLayers as input (see reproducible, albeit probably 'clunky', example below). However, I would like to do this using different layers in a RasterStack as described in {SECTION2} of the code below.
I would very much appreciate any advice.
Thank you
CODE:
library(raster)
#%%%%%%%%%%%%%%%%%%%%%
## SECTION1
#%%%%%%%%%%%%%%%%%%%%%
# create test data
set.seed(0)
resp = expl = raster(nrow=10, ncol=10)
# response variable
resp = setValues(resp,runif(100,min=15,max=45))
# explanatory variable
expl = setValues(expl,runif(100,min=2,max=6))
expl = expl * resp
resp[1:5] = NA; expl[1:5] = NA # add some NA values
par(mfrow=c(1,2))
plot(resp); plot(expl)
#..............................................................
# check global lm() results
data1.df = na.omit(as.data.frame(stack(list(resp=resp,expl=expl))))
head(data1.df)
data1.lm = lm(resp ~ expl, data=data1.df)
(data1.lmSum = summary(data1.lm))
data1.lmSum$coefficients[1];data1.lmSum$coefficients[2];data1.lmSum$coefficients[8]
data1.lmSum$r.squared
data1.lmSum$sigma
# pf(data1.lmSum$fstatistic[1], data1.lmSum$fstatistic[2], data1.lmSum$fstatistic[3],lower.tail = FALSE)
#..............................................................
# lm function for focal {raster} with RasterLayers
# output coefficients, r-squared, residual standard error and p-value(F stat)
# Calculate focal ("moving window") weight
fw = focalWeight(resp, 2, "Gauss")
# focal regression:
lm.focal = function(x, y, ...) {
if(all(is.na(x) & is.na(y))) {NA}
else {
m = lm(y~x)
summary(m)$r.squared #r-squared
# summary(m)$coefficients #intercept and slope together
#---> Error in setValues(x, value) : cannot use a matrix with these dimensions
# summary(m)$coefficients[1] #intercept
# summary(m)$coefficients[2] #slope
# summary(m)$coefficients[8] #p-value
# summary(m)$sigma #residual standard error
}
}
#---> How to output all at once?
lm.focal.out1 = localFun(resp, expl, w=fw, fun=lm.focal, na.rm=TRUE)
plot(lm.focal.out1)
#%%%%%%%%%%%%%%%%%%%%%
## SECTION2
#%%%%%%%%%%%%%%%%%%%%%
# create test data
set.seed(1)
resp = expl1 = expl2 = expl3 = expl4 = raster(nrow=10, ncol=10)
# x1 response variable
resp = setValues(resp,runif(100,min=15,max=45))
# x3 explanatory variables
expl1 = setValues(expl,runif(100,min=2,max=6))
expl1 = expl1 * resp
expl2 = expl1 * resp/runif(100,min=1,max=4)
expl3 = ((expl1 * resp) / 1.5 )/10
expl4 = ((expl1 * resp) / runif(100,min=0.5,max=2))/100
# add some NA values
resp[1:5] = NA; expl1[1:5] = NA; expl2[1:5] = NA; expl3[1:5] = NA; expl4[1:5] = NA
#stack RasterLayers
stack1 = stack(list(resp=resp,expl1=expl1,expl2=expl2,expl3=expl3,expl4=expl4))
# par(mfrow=c(1,1))
plot(stack1)
#..............................................................
# check global lm() results
stack1.df = na.omit(as.data.frame(stack1))
head(stack1.df)
stack1.lm = lm(resp ~ expl1+expl2+expl3+expl4, data=stack1.df)
(stack1.lmSum = summary(stack1.lm))
stack1.lmSum$coefficients[1]
stack1.lmSum$coefficients[2];stack1.lmSum$coefficients[3];stack1.lmSum$coefficients[4];stack1.lmSum$coefficients[5]
stack1.lmSum$r.squared
stack1.lmSum$sigma
pf(stack1.lmSum$fstatistic[1], stack1.lmSum$fstatistic[2], stack1.lmSum$fstatistic[3],lower.tail = FALSE)
#..............................................................
# lm function for focal {raster} with RasterStack
# output coefficients, r-squared, residual standard error and p-value(F stat)
# Calculate focal ("moving window") weight
fw.s = focalWeight(stack1, 2, "Gauss")
# focal regression with raster stack:
lm.focal.stack = function(x, ...) {
if(all(is.na(x) )) {NA}
else {
m = lm(x[1]~x[2]+x[3]+x[4]+x[5])
summary(m)$r.squared #r-squared
# summary(m)$coefficients #intercept and slope together
#---> Error in setValues(x, value) : cannot use a matrix with these dimensions
# summary(m)$coefficients[1] #intercept
# summary(m)$coefficients[2] #slope
# pf(summary(m)$fstatistic[1], summary(m)$fstatistic[2], summary(m)$fstatistic[3],lower.tail = FALSE) #p-value
# summary(m)$sigma #residual standard error
}
}
#---> How to output all at once?
lm.focal.stack.out1 <- focal(stack1, w=fw.s, fun=lm.focal.stack, na.rm=TRUE)
#---> unable to find an inherited method for function ‘focal’ for signature ‘"RasterStack"’
#plot(lm.focal.stack.out1)
#-----------------------------------------------------------
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] raster_2.5-8 sp_1.2-3
loaded via a namespace (and not attached):
[1] rgdal_1.1-10 tools_3.3.1 Rcpp_0.12.5 grid_3.3.1 lattice_0.20-33
Not sure if you still need this answered, but I had the same issue and made a function called localFunStack to do the job of vector output from the local function as a rasterStack object, with a little hack to get the right layer names:
# localFun modified to write out a layer stack
localFunStack <- function(x, y, ngb=5, fun, ...) {
compareRaster(x,y)
rasterList <- list()
nc1 <- 1:(ngb*ngb)
nc2 <- ((ngb*ngb)+1):(2*(ngb*ngb))
if (canProcessInMemory(x, n=2*ngb)) {
vx <- getValuesFocal(x, 1, nrow(x), ngb=ngb)
vy <- getValuesFocal(y, 1, nrow(y), ngb=ngb)
v <- apply(cbind(vx, vy), 1, function(x, ...) fun(x[nc1], x[nc2], ...))
for (j in 1:nrow(v)) {
if (length(rasterList) < j) {
rasterList[[j]] <- raster(x)
}
values(rasterList[[j]]) <- v[j,]
}
}
else {
tr <- blockSize(out)
pb <- pbCreate(tr$n, label='localFun', ...)
for (i in 1:tr$n) {
vx <- getValuesFocal(x, tr$row[i], tr$nrows[i], ngb=ngb)
vy <- getValuesFocal(y, tr$row[i], tr$nrows[i], ngb=ngb)
v <- apply(cbind(vx, vy), 1, function(x, ...) fun(x[nc1], x[nc2], ...))
for (j in 1:nrow(v)) {
if (length(rasterList) < j) {
rasterList[[j]] <- raster(x)
}
rasterList[[j]] <- writeValues(rasterList[[j]], v[j,], tr$row[i])
}
}
}
return(stack(rasterList))
}
# local regression function
lm.focal <- function(x, y, ...) {
if(all(is.na(x) & is.na(y)) || all(is.na(x)) || all(is.na(y))) {rep(NA, 8)}
else {
m <- lm(y~x)
coef <- summary(m)$coef
if (nrow(coef) == 1) { # Add NAs for cases where the response is constant
coef <- rbind(coef, rep(NA, 4))
rownames(coef) <- rownames(nm)
}
coef <- as.vector(coef)
names(coef) <- c( outer( rownames(nm), colnames(nm) ,FUN=paste ,sep=" "))
coef
# summary(m)$r.squared #r-squared
# summary(m)$sigma #residual standard error
}
}
lm.focal.out = localFunStack(expl, resp, ngb=5, fun=lm.focal, na.rm=TRUE)
m <- lm(resp ~ expl)
nm <- summary(m)$coefficients
names(lm.focal.out) <- c(outer(rownames(nm), colnames(nm), FUN=paste, sep=" "))
plot(lm.focal.out)
I am not able to apply ucminf function to minimise my cost function in R.
Here is my cost function:
costfunction <- function(X,y,theta){
m <- length(y);
J = 1/m * ((-t(y)%*%log(sigmoid(as.matrix(X)%*%as.matrix(theta)))) - ((1-t(y))%*%log(1-sigmoid(as.matrix(X)%*%as.matrix(theta)))))
}
Here is my sigmoid function:
sigmoid <- function(t){
g = 1./(1+exp(-t))
}
Here is my gradient function:
gradfunction <- function(X,y,theta){
grad = 1/ m * t(X) %*% (sigmoid(as.matrix(X) %*% as.matrix(theta) - y));
}
I am trying to do the following:
library("ucminf")
data <- read.csv("ex2data1.txt",header=FALSE)
X <<- data[,c(1,2)]
y <<- data[,3]
qplot(X[,1],X[,2],colour=factor(y))
m <- dim(X)[1]
n <- dim(X)[2]
X <- cbind(1,X)
initial_theta <<- matrix(0,nrow=n+1,ncol=1)
cost <- costfunction(X,y,initial_theta)
grad <- gradfunction(X,y,initial_theta)
This is where I want to call ucminf to find the minimum cost and values of theta. I am not sure how to do this.
Looks like you are trying to do the week2 problem of the machine learning course of Coursera.
No need to use ucminf packages here, you can simply use the R function optim it works
We will define the sigmoid and cost function first.
sigmoid <- function(z)
1 / (1 + exp(-z))
costFunction <- function(theta, X, y) {
m <- length(y)
J <- -(1 / m) * crossprod(c(y, 1 - y),
c(log(sigmoid(X %*% theta)), log(1 - sigmoid(X %*% theta))))
grad <- (1 / m) * crossprod(X, sigmoid(X %*% theta) - y)
list(J = J, grad = grad)
}
Let's load the data now, to make this code it reproductible, I put the data in my dropbox.
download.file("https://dl.dropboxusercontent.com/u/8750577/ex2data1.txt",
method = "curl", destfile = "/tmp/ex2data1.txt")
data <- matrix(scan('/tmp/ex2data1.txt', what = double(), sep = ","),
ncol = 3, byrow = TRUE)
X <- data[, 1:2]
y <- data[, 3, drop = FALSE]
m <- nrow(X)
n <- ncol(X)
X <- cbind(1, X)
initial_theta = matrix(0, nrow = n + 1)
We can then compute the result of the cost function at the initial theta like this
cost <- costFunction(initial_theta, X, y)
(grad <- cost$grad)
## [,1]
## [1,] -0.100
## [2,] -12.009
## [3,] -11.263
(cost <- cost$J)
## [,1]
## [1,] 0.69315
Finally we can use optim to ge the optimal theta
res <- optim(par = initial_theta,
fn = function(t) costFunction(t, X, y)$J,
gr = function(t) costFunction(t, X, y)$grad,
method = "BFGS", control = list(maxit = 400))
(theta <- res$par)
## [,1]
## [1,] -25.08949
## [2,] 0.20566
## [3,] 0.20089
(cost <- res$value)
## [1] 0.2035
If you have some problem with the function download.file, the data can be downloaded
here
As you did not provide a reproducible example it is hard to exactly give you the code you need, but the general idea is to hand the functions over to ucminf:
ucminf(start, costfunction, gradfunction, y = y, theta = initial_theta)
Note that start needs to be a vector of initial starting values which when handed over as X to the two functions need to produce a result. Usually you use random starting value (e.g., runif).