I've been trying to build some custom code for Logistic regression (i.e. I cannot use the GLM package for this purpose since - happy to explain why.)
Below is the initial R code to provide the data set I'm working with:
## Load the datasets
data("titanic_train")
data("titanic_test")
## Combining Training and Testing dataset
complete_data <- rbind(titanic_train, titanic_test)
library(dplyr)
titanic_test$Survived <- 2
complete_data <- rbind(titanic_train, titanic_test)
complete_data$Embarked[complete_data$Embarked==""] <- "S"
complete_data$Age[is.na(complete_data$Age)] <-
median(complete_data$Age,na.rm=T)
complete_data <- as.data.frame(complete_data)
titanic_data <- select(complete_data,-c(Cabin, PassengerId, Ticket,
Name))
titanic_data <- titanic_data[!titanic_data$Survived == "2", ]
titanic_model <- model <- glm(Survived
~.,family=binomial(link='logit'),data=titanic_data)
y <- titanic_data$Survived
x <- as.data.frame(cbind(rep(1, dim(titanic_data)
[1]),titanic_data[,-2]))
x <- as.matrix(as.numeric(x))
beta <- as.numeric(rep(0, dim(x)[2]))
beta <- as.matrix(beta)
The issue I'm having here is I would like to compute the matrix product of beta (a px1 matrix) and x (a n x p matrix)
I have tried the following -
beta * x
x %*% beta
However, the above the following errors -
Error in FUN(left, right) : non-numeric argument to binary operator
Error in x %*% beta : requires numeric/complex matrix/vector arguments
I'd imagine this is due to the fact I've got non-numeric fields in the data matrix x.
As a bit of a background, calculating the linear predictor will allow me to progress with my custom code for fitting a Logistic regression model.
I would appreciate some help - thank you!
Related
I am trying to do a proportional odds logistic regression model of the form:
dsnac <- polr(formula=DS1~AC1, data = pddat1, method=c("logistic"))
summary(dsnac)
The regression ran fine,however, when I implement the summary function I get an error:
svd(X) : infinite or missing values in 'x'
I checked to see if there are any missing values in the "AC1" column (assuming AC1 is "x" as mentioned in the error), but does not have any values missing. The range of AC1 is 1.3 to 170000. DS1 is a factor having the levels 0,1 and 2.
Would be a great help if someone can help me with this. Thanks
A reproducible example is:
pddat1 <- data.frame(cbind(DS1=c(rep(0,400),rep(1,60),rep(2,40)),
AC1=runif(500,1,170000)))
pddat1$DS1 <- as.factor(pddat1$DS1)
dsnac <- polr(formula=DS1~AC1, data = pddat1, method=c("logistic"))
summary(dsnac)
A simple transformation solved the issue. svd(X) refers to singular value decomposition of covariates matrix.
dsnac <- polr(DS1~scale(AC1) , data = pddat1, method=c("logistic"))
summary(dsnac)
However, it is something has to do with your data. Calling clm function from ordinal package lead to the same conclusions with a warning such as "Model is nearly unidentifiable: very large eigenvalue - Rescale variables?"
library(ordinal)
dsnac <- clm(as.factor(DS1) ~ AC1, data=pddat1)
summary(dsnac)
If you downsize the maximum value in the runif command everything works fine
pddat1 <- data.frame(cbind(DS1=factor(c(rep(0,400),rep(1,60),rep(2,40))),
AC1=runif(500,1,15)))
str(pddat1)
pddat1$DS1 <- as.factor(pddat1$DS1)
dsnac <- polr(DS1 ~ AC1, data = pddat1, method=c("logistic"))
summary(dsnac)
I need to estimate parameters by a non-linear fitting procedure. In particular, I'm trying to fit the following equation:
I thought that nlm could be a good solution, using:
#example data
df <- data.frame(var= cumsum(sort(rnorm(100, mean=20, sd=4))),
time= seq(strptime("2018-1-1 0:0:0","%Y-%m-%d %H:%M:%S"), by= dseconds(200),length.out=100))
#write function
my_fun <- function(v, vin, t,theta){
fy <- v ~ (theta[1]-(theta[1]- vin)*exp((-theta[2]/10000)*(t-theta[3])))
ssq<-sum((v-fy)^2)
return(ssq)
}
#run nlm
th.start <- c(7000, 1000, 10)
my_fit <- nlm(f=my_fun, vin=400, v = df$var,
t=df$time,p=th.start)
However I got the error: Error in v - fy : non-numeric argument to binary operator. I'm sure that it's a basic thing but I'm struggling to understand the problem.
I need to estimate a fractional (response taking values between 0 and 1) model with R. I also want to cluster the standard errors. I have found several examples in SO and elsewhere and I built this function based on my findings:
require(sandwich)
require(lmtest)
clrobustse <- function(fit, cl){
M <- length(unique(cl))
N <- length(cl)
K <- fit$rank
dfc <- (M/(M - 1))*((N - 1)/(N - K))
uj <- apply(estfun(fit), 2, function(x) tapply(x, cl, sum))
vcovCL <- dfc*sandwich(fit, meat = crossprod(uj)/N)
coeftest(fit, vcovCL)
}
I estimate my model like this:
fit <- glm(dep ~ exp1 + exp2 + exp3, data = df, fam = quasibinomial("probit"))
clrobustse(fit, df$cluster)
Everything works fine and I get the results. However, I suspect that something is not right as the non-clustered version:
coeftest(fit)
gives the exact same standard errors. I checked that Stata reports and that displays different clustered errors. I suspect that I have misspecified the function clrobustse but I just don't know how. Any ideas about what could be going wrong here?
I am using R-studio and am using kaggle's forest cover data and keep getting an error when trying to use the knn3 function in caret. here is my code:
library(caret)
train <- read.csv("C:/data/forest_cover/train.csv", header=T)
trainingRows <- createDataPartition(train$Cover_Type, p=0.8, list=F)
head(trainingRows)
train_train <- train[trainingRows,]
train_test <- train[-trainingRows,]
knnfit <- knn3(train_train[,-56], train_train$Cover_Type)
This last line gives me this in the console:
Error in knn3.matrix(x, y = y, k = k, ...) : y must be a factor
As the error message states, y must be a factor (here, y is the name of the second parameter to the function). In R, a factor variable is used to represent categorical data. You can turn y into a factor with factor(y) but it will just have the levels 1:7 for your data. If you want to give more meaningful values to your factor, try
train$Cover_Type <- factor(train$Cover_Type, levels=1:7,
labels=c("Spruce/Fir","Lodgepole Pine","Ponderosa Pine",
"Cottonwood/Willow","Aspen",
"Douglas-fir","Krummholz"))
That will make your function happier and give you more useful labels in the results
I'm trying to learn a penalized logistic regression method with glmnet. I'm trying to predict if a car from the mtcars example data will have an automatic transmission or manual. I think my code is pretty straightforward, but I seem to be getting an error:
This first block simply splits mtcars into an 80% train set and a 20% test set
library(glmnet)
attach(mtcars)
smp_size <- floor(0.8 * nrow(mtcars))
set.seed(123)
train_ind <- sample(seq_len(nrow(mtcars)), size=smp_size)
train <- mtcars[train_ind,]
test <- mtcars[-train_ind,]
I know the x data is supposed to be in a matrix form without the response, so I separate the two training sets into a non-response matrix (train_x) and a response vector (train_y)
train_x <- train[,!(names(train) %in% c("am"))]
train_y <- train$am
But when trying to train the model,
p1 <- glmnet(train_x, train_y)
I get the error:
Error in elnet(x, is.sparse, ix, jx, y, weights, offset, type.gaussian,
:(list) object cannot be coerced to type 'double'
Am I missing something?
Coercing the first argument as a matrix solve for me :
p1 <- glmnet(as.matrix(train_x), train_y)
In fact , form glmnet? looks that the first argument should be a matrix/sparse matrix:
x: input matrix, of dimension nobs x nvars; each row is an observation vector. Can be in sparse matrix format (inherit from class "sparseMatrix" as in package Matrix; not yet available for family="cox")