R won't recognize column names as an object - r

I'm trying to build a histogram of residual values, however the first step I'm taking to do that is to run a linear model. R will not recognize the column name as an object.
The first three lines of code run fine. The second two give me an error saying the object area_ha cant be found, however, it is one of eight column titles in my data. Any advice on creating a linear model and a histogram to graph residuals would also be very helpful.
dat<-read.csv("/Users/sara/Desktop/birdsinforest.csv", header=TRUE)
linearmodel=lm(abundance ~ area_ha, data = dat)
summary(linearmodel)
area_ha$abundance_predicted = predict(linearmodel)
area_ha$residual = area_ha$abundance - area_ha$abundance_predicted
This is the error I get after running the last two lines of code:
Error in area_ha$abundance_predicted = predict(linearmodel) :
object 'area_ha' not found

Your code:
dat<-read.csv("/Users/sara/Desktop/birdsinforest.csv", header=TRUE)
linearmodel=lm(abundance ~ area_ha, data = dat)
summary(linearmodel)
area_ha$abundance_predicted = predict(linearmodel)
area_ha$residual = area_ha$abundance - area_ha$abundance_predicted
In the above code, area_ha seems like a variable (column name) and not data.frame since you're using it to fit a linear model. You should try the last two lines of code as below:
dat$abundance_predicted <- predict(linearmodel)
dat$residual <- dat$abundance - dat$abundance_predicted

Related

Fastshap summary plot - Error: can't combine <double> and <factor<919a3>>

I'm trying to get a summary plot using fastshap explain function as in the code below.
p_function_G<- function(object, newdata)
caret::predict.train(object,
newdata =
newdata,
type = "prob")[,"AntiSocial"] # select G class
# Calculate the Shapley values
#
# boostFit: is a caret model using catboost algorithm
# trainset: is the dataset used for bulding the caret model.
# The dataset contains 4 categories W,G,R,GM
# corresponding to 4 diferent animal behaviors
library(caret)
shap_values_G <- fastshap::explain(xgb_fit,
X = game_train,
pred_wrapper =
p_function_G,
nsim = 50,
newdata= game_train[which(game_test=="AntiSocial"),])
)
However I'm getting error
Error in 'stop_vctrs()':
can't combine latitude and gender <factor<919a3>>
What's the way out?
I see that you are adapting code from Julia Silge's Predict ratings for board games Tutorial. The original code used SHAPforxgboost for generating SHAP values, but you're using the fastshap package.
Because Shapley explanations are only recently starting to gain traction, there aren't very many standard data formats. fastshap does not like tidyverse tibbles, it only takes matrices or matrix-likes.
The error occurs because, by default, fastshap attempts to convert the tibble to a matrix. But this fails, because matrices can only have one type (f.x. either double or factor, not both).
I also ran into a similar issue and found that you can solve this by passing the X parameter as a data.frame. I don't have access to your full code but you could you try replacing the shap_values_G code-block as so:
shap_values_G <- fastshap::explain(xgb_fit,
X = game_train,
pred_wrapper =
p_function_G,
nsim = 50,
newdata= as.data.frame(game_train[which(game_test=="AntiSocial"),]))
)
Wrap newdata with as.data.frame. This converts the tibble to a dataframe and so shouldn't upset fastshap.

Correct use of R naive_bayes() and predict()

I am trying to run a simple naive bayes model (trying to redo what I have seen the datacamp course).
I am using the R naivebayes package.
The training dataset is where9am which looks like this:
My first problem is the following... when I have several predictions in a dataframe thursday9am...
... and I use the following code:
locmodel <- naive_bayes(location ~ daytype, data = where9am)
my_pred <- predict(locmodel, thursday9am)
I get a series of <NA> while it works well with the correct prediction if the thursday9am dataframe only contains a single observation.
The second problem is the following: when I use the following code to get the associated probabilities...
locmodel <- naive_bayes(location ~ daytype, data = where9am, type = c("class", "prob"))
predict(locmodel, thursday9am , type = "prob")
... even if I have only one observation in thursday9am, I get a series of <NaN>.
I am not sure what I am doing wrong.

R newbie having issues with lm function

I have the following code to get the famafrench regression of a set of data:
#Regression
ff_reg = lm(e25 ~ rmrf+smb+hml, data=dat);
However, I keep getting the error "invalid type (list) for variable e25".
e25 was defined earlier in the program as a set of data obtained from subtracting 'rf' from a matrix made up of 25 columns:
e25 = (dat[,7:31]) - dat$rf;
(where dat is an CSV file read in to R and rf is one of the columns within that file)
Why is this error coming up and how can I resolve it?
On advice, here is the full code that I am running...
dat = read.csv("ff2014.csv", as.is=TRUE);
##excess portfolio returns
e25 = (dat[,7:31]) - dat$rf;
#print(e25);
#Regression
ff_reg = lm(e25 ~ rmrf+smb+hml, data=dat);
print(summary(ffreg));
From help("lm"):
If response is a matrix a linear model is fitted separately by least-squares to each column of the matrix.
So, if that's what you intend to do, you need to make your data.frame a matrix before you call lm:
e25 <- as.matrix(e25)

predict in caret ConfusionMatrix is removing rows

I'm fairly new to using the caret library and it's causing me some problems. Any
help/advice would be appreciated. My situations are as follows:
I'm trying to run a general linear model on some data and, when I run it
through the confusionMatrix, I get 'the data and reference factors must have
the same number of levels'. I know what this error means (I've run into it before), but I've double and triple checked my data manipulation and it all looks correct (I'm using the right variables in the right places), so I'm not sure why the two values in the confusionMatrix are disagreeing. I've run almost the exact same code for a different variable and it works fine.
I went through every variable and everything was balanced until I got to the
confusionMatrix predict. I discovered this by doing the following:
a <- table(testing2$hold1yes0no)
a[1]+a[2]
1543
b <- table(predict(modelFit,trainTR2))
dim(b)
[1] 1538
Those two values shouldn't disagree. Where are the missing 5 rows?
My code is below:
set.seed(2382)
inTrain2 <- createDataPartition(y=HOLD$hold1yes0no, p = 0.6, list = FALSE)
training2 <- HOLD[inTrain2,]
testing2 <- HOLD[-inTrain2,]
preProc2 <- preProcess(training2[-c(1,2,3,4,5,6,7,8,9)], method="BoxCox")
trainPC2 <- predict(preProc2, training2[-c(1,2,3,4,5,6,7,8,9)])
trainTR2 <- predict(preProc2, testing2[-c(1,2,3,4,5,6,7,8,9)])
modelFit <- train(training2$hold1yes0no ~ ., method ="glm", data = trainPC2)
confusionMatrix(testing2$hold1yes0no, predict(modelFit,trainTR2))
I'm not sure as I don't know your data structure, but I wonder if this is due to the way you set up your modelFit, using the formula method. In this case, you are specifying y = training2$hold1yes0no and x = everything else. Perhaps you should try:
modelFit <- train(trainPC2, training2$hold1yes0no, method="glm")
Which specifies y = training2$hold1yes0no and x = trainPC2.

lm function throws an error in terms.formula() in R

I am trying to run linear modelling on the training data frame, but it is not giving me the output.
It gives me an error saying
Error in terms.formula(formula, data = data) :
'.' in formula and no 'data' argument
Code
n <- ncol(training)
input <- as.data.frame(training[,-n])
fit <- lm(training[,n] ~.,data = training[,-n])
There's no need to remove the column from the data to perform this operation, and it's best to use names.
Say that your last column is called response. Then run this:
lm(response ~ ., data=training)
It's hard to say that this is the formula that you need. If you provide a reproducible example, that will become clear.

Resources