Compare predicted values and actual values by Caret - r

I uses Caret to get a prediction for electricity usage per day. I am wondering if there is a way to plot both predicted data and the actual data to inspect the difference.
I check this post, but I didn't find the solution I want.
how to plot actual and predicted values?
Assume two list of numbers.
actual <- [1,2,3,4]
predicted <- [1.3,2,3.1,4]
Is there a simply way to plot the two line to observe the difference?
Thanks!

you mentioned using caret, so there must be some model you're trying to fit. if correct, there exists a model object and you could use plotObsVsPred function. Here is a simplified example from help link:
# regression example
library(mlbench)
data(BostonHousing)
plsFit <- train(BostonHousing[1:100, -c(4, 14)],
BostonHousing$medv[1:100],
"pls")
predVals <- extractPrediction(list(plsFit),
testX = BostonHousing[101:200, -c(4, 14)],
testY = BostonHousing$medv[101:200])
plotObsVsPred(predVals)

Related

How to graph inc exponential decay in R?

My prof decided that our first experience with coding was going to be trying to fit the function z(t) = A(1-e^(-t/T)) into a given data-set from class using R. I'm completely lost. I keep using lm and nls functions, without quite knowing how they work. So far, I have the data graphed but I have no clue how to get any sort of line more complicated than
mod3<-lm(y~I(x^1/5))
pre3<-predict(mod3)
lines(pre3)
to sum up: how do I find the A and T parameters? Do I use nls for the formula? Anything helps. I'll include a picture of the graph and the data. Please ignore the random lines on the plot. graph depicting my dataset dataset I have to use
One could attempt transform your expression into a linear relationship, but sometimes it is easier to just let the computer do the work. As mention in the comments, R has the nls function to perform the nonlinear regression.
Here is an example using some dummy data. The supply the nls function with your equation, the data frame containing the data and supply it with the initial estimates of the parameters.
See comments for additional details.
#create dummy data
A= 0.8
T1 = 13
t <- seq(2, 50, 3)
z <- A*(1-exp(-t/T1))
z<- z +rnorm(length(z), 0, 0.005) #add noise
#starting data frame
df <-data.frame(t, z)
#solve non-linear model
model <- nls(z ~ A*(1-exp(-t/Tc)), data=df, start = list(A=1, Tc=1))
print(summary(model))
#predict
pred_y <-predict(model, data.frame(t))
#plot
plot(x=t, y=z)
lines(y=pred_y, x= t, col="blue")

qq plot in R to check normality of the distribution?

I have been reading a tutorial from https://www.datanovia.com/en/lessons/anova-in-r/ on how to perform ANOVA test in R. However, my question is regarding checking normality of the distribution in general.
There is an option to do a QQ plot with the ggqqplot function. However I do not know how to define the function. From what I can see in the tutorial on the datanovia, they use residuals from the linear model:
# Build the linear model
model <- lm(weight ~ group, data = PlantGrowth)
# Create a QQ plot of residuals
ggqqplot(residuals(model)
Then I performed the same test this way:
ggqqplot(PlantGrowth, "weight")
I expected to see the same result; however, the results differ.
From the documentation of the function ggqqplot it is not clear to me how is it correct to define it. Does someone have an explanation?
Thanks :D
You would just do ggqqplot(PlantGrowth) as long as that variable is a vector of numeric values. the function only takes a vector of numeric values and gives you something like this ex: ggqqplot(iris$Sepal.Length)

Boosting classification tree in R

I'm trying to boost a classification tree using the gbm package in R and I'm a little bit confused about the kind of predictions I obtain from the predict function.
Here is my code:
#Load packages, set random seed
library(gbm)
set.seed(1)
#Generate random data
N<-1000
x<-rnorm(N)
y<-0.6^2*x+sqrt(1-0.6^2)*rnorm(N)
z<-rep(0,N)
for(i in 1:N){
if(x[i]-y[i]+0.2*rnorm(1)>1.0){
z[i]=1
}
}
#Create data frame
myData<-data.frame(x,y,z)
#Split data set into train and test
train<-sample(N,800,replace=FALSE)
test<-(-train)
#Boosting
boost.myData<-gbm(z~.,data=myData[train,],distribution="bernoulli",n.trees=5000,interaction.depth=4)
pred.boost<-predict(boost.myData,newdata=myData[test,],n.trees=5000,type="response")
pred.boost
pred.boost is a vector with elements from the interval (0,1).
I would have expected the predicted values to be either 0 or 1, as my response variable z also consists of dichotomous values - either 0 or 1 - and I'm using distribution="bernoulli".
How should I proceed with my prediction to obtain a real classification of my test data set? Should I simply round the pred.boost values or is there anything I'm doing wrong with the predict function?
Your observed behavior is correct. From documentation:
If type="response" then gbm converts back to the same scale as the
outcome. Currently the only effect this will have is returning
probabilities for bernoulli.
So you should be getting probabilities when using type="response" which is correct. Plus distribution="bernoulli" merely tells that labels follows bernoulli (0/1) pattern. You can omit that and still model will run fine.
To proceed do predict_class <- pred.boost > 0.5 (cutoff = 0.5) or else plot ROC curve to decide on cutoff yourself.
Try using adabag. Class, probabilities, votes and error are inbuilt in adabag which makes it easy to interpret, and of course less lines of codes.

Random predictions from linear model in R

I have some data with some missing values for one variable, and I want to be able to create (random) predictions for what these could be. Here's my first thought:
# miss indicates where the observations with missing response are
library(MASS)
model <- glm.nb(data[-miss,4] ~ ., data=data[-miss,-4])
predict(model, newdata=data[miss,-4])
However, if I repeat the last line, it gives the same answers over and over - it appears to give the predicted mean of responses given that data and the model. I want a random prediction which incorporates variance i.e. a random draw from the distribution of the response of an observation with such predictors under the given model.
It could have something to do with the pred.var argument, but I'm unsure how to use that.
Suppose we have data like this:
set.seed(101)
dd <- data.frame(x=(1:20)*0.1)
dd$y <- rnbinom(20,mu=exp(dd$x),size=1)
## make some missing values
miss <- c(2,3,5)
dd$y[miss] <- NA
Now fit a model:
m1 <- MASS::glm.nb(y~x,dd,na.action=na.exclude)
Now use predictions from that model to get the expected mean value and rnbinom to generate the random values ...
p <- predict(m1,newdata=dd,type="response")
randvals <- rnbinom(length(p),mu=p,size=m1$theta)
(This gives random values for every element, not just the missing ones, but obviously you can pick out just the ones you want ...) It would be nice if the simulate method did this, but it's not quite flexible enough ...

R: Prediction using glm() gamma family

I am using glm() function in R with link= log to fit my model. I read on various websites that fitted() returns the value which we can compare with the original data as compared to the predict().
I am facing some problem while fitting the model.
data<-read.csv("training.csv")
data$X2 <- as.Date(data$X2, format="%m/%d/%Y")
data$X3 <- as.Date(data$X3, format="%m/%d/%Y")
data_subset <- subset(...)
attach(data_subset)
#define variable
Y<-cbind(Y)
X<-cbind(X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X14)
# correlation among variables
cor(Y,X)
model <- glm(Y ~ X , data_subset,family=Gamma(link="log"))
summary(model)
detach(data_subset)
validation_data<-read.csv("validation.csv")
validation_data$X2 <- as.Date(validation_data$X2, format="%m/%d/%Y")
validation_data$X3 <- as.Date(validation_data$X3, format="%m/%d/%Y")
attach(validation_data)
predicted_valid<-predict(model, newdata=validation_data)
I am not sure how does predict work with gamma log link. I want to transform the predicted values so that it can be compared with the original data. Can someone please help me.
Add type="response" to your predict call, to get predictions on the response scale. See ?predict.glm.
predict(model, newdata=*, type="response")
Looks to me like fitted doesn't work the way you seem to think it does.
You probably want to use predict there, since you seem to want to pass it data.
see ?fitted vs ?predict

Resources