qq plot in R to check normality of the distribution? - r

I have been reading a tutorial from https://www.datanovia.com/en/lessons/anova-in-r/ on how to perform ANOVA test in R. However, my question is regarding checking normality of the distribution in general.
There is an option to do a QQ plot with the ggqqplot function. However I do not know how to define the function. From what I can see in the tutorial on the datanovia, they use residuals from the linear model:
# Build the linear model
model <- lm(weight ~ group, data = PlantGrowth)
# Create a QQ plot of residuals
ggqqplot(residuals(model)
Then I performed the same test this way:
ggqqplot(PlantGrowth, "weight")
I expected to see the same result; however, the results differ.
From the documentation of the function ggqqplot it is not clear to me how is it correct to define it. Does someone have an explanation?
Thanks :D

You would just do ggqqplot(PlantGrowth) as long as that variable is a vector of numeric values. the function only takes a vector of numeric values and gives you something like this ex: ggqqplot(iris$Sepal.Length)

Related

Generalized Linear Model (GLM) in R

I have a response variable (A) which I transformed (logA) and predictor (B) from data (X) which are both continuous. How do I check the linearity between the two variables using Generalized Additive Model (GAM) in R. I use the following code
model <- gamlss(logA ~ pb(B) , data = X, trace = F)
but I am not sure about it, can I add "family=Poisson" in the code when logA is continuous in GLM? Any thoughts on this?
Thanks in advance
If your dependent variable is a count variable, you can use family=PO() without log transformation. With family=PO() a log link is already applied to transform the variable. See help page for gamlss family and also vignette on count regression section 2.1.
So it will go like:
library(gamlss)
fit = gamlss(gear ~ pb(mpg),data=mtcars,family=PO())
You can see that the predictions are log transformed and you need to take the exponential:
with(mtcars,plot(mpg,gear))
points(mtcars$mpg,exp(predict(fit,what="mu")),col="blue",pch=20)

Residual Plot for multivariate regression in Time Series, with time on X axis in R

I have a dataframe which is a time series. I am using the function lm to build a multivariate regression model.
linearmodel <- lm(Y~X1+X2+X3, data = data)
I want to plot the residuals of this linearmodel on the y-axis and time on the x-axis using a simple function, with the lm() object as the input.
Standard residual plotting functions like the one in car package (car::residualPlot) gives residuals on the Y-axis and fitted-values on the Y-axis.
Ideally, I need the residuals on the Y-axis and the timescale on the X-axis. But I understand that the function lm() is time agnositc. So, I can live with if the residuals are on Y-axis in the same order as the data input and nothing on the X-axis
Is there a plotting function which i can use by passing the linearmodel object into the function (not something where i can extract the residuals and use ggplot2). So for example: plot<- plotresidualsinorder(linearmodels) should give me the residuals on Y-axis in the same order of the data input?
I want to use this plot in R-shiny ultimately.
My research led me to car package, which is wonderful in its own right, but doesn't have the function to solve my problem.
Many thanks in advance for the help.
You can use the Residual Plot information. For the proposed solution, we need to apply the lm function to a formula that describes your Y variables by the variables X1+X2+X3, and save the linear regression model in a new linearmodel variable. Finally, we compute the residual with the resid function. In your case, the following solution can be representative for your problem.
Proposed solution:
linearmodel <- lm(Y~X1+X2+X3, data = data)
lm_resid <- resid(linearmodel)
plot(data$X1+X2+X3, lm_resid,
ylab="Residuals", xlab="Time",
main="Data")
abline(0, 0)
For any help concerning how does the resid function works, you can try:
help(resid)
Calisto's solution will work, but there is a more simple and straightforward solution. The lm function already give to you the regression residuals. So you may simply pass:
plot(XTime, linearmodel$residuals, main = "Residuals")
XTime is the Date variable of your dataset, maybe you may require to format that with POSIX functions: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/as.POSIX*
Add parameters as you need to share it on R-shiny.

How to replicate the predict function from R in Excel given I have access to "summary" output from R

I have run a 3rd order polynomial regression in R and have run the "summary" function, but I need to be able to replicate the "predict" function in Excel. I have my current working code below. Thank you for your help!
#Have access to this output:
AICFit <- lm(R60 ~ poly(M20, 3) + poly(M40, 3), data = mydata)
summary(AICFit)
#do not have access to output:
predict(AICFit,data.frame(M20=0.972375241,M40=0.989086129,interval ="prediction")
Basically, I don't have access to R when I have access to these numbers: 0.972375241,0.989086129.
I believe this is the equation that is the basis for the predict function, but I don't know how to compute this in Excel incorporating order 1, 2 and 3:
You do not have enough information from summary to calculate the prediction interval in Excel.
So the simple answer is - it is not possible to do it without an access to variance-covariance matrix (however for orthogonal polynomials in your model it is diagonal) and raw data. Moreover you need to extract orthogonal polynomial coefficients themselves, which are generated recursively and uniquely for each dataset you are fitting.
The formula you are referencing is for the univariate linear regression and it is not applicable for your case where you are doing multivariate polynomial regression for two variables: M20 and M40.

Command for finding the best linear model in R

Is there a way to get R to run all possible models (with all combinations of variables in a dataset) to produce the best/most accurate linear model and then output that model?
I feel like there is a way to do this, but I am having a hard time finding the information.
There are numerous ways this could be achieved, but for a simple way of doing this I would suggest that you have a look at the glmulti package, which is described in detail in this paper:
glmulti: An R Package for Easy Automated Model Selection with (Generalized) Linear Models
Alternatively, very simple example of the model selection as available on the Quick-R website:
# Stepwise Regression
library(MASS)
fit <- lm(y~x1+x2+x3,data=mydata)
step <- stepAIC(fit, direction="both")
step$anova # display results
Or to simplify even more, you can do more manual model comparison:
fit1 <- lm(y ~ x1 + x2 + x3 + x4, data=mydata)
fit2 <- lm(y ~ x1 + x2, data=mydata)
anova(fit1, fit2)
This should get you started. Although you should read my comment from above. This should build you a model based on all the data in your dataset and then compare all of the models with AIC and BIC.
# create a NULL vector called model so we have something to add our layers to
model=NULL
# create a vector of the dataframe column names used to build the formula
vars = names(data)
# remove variable names you don’t want to use (at least
# the response variable (if its in the first column)
vars = vars[-1]
# the combn function will run every different combination of variables and then run the glm
for(i in 1:length(vars)){
xx = combn(vars,i)
if(is.null(dim(xx))){
fla = paste("y ~", paste(xx, collapse="+"))
model[[length(model)+1]]=glm(as.formula(fla),data=data)
} else {
for(j in 1:dim(xx)[2]){
fla = paste("y ~", paste(xx[1:dim(xx)[1],j], collapse="+"))
model[[length(model)+1]]=glm(as.formula(fla),data=data)
}
}
}
# see how many models were build using the loop above
length(model)
# create a vector to extract AIC and BIC values from the model variable
AICs = NULL
BICs = NULL
for(i in 1:length(model)){
AICs[i] = AIC(model[[i]])
BICs[i] = BIC(model[[i]])
}
#see which models were chosen as best by both methods
which(AICs==min(AICs))
which(BICs==min(BICs))
I ended up running forwards, backwards, and stepwise procedures on data to select models and then comparing them based on AIC, BIC, and adj. R-sq. This method seemed most efficient. However, when I received the actual data to be used (the program I was writing was for business purposes), I was told to only model each explanatory variable against the response, so I was able to just call lm(response ~ explanatory) for each variable in question, since the analysis we ended up using it for wasn't worried about how they interacted with each other.
This is a very old question, but for those who are still encountering this discussion - the package olsrr and specifically the function ols_step_all_possible exhaustively produces an ols model for all possible subsets of variables, based on an lm object (such that by feeding it with a full model you will get all possible combinations), and returns a dataframe with R squared, adjusted R squared, aic, bic, etc. for all the models. This is very helpful in finding the best predictors but it is also very much time consuming.
see https://olsrr.rsquaredacademy.com/reference/ols_step_all_possible.html
I do not recommend just "cherry picking" the best performing model, rather I would actually look at the output and choose carefully for the most reasonable outcome. In case you would want to immediately get the best performing model (by some criteria, say number of predictors and R2) you may write a function that saves the dataframe, arranges it by number of predictors and orders it by descending R2 and spits out the top result.
The dredge() function in R also accomplishes this.

How to get Loess function for my data in R?

I have a some data and I draw them on a plot, using R.
After that, I draw the loess function about that data.
Here is the code:
data <- read.table("D:/data.csv", header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)
ur <- subset(data, select = c(users,responseTime))
ur <- ur[with(ur, order(users, responseTime)), ]
plot(ur, xlab="Users", ylab="Response Time (ms)")
lines(ur)
loess_fit <- loess(responseTime ~ users, ur)
lines(ur$users, predict(loess_fit), col = "blue")
Here's my plot's image:
How can I get the function of this regression?
For example: responseTime = 68 + 45 * users.
Thanks.
You can use the loess_fit object from your code to predict the response time. If you want to estimate the average response time for 230 users, you could do:
predict(loess_fit, newdata=data.frame(users=230))
Here is an interesting blog post on this subject.
EDIT: If you want to make predictions for values outside your data, you need a theory or further assumptions. The most simple assumption would be a linear fit,
lm_fit <- lm(responseTime ~ users, data=ur)
predict(lm_fit, newdata=data.frame(users=400))
However, your data may show heteroscedacity (non-constant variance) and may show non-normal residuals. You might want to check if that is the case. If it is, then a robust linear fitting procedure such as rlm from the package MASS, or a generalized linear model glm might be worth a try. I am not an expert for that, maybe someone else or at Cross Validated can provide better help.
The loess.demo function in the TeachingDemos package shows the logic underlying the loess fit. This can help you understand what is going on and why there is not a simple prediction function. However, for predicting, there is a predict function that works with loess fits to create the prediction. You can also find the linear equation that will predict for a specific value of x (but it will be different for each value of x you may want to predict for).

Resources