I have used the package lsmeans in R to get the average estimate for all observations for my treatment factor (across the levels of a block factor in the experimental design that has been included with systematic effect because it only had 3 levels). I have used a sqrt transformation for my response variable.
Thus I have used the following commands in R.
First defining model
model<-sqrt(response)~treatment+block
Then applying lsmeans
model_lsmeans<-lsmeans(model,~treatment)
Then plotting this
plot(model_lsmeans,ylab="treatment", xlab="response(with 95% CI)")
This gives a very nice graph with estimates and 95% confidense intervals for the different treatment.
The problems is just that this graph is for the transformed response.
How do I get this same plot with the backtransformed response (so the squared response)?
I have tried to create a new data frame and extract the lsmean, lower.CL, and upper.CL:
a<-summary(model_lsmeans)
New_dataframe<-as.data.frame(a[c("treatment","lsmean","lower.CL","upper.CL")])
And then make these squared
New_dataframe$lsmean<-New_dataframe$lsmean^2
New_dataframe$lower.CL<-New_dataframe$lower.CL^2
New_dataframe$upper.CL<-New_dataframe$upper.CL^2
New_dataframe
This gives me the estimates and CI boundaries squared that I need.
The problem is that I cannot make the same graph for thise estimates and CI as the one that I did in LS means above.
How can I do this? The reason that I ask is that I want to have graphs that are all of a similar style for my article. Since I very much like this LSmeans plot, and it is very convenient for me to use on the non-transformed response variables, I would like to have all my graphs in this style.
Thank you very much for your help! Hope everything is clear!
Kind regards
Ditlev
Related
Im using the book Applied Survival Analysis Using R by Moore to try and model some time-to-event data. The issue I'm running into is plotting the estimated survival curves from the cox model. Because of this I'm wondering if my understanding of the model is wrong or not. My data is simple: a time column t, an event indicator column (1 for event 0 for censor) i, and a predictor column with 6 factor levels p.
I believe I can plot estimated surival curves for a cox model as follows below. But I don't understand how to use survfit and baseplot, nor functions from survminer to achieve the same end. Here is some generic code for clarifying my question. I'll use the pharmcoSmoking data set to demonstrate my issue.
library(survival)
library(asaur)
t<-pharmacoSmoking$longestNoSmoke
i<-pharmacoSmoking$relapse
p<-pharmacoSmoking$levelSmoking
data<-as.data.frame(cbind(t,i,p))
model <- coxph(Surv(data$t, data$i) ~ p, data=data)
As I understand it, with the following code snippets, modeled after book examples, a baseline (cumulative) hazard at my reference factor level for p may be given from
base<-basehaz(model, centered=F)
An estimate of the survival curve is given by
s<-exp(-base$hazard)
t<-base$time
plot(s~t, typ = "l")
The survival curve associated with a different factor level may then be given by
beta_n<-model$coefficients #only one coef in this case
s_n <- s^(exp(beta_n))
lines(s_n~t)
where beta_n is the coefficient for the nth factor level from the cox model. The code above gives what I think are estimated survival curves for heavy vs light smokers in the pharmcoSmokers dataset.
Since thats a bit of code I was looking to packages for a one-liner solution, I had a hard time with the documentation for Survival ( there weren't many examples in the docs) and also tried survminer. For the latter I've tried:
library(survminer)
ggadjustedcurves(model, variable ="p" , data=data)
This gives me something different than my prior code, although it is similar. Is the method I used earlier incorrect? Or is there a different methodology that accounts for the difference? The survminer code doesn't work from my data (I get a 'can't allocated vector of size yada yada error, and my data is ~1m rows) which seems weird considering I can make plots using what I did before no problem. This is the primary reason I am wondering if I am understanding how to plot survival curves for my model.
I am looking at the relationship between agricultural intensity and functional diversity of birds.
In my GLM model I have included a number of other variables including forest, semi-natural habitat, temperature, pesticides etc.
When looking to see whether my variables are normally distributed or not, I used a QQplot to identify the normality and there appears to be these 3 outliers.
I wondered how I would remove these outliers to make my data more normally distributed?
I tried to use the outliers package but all the examples I found failed to work, or I failed to understand how they worked!
Any help would be appreciated. This is my QQ plot for my functional dispersion model and a scatter of functional dispersion x agricultural intensity.
QQ plot
functional dispersion x agriculture scatter
You could remove the observations that appear out of place. Given the amount of observations, this is unlikely to change estimates, but please make sure this is indeed the case. Also, when reporting your work, make sure you justify why you removed those points based on your domain knowledge about the variable.
You can remove the observation using
model.data.scaled <- model.data.scaled[model.data.scaled$agri > -5, ]
Is Plot residuals vs predicted response equivalent to Plot residuals vs fitted ?
If so, then would be plotted by plot(lm) and plot(predict(lm)), where lm is the linear model ?
Am I correct?
Maybe little off-topic, but as an addition: package named ggfortify might come handy. Super easy to use, like this:
library(ggfortify)
autoplot(mod3)
Yields an output with the most important things you need to know, if your model violates the lm assumptions or not. An example output here:
Yes, the fitted values are the predicted responses on the training data, i.e. the data used to fit the model, so plotting residuals vs. predicted response is equivalent to plotting residuals vs. fitted.
As for your second question, the plot would be obtained by plot(lm), but before that you have to run par(mfrow = c(2, 2)). This is because plot(lm) outputs 4 plots, one of which is the one you want, i.e the residuals vs fitted plot. The command above divides the output screen into four facets, so each plot will be shown in one. The plot you are looking for will appear in the top left.
My PCA result using prcomp() function is summarised and plot as followings. How to interpret the plot results? It shows in some online article that the points present the amount of variance attributed to the different principal components. However, the value seems not matching with any of the statistics, e.g., standard deviation, the proportion of variance, or cumulative proportion.
> summary(data_pca)
> plot(data_pca,type="lines")
I got the hint from #Roland and #Maurits. Here, the variance is exactly the square of standard deviation.
I'm studying the effect of different predictors (dummy, categorical and continuos variables) on presence of birds, obtained from bird counts at-sea. To do that I used a glmmadmb function and binomial family.
I've plotted the relationship between response variable and predictors in order to asses the model fit and the marginal effect of each predictor. To draw the graphs I used visreg function, specifying the transformation of the vertical axis:
visreg(modelo.bn7, type="conditional", scale="response", ylab= "Bird Presence")
The output graphs showed a confident bands very wide when I used the original scale of the response variable (covering the whole vertical axis). In case of graphs without transformation, confident bands were shorter but they had the same extension in the different levels of dummy variables. Does anyone know how the confidents bands are calculated in binomial distributions? Could it reflect that I have a problem in the estimated coefficients or in the model fit?
The confidence bands are calculated using p-values for binomial distribution... For detailed explanation you can ask on stats.stackexchange.com. If the bands are very wide (and the interpretation of 'wide' is subjective and mostly based on what is your goal) then it shows that your estimates may not be very accurate. High p-values usually are due to small or insufficient number of observations used for building the model. If the number of observations are large, then it does indicate a poor fit.