Difference between 3D plots using fitted and predicted values - r

I have two 3D plots, one was made using fitted values using library p3d (left side). In the second one I used predict and then interp command from package akima and plotted using persp (right side). The images are not showing the same angle, but it is the best way I found to show that one is flat and the other one has a curve.
I would like to know why the graph showing fitted values has a curve and the other one with predicted values doesn't have it.
First graph code:
library(p3d)
Init3d(family="serif", cex = 1)
Plot3d( TCL ~ reLDM+yr, nest6)
Axes3d()
fit = lm( TCL ~ reLDM+yr+I(yr^2)+I(reLDM*yr)+I(reLDM*yr^2), nest6)
Fit3d( fit )
Second graph code:
library(akima)
x <- nest6$reLDM
y <- nest6$yr
y2 <- y^2
z <- nest6$TCL
m <- lm(z ~ x*y+y2+x:y2)
i <- 25
xtemp <- seq(min(x),max(x),length.out=i)
xrange <- rep(xtemp,times=i)
ytemp <- seq(min(y),max(y),length.out=i)
yrange <- rep(ytemp,each=i)
y2temp <- seq(min(y2),max(y2),length.out=i)
y2range <- rep(y2temp,each=i)
newdata <- data.frame(x=xrange,y=yrange,y2=y2range)
zhat <- predict(m,newdata=newdata)
xyz <- interp(xrange,yrange,zhat)
jet.colors <- colorRampPalette( c("yellow", "red", "blue") )
nbcol <- 500
color <- jet.colors(nbcol)
nrz <- length(xyz[[1]])
ncz <- length(xyz[[2]])
z<-xyz[[3]]
zfacet <- z[-1, -1] + z[-1, -ncz] + z[-nrz, -1] + z[-nrz, -ncz]
facetcol <- cut(zfacet, nbcol)
quartz()
persp(xyz,xlab="x",ylab="y",zlab="z", cex.lab = 1,cex.axis = 1,
theta = 35, phi = 50,col=color[facetcol], border="grey40", ticktype = "detailed", zlim=c(1,7))
You will find the dataset in this link, it big so it was not possible to post it here: https://www.dropbox.com/s/czdascoq02alm46/TCL16_26.csv?dl=0
My model was done using lm(), I read in a post that there is no difference between fitted and predict functions in a simple linear regression model. However, in akima I am using the interp command wich I understand it estimates values between two known data points (basically fills the missing data gaps).
Another difference I found is that predicted values plot uses new data from the maximum and minimum range of values in the original data set. For fitted values there is one value per observation.
I have problems explaining this to my supervisor, he thinks it is not enough reason. What would be a better explanation why is the curve missing in the second graph?

Replace
y2temp <- seq(min(y2),max(y2),length.out=i)
with
y2temp <- ytemp^2
You will get a similar curve.
Using simulated data:

Related

Export results from LOESS plot

I am trying to export the underlying data from a LOESS plot (blue line)
I found this post on the subject and was able to get it to export like the post says:
Can I export the result from a loess regression out of R?
However, as the last comment from the poster in that post says, I am not getting the results for my LOESS line. Does anyone have any insights on how to get it to export properly?
Thanks!
Code for my export is here:
#loess object
CL111_loess <- loess(dur_cleaned~TS_LightOn, data = CL111)
#get SE
CL111_predict <- predict(CL111_loess, se=T)
CL111_ouput <- data.frame("fitted" = CL111_predict$fit, "SE"=CL111_predict$se.fit)
write.csv(CL111_ouput, "CL111_output.csv")
Data for the original plot is here:
Code for my original plot is here:
{r}
#individual plot
ggplot(data = CL111) + geom_smooth(mapping = aes(x = TS_LightOn, y = dur_cleaned), method = "lm", se = FALSE, colour = "Green") +
labs(x = "TS Light On (Seconsd)", y = "TS Response Time (Seconds)", title = "Layout 1, Condition AO, INS High") +
theme(plot.title = element_text(hjust = 0.5)) +
stat_smooth(mapping = aes(x = TS_LightOn, y = dur_cleaned), method = "loess", se = TRUE) + xlim(0, 400) + ylim (0, 1.0)
#find coefficients for best fit line
lm(CL111_LM$dur_cleaned ~ CL111_LM$TS_LightOn)
You can get this information via ggplot_build().
If your plot is saved as gg1, run ggplot_build(gg1); then you have to examine the data object (which is a list of data for different layers) and try to figure out which layer you need (in this case, I looked for which data layer included a colour column that matched the smooth line ...
bb <- ggplot_build(gg1)
## extract the right component, just the x/y coordinates
out <- bb$data[[2]][,c("x","y")]
## check
plot(y~x, data = out)
You can do whatever you want with this output now (write.csv(), save(), saveRDS() ...)
I agree that there is something weird/that I don't understand about the way that ggplot2 is setting up the loess fit. You do have to do predict() with the right newdata (e.g. a data frame with a single column TS_LightOn that ranges from 0 to 400) - otherwise you get predictions of the points in your data set, which may not be properly spaced/in the right order - but that doesn't resolve the difference for me.
To complement #ben-bolker, I have just written a small function that may be useful for retrieving the internal dataset created by ggplot for a geom_smooth call. It takes the resultant ggplot as input and returns the smoothed data. The problem it solves is that, as Ben observed, internally ggplot creates a smoothed fit with predicted data on random intervals, different from the interval used for the input data. This function will get you back the ggplot fit data with an interval based on integer and equally spaced values. That function uses a loess fit on the already smoothed data, using a small value of span (0.1), that is adjusted upward on-the-fly to cope with small numbers of values.
This is useful if you used geom_smooth with a method that is not 'loess' or using 'NULL' and you cannot easily build a model that replicates what geom_smooth is doing internally.
The function separates different series on the same plot as well as series located on different facets. It also returns the 'ymin' and 'ymax' values.
Note that this function uses an interval based on integer values of x. You can modify this if you need an interval based on equally-spaced values of x, but not integral. In that case, pass your x interval of choice in the xInterval parameter, or tweak the line:
outOne <- data.frame(x=c(min(trunc(sub$x)):max(trunc(sub$x)))).
get_geom_smooth_dataFromPlot <- function (a_ggplot, xInterval=NULL) {
#internal ggplot values read in ggTable
ggTable <- ggplot_build(a_ggplot)$data[[1]]
#facet panels
panels <- as.numeric(names(table(ggTable$PANEL)))
nPanel <- length(panels)
onePanel <- (nPanel==1)
#number of series in each plot
groups <- as.numeric(names(table(ggTable$group)))
nGroup <- length(groups)
oneGroup <- (nGroup==1)
out <- data.frame()
#are there 'ymin' and 'ymax' values?
SE_data <- "ymin" %in% colnames(ggTable)
for (pan in (1:nPanel)) {
for (grp in (1:nGroup)) {
sub <- subset(ggTable, (PANEL==panels[pan])&(group==groups[grp]))
#no group series for this facet panel?
if (dim(sub)[1] == 0) next
if (is.null(xInterval)) {
outOne <- data.frame(x=c(min(trunc(sub$x)):max(trunc(sub$x))))
} else {
outOne <- data.frame(x=xInterval)
}
nObs <- dim(outOne)[1]
#hack to avoid problems with a small range for the x interval
# when there are more than 90 x values
# we use a span of 0.1, but
# we adjust on-the-fly up to a span of 0.5
# for 10 values of the x interval
cSpan <- max (0.1, 0.5 * 10 / (nObs-(nObs-10)/2))
if (!onePanel) outOne$panel <- pan
if (!oneGroup) outOne$group <- grp
mod <- loess(y~x, data=sub, span=cSpan)
outOne$y <- predict(mod, outOne$x, se=FALSE)
if (SE_data) {
mod <- loess(ymin~x, data=sub, span=cSpan)
outOne$ymin <- predict(mod, outOne$x, se=FALSE)
mod <- loess(ymax~x, data=sub, span=cSpan)
outOne$ymax <- predict(mod, outOne$x, se=FALSE)
}
out <- rbind(out, outOne)
}
}
return (out)
}

Place different QQ plot (with different datasets) in the same coordinate system

I can only get the qq plot one by one with different datasets..
library(fitdistrplus)
x1<-c(1300,541,441,35,278,167,276,159,126,60.8,160,5000,264.6,379,170,251.3,155.84,187.01,850)
x2<-c(25,500,42,100,10,8.2,76,2.2,7.86,50)
y1<-log10(x1)
y2<-log10(x2)
x1.logis <- fitdist(y1, "logis", method="mle")
x2.logis <- fitdist(y2, "logis", method="mle")
ppcomp(x1.logis, addlegend=FALSE)
ppcomp(x2.logis, addlegend=FALSE)
How can i place the two qq plot in same coordinate system?
Use ggplot2. You need to extract your fitted values from the fitdist object n and make a new data frame. Use ggplot2 layers to add the fitted values from the two data sets and then add an abline.
library(ggplot2)
fittedx1 <- data.frame(x = sort(plogis(x1.logis$data,
location = x1.logis$estimate[1],
scale = x1.logis$estimate[2])),
p = (1:length(x1.logis$data))/length(x1.logis$data))
fittedx2 <- data.frame(x = sort(plogis(x2.logis$data,
location = x2.logis$estimate[1],
scale = x2.logis$estimate[2])),
p = (1:length(x2.logis$data))/length(x2.logis$data))
fitted <- rbind(fittedx1,fittedx2) #You need to combine the two datasets
#Add a variable that identifies which dataset the values belong to
#Then you can use the col option in ggplot to give each data set its own color!
fitted$set <- c(rep("1", nrow(fittedx1)), rep("2", nrow(fittedx2)))
#Now plot
ggplot(fitted) +
geom_point(aes(p, x, col=set), shape=1, size=3) +
geom_abline(intercept=0, slope=1)

Plot decision boundaries with ggplot2?

How do I plot the equivalent of contour (base R) with ggplot2? Below is an example with linear discriminant function analysis:
require(MASS)
iris.lda<-lda(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris)
datPred<-data.frame(Species=predict(iris.lda)$class,predict(iris.lda)$x) #create data.frame
#Base R plot
eqscplot(datPred[,2],datPred[,3],pch=as.double(datPred[,1]),col=as.double(datPred[,1])+1)
#Create decision boundaries
iris.lda2 <- lda(datPred[,2:3], datPred[,1])
x <- seq(min(datPred[,2]), max(datPred[,2]), length.out=30)
y <- seq(min(datPred[,3]), max(datPred[,3]), length.out=30)
Xcon <- matrix(c(rep(x,length(y)),
rep(y, rep(length(x), length(y)))),,2) #Set all possible pairs of x and y on a grid
iris.pr1 <- predict(iris.lda2, Xcon)$post[, c("setosa","versicolor")] %*% c(1,1) #posterior probabilities of a point belonging to each class
contour(x, y, matrix(iris.pr1, length(x), length(y)),
levels=0.5, add=T, lty=3,method="simple") #Plot contour lines in the base R plot
iris.pr2 <- predict(iris.lda2, Xcon)$post[, c("virginica","setosa")] %*% c(1,1)
contour(x, y, matrix(iris.pr2, length(x), length(y)),
levels=0.5, add=T, lty=3,method="simple")
#Eqivalent plot with ggplot2 but without decision boundaries
ggplot(datPred, aes(x=LD1, y=LD2, col=Species) ) +
geom_point(size = 3, aes(pch = Species))
It is not possible to use a matrix when plotting contour lines with ggplot. The matrix can be rearranged to a data-frame using melt. In the data-frame below the probability values from iris.pr1 are displayed in the first column along with the x and y coordinates in the following two columns. The x and y coordinates form a grid of 30 x 30 points.
df <- transform(melt(matrix(iris.pr1, length(x), length(y))), x=x[X1], y=y[X2])[,-c(1,2)]
I would like to plot the coordinates (preferably connected by a smoothed curve) where the posterior probabilities are 0.5 (i.e. the decision boundaries).
You can use geom_contour in ggplot to achieve a similar effect. As you correctly assumed, you do have to transform your data. I ended up just doing
pr<-data.frame(x=rep(x, length(y)), y=rep(y, each=length(x)),
z1=as.vector(iris.pr1), z2=as.vector(iris.pr2))
And then you can pass that data.frame to the geom_contour and specify you want the breaks at 0.5 with
ggplot(datPred, aes(x=LD1, y=LD2) ) +
geom_point(size = 3, aes(pch = Species, col=Species)) +
geom_contour(data=pr, aes(x=x, y=y, z=z1), breaks=c(0,.5)) +
geom_contour(data=pr, aes(x=x, y=y, z=z2), breaks=c(0,.5))
and that gives
The partimat function in the klaR library does what you want for observed predictors, but if you want the same for the LDA projections, you can build a data frame augmenting the original with the LD1...LDk projections, then call partimat with formula Group~LD1+...+LDk, method='lda' - then you see the "LD-plane" that you intended to see, nicely partitioned for you. This seemed easier to me, at least to explain to students newer to R, since I'm just reusing a function already provided in a way in which it wasn't quite intended.

Nonparametric quantile regression curves to scatterplot

I created a scatterplot (multiple groups GRP) with IV=time, DV=concentration. I wanted to add the quantile regression curves (0.025,0.05,0.5,0.95,0.975) to my plot.
And by the way, this is what I did to create the scatter-plot:
attach(E) ## E is the name I gave to my data
## Change Group to factor so that may work with levels in the legend
Group<-as.character(Group)
Group<-as.factor(Group)
## Make the colored scatter-plot
mycolors = c('red','orange','green','cornflowerblue')
plot(Time,Concentration,main="Template",xlab="Time",ylab="Concentration",pch=18,col=mycolors[Group])
## This also works identically
## with(E,plot(Time,Concentration,col=mycolors[Group],main="Template",xlab="Time",ylab="Concentration",pch=18))
## Use identify to identify each point by group number (to check)
## identify(Time,Concentration,col=mycolors[Group],labels=Group)
## Press Esc or press Stop to stop identify function
## Create legend
## Use locator(n=1,type="o") to find the point to align top left of legend box
legend('topright',legend=levels(Group),col=mycolors,pch=18,title='Group')
Because the data that I created here is a small subset of my larger data, it may look like it can be approximated as a rectangular hyperbole. But I don't want to call a mathematical relationship between my independent and dependent variables yet.
I think nlrq from the package quantreg may be the answer, but I don't understand how to use the function when I don't know the relationship between my variables.
I find this graph from a science article, and I want to do precisely the same kind of graph:
Again, thanks for your help!
Update
Test.csv
I was pointed out that my sample data is not reproducible. Here is a sample of my data.
library(evd)
qcbvnonpar(p=c(0.025,0.05,0.5,0.95,0.975),cbind(TAD,DV),epmar=T,plot=F,add=T)
I also tried qcbvnonpar::evd,but the curve doesn't seem very smooth.
Maybe have a look at quantreg:::rqss for smoothing splines and quantile regression.
Sorry for the not so nice example data:
set.seed(1234)
period <- 100
x <- 1:100
y <- sin(2*pi*x/period) + runif(length(x),-1,1)
require(quantreg)
mod <- rqss(y ~ qss(x))
mod2 <- rqss(y ~ qss(x), tau=0.75)
mod3 <- rqss(y ~ qss(x), tau=0.25)
plot(x, y)
lines(x[-1], mod$coef[1] + mod$coef[-1], col = 'red')
lines(x[-1], mod2$coef[1] + mod2$coef[-1], col = 'green')
lines(x[-1], mod3$coef[1] + mod3$coef[-1], col = 'green')
I have in the past frequently struggled with rqss and my issues have almost always been related to the ordering of the points.
You have multiple measurements at various time points, which is why you're getting different lengths. This works for me:
dat <- read.csv("~/Downloads/Test.csv")
library(quantreg)
dat <- plyr::arrange(dat,Time)
fit<-rqss(Concentration~qss(Time,constraint="N"),tau=0.5,data = dat)
with(dat,plot(Time,Concentration))
lines(unique(dat$Time)[-1],fit$coef[1] + fit$coef[-1])
Sorting the data frame prior to fitting the model appears necessary.
In case you want ggplot2 graphic...
I based this example on that of #EDi. I increased the x and y so that the quantile lines would be less wiggly. Because of this increase, I need to use unique(x) in place of x in some of the calls.
Here's the modified set-up:
set.seed(1234)
period <- 100
x <- rep(1:100,each=100)
y <- 1*sin(2*pi*x/period) + runif(length(x),-1,1)
require(quantreg)
mod <- rqss(y ~ qss(x))
mod2 <- rqss(y ~ qss(x), tau=0.75)
mod3 <- rqss(y ~ qss(x), tau=0.25)
Here are the two plots:
# #EDi's base graphics example
plot(x, y)
lines(unique(x)[-1], mod$coef[1] + mod$coef[-1], col = 'red')
lines(unique(x)[-1], mod2$coef[1] + mod2$coef[-1], col = 'green')
lines(unique(x)[-1], mod3$coef[1] + mod3$coef[-1], col = 'green')
# #swihart's ggplot2 example:
## get into dataset so that ggplot2 can have some fun:
qrdf <- data.table(x = unique(x)[-1],
median = mod$coef[1] + mod$coef[-1],
qupp = mod2$coef[1] + mod2$coef[-1],
qlow = mod3$coef[1] + mod3$coef[-1]
)
line_size = 2
ggplot() +
geom_point(aes(x=x, y=y),
color="black", alpha=0.5) +
## quantiles:
geom_line(data=qrdf,aes(x=x, y=median),
color="red", alpha=0.7, size=line_size) +
geom_line(data=qrdf,aes(x=x, y=qupp),
color="blue", alpha=0.7, size=line_size, lty=1) +
geom_line(data=qrdf,aes(x=x, y=qlow),
color="blue", alpha=0.7, size=line_size, lty=1)

Restrict fitted regression line (abline) to range of data used in model

Is it possible to draw an abline of a fit only in a certain range of x-values?
I have a dataset with a linear fit of a subset of that dataset:
# The dataset:
daten <- data.frame(x = c(0:6), y = c(0.3, 0.1, 0.9, 3.1, 5, 4.9, 6.2))
# make a linear fit for the datapoints 3, 4, 5
daten_fit <- lm(formula = y~x, data = daten, subset = 3:5)
When I plot the data and draw a regression line:
plot (y ~ x, data = daten)
abline(reg = daten_fit)
The line is drawn for the full range of x-values in the original data. But, I want to draw the regression line only for the subset of data that was used for curve fitting. There were 2 ideas that came to my mind:
Draw a second line that is thicker, but is only shown in the range 3:5. I checked the parameters for abline, lines and segments but I could not find anything
Add small ticks to the respective positions, that are perpendicular to the abline. I have now idea how I could do this. this would be the nicer way of course.
Do you have any idea for a solution?
The answer is No, it is not possible to get abline() to draw the fitted line on only one part of the plot region where the model was fitted. This is because it uses only the model coefficients to draw the line, not predictions from the model. If you look closely, you'll see that the line draw actually extends outside the plot region, covering the plot frame where it exists the region.
The simplest solution to such problems is to predict from the model for the regions you want.
# The dataset:
daten <- data.frame(x = c(0:6), y = c(0.3, 0.1, 0.9, 3.1, 5, 4.9, 6.2))
# make a linear fit for the datapoints 3, 4, 5
mod <- lm(y~x, data = daten, subset = 3:5)
First, we get the range of x values we want to differentiate:
xr <- with(daten, range(x[3:5]))
then we predict for a set of evenly spaced points on this range using the model:
pred <- data.frame(x = seq(from = xr[1], to = xr[2], length = 50))
pred <- transform(pred, yhat = predict(mod, newdata = pred))
Now plot the data and the model using abline():
plot(y ~ x, data = daten)
abline(mod)
then add in the region you want to emphasise:
lines(yhat ~ x, data = pred, col = "red", lwd = 2)
Which gives us this plot:
If you have a model that is more complex than that which can be handled by abline(), then we take a slightly different strategy, predicting over the range of the available, plotted data to draw the line, and then pick out the interval we want to highlight. The following code does that:
## range of all `x` data
xr2 <- with(daten, range(x))
## same as before
pred <- data.frame(x = seq(from = xr2[1], to = xr2[2], length = 100))
pred <- transform(pred, yhat = predict(mod, newdata = pred))
## plot the data and the fitted model line
plot(y ~ x, data = daten)
lines(yhat ~ x, data = pred)
## add emphasis to the interval used in fitting
with(pred, lines(yhat ~ x, data = pred, subset = x >= xr[1] & x <= xr[2],
lwd = 2, col = "red"))
What we do here is use the subset argument to pick out the values from the predictions that are in the interval used in fitting, the vector we pass to subset is a logical vector of TRUE and FALSE values indicating which data are in the region of interest and lines() only plots a line along those data.
R> head(with(pred, x >= xr[1] & x <= xr[2]))
[1] FALSE FALSE FALSE FALSE FALSE FALSE
One might wonder why I have done predictions over 50 or 100 evenly spaced values of the predictor variable when we could, in this case, just have done a prediction for the start and the end of the data or region of interest and join the two points? Well, not all modelling exercises are that simple - you double log model from a previous question is a case in point - and the generic solution I outline above will work in all cases whereas simply joining two predictions won't.
#Andrie has furnished you with a solution to Idea 2.
One way would be to use colours to distinguish between points that are fitted and those that aren't:
daten_fit <- lm(formula = y~x, data = daten[3:5, ])
plot(y ~ x, data = daten)
points(y ~ x, data = daten[3:5, ], col="red")
abline(reg=daten_fit, col="red")
The second way is to plot the tick marks on the x-axis. These ticks are called rugs, and can be drawn using the rug function. But first you have to calculate the range:
#points(y ~ x, data = daten[3:5, ], col="red")
abline(reg=daten_fit, col="red")
rug(range(daten[3:5, 1]), lwd=3, col="red")
This is a somewhat basic plotting question -- use the ylim=c(low, high) option with suitable options for low and high.
You may want to read then An Introduction to R manual that came with your R version, and the other fine contributed documentation on the CRAN site.

Resources