I created a scatterplot (multiple groups GRP) with IV=time, DV=concentration. I wanted to add the quantile regression curves (0.025,0.05,0.5,0.95,0.975) to my plot.
And by the way, this is what I did to create the scatter-plot:
attach(E) ## E is the name I gave to my data
## Change Group to factor so that may work with levels in the legend
Group<-as.character(Group)
Group<-as.factor(Group)
## Make the colored scatter-plot
mycolors = c('red','orange','green','cornflowerblue')
plot(Time,Concentration,main="Template",xlab="Time",ylab="Concentration",pch=18,col=mycolors[Group])
## This also works identically
## with(E,plot(Time,Concentration,col=mycolors[Group],main="Template",xlab="Time",ylab="Concentration",pch=18))
## Use identify to identify each point by group number (to check)
## identify(Time,Concentration,col=mycolors[Group],labels=Group)
## Press Esc or press Stop to stop identify function
## Create legend
## Use locator(n=1,type="o") to find the point to align top left of legend box
legend('topright',legend=levels(Group),col=mycolors,pch=18,title='Group')
Because the data that I created here is a small subset of my larger data, it may look like it can be approximated as a rectangular hyperbole. But I don't want to call a mathematical relationship between my independent and dependent variables yet.
I think nlrq from the package quantreg may be the answer, but I don't understand how to use the function when I don't know the relationship between my variables.
I find this graph from a science article, and I want to do precisely the same kind of graph:
Again, thanks for your help!
Update
Test.csv
I was pointed out that my sample data is not reproducible. Here is a sample of my data.
library(evd)
qcbvnonpar(p=c(0.025,0.05,0.5,0.95,0.975),cbind(TAD,DV),epmar=T,plot=F,add=T)
I also tried qcbvnonpar::evd,but the curve doesn't seem very smooth.
Maybe have a look at quantreg:::rqss for smoothing splines and quantile regression.
Sorry for the not so nice example data:
set.seed(1234)
period <- 100
x <- 1:100
y <- sin(2*pi*x/period) + runif(length(x),-1,1)
require(quantreg)
mod <- rqss(y ~ qss(x))
mod2 <- rqss(y ~ qss(x), tau=0.75)
mod3 <- rqss(y ~ qss(x), tau=0.25)
plot(x, y)
lines(x[-1], mod$coef[1] + mod$coef[-1], col = 'red')
lines(x[-1], mod2$coef[1] + mod2$coef[-1], col = 'green')
lines(x[-1], mod3$coef[1] + mod3$coef[-1], col = 'green')
I have in the past frequently struggled with rqss and my issues have almost always been related to the ordering of the points.
You have multiple measurements at various time points, which is why you're getting different lengths. This works for me:
dat <- read.csv("~/Downloads/Test.csv")
library(quantreg)
dat <- plyr::arrange(dat,Time)
fit<-rqss(Concentration~qss(Time,constraint="N"),tau=0.5,data = dat)
with(dat,plot(Time,Concentration))
lines(unique(dat$Time)[-1],fit$coef[1] + fit$coef[-1])
Sorting the data frame prior to fitting the model appears necessary.
In case you want ggplot2 graphic...
I based this example on that of #EDi. I increased the x and y so that the quantile lines would be less wiggly. Because of this increase, I need to use unique(x) in place of x in some of the calls.
Here's the modified set-up:
set.seed(1234)
period <- 100
x <- rep(1:100,each=100)
y <- 1*sin(2*pi*x/period) + runif(length(x),-1,1)
require(quantreg)
mod <- rqss(y ~ qss(x))
mod2 <- rqss(y ~ qss(x), tau=0.75)
mod3 <- rqss(y ~ qss(x), tau=0.25)
Here are the two plots:
# #EDi's base graphics example
plot(x, y)
lines(unique(x)[-1], mod$coef[1] + mod$coef[-1], col = 'red')
lines(unique(x)[-1], mod2$coef[1] + mod2$coef[-1], col = 'green')
lines(unique(x)[-1], mod3$coef[1] + mod3$coef[-1], col = 'green')
# #swihart's ggplot2 example:
## get into dataset so that ggplot2 can have some fun:
qrdf <- data.table(x = unique(x)[-1],
median = mod$coef[1] + mod$coef[-1],
qupp = mod2$coef[1] + mod2$coef[-1],
qlow = mod3$coef[1] + mod3$coef[-1]
)
line_size = 2
ggplot() +
geom_point(aes(x=x, y=y),
color="black", alpha=0.5) +
## quantiles:
geom_line(data=qrdf,aes(x=x, y=median),
color="red", alpha=0.7, size=line_size) +
geom_line(data=qrdf,aes(x=x, y=qupp),
color="blue", alpha=0.7, size=line_size, lty=1) +
geom_line(data=qrdf,aes(x=x, y=qlow),
color="blue", alpha=0.7, size=line_size, lty=1)
Related
I have a vector of length 10k for each of the variables x and z. For each of the 10k, I have also estimated propensity scores using logit and other methods. So I have another vector that contains the predicted propensity scores.
I want to plot predicted propensity vector as the height of the 3d graph and as a function of the x and z vectors (I want something like a surface). What is the best way to go about doing this? I tried using scatter3d() from the plot3d library and it looks very bad.
Sample data: https://www.dropbox.com/s/1lf36dpxvebd7kw/mydata2.csv?dl=0
Updated Answer
Using the data you provided, we can bin the data, get the average propensity score by bin and plot using geom_tile. I provide code for that below. A better option would be to fit the propensity score model using the x and z vectors (and the binary treatment variable that you're predicting). Then, create a new data frame of predicted pz_p values on a complete grid of x and z values and plot that. I don't have your binary treatment variable with which to fit the model, so I haven't produced an actual plot, but the code would look something like this:
# Propensity score model
m1 = glm(treat ~ x + z, data=dat, family=binomial)
# Get propensity scores on full grid of x and z values
n = 100 # Number of grid points. Adjust as needed.
pred.dat = expand.grid(x=seq(min(dat$x),max(dat$x),length=n,
z=seq(min(dat$z),max(dat$z),length=n)
pred.dat$pz_p = predict(m1, newdata=pred.dat, type="response")
ggplot(pred.dat. aes(x, z, fill=pz_p)) +
geom_tile() +
scale_fill_gradient2(low="red", mid="white", high="blue", midpoint=0.5, limits=c(0,1))
Code for tile plot with binned data:
library(tidyverse)
theme_set(theme_classic())
dat = read_csv("mydata2.csv")
# Bin by x and z
dat = dat %>%
mutate(xbin = cut(x,breaks=seq(round(min(x),1)-0.05,round(max(x),1)+0.05,0.1),
labels=seq(round(min(x),1), round(max(x),1),0.1)),
xbin=as.numeric(as.character(xbin)),
zbin = cut(z,breaks=seq(round(min(z),1)-0.1,round(max(z),1)+0.1,0.2),
labels=seq(round(min(z),1), round(max(z),1),0.2)),
zbin=as.numeric(as.character(zbin)))
# Calculate average pz_p by bin and then plot
ggplot(dat %>% group_by(xbin, zbin) %>%
summarise(pz_p=mean(pz_p)),
aes(xbin, zbin, fill=pz_p)) +
geom_tile() +
scale_fill_gradient2(low="red", mid="white", high="blue", midpoint=0.5, limits=c(0,1))
Original Answer
A heat map might work well here. For example:
library(ggplot2)
# Fake data
set.seed(2)
dat = expand.grid(x=seq(0,10,length=100),
z=seq(0,10,length=100))
dat$ps = 1/(1 + exp(0.3 + 0.2*dat$x - 0.5*dat$z))
ggplot(dat, aes(x, z, fill=ps)) +
geom_tile() +
scale_fill_gradient2(low="red", mid="white", high="blue", midpoint=0.5, limits=c(0,1)) +
coord_equal()
Or in 3D with rgl::persp3d:
library(rgl)
library(tidyverse)
x=unique(sort(dat$x))
z=unique(sort(dat$z))
ps=dat %>% spread(z, ps) %>% select(-1) %>% as.matrix
persp3d(x, z, ps, col="lightblue")
I am trying to plot a vector, y which has 604800 points, against a sequence:
x=seq(from=1, to=604800). This is not a problem, but I do need to add a loess curve to the plots.
I have tried this using ggplot2 but this takes forever, and is notoriously bad at plotting large datasets. See R code:
vf <- ggplot(single.prop, aes(x,y)) + geom_line(linetype=1, size=1)
vf <- vf + stat_smooth(method="loess",fullrange=TRUE,aes(outfit=fit1<<-..y..))
vf
I have now tried to use the base package, but this is also taking forever:
lw <- loess(y ~ x,data=single.prop)
plot(y ~ x, data=single.prop,pch=19,cex=0.1)
k <- order(single.prop$x)
lines(single.prop$x[k],lw$fitted[k],col="red",lwd=3)
Does anyone else have any suggestions about what I can do to make this run quicker? I have to do this multiple times, and have so far been waiting about 15 minutes for one plot, and is still not completed.
With this many data points it can indeed last a long time for the plot to render. Of course it depends on the data but often a plot with this many points does not give a very interpretable picture. For both time an interpretability it can be useful to calculate summary stats first and then plot. In your situation I can imagine binning on x and calculating one or multiple stats for y for every bin can be useful. I did a small example with the mean, but you can use the stat of your liking of course. Hope this helps..
x <- 1:10^6
y <- x/10^5 + rnorm(10^6)
plot_dat <- data.frame(x, y)
p <- ggplot(plot_dat, aes(x,y)) + geom_point()
bin_plot_dat <- function(bin_size){
nr_bins <- nrow(plot_dat) / bin_size
x2 <- rep(1:nr_bins * bin_size, each = bin_size)
y2 <- tapply(plot_dat$y, x2, mean)
data.frame(x = unique(x2), y= y2)
}
plot_dat2 <- bin_plot_dat(50)
p2 <- ggplot(plot_dat2, aes(x,y)) +
geom_point()
p2 + geom_smooth()
I have a scatter plot,I want to know how can I find the genes above and below the confidence interval lines?
EDIT: Reproducible example:
library(ggplot2)
#dummy data
df <- mtcars[,c("mpg","cyl")]
#plot
ggplot(df,aes(mpg,cyl)) +
geom_point() +
geom_smooth()
I had to take a deep dive into the github repo but I finally got it. In order to do this you need to know how stat_smooth works. In this specific case the loess function is called to do the smoothing (the different smoothing functions can be constructed using the same process as below):
So, using loess on this occasion we would do:
#data
df <- mtcars[,c("mpg","cyl"), with=FALSE]
#run loess model
cars.lo <- loess(cyl ~ mpg, df)
Then I had to read this in order to see how the predictions are made internally in stat_smooth. Apparently hadley uses the predictdf function (which is not exported to the namespace) as follows for our case:
predictdf.loess <- function(model, xseq, se, level) {
pred <- stats::predict(model, newdata = data.frame(x = xseq), se = se)
if (se) {
y = pred$fit
ci <- pred$se.fit * stats::qt(level / 2 + .5, pred$df)
ymin = y - ci
ymax = y + ci
data.frame(x = xseq, y, ymin, ymax, se = pred$se.fit)
} else {
data.frame(x = xseq, y = as.vector(pred))
}
}
After reading the above I was able to create my own data.frame of the predictions using:
#get the predictions i.e. the fit and se.fit vectors
pred <- predict(cars.lo, se=TRUE)
#create a data.frame from those
df2 <- data.frame(mpg=df$mpg, fit=pred$fit, se.fit=pred$se.fit * qt(0.95 / 2 + .5, pred$df))
Looking at predictdf.loess we can see that the upper boundary of the confidence interval is created as pred$fit + pred$se.fit * qt(0.95 / 2 + .5, pred$df) and the lower boundary as pred$fit - pred$se.fit * qt(0.95 / 2 + .5, pred$df).
Using those we can create a flag for the points over or below those boundaries:
#make the flag
outerpoints <- +(df$cyl > df2$fit + df2$se.fit | df$cyl < df2$fit - df2$se.fit)
#add flag to original data frame
df$outer <- outerpoints
The df$outer column is probably what the OP is looking for (it takes the value of 1 if it is outside the boundaries or 0 otherwise) but just for the sake of it I am plotting it below.
Notice the + function above is only used here to convert the logical flag into a numeric.
Now if we plot as this:
ggplot(df,aes(mpg,cyl)) +
geom_point(aes(colour=factor(outer))) +
geom_smooth()
We can actually see the points inside and outside the confidence interval.
Output:
P.S. For anyone who is interested in the upper and lower boundaries, they are created like this (speculation: although the shaded areas are probably created with geom_ribbon - or something similar - which makes them more round and pretty):
#upper boundary
ggplot(df,aes(mpg,cyl)) +
geom_point(aes(colour=factor(outer))) +
geom_smooth() +
geom_line(data=df2, aes(mpg , fit + se.fit , group=1), colour='red')
#lower boundary
ggplot(df,aes(mpg,cyl)) +
geom_point(aes(colour=factor(outer))) +
geom_smooth() +
geom_line(data=df2, aes(mpg , fit - se.fit , group=1), colour='red')
This solution takes advantage of the hard work ggplot2 does for you:
library(sp)
# we have to build the plot first so ggplot can do the calculations
ggplot(df,aes(mpg,cyl)) +
geom_point() +
geom_smooth() -> gg
# do the calculations
gb <- ggplot_build(gg)
# get the CI data
p <- gb$data[[2]]
# make a polygon out of it
poly <- data.frame(
x=c(p$x[1], p$x, p$x[length(p$x)], rev(p$x)),
y=c(p$ymax[1], p$ymin, p$ymax[length(p$x)], rev(p$ymax))
)
# test for original values in said polygon and add that to orig data
# so we can color by it
df$in_ci <- point.in.polygon(df$mpg, df$cyl, poly$x, poly$y)
# re-do the plot with the new data
ggplot(df,aes(mpg,cyl)) +
geom_point(aes(color=factor(in_ci))) +
geom_smooth()
It needs a bit of tweaking (i.e that last point getting a 2 value) but I'm limited on time. NOTE that the point.in.polygon return values are:
0: point is strictly exterior to pol
1: point is strictly interior to pol
2: point lies on the relative interior of an edge of pol
3: point is a vertex of pol
so it should be easy to just change the code to TRUE/FALSE whether value is 0 or not.
Using ggplot_build like #hrbrmstr's nice solution, you can actually do this by simply passing a sequence of x values to geom_smooth specifying where the errors bounds should be calculated, and make this equal to the x-values of your points. Then, you just see if the y-values are within the range.
library(ggplot2)
## dummy data
df <- mtcars[,c("mpg","cyl")]
ggplot(df, aes(mpg, cyl)) +
geom_smooth(params=list(xseq=df$mpg)) -> gg
## Find the points within bounds
bounds <- ggplot_build(gg)[[1]][[1]]
df$inside <- with(df, bounds$ymin < cyl & bounds$ymax > cyl)
## Add the points
gg + geom_point(data=df, aes(color=inside)) + theme_bw()
I would like to change the color of coefficient lines based on whether the point estimate is negative or positive in a ggplot2 coefficient plot in R. For example:
require(coefplot)
set.seed(123)
dat <- data.frame(x = rnorm(100), z = rnorm(100))
mod1 <- lm(y1 ~ x + z, data = dat)
coefplot.lm(mod1)
Which produces the following plot:
In this plot, I would like to change the "x" variable to red when plotted. Any ideas? Thanks.
I think, you cannot do this with a plot produced by coefplot.lm. The package coefplot uses ggplot2 as the plotting system, which is good itself, but does not allow to play with colors as easily as you would like. To achieve the desired colors, you need to have a variable in your dataset that would color-code the values; you need to specify color = color-code in aes() function within the layer that draws the dots with CE. Apparently, this is impossible to do with the output of coefplot.lm function. Maybe, you can change the colors using ggplot2 ggplot_build() function. I would say, it's easier to write your own function for this task.
I've done this once to plot odds. If you want, you may use my code. Feel free to change it. The idea is the same as in coefplot. First, we extract coefficients from a model object and prepare the data set for plotting; second, actually plot.
The code for extracting coefficients and data set preparation
df_plot_odds <- function(x){
tmp<-data.frame(cbind(exp(coef(x)), exp(confint.default(x))))
odds<-tmp[-1,]
names(odds)<-c('OR', 'lower', 'upper')
odds$vars<-row.names(odds)
odds$col<-odds$OR>1
odds$col[odds$col==TRUE] <-'blue'
odds$col[odds$col==FALSE] <-'red'
odds$pvalue <- summary(x)$coef[-1, "Pr(>|t|)"]
return(odds)
}
Plot the output of the extract function
plot_odds <- function(df_plot_odds, xlab="Odds Ratio", ylab="", asp=1){
require(ggplot2)
p <- ggplot(df_plot_odds, aes(x=vars, y=OR, ymin=lower, ymax=upper),asp=asp) +
geom_errorbar(aes(color=col),width=0.1) +
geom_point(aes(color=col),size=3)+
geom_hline(yintercept = 1, linetype=2) +
scale_color_manual('Effect', labels=c('Positive','Negative'),
values=c('blue','red'))+
coord_flip() +
theme_bw() +
theme(legend.position="none",aspect.ratio = asp)+
ylab(xlab) +
xlab(ylab) #switch because of the coord_flip() above
return(p)
}
Plotting your example
set.seed(123)
dat <- data.frame(x = rnorm(100),y = rnorm(100), z = rnorm(100))
mod1 <- lm(y ~ x + z, data = dat)
df <- df_plot_odds(mod1)
plot <- plot_odds(df)
plot
Which yields
Note that I chose theme_wb() as the default. Output is a ggplot2object. So, you may change it quite a lot.
How do I plot the equivalent of contour (base R) with ggplot2? Below is an example with linear discriminant function analysis:
require(MASS)
iris.lda<-lda(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris)
datPred<-data.frame(Species=predict(iris.lda)$class,predict(iris.lda)$x) #create data.frame
#Base R plot
eqscplot(datPred[,2],datPred[,3],pch=as.double(datPred[,1]),col=as.double(datPred[,1])+1)
#Create decision boundaries
iris.lda2 <- lda(datPred[,2:3], datPred[,1])
x <- seq(min(datPred[,2]), max(datPred[,2]), length.out=30)
y <- seq(min(datPred[,3]), max(datPred[,3]), length.out=30)
Xcon <- matrix(c(rep(x,length(y)),
rep(y, rep(length(x), length(y)))),,2) #Set all possible pairs of x and y on a grid
iris.pr1 <- predict(iris.lda2, Xcon)$post[, c("setosa","versicolor")] %*% c(1,1) #posterior probabilities of a point belonging to each class
contour(x, y, matrix(iris.pr1, length(x), length(y)),
levels=0.5, add=T, lty=3,method="simple") #Plot contour lines in the base R plot
iris.pr2 <- predict(iris.lda2, Xcon)$post[, c("virginica","setosa")] %*% c(1,1)
contour(x, y, matrix(iris.pr2, length(x), length(y)),
levels=0.5, add=T, lty=3,method="simple")
#Eqivalent plot with ggplot2 but without decision boundaries
ggplot(datPred, aes(x=LD1, y=LD2, col=Species) ) +
geom_point(size = 3, aes(pch = Species))
It is not possible to use a matrix when plotting contour lines with ggplot. The matrix can be rearranged to a data-frame using melt. In the data-frame below the probability values from iris.pr1 are displayed in the first column along with the x and y coordinates in the following two columns. The x and y coordinates form a grid of 30 x 30 points.
df <- transform(melt(matrix(iris.pr1, length(x), length(y))), x=x[X1], y=y[X2])[,-c(1,2)]
I would like to plot the coordinates (preferably connected by a smoothed curve) where the posterior probabilities are 0.5 (i.e. the decision boundaries).
You can use geom_contour in ggplot to achieve a similar effect. As you correctly assumed, you do have to transform your data. I ended up just doing
pr<-data.frame(x=rep(x, length(y)), y=rep(y, each=length(x)),
z1=as.vector(iris.pr1), z2=as.vector(iris.pr2))
And then you can pass that data.frame to the geom_contour and specify you want the breaks at 0.5 with
ggplot(datPred, aes(x=LD1, y=LD2) ) +
geom_point(size = 3, aes(pch = Species, col=Species)) +
geom_contour(data=pr, aes(x=x, y=y, z=z1), breaks=c(0,.5)) +
geom_contour(data=pr, aes(x=x, y=y, z=z2), breaks=c(0,.5))
and that gives
The partimat function in the klaR library does what you want for observed predictors, but if you want the same for the LDA projections, you can build a data frame augmenting the original with the LD1...LDk projections, then call partimat with formula Group~LD1+...+LDk, method='lda' - then you see the "LD-plane" that you intended to see, nicely partitioned for you. This seemed easier to me, at least to explain to students newer to R, since I'm just reusing a function already provided in a way in which it wasn't quite intended.