Quick way to add loess curve to large data set graph - r

I am trying to plot a vector, y which has 604800 points, against a sequence:
x=seq(from=1, to=604800). This is not a problem, but I do need to add a loess curve to the plots.
I have tried this using ggplot2 but this takes forever, and is notoriously bad at plotting large datasets. See R code:
vf <- ggplot(single.prop, aes(x,y)) + geom_line(linetype=1, size=1)
vf <- vf + stat_smooth(method="loess",fullrange=TRUE,aes(outfit=fit1<<-..y..))
vf
I have now tried to use the base package, but this is also taking forever:
lw <- loess(y ~ x,data=single.prop)
plot(y ~ x, data=single.prop,pch=19,cex=0.1)
k <- order(single.prop$x)
lines(single.prop$x[k],lw$fitted[k],col="red",lwd=3)
Does anyone else have any suggestions about what I can do to make this run quicker? I have to do this multiple times, and have so far been waiting about 15 minutes for one plot, and is still not completed.

With this many data points it can indeed last a long time for the plot to render. Of course it depends on the data but often a plot with this many points does not give a very interpretable picture. For both time an interpretability it can be useful to calculate summary stats first and then plot. In your situation I can imagine binning on x and calculating one or multiple stats for y for every bin can be useful. I did a small example with the mean, but you can use the stat of your liking of course. Hope this helps..
x <- 1:10^6
y <- x/10^5 + rnorm(10^6)
plot_dat <- data.frame(x, y)
p <- ggplot(plot_dat, aes(x,y)) + geom_point()
bin_plot_dat <- function(bin_size){
nr_bins <- nrow(plot_dat) / bin_size
x2 <- rep(1:nr_bins * bin_size, each = bin_size)
y2 <- tapply(plot_dat$y, x2, mean)
data.frame(x = unique(x2), y= y2)
}
plot_dat2 <- bin_plot_dat(50)
p2 <- ggplot(plot_dat2, aes(x,y)) +
geom_point()
p2 + geom_smooth()

Related

Plotting with ggplot 2 two time series with different number of rows and different measurement

I need help because I am unfamiliar with using ggplot2 to plot two time series datasets with different number of rows and different measurements. I found in previous answers regarding how to solve the first problem and I'm pretty sure I can solve also the second one, but I don't know how to solve them together. Here is the code to simulate what I need to do, but in the plot that the code produces I need to solve the problem with the different ranges. How can I set two different y-axes?
x <- rnorm(100)
y <- rnorm(100)
data1 <- data.frame(x,y)
x1 <- rnorm(50)
y1 <-rnorm(50) + 500
data2 <- data.frame(x1,y1)
names(data2)[1]<-paste("x")
names(data2)[2]<-paste("y")
data <- rbind(data1,data2)
data$dataset = c(rep("A", 100), rep("B", 50))
ggplot(data, aes(x = x, y = y, col=dataset)) + geom_line()
Thanks for your attention and excuse me if the question is not clear enough, this is my first question here and I know I have to learn a lot.
I would use facet_grid() with free y axis
ggplot(data, aes(x = x, y = y, col=dataset)) + geom_line() +
facet_grid(dataset ~., scales = "free_y")

Plot with one line for each column and time-series on the x-axis R

You can find my dataset here.
From this data, I wish to plot (one line for each):
x$y[,1]
x$y[,5]
x$y[,1]+x$y[,5]
Therefore, more clearly, in the end, each of the following will be represented by one line:
y0,
z0,
y0+z0
My x-axis (time-series) will be from x$t.
I have tried the following, but the time-series variable is problematic and I cannot figure out how I can exactly plot it. My code is:
Time <- x$t
X0 <- x$y[,1]
Z0 <- x$y[,5]
X0.plus.Z0 <- X0 + Z0
xdf0 <- cbind(Time,X0,Z0,X0.plus.Z0)
xdf0.melt <- melt(xdf0, id.vars="Time")
ggplot(data = xdf0.melt, aes(x=Time, y=value)) + geom_line(aes(colour=Var2))
The error in your code comes from the use of melt applied to an object that is not a data.frame. You should modify like this:
xdf0 <- cbind.data.frame(Time,X0,Z0,X0.plus.Z0)
xdf0.melt <- reshape2::melt(xdf0, id.vars="Time")
ggplot(data = xdf0.melt, aes(x=Time, y=value)) + geom_line(aes(colour=variable))
You don't have to go through the melt process since you juste have 3 lines to plot, it's fine to plot them separately
ggplot(data=xdf0) + aes(x=Time) +
geom_line(aes(y=X0), col="red") +
geom_line(aes(y=Z0), col="blue") +
geom_line(aes(y=X0.plus.Z0))
However, you don't get the legend.
A remark about your example: you try to plot values of really different order of magnitude, so you can't really see anything.
How about
matplot(xdf0, type = 'l')
?

Plot decision boundaries with ggplot2?

How do I plot the equivalent of contour (base R) with ggplot2? Below is an example with linear discriminant function analysis:
require(MASS)
iris.lda<-lda(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris)
datPred<-data.frame(Species=predict(iris.lda)$class,predict(iris.lda)$x) #create data.frame
#Base R plot
eqscplot(datPred[,2],datPred[,3],pch=as.double(datPred[,1]),col=as.double(datPred[,1])+1)
#Create decision boundaries
iris.lda2 <- lda(datPred[,2:3], datPred[,1])
x <- seq(min(datPred[,2]), max(datPred[,2]), length.out=30)
y <- seq(min(datPred[,3]), max(datPred[,3]), length.out=30)
Xcon <- matrix(c(rep(x,length(y)),
rep(y, rep(length(x), length(y)))),,2) #Set all possible pairs of x and y on a grid
iris.pr1 <- predict(iris.lda2, Xcon)$post[, c("setosa","versicolor")] %*% c(1,1) #posterior probabilities of a point belonging to each class
contour(x, y, matrix(iris.pr1, length(x), length(y)),
levels=0.5, add=T, lty=3,method="simple") #Plot contour lines in the base R plot
iris.pr2 <- predict(iris.lda2, Xcon)$post[, c("virginica","setosa")] %*% c(1,1)
contour(x, y, matrix(iris.pr2, length(x), length(y)),
levels=0.5, add=T, lty=3,method="simple")
#Eqivalent plot with ggplot2 but without decision boundaries
ggplot(datPred, aes(x=LD1, y=LD2, col=Species) ) +
geom_point(size = 3, aes(pch = Species))
It is not possible to use a matrix when plotting contour lines with ggplot. The matrix can be rearranged to a data-frame using melt. In the data-frame below the probability values from iris.pr1 are displayed in the first column along with the x and y coordinates in the following two columns. The x and y coordinates form a grid of 30 x 30 points.
df <- transform(melt(matrix(iris.pr1, length(x), length(y))), x=x[X1], y=y[X2])[,-c(1,2)]
I would like to plot the coordinates (preferably connected by a smoothed curve) where the posterior probabilities are 0.5 (i.e. the decision boundaries).
You can use geom_contour in ggplot to achieve a similar effect. As you correctly assumed, you do have to transform your data. I ended up just doing
pr<-data.frame(x=rep(x, length(y)), y=rep(y, each=length(x)),
z1=as.vector(iris.pr1), z2=as.vector(iris.pr2))
And then you can pass that data.frame to the geom_contour and specify you want the breaks at 0.5 with
ggplot(datPred, aes(x=LD1, y=LD2) ) +
geom_point(size = 3, aes(pch = Species, col=Species)) +
geom_contour(data=pr, aes(x=x, y=y, z=z1), breaks=c(0,.5)) +
geom_contour(data=pr, aes(x=x, y=y, z=z2), breaks=c(0,.5))
and that gives
The partimat function in the klaR library does what you want for observed predictors, but if you want the same for the LDA projections, you can build a data frame augmenting the original with the LD1...LDk projections, then call partimat with formula Group~LD1+...+LDk, method='lda' - then you see the "LD-plane" that you intended to see, nicely partitioned for you. This seemed easier to me, at least to explain to students newer to R, since I'm just reusing a function already provided in a way in which it wasn't quite intended.

Nonparametric quantile regression curves to scatterplot

I created a scatterplot (multiple groups GRP) with IV=time, DV=concentration. I wanted to add the quantile regression curves (0.025,0.05,0.5,0.95,0.975) to my plot.
And by the way, this is what I did to create the scatter-plot:
attach(E) ## E is the name I gave to my data
## Change Group to factor so that may work with levels in the legend
Group<-as.character(Group)
Group<-as.factor(Group)
## Make the colored scatter-plot
mycolors = c('red','orange','green','cornflowerblue')
plot(Time,Concentration,main="Template",xlab="Time",ylab="Concentration",pch=18,col=mycolors[Group])
## This also works identically
## with(E,plot(Time,Concentration,col=mycolors[Group],main="Template",xlab="Time",ylab="Concentration",pch=18))
## Use identify to identify each point by group number (to check)
## identify(Time,Concentration,col=mycolors[Group],labels=Group)
## Press Esc or press Stop to stop identify function
## Create legend
## Use locator(n=1,type="o") to find the point to align top left of legend box
legend('topright',legend=levels(Group),col=mycolors,pch=18,title='Group')
Because the data that I created here is a small subset of my larger data, it may look like it can be approximated as a rectangular hyperbole. But I don't want to call a mathematical relationship between my independent and dependent variables yet.
I think nlrq from the package quantreg may be the answer, but I don't understand how to use the function when I don't know the relationship between my variables.
I find this graph from a science article, and I want to do precisely the same kind of graph:
Again, thanks for your help!
Update
Test.csv
I was pointed out that my sample data is not reproducible. Here is a sample of my data.
library(evd)
qcbvnonpar(p=c(0.025,0.05,0.5,0.95,0.975),cbind(TAD,DV),epmar=T,plot=F,add=T)
I also tried qcbvnonpar::evd,but the curve doesn't seem very smooth.
Maybe have a look at quantreg:::rqss for smoothing splines and quantile regression.
Sorry for the not so nice example data:
set.seed(1234)
period <- 100
x <- 1:100
y <- sin(2*pi*x/period) + runif(length(x),-1,1)
require(quantreg)
mod <- rqss(y ~ qss(x))
mod2 <- rqss(y ~ qss(x), tau=0.75)
mod3 <- rqss(y ~ qss(x), tau=0.25)
plot(x, y)
lines(x[-1], mod$coef[1] + mod$coef[-1], col = 'red')
lines(x[-1], mod2$coef[1] + mod2$coef[-1], col = 'green')
lines(x[-1], mod3$coef[1] + mod3$coef[-1], col = 'green')
I have in the past frequently struggled with rqss and my issues have almost always been related to the ordering of the points.
You have multiple measurements at various time points, which is why you're getting different lengths. This works for me:
dat <- read.csv("~/Downloads/Test.csv")
library(quantreg)
dat <- plyr::arrange(dat,Time)
fit<-rqss(Concentration~qss(Time,constraint="N"),tau=0.5,data = dat)
with(dat,plot(Time,Concentration))
lines(unique(dat$Time)[-1],fit$coef[1] + fit$coef[-1])
Sorting the data frame prior to fitting the model appears necessary.
In case you want ggplot2 graphic...
I based this example on that of #EDi. I increased the x and y so that the quantile lines would be less wiggly. Because of this increase, I need to use unique(x) in place of x in some of the calls.
Here's the modified set-up:
set.seed(1234)
period <- 100
x <- rep(1:100,each=100)
y <- 1*sin(2*pi*x/period) + runif(length(x),-1,1)
require(quantreg)
mod <- rqss(y ~ qss(x))
mod2 <- rqss(y ~ qss(x), tau=0.75)
mod3 <- rqss(y ~ qss(x), tau=0.25)
Here are the two plots:
# #EDi's base graphics example
plot(x, y)
lines(unique(x)[-1], mod$coef[1] + mod$coef[-1], col = 'red')
lines(unique(x)[-1], mod2$coef[1] + mod2$coef[-1], col = 'green')
lines(unique(x)[-1], mod3$coef[1] + mod3$coef[-1], col = 'green')
# #swihart's ggplot2 example:
## get into dataset so that ggplot2 can have some fun:
qrdf <- data.table(x = unique(x)[-1],
median = mod$coef[1] + mod$coef[-1],
qupp = mod2$coef[1] + mod2$coef[-1],
qlow = mod3$coef[1] + mod3$coef[-1]
)
line_size = 2
ggplot() +
geom_point(aes(x=x, y=y),
color="black", alpha=0.5) +
## quantiles:
geom_line(data=qrdf,aes(x=x, y=median),
color="red", alpha=0.7, size=line_size) +
geom_line(data=qrdf,aes(x=x, y=qupp),
color="blue", alpha=0.7, size=line_size, lty=1) +
geom_line(data=qrdf,aes(x=x, y=qlow),
color="blue", alpha=0.7, size=line_size, lty=1)

Plotting CCDF of walking durations

I have plotted the CCDF as mentioned in question part of the maximum plot points in R? post to get a plot(image1) with this code:
ccdf<-function(duration,density=FALSE)
{
freqs = table(duration)
X = rev(as.numeric(names(freqs)))
Y =cumsum(rev(as.list(freqs)));
data.frame(x=X,count=Y)
}
qplot(x,count,data=ccdf(duration),log='xy')
Now, on the basis of answer by teucer on Howto Plot “Reverse” Cumulative Frequency Graph With ECDF I tried to plot a CCDF using the commands below:
f <- ecdf(duration)
plot(1-f(duration),duration)
I got a plot like image2.
Also I read in from the comments in one of the answers in Plotting CDF of a dataset in R? as CCDF is nothing but 1-ECDF.
I am totally confused about how to get the CCDF of my data.
Image1
Image2
Generate some data and find the ecdf function.
x <- rlnorm(1e5, 5)
ecdf_x <- ecdf(x)
Generate vector at regular intervals over range of x. (EDIT: you want them evenly spaced on a log scale in this case; if you have negative values, then use sample over a linear scale.)
xx <- seq(min(x), max(x), length.out = 1e4)
#or
log_x <- log(x)
xx <- exp(seq(min(log_x), max(log_x), length.out = 1e3))
Create data with x and y coordinates for plot.
dfr <- data.frame(
x = xx,
ecdf = ecdf_x(xx),
ccdf = 1 - ecdf_x(xx)
)
Draw plot.
p_ccdf <- ggplot(dfr, aes(x, ccdf)) +
geom_line() +
scale_x_log10()
p_ccdf
(Also take a look at aes(x, ecdf).)
I used ggplot to get desired ccdf plot of my data as shown below:
>>ecdf_x <- ecdf(x)
>>dfr <- data.frame( ecdf = ecdf_x(x),
>>ccdf = 1 - ecdf_x(x) )
>>p_ccdf <- ggplot(dfr, aes(x, ccdf)) + geom_line() + scale_x_log10()
>>p_ccdf
Sorry for posting it so late.
Thank you all!

Resources