i am trying to make flexible log10-grid in ggplot2. The idea is that between 0.1-1 the gridline breaks are 0.1 apart, between 1-10, they are 1 apart, between 10-100 they are 10 apart, etc
This way the grid lines repeat the same pattern as many times as required based on a variable vector (CAfails) with data i supply. This is what i came up with after a lot of tweaking:
CAfails<-data.frame(c(2.5,5.8,10.7,16.2,23,36.2,45.3,49.5,70.1,80.3,83.6,90))
LOG.as<-c(t((10^((floor(log10(min(CAfails)))-1):ceiling(log10(max(CAfails)))))%o%c(1:10)))
LOG.as<-LOG.as[-10*((floor(log10(min(CAfails))):ceiling(log10(max(CAfails))))+1)]
After which i pass it to ggplot2:
scale_x_log10(limits=c(1,10^(ceiling(log10(max(CAfails))))),breaks=LOG.as)
scale_y_log10(limits=c(0.1,10^(ceiling(log10(max(CAfails))))),breaks=LOG.as)
It works Ok but i was wondering if there wasn't a simpler and more easy way to do this
Here is a complete example:
CAfails<-data.frame(x=c(2.5,5.8,10.7,16.2,23,36.2,45.3,49.5,70.1,80.3,83.6,90))
LOG.as<-c(t((10^((floor(log10(min(CAfails)))-1):ceiling(log10(max(CAfails)))))%o%c(1:10)))
LOG.as<-LOG.as[-10*((floor(log10(min(CAfails))):ceiling(log10(max(CAfails))))+1)]
pdf$x <- 1:nrow(CAfails)
pdf$y <- CAfails$x
ggplot(data=pdf,aes(x,y)) + geom_point() +
scale_x_log10(limits=c(1,10^(ceiling(log10(max(CAfails))))),breaks=LOG.as) +
scale_y_log10(limits=c(0.1,10^(ceiling(log10(max(CAfails))))),breaks=LOG.as)
Which yields this:
I think this works a bit better:
CAfails<-data.frame(x=c(2.5,5.8,10.7,16.2,23,36.2,45.3,49.5,70.1,80.3,83.6,90))
pdf$x <- 1:nrow(CAfails)
pdf$y <- CAfails$x
genbreaks <- function(x){
# only works on positive vals
minx <- min(x)
maxx <- max(x)
flminx <- floor(log10(minx))
clmaxx <- ceil(log10(maxx))
rv <- c()
xlo <- 10^flminx
for (i in flminx:clmaxx) {
rv <- c(rv,seq(xlo,xlo*10,xlo))
xlo <- xlo*10
}
return(rv)
}
ggplot(data=pdf,aes(x,y)) + geom_point(color="blue") +
scale_x_log10(breaks=genbreaks(pdf$x)) +
scale_y_log10(breaks=genbreaks(pdf$y))
Yielding:
But there could be a standard way of doing it.
Related
I would like to smartly round my results so that it sums up to the same sum after rounding.
Can someone explain me why this is different when I do it with map or lapply?
v <- c(
0.9472164,
71.5330771,
27.5197066)
smart.round <- function(x, digits = 0) {
up <- 10 ^ digits
x <- x * up
y <- floor(x)
indices <- tail(order(x-y), round(sum(x)) - sum(y))
y[indices] <- y[indices] + 1
y / up
}
### works correctly
smart.round(v)
### lapply and map is wrong
lapply(v,smart.round)
map(v,smart.round)
( I think this is merely a comment, but I have not yet earned my right add comments )
lapply, purrr::map are processing your input sequentially. In your example, lapply takes the first value of v and calls smart.round then moves on to the second value of v and so on ...
in total smart.round is called three times, each time without any knowledge of the other two values in v.
I'm not entirely sure why you try to use lapply here, if this is part of a more complex situation you might want to expand your question.
I have written my own solution. Definitely a bit cumbersome but it works.. :) My initial goal was just to input a dataframe and output the rounded dataframe.
The whole example here:
v <- data.frame(a = c(0.9472164,
71.5330771,
27.5197066),
b = c(4.6472164,
5.6330771,
27.1197066))
smart.round <- function(x, digits = 0) {
up <- 10 ^ digits
x <- x * up
y <- floor(x)
indices <- tail(order(x-y), round(sum(x)) - sum(y))
y[indices] <- y[indices] + 1
y / up
}
rounding_function <- function(input_df) {
output_df <- data.frame(matrix(ncol = ncol(input_df), nrow = nrow(input_df)))
for (i in 1:nrow(input_df)) {
a = smart.round(as.numeric(input_df[i,]))
for (k in 1:ncol(input_df)) {
output_df[i,k]=a[k]
}
colnames(output_df) = colnames(input_df)
}
return(output_df)
}
v_rounded <- rounding_function(v)
The my simple case:
Plotting graphs within the loop brings different results than plotting it directly after the loop
# Initialize
Input <- list(c(3,3,3,3),c(1,1,1,1))
y <- c()
x <- c()
plotlist <- c()
Answer <- c()
# create helper grid
x.grid = c(1:4)
y.grid = c(1:4)
helpergrid <- expand.grid(xgrid=x.grid, ygrid=y.grid )
#- Loop Lists -
for (m in c(1,2))
{
# # Loop within each list
# for(j in 1:4)
# {
# y[j] <- Input[[m]][j]
# x[j] <- j
# }
y[1] <- Input[[m]][1]
x[1] <- 1
y[2] <- Input[[m]][2]
x[2] <- 2
y[3] <- Input[[m]][3]
x[3] <- 3
y[4] <- Input[[m]][4]
x[4] <- 4
Points <- data.frame(x, y)
# Example Plot
plot = ggplot() + labs(title = paste("Loop m = ",m)) + labs(subtitle = paste("y-values = ",Points$y)) + geom_tile(data = helpergrid, aes(x=xgrid, y=ygrid, fill=1), colour="grey20") + geom_point(data = Points, aes(x=Points$x, y=Points$y), stroke=3, size=5, shape=1, color="white") + theme_minimal()
# Plot to plotlist
plotlist[[m]] <- plot
# --- Plot plotlist within loop ---
plot(plotlist[[m]])
}
# --- Plot plotlist outside of loop ---
plot(plotlist[[1]])
plot(plotlist[[2]])
Here is an image of the results:
Plot Results
as aaumai is pointing out that there is a nested loop that might cause the issue for ggplot using static values, however the resulting plot 'is' showing the correct y-value (y=3) explicitely, but the geom_points are using the wrong values (y=1)...
It makes absolutely (!) no sense to me, I am relatively new to R and trying to debug this for hours now - so I hope someone can help me with this !!
EDIT: I manually removed the nested loop and updated the example code, but the problem still persists :(
The problem arises due to your use of Points$x within aes. The "tl;dr" is that basically you should never use $ or [ or [[ within aes. See the answer here from baptiste.
library(ggplot2)
# Initialize
Input <- list(c(3,3,3,3),c(1,1,1,1))
y <- c()
x <- c()
plotlist <- c()
Answer <- c()
# create helper grid
x.grid = c(1:4)
y.grid = c(1:4)
helpergrid <- expand.grid(xgrid=x.grid, ygrid=y.grid )
#- Loop Lists -
for (m in c(1,2)) {
y[1] <- Input[[m]][1]
x[1] <- 1
y[2] <- Input[[m]][2]
x[2] <- 2
y[3] <- Input[[m]][3]
x[3] <- 3
y[4] <- Input[[m]][4]
x[4] <- 4
Points <- data.frame(x, y)
# Example Plot
plot = ggplot() + labs(title = paste("Loop m = ",m)) + labs(subtitle = paste("y-values = ",force(Points$y))) +
geom_tile(data = helpergrid, aes(x=xgrid, y=ygrid, fill=1), colour="grey20") +
geom_point(data = Points, aes(x=x, y=y), stroke=3, size=5, shape=1, color="white") + theme_minimal()
# Plot to plotlist
plotlist[[m]] <- plot
# --- Plot plotlist within loop ---
print(plotlist[[m]])
}
# --- Plot plotlist outside of loop ---
print(plotlist[[1]])
print(plotlist[[2]])
I believe the reason this happens is due to lazy evaluation. The data passed into geom_tile/point gets stored, but when the plot is printed, it grabs Points$x from the current environment. During the loop, this points to the current state of the Points data frame, the desired state. After the loop is finished, only the second version of Points exists, so when the referenced value from aes is evaluated, it grabs the x values from Points$x as it exists after the second evaluation of the loop. Hope this is clear, feel free to ask further if not.
To clarify, if you remove Points$ and just refer to x within aes, it takes these values from the data.frame as it was passed into the data argument of the geom calls.
If I'm not mistaken, this is because you have a loop within the loop.
The plot within the loop returns plots for changing y values in the Points data (from 1 to 4), whereas the plot outside is only plotting the static values.
I'm having some issues when using persp3d() function in R. Here is my code:
library(rgl)
vero=function(mu,sigma,datos)
{
n=length(datos)
media=mean(datos)
S2=sd(datos)^2
lvero=(-n/2)*log(2*pi*sigma)-(n/(2*sigma))*S2-(n/(2*sigma))*(media-mu)^2
return(exp(lvero))
}
nbebes=rnorm(20, mean=75, sd=2.5)
mu.seq <- seq(60,100,length=2000)
sigma.seq <- seq(1,20,length=2000)
f <- Vectorize(vero,vectorize.args=c("mu","sigma"))
z <- outer(mu.seq,sigma.seq,f,datos=nbebes)
zlim <- range(z[!is.na(z)])
palette <- rev(rainbow(20))
colors <- palette[19*(z-zlim[1])/diff(zlim) + 1]
persp3d(mu.seq,sigma.seq,z,col=colors)
And the output of the code is this graph:
What am I doing wrong? Sometimes I get another result which is a rainbow colored function like this one:
But as you can see it is not completely colored and I don't know what to change or rewrite to get the proper result.
I can't say eaxactly why this is an issue, but it seems to have to do with the limits of your z-axis. When I rescale z to z2 <- z / max(z) then it ranges between 0 and 1 and plots well. This might be an issue with rgl. Here's an example:
nbebes=rnorm(20, mean=75, sd=2.5)
mu.seq <- seq(60,100,length=500)
sigma.seq <- seq(1,20,length=500)
f <- Vectorize(vero,vectorize.args=c("mu","sigma"))
z <- outer(mu.seq,sigma.seq,f,datos=nbebes)
z2 <- z/max(z)
colors <- rev(rainbow(20))
breaks <- seq(zlim[1], zlim[2], length.out=(length(colors)+1))
CUT <- cut(z2, breaks=breaks, include.lowest = TRUE)
colorlevels <- colors[match(CUT, levels(CUT))] # assign colors to heights for each point
persp3d(mu.seq,sigma.seq,z2,color=colorlevels)
I want to colour the area under a curve. The area with y > 0 should be red, the area with y < 0 should be green.
x <- c(1:4)
y <- c(0,1,-1,2,rep(0,4))
plot(y[1:4],type="l")
abline(h=0)
Using ifelse() does not work:
polygon(c(x,rev(x)),y,col=ifelse(y>0,"red","green"))
What I achieved so far is the following:
polygon(c(x,rev(x)),y,col="green")
polygon(c(x,rev(x)),ifelse(y>0,y,0),col="red")
But then the red area is too large. Do you have any ideas how to get the desired result?
If you want two different colors, you need two different polygons. You can either call polygon multiple times, or you can add NA values in your x and y vectors to indicate a new polygon. R will not automatically calculate the intersection for you. You must do that yourself. Here's how you could draw that with different colors.
x <- c(1,2,2.5,NA,2.5,3,4)
y <- c(0,1,0,NA,0,-1,0)
#calculate color based on most extreme y value
g <- cumsum(is.na(x))
gc <- ifelse(tapply(y, g,
function(x) x[which.max(abs(x))])>0,
"red","green")
plot(c(1, 4),c(-1,1), type = "n")
polygon(x, y, col = gc)
abline(h=0)
In the more general case, it might not be as easy to split a polygon into different regions. There seems to be some support for this type of operation in GIS packages, where this type of thing is more common. However, I've put together a somewhat general case that may work for simple polygons.
First, I define a closure that will define a cutting line. The function will take a slope and y-intercept for a line and will return the functions we need to cut a polygon.
getSplitLine <- function(m=1, b=0) {
force(m); force(b)
classify <- function(x,y) {
y >= m*x + b
}
intercepts <- function(x,y, class=classify(x,y)) {
w <- which(diff(class)!=0)
m2 <- (y[w+1]-y[w])/(x[w+1]-x[w])
b2 <- y[w] - m2*x[w]
ix <- (b2-b)/(m-m2)
iy <- ix*m + b
data.frame(x=ix,y=iy,idx=w+.5, dir=((rank(ix, ties="first")+1) %/% 2) %% 2 +1)
}
plot <- function(...) {
abline(b,m,...)
}
list(
intercepts=intercepts,
classify=classify,
plot=plot
)
}
Now we will define a function to actually split a polygon using the splitter we've just defined.
splitPolygon <- function(x, y, splitter) {
addnullrow <- function(x) if (!all(is.na(x[nrow(x),]))) rbind(x, NA) else x
rollup <- function(x,i=1) rbind(x[(i+1):nrow(x),], x[1:i,])
idx <- cumsum(is.na(x) | is.na(y))
polys <- split(data.frame(x=x,y=y)[!is.na(x),], idx[!is.na(x)])
r <- lapply(polys, function(P) {
x <- P$x; y<-P$y
side <- splitter$classify(x, y)
if(side[1] != side[length(side)]) {
ints <- splitter$intercepts(c(x,x[1]), c(y, y[1]), c(side, side[1]))
} else {
ints <- splitter$intercepts(x, y, side)
}
sideps <- lapply(unique(side), function(ss) {
pts <- data.frame(x=x[side==ss], y=y[side==ss],
idx=seq_along(x)[side==ss], dir=0)
mm <- rbind(pts, ints)
mm <- mm[order(mm$idx), ]
br <- cumsum(mm$dir!=0 & c(0,head(mm$dir,-1))!=0 &
c(0,diff(mm$idx))>1)
if (length(unique(br))>1) {
mm<-rollup(mm, sum(br==br[1]))
}
br <- cumsum(c(FALSE,abs(diff(mm$dir*mm$dir))==3))
do.call(rbind, lapply(split(mm, br), addnullrow))
})
pss<-rep(unique(side), sapply(sideps, nrow))
ps<-do.call(rbind, lapply(sideps, addnullrow))[,c("x","y")]
attr(ps, "side")<-pss
ps
})
pss<-unname(unlist(lapply(r, attr, "side")))
src <- rep(seq_along(r), sapply(r, nrow))
r <- do.call(rbind, r)
attr(r, "source")<-src
attr(r, "side")<-pss
r
}
The input is just the values of x and y as you would pass to polygon along with the cutter. It will return a data.frame with x and y values that can be used with polygon.
For example
x <- c(1,2,2.5,NA,2.5,3,4)
y <- c(1,-2,2,NA,-1,2,-2)
sl<-getSplitLine(0,0)
plot(range(x, na.rm=T),range(y, na.rm=T), type = "n")
p <- splitPolygon(x,y,sl)
g <- cumsum(c(F, is.na(head(p$y,-1))))
gc <- ifelse(attr(p,"side")[is.na(p$y)],
"red","green")
polygon(p, col=gc)
sl$plot(lty=2, col="grey")
This should work for simple concave polygons as well with sloped lines. Here's another example
x <- c(1,2,3,4,5,4,3,2)
y <- c(-2,2,1,2,-2,.5,-.5,.5)
sl<-getSplitLine(.5,-1.25)
plot(range(x, na.rm=T),range(y, na.rm=T), type = "n")
p <- splitPolygon(x,y,sl)
g <- cumsum(c(F, is.na(head(p$y,-1))))
gc <- ifelse(attr(p,"side")[is.na(p$y)],
"red","green")
polygon(p, col=gc)
sl$plot(lty=2, col="grey")
Right now things can get a bit messy when the the vertex of the polygon falls directly on the splitting line. I may try to correct that in the future.
A faster, but not very accurate solution is to split data frame to list according to grouping variable (e.g. above=red and below=blue). This is a pretty nice workaround for rather big (I would say > 100 elements) datasets. For smaller chunks some discontinuity may be visible:
x <- 1:100
y1 <- sin(1:100/10)*0.8
y2 <- sin(1:100/10)*1.2
plot(x, y2, type='l')
lines(x, y1, col='red')
df <- data.frame(x=x, y1=y1, y2=y2)
df$pos_neg <- ifelse(df$y2-df$y1>0,1,-1) # above (1) or below (-1) average
# create the number for chunks to be split into lists:
df$chunk <- c(1,cumsum(abs(diff(df$pos_neg)))/2+1) # first element needs to be added`
df$colors <- ifelse(df$pos_neg>0, "red","blue") # colors to be used for filling the polygons
# create lists to be plotted:
l <- split(df, df$chunk) # we should get 4 sub-lists
lapply(l, function(x) polygon(c(x$x,rev(x$x)),c(x$y2,rev(x$y1)),col=x$colors))
As I said, for smaller dataset some discontinuity may be visible if sharp changes occur between positive and negative areas, but if horizontal line distinguishes between those two, or more elements are plotted then this effect is neglected:
A common display of spectroscopic data (intensity vs wavelength) is used below to compare the position of peaks in the data across multiple spectra. Assuming they all share a baseline at 0, it is convenient to offset the multiple lines vertically by a constant spacing, to avoid the distraction of overlapping lines.
Thus becomes
I'm looking for a better strategy to perform this vertical shift automatically, starting from data in long format. Here is a minimal example.
# fake data (5 similar-looking spectra)
spec <- function(){
x <- runif(100, 0, 100)
data.frame(x=x, y=jitter(dnorm(x, mean=jitter(50), sd=jitter(5)), amount=0.01))
}
require(plyr)
all <- ldply(1:5, function(ii) data.frame(spec(), id=ii))
My current strategy is as follows:
convert the spectra from long format to wide format. This involves interpolation, as the spectra do not necessarily have identical x axis values.
find the minimum offset between spectra to avoid overlap between neighbours
shift the spectra by multiples of this distance
melt back to long format
I implemented this using plyr,
# function that evenly spaces the spectra to avoid overlap
# d is in long format, s is a scaling factor for the vertical shift
require(plyr); require(ggplot2)
spread_plot <- function(d, s=1){
ranges <- ddply(d, "id", with, each(min,max,length)(x))
common_x <- seq(max(ranges$min), min(ranges$max), length=max(ranges$length))
new_y <- dlply(d, "id", function(x) approx(x$x, x$y, common_x)$y)
mat <- do.call(cbind, new_y)
test <- apply(mat, 1, diff)
shift <- max(-test[test < 0])
origins <- s*seq(0, by=shift, length=ncol(mat))
for(ii in seq_along(origins)){
current <- unique(d[["id"]])[ii]
d[d[["id"]] == current, "y"] <-
d[d[["id"]] == current, "y"] + origins[ii]
}
d
}
test <- spread_plot(all)
ggplot(test, aes(x, y, colour=id, group=id))+
geom_line() + guides(colour=guide_legend())
This strategy suffers from a few shortcomings:
it is slow
the offset is not a pretty number; I do not know how to automatically round it well so that spectra are offset e.g. by 0.02, or 50, etc. depending on the range of the intensities. pretty(origins) is problematic in that it can return a different number of values.
I feel I'm missing a simpler solution, perhaps working directly with the original data in long format.
Interesting question.
Here's a possibility, offered without detailed comment, except to point out that it:
Should be very fast, due to a combo of its avoidance of plyr, use of data.table, and operation on data in its original long format.
Uses pretty() to pick a pretty offset.
Like your code, is not guaranteed to produce no intersections of lines, since overlap can happen between the lattice of points formed by common_x.
Here's the code
## Setup
library(data.table)
library(plyr)
library(ggplot2)
spec <- function(){
x <- runif(100, 0, 100)
data.frame(x=x, y=jitter(dnorm(x, mean=jitter(50), sd=jitter(5)), amount=0.01))
}
all <- ldply(1:5, function(ii) data.frame(spec(), id=ii))
## Function that uses data.table rather than plyr to compute and add offsets
spread_plot <- function(d, s=1){
d <- data.table(d, key="id")
ranges <- d[, list(min=min(x), max=max(x), length=length(x)),by="id"]
common_x <- seq(max(ranges$min), min(ranges$max), length=max(ranges$length))
new_y <- d[,list(y=approx(x, y, common_x)$y, N=seq_along(common_x)),
by="id"]
shift <- max(new_y[, max(abs(diff(y))), by = "N"][[2]])
shift <- pretty(c(0, shift), n=0)[2]
origins <- s*seq(0, by=shift, length=length(unique(d$id)))
d[,y:=(y + origins[.GRP]),by="id"]
d
}
## Try it out
test <- spread_plot(all)
ggplot(test, aes(x, y, colour=id, group=id))+
geom_line() + guides(colour=guide_legend())
I still think you could rely on some assumptions about typical data from spectroscopy. Usually, x values are sorted, the number of them is equal for all spectra and they are quite similar:
# new fake data (5 similar-looking spectra)
spec <- function(){
x <- jitter(seq(0,100,1),0.1)
data.frame(x=x, y=jitter(dnorm(x, mean=jitter(50), sd=jitter(5)), amount=0.01))
}
require(plyr)
all <- ldply(1:5, function(ii) data.frame(spec(), id=ii))
If these assumptions are valid, you could treat the spectra as having identical x values:
library(ggplot2)
spread_plot <- function(d, s=0.05) {
#add some checks here, e.g., for equal length
d <- d[order(d$x),]
d$id <- factor(d$id)
l <- levels(d$id)
pretty_offset <- pretty(s*min(tapply(d$y, d$id, function(x) abs(diff(range(x))))))[2]
for (i in seq_len(length(l)-1)+1) {
mean_delta_y <- mean(d[d$id == l[i], "y"] - d[d$id == l[i-1], "y"])
d[d$id == l[i], "y"] <- d[d$id == l[i], "y"] - mean_delta_y
min_delta_y <- abs(1.05 * min(d[d$id == l[i], "y"] - d[d$id == l[i-1], "y"]))
pretty_delta_y <- max(min_delta_y, pretty_offset)
d[d$id == l[i], "y"] <- d[d$id == l[i], "y"] + pretty_delta_y
}
p <- ggplot(d, aes(x=x, y=y, col=id)) + geom_line()
print(p)
}
spread_plot(all, s=0)
spread_plot(all, s=0.5)
As suggested by hadley, the for loop can be avoided very simply,
d$y <- d$y + origins[d$id]
Full code:
spread_plot <- function(d, s=1){
ranges <- ddply(d, "id", with, each(min,max,length)(x))
common_x <- seq(max(ranges$min), min(ranges$max), length=max(ranges$length))
new_y <- dlply(d, "id", function(x) approx(x$x, x$y, common_x)$y)
mat <- do.call(cbind, new_y)
test <- apply(mat, 1, diff)
shift <- max(-test[test < 0])
origins <- s*seq(0, by=shift, length=ncol(mat))
d$y <- d$y + origins[d$id]
d
}
test <- spread_plot(all)
ggplot(test, aes(x, y, colour=id, group=id))+
geom_line() + guides(colour=guide_legend())