I have searched very hard for a solution to this question, but I haven't been successful. I have a zoo plot with your standard time series X axis and simply want to highlight regions of the chart when the value of the series is less than a certain threshold. Specifically, I want to highlight when the p-value of the intercept is significant (and the intercept is plotted). This will occur at various intervals across the time series, rather than simply some range x <= y.
I have tried help (xblocks) example (xblocks) and I haven't been able to get the highlighted region to show for the dates to which I know they should apply.
Does this solves your problem?
rgb <- hcl(c(0, 0, 260), c = c(100, 0, 100), l = c(50, 90, 50), alpha = 0.3)
set.seed(1234)
x.Date <- as.Date("2015-02-01") + c(1,3,6,7,9,10,12,14,18,20) - 1
y <- zoo(rnorm(length(x.Date)), x.Date)
pval<-zoo(runif(length(x.Date),0,.2), x.Date)
plot(y,col=4)
xblocks(pval<=0.05,col = rgb[1])
Related
In an experiment, blood pressure is measured at several time points. Blood pressure rises and declines during the experiment. I need to plot the blood pressure reaction (the easy part) and find the time points (x values) where blood pressure has doubled (the tricky part). I was wondering whether this information could be retrieved in the ggplot?
Here is an example:
# Generate data
time <- c(10, 60, 90, 200, 260, 300, 700)
value <- c(1, 6, 8, 40, 50, 60, 70)
df <- data.frame(time, value)
# The first value of "value" is the first observation.
# When the first "value" increased ten times, it is equal to 10
# Question is at what time point did the value increase ten times according to the graph?
ggplot(data=c, aes(x=time, y=value,)) +
geom_line() +
geom_hline(y=10, colour="red") +
annotate("text", hjust=0, x=170, y=15, label="I need to find the x value at the intersection")
Any solutions?
No, this can't be done with ggplot2. However, it's easy to do:
v0 <- 10
f1 <- approxfun(df$time, df$value)
#we use numeric optimization here, but an analytical solution is of course possible (though a bit more work)
#this finds only one intersection, more work is required if there are more than one
optimize(function(t0) abs(f1(t0) - v0), interval = range(df$time))
#$minimum
#[1] 96.87501
#
#$objective
#[1] 3.080161e-06
If your data is bijective it gets even simpler:
f2 <- approxfun(df$value, df$time)
f2(v0)
#[1] 96.875
The Problem
I have data that I would like to plot in a line-graph with a log-scale on the y-axis using ggplot2. Unfortunately, some of my values go all the way down to zero. The data represents relative occurences of a feature in dependence of some parameters. The value zero occurs when that feature is not observed in a sample, which means that it occurs very seldomly, or indeed never. These zero values cause a problem in the log plot.
The following code illustrates the problem on a simplified data set. In reality the data set consists of more points, so the curve looks smoother, and also more values for the parameter p.
library(ggplot2)
dat <- data.frame(x=rep(c(0, 1, 2, 3), 2),
y=c(1e0, 1e-1, 1e-4, 0,
1e-1, 1e-3, 0, 0),
p=c(rep('a', 4), rep('b', 4)))
qplot(data=dat, x=x, y=y, colour=p, log="y", geom=c("line", "point"))
Given the data above, we would expect two lines, the first one should have three finite points on a log plot, the second one should have only two finite points on a log plot.
However, as you can see this produces a very misleading plot. It looks like the blue and red line are both converging to a value between 1e-4 and 1e-3. The reason is that log(0) gives -Inf, which ggplot just puts on the lower axis.
My Question
What's the best way to deal with this in R with ggplot2? By best I mean in terms of efficiency, and being ideomatic R (I'm fairly new to R).
The plot should indicate that these curves go down to "very small" after x=2 (red), or x=1 (blue), respectively. Ideally, with a vertical line downwards from the last finite point. What I mean by that is demonstrated in the following.
My Attempt
Here I'll describe what I've come up with. However, given that I'm fairly new to R, I suspect that there might a much better way.
library(ggplot2)
library(scales)
dat <- data.frame(x=rep(c(0, 1, 2, 3), 2),
y=c(1e0, 1e-1, 1e-4, 0,
1e-1, 1e-3, 0, 0),
p=c(rep('a', 4), rep('b', 4)))
Same data as above.
Now, I'm going through each unique parameter p, find the x coordinate of the last finite point, and assign it to the x coordinates of all points where y is zero. That is to achieve a vertical line.
for (p in unique(dat$p)) {
dat$x[dat$p == p & dat$y == 0] <- dat$x[head(which(dat$p == p & dat$y == 0), 1) - 1]
}
At this point the plot looks as follows.
The vertical lines are there. However, there are also points. These are misleading as they indicate that there was an actual data point there, which is not true.
To remove the points I duplicate the y data (seems wasteful), let's call it yp, and replace zero by NA. Then I use that new yp as the y aesthetics for geom_point.
dat$yp <- dat$y
dat$yp[dat$y == 0] <- NA
ggplot(dat, aes(x=x, y=y, colour=p)) +
geom_line() +
geom_point(aes(y=dat$yp)) +
scale_y_continuous(trans=log10_trans(),
breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x)))
Where I've used ggplot instead of qplot so that I can give different aesthetics to geom_line and geom_point.
Finally, the plot looks like this.
What is the right way to do this?
For me, I use
+ scale_y_continuous(trans=scales::pseudo_log_trans(base = 10))
If you're using ggplot, you can use scales::pseudo_log_trans() as your transformation object. This will replace your -inf with 0.
From the docs (https://scales.r-lib.org/reference/pseudo_log_trans.html),
A transformation mapping numbers to a signed logarithmic scale with a smooth transition to linear scale around 0.
pseudo_log_trans(sigma = 1, base = exp(1))
For example, my scale expression looks like this:
+ scale_fill_gradient(name = "n occurrences", trans="pseudo_log")
Unconfirmed, but you probably need to include the scales library:
require("scales")
The simplest way would be to add a small value to each of the numbers. Example,
df <- mutate(df, log_var = log(var + 0.01))
ggplot(df, aes(x = log_var)) + geom_histogram()
I am a new R user and am having trouble graphing some data in a bar plot. Sorry in advance if this is really easy to do, and I just can’t figure it out.
I have six sets of data: 3 data sets for car #1 at 1, 5, and 10yrs, and 3 data sets of car#2 at 1,5, and 10 yrs., where measurements for each car at each age would consist of 1.) counting the total number of dents on the cars exterior and 2.) number of dents that remove paint. I want to make a boxplot with 6 bars, corresponding to each car and their respective ages, where the column height is the total number of dents that remove paint, with standard deviation bars.
Here’s what I’ve been trying so far (only 2 data sets included):
car1yr1 = c(rep(0, 101), rep(1, 9)) #car has 9 dents that remove paint
car1yr5 = c(rep(0, 131), rep(1, 19)) #car has 19 dents that remove paint
sd1 = sd(car1yr1)
sd2 = sd(car1yr5)
stdv = c(sd1, sd2)
car1yr1 = car1yr1[1:150]
dentsCar1 = data.frame("Car1Yr1" = car1yr1, "Car1Yr5" = car1yr5)
barplot(as.matrix(dentsCar1, ylim = c(0, 50), beside = TRUE))
I’ve found an example of error bars: arrows(bar, x, bar, x+ -(stdv), length = 0.15, angle = 90), but I can’t get this to work with my numbers. Also, in this example, the y-axis stops at 15, but the bars Car1Yr5 goes until 19. How can I draw a y-axis up to 20 or 30?
Again, I’m new at R and any help would be greatly appreciated. I’ve been trying to solve this on my own off and on for about 2 weeks. Thanks.
I am a little confused by your data... I am assuming from your example that car 1 has 101 dents that did not remove paint and 9 that did and car 2 has 131 that did not and 19 that did.
Now calculating the standard deviation on the number of dents does not make much sense to me... you are plotting count data, so you should not have any standard deviation unless you have, say, many cars of the same model and you want to see the variability between cars.
The best thing to do would be to calculate the % of dents that removed paint by doing:
car1yr1 = c(rep(0, 101), rep(1, 9)) #car has 9 dents that remove paint
car1yr5 = c(rep(0, 131), rep(1, 19)) #car has 19 dents that remove paint
# The total number of observations is the total number of dents
total.dents.1 <- length(car1yr1)
total.dents.5 <- length(car1yr5)
# The dents that remove paint are marked as 1, the others with 0,
# so we can just sum all of the data to get the number of paint-removing dents
dents.paint.1 <- sum(car1yr1)
dents.paint.5 <- sum(car1yr5)
# Alternatively you can use
# dents.paint.1 <- length(which(car1yr1==1))
# Calculate the %
dents.paint.perc.1 <- dents.paint.1/total.dents.1
dents.paint.perc.5 <- dents.paint.1/total.dents.5
df <- data.frame(dents.paint.perc.1, dents.paint.perc.5)
# Plot the data.
# ylim specifies the limits of the y axis
# ylab defines the axis title.
# las=1 puts the labels on the y axis horizontally
# names defines the labels on the x axis
barplot(as.matrix(df)*100, ylim=c(0,20),
ylab="% dents removing paint", las=1,
names=c("Car 1 year 1", "Car 1 year 5"))
In general it would be much better to put all your data in a single list, so that you can use the *apply family of function to perform repetitive operations on all of your dataset. This will give you cleaner and more manageable code. Also, if you add more data it will automagically add it to the plot.
I have made box-plots for the onset values of three different groups using the box-plot function in R like so:
boxplot(onset ~ group, data = pulse.dat, range = 0, col = "lightblue")
However, I want to see how the data looks without the range, so I want to create a box-plot without the whiskers. I also wouldn't mind any kind of graph as long as it displays the median, 25th and 75th quartile for each of the 3 groups.
Does anyone know how I can do this in R?
Under boxplot pars...
d <- rnorm(1:100, 100, 10)
boxplot(d, whisklty = 0, staplelty = 0)
whisklty gets rid of the lines or whiskers
staplelty gets rid of the ends or staples
This is related to another question: Plot weighted frequency matrix.
I have this graphic (produced by the code below in R):
#Set the number of bets and number of trials and % lines
numbet <- 36
numtri <- 1000
#Fill a matrix where the rows are the cumulative bets and the columns are the trials
xcum <- matrix(NA, nrow=numbet, ncol=numtri)
for (i in 1:numtri) {
x <- sample(c(0,1), numbet, prob=c(5/6,1/6), replace = TRUE)
xcum[,i] <- cumsum(x)/(1:numbet)
}
#Plot the trials as transparent lines so you can see the build up
matplot(xcum, type="l", xlab="Number of Trials", ylab="Relative Frequency", main="", col=rgb(0.01, 0.01, 0.01, 0.02), las=1)
I very much like the way that this plot is built up and shows the more frequent paths as darker than the rarer paths (but it is not clear enough for a print presentation). What I would like to do is to produce some kind of hexbin or heatmap for the numbers. On thinking about it, it seems that the plot will have to incorporate different sized bins (see my back of the envelope sketch):
My question then: If I simulate a million runs using the code above, how can I present it as a heatmap or hexbin, with the different sized bins as shown in the sketch?
To clarify: I do not want to rely on transparency to show the rarity of a trial passing through a part of the plot. Instead I would like to denote rarity with heat and show a common pathway as hot (red) and a rare pathway as cold (blue). Also, I do not think the bins should be the same size because the first trial has only two places where the path can be, but the last has many more. Hence the fact I chose a changing bin scale, based on that fact. Essentially I am counting the number of times a path passes through the cell (2 in col 1, 3 in col 2 etc) and then colouring the cell based on how many times it has been passed through.
UPDATE: I already had a plot similar to #Andrie, but I am not sure it is much clearer than the top plot. It is the discontinuous nature of this graph, that I do not like (and why I want some kind of heatmap). I think that because the first column has only two possible values, that there should not be a huge visual gap between them etc etc. Hence why I envisaged the different sized bins. I still feel that the binning version would show large number of samples better.
Update: This website outlines a procedure to plot a heatmap:
To create a density (heatmap) plot version of this we have to effectively enumerate the occurrence of these points at each discrete location in the image. This is done by setting a up a grid and counting the number of times a point coordinate "falls" into each of the individual pixel "bins" at every location in that grid.
Perhaps some of the information on that website can be combined with what we have already?
Update: I took some of what Andrie wrote with some of this question, to arrive at this, which is quite close to what I was conceiving:
numbet <- 20
numtri <- 100
prob=1/6
#Fill a matrix
xcum <- matrix(NA, nrow=numtri, ncol=numbet+1)
for (i in 1:numtri) {
x <- sample(c(0,1), numbet, prob=c(prob, 1-prob), replace = TRUE)
xcum[i, ] <- c(i, cumsum(x)/cumsum(1:numbet))
}
colnames(xcum) <- c("trial", paste("bet", 1:numbet, sep=""))
mxcum <- reshape(data.frame(xcum), varying=1+1:numbet,
idvar="trial", v.names="outcome", direction="long", timevar="bet")
#from the other question
require(MASS)
dens <- kde2d(mxcum$bet, mxcum$outcome)
filled.contour(dens)
I don't quite understand what's going on, but this seems to be more like what I wanted to produce (obviously without the different sized bins).
Update: This is similar to the other plots here. It is not quite right:
plot(hexbin(x=mxcum$bet, y=mxcum$outcome))
Last try. As above:
image(mxcum$bet, mxcum$outcome)
This is pretty good. I would just like it to look like my hand-drawn sketch.
Edit
I think the following solution does what you ask for.
(Note that this is slow, especially the reshape step)
numbet <- 32
numtri <- 1e5
prob=5/6
#Fill a matrix
xcum <- matrix(NA, nrow=numtri, ncol=numbet+1)
for (i in 1:numtri) {
x <- sample(c(0,1), numbet, prob=c(prob, 1-prob), replace = TRUE)
xcum[i, ] <- c(i, cumsum(x)/cumsum(1:numbet))
}
colnames(xcum) <- c("trial", paste("bet", 1:numbet, sep=""))
mxcum <- reshape(data.frame(xcum), varying=1+1:numbet,
idvar="trial", v.names="outcome", direction="long", timevar="bet")
library(plyr)
mxcum2 <- ddply(mxcum, .(bet, outcome), nrow)
mxcum3 <- ddply(mxcum2, .(bet), summarize,
ymin=c(0, head(seq_along(V1)/length(V1), -1)),
ymax=seq_along(V1)/length(V1),
fill=(V1/sum(V1)))
head(mxcum3)
library(ggplot2)
p <- ggplot(mxcum3, aes(xmin=bet-0.5, xmax=bet+0.5, ymin=ymin, ymax=ymax)) +
geom_rect(aes(fill=fill), colour="grey80") +
scale_fill_gradient("Outcome", formatter="percent", low="red", high="blue") +
scale_y_continuous(formatter="percent") +
xlab("Bet")
print(p)
FYI: This is more of an extended comment than an answer.
To me, this new plot looks like a stacked bar where each bar's height is equal to the intersection points of the upper and lower line at the next trial.
The way that I would approach this is to treat "Trials" as a categorical variable. Then we can search each row of xcum for elements that are equal. If they are, then we can consider this to be a point of intersection whose minima also represents the multiple defining the height of our bars.
x <- t(xcum)
x <- x[duplicated(x),]
x[x==0] <- NA
Now we have the multiples of the actual points, we need to figure out how to take it to the next step and find a way of binning the information. That means we need to make a decision about how many points will represent each grouping. Let's write some points out for posterity.
Trial 1 (2) = 1, 0.5 # multiple = 0.5
Trial 2 (3) = 1, 0.66, 0.33 # multiple = 0.33
Trial 3 (4) = 1, 0.75, 0.5, 0.25 # multiple = 0.25
Trial 4 (5) = 1, 0.8, 0.6, 0.4, 0.2 # multiple = 0.2
Trial 5 (6) = 1, 0.8333335, 0.6666668, 0.5000001, 0.3333334, 0.1666667
...
Trial 36 (35) = 1, 0.9722223, ..., 0.02777778 # mutiple = 0.05555556 / 2
In other words, for each Trial there are n-1 points to plot. In your drawing you have 7 bins. So we need to figure out the multiples for each bin.
Let's cheat and divide the last two columns by two, we know from visual inspection that the minima is lower than 0.05
x[,35:36] <- x[,35:36] / 2
Then find the minimum of each column:
x <- apply(x, 2, function(x) min(x, na.rm=T))[-1] # Drop the 1
x <- x[c(1,2,3,4,8,17,35)] # I'm just guessing here by the "look" of your drawing.
The clearest way to do this is to create each bin separately. Obviously, this could be done automatically later. Remembering that each point is
bin1 <- data.frame(bin = rep("bin1",2), Frequency = rep(x[1],2))
bin2 <- data.frame(bin = rep("bin2",3), Frequency = rep(x[2],3))
bin3 <- data.frame(bin = rep("bin3",4), Frequency = rep(x[3],4))
bin4 <- data.frame(bin = rep("bin4",5), Frequency = rep(x[4],5))
bin5 <- data.frame(bin = rep("bin5",9), Frequency = rep(x[5],9))
bin6 <- data.frame(bin = rep("bin6",18), Frequency = rep(x[6],18))
bin7 <- data.frame(bin = rep("bin7",36), Frequency = rep(x[7],36))
df <- rbind(bin1,bin2,bin3,bin4,bin5,bin6,bin7)
ggplot(df, aes(bin, Frequency, color=Frequency)) + geom_bar(stat="identity", position="stack")