Let's say I have some data in a tibble activity with a column activity$time that records the time of day of some events. Suppose this data consists of two different sampling periods, one from 5:00 to 9:00, and one from 7:00 to 11:00. Because these periods overlap, events between 7:00 and 9:00 are over-represented by a factor of 2 compared to the rest. If I were to make a density plot like this:
ggplot(activity) + geom_density(mapping = aes(x = time))
then the center would be skewed upwards compared to what would be a true reflection of reality. How can I tell geom_density() something like "weight this interval by a factor of 0.5", or better yet, provide an arbitrary weighting function?
Here is some code demonstrating the overlap effect. runif() should produce a uniform distribution, but because I have two overlapping sections, there is a higher plateau in the middle:
set.seed(27036459)
activity <- tibble(time = c(runif(10000, 5, 9), runif(10000, 7, 11)))
ggplot(activity) + geom_density(mapping = aes(x = time))
What I want is a way to take activity, and using my knowledge of the sampling intervals, somehow adjust the graph to represent the actual distribution of the phenomenon, independent of sampling bias (in this case, the uniformity of runif()).
We can produce a set-up similar to your own by taking 50 samples from the period 5am to 9am and another 50 samples from 7am to 11am like so:
set.seed(1)
activity <- data.frame(time = as.POSIXct("2022-08-05 05:00:00") +
c(runif(50, 0, 14400), c(runif(50, 7200, 21600))))
And we can see this produces the unwanted peak between 7am and 9am:
library(tidyverse)
ggplot(activity) +
geom_density(mapping = aes(x = time))
There is no weights argument in geom_density, but since the area under the curve is normalized to one, it doesn't matter whether we half the weight of values between 7 and 9, or double the weights outside this period - it would give us the same result. The latter is much easier to do however: we just create a copy of the data frame in which we filter out the values between 7 and 9, then bind this to the original data frame:
library(lubridate)
activity %>%
filter(hour(time) < 7 | hour(time) > 9) %>%
bind_rows(activity) %>%
ggplot() +
geom_density(mapping = aes(x = time))
Created on 2022-08-05 by the reprex package (v2.0.1)
I wrapped my head around this question. The chart should look similar like this:
So I am basically trying to plot returns but "standardizing" them before. Is there any quick way to do this? I thought about dividing each row entry by the value of the first row respectively, e.g. if stock starts trading at 200, data point 1 will be 200/200=1, datapoint 2 say 210/200= 1.05 etc. - I could then also multiply that value by 100 so I would start the first one with 100, second 105 etc.
Does this make sense or is there a smarter way to do this?
Many thanks!
You may want seq_along(). I don't have your data, so here is an example with some dummy data:
set.seed(12345)
df <- data.frame(company = c(rep("A", 100), rep("B", 100), rep("C",100), rep("D",100)),
value = c(rnorm(100, 150, 25), rnorm(100, 250, 25), rnorm(100, 50, 25), rnorm(100, 300, 25)),
time = c(151:250, 100:199, 200:299, 251:350))
Add a new column in your data after grouping by the group/color variable. Use seq_along() to populate that column with a sequence of integers starting at 1 for each set. If you need to, transform that new column to whatever scale you need. Note, this only works if your horizontal axis data is evenly spaced. If the intervals are not the same, this will cause trouble.
library(dplyr)
library(ggplot2)
df %>%
group_by(company) %>%
mutate(time2 = seq_along(time)) %>%
ggplot(aes(x = time2, y = value, color = company)) +
geom_line(size = 2) +
xlab("relative time")
If your data is unevenly spaced, consider transforming to subtract each value per group by the minimum per group. This has the bonus of preserving the interval widths. If you divide by the minimum value, the time intervals will be compressed differently in each group. Again, you could manipulate the new variable in other ways, like adding 100 so that all values start at 100.
df %>%
group_by(company) %>%
mutate(time2 = time - min(time)) %>%
ggplot(aes(x = time2, y = value, color = company)) +
geom_line(size = 2) +
xlab("relative time")
To come back to the solution I've proposed before, I've prepared following dataset:
close <- aapl %>%
select(AAPL.Close) #selecting only the closing prices for apple
The first closing price for apple is 32.1875 - for me to make it comparable with another firm independent of the share price I want to "standardize this value". Thus I will divide each row in the dataset by 32.1875 and multiply the solution by 100. This will lead to a new row, which I call relative, that begins with 100 (the base value).
close$Relative <- (close$AAPL.Close/32.1875)*100
Now I do the same with AMZN, I spare you guys the code the concept is the same. When done I bind both data.frames together:
close <- cbind(close,amzn)
And plot the data:
ggplot(close, aes(x=close$dates))+
geom_line(aes(y=Relative), color="Red")+
geom_line(aes(y=Relative1), color="Blue")
I have searched very hard for a solution to this question, but I haven't been successful. I have a zoo plot with your standard time series X axis and simply want to highlight regions of the chart when the value of the series is less than a certain threshold. Specifically, I want to highlight when the p-value of the intercept is significant (and the intercept is plotted). This will occur at various intervals across the time series, rather than simply some range x <= y.
I have tried help (xblocks) example (xblocks) and I haven't been able to get the highlighted region to show for the dates to which I know they should apply.
Does this solves your problem?
rgb <- hcl(c(0, 0, 260), c = c(100, 0, 100), l = c(50, 90, 50), alpha = 0.3)
set.seed(1234)
x.Date <- as.Date("2015-02-01") + c(1,3,6,7,9,10,12,14,18,20) - 1
y <- zoo(rnorm(length(x.Date)), x.Date)
pval<-zoo(runif(length(x.Date),0,.2), x.Date)
plot(y,col=4)
xblocks(pval<=0.05,col = rgb[1])
In an experiment, blood pressure is measured at several time points. Blood pressure rises and declines during the experiment. I need to plot the blood pressure reaction (the easy part) and find the time points (x values) where blood pressure has doubled (the tricky part). I was wondering whether this information could be retrieved in the ggplot?
Here is an example:
# Generate data
time <- c(10, 60, 90, 200, 260, 300, 700)
value <- c(1, 6, 8, 40, 50, 60, 70)
df <- data.frame(time, value)
# The first value of "value" is the first observation.
# When the first "value" increased ten times, it is equal to 10
# Question is at what time point did the value increase ten times according to the graph?
ggplot(data=c, aes(x=time, y=value,)) +
geom_line() +
geom_hline(y=10, colour="red") +
annotate("text", hjust=0, x=170, y=15, label="I need to find the x value at the intersection")
Any solutions?
No, this can't be done with ggplot2. However, it's easy to do:
v0 <- 10
f1 <- approxfun(df$time, df$value)
#we use numeric optimization here, but an analytical solution is of course possible (though a bit more work)
#this finds only one intersection, more work is required if there are more than one
optimize(function(t0) abs(f1(t0) - v0), interval = range(df$time))
#$minimum
#[1] 96.87501
#
#$objective
#[1] 3.080161e-06
If your data is bijective it gets even simpler:
f2 <- approxfun(df$value, df$time)
f2(v0)
#[1] 96.875
This is related to another question: Plot weighted frequency matrix.
I have this graphic (produced by the code below in R):
#Set the number of bets and number of trials and % lines
numbet <- 36
numtri <- 1000
#Fill a matrix where the rows are the cumulative bets and the columns are the trials
xcum <- matrix(NA, nrow=numbet, ncol=numtri)
for (i in 1:numtri) {
x <- sample(c(0,1), numbet, prob=c(5/6,1/6), replace = TRUE)
xcum[,i] <- cumsum(x)/(1:numbet)
}
#Plot the trials as transparent lines so you can see the build up
matplot(xcum, type="l", xlab="Number of Trials", ylab="Relative Frequency", main="", col=rgb(0.01, 0.01, 0.01, 0.02), las=1)
I very much like the way that this plot is built up and shows the more frequent paths as darker than the rarer paths (but it is not clear enough for a print presentation). What I would like to do is to produce some kind of hexbin or heatmap for the numbers. On thinking about it, it seems that the plot will have to incorporate different sized bins (see my back of the envelope sketch):
My question then: If I simulate a million runs using the code above, how can I present it as a heatmap or hexbin, with the different sized bins as shown in the sketch?
To clarify: I do not want to rely on transparency to show the rarity of a trial passing through a part of the plot. Instead I would like to denote rarity with heat and show a common pathway as hot (red) and a rare pathway as cold (blue). Also, I do not think the bins should be the same size because the first trial has only two places where the path can be, but the last has many more. Hence the fact I chose a changing bin scale, based on that fact. Essentially I am counting the number of times a path passes through the cell (2 in col 1, 3 in col 2 etc) and then colouring the cell based on how many times it has been passed through.
UPDATE: I already had a plot similar to #Andrie, but I am not sure it is much clearer than the top plot. It is the discontinuous nature of this graph, that I do not like (and why I want some kind of heatmap). I think that because the first column has only two possible values, that there should not be a huge visual gap between them etc etc. Hence why I envisaged the different sized bins. I still feel that the binning version would show large number of samples better.
Update: This website outlines a procedure to plot a heatmap:
To create a density (heatmap) plot version of this we have to effectively enumerate the occurrence of these points at each discrete location in the image. This is done by setting a up a grid and counting the number of times a point coordinate "falls" into each of the individual pixel "bins" at every location in that grid.
Perhaps some of the information on that website can be combined with what we have already?
Update: I took some of what Andrie wrote with some of this question, to arrive at this, which is quite close to what I was conceiving:
numbet <- 20
numtri <- 100
prob=1/6
#Fill a matrix
xcum <- matrix(NA, nrow=numtri, ncol=numbet+1)
for (i in 1:numtri) {
x <- sample(c(0,1), numbet, prob=c(prob, 1-prob), replace = TRUE)
xcum[i, ] <- c(i, cumsum(x)/cumsum(1:numbet))
}
colnames(xcum) <- c("trial", paste("bet", 1:numbet, sep=""))
mxcum <- reshape(data.frame(xcum), varying=1+1:numbet,
idvar="trial", v.names="outcome", direction="long", timevar="bet")
#from the other question
require(MASS)
dens <- kde2d(mxcum$bet, mxcum$outcome)
filled.contour(dens)
I don't quite understand what's going on, but this seems to be more like what I wanted to produce (obviously without the different sized bins).
Update: This is similar to the other plots here. It is not quite right:
plot(hexbin(x=mxcum$bet, y=mxcum$outcome))
Last try. As above:
image(mxcum$bet, mxcum$outcome)
This is pretty good. I would just like it to look like my hand-drawn sketch.
Edit
I think the following solution does what you ask for.
(Note that this is slow, especially the reshape step)
numbet <- 32
numtri <- 1e5
prob=5/6
#Fill a matrix
xcum <- matrix(NA, nrow=numtri, ncol=numbet+1)
for (i in 1:numtri) {
x <- sample(c(0,1), numbet, prob=c(prob, 1-prob), replace = TRUE)
xcum[i, ] <- c(i, cumsum(x)/cumsum(1:numbet))
}
colnames(xcum) <- c("trial", paste("bet", 1:numbet, sep=""))
mxcum <- reshape(data.frame(xcum), varying=1+1:numbet,
idvar="trial", v.names="outcome", direction="long", timevar="bet")
library(plyr)
mxcum2 <- ddply(mxcum, .(bet, outcome), nrow)
mxcum3 <- ddply(mxcum2, .(bet), summarize,
ymin=c(0, head(seq_along(V1)/length(V1), -1)),
ymax=seq_along(V1)/length(V1),
fill=(V1/sum(V1)))
head(mxcum3)
library(ggplot2)
p <- ggplot(mxcum3, aes(xmin=bet-0.5, xmax=bet+0.5, ymin=ymin, ymax=ymax)) +
geom_rect(aes(fill=fill), colour="grey80") +
scale_fill_gradient("Outcome", formatter="percent", low="red", high="blue") +
scale_y_continuous(formatter="percent") +
xlab("Bet")
print(p)
FYI: This is more of an extended comment than an answer.
To me, this new plot looks like a stacked bar where each bar's height is equal to the intersection points of the upper and lower line at the next trial.
The way that I would approach this is to treat "Trials" as a categorical variable. Then we can search each row of xcum for elements that are equal. If they are, then we can consider this to be a point of intersection whose minima also represents the multiple defining the height of our bars.
x <- t(xcum)
x <- x[duplicated(x),]
x[x==0] <- NA
Now we have the multiples of the actual points, we need to figure out how to take it to the next step and find a way of binning the information. That means we need to make a decision about how many points will represent each grouping. Let's write some points out for posterity.
Trial 1 (2) = 1, 0.5 # multiple = 0.5
Trial 2 (3) = 1, 0.66, 0.33 # multiple = 0.33
Trial 3 (4) = 1, 0.75, 0.5, 0.25 # multiple = 0.25
Trial 4 (5) = 1, 0.8, 0.6, 0.4, 0.2 # multiple = 0.2
Trial 5 (6) = 1, 0.8333335, 0.6666668, 0.5000001, 0.3333334, 0.1666667
...
Trial 36 (35) = 1, 0.9722223, ..., 0.02777778 # mutiple = 0.05555556 / 2
In other words, for each Trial there are n-1 points to plot. In your drawing you have 7 bins. So we need to figure out the multiples for each bin.
Let's cheat and divide the last two columns by two, we know from visual inspection that the minima is lower than 0.05
x[,35:36] <- x[,35:36] / 2
Then find the minimum of each column:
x <- apply(x, 2, function(x) min(x, na.rm=T))[-1] # Drop the 1
x <- x[c(1,2,3,4,8,17,35)] # I'm just guessing here by the "look" of your drawing.
The clearest way to do this is to create each bin separately. Obviously, this could be done automatically later. Remembering that each point is
bin1 <- data.frame(bin = rep("bin1",2), Frequency = rep(x[1],2))
bin2 <- data.frame(bin = rep("bin2",3), Frequency = rep(x[2],3))
bin3 <- data.frame(bin = rep("bin3",4), Frequency = rep(x[3],4))
bin4 <- data.frame(bin = rep("bin4",5), Frequency = rep(x[4],5))
bin5 <- data.frame(bin = rep("bin5",9), Frequency = rep(x[5],9))
bin6 <- data.frame(bin = rep("bin6",18), Frequency = rep(x[6],18))
bin7 <- data.frame(bin = rep("bin7",36), Frequency = rep(x[7],36))
df <- rbind(bin1,bin2,bin3,bin4,bin5,bin6,bin7)
ggplot(df, aes(bin, Frequency, color=Frequency)) + geom_bar(stat="identity", position="stack")