ggplot2: add stat_function for particular domain? - r

I'd like to add a curve to a plot I'm making with ggplot, but I only want the curve to appear for a particular domain.
I've tried various approaches using stat_function:
data <- data.frame(Date = ..., cases = ...)
end_date <- ... ## calculated from a date (e.g., Sys.Date()) minus an offset
start_date <- ... ## end_date - some offset
p1 <- ggplot(data) + aes(x=Date, y=cases) + ... ## data has Date, cases columns
p1 + stat_function(...something..., fun=function(t) ...)
where for something I've tried to put a new, subsetted chunk of data:
data = data[(start_date <= data$Date) & (data$Date <= end_date),] ## no change
and a new aes
aes = aes(xmin = start_date, xmax = end_date)
## error - thinks start_date / end_date don't exist,
## though they are declared earlier
Any suggestions? I've also fiddled around with annotate("path", ...) but nothing concrete there. I feel like this should be something easy, I just don't have my head around the "ggplot way" to make it happen.
It may also be relevant that I'm making these plots in a shiny application, though aside from funny crap w/ data.table, I haven't noticed that affecting anything.

The following seems to work, though it still feels very hacky to me:
data$fit <- ... # evaluate function on Date
relrows <- (start_date <= data$Date) & (data$Date <= end_date)
p1 <- p1 + annotate("line", y=data$fit[relrows], x=data$Date[relrows])

Try adding another label as a new column in your dataframe.
df$newlabel[(start_date <= data$Date) $ (data$Date <= end_date)]<-a
then add groups to your ggplot
p1 <- ggplot(data)
+ aes(x=Date, y=cases, group=newlabel, colour=newlabel)
+ geom_point()
+ stat_smooth(method = "lm", formula = y ~ poly(x,2), size=1)

Related

Facet_grid w/ Dates Oct thru Sept

I'm working with a dataset that is typically analyzed using start and end dates that differ from a typical calendar year (water years run October 1 to Sept 30). I have no problem plotting the data in ggplot but as soon as I try to use facet_grid to break up the years, the data get misrepresented.
#Create data set
WY18to21 <- as.data.frame(seq(as.Date("2017-10-01"), as.Date("2021-09-30"), "days"))
names(WY18to21) <- c("Date")
WY18to21$Year <- year(WY18to21$Date)
WY18to21$Temp <- c(rep(seq(1,25, by=0.1), times=6), 1:15)
#Create "yearless" date for facet plot
WY18to21$Date_md <- WY18to21$Date
year(WY18to21$Date_md) <- 2000
#Set Water Year (WY)
WY18to21$WY <- "NA"
WY18to21$WY <- ifelse(WY18to21$Date >= as.Date("2017-10-01") &
WY18to21$Date <= as.Date("2018-09-30"), "WY18",WY18to21$WY)
WY18to21$WY <- ifelse(WY18to21$Date >= as.Date("2018-10-01") &
WY18to21$Date <= as.Date("2019-09-30"), "WY19",WY18to21$WY)
WY18to21$WY <- ifelse(WY18to21$Date >= as.Date("2019-10-01") &
WY18to21$Date <= as.Date("2020-09-30"), "WY20",WY18to21$WY)
WY18to21$WY <- ifelse(WY18to21$Date >= as.Date("2020-10-01") &
WY18to21$Date <= as.Date("2021-09-30"), "WY21",WY18to21$WY)
#Plot - regular ggplot
x <- ggplot(WY18to21) +
geom_line(aes(Date, Temp)) +
xlab("") + ylab("[°C]") +
ylim(0,26) +
theme_bw() +
scale_x_date(date_labels = "%b")
x
This looks okay, but I find the aesthetic of facet plots easier to interpret
x <- ggplot(WY18to21) +
geom_line(aes(Date_md, Temp)) +
xlab("") + ylab("[°C]") +
ylim(0,26) +
theme_bw() +
scale_x_date(date_labels = "%b")
x_facet <- x + facet_grid(. ~ WY)
x_facet
It looks like the facet_grid function first splits the data by water year (WY) and then orders the data by the "yearless" date. In practice this means the beginning of the next water year gets moved to the beginning of the current water year.
What I need to do is figure out either:
(a) how do I use facet_grid with dates without removing the year OR
(b) how do I redefine the order dates are plotted so that the year begins on October 1.
Attempts at problem solving: I think the solution has something to do with the mutate function but I have yet to successfully figure that out. I also tried searching under "fiscal year" since often they are also different than calendar years. But no luck. Thanks for the help!

How to use ggplot with prop.table(table(x)?

First, I have a data with two categorical variables into like this:
nombre <- c("A","B","C","A","D","F","F","H","I","J")
sexo <- c(rep("man",4),rep("woman",6))
edad <- c (25,14,25,76,12,90,65,45,56,43)
pais <- c(rep("spain",3),rep("italy",4),rep("portugal",3))
data <- data.frame(nombre=nombre,sexo=sexo,edad=edad,pais=pais)
If I use:
prop.table(table(data$sexo,data$pais), margin=1)
I can see the relative frequency of the levels, for example for Italy (Man=0.25 Woman=0.5)
but the problem is that when I try to plot the prop.table(table(x)) I get something different
ggplot(as.data.frame(prop.table(table(data),margin=1)), aes(x=pais ,y =Freq, fill=sexo))+geom_bar(stat="identity")
On the Y axis from 0 to 3 and for example in the bar Italy (Woman=2 Man=2.5)
I don't need that (and I don't know what is showing), I want the same with as I had with the table of the prop.table(table(x))
I think the problem is something related with the margin=1
Thanks you!
You need to make the same table
tab = prop.table(table(data$sexo,data$pais), margin=1)
tab = as.data.frame(tab)
Then plot:
ggplot(tab,aes(x=Var2,y=Freq,fill=Var1)) + geom_col()
Or simply:
barplot(prop.table(table(data$sexo,data$pais), margin=1))
You're probably looking for something like position = "dodge"
If I run the following on your data :
P <- prop.table(table(data$sexo,data$pais), margin=1)
ggplot(as.data.frame(P), aes(x = Var2, y = Freq, fill = Var1)) +
geom_bar(stat="identity", position = "dodge")
I output the following graph :

Plotting a datable with multiple columns (all 1:7 rows) via ggplot with a single geom_point() using aesthetics to color them differently

I intend to compare timings between two algorithm-based functions f1,f2 via microbenchmark which work on a rpois simulated dataset with sizes of: [1:7] vector given by 10^seq(1,4,by=0.5) i.e. :
[1] 10.00000 31.62278 100.00000 316.22777 1000.00000 3162.27766 10000.00000
Am working on to plot them as well, with all of the information required from microbenchmark (i.e. min,lq,mean,median,uq and max - yes all of them are required, except for expr and neval). I require this via ggplot on a log-log scale with a single geom_point() and aesthetics with each of the information being of different colours and here is my code for that:
library(ggplot2)
library(microbenchmark)
require(dplyr)
library(data.table)
datasetsizes<-c(10^seq(1,4,by=0.5))
f1_min<-integer(length(datasetsizes))
f1_lq<-integer(length(datasetsizes))
f1_mean<-integer(length(datasetsizes))
f1_median<-integer(length(datasetsizes))
f1_uq<-integer(length(datasetsizes))
f1_max<-integer(length(datasetsizes))
f2_min<-integer(length(datasetsizes))
f2_lq<-integer(length(datasetsizes))
f2_mean<-integer(length(datasetsizes))
f2_median<-integer(length(datasetsizes))
f2_uq<-integer(length(datasetsizes))
f2_max<-integer(length(datasetsizes))
for(loopvar in 1:(length(datasetsizes)))
{
s<-summary(microbenchmark(f1(rpois(datasetsizes[loopvar],10), max.segments=3L),f2(rpois(datasetsizes[loopvar],10), maxSegments=3)))
f1_min[loopvar] <- s$min[1]
f2_min[loopvar] <- s$min[2]
f1_lq[loopvar] <- s$lq[1]
f2_lq[loopvar] <- s$lq[2]
f1_mean[loopvar] <- s$mean[1]
f2_mean[loopvar] <- s$mean[2]
f1_median[loopvar] <- s$median[1]
f2_median[loopvar] <- s$median[2]
f1_uq[loopvar] <- s$uq[1]
f2_uq[loopvar] <- s$uq[2]
f1_max[loopvar] <- s$max[1]
f2_max[loopvar] <- s$max[2]
}
algorithm<-data.table(f1_min ,f2_min,
f1_lq, f2_lq,
f1_mean, f2_mean,
f1_median, f2_median,
f1_uq, f2_uq,
f1_max, cdpa_max, datasetsizes)
ggplot(algorithm, aes(x=algorithm,y=datasetsizes)) + geom_point(aes(color=algorithm)) + labs(x="N", y="Runtime") + scale_x_continuous(trans = 'log10') + scale_y_continuous(trans = 'log10')
I debug my code at each step and uptil the assignment of computed values to a datatable by the name of 'algorithm' it works fine.
Here are the computed runs which are passed as [1:7]vecs into the data table along with datasetsizes (1:7 as well) at the end:
> algorithm
f1_min f2_min f1_lq f2_lq f1_mean f2_mean f1_median f2_median f1_uq f2_uq f1_max f2_max datasetsizes
1: 86.745000 21.863000 105.080000 23.978000 113.645630 24.898840 113.543500 24.683000 120.243000 25.565500 185.477000 39.141000 10.00000
2: 387.879000 52.893000 451.880000 58.359000 495.963480 66.070390 484.672000 62.061000 518.876500 66.116500 734.149000 110.370000 31.62278
3: 1608.287000 341.335000 1845.951500 382.062000 1963.411800 412.584590 1943.802500 412.739500 2065.103500 443.593500 2611.131000 545.853000 100.00000
4: 5.964166 3.014524 6.863869 3.508541 7.502123 3.847917 7.343956 3.851285 7.849432 4.163704 9.890556 5.096024 316.22777
5: 23.128505 29.687534 25.348581 33.654475 26.860166 37.576444 26.455269 37.080149 28.034113 41.343289 35.305429 51.347386 1000.00000
6: 79.785949 301.548202 88.112824 335.135149 94.248141 370.902821 91.577462 373.456685 98.486816 406.472393 135.355570 463.908240 3162.27766
7: 274.367776 2980.122627 311.613125 3437.044111 337.287131 3829.503738 333.544669 3820.517762 354.347487 4205.737045 546.996092 4746.143252 10000.00000
The microbenchmark computed values fine as expected but the ggplot throws up this error:
Don't know how to automatically pick scale for object of type data.table/data.frame. Defaulting to continuous.
Error: Aesthetics must be either length 1 or the same as the data (7): colour, x
Am not being able to resolve this, can anyone let me know what is possibly wrong and correct the plotting procedure for the same?
Also on a sidenote I had to extract all the values (min,lq,mean,median,uq,max) seperately from the computed benchmark seperately since I cant take that as a datatable from the summary itself since it contained expr (expression) and neval columns. I was able to eliminate one of the columns using
algorithm[,!"expr"] or algorithm[,!"neval"]
but I can't eliminate two of them together, i.e.
algorithm[,!"expr",!"neval"] or algorithm[,!("expr","neval")] or algorithm[,!"expr","neval"]
- all possible combinations like that don't work (throws 'invalid argument type' error).
Any possible workaround or solution to this and the plotting (main thing) would be highly appreciated!
Your problem lies mainly with the fact that you're referring to an algorithm column in the ggplot formula that does not exist in your object.
From what you gave, I could do the following :
algorithm$algorithm <- 1:nrow(algorithm)
ggplot(algorithm, aes(x=algorithm,y=datasetsizes)) + geom_point(aes(color=algorithm)) + labs(x="N", y="Runtime") +
scale_x_continuous(trans = 'log10') + scale_y_continuous(trans = 'log10')
and plot this fine :
EDIT : let's clean this up a bit...
As per OP's request, I've cleaned up his code a bit.
There are a lot of things you can work on to improve on your code's readability, but I'm focusing more on the practical aspect here.
Basically, join your variables together in a table if you know they'll end up as such.
There are a bunch of tricks you can use to assign the values to the correct spots, a few of which you'll see in the code below.
library(ggplot2)
library(microbenchmark)
require(dplyr)
library(data.table)
datasetsizes<-c(10^seq(1,4,by=0.5))
l <- length(datasetsizes)
# make a vector with your different conditions
conds <- c('f1', 'f2')
# initalizing a table from the getgo is much cleaner
# than doing everything in separate variables
dat <- data.frame(
datasetsizes = rep(datasetsizes, each = length(conds)), # make replicates for each condition
cond = rep(NA, l*length(conds))
)
dat[, c("min", "lq", "mean", "median", "uq", "max")] <- 0
dat$cond <- factor(dat$cond, levels = conds)
head(dat)
for(i in 1:l){ # for the love of god, don't use something as long as 'loopvar' as an iterative
# I don't have f1 & f2 so I did what I could...
s <- summary(microbenchmark(
"f1" = rpois(datasetsizes[i],10),
"f2" = {length(rpois(datasetsizes[i],10))}))
dat[which(dat$datasetsizes == datasetsizes[i]), # select rows of current ds size
c("cond", "min", "lq", "mean", "median", "uq", "max")] <- s[, !colnames(s)%in%c("neval")]
}
dat <- data.table(dat)
ggplot(dat, aes(x=datasetsizes,y=mean)) +
geom_point(aes(color = cond)) +
geom_line(aes(color = cond)) + # added to see a clear difference btw conds
labs(x="N", y="Runtime") + scale_x_continuous(trans = 'log10') +
scale_y_continuous(trans = 'log10')
This give the following plot.

How to increase the font size of a legend?

Or even the words in the plot itself? Any hints on that are welcome.
dat <- selectByDate(mydata, year = 2003)
dat <- data.frame(date = mydata$date, obs = mydata$nox, mod = mydata$nox)
dat <- transform(dat, month = as.numeric(format(date, "%m")))
mod1 <- transform(dat, mod = mod + 10 * month + 10 * month * rnorm(nrow(dat)),model = "model 1")
mod1 <- transform(mod1, mod = c(mod[5:length(mod)], mod[(length(mod) - 3) :
length(mod)]))
mod2 <- transform(dat, mod = mod + 7 * month + 7 * month * rnorm(nrow(dat)),
model = "model 2")
mod.dat <- rbind(mod1, mod2)
Much of this appears to have been hard coded, so I don't think modifying this plot will be easy in general. In the specific case of the legend text, you can modify some arguments in the plot object after creating it:
out <- TaylorDiagram(mod.dat, obs = "obs", mod = "mod", group = "model")
out$plot$legend$right$args$key$text$cex <- 1.5
out$plot$legend$right$args$key$cex.title <- 1.5
I don't see anything similar that only applies to the text in the plot itself. To modify that you'd likely have to dig further into the code itself and modify it to get the specific results you want.
Indeed, digging further, much of the details of the plot are taking place in custom panel functions panel.taylor.setup and panel.taylor in which almost all of the specific sizes of things are hard coded.

Position dodge with geom_point(), x=continuous, y=factor

I have made a function that can plot the loadings from many factor analyses at once, also when their variables do not overlap perfectly (or at all). It works fine, but sometimes factor loadings are identical across analyses which means that the points get plotted on top of each other.
library(pacman)
p_load(devtools, psych, stringr, plotflow)
source_url("https://raw.githubusercontent.com/Deleetdk/psych2/master/psych2.R")
loadings.plot2 = function(fa.objects, fa.names=NA) {
fa.num = length(fa.objects) #number of fas
#check names are correct or set automatically
if (length(fa.names)==1 & is.na(fa.names)) {
fa.names = str_c("fa.", 1:fa.num)
}
if (length(fa.names) != fa.num) {
stop("Names vector does not match the number of factor analyses.")
}
#merge into df
d = data.frame() #to merge into
for (fa.idx in 1:fa.num) { #loop over fa objects
loads = fa.objects[[fa.idx]]$loadings
rnames = rownames(loads)
loads = as.data.frame(as.vector(loads))
rownames(loads) = rnames
colnames(loads) = fa.names[fa.idx]
d = merge.datasets(d, loads, 1)
}
#reshape to long form
d2 = reshape(d,
varying = 1:fa.num,
direction="long",
ids = rownames(d))
d2$time = as.factor(d2$time)
d2$id = as.factor(d2$id)
colnames(d2)[2] = "fa"
print(d2)
#plot
g = ggplot(reorder_by(id, ~ fa, d2), aes(x=fa, y=id, color=time, group=time)) +
geom_point(position=position_dodge()) +
xlab("Loading") + ylab("Indicator") +
scale_color_discrete(name="Analysis",
labels=fa.names)
return(g)
}
#Some example plots
fa1 = fa(iris[-5])
fa2 = fa(iris[-c(1:50),-5])
fa3 = fa(ability)
fa4 = fa(ability[1:50,])
loadings.plot2(list(fa1,fa1,fa2))
Here I've plotted the same object twice just to show the effect. The plot has no red points because the green ones from fa.2 are on top. Instead, I want them to be dodged on the y-axis. However, position="dodge" with various settings does not appear to make a difference.
However, position="jitter" works, but it is random, so sometimes it does not work well as well as makes the plot chaotic to look at.
How do I make the points dodged on the y-axis?
Apparently, you can only dodge sideways, but there is a workaround. The trick is to flip your x and y, do the position_dodge, and then do a coord_flip().
g = ggplot(data = reorder_by(id, ~ fa, d2), aes(x=id, y=fa, color=time, group=time)) +
geom_point(position=position_dodge(width = .5)) +
xlab("Loading") + ylab("Indicator") +
scale_color_discrete(name="Analysis",
labels=fa.names) +
coord_flip()
Possible duplicate
In the linked post, the right answer states that one must use position_jitter() instead of position_dodge(). It has worked for me.

Resources