I have a data set with information of where individuals work at over time. More specifically, I have information on the interval at which individuals work in a given workplace.
library('tidyverse')
library('lubridate')
# individual A
a_id <- c(rep('A',1))
a_start <- c(201201)
a_end <- c(201212)
a_workplace <-c(1)
# individual B
b_id <- c(rep('B',2))
b_start <- c(201201, 201207)
b_end <- c(201206, 201211)
b_workplace <-c(1, 2)
# individual C
c_id <- c(rep('C',2))
c_start <- c(201201, 201202)
c_end <- c(201204, 201206)
c_workplace <-c(1, 2)
# individual D
d_id <- c(rep('D',1))
d_start <- c(201201)
d_end <- c(201201)
d_workplace <-c(1)
# final data frame
id <- c(a_id, b_id, c_id, d_id)
start <- c(a_start, b_start, c_start, d_start)
end <- c(a_end, b_end, c_end, d_end)
workplace <- as.factor(c(a_workplace, b_workplace, c_workplace, d_workplace))
mydata <- data.frame(id, start, end, workplace)
mydata_ym <- mydata %>%
mutate(ymd_start = as.Date(paste0(start, "01"), format = "%Y%m%d"),
ymd_end0 = as.Date(paste0(end, "01"), format = "%Y%m%d"),
day_end = as.numeric(format(ymd_end0 + months(1) - days(1), format = "%d")),
ymd_end = as.Date(paste0(end, day_end), format = "%Y%m%d")) %>%
select(-ymd_end0, -day_end)
I would like a plot where I can see the patterns of how long each individual works at each workplace as well as how they move around. I tried plotting a geom_segment as I have information of start and end date the individual works in each place. Besides, because the same individual may work in more than one place during the same month, I would like to use position_dodge to make it visible when there is overlap of different workplaces for the same id-time. This was suggested in this post here: Ggplot (geom_line) with overlaps
ggplot(mydata_ym) +
geom_segment(aes(x = id, xend = id, y = ymd_start, yend = ymd_end),
position = position_dodge(width = 0.1), size = 2) +
scale_x_discrete(limits = rev) +
coord_flip() +
theme(panel.background = element_rect(fill = "grey97")) +
labs(y = "time", title = "Work affiliation")
The problem I am having is that: (i) the position_dodge doesn't seem to be working, (ii) I don't know why all the segments are being colored in black. I would expect each workplace to have a different color and a legend to show up.
If you include colour = workplace in the aes() mapping for geom_segment you get colours and a legend and some dodging, but it doesn't work quite right (it looks like position_dodge only applies to x and not xend ... ? this seems like a bug, or at least an "infelicity", in position_dodge ...
However, replacing geom_segment with an appropriate use of geom_linerange does seem to work:
ggplot(mydata_ym) +
geom_linerange(aes(x = id, ymin = ymd_start, ymax = ymd_end, colour = workplace),
position = position_dodge(width = 0.1), size = 2) +
scale_x_discrete(limits = rev) +
coord_flip()
(some tangential components omitted).
A similar approach is previously documented here — a near-duplicate of your question once the colour= mapping is taken care of ...
Related
I am attempting to plot the blood test results for a patient in a time series. I have managed to do this and included a reference range between two shaded y-intercepts. My problem is that the annotate() or geom_segment() calls want me to specify, in the units of my independent variable, which is, unhelpfully, a date (YYYY-MM-DD).
Is it possible to get R to ignore the units of the x- and y-axis and specify the arrow co-ordinates as if they were on a grid?
result <- runif(25, min = 2.0, max = 3.5)
start_date <- ymd("2021-08-16")
end_date <- ymd("2022-10-29")
date <- sample(seq(start_date, end_date, by = "days"), 25, replace = TRUE)
q <- data.table(numbers, date)
ggplot(q, aes(x = date, y = result)) +
geom_line() +
geom_point(aes(x = date, y = result), shape = 21, size = 3) +
scale_x_date(limits = c(min(q$date), max(q$date)),
breaks = date_breaks("1 month"),
labels = date_format("%b %Y")) +
ylab("Corrected calcium (mmol/L")+
xlab("Date of blood test") +
ylim(1,4)+
geom_ribbon(aes(ymin=2.1, ymax=2.6), fill="grey", alpha=0.2, colour="grey")+
geom_vline(xintercept=as.numeric(q$date[c(3, 2)]),
linetype=4, colour="black") +
theme(axis.text.x = element_text(angle = 45)) + theme_prism(base_size = 10) +
annotate("segment", x = 1, y = 2, xend = 3, yend = 4, arrow = arrow(length = unit(0.15, "cm")))
The error produced is Error: Invalid input: date_trans works with objects of class Date only.
I can confirm that:
> class(q$date)
[1] "Date"
I've just gone with test co-ordinates (1,2,3,4) for the annotate("segment"...), ideally I want to be able to get the arrow to point to a specific data point on the plot to indicate when the patient went on treatment.
Many thanks,
Sandro
You don't need to convert to points or coordinates. Just use the actual values from your data frame. I am just subsetting within annotate using a hard coded index (you can also automate this of course), but you will need to "remind" R that you are dealing with dates - thus the added lubridate::as_date call.
library(ggplot2)
library(lubridate)
result <- runif(25, min = 2.0, max = 3.5)
start_date <- ymd("2021-08-16")
end_date <- ymd("2022-10-29")
date <- sample(seq(start_date, end_date, by = "days"), 25, replace = TRUE)
q <- data.frame(result, date)
## I am arranging the data frame by date
q <- dplyr::arrange(q, date)
ggplot(q, aes(x = date, y = result)) +
geom_line() +
## for start use a random x and y so it starts whereever you want it to start
## for end, use the same row from your data frame, in this case row 20
annotate(geom = "segment",
x = as_date(q$date[2]), xend = as_date(q$date[20]),
y = min(q$result), yend = q$result[20],
arrow = arrow(),
size = 2, color = "red")
My problems seems simple, I am using ggplot2 with geom_jitter() to plot a variable. (take my picture as an example)
Jitter now adds some random noise to the variable (the variable is just called "1" in this example) to prevent overplotting. So I have now random noise in the y-direction and clearly what otherwise would be completely overplotted is now better visible.
But here is my question:
As you can see, there are still some points, that overplot each other. In my example here, this could be easily prevented, if it wouldn't be random noise in y-direction... but somehow more strategically placed offsets.
Can I somehow alter the geom_jitter() behavior or is there a similar function in ggplot2 that does exactly this?
Not really a minimal example, but also not too long:
library("imputeTS")
library("ggplot2")
data <- tsAirgap
# 2.1 Create required data
# Get all indices of the data that comes directly before and after an NA
na_indx_after <- which(is.na(data[1:(length(data) - 1)])) + 1
# starting from index 2 moves all indexes one in front, so no -1 needed for before
na_indx_before <- which(is.na(data[2:length(data)]))
# Get the actual values to the indices and put them in a data frame with a label
before <- data.frame(id = "1", type = "before", input = na_remove(data[na_indx_before]))
after <- data.frame(id = "1", type = "after", input = na_remove(data[na_indx_after]))
all <- data.frame(id = "1", type = "source", input = na_remove(data))
# Get n values for the plot labels
n_before <- length(before$input)
n_all <- length(all$input)
n_after <- length(after$input)
# 2.4 Create dataframe for ggplot2
# join the data together in one dataframe
df <- rbind(before, after, all)
# Create the plot
gg <- ggplot(data = df) +
geom_jitter(mapping = aes(x = id, y = input, color = type, alpha = type), width = 0.5 , height = 0.5)
gg <- gg + ggplot2::scale_color_manual(
values = c("before" = "skyblue1", "after" = "yellowgreen","source" = "gray66"),
)
gg <- gg + ggplot2::scale_alpha_manual(
values = c("before" = 1, "after" = 1,"source" = 0.3),
)
gg + ggplot2::theme_linedraw() + theme(aspect.ratio = 0.5) + ggplot2::coord_flip()
So many good suggestions...here is what Bens suggestion would look like for my example:
I changed parts of my code to:
gg <- ggplot(data = df, aes(x = input, color = type, fill = type, alpha = type)) +
geom_dotplot(binwidth = 15)
Would basically also work as intended for me. ggbeeplot as suggested by Jon also worked great for my purpose.
I thought of a hack I really like, using ggrepel. It's normally used for labels, but nothing preventing you from making the label into a point.
df <- data.frame(x = rnorm(200),
col = sample(LETTERS[1:3], 200, replace = TRUE),
y = 1)
ggplot(df, aes(x, y, label = "●", color = col)) + # using unicode black circle
ggrepel::geom_text_repel(segment.color = NA,
box.padding = 0.01, key_glyph = "point")
A downside of this method is that ggrepel can take a lot time for a large number of points, and will recalculate differently each time you change the plot size. A faster alternative would be to use ggbeeswarm::geom_quasirandom, which uses a deterministic process to define jitter that looks random.
ggplot(df, aes(x,y, color = col)) +
ggbeeswarm::geom_quasirandom(groupOnX = FALSE)
Using the following data:
Category <- c("Bankpass", "Bankpass", "Bankpass", "Moving", "Moving")
Subcategory <- c("Stolen", "Lost", "Login", "Address", "New contract")
Weight <- c(10,20,13,40,20)
Duration <- as.character(c(0.2,0.4,0.5,0.44,0.66))
Silence <- as.character(c(0.1,0.3,0.25,0.74,0.26))
df <- data.frame(Category, Subcategory, Weight, Duration, Silence)
Which I use to create the following mosaic plot:
library (ggplot2)
library (ggmosaic)
g <- ggplot(data = df) +
geom_mosaic(aes(weight = Weight, x = product(Category), fill = Duration),
offset = 0, na.rm = TRUE) +
theme(axis.text.x = element_text(angle = -25, hjust = .1)) +
theme(axis.title.x = element_blank()) +
scale_fill_manual(values = c("#e8f5e9", "#c8e6c9", "#a5d6a7", "#81c784", "#66bb6a"))
This works, however I would like to include text labels on the elements on the graph ("Showing fe stolen, lost" etc.)
However, when I do:
g + geom_text(x = Category, y = Subcategory, label = Weight)
I get the following error:
Error in UseMethod("rescale") : no applicable method for 'rescale' applied to an object of class "character"
Any thoughts on what goes wrong here?
Here is my attempt. The x-axis is in a discrete variable (i.e., Category). So you cannot use it in geom_text(). You somehow need to create a numeric variable for the axis. Similarly, you need to find position in the y-axis for labels. In order to get numeric values for the two dimensions, I decided to access to the data frame staying behind your graphic. When you use the ggmosaic package, there is one data frame behind a graphic in this case. You can get it using ggplot_build(). You can calculate x and y values using the information in the data frame (e.g., xmin, and xmax). This is good news. But, we have bad news too. When you reach the data, you realize that there is no information about Subcategory that you need for labels.
We can overcome this challenge joining the data frame above with the original data. When I joined the data, I calculated proportion for both the original data and the other data. The values are purposely converted to character. temp is the data set you need in order to add labels.
library(dplyr)
library(ggplot2)
library(ggmosaic)
# Add proportion for each and convert to character for join
df <- group_by(df, Category) %>%
mutate(prop = as.character(round(Weight / sum(Weight),3)))
# Add proportion for each and convert to character.
# Get x and y values for positions
# Use prop for join
temp <- ggplot_build(g)$data %>%
as.data.frame %>%
transmute(prop = as.character(round(ymax - ymin, 3)),
x.position = (xmax + xmin) / 2,
y.position = (ymax + ymin) / 2) %>%
right_join(df)
g + geom_text(x = temp$x.position, y = temp$y.position, label = temp$Subcategory)
I think you are looking for something like this
library(ggplot2)
library(ggmosaic)
Your data:
Category <- c("Bankpass", "Bankpass", "Bankpass", "Moving", "Moving")
Subcategory <- c("Stolen", "Lost", "Login", "Address", "New contract")
Weight <- c(10,20,13,40,20)
Duration <- as.character(c(0.2,0.4,0.5,0.44,0.66))
Silence <- as.character(c(0.1,0.3,0.25,0.74,0.26))
mydf <- data.frame(Category, Subcategory, Weight, Duration, Silence)
ggplot(data = mydf) +
geom_mosaic(aes( x = product(Duration, Subcategory), fill=factor(Duration)), na.rm=TRUE) +
theme(axis.text.x=element_text(angle=-25, hjust= .1)) +
labs(x="Subcategory", title='f(Duration, Subcategory | Category)') +
facet_grid(Category~.) +
guides(fill=guide_legend(title = "Duration", reverse = TRUE))
The output is:
It is almost the best you can do on ggmosaic package. You should try other packages.
Good luck for your project work ;-)
I've been able to successfully create a dotpot in ggplot for percentages across gender. But, I want to highlight the significant differences. I thought I could do this with a combination of subsetting and the use of last_plot().
Here’s my data:
require(ggplot2)
require(reshape2)
prog <- c("Honors", "Academic", "Social", "Media")
m <- c(30,35,40,23)
f <- c(25,40,45,15)
s <- c(0.7, 0.4, 0.1, 0.03)
temp <- as.data.frame(cbind(prog, m, f, s), stringsAsFactors=FALSE)
first <- temp[,1:3]
first.melt <- melt(first, id.vars = 'prog', variable.name = 'Gender', value.name = 'Percent')
first.melt <- as.data.frame(cbind(first.melt,temp[,4]), , stringsAsFactors=FALSE)
names(first.melt) <- c("program", "Gender", "Percent", "sig")
first.melt$program <- as.factor(first.melt$program)
Here’s where I reverse order my Program variable, so that when graphed if will be alphabetical from top to bottom.
first.melt[,1] = with(first.melt, factor(first.melt[,1], levels = rev(levels(first.melt[,1]))))
first.melt$sig <- as.numeric(as.character(first.melt$sig))
first.melt$Percent <- as.numeric(as.character(first.melt$Percent))
Now, I subset...
first.melt.ns <- subset(first.melt,sig > 0.05)
first.melt.sig <- subset(first.melt,sig <= 0.05)
ggplot(first.melt.ns, aes(program, y=Percent, shape=Gender)) +
geom_point(size=3) +
coord_flip() +
scale_shape_manual(values=c("m"=1, "f"=5))
The first run at ggplot get’s me my non-significant Program pairs – and it’s in the right order – so, I add my the two new points for male and female (making them solid, to draw attention as a significant pair):
last_plot() +
geom_point(data=first.melt.sig, aes(program[Gender=="m"], y=Percent[Gender=="m"]), size=3, shape=19) +
geom_point(data=first.melt.sig, aes(program[Gender=="f"], y=Percent[Gender=="f"]),size=4, shape=18)
The points get added just fine – ggplot works. But notice my Program axis – it’s correct, but reversed now.
First, you really should avoid as.data.frame(cbind(...)). It is dramatically increasing the amount of work necessary to prepare your data. The function for creating data frames is (naturally) data.frame. Use it!
What you're doing here is basically trying to get around the limitation of only having one shape scale. It's probably easiest to just do this:
temp <- data.frame(prog,m,f,s)
first <- temp[,1:3]
first.melt <- melt(first, id.vars = 'prog', variable.name = 'Gender', value.name = 'Percent')
first.melt$sig <- rep(temp$s,times = 2)
first.melt[,1] = with(first.melt, factor(first.melt[,1], levels = rev(levels(first.melt[,1]))))
first.melt.sig <- subset(first.melt,sig < 0.05)
first.melt$Percent[first.melt$sig < 0.05] <- NA
ggplot() +
geom_point(data = first.melt,aes(x = prog,y = Percent,shape = Gender),size = 3) +
geom_point(data = first.melt.sig[1,],aes(x = prog,y = Percent),shape = 19) +
geom_point(data = first.melt.sig[2,],aes(x = prog,y = Percent),shape = 18) +
coord_flip() +
scale_shape_manual(values=c("m"=1, "f"=5))
In general, work to structure your ggplot code so that you're subsetting data frames, not variables inside of aes. That gets both tricky and dangerous, because ggplot is assuming certain things about what you pass inside of aes in order for the evaluation to work properly.
I have data in the following format:
Date Year Month Day Flow
1 1953-10-01 1953 10 1 530
2 1953-10-02 1953 10 2 530
3 1953-10-03 1953 10 3 530
I would like to create a graph like this:
Here is my current image and code:
library(ggplot2)
library(plyr)
library(reshape2)
library(scales)
## Read Data
df <- read.csv("Salt River Flow.csv")
## Convert Date column to R-recognized dates
df$Date <- as.Date(df$Date, "%m/%d/%Y")
## Finds Water Years (Oct - Sept)
df$WY <- as.POSIXlt(as.POSIXlt(df$Date)+7948800)$year+1900
## Normalizes Water Years so stats can be applied to just months and days
df$w <- ifelse(month(df$Date) %in% c(10,11,12), 1903, 1904)
##Creates New Date (dat) Column
df$dat <- as.Date(paste(df$w,month(df$Date),day(df$Date), sep = "-"))
## Creates new data frame with summarised data by MonthDay
PlotData <- ddply(df, .(dat), summarise, Min = min(Flow), Tenth = quantile(Flow, p = 0.05), TwentyFifth = quantile(Flow, p = 0.25), Median = quantile(Flow, p = 0.50), Mean = mean(Flow), SeventyFifth = quantile(Flow, p = 0.75), Ninetieth = quantile(Flow, p = 0.90), Max = max(Flow))
## Melts data so it can be plotted with ggplot
m <- melt(PlotData, id="dat")
## Plots
p <- ggplot(m, aes(x = dat)) +
geom_ribbon(aes(min = TwentyFifth, max = Median), data = PlotData, fill = alpha("black", 0.1), color = NA) +
geom_ribbon(aes(min = Median, max = SeventyFifth), data = PlotData, fill = alpha("black", 0.5), color = NA) +
scale_x_date(labels = date_format("%b"), breaks = date_breaks("month"), expand = c(0,0)) +
geom_line(data = subset(m, variable == "Mean"), aes(y = value), size = 1.2) +
theme_bw() +
geom_line(data = subset(m, variable %in% c("Min","Max")), aes(y = value, group = variable)) +
geom_line(data = subset(m, variable %in% c("Ninetieth","Tenth")), aes(y = value, group = variable), linetype = 2) +
labs(x = "Water Year", y = "Flow (cfs)")
p
I am very close but there are some issues I'm having. First, if you can see a way to improve my code, please let me know. The main problem I ran into was that I needed two dataframes to make this graph: one melted, and one not. The unmelted dataframe was necessary (I think) to create the ribbons. I tried many ways to use the melted dataframe for the ribbons, but there was always a problem with the aesthetic length.
Second, I know to have a legend - and I want one, I need to have something in the aesthetics of each line/ribbon, but I am having trouble getting that to work. I think it would involve scale_fill_manual.
Third, and I don't know if this is possible, I would like to have each month label in between the tick marks, not on them (like in the above image).
Any help is greatly appreciated (especially with creating more efficient code).
Thank you.
Something along these lines might get you close with base:
library(lubridate)
library(reshape2)
# simulating data...
Date <- seq(as.Date("1953-10-01"),as.Date("2010-10-01"),by="day")
Year <- year(Date)
Month <- month(Date)
Day <- day(Date)
set.seed(1)
Flow <- rpois(length(Date), 2000)
Data <- data.frame(Date=Date,Year=Year,Month=Month,Day=Day,Flow=Flow)
# use acast to get it in a convenient shape:
PlotData <- acast(Data,Year~Month+Day,value.var="Flow")
# apply for quantiles
Quantiles <- apply(PlotData,2,function(x){
quantile(x,probs=c(1,.9,.75,.5,.25,.1,0),na.rm=TRUE)
})
Mean <- colMeans(PlotData, na.rm=TRUE)
# ugly way to get month tick separators
MonthTicks <- cumsum(table(unlist(lapply(strsplit(names(Mean),split="_"),"[[",1))))
# and finally your question:
plot(1:366,seq(0,max(Flow),length=366),type="n",xlab = "Water Year",ylab="Discharge",axes=FALSE)
polygon(c(1:366,366:1),c(Quantiles["50%",],rev(Quantiles["75%",])),border=NA,col=gray(.6))
polygon(c(1:366,366:1),c(Quantiles["50%",],rev(Quantiles["25%",])),border=NA,col=gray(.4))
lines(1:366,Quantiles["90%",], col = gray(.5), lty=4)
lines(1:366,Quantiles["10%",], col = gray(.5))
lines(1:366,Quantiles["100%",], col = gray(.7))
lines(1:366,Quantiles["0%",], col = gray(.7), lty=4)
lines(1:366,Mean,lwd=3)
axis(1,at=MonthTicks, labels=NA)
text(MonthTicks-15,-100,1:12,pos=1,xpd=TRUE)
axis(2)
The plotting code really isn't that tricky. You'll need to clean up the aesthetics, but polygon() is usually my strategy for shaded regions in plots (confidence bands, whatever).
Perhaps this will get you closer to what you're looking for, using ggplot2 and plyr:
library(ggplot2)
library(plyr)
library(lubridate)
library(scales)
df$MonthDay <- df$Date - years( year(df$Date) + 100 ) #Normalize points to same year
df <- ddply(df, .(Month, Day), mutate, MaxDayFlow = max(Flow) ) #Max flow on day
df <- ddply(df, .(Month, Day), mutate, MinDayFlow = min(Flow) ) #Min flow on day
p <- ggplot(df, aes(x=MonthDay) ) +
geom_smooth(size=2,level=.8,color="black",aes(y=Flow)) + #80% conf. interval
geom_smooth(size=2,level=.5,color="black",aes(y=Flow)) + #50% conf. interval
geom_line( linetype="longdash", aes(y=MaxDayFlow) ) +
geom_line( linetype="longdash", aes(y=MinDayFlow) ) +
labs(x="Month",y="Flow") +
scale_x_date( labels = date_format("%b") ) +
theme_bw()
Edit: Fixed X scale and X scale label
(Partial answer with base plotting function and not including the min, max, or mean.) I suspect you will need to construct a dataset before passing to ggplot, since that is typical for that function. I already do something similar and then pass the resulting matrix to matplot. (It doesn't do that kewl highlighting, but maybe ggplot can do it>
HDL.mon.mat <- aggregate(dfrm$Flow,
list( dfrm$Year + dfrm$Month/12),
quantile, prob=c(0.1,0.25,0.5,0.75, 0.9), na.rm=TRUE)
matplot(HDL.mon.mat[,1], HDL.mon.mat$x, type="pl")