Label line in qplot - r

I have a qplot that is showing 5 different groupings (denoted with colour = type) with two dependent variables each. The command looks like this:
qplot(data = data, x = day, y = var1, geom = "line", colour = type) +
geom_line(aes(y = var2, colour = value)
I'd like to label the two different lines so that I can tell which five represent var1 and which five represent var2.
How do I do this?

You can convert the data to a "tall" format, with melt, and use another aesthetic, such as the line type, to distinguish the variables.
# Sample data
n <- 100
k <- 5
d <- data.frame(
day = rep(1:n,k),
type = factor(rep(1:k, each=n)),
var1 = as.vector( replicate(k, cumsum(rnorm(n))) ),
var2 = as.vector( replicate(k, cumsum(rnorm(n))) )
)
# Normalize the data
library(reshape2)
d <- melt(d, id.vars=c("day","type"))
# Plot
library(ggplot2)
ggplot(d) + geom_line(aes(x=day, y=value, colour=type, linetype=variable))

Related

matching of shape, color and legend in bubble plot with subset of variable

I have some data
library(data.table)
wide <- data.table(id=c("A","C","B"), var1=c(1,6,1), var2=c(2,6,5), size1=c(11,12,13), size2=c(10,12,10), flag=c(FALSE,TRUE,FALSE))
> wide
id var1 var2 size1 size2 flag
1: A 1 2 11 10 FALSE
2: C 6 6 12 12 TRUE
3: B 1 5 13 10 FALSE
which I would like to plot as bubble plots where id is ordered by var2, and bubbles are as follows:
ID A and B: var1 is plotted in size1 and "empty bubbles" and var2 is plotted in size2 with "filled" bubbles.
ID C is flagged because there is only one value (this is why var1=var2) and it should have a "filled bubble" of a different color.
I have tried this as follows:
cols <- c("v1"="blue", "v2"="red", "flags"="green")
shapes <- c("v1"=16, "v2"=21, "flags"=16)
p1 <- ggplot(data = wide, aes(x = reorder(id,var2))) + scale_size_continuous(range=c(5,15))
p1 <- p1 + geom_point(aes(size=size1, y = var1, color = "v1", shape = "v1"))
p1 <- p1 + geom_point(aes(size=size2, y = var2, color = "v2", shape = "v2", stroke=1.5))
p1 <- p1 + geom_point(data=subset(wide,flag), aes(size=size2[flag], y=var2[flag], color= "flags", shape="flags"))
p1 <- p1 + scale_color_manual(name = "test",
values = cols,
labels = c("v1", "v2", "flags"))
p1 <- p1 + scale_shape_manual(name = "test",
values = shapes,
labels = c("v1", "v2", "flags"))
which gives (in my theme)
but two questions remain:
What happened to the order in the legend? I have followed the recipe of the bottom solution in Two geom_points add a legend but somehow the order does not match.
How to get rid of the stroke around the green bubble and why is it there?
Overall, something appears to go wrong in matching shape and color.
I admit, it took me a while to understand your slightly convoluted plot. Forgive me, but I have allowed myself to change the way to plot, and make (better?) use of ggplot.
The data shape is less than ideal. ggplot works extremely well with long data.
It was a bit of a guesswork to reshape your data, and I decided to go the quick and dirty way to simply bind the rows from selected columns.
Now you can see, that you can achieve the new plot with a single call to geom_point. The rest is "scale_aesthetic" magic...
In order to combine the shape and color legend, safest is to use override.aes. But beware! It does not take named vectors, so the order of the values needs to be in the exact order given by your legend keys - which is usually alphabetic, if you don't have the factor levels defined.
update re: request to order x labels
This hugely depends on the actual data structure. if it is originally as you have presented, I'd first make id a factor with the levels ordered based on your var2. Then, do the data shaping.
library(tidyverse)
# data reshape
wide <- data.frame(id=c("C","B","A"), var1=c(1,6,1), var2=c(2,6,5), size1=c(11,12,13), size2=c(10,12,10), flag=c(FALSE,TRUE,FALSE))
wide <- wide %>% mutate(id = reorder(id, var2))
wide1 <- wide %>% filter(!flag) %>%select(id, var = var1, size = size1)
wide2 <- wide %>% filter(!flag) %>% select(id, var = var2, size = size2)
wide3 <- wide %>% filter(flag) %>% select(id, var = flag, size = size2) %>%
mutate(var = 6)
long <- bind_rows(list(v1 = wide1, v2 = wide2, flag = wide3), .id = "var_id")
# rearrange the vectors for scales aesthetic
cols <- c(flag="green", v1 ="blue", v2="red" )
shapes <- c(flag=16, v1=16, v2 =21 )
ggplot(data = long, aes(x = id, y = var)) +
geom_point(aes(size=size, shape = var_id, color = var_id), stroke=1.5) +
scale_size_continuous(limits = c(5,15),breaks = seq(5,15,5)) +
scale_shape_manual(name = "test", values = shapes) +
scale_color_manual(values = cols, guide = FALSE) +
guides(shape = guide_legend(override.aes = list(color = cols)))
P.S. the reason for the red stroke around the green bubble in your plot is that you also plotted the 'var2' behind your flag.
Created on 2020-04-08 by the reprex package (v0.3.0)

ggplot2: multiple variables on x-axis at multiple times

I have a data frame for observation numbers (3 observations for same id), height, weight and fev that looks like this (just for example):
id obs height weight fev
1 1 160 80 90
1 2 150 70 85
1 3 155 76 87
2 1 140 67 91
2 2 189 78 71
2 3 178 86 89
I need to plot this data using ggplot2 such that on x-axis there are 3 variables height, weight, fev; and the observation numbers are displayed as 3 vertical lines for each variable (color coded), where each lines show a median as a solid circle, and 25th and 75th percentiles as caps at the upper and lower extremes of the line (no minimum or maximum needed). I have so far tried many variations of box plots but I am not even getting close. Any suggestion(s) how to approach or solve this?
Thanks
OK instead what I did below was make three graphs then piece together with gridExtra. Read more about package here: http://www.sthda.com/english/wiki/wiki.php?id_contents=7930
I took the common legend code from this site to produce the following, starting with our existing longdf2. By piecing together the graphs, the information about corresponding observation is within the title of the graph
id <- rep(1:12, each = 3)
obs <- rep(1:3, 12)
height <- seq(140,189, length.out = 36)
weight <- seq(67,86, length.out = 36)
fev <- seq(71,91, length.out = 36)
df <- as.data.frame(cbind(id,obs,height, weight, fev))
obsonly <- melt(df, id.vars = c('id'), measure.vars = 'obs')
obsonly <- rbind(obsonly,obsonly,obsonly)
newvars <- melt(df[-2],id.vars = 'id')
longdf2 <- cbind(obsonly,newvars)
longdf2 <- longdf2[-4] #dropping second id column
colnames(longdf2)[c(2:5)] <- c('obs', 'obsnum', 'variable', 'value')
#Make graph 1 of observation 1
g1 <- longdf2 %>%
dplyr::filter(obsnum == 1) %>%
ggplot(aes(x = variable, y = value, color = variable)) +
stat_summary(fun.data=median_hilow) +
labs(title = "Observation 1") +
theme(plot.title = element_text(hjust = 0.5)) #has a legend
g2 <- longdf2 %>%
dplyr::filter(obsnum == 2) %>%
ggplot(aes(x = variable, y = value, color = variable)) +
stat_summary(fun.data=median_hilow) +
labs(title = "Observation 2") +
theme(plot.title = element_text(hjust = 0.5), legend.position =
'none')
#specified as none to make common legend at end
g3 <- longdf2 %>%
dplyr::filter(obsnum == 3) %>%
ggplot(aes(x = variable, y = value, color = variable)) +
stat_summary(fun.data=median_hilow) +
labs(title = "Observation 3") +
theme(plot.title = element_text(hjust = 0.5), legend.position =
'none')
library(gridExtra)
get_legend<-function(myggplot){
tmp <- ggplot_gtable(ggplot_build(myggplot))
leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
legend <- tmp$grobs[[leg]]
return(legend)
}
# Save legend
legend <- get_legend(g1)
# Remove legend from 1st graph
g1 <- g1 + theme(legend.position = 'none')
# Combine graphs
grid.arrange(g1, g2, g3, legend, ncol=4, widths=c(2.3, 2.3, 2.3, 0.8))
Plenty of other little tweaks you could make along the way
Try putting the data into long format prior to graphing. I generated some more data, 12 subjects, each with 3 observations.
id <- rep(1:12, each = 3)
obs <- rep(1:3, 12)
height <- seq(140,189, length.out = 36)
weight <- seq(67,86, length.out = 36)
fev <- seq(71,91, length.out = 36)
df <- as.data.frame(cbind(id,obs,height, weight, fev))
library(reshape2) #use to melt data from wide to long format
longdf <- melt(df,id.vars = c('id', 'obs'))
Don't need to define measure variables here since the id.vars are defined, the remaining non-id.vars automatically default to measure variables. If you have more variables in your data set, you'll want to define measure variables in that same line as: measure.vars = c("height,"weight","fev")
longdf <- melt(df,id.vars = c('id', 'obs'), measure.vars = c("height", "weight", "fev"))
Apologies, haven't earned enough votes to put figures into my responses
ggplot(data = longdf, aes(x = variable, y = value, fill = factor(obs))) +
geom_boxplot(notch = T, notchwidth = .25, width = .25, position = position_dodge(.5))
This does not produce the exact graph you described-- which sounded like it was geom_linerange or something similar? -- those geoms require an x, ymin, and ymax to draw. Otherwise a regular, 'ole boxplot has your 1st and 3rd IQRs and median marked. I adjusted parameters of the boxplot to make it thinner with notches and widths, and separated them slightly with the position_dodge(.5)
after reading your response, I edited my original answer
You could try facet_wrap -- and watch the exchanging of "fill" vs. "color" in ggplot. If an object can't be "filled" with a color, like a boxplot or distribution, then it has to be "colored" with a color. Use color instead in the original aes()
ggplot(data = longdf, aes(x = variable, y = value, color = factor(obs))) +
stat_summary(fun.data=median_hilow) + facet_wrap(.~obs)
This gives you observation 1 - height, weight, fev side by side, observation 2- height, ....
If that still isn't what you want perhaps more like height observation 1,2,3; weight observation 1,2,3...then you'll need to modify your melting to have two variable and two value columns. Essentially make two melted dataframes, then cbind. Annnnd because each observation has three variables, you'll need to rbind to make sure both data frames have the same number of rows:
obsonly <- melt(df, id.vars = c('id'), measure.vars = 'obs')
obsonly <- rbind(obsonly,obsonly,obsonly) #making rows equal
longvars <- melt(df[-2],id.vars = 'id') #dropping obs from melt
longdf2 <- cbind(obsonly,longvars)
longdf2 <- longdf2[-4] #dropping second id column
colnames(longdf2)[c(2:5)] <- c('obs', 'obsnum', 'variable', 'value')
ggplot(data = longdf2, aes(x = obsnum, y = value,
color = factor(variable))) +
stat_summary(fun.data=median_hilow) +
facet_wrap(.~variable)
From here you can play around with the x axis marks (probably isn't useful to have a 1.5 observation marked) and the spacing of the lines from each other

creating histogram bins in r

I have this code.
a = c("a", 1)
b = c("b",2)
c = c('c',3)
d = c('d',4)
e = c('e',5)
z = data.frame(a,b,c,d,e)
hist = hist(as.numeric(z[2,]))
I am trying to have a histogram such that the bins would be a,b,c,d,e
and the freq values would be 1,2,3,4,5.
However, it gives me an empty screen(no bins at all for histogram model)
You are plotting the factor levels of each column for row 2, which is in this case always 1.
When creating the dataframe you add stringsAsFactors=FALSE to avoid converting the numbers to factors. This should work:
z = data.frame(a,b,c,d,e,stringsAsFactors=FALSE)
hist(as.numeric(z[2,]))
Perhaps this would work for you: it creates a data frame with the x elements being the letters a through 'e', and the y elements being the numbers 1 through 5. It then renders a histogram and tells ggplot not to perform any binning.
library(ggplot2)
tmp <- data.frame(x = letters[1:5], y = 1:5)
ggplot(tmp, aes(x = x, y = y)) + geom_histogram(stat = "identity")

Add shape at the start and end of lines, and at some interval along the lines, defined by a grouping variable

that's my df (almost 100,000 rows and 10 ID values)
Date.time P ID
1 2013-07-03 12:10:00 1114.3 J9335
2 2013-07-03 12:20:00 1114.5 K0904
3 2013-07-03 12:30:00 1114.3 K0904
4 2013-07-03 12:40:00 1114.1 K1136
5 2013-07-03 12:50:00 1114.1 K1148
............
With ggplot I create this graph:
ggplot(df) + geom_line(aes(Date.time, P, group=ID, colour=ID)
No problem with this graph. But at the moment that I have to print it also in b/w, the separation in colors is not a smart choice.
I try to group the ID with the line type but the result is not so exiting.
So my idea is to add a different symbol at the beginning and at the end of every line: so the different IDs can be identified also in a b/w paper.
I add the lines:
geom_point(data=df, aes(x=min(Date.time), y=P, shape=ID))+
geom_point(data=df, aes(x=max(Date.time), y=P, shape=ID))
But an error occur..
Any suggestions?
Given that every line is composed by around 5000 or 10000 values it's impossible to plot the values as different characters. A solution could be to plot the lines and then plot the point as different symbol for every ID divided into breaks (for example one character every 500 values). Is it possible to do that?
What about adding the geom_points using a subset of you data with only the min-max time values?
# some data
df <- data.frame(
ID = rep(c("a", "b"), each = 4),
Date.time = rep(seq(Sys.time(), by = "hour", length.out = 4), 2),
P = sample(1:10, 8))
df
# create a subset with min and max time values
# if min(x) and max(x) is the same for each ID:
df_minmax <- subset(x= df, subset = Date.time == min(Date.time) | Date.time == max(Date.time))
# if min(x) and max(x) may differ between ID,
# calculate min and max values *per* ID
# Here I use ddply, but several other aggregating functions in base R will do as well.
library(plyr)
df_minmax <- ddply(.data = df, .variables = .(ID), subset,
Date.time == min(Date.time) | Date.time == max(Date.time))
gg <- ggplot(data = df, aes(x = Date.time, y = P)) +
geom_line(aes(group = ID, colour = ID)) +
geom_point(data = df_minmax, aes(shape = ID))
gg
If you wish to have some control over your shapes, you may have a look at ?scale_shape_discrete (with examples here).
Edit following updated question
For each ID, add a shape to the line at some interval.
# create a slightly larger data set
df <- data.frame(
ID = rep(c("a", "b"), each = 100),
Date.time = rep(seq(Sys.time(), by = "day", length.out = 100), 2),
P = c(sample(1:10, 100, replace = TRUE), sample(11:20, 100, replace = TRUE)))
# for each ID:
# create a time sequence from min(time) to max(time), by some time step
# e.g. a week
df_gap <- ddply(.data = df, .variables = .(ID), summarize,
Date.time =
seq(from = min(Date.time), to = max(Date.time), by = "week"))
# add P from df to df_gap
df_gap <- merge(x = df_gap, y = df)
gg <- ggplot(data = df, aes(x = Date.time, y = P)) +
geom_line(aes(group = ID, colour = ID)) +
geom_point(data = df_gap, aes(shape = ID)) +
# if your gaps are not a multiple of the length of the data
# you may wish to add the max points as well
geom_point(data = df_minmax, aes(shape = ID))
gg
The error stems from the fact that the single numeric value min(Date.time) doesn't match up in length with the vectors P or ID. Another problem might be that you're re-declaring your data variable even though you already have ggplot(df).
The solution that immediately comes to mind is to figure out what the row indexes are for your minimum and maximum dates. If they all share the same minimum and maximum time stamps than its easy. Use the which() function to come up with an array of the row numbers you'll need.
min.index <- which(df$Date.time == min(df$Date.time))
max.index <- which(df$Date.time == max(df$Date.time))
Then use those arrays as your indexes.
geom_point(aes(x=Date.time[min.index], y=P[min.index], shape=ID[min.index]))+
geom_point(aes(x=Date.time[max.index], y=P[max.index], shape=ID[max.index]))

multiple lines each based on a different dataframe in ggplot2 - automatic coloring and legend

Suppose I have the following data frames:
df1 = data.frame(c11 = c(1:5), c12 = c(1:5))
df2 = data.frame(c21 = c(1:5), c22 = (c(1:5))^0.5)
df3 = data.frame(c31 = c(1:5), c32 = (c(1:5))^2)
I want to plot these as lines in the same plot/panel. I can do this by
p <- ggplot() + geom_line(data=df1, aes(x=c11, y = c12)) +
geom_line(data=df2, aes(x=c21,y=c22)) +
geom_line(data=df3, aes(x=c31, c32))
All these will be black. If I want them in a different color, I can specify the color explicitly as an argument to geom_line(). My question is can I specify a list of a few colors, say 5 colors, such as, red, blue, green, orange, gray, and use that list so that I do not have to explicitly specify the colors as an argument to geom_line() in case of each line. If the plot p contains 2 geom_line() statements then it will color them red and blue respectively. If it contains 3 geom_line statements, it will color them red, blue and green. Finally, how can I specify the legend for these plots. Even if I can give the colors as a vector at the end of p that would be great. Please let me know if the question is not clear.
Thanks.
ggplot2 works best if you work with a melted data.frame that contains a different column to specify the different aesthetics. Melting is easier with common column names, so I'd start there. Here are the steps I'd take:
rename the columns
melt the data which adds a new variables that we'll map to the colour aesthetic
define your colour vector
Specify the appropriate scale with scale_colour_manual
'
names(df1) <- c("x", "y")
names(df2) <- c("x", "y")
names(df3) <- c("x", "y")
newData <- melt(list(df1 = df1, df2 = df2, df3 = df3), id.vars = "x")
#Specify your colour vector
cols <- c("red", "blue", "green", "orange", "gray")
#Plot data and specify the manual scale
ggplot(newData, aes(x, value, colour = L1)) +
geom_line() +
scale_colour_manual(values = cols)
Edited for clarity
The structure of newData:
'data.frame': 15 obs. of 4 variables:
$ x : int 1 2 3 4 5 1 2 3 4 5 ...
$ variable: Factor w/ 1 level "y": 1 1 1 1 1 1 1 1 1 1 ...
$ value : num 1 2 3 4 5 ...
$ L1 : chr "df1" "df1" "df1" "df1" ...
And the plot itself:
You dont have to melt, group or gather. Its pretty simple. Just add the color to the geom_line
library(tidyverse)
df1 = data.frame(c11 = c(1:5), c12 = c(1:5))
df2 = data.frame(c21 = c(1:5), c22 = (c(1:5))^0.5)
df3 = data.frame(c31 = c(1:5), c32 = (c(1:5))^2)
p <- ggplot() + geom_line(data=df1, aes(x=c11, y = c12), color= "red") +
geom_line(data=df2, aes(x=c21,y=c22), color = "blue") +
geom_line(data=df3, aes(x=c31, c32), color = "green")
p
These sorts of questions become much easier to solve if you adjust your thinking to the way that ggplot2 approaches graphics. ggplot2 is organized around the idea that everything that appears in your graph should (in principle) exist as a column in your data frame. (There are exceptions, of course, but this is the general idea.)
So your attempt to build this graph piece by piece, one line at a time, each coming from different data frames and then assigning colors to them is very un-ggplot2ish. If you want to label things in your graph with different colors, your first thought should always be:
How can I encode this color labeling information as a variable?
In this case, the solution is fairly simple. Simply rbind your three data frames together (you'll need to make sure the colnames match up first) and create a new column, say grp that has three levels corresponding to your three data frames:
dat <- rbind(df1,df2,df3)
dat$grp <- rep(factor(1:3),times = c(nrow(df1),nrow(df2),nrow(df3)))
and then map the variable grp to the aesthetic color in the ggplot call:
ggplot(data = dat, aes(x=...,y=...,colour = grp) +
geom_line()
Finally, if you don't like the default colors, you can specify your own using scale_colour_manual:
+ scale_colour_manual(value = c('green','blue','grey'))
or you can use some nice 'pre-chosen' palettes from scale_colour_brewer.
EDIT: I fixed a typo above to ensure that grp is a factor. Here's my final version:
df1 = data.frame(c1 = c(1:5), c2 = c(1:5))
df2 = data.frame(c1 = c(1:5), c2 = (c(1:5))^0.5)
df3 = data.frame(c1 = c(1:5), c2 = (c(1:5))^2)
dat <- rbind(df1,df2,df3)
dat$grp <- rep(factor(1:3),times=c(nrow(df1),nrow(df2),nrow(df3)))
ggplot(data = dat, aes(x = c1, y = c2, colour = grp)) +
geom_line()

Resources