R, ggplot, separate mean by range of x value - r

I have a set of data looks like this
CHROM POS GT DIFF
1 chr01 14653 CT 254
2 chr01 14907 AG 254
3 chr01 14930 AG 23
4 chr01 15190 GA 260
5 chr01 15211 TG 21
6 chr01 16378 TC 1167
Where POS range from 1xxxx to 1xxxxxxx.
And CHROM is a categorical variable that contains values of "chr01" to "chr22" and "chrX".
I want to plot a scatterplot:
y(DIFF) vs. X(POS)
having panels separated by CHROM
grouped by GT (different colors by GT)
I'm creating a ggplot with running average (though not time series data).
What I want is to get average for every 1,000,000 range of POS by GT.
For example,
for x in range(1 ~ 1,000,000) , DIFF average = _____
for x in range(1,000,001 ~ 2,000,000), DIFF average = _____
and I want to plot horizontal lines on the ggplot (coloured by GT).
#
What I have so far before apply your function:
After apply your function:
I tried to apply your solution to what I already have, here are some problems:
There are different panels, so the mean values are different for different panel, but when I apply your code, the horizontal mean lines are all identical to the first panel.
I'm having different ranges for x-axis, so when apply your function, it automatically fills out the extra range with the previous horizontal mean line
Here is my code before:
ggplot(data1, aes(x=POS,y=DIFF,colour=GT)) +
geom_point() +
facet_grid(~ CHROM,scales="free_x",space="free_x") +
theme(strip.text.x = element_text(size=40),
strip.background = element_rect(color='lightblue',fill='lightblue'),
legend.position="top",
legend.title = element_text(size=40,colour="darkblue"),
legend.text = element_text(size=40),
legend.key.size = unit(2.5, "cm")) +
guides(fill = guide_legend(title.position="top",
title = "Legend:GT='REF'+'ALT'"),
shape = guide_legend(override.aes=list(size=10))) +
scale_y_log10(breaks=trans_breaks("log10", function(x) 10^x, n=10)) +
scale_x_continuous(breaks = pretty_breaks(n=3))

This should get you started:
# It saves a lot of headaches to just make factors as you need them
options(stringsAsFactors = FALSE)
library(ggplot2)
library(plyr)
# Here's some made-up data - it always helps if you can post a subset of
# your real data, though. The dput() function is really useful for that.
dat <- data.frame(POS = seq(1, 1e7, by = 1e4))
# Add random GT value
dat$GT <- sample(x = c("CT", "AG", "GA", "TG", "TC"),
size = nrow(dat),
replace = TRUE)
# Group by millions - there are several ways to do this that I can
# never remember, but here's a simple way to split by millions
dat$POSgroup <- floor(dat$POS / 1e6)
# Add an arbitrary DIFF value
dat$DIFF <- rnorm(n = nrow(dat),
mean = 200 * dat$POSgroup,
sd = 300)
# Aggregate the data by GT and POS-group
# Ideally, you'd do this inside of the plot using stat_summary,
# but I couldn't get that to work. Using two datasets in a plot
# is okay, though.
datsum <- ddply(dat, .var = "POSgroup", .fun = function(x) {
# Calculate the mean DIFF value for each GT group in this POSgroup
meandiff <- ddply(x, .var = "GT", .fun = summarise, ymean = mean(DIFF))
# Add the center of the POSgroup range as the x position
meandiff$center <- (x$POSgroup[1] * 1e6) + 0.5e6
# Return the results
meandiff
})
# On the plot, these results will be grouped by both POS and GT - but
# ggplot will only accept one vector for grouping. So make a combination.
datsum$combogroup <- paste(datsum$GT, datsum$POSgroup)
# Plot it
ggplot() +
# First, a layer for the points themselves
# Large numbers of points can get pretty slow - you might try getting
# the plot to work with a subsample (~1000) and then add in the rest of
# your data
geom_point(data = dat,
aes(x = POS, y = DIFF, color = as.factor(GT))) +
# Then another layer for the means. There are a variety of geoms you could
# use here, but crossbar with ymin and ymax set to the group mean
# is a simple one
geom_crossbar(data = datsum, aes(x = center,
y = ymean,
ymin = ..y..,
ymax = ..y..,
color = as.factor(GT),
group = combogroup),
size = 1) +
# Some other niceties
scale_x_continuous(breaks = seq(0, 1e7, by = 1e6)) +
labs(x = "POS", y = "DIFF", color = "GT") +
theme_bw()
Which results in this:

Related

How can I selectively colour histogram bars in R?

I did 10,000 stochastic implementations of a model, and want to plot some outputs. Long story short, the plot shows how long two species co-occurred within the system on a given run. Here's an example:
Now, I want to colour the histogram bars to specify which species lasted longer within the system. Specifically, I want to be able to specify if Species A outlasted Species B, if Species B outlasted Species A, or if both species went "extinct" at the same time, within a given simulation.
So, I made three vectors containing the run numbers (out of 10,000) for each of these three outcomes. For example:
# Pi wins
PiWins_1 <- which(Winner_1[1:10000, 2] %in% TRUE)
head(PiWins_1)
# [1] 1 2 6 7 9 12
# Pj wins
PjWins_1 <- which(Winner_1[1:10000, 3] %in% TRUE)
head(PjWins_1)
# [1] 3 4 5 8 10 11
# Ties
Ties_1 <- which(Winner_1[1:10000, 4] %in% TRUE)
head(Ties_1)
# [1] 20 24 29 40 110 132
And am now trying to figure out how to use these (or some other method) to colour each of the 10,000 histogram bars accordingly?
This is the ggplot script so far:
Histogram_1 <- ggplot(temp_df_1, aes(x=persistance_vec_1))+
geom_histogram(binwidth = 1, fill = "darkseagreen2")+
geom_vline(aes(xintercept = mean_pv_1, colour = "Mean # of Overlapping Time Pts in Stoch Runs"))+
geom_vline(aes(xintercept = cmean_pv_1, colour = "Conditional Mean # of Overlapping Time Pts in Stoch Runs (Dashed)"), linetype=2)+
geom_vline(aes(xintercept = median_pv_1, colour = "Median # of Overlapping Time Pts in Stoch Runs"))+
geom_vline(aes(xintercept = cmedian_pv_1, colour = "Conditional Median # of Overlapping Time Pts in Stoch Runs (Dashed)"), linetype=2)+
geom_vline(aes(xintercept = DO_1, colour = "Overlap in Det Sim"))+
labs(title="Det vs Stoch Overlap; Intro # 1, 10000 sims",
x="# of Time Pts Pi and Pj Co-occur",
y="# of Sims")+
scale_colour_manual(name="Legend",
values=c("Mean # of Overlapping Time Pts in Stoch Runs" = "navyblue",
"Conditional Mean # of Overlapping Time Pts in Stoch Runs (Dashed)" = "navyblue",
"Median # of Overlapping Time Pts in Stoch Runs" = "red2",
"Conditional Median # of Overlapping Time Pts in Stoch Runs (Dashed)" = "red2",
"Overlap in Det Sim" = "orange"))+
theme_minimal()
Histogram_1
Please let me know if I should provide any more information! Thanks so much in advance for any suggestions :)
Okay so I think what you're asking for is just how to colour bars by different colours. Here's a reprex:
x = c(1:10)
y = c(11:15, 15:11)
z = sample(c("A", "B", "C"), 10, replace = TRUE)
df = data.frame(x = x, y = y, z = z)
And then the plot - you can just use the fill argument in geom_col() if you want to use ggplot.
library(ggplot2)
df$z = as.factor(df$z)
ggplot(data = df) +
geom_col(aes(x = x, y = y, fill = z), colour = "black")
Does this do what you're looking to do?
Or if you really want to use geom_histogram(), here's an option with a different reprex:
# make the numeric data
x = sample(c(1:10), 50, replace = TRUE)
# initialize empty vector for categorical variables
z = vector(mode = "character", length = 50)
# im making 3 groups that i've randomly decided to group
for(i in 1:length(x)){
z[i] = ifelse(x[i] %in% c(1,3,5,7,9), # if x is an odd number, z=A
"A",
ifelse(x[i] %in% c(2,4,6), # if x in (2,4,6), z=B, if not, z=C
"B", "C"))
}
#turn into dataframe
df = data.frame(x = x, z = z)
and then the plot using the same idea:
df$z = as.factor(df$z)
ggplot(data = df) +
geom_histogram(aes(x = x, fill = z), colour = "black")

Loop printing lots of graphs in order (PDF) using ggplot2 in R

I have a large dataset as a result of a bayesian logistic regression. The dataset contains parameter estimates, confidence intervals, etc (see below for head).
mean sd confint_2.5 confint_97.5 Rhat median spec Errorup Errordown
1 -0.7897597 0.18668304 -1.1759960 -0.4517294 1.002211 -0.7811156 Marvulg -0.3293862 -1.957112
2 -0.7891327 0.08145761 -0.9570086 -0.6380287 1.000155 -0.7861764 Viotric -0.1481477 -1.743185
3 -0.6619662 0.26049168 -1.2203315 -0.2059030 1.045208 -0.6440501 Antdioi -0.4381470 -1.864382
4 -0.6571516 0.17940842 -1.0417642 -0.3364415 1.008100 -0.6470382 Eleacic -0.3105968 -1.688802
5 -0.6526717 0.20005184 -1.0816375 -0.2968111 1.005126 -0.6394952 Antcotu -0.3426842 -1.721133
6 -0.6497648 0.16620699 -1.0081607 -0.3555847 1.003738 -0.6384035 Triflav -0.2828188 -1.646564
I have a total of 714 rows of data, sorted (mean) from low to high. I use this code to plot 50 at a time, where a3_sort is a subset of 50 rows of data (so manually doing a3_sort <- a3[n:n,), after which I print the subset and proceed to the next 50):
ggplot2::ggplot(data = a3_sort, mapping = aes(x = reorder(spec, mean), y = mean, ymin = confint_97.5, ymax = confint_2.5))+
geom_pointrange()+
geom_hline(yintercept = 0, lty = 2)+
coord_flip()+
xlab ("species") +ylab ("mean (credibility interval)")+
theme_bw()
This works, and I get what I want, but there must be a less manual labour way to do this?
My question: Is there a way to loop this procedure, automatically saving the PDF in the working directory?
Below an example of what one plot looks like:
You can try this solution. I tested with dummy data DF with 714 rows and same columns as you have. DF in your case is your sorted dataframe of 714 rows and the variables you have. I have set the code so that you can change if you require a width larger than 50.
library(zoo)
#Create keys; change 50 if you want a larger window
keys <- seq(1, nrow(DF), 50)
vals=1:length(keys)
#Flag to allocate the position and values
#na.locf is used to complete NA so that we have same index
DF$Flag <- NA
DF$Flag[keys]<-vals
DF$Flag <- na.locf(DF$Flag)
#Then split by flag
ListData <- split(DF,DF$Flag)
#Function to create plot
myplot <- function(x)
{
tplot <- ggplot2::ggplot(data = x, mapping = aes(x = reorder(spec, mean), y = mean, ymin = confint_97.5, ymax = confint_2.5))+
geom_pointrange()+
geom_hline(yintercept = 0, lty = 2)+
coord_flip()+
xlab ("species") +ylab ("mean (credibility interval)")+
theme_bw()
return(tplot)
}
#Replicate plots
LPlots <- lapply(ListData,myplot)
#Export to pdf
pdf('Myplots.pdf',width = 14)
for(i in c(1:length(LPlots)))
{
plot(LPlots[[i]])
}
dev.off()
In the end, you will have your plots in pdf. I hope this helps. Let me know if you have any doubt.
This approach could be adapted to your case:
# Some dummy data:
df <- data.frame(g = letters[1:24],
min = sample(0:10, 24, replace = TRUE),
mid = sample(11:20, 24, replace = TRUE),
max = sample(21:30, 24, replace = TRUE))
library(ggplot2)
library(purrr)
# list of the rows you want printing, this could be automated
plot_range <- list(p1_6 = 1:6, p7_12 = 7:12, p13_18 = 13:18, p19_24 = 19:24)
# plotting function which also sets a title and plot name
gg_plot <- function(df, plot_rows){
title <- paste("Automatic plot rows: ", min(plot_rows), "to", max(plot_rows))
plot_nm <- paste("plots", min(plot_rows), max(plot_rows), sep = "_")
p <- ggplot(df[plot_rows, ])+
geom_segment(aes(x = min , xend = max, y = g, yend = g))+
geom_point(aes(x = mid, y = g))+
ggtitle(title)
print(ggsave(plot_nm, p, device = "pdf"))
}
# purrr function which acts as a loop to print each graph and allows a different data frame to be used.
walk(plot_range, ~gg_plot(df = df, plot_rows = .x))
#> Saving 7 x 5 in image
#> NULL
#> Saving 7 x 5 in image
#> NULL
#> Saving 7 x 5 in image
#> NULL
#> Saving 7 x 5 in image
#> NULL
Created on 2020-07-11 by the reprex package (v0.3.0)

ggplot2: multiple variables on x-axis at multiple times

I have a data frame for observation numbers (3 observations for same id), height, weight and fev that looks like this (just for example):
id obs height weight fev
1 1 160 80 90
1 2 150 70 85
1 3 155 76 87
2 1 140 67 91
2 2 189 78 71
2 3 178 86 89
I need to plot this data using ggplot2 such that on x-axis there are 3 variables height, weight, fev; and the observation numbers are displayed as 3 vertical lines for each variable (color coded), where each lines show a median as a solid circle, and 25th and 75th percentiles as caps at the upper and lower extremes of the line (no minimum or maximum needed). I have so far tried many variations of box plots but I am not even getting close. Any suggestion(s) how to approach or solve this?
Thanks
OK instead what I did below was make three graphs then piece together with gridExtra. Read more about package here: http://www.sthda.com/english/wiki/wiki.php?id_contents=7930
I took the common legend code from this site to produce the following, starting with our existing longdf2. By piecing together the graphs, the information about corresponding observation is within the title of the graph
id <- rep(1:12, each = 3)
obs <- rep(1:3, 12)
height <- seq(140,189, length.out = 36)
weight <- seq(67,86, length.out = 36)
fev <- seq(71,91, length.out = 36)
df <- as.data.frame(cbind(id,obs,height, weight, fev))
obsonly <- melt(df, id.vars = c('id'), measure.vars = 'obs')
obsonly <- rbind(obsonly,obsonly,obsonly)
newvars <- melt(df[-2],id.vars = 'id')
longdf2 <- cbind(obsonly,newvars)
longdf2 <- longdf2[-4] #dropping second id column
colnames(longdf2)[c(2:5)] <- c('obs', 'obsnum', 'variable', 'value')
#Make graph 1 of observation 1
g1 <- longdf2 %>%
dplyr::filter(obsnum == 1) %>%
ggplot(aes(x = variable, y = value, color = variable)) +
stat_summary(fun.data=median_hilow) +
labs(title = "Observation 1") +
theme(plot.title = element_text(hjust = 0.5)) #has a legend
g2 <- longdf2 %>%
dplyr::filter(obsnum == 2) %>%
ggplot(aes(x = variable, y = value, color = variable)) +
stat_summary(fun.data=median_hilow) +
labs(title = "Observation 2") +
theme(plot.title = element_text(hjust = 0.5), legend.position =
'none')
#specified as none to make common legend at end
g3 <- longdf2 %>%
dplyr::filter(obsnum == 3) %>%
ggplot(aes(x = variable, y = value, color = variable)) +
stat_summary(fun.data=median_hilow) +
labs(title = "Observation 3") +
theme(plot.title = element_text(hjust = 0.5), legend.position =
'none')
library(gridExtra)
get_legend<-function(myggplot){
tmp <- ggplot_gtable(ggplot_build(myggplot))
leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
legend <- tmp$grobs[[leg]]
return(legend)
}
# Save legend
legend <- get_legend(g1)
# Remove legend from 1st graph
g1 <- g1 + theme(legend.position = 'none')
# Combine graphs
grid.arrange(g1, g2, g3, legend, ncol=4, widths=c(2.3, 2.3, 2.3, 0.8))
Plenty of other little tweaks you could make along the way
Try putting the data into long format prior to graphing. I generated some more data, 12 subjects, each with 3 observations.
id <- rep(1:12, each = 3)
obs <- rep(1:3, 12)
height <- seq(140,189, length.out = 36)
weight <- seq(67,86, length.out = 36)
fev <- seq(71,91, length.out = 36)
df <- as.data.frame(cbind(id,obs,height, weight, fev))
library(reshape2) #use to melt data from wide to long format
longdf <- melt(df,id.vars = c('id', 'obs'))
Don't need to define measure variables here since the id.vars are defined, the remaining non-id.vars automatically default to measure variables. If you have more variables in your data set, you'll want to define measure variables in that same line as: measure.vars = c("height,"weight","fev")
longdf <- melt(df,id.vars = c('id', 'obs'), measure.vars = c("height", "weight", "fev"))
Apologies, haven't earned enough votes to put figures into my responses
ggplot(data = longdf, aes(x = variable, y = value, fill = factor(obs))) +
geom_boxplot(notch = T, notchwidth = .25, width = .25, position = position_dodge(.5))
This does not produce the exact graph you described-- which sounded like it was geom_linerange or something similar? -- those geoms require an x, ymin, and ymax to draw. Otherwise a regular, 'ole boxplot has your 1st and 3rd IQRs and median marked. I adjusted parameters of the boxplot to make it thinner with notches and widths, and separated them slightly with the position_dodge(.5)
after reading your response, I edited my original answer
You could try facet_wrap -- and watch the exchanging of "fill" vs. "color" in ggplot. If an object can't be "filled" with a color, like a boxplot or distribution, then it has to be "colored" with a color. Use color instead in the original aes()
ggplot(data = longdf, aes(x = variable, y = value, color = factor(obs))) +
stat_summary(fun.data=median_hilow) + facet_wrap(.~obs)
This gives you observation 1 - height, weight, fev side by side, observation 2- height, ....
If that still isn't what you want perhaps more like height observation 1,2,3; weight observation 1,2,3...then you'll need to modify your melting to have two variable and two value columns. Essentially make two melted dataframes, then cbind. Annnnd because each observation has three variables, you'll need to rbind to make sure both data frames have the same number of rows:
obsonly <- melt(df, id.vars = c('id'), measure.vars = 'obs')
obsonly <- rbind(obsonly,obsonly,obsonly) #making rows equal
longvars <- melt(df[-2],id.vars = 'id') #dropping obs from melt
longdf2 <- cbind(obsonly,longvars)
longdf2 <- longdf2[-4] #dropping second id column
colnames(longdf2)[c(2:5)] <- c('obs', 'obsnum', 'variable', 'value')
ggplot(data = longdf2, aes(x = obsnum, y = value,
color = factor(variable))) +
stat_summary(fun.data=median_hilow) +
facet_wrap(.~variable)
From here you can play around with the x axis marks (probably isn't useful to have a 1.5 observation marked) and the spacing of the lines from each other

How to Create a Graph of Statistical Time Series

I have data in the following format:
Date Year Month Day Flow
1 1953-10-01 1953 10 1 530
2 1953-10-02 1953 10 2 530
3 1953-10-03 1953 10 3 530
I would like to create a graph like this:
Here is my current image and code:
library(ggplot2)
library(plyr)
library(reshape2)
library(scales)
## Read Data
df <- read.csv("Salt River Flow.csv")
## Convert Date column to R-recognized dates
df$Date <- as.Date(df$Date, "%m/%d/%Y")
## Finds Water Years (Oct - Sept)
df$WY <- as.POSIXlt(as.POSIXlt(df$Date)+7948800)$year+1900
## Normalizes Water Years so stats can be applied to just months and days
df$w <- ifelse(month(df$Date) %in% c(10,11,12), 1903, 1904)
##Creates New Date (dat) Column
df$dat <- as.Date(paste(df$w,month(df$Date),day(df$Date), sep = "-"))
## Creates new data frame with summarised data by MonthDay
PlotData <- ddply(df, .(dat), summarise, Min = min(Flow), Tenth = quantile(Flow, p = 0.05), TwentyFifth = quantile(Flow, p = 0.25), Median = quantile(Flow, p = 0.50), Mean = mean(Flow), SeventyFifth = quantile(Flow, p = 0.75), Ninetieth = quantile(Flow, p = 0.90), Max = max(Flow))
## Melts data so it can be plotted with ggplot
m <- melt(PlotData, id="dat")
## Plots
p <- ggplot(m, aes(x = dat)) +
geom_ribbon(aes(min = TwentyFifth, max = Median), data = PlotData, fill = alpha("black", 0.1), color = NA) +
geom_ribbon(aes(min = Median, max = SeventyFifth), data = PlotData, fill = alpha("black", 0.5), color = NA) +
scale_x_date(labels = date_format("%b"), breaks = date_breaks("month"), expand = c(0,0)) +
geom_line(data = subset(m, variable == "Mean"), aes(y = value), size = 1.2) +
theme_bw() +
geom_line(data = subset(m, variable %in% c("Min","Max")), aes(y = value, group = variable)) +
geom_line(data = subset(m, variable %in% c("Ninetieth","Tenth")), aes(y = value, group = variable), linetype = 2) +
labs(x = "Water Year", y = "Flow (cfs)")
p
I am very close but there are some issues I'm having. First, if you can see a way to improve my code, please let me know. The main problem I ran into was that I needed two dataframes to make this graph: one melted, and one not. The unmelted dataframe was necessary (I think) to create the ribbons. I tried many ways to use the melted dataframe for the ribbons, but there was always a problem with the aesthetic length.
Second, I know to have a legend - and I want one, I need to have something in the aesthetics of each line/ribbon, but I am having trouble getting that to work. I think it would involve scale_fill_manual.
Third, and I don't know if this is possible, I would like to have each month label in between the tick marks, not on them (like in the above image).
Any help is greatly appreciated (especially with creating more efficient code).
Thank you.
Something along these lines might get you close with base:
library(lubridate)
library(reshape2)
# simulating data...
Date <- seq(as.Date("1953-10-01"),as.Date("2010-10-01"),by="day")
Year <- year(Date)
Month <- month(Date)
Day <- day(Date)
set.seed(1)
Flow <- rpois(length(Date), 2000)
Data <- data.frame(Date=Date,Year=Year,Month=Month,Day=Day,Flow=Flow)
# use acast to get it in a convenient shape:
PlotData <- acast(Data,Year~Month+Day,value.var="Flow")
# apply for quantiles
Quantiles <- apply(PlotData,2,function(x){
quantile(x,probs=c(1,.9,.75,.5,.25,.1,0),na.rm=TRUE)
})
Mean <- colMeans(PlotData, na.rm=TRUE)
# ugly way to get month tick separators
MonthTicks <- cumsum(table(unlist(lapply(strsplit(names(Mean),split="_"),"[[",1))))
# and finally your question:
plot(1:366,seq(0,max(Flow),length=366),type="n",xlab = "Water Year",ylab="Discharge",axes=FALSE)
polygon(c(1:366,366:1),c(Quantiles["50%",],rev(Quantiles["75%",])),border=NA,col=gray(.6))
polygon(c(1:366,366:1),c(Quantiles["50%",],rev(Quantiles["25%",])),border=NA,col=gray(.4))
lines(1:366,Quantiles["90%",], col = gray(.5), lty=4)
lines(1:366,Quantiles["10%",], col = gray(.5))
lines(1:366,Quantiles["100%",], col = gray(.7))
lines(1:366,Quantiles["0%",], col = gray(.7), lty=4)
lines(1:366,Mean,lwd=3)
axis(1,at=MonthTicks, labels=NA)
text(MonthTicks-15,-100,1:12,pos=1,xpd=TRUE)
axis(2)
The plotting code really isn't that tricky. You'll need to clean up the aesthetics, but polygon() is usually my strategy for shaded regions in plots (confidence bands, whatever).
Perhaps this will get you closer to what you're looking for, using ggplot2 and plyr:
library(ggplot2)
library(plyr)
library(lubridate)
library(scales)
df$MonthDay <- df$Date - years( year(df$Date) + 100 ) #Normalize points to same year
df <- ddply(df, .(Month, Day), mutate, MaxDayFlow = max(Flow) ) #Max flow on day
df <- ddply(df, .(Month, Day), mutate, MinDayFlow = min(Flow) ) #Min flow on day
p <- ggplot(df, aes(x=MonthDay) ) +
geom_smooth(size=2,level=.8,color="black",aes(y=Flow)) + #80% conf. interval
geom_smooth(size=2,level=.5,color="black",aes(y=Flow)) + #50% conf. interval
geom_line( linetype="longdash", aes(y=MaxDayFlow) ) +
geom_line( linetype="longdash", aes(y=MinDayFlow) ) +
labs(x="Month",y="Flow") +
scale_x_date( labels = date_format("%b") ) +
theme_bw()
Edit: Fixed X scale and X scale label
(Partial answer with base plotting function and not including the min, max, or mean.) I suspect you will need to construct a dataset before passing to ggplot, since that is typical for that function. I already do something similar and then pass the resulting matrix to matplot. (It doesn't do that kewl highlighting, but maybe ggplot can do it>
HDL.mon.mat <- aggregate(dfrm$Flow,
list( dfrm$Year + dfrm$Month/12),
quantile, prob=c(0.1,0.25,0.5,0.75, 0.9), na.rm=TRUE)
matplot(HDL.mon.mat[,1], HDL.mon.mat$x, type="pl")

Add shape at the start and end of lines, and at some interval along the lines, defined by a grouping variable

that's my df (almost 100,000 rows and 10 ID values)
Date.time P ID
1 2013-07-03 12:10:00 1114.3 J9335
2 2013-07-03 12:20:00 1114.5 K0904
3 2013-07-03 12:30:00 1114.3 K0904
4 2013-07-03 12:40:00 1114.1 K1136
5 2013-07-03 12:50:00 1114.1 K1148
............
With ggplot I create this graph:
ggplot(df) + geom_line(aes(Date.time, P, group=ID, colour=ID)
No problem with this graph. But at the moment that I have to print it also in b/w, the separation in colors is not a smart choice.
I try to group the ID with the line type but the result is not so exiting.
So my idea is to add a different symbol at the beginning and at the end of every line: so the different IDs can be identified also in a b/w paper.
I add the lines:
geom_point(data=df, aes(x=min(Date.time), y=P, shape=ID))+
geom_point(data=df, aes(x=max(Date.time), y=P, shape=ID))
But an error occur..
Any suggestions?
Given that every line is composed by around 5000 or 10000 values it's impossible to plot the values as different characters. A solution could be to plot the lines and then plot the point as different symbol for every ID divided into breaks (for example one character every 500 values). Is it possible to do that?
What about adding the geom_points using a subset of you data with only the min-max time values?
# some data
df <- data.frame(
ID = rep(c("a", "b"), each = 4),
Date.time = rep(seq(Sys.time(), by = "hour", length.out = 4), 2),
P = sample(1:10, 8))
df
# create a subset with min and max time values
# if min(x) and max(x) is the same for each ID:
df_minmax <- subset(x= df, subset = Date.time == min(Date.time) | Date.time == max(Date.time))
# if min(x) and max(x) may differ between ID,
# calculate min and max values *per* ID
# Here I use ddply, but several other aggregating functions in base R will do as well.
library(plyr)
df_minmax <- ddply(.data = df, .variables = .(ID), subset,
Date.time == min(Date.time) | Date.time == max(Date.time))
gg <- ggplot(data = df, aes(x = Date.time, y = P)) +
geom_line(aes(group = ID, colour = ID)) +
geom_point(data = df_minmax, aes(shape = ID))
gg
If you wish to have some control over your shapes, you may have a look at ?scale_shape_discrete (with examples here).
Edit following updated question
For each ID, add a shape to the line at some interval.
# create a slightly larger data set
df <- data.frame(
ID = rep(c("a", "b"), each = 100),
Date.time = rep(seq(Sys.time(), by = "day", length.out = 100), 2),
P = c(sample(1:10, 100, replace = TRUE), sample(11:20, 100, replace = TRUE)))
# for each ID:
# create a time sequence from min(time) to max(time), by some time step
# e.g. a week
df_gap <- ddply(.data = df, .variables = .(ID), summarize,
Date.time =
seq(from = min(Date.time), to = max(Date.time), by = "week"))
# add P from df to df_gap
df_gap <- merge(x = df_gap, y = df)
gg <- ggplot(data = df, aes(x = Date.time, y = P)) +
geom_line(aes(group = ID, colour = ID)) +
geom_point(data = df_gap, aes(shape = ID)) +
# if your gaps are not a multiple of the length of the data
# you may wish to add the max points as well
geom_point(data = df_minmax, aes(shape = ID))
gg
The error stems from the fact that the single numeric value min(Date.time) doesn't match up in length with the vectors P or ID. Another problem might be that you're re-declaring your data variable even though you already have ggplot(df).
The solution that immediately comes to mind is to figure out what the row indexes are for your minimum and maximum dates. If they all share the same minimum and maximum time stamps than its easy. Use the which() function to come up with an array of the row numbers you'll need.
min.index <- which(df$Date.time == min(df$Date.time))
max.index <- which(df$Date.time == max(df$Date.time))
Then use those arrays as your indexes.
geom_point(aes(x=Date.time[min.index], y=P[min.index], shape=ID[min.index]))+
geom_point(aes(x=Date.time[max.index], y=P[max.index], shape=ID[max.index]))

Resources