Defining T0 in my program - r

Here's a small program I'm making, to eventually get a final graph. I have 2 separate data sets. One is called T0 and the second one contains all the data I have. I want this program to get the T0 values from the the first data frame, then it searches about the maximum price in the 3 years before and the 3 years after the T0 year.
In essence, my program is going to assign T0 values that I chose arbitrarily. Then it will search automatically in my database for the maximum price in each year except the t0 year.
The problem I'm facing, is with the implementation of T0 values in the schedule. It just does not come out right when I run my code.
The problem apparently has to do with the way I'm defining T0. Should I use a for loop? or is there a small tweak I'm missing?
Final result wanted:
Data Base Example:
structure(list(Company = structure(1:3, .Label = c("Amazon",
"Cisco", "McDonald's"), class = "factor"), Year = c(2011L, 2008L,
2013L), Price = c(182, 21.82, 95.15)), .Names = c("Company",
"Year", "Price"), row.names = c(NA, 3L), class = "data.frame")
All Data:
structure(list(Company = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), .Label = c("Amazon", "Cisco", "McDonald's"), class = "factor"),
Year = c(2008L, 2008L, 2008L, 2008L, 2009L, 2009L, 2010L,
2010L, 2010L, 2011L, 2011L, 2012L, 2012L, 2013L, 2013L, 2014L,
2014L, 2014L, 2008L, 2010L, 2010L, 2010L, 2011L, 2011L, 2012L,
2012L, 2013L, 2013L, 2014L, 2014L, 2014L, 2015L, 2015L, 2016L,
2016L, 2016L, 2005L, 2005L, 2005L, 2006L, 2006L, 2007L, 2007L,
2007L, 2008L, 2008L, 2009L, 2009L, 2009L, 2010L, 2010L, 2011L,
2011L, 2011L), Price = c(91L, 77L, 81L, 87L, 63L, 88L, 110L,
75L, 117L, 170L, 190L, 215L, 245L, 316L, 275L, 330L, 378L,
390L, 55L, 62L, 66L, 65L, 72L, 98L, 93L, 88L, 99L, 101L,
94L, 103L, 96L, 99L, 116L, 112L, 123L, 113L, 19L, 17L, 18L,
20L, 19L, 26L, 31L, 27L, 24L, 21L, 14L, 22L, 18L, 26L, 22L,
14L, 16L, 15L)), .Names = c("Company", "Year", "Price"), class = "data.frame", row.names = c(NA,
My code:
T0data<- read.csv(file = "C:/Users/My first file.csv", header = TRUE )
Alldata<- read.csv(file = "C:/Users/My second file.csv", header = TRUE )
year_zero <- T0data$Year
# Filter to include year_zero +/- 3 years and get Best result per company per year
d <- d[Year >= year_zero - 3 & Yeae <= year_zero + 3,
.(Best_Result = max(Price, na.rm = TRUE)), by = .(Company, Year)]
# Add T as interval to year_zero (and convert to factor in order to get all
# values from 3 to 3
d[, "T" := factor(Year - year_zero, levels = seq(-3, 3), ordered = TRUE)]
# Cast to wide format (fill missing values with NA)
dcast(d, Company ~T, value.var = "Best_Result", drop = FALSE)
# Cast to wide format (fill missing values with "")
dcast(d, Company~T, value.var = "Best_Result", drop = FALSE, fun.aggregate = paste0,
fill = "")

Here's a solution that uses dplyr / tidyr packages from the tidyverse, rather than data.table, but it should do the job:
library(dplyr); library(tidyr)
T0.modified <- T0data %>%
# create year range based on each company's T0 year
mutate(Year.M1 = Year - 1,
Year.M2 = Year - 2,
Year.M3 = Year - 3,
Year.P1 = Year + 1,
Year.P2 = Year + 2,
Year.P3 = Year + 3) %>%
# convert to long format, match with Alldata based on both company & year
gather(reference.year, actual.year, -Company, -Price) %>%
left_join(Alldata, by = c("Company" = "Company", "actual.year" = "Year")) %>%
# keep T0 price for year T0, & use matched prices for all other years
mutate(Price = ifelse(reference.year == "Year", Price.x, Price.y)) %>%
# take maximum of all matched prices for each company each year
group_by(Company, reference.year) %>%
summarise(Price = max(Price)) %>%
ungroup() %>%
# order reference.year for correct sequence in ggplot's x-axis
mutate(reference.year = factor(reference.year,
levels = c("Year.M3", "Year.M2", "Year.M1", "Year",
"Year.P1", "Year.P2", "Year.P3"),
labels = c("T-3", "T-2", "T-1", "T0", "T+1", "T+2", "T+3")))
Resulting plot:
aes(x = reference.year, y = Price, group = Company, color = Company)) +
geom_line(aes()) +
xlab("Year") + theme_bw()
Edit adding average for each year using stat_summary:
aes(x = reference.year, y = Price, group = Company, color = Company)) +
geom_line(aes()) +
xlab("Year") + theme_bw() +
stat_summary(fun.y = mean, geom = "line", group = 1,
linetype = 2, size = 1.5, colour = "grey") +
annotate("label", x = 7, y = 200, label = "Average",
fill = "grey", alpha = 0.5, hjust = 1)


Plotting a general linear model (glm) produced with the function monthglm() from the season package in R

I have fitted a general linear model (glm) with a categorical variable of the month using the function monthglm() based on the covariates of Month and Season, which was written by Barnett, A.G., Dobson, A.J. (2010) Analysing Seasonal Health Data. Springer.
The covariates 'Month' and 'Season' appear to confuse the model. From looking at the model summary (see below), there are some warning “Coefficients: (3 not defined because of singularities)”, and therefore, there are exactly three months that have not been properly estimated (e.g. March, September, and December), and instead, the model output shows NA's. So essentially the model can’t distinguish between the covariates Month and Season because they are so similar.
I was wondering if anyone can please help in regards to manipulating the data or the model itself so the function monthglm() is able to calculate the mean values and upper and lower confidence levels for all blue whale sightings over all months while including the covariates 'Month' and 'Season' in the model? As a result, the plotted model (see below) has three missing confidence bars for March, September, and December.
To plot the results of the model displaying all months between January-December showing mean blue whale sightings with both upper and lower confidence levels using the covariates 'Month' and 'Season'.
Thank you if anyone is able to help!
Function: monthglm():
##Install pacakages
library(MASS) # for mvrnorm
library(survival) # for coxph
##R-code for the function monthglm2()
## checks
if (refmonth<1|refmonth>12){stop("Reference month must be between 1 and 12")}
## original call with defaults (see amer package)
ans <- as.list(
frmls <- formals(deparse(ans[[1]]))
add <- which(!(names(frmls) %in% names(ans)))
call<, frmls[add]))
## If month is a character, create the numbers
data$month=months # add to data for flagleap
months<-relevel(months,[refmonth]) # set reference month ### TYPO HERE, changed from months.u
## Transform month numbers to names
nums<-as.numeric(nochars(levels(months.u))) # Month numbers
months<-relevel(months.u,[refmonth]) # set reference month
## prepare data/formula
dep<-parts[2] # dependent variable
days<-flagleap(data=data,report=FALSE,matchin=T) # get the number of days in each month
if(is.null(offsetpop)==FALSE){poff=with(data,eval(offsetpop))} else{poff=rep(1,l)} #
if(offsetmonth==TRUE){moff=days$ndaysmonth/(365.25/12)} else{moff=rep(1,l)} # days per month divided by average month length
### data$off<-log(poff*moff)
off<-log(poff*moff) #
## return
Sightings$year <- Sightings$Year
model<-monthglm2(formula=Frequency_Blue_Whales_Year_Month~Year+Season, family=poisson(),
offsetmonth=TRUE, monthvar='Month', refmonth=1, data=Sightings)
Model Output
Call: glm(formula = f, family = family, data = data, offset = off)
(Intercept) Year SeasonSpring SeasonSummer Seasonwinter SeasonWinter monthsFeb monthsMar monthsApr monthsMay
-323.25725 0.16196 0.43926 -0.03365 0.76373 0.91534 -0.06261 NA -0.23382 0.27876
monthsJun monthsJul monthsAug monthsSep monthsOct monthsNov monthsDec
-1.97313 -19.55938 0.25231 -1.94416 0.00643 0.77171 NA
Degrees of Freedom: 35 Total (i.e. Null); 21 Residual
Null Deviance: 940.7
Residual Deviance: 195.4 AIC: 386.7
Number of observations = 36
Rate ratios
mean lower upper zvalue pvalue
monthsFeb 9.393137e-01 0.67978181 1.2979315 -0.37944839 7.043549e-01
monthsApr 7.915059e-01 0.54509500 1.1493073 -1.22869325 2.191868e-01
monthsMay 1.321488e+00 0.83554494 2.0900500 1.19180025 2.333396e-01
monthsJun 1.390209e-01 0.03860611 0.5006151 -3.01844013 2.540796e-03
monthsJul 3.202360e-09 0.00000000 Inf -0.01615812 9.871082e-01
monthsAug 1.286991e+00 1.01676543 1.6290337 2.09823277 3.588459e-02
monthsSep 1.431068e-01 0.05831898 0.3511647 -4.24489759 2.186933e-05
monthsOct 1.006450e+00 0.73231254 1.3832102 0.03963081 9.683875e-01
monthsNov 2.163470e+00 1.64625758 2.8431777 5.53616916 3.091590e-08
plot(model, ylim=c(0,1.4))
Error message inserting the y-labels and x-labels
##I am also unable to plot the x-labels and the y-labels
+ ylim=c(0,1.4),
+ ylab="Mean Blue Whale Sightings",
+ xlab="Month")
Error in plot.default(order, toplot$mean, xaxt = "n", xlab = "", ylab = "", :
formal argument "xlab" matched by multiple actual arguments
Plotted Figure
Dataframe (called Sightings)
structure(list(Year = c(2015L, 2016L, 2017L, 2015L, 2016L, 2017L,
2015L, 2016L, 2017L, 2015L, 2016L, 2017L, 2015L, 2016L, 2017L,
2015L, 2016L, 2017L, 2015L, 2016L, 2017L, 2015L, 2016L, 2017L,
2015L, 2016L, 2017L, 2015L, 2016L, 2017L, 2015L, 2016L, 2017L,
2015L, 2016L, 2017L), Month = structure(c(5L, 5L, 5L, 4L, 4L,
4L, 8L, 8L, 8L, 1L, 1L, 1L, 9L, 9L, 9L, 7L, 7L, 7L, 6L, 6L, 6L,
2L, 2L, 2L, 12L, 12L, 12L, 11L, 11L, 11L, 10L, 10L, 10L, 3L,
3L, 3L), .Label = c("April", "August", "December", "Feb", "Jan",
"July", "June", "Mar", "May", "November", "October", "September"
), class = "factor"), Frequency_Blue_Whales_Year_Month = c(76L,
78L, 66L, 28L, 54L, 37L, 39L, 31L, 88L, 46L, 23L, 54L, 5L, 8L,
0L, 0L, 0L, 0L, 0L, 4L, 7L, 22L, 6L, 44L, 10L, 30L, 35L, 88L,
41L, 35L, 4L, 30L, 43L, 65L, 43L, 90L), Season = structure(c(4L,
4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
5L, 5L, 5L), .Label = c("Autumn", "Spring", "Summer", "winter",
"Winter"), class = "factor")), class = "data.frame", row.names = c(NA,

How to remove zero frequency for frequency plot and fix time?

When I produce a frequency plot:
Data <- structure(list(Venue = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), .Label = c("Conference", "Journal"), class = "factor"), Year = c(2008L,
2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L,
2018L, 2019L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L,
2015L, 2016L, 2017L, 2018L), Frequency = c(0L, 0L, 0L, 0L, 1L,
1L, 2L, 1L, 4L, 4L, 11L, 3L, 2L, 1L, 0L, 0L, 3L, 5L, 3L, 7L,
8L, 19L, 10L)), class = "data.frame", row.names = c(NA, -23L))
ggplot(Data, aes(x = Year, y = Frequency, fill = Venue, label = Frequency)) +
geom_bar(stat = "identity") +
geom_text(size = 3, position = position_stack(vjust = 0.5))
I receive in the plot value with zero and the year in x axis does not seem as the data frame
How is it possible to remove zero frequency from plot (but keep from year i.e. 2012 the record in the plot) and show in x axis all years for every bar?
Is this what you want?
The code to get it is:
ggplot(Data, aes(x = as.character(Year), y = Frequency, fill = Venue,
label = ifelse(Frequency > 0, Frequency, numeric(0)))) +
geom_bar(stat = "identity") +
geom_text(size = 3, position = position_stack(vjust = 0.5)) +
scale_x_discrete(name ="Year")

ggplot2 graphing and plotting average and minimum

here is my code:
library(dplyr); library(tidyr)
T0.modified <- T0data %>%
# create year range based on each company's T0 year
mutate(Year.M1 = Year - 1,
Year.M2 = Year - 2,
Year.M3 = Year - 3,
Year.P1 = Year + 1,
Year.P2 = Year + 2,
Year.P3 = Year + 3) %>%
# convert to long format, match with Alldata based on both company & year
gather(reference.year, actual.year, -Company, -Price) %>%
left_join(Alldata, by = c("Company" = "Company", "actual.year" = "Year")) %>%
# keep T0 price for year T0, & use matched prices for all other years
mutate(Price = ifelse(reference.year == "Year", Price.x, Price.y)) %>%
# take maximum of all matched prices for each company each year
group_by(Company, reference.year) %>%
summarise(Price = max(Price)) %>%
ungroup() %>%
# order reference.year for correct sequence in ggplot's x-axis
mutate(reference.year = factor(reference.year,
levels = c("Year.M3", "Year.M2", "Year.M1", "Year",
"Year.P1", "Year.P2", "Year.P3"),
labels = c("T-3", "T-2", "T-1", "T0", "T+1", "T+2", "T+3")))
aes(x = reference.year, y = Price, group = Company, color = Company)) +
geom_line(aes()) +
xlab("Year") + theme_bw() +
stat_summary(fun.y = mean, geom = "line", group = 1,
linetype = 2, size = 1.5, colour = "grey") +
annotate("label", x = 7, y = 200, label = "Average",
fill = "grey", alpha = 0.5, hjust = 1)
And here is my data:
structure(list(Company = structure(1:3, .Label = c("Amazon",
"Cisco", "McDonald's"), class = "factor"), Year = c(2011L, 2008L,
2013L), Price = c(182, 21.82, 95.15)), .Names = c("Company",
"Year", "Price"), row.names = c(NA, 3L), class = "data.frame")
All Data:
structure(list(Company = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), .Label = c("Amazon", "Cisco", "McDonald's"), class = "factor"),
Year = c(2008L, 2008L, 2008L, 2008L, 2009L, 2009L, 2010L,
2010L, 2010L, 2011L, 2011L, 2012L, 2012L, 2013L, 2013L, 2014L,
2014L, 2014L, 2008L, 2010L, 2010L, 2010L, 2011L, 2011L, 2012L,
2012L, 2013L, 2013L, 2014L, 2014L, 2014L, 2015L, 2015L, 2016L,
2016L, 2016L, 2005L, 2005L, 2005L, 2006L, 2006L, 2007L, 2007L,
2007L, 2008L, 2008L, 2009L, 2009L, 2009L, 2010L, 2010L, 2011L,
2011L, 2011L), Price = c(91L, 77L, 81L, 87L, 63L, 88L, 110L,
75L, 117L, 170L, 190L, 215L, 245L, 316L, 275L, 330L, 378L,
390L, 55L, 62L, 66L, 65L, 72L, 98L, 93L, 88L, 99L, 101L,
94L, 103L, 96L, 99L, 116L, 112L, 123L, 113L, 19L, 17L, 18L,
20L, 19L, 26L, 31L, 27L, 24L, 21L, 14L, 22L, 18L, 26L, 22L,
14L, 16L, 15L)), .Names = c("Company", "Year", "Price"), class = "data.frame", row.names = c(NA,
Here's my question:
How can I make the line graph show only 2 values, the average, and the minimum for all values?
And How can I plot a random company to represent the third line in the graph too to compare it to the minimum and the average?
Something like this? It plots the average, the minimum and a random company (see subset).
p = ggplot(T0.modified) + xlab("Year") + theme_bw() +
stat_summary(aes(x = reference.year, y = Price),fun.y = mean, geom = "line", group = 1, linetype = 2, size = 1.5, colour = "grey") +
stat_summary(aes(x = reference.year, y = Price),fun.y = min, geom = "line", group = 1, linetype = 2, size = 1.5, colour = "red") +
annotate("label", x = 7, y = 200, label = "Average", fill = "grey", alpha = 0.5, hjust = 1) +
annotate("label", x = 7, y = 30, label = "Min", fill = "grey", alpha = 0.5, hjust = 1) +
geom_line(data = subset(T0.modified,Company=="Amazon"),aes(x = reference.year, y = Price,group=Company),color="blue")

ggplot2 legends: Adding items not in plot code to the legend & changing shapes in legend

I have a complex plot that shows dots of different colors for patient grades in different years, and lines of different colors connecting repeated measures (the same patients) measured in the two years whose grading has changed. As you can see the legend simply lists the dots with lines for the two colors. What I need, however, is a legend that has a red dot for 2009 and a blue dot for 2016 (without lines!), as well as a red line that I can label "Upgraded" and a blue line I can label "Downgraded". So I need four legend items: 2 dots, 2 lines, 4 labels. I've done extensive searching on this, and cannot find the answer.
Here's the plot code I used to build all the aesthetics I wanted:
sampledata <- structure(list(patient = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L,
5L, 6L, 6L, 7L, 7L, 8L, 8L, 9L, 9L, 10L, 10L, 11L, 11L, 12L,
12L, 13L, 13L, 14L, 14L, 15L, 15L, 16L, 16L, 17L, 17L, 18L, 18L,
19L, 19L), grade = structure(c(5L, 5L, 5L, 5L, 5L, 1L, 5L, 1L,
1L, 5L, 1L, 5L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 5L, 2L, 5L, 3L, 3L,
3L, 3L, 2L, 4L, 2L, 4L, 4L, 1L, 4L, 1L, 4L, 2L, 4L, 2L), .Label = c("grade_I",
"grade_II", "grade_III", "indeterminate", "normal"), class = "factor"),
year = c(2009L, 2016L, 2009L, 2016L, 2009L, 2016L, 2009L,
2016L, 2009L, 2016L, 2009L, 2016L, 2009L, 2016L, 2009L, 2016L,
2009L, 2016L, 2009L, 2016L, 2009L, 2016L, 2009L, 2016L, 2009L,
2016L, 2009L, 2016L, 2009L, 2016L, 2009L, 2016L, 2009L, 2016L,
2009L, 2016L, 2009L, 2016L)), .Names = c("patient", "grade",
"year"), class = "data.frame", row.names = c(NA, -38L))
yearf = factor(year)
gradef = factor(gradef, levels=c("normal", "grade_I", "grade_II", "grade_III", "indeterminate"))
p <- ggplot(data=guidegrades2, aes(x=gradef, y=patient, group=patient, color=yearf)) +
geom_point() + geom_line()
p + scale_colour_brewer(palette = "Set1") +
labs(x = "ASE Grade", y = "Patient", color = "ASE Guidelines")
pticks = p + scale_x_discrete(labels=c("grade_I" = "Grade I", "grade_II" = "Grade II",
"grade_III" = "Grade III", "indeterminate" = "Indeterminate", "normal" = "Normal"))
ptheme = pticks + theme(panel.grid.minor = element_blank(), panel.grid.major = element_blank(), panel.background = element_blank(), legend.key = element_rect(fill = "white"), axis.line = element_line(), axis.title.x=element_text(vjust=0.0001))
paxes = ptheme + scale_colour_brewer(palette = "Set1") +
labs(x = "ASE Grade", y = "Patient", color = "ASE Guidelines")

Error (ymax not found) while adding a geom_line (different data) to a multi-layered ggplot

I'm trying to generate a plot that summarizes a dataset by first plotting the median & quantiles in an area / black line, after which I want to outline a specific 'firm' with a red line.
I'd also like to do so while facetting on a variable, thus plotting multiple variables at once.
An example code of what I'd plot is as follows:
dt <- structure(list(Firm = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L,
4L), .Label = c("a", "b", "c", "d"), class = "factor"), Year = c(2008L,
2009L, 2008L, 2009L, 2008L, 2009L, 2008L, 2009L, 2008L, 2009L,
2008L, 2009L, 2008L, 2009L, 2008L, 2009L, 2008L, 2009L, 2008L,
2009L, 2008L, 2009L, 2008L, 2009L), variable = structure(c(1L,
1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 3L,
3L, 1L, 1L, 2L, 2L, 3L, 3L), .Label = c("var1", "var2", "var3"
), class = "factor"), value = c(0.991894223, 2.216322113, 3.189415462,
0.663732077, 0.444826423, 2.674568191, 1.272077011, 7.691464914,
4.263339855, 0.214415839, 3.995328653, 6.028747322, 8.191459456,
0.16205906, 4.056495056, 5.17994524, 0.42435417, 0.678655669,
6.246411921, 0.505532339, 4.65045746, 8.85141854, 5.850616048,
2.028583225)), .Names = c("Firm", "Year", "variable", "value"
), class = "data.frame", row.names = c(NA, -24L))
Firm Year variable value
1 a 2008 var1 0.9918942
2 a 2009 var1 2.2163221
3 a 2008 var2 3.1894155
4 a 2009 var2 0.6637321
5 a 2008 var3 0.4448264
6 a 2009 var3 2.6745682
I now manually calculate the ymin, ymax, and y for the ribbon / line plots. They're plotting just fine.
dt_aggregates <- dt %>%
group_by(variable, Year) %>%
arrange(variable, Year) %>%
dt_aggregates$ymin <- 0.9*dt_aggregates$y
dt_aggregates$ymax <- 1.1*dt_aggregates$y
This will be the firm I want to highlight:
dt_focus <- filter(dt, Firm=="a")
The following plots just fine, and is almost what I want.
g <- ggplot(data=dt_aggregates,
group=variable)) +
facet_grid(variable~., scales="free") + geom_line() + geom_ribbon(alpha=0.3) +
However, I want to add another line (for the one firm) onto this (in red).
Once I try to add a new line with a separate dataframe, I get the following error. Any help on getting this to work is greatly appreciated
# Error in eval(expr, envir, enclos) : object 'ymax' not found
g + geom_line(data=dt_focus,
aes(x=Year, y=value, group=variable),
col="red", size=2)
