This question already has answers here:
Why does summarize or mutate not work with group_by when I load `plyr` after `dplyr`?
(2 answers)
Closed 4 years ago.
I'm trying to calculate the difference b/w minimum and maximum date by group in R. The code to achieve this I found here. However, replicating the example does not lead to the expected result. This is the dataset example that was used:
HS_Hatch <- structure(list(ClutchID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L
), DateVisit = c("3/15/2012", "3/18/2012", "3/20/2012", "4/1/2012",
"4/3/2012", "3/18/2012", "3/20/2012", "3/22/2012", "4/3/2012",
"4/4/2012", "3/22/2012", "4/3/2012", "4/4/2012", "3/18/2012",
"3/20/2012", "3/22/2012", "4/2/2012", "4/3/2012", "4/4/2012",
"3/20/2012", "3/22/2012", "3/25/2012", "3/27/2012", "4/4/2012",
"4/5/2012"), Year = c(2012L, 2012L, 2012L, 2012L, 2012L, 2012L,
2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L,
2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L,
2012L), Survive = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -25L), .Names = c("ClutchID",
"DateVisit", "Year", "Survive"), spec = structure(list(cols = structure(list(
ClutchID = structure(list(), class = c("collector_integer",
"collector")), DateVisit = structure(list(), class = c("collector_character",
"collector")), Year = structure(list(), class = c("collector_integer",
"collector")), Survive = structure(list(), class = c("collector_integer",
"collector"))), .Names = c("ClutchID", "DateVisit", "Year",
"Survive")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
This was the proposed solution using dplyr:
library(dplyr)
HS_Hatch <- HS_Hatch %>%
mutate(date_visit = as.Date(DateVisit, "%m/%d/%Y"))
exposure <- HS_Hatch %>%
group_by(ClutchID) %>%
summarize(first_visit = min(date_visit),
last_visit = max(date_visit),
exposure = last_visit - first_visit)
This is the expected result:
ClutchID first_visit last_visit exposure
<int> <date> <date> <dbl>
1 1 2012-03-15 2012-04-03 19
2 2 2012-03-18 2012-04-04 17
3 3 2012-03-22 2012-04-04 13
4 4 2012-03-18 2012-04-04 17
5 5 2012-03-20 2012-04-05 16
This is the actual result:
first_visit last_visit exposure
1 2012-03-15 2012-04-05 21 days
It seems that the grouping factor gets ignored. How do i have it calculate the date difference per ClutchID?
It works with just dplyr loaded.
Change summarize to dplyr::summarize to make it unambiguous. I would suggest not using plyr as you can do everything with dplyr and tidyverse.
After the import of the dataframe, try this
HS_Hatch$DateVisit = as.Date(HS_Hatch$DateVisit, "%m/%d/%Y")
HS_Hatch$DateVisit = as.POSIXct(HS_Hatch$DateVisit, "%m/%d/%Y")
Then change your dplyr pipe to:
HS_Hatch <- HS_Hatch %>%
group_by(ClutchID) %>%
summarize(first_visit = min(date_visit),
last_visit = max(date_visit),
exposure = last_visit - first_visit)
This gave the expected result and worked since the format Posixct stores time in seconds since "the origin" and you can calculate differences.
Related
The x-axis labels isn't showing in my ggplot and I can't figure out what the issue is. I tried changing the scale_x_continuous to scale_x_discrete but that wasn't the issue. Here's the data and the code:
dput(df)
structure(list(variable = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "X..i..", class = "factor"),
value = c(0.86535786015671, 0.270518550067837, 0.942648772841964,
3.99444934081099, 1.11759146288817, 1.54510976425154, 2.44547105239855,
2.2564822479637, 0.806268193902794, 0.334684787222841, 0.279275582280181,
0.506202944652795, 0.00974858004556866, 0.274742461635902,
0.22071873199716, 0.289511637643534, 0.352185038116792, 0.834072418861261,
1.34338149120735, 1.74931508000265, 1.49348843361896, 4.07991249877895,
1.37225152308336, 0.812438174787708, 0.870119514197706, 1.12552827647611,
0.981401242191818, 0.811544940639505, 0.270314252804909,
0.00129424269740973, 0.138397649461267, 0.320412520877311,
0.200638317328505, 0.311317976283425, 2.27515845904203, 0.701130150695764,
1.19347381779438, 1.74260582346705, 2.04812451743241, 3.30525861365071,
1.09525257544341, 2.6941909849432, 1.24879308689346, 2.32559594481724,
0.489685734592222, 0.401412018111572, 0.209957274618462,
0.715330877881211, 0.844512982038313, 0.220417574806829,
0.440151738500053, 1.32486291268667, 0.771676730656983, 1.295145890213,
2.410181199299, 2.41520949303317, 2.07420663366187, 1.45105393420989,
1.94026424903487, 1.06019651909079, 1.21389399141063, 0.526835419170636,
0.392643071856425, 0.07366669912048, 0.376156996326127, 0.461881411637594,
0.236855843259622, 0.367884917633423), year = c(2005L, 2006L,
2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L,
2016L, 2017L, 2018L, 2019L, 2020L, 2021L, 2005L, 2006L, 2007L,
2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L,
2017L, 2018L, 2019L, 2020L, 2021L, 2005L, 2006L, 2007L, 2008L,
2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L,
2018L, 2019L, 2020L, 2021L, 2005L, 2006L, 2007L, 2008L, 2009L,
2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2018L,
2019L, 2020L, 2021L), tenor = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L), .Label = c("1", "5", "10", "average"), class = "factor")), row.names = c(NA,
-68L), class = "data.frame")
ggplot(df, aes(year, value, color = tenor)) +
geom_line(size=0.5) + scale_x_continuous(breaks = seq(1:17),labels = seq(2005,2021)) +
geom_point() +
xlab("year")
If you wanted to force ggplot to plot every x axis label, you could use scale_x_continous(breaks = 2005:2021) or breaks = df$year
ggplot(df, aes(year, value, color = tenor)) +
geom_line(size=0.5) +
scale_x_continuous(breaks = df$year) +
geom_point() +
xlab("year")
When I produce a frequency plot:
Data <- structure(list(Venue = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), .Label = c("Conference", "Journal"), class = "factor"), Year = c(2008L,
2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L,
2018L, 2019L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L,
2015L, 2016L, 2017L, 2018L), Frequency = c(0L, 0L, 0L, 0L, 1L,
1L, 2L, 1L, 4L, 4L, 11L, 3L, 2L, 1L, 0L, 0L, 3L, 5L, 3L, 7L,
8L, 19L, 10L)), class = "data.frame", row.names = c(NA, -23L))
library(ggplot2)
ggplot(Data, aes(x = Year, y = Frequency, fill = Venue, label = Frequency)) +
geom_bar(stat = "identity") +
geom_text(size = 3, position = position_stack(vjust = 0.5))
I receive in the plot value with zero and the year in x axis does not seem as the data frame
How is it possible to remove zero frequency from plot (but keep from year i.e. 2012 the record in the plot) and show in x axis all years for every bar?
Is this what you want?
The code to get it is:
ggplot(Data, aes(x = as.character(Year), y = Frequency, fill = Venue,
label = ifelse(Frequency > 0, Frequency, numeric(0)))) +
geom_bar(stat = "identity") +
geom_text(size = 3, position = position_stack(vjust = 0.5)) +
scale_x_discrete(name ="Year")
First at all I would like to apologise if I did not use the correct jargon.
I have the dataset as below which contains a wide range of categories
Here some excerpt from dput (using droplevels)
structure(list(
x = c(2010L, 2010L, 2010L, 2010L, 2010L, 2010L,
2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L,
2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L,
2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L,
2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L,
2010L, 2010L), *[ME: there are more years than 2010...]*
y = c(7.85986, 185.81068, 107.24097, 7094.74649,
1.4982, 185.77319, 5090.79354, 167.58584, 4189.64609, 157.08277,
3927.06932, 2.86732, 71.683, 4.70123, 117.53085, 2.93452, 73.36292,
1.4982, 18.18734, 901.14744, 0.90268, 13.77532, 613.38298, 0.01845,
0.0681, 7.19925, 3.75315, 0.14333, 136.54008, 0.04766, 0.59077,
28.97255, 0.38608, 115.05258, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
x1 = structure(c(4L, 2L, 3L, 1L, 4L, 2L, 1L, 2L, 1L, 2L,
1L, 2L, 1L, 2L, 1L, 2L, 1L, 4L, 2L, 1L, 4L, 2L, 1L, 4L, 2L,
1L, 2L, 4L, 1L, 4L, 2L, 1L, 4L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L), .Label = c("All greenhouse gases - (CO2 equivalent)",
"CH4", "CO2", "N2O"), class = "factor"),
x2 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Austria",
class = "factor"),
x4 = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L,
4L, 5L, 5L, 6L, 6L, 7L, 7L, 8L, 8L, 8L, 9L, 9L, 9L, 10L,
10L, 10L, 11L, 11L, 11L, 12L, 12L, 12L, 13L, 13L, 14L, 14L,
15L, 15L, 16L, 16L, 17L, 17L, 18L, 18L), .Label = c("3",
"3.1", "3.A", "3.A.1", "3.A.2", "3.A.3", "3.A.4", "3.B",
"3.B.1", "3.B.2", "3.B.3", "3.B.4", "3.B.5", "3.C", "3.C.1",
"3.C.2", "3.C.3", "3.C.4"), class = "factor")), class = "data.frame",
row.names = c(NA,
-44L))
I want to know whether the of the sum of subcategories in x4 (e.g. 3.B.1+3.B.2+...+3.B.n) equal the figure stated in the parent category (e.g. 3.B). (i.e. the in the csv stated sum) for a given year and country. I want to verify the sums.
For get the sum of the subcategories I have this
sum(df$y[df$x4 %in% c("3.A.1", "3.A.2", "3.A.3", "3.A.4") & x ==
"2010" & x2 == "Austria"])
To receive the sum of the parent category I have this
sum(df$y[df$x4 %in% c("3.A") & x == "2010" & x2 == "Austria"])
Next I would need an operation which checks whether the results of both codes are equal (True/false). However, I have more than 20 countries, 20 years, dozens of categories to check. With my newby approach I would be writing code for ages...
is there anyway to automate this? Basically, I am looking for a code which is able to do the following
1) Run for one category, go to next one
2) once done with categories change year and start again with categories
3) ... same for countries....
Any sort of help would be appreciated and even a suggestions how to use the right jargon in the title. Thanks in any case
Here's a potential solution using dplyr (might require some tweaking based on the full dataset):
require(dplyr)
# Create two columns - one that shows only the parent category number, and one that tells you if it's a parent or child; note that the regex here makes some assumptions on the format of your data.
mutate(df,parent=gsub("(.?\\..?)\\..*", "\\1", df$x4),
type=ifelse(parent==x4,"Parent","Child")) %>%
# Sum the children y's by category, year and country
group_by(parent, type, x, x2) %>%
summarize(sum(y)) %>%
# See if the sum of the children is equal to the parent y
tidyr::spread(type,`sum(y)`) %>%
mutate(equals=isTRUE(all.equal(Child,Parent)))
Result using your (new) data:
parent x x2 Child Parent equals
<chr> <int> <fct> <dbl> <dbl> <lgl>
1 3 2010 Austria NA 7396. FALSE
2 3.1 2010 Austria NA 5278. FALSE
3 3.A 2010 Austria 4357. 4357. TRUE
4 3.B 2010 Austria 921. 921. TRUE
5 3.C 2010 Austria 0 0 TRUE
I can see from your new data that you have two levels of parents. My solution will only work for the second level (e.g. 3.1 and its children), but can be easily tweaked to also work for the top level.
I have a dataset described by the following:
> dput(droplevels(head(sample,10)))
structure(list(Team = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = "Air-Force", class = "factor"), Year = c(2003L,
2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2011L, 2012L, 2013L
), Grouped_Position_3 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = "Skill", class = "factor"), Avg_Rating = c(0.7667,
0, 0.7444, 0.7222, 0, 0.7556, 0.76224, 0.596322222222222, 0.706584615384615,
0.767509090909091), n = c(1L, 1L, 3L, 6L, 1L, 1L, 5L, 9L, 13L,
11L)), .Names = c("Team", "Year", "Grouped_Position_3", "Avg_Rating",
"n"), row.names = c(NA, 10L), class = "data.frame")
In the full dataset there are multiple schools, grouped positions and years. What I'm trying to do is figure out how to generate a rolling average using the current year and four years in the past for each unique group of school, year and position. For example for 2013, Air Force and Skill position I would like the following calculation to take place (Note that 2010 is missing in the data):
(.767+.70+.59+0+.762)/5
The 0 comes from the missing year. I have looked at the zoo library in combination with dplyr but I haven't been able to control for missing values like this. Am I looking at having to write a loop or is there some package in r that has this capability?
Create a function Avg which given a vector of row numbers ix takes the required average of the most recent 5 years and then rollapplyr it for each group of Team and Grouped_Position_3:
library(zoo)
Avg <- function(ix) with(sample[ix, ], sum(Avg_Rating[Year >= max(Year) - 4]) / 5)
transform(sample, Avg = ave(1:nrow(sample), Team, Grouped_Position_3, FUN =
function(ix) rollapplyr(ix, 5, Avg, partial = TRUE)))
giving:
Team Year Grouped_Position_3 Avg_Rating n Avg
1 Air-Force 2003 Skill 0.7667000 1 0.1533400
2 Air-Force 2004 Skill 0.0000000 1 0.1533400
3 Air-Force 2005 Skill 0.7444000 3 0.3022200
4 Air-Force 2006 Skill 0.7222000 6 0.4466600
5 Air-Force 2007 Skill 0.0000000 1 0.4466600
6 Air-Force 2008 Skill 0.7556000 1 0.4444400
7 Air-Force 2009 Skill 0.7622400 5 0.5968880
8 Air-Force 2011 Skill 0.5963222 9 0.4228324
9 Air-Force 2012 Skill 0.7065846 13 0.5641494
10 Air-Force 2013 Skill 0.7675091 11 0.5665312
Note
The input used is:
sample <- structure(list(Team = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = "Air-Force", class = "factor"), Year = c(2003L,
2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2011L, 2012L, 2013L
), Grouped_Position_3 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = "Skill", class = "factor"), Avg_Rating = c(0.7667,
0, 0.7444, 0.7222, 0, 0.7556, 0.76224, 0.596322222222222, 0.706584615384615,
0.767509090909091), n = c(1L, 1L, 3L, 6L, 1L, 1L, 5L, 9L, 13L,
11L)), .Names = c("Team", "Year", "Grouped_Position_3", "Avg_Rating",
"n"), row.names = c(NA, 10L), class = "data.frame")
I'm trying to generate a plot that summarizes a dataset by first plotting the median & quantiles in an area / black line, after which I want to outline a specific 'firm' with a red line.
I'd also like to do so while facetting on a variable, thus plotting multiple variables at once.
An example code of what I'd plot is as follows:
require(dplyr)
require(ggplot2)
dt <- structure(list(Firm = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L,
4L), .Label = c("a", "b", "c", "d"), class = "factor"), Year = c(2008L,
2009L, 2008L, 2009L, 2008L, 2009L, 2008L, 2009L, 2008L, 2009L,
2008L, 2009L, 2008L, 2009L, 2008L, 2009L, 2008L, 2009L, 2008L,
2009L, 2008L, 2009L, 2008L, 2009L), variable = structure(c(1L,
1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 3L,
3L, 1L, 1L, 2L, 2L, 3L, 3L), .Label = c("var1", "var2", "var3"
), class = "factor"), value = c(0.991894223, 2.216322113, 3.189415462,
0.663732077, 0.444826423, 2.674568191, 1.272077011, 7.691464914,
4.263339855, 0.214415839, 3.995328653, 6.028747322, 8.191459456,
0.16205906, 4.056495056, 5.17994524, 0.42435417, 0.678655669,
6.246411921, 0.505532339, 4.65045746, 8.85141854, 5.850616048,
2.028583225)), .Names = c("Firm", "Year", "variable", "value"
), class = "data.frame", row.names = c(NA, -24L))
head(dt)
Firm Year variable value
1 a 2008 var1 0.9918942
2 a 2009 var1 2.2163221
3 a 2008 var2 3.1894155
4 a 2009 var2 0.6637321
5 a 2008 var3 0.4448264
6 a 2009 var3 2.6745682
I now manually calculate the ymin, ymax, and y for the ribbon / line plots. They're plotting just fine.
dt_aggregates <- dt %>%
group_by(variable, Year) %>%
arrange(variable, Year) %>%
summarize(y=median(value))
dt_aggregates$ymin <- 0.9*dt_aggregates$y
dt_aggregates$ymax <- 1.1*dt_aggregates$y
This will be the firm I want to highlight:
dt_focus <- filter(dt, Firm=="a")
The following plots just fine, and is almost what I want.
g <- ggplot(data=dt_aggregates,
aes(x=Year,
y=y,
ymax=ymax,
ymin=ymin,
group=variable)) +
facet_grid(variable~., scales="free") + geom_line() + geom_ribbon(alpha=0.3) +
theme_bw()
However, I want to add another line (for the one firm) onto this (in red).
Once I try to add a new line with a separate dataframe, I get the following error. Any help on getting this to work is greatly appreciated
# Error in eval(expr, envir, enclos) : object 'ymax' not found
g + geom_line(data=dt_focus,
aes(x=Year, y=value, group=variable),
col="red", size=2)