How to add a year to the existing date list without erasing any of the existing ones - r

I have a data frame with Date and Velocity as they are seen below. My issue is that some years are missing like 1945 and 1951.
I would like to add 1945 to Date only once and on the position that it should be on between 1944 and 1946. I know some years are repeated. The day and month are not very important as they are more of a position holder. I plan to make the velocity equal to 0 for all the added years (e.g. mm-dd-1945)
What I have
Date Velocity
2/23/1944 1
12/26/1944 2
1/7/1946 5
3/25/1947 8
4/14/1948 10
6/18/1949 12
1/31/1950 13
12/7/1950 14
1/27/1952 15
I tried doing the following
NewYear <- complete(Data,Date = seq.Date(min(Data$Date),
max(Data$Date), by="year"))
but all of the existing dates get overwritten and I end up with this
Date Velocity
2/23/1944 NA
2/23/1945 NA
2/23/1946 NA
2/23/1947 NA
2/23/1948 NA
2/23/1949 NA
2/23/1950 NA
2/23/1951 NA
2/23/1952 NA
Desired Output
Date Velocity
2/23/1944 1
12/26/1944 2
1/01/1945 0
1/7/1946 5
3/25/1947 8
4/14/1948 10
6/18/1949 12
1/31/1950 13
12/7/1950 14
1/1/1951 0
1/27/1952 15

We first need to extract the year from the date then use complete to get missing years and replace the missing Date with first day of the Year.
library(dplyr)
df %>%
mutate(Date = as.Date(Date, "%m/%d/%Y"),
Year = as.integer(format(Date, "%Y"))) %>%
tidyr::complete(Year = seq(min(Year), max(Year)), fill = list(Velocity = 0)) %>%
mutate(Date = if_else(is.na(Date), as.Date(paste0(Year, "-01-01")), Date))
# Year Date Velocity
# <int> <date> <dbl>
# 1 1944 1944-02-23 1
# 2 1944 1944-12-26 2
# 3 1945 1945-01-01 0
# 4 1946 1946-01-07 5
# 5 1947 1947-03-25 8
# 6 1948 1948-04-14 10
# 7 1949 1949-06-18 12
# 8 1950 1950-01-31 13
# 9 1950 1950-12-07 14
#10 1951 1951-01-01 0
#11 1952 1952-01-27 15
Add select(-Year) if you don't want Year column in your final output.

Related

r - Fill in missing years in Data frame [duplicate]

This question already has answers here:
Extend an irregular sequence and add zeros to missing values
(9 answers)
Closed 1 year ago.
I have some data in R that looks like this.
year freq
<int> <int>
1902 2
1903 2
1905 1
1906 4
1907 1
1908 1
1909 1
1912 1
1914 1
1915 1
The data was read in using the following code.
data = read.csv("earthquakes.csv")
my_var <- c('year')
new_data <- data[my_var]
counts <- count(data, 'year')
This is 1 page of a 7 page table. I need to fill in the missing years with a count of 0 from 1900-1999. How would I go about this? I haven't been able to find an example online where year is the primary column.
We may use complete on the 'counts' data
library(tidyr)
complete(counts, year = 1990:1999, fill = list(freq = 0))
1) Convert the input, shown in the Note, to zoo class and then to ts class. The latter will fill iln the missing years with NA. Replace the NA's with 0, convert back to data frame and set the names to the original names.
If a ts series is ok as output then omit the last two lines. If in addition it is ok to use NA rather than 0 then omit the last three lines.
library(zoo)
DF |>
read.zoo() |>
as.ts() |>
na.fill(0) |>
fortify.zoo() |>
setNames(names(DF))
giving:
year freq
1 1902 2
2 1903 2
3 1904 0
4 1905 1
5 1906 4
6 1907 1
7 1908 1
8 1909 1
9 1910 0
10 1911 0
11 1912 1
12 1913 0
13 1914 1
14 1915 1
2) for a base solution use merge. Omit the last line if NA is ok instead of 0.
m <- merge(DF, data.frame(year = min(DF$year):max(DF$year)), all = TRUE)
transform(m, freq = replace(freq, is.na(freq), 0))
Note
Lines <- "year freq
1902 2
1903 2
1905 1
1906 4
1907 1
1908 1
1909 1
1912 1
1914 1
1915 1"
DF <- read.table(text = Lines, header = TRUE)

How to subtract each Country's value by year

I have data for each Country's happiness (https://www.kaggle.com/unsdsn/world-happiness), and I made data for each year of the reports. Now, I don't know how to get the values for each year subtracted from each other e.g. how did happiness rank change from 2015 to 2017/2016 to 2017? I'd like to make a new df of differences for each.
I was able to bind the tables for columns in common and started to work on removing Countries that don't have data for all 3 years. I'm not sure if I'm going down a complicated path.
keepcols <- c("Country","Happiness.Rank","Economy..GDP.per.Capita.","Family","Health..Life.Expectancy.","Freedom","Trust..Government.Corruption.","Generosity","Dystopia.Residual","Year")
mydata2015 = read.csv("C:\\Users\\mmcgown\\Downloads\\2015.csv")
mydata2015$Year <- "2015"
data2015 <- subset(mydata2015, select = keepcols )
mydata2016 = read.csv("C:\\Users\\mmcgown\\Downloads\\2016.csv")
mydata2016$Year <- "2016"
data2016 <- subset(mydata2016, select = keepcols )
mydata2017 = read.csv("C:\\Users\\mmcgown\\Downloads\\2017.csv")
mydata2017$Year <- "2017"
data2017 <- subset(mydata2017, select = keepcols )
df <- rbind(data2015,data2016,data2017)
head(df, n=10)
tail(df, n=10)
df15 <- df[df['Year']=='2015',]
df16 <- df[df['Year']=='2016',]
df17 <- df[df['Year']=='2017',]
nocon <- rbind(setdiff(unique(df16['Country']),unique(df17['Country'])),setdiff(unique(df15['Country']),unique(df16['Country'])))
Don't have a clear path to accomplish what I want but it would look like
df16_to_17
Country Happiness.Rank ...(other columns)
Yemen (Yemen[Happiness Rank in 2017] - Yemen[Happiness Rank in 2016])
USA (USA[Happiness Rank in 2017] - USA[Happiness Rank in 2016])
(other countries)
df15_to_16
Country Happiness.Rank ...(other columns)
Yemen (Yemen[Happiness Rank in 2016] - Yemen[Happiness Rank in 2015])
USA (USA[Happiness Rank in 2016] - USA[Happiness Rank in 2015])
(other countries)
It's very straightforward with dplyr, and involves grouping by country and then finding the differences between consecutive values with base R's diff. Just make sure to use df and not df15, etc.:
library(dplyr)
rank_diff_df <- df %>%
group_by(Country) %>%
mutate(Rank.Diff = c(NA, diff(Happiness.Rank)))
The above assumes that the data are arranged by year, which they are in your case because of the way you combined the dataframes. If not, you'll need to call arrange(Year) before the call to mutate. Filtering out countries with missing year data isn't necessary, but can be done after group_by() with filter(n() == 3).
If you would like to view the differences it would make sense to drop some variables and rearrange the data:
rank_diff_df %>%
select(Year, Country, Happiness.Rank, Rank.Diff) %>%
arrange(Country)
Which returns:
# A tibble: 470 x 4
# Groups: Country [166]
Year Country Happiness.Rank Rank.Diff
<chr> <fct> <int> <int>
1 2015 Afghanistan 153 NA
2 2016 Afghanistan 154 1
3 2017 Afghanistan 141 -13
4 2015 Albania 95 NA
5 2016 Albania 109 14
6 2017 Albania 109 0
7 2015 Algeria 68 NA
8 2016 Algeria 38 -30
9 2017 Algeria 53 15
10 2015 Angola 137 NA
# … with 460 more rows
The above data frame will work well with ggplot2 if you are planning on plotting the results.
If you don't feel comfortable with dplyr you can use base R's merge to combine the dataframes, and then create a new dataframe with the differences as columns:
df_wide <- merge(merge(df15, df16, by = "Country"), df17, by = "Country")
rank_diff_df <- data.frame(Country = df_wide$Country,
Y2015.2016 = df_wide$Happiness.Rank.y -
df_wide$Happiness.Rank.x,
Y2016.2017 = df_wide$Happiness.Rank -
df_wide$Happiness.Rank.y
)
Which returns:
head(rank_diff_df, 10)
Country Y2015.2016 Y2016.2017
1 Afghanistan 1 -13
2 Albania 14 0
3 Algeria -30 15
4 Angola 4 -1
5 Argentina -4 -2
6 Armenia -6 0
7 Australia -1 1
8 Austria -1 1
9 Azerbaijan 1 4
10 Bahrain -7 -1
Assuming the three datasets are present in your environment with the name data2015, data2016 and data2017, we can add a year column with the respective year and keep the columns which are present in keepcols vector. arrange the data by Country and Year, group_by Country, keep only those countries which are present in all 3 years and then subtract the values from previous rows using lag or diff.
library(dplyr)
data2015$Year <- 2015
data2016$Year <- 2016
data2017$Year <- 2017
df <- bind_rows(data2015, data2016, data2017)
data <- df[keepcols]
data %>%
arrange(Country, Year) %>%
group_by(Country) %>%
filter(n() == 3) %>%
mutate_at(-1, ~. - lag(.)) #OR
#mutate_at(-1, ~c(NA, diff(.)))
# A tibble: 438 x 10
# Groups: Country [146]
# Country Happiness.Rank Economy..GDP.pe… Family Health..Life.Ex… Freedom
# <chr> <int> <dbl> <dbl> <dbl> <dbl>
# 1 Afghan… NA NA NA NA NA
# 2 Afghan… 1 0.0624 -0.192 -0.130 -0.0698
# 3 Afghan… -13 0.0192 0.471 0.00731 -0.0581
# 4 Albania NA NA NA NA NA
# 5 Albania 14 0.0766 -0.303 -0.0832 -0.0387
# 6 Albania 0 0.0409 0.302 0.00109 0.0628
# 7 Algeria NA NA NA NA NA
# 8 Algeria -30 0.113 -0.245 0.00038 -0.0757
# 9 Algeria 15 0.0392 0.313 -0.000455 0.0233
#10 Angola NA NA NA NA NA
# … with 428 more rows, and 4 more variables: Trust..Government.Corruption. <dbl>,
# Generosity <dbl>, Dystopia.Residual <dbl>, Year <dbl>
The value of first row for each Year would always be NA, rest of the values would be subtracted by it's previous values.

Calculating age per animal by subtracting years in R

I am looking to calculate relative age of animals. I need to subtract sequentially each year from the next for each animal in my dataset. Because an animal can have multiple reproductive events in a year, I need the age for the remaining events in that year (i.e. all events after the first) to be the same as the initial calculation.
Update:
The dataset more resembles this:
Year ID Age
1 1975 6 -1
2 1975 6 -1
3 1976 6 -1
4 1977 6 -1
6 1975 9 -1
8 1978 9 -1
And I need it to look like this
Year ID Age
1 1975 6 0
2 1975 6 0
3 1976 6 1
4 1977 6 2
6 1975 9 0
8 1978 9 3
Apologies for the initial confusion, if I wasn't clear on what I needed to accomplish.
Any help would be greatly appreciated.
Things done "by group" are usually easiest to do using dplyr or data.table
library(dplyr)
your_data %>%
group_by(ID) %>% # group by ID
mutate(Age = Year - min(Year)) # add new column
or
library(data.table)
setDT(your_data) # convert to data table
# add new column by group
your_data[, Age := Year - min(Year), by = ID]
In base R, ave is probably easiest for adding a groupwise columns to existing data:
your_data$Age = with(your_data, ave(Year, ID, function(x) x - min(x)))
but the syntax isn't as nice as the options above.
You can test on this data:
your_data = read.table(text = " Year ID Age
1 1975 6 -1
2 1975 6 -1
3 1976 6 -1
4 1977 6 -1
6 1975 9 -1
8 1978 9 -1 ", header = T)
if you're trying to figure out the relative age based on one intial birth year, 1975 (which it seems like you are), then you can just make a new column called "RelativeAge" and set it equal to the year - 1975
data$RelativeAge = (Year-1975)
then just get rid of the original "Age" column, or rename as necessary

computing onset date of snowmelt in R [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have daily temperature in this format starting from 1950 to 2017
Data
I need to compute snowmelt onset date which is defined as as the the first day when daily temperature is above 0 C, following the last five-day period between March and May, when the daily temperature is below 0 C. My codes so far:
df1<-read.csv("temp.csv")
require(dplyr)
# applying the condition to check each temperature value
df1$boolean<- ifelse(df1$temp<0.0 , 1, 0)
#computing the total sum < 0 and the start and end date
snow<-df1 %>%
mutate(boolean = ifelse(is.na(boolean), 0, boolean)) %>%
group_by(group = cumsum(c(0, diff(boolean) != 0))) %>%
filter(boolean == 1 & n() > 1) %>%
summarize("Start Date"=min(as.character(date)),
"End Date"=max(as.character(date)),
"Length of Run"=n()) %>%
ungroup() %>%
select(-matches("group"))
colnames(snow)[3] <- 'length'
# subset length that greater >5
obs<-subset(snow,length >=5)
The codes above give me partial solution ( if further manually edit I will get ideal solution to match my definition) I am only interested in one onset date for each year. I need some further guidance on how I can edit this code to compute onset date based on definition above.
I have number of locations so manually editing this would not be ideal solution.
Your help would be appreciated
We have assumed in (1) that the melt day must occur in Mar, Apr or May and in (2) that only the 5 subzero days occur in Mar, Apr, May but the melt day could occur in June, say.
1) Define df2 which is df1 plus additional columns: month, year and code where code is 0 if the date is not in Mar, Apr, May and is otherwise 1 if temp < 0 and 2 if temp >= 0.
Now using df2 run rollapplyr on code returning TRUE if the most recent 6 dates have codes 1, 1, 1, 1, 1, 2 and otherwise FALSE. Take the TRUE rows and only keep the last in each year. Right join that to a data frame of all years in order to generate NAs in the output for any missing years.
library(zoo)
df2 <- df1 %>%
mutate(Date = as.Date(Date), month = as.numeric(format(Date, "%m")),
year = as.numeric(format(Date, "%Y")),
code = (month %in% 3:5) * ((temp < 0) + 2 * (temp >= 0)),
OK = rollapplyr(code, 6, identical, c(1, 1, 1, 1, 1, 2), fill = FALSE))
df2 %>%
filter(OK) %>%
filter(!duplicated(year, fromLast = TRUE)) %>%
right_join(unique(df2["year"]), by = "year") %>%
select(year, Date)
giving:
year Date
1 1950 1950-05-24
2 1951 1951-05-21
3 1952 1952-05-28
4 1953 1953-05-15
5 1954 1954-05-28
6 1955 1955-05-14
7 1956 1956-05-27
8 1957 1957-05-17
9 1958 1958-05-21
10 1959 <NA>
11 1960 1960-05-26
12 1961 1961-05-16
13 1962 1962-05-19
14 1963 1963-05-13
15 1964 1964-05-27
16 1965 1965-05-20
17 1966 1966-05-26
18 1967 1967-05-26
19 1968 1968-05-27
20 1969 1969-05-30
21 1970 1970-05-21
2) In (1) we assumed that the melt onset day must be in Mar, Apr or May but here we assume that only the subzero days lie in that range and the melt onset day may extend further out.
Calculations are the same as in (1) except that the codes are now such that 1 indicates a subzero temperature in Mar, Apr or May, 2 indicates any temp above zero any time (not just in Mar, Apr and May) and 0 is anything else. We collapse the codes into a character string (one character per date) and use a regular expression on it to look for a substring of 5 ones followed by anything until we get to the next 2. We process the rest as in (1) except now we don't need the join since there will always be a melt onset day. Without the join we can represent this now as a single pipeline.
df1 %>%
mutate(Date = as.Date(Date), month = as.numeric(format(Date, "%m")),
year = as.numeric(format(Date, "%Y")),
code = (month %in% 3:5) * (temp < 0) + 2 * (temp >= 0),
OK = { g <- gregexpr("1{5}.*?2", paste(code, collapse = ""))[[1]]
seq_along(code) %in% (g + attr(g, "match.length") - 1) }) %>%
filter(OK) %>%
filter(!duplicated(year, fromLast = TRUE)) %>%
select(year, Date)
giving:
year Date
1 1950 1950-05-24
2 1951 1951-06-01
3 1952 1952-05-28
4 1953 1953-05-15
5 1954 1954-05-28
6 1955 1955-05-14
7 1956 1956-05-27
8 1957 1957-05-17
9 1958 1958-05-21
10 1959 1959-06-02
11 1960 1960-05-26
12 1961 1961-05-16
13 1962 1962-05-19
14 1963 1963-06-01
15 1964 1964-05-27
16 1965 1965-05-20
17 1966 1966-05-26
18 1967 1967-05-26
19 1968 1968-05-27
20 1969 1969-05-30
21 1970 1970-05-21
A straightforward solution in tidyverse.
library(tidyverse)
library(lubridate)
readxl::read_excel("temp.xlsx") -> df1
df1 %>%
mutate(year = year(Date),
month = month(Date)) %>%
group_by(year) %>%
mutate(
below_0 = as.numeric(temp < 0),
streak5 = cumsum(below_0) - cumsum(lag(below_0, 5, 0)),
onset = month %in% c(3, 4, 5) & lag(streak5) == 5 & below_0 == 0) %>%
filter(onset) %>%
summarise(Date = last(Date))
Gives
# A tibble: 20 x 2
year Date
<dbl> <dttm>
1 1950 1950-05-24
2 1951 1951-05-21
3 1952 1952-05-28
4 1953 1953-05-15
5 1954 1954-05-28
6 1955 1955-05-14
7 1956 1956-05-27
8 1957 1957-05-17
9 1958 1958-05-21
10 1960 1960-05-26
11 1961 1961-05-16
12 1962 1962-05-19
13 1963 1963-05-13
14 1964 1964-05-27
15 1965 1965-05-20
16 1966 1966-05-26
17 1967 1967-05-26
18 1968 1968-05-27
19 1969 1969-05-30
20 1970 1970-05-21
I hope the code more or less explains itself, streak5 is the number of previous days with temp below 0, onset implements the criteria given in the question, summarise picks the last date in given year.
rle() to the rescue!
library(broom)
library(tidyverse)
temp <- read_csv("temp.csv")
Best read the pipe below first before reading this helper function.
For each year we:
take a run-length encoding of above/below 0
the first one that's TRUE (<0) and has 5+ consecutive days is our candidate
take the next index
if that's too much (no days that fit the criteria) return NA
else return that date
thus:
mk_runs <- function(xdf) {
r <- rle(xdf$below_0) take the T/F RLE
pos <- which(r$values & r$length>=5)[1] # find the first one meeting criteria
idx <- (sum(r$lengths[1:pos]))+1 # sum the lengths up until this point and add 1 to get to the first > 0 day
if (idx > nrow(xdf)) { # if past our date range return NA
data_frame(year=xdf$year[1], date=NA)
} else {
xdf[idx, c("year", "date")]
}
}
We need to get the data into shape:
separate(temp, Date, c("month", "day", "year")) %>%
mutate_all(as.numeric) %>%
mutate(year = ifelse(year >=50, 1900+year, 2000+year)) %>%
mutate(date = as.Date(sprintf("%04d-%02d-%02d", year, month, day))) %>%
mutate(month = lubridate::month(date)) %>%
mutate(below_0 = temp < 0) %>%
filter(month >= 3 & month <=5) %>%
group_by(year) %>% # year groups
arrange(date) %>% # in order
do(mk_runs(.)) %>% # see above function
print(n=21)
## # A tibble: 21 x 2
## # Groups: year [21]
## year date
## <dbl> <date>
## 1 1950 1950-04-30
## 2 1951 1951-05-21
## 3 1952 1952-05-28
## 4 1953 1953-05-15
## 5 1954 1954-05-28
## 6 1955 1955-05-14
## 7 1956 1956-05-02
## 8 1957 1957-05-07
## 9 1958 1958-04-27
## 10 1959 NA
## 11 1960 1960-04-24
## 12 1961 1961-05-16
## 13 1962 1962-05-19
## 14 1963 1963-05-13
## 15 1964 1964-05-20
## 16 1965 1965-05-20
## 17 1966 1966-05-07
## 18 1967 1967-04-27
## 19 1968 1968-05-10
## 20 1969 1969-05-22
## 21 1970 1970-05-21
Here is another attempt. In my first step, I created two new columns first (i.e., year and month). Then, I filtered the data for data between March and May. Then, I created index numbers for rows which have temperature higher than 0 Celsius. This process is done per year. Since you need to have five consecutive days before those days that have temperature above zero, index numbers equal to / smaller than 5 needs to be ignored. This is done if_else() in the true condition in the outer if_else().
In my second step, I chose to use a package called SOfun which is developed by the author of splitstackshape. You can download this package from github. What getMyRows() is doing are; 1) it identifies which rows should be considered by specifying pattern, 2) get a certain range of rows from the marked rows in 1), and 3) create a list. Here range = -5:0 means that I am choosing five previous rows of a target row, and the target row itself.
In my third step, I subsetted mylist with two logical conditions. !is.na(x$ind[6]) checks if the 6th element of ind is not NA, and all(x$temp[1:5] < 0) checks if the 1st-5th elements of temp (temperature) are all smaller than zero. Filter() chooses list elements that satisfy the two logical condition. Then, I extracted the 6th row from each data frame since that is the target row. I bound the list, grouped the data by year and chose the first observation for each year using slice().
library(devtools)
install_github("mrdwab/overflow-mrdwab")
install_github("mrdwab/SOfun")
library(overflow)
library(SOfun)
library(readxl)
library(dplyr)
# Part 1
mydf <- read_excel("temp.xlsx") %>%
mutate(year = as.numeric(format(Date, "%Y")),
month = as.numeric(format(Date, "%m"))) %>%
filter(between(month, 3, 5)) %>%
group_by(year) %>%
mutate(ind = if_else(temp > 0,
{ind <- row_number()
if_else(ind <= 5, NA_integer_, ind)},
NA_integer_)) %>%
ungroup
# Part 2
mylist <- getMyRows(mydf,
pattern = which(complete.cases(mydf$ind)),
range = -5:0, isNumeric = TRUE)
# Part 3
Filter(function(x) !is.na(x$ind[6]) & all(x$temp[1:5] < 0), mylist) %>%
lapply(function(x) x[6, ]) %>%
bind_rows %>%
group_by(year) %>%
slice(1) %>%
select(Date)
year Date
<dbl> <dttm>
1 1950 1950-04-30 00:00:00
2 1951 1951-05-21 00:00:00
3 1952 1952-05-28 00:00:00
4 1953 1953-05-15 00:00:00
5 1954 1954-05-28 00:00:00
6 1955 1955-05-14 00:00:00
7 1956 1956-05-02 00:00:00
8 1957 1957-05-07 00:00:00
9 1958 1958-04-27 00:00:00
10 1960 1960-04-24 00:00:00
11 1961 1961-05-16 00:00:00
12 1962 1962-05-19 00:00:00
13 1963 1963-05-13 00:00:00
14 1964 1964-05-20 00:00:00
15 1965 1965-05-20 00:00:00
16 1966 1966-05-07 00:00:00
17 1967 1967-04-27 00:00:00
18 1968 1968-05-10 00:00:00
19 1969 1969-05-22 00:00:00
20 1970 1970-05-21 00:00:00

Clean way to calculate both group and overall statistics

I would like like to calculate the median not only for different groups of my data, but also the median over all groups and store the result in a single data.frame. While accomplishing each of these tasks separately is easy, I have not found a clean way to do both at the same time.
Right now, what I'm doing is calculate both statistics separately; then join the results; then tidy the data if necessary. Here's an example of what this may look like if I wanted to know the median delay per day and per month:
library(dplyr)
library(hflights)
data(hflights)
# Calculate both statistics separately
per_day <- hflights %>%
group_by(Year, Month, DayofMonth) %>%
summarise(Delay = mean(ArrDelay, na.rm = TRUE)) %>%
mutate(Interval = "Daily")
per_month <- hflights %>%
group_by(Year, Month) %>%
summarise(Delay = mean(ArrDelay, na.rm = TRUE)) %>%
mutate(Interval = "Monthly", DayofMonth = NA)
# Join into a single data.frame
my_summary <- full_join(per_day, per_month,
by = c("Year", "Month", "DayofMonth", "Interval", "Delay"))
my_summary
# Source: local data frame [377 x 5]
# Groups: Year, Month
#
# Year Month DayofMonth Delay Interval
# 1 2011 1 1 10.067642 Daily
# 2 2011 1 2 10.509745 Daily
# 3 2011 1 3 6.038627 Daily
# 4 2011 1 4 7.970740 Daily
# 5 2011 1 5 4.172650 Daily
# 6 2011 1 6 6.069909 Daily
# 7 2011 1 7 3.907295 Daily
# 8 2011 1 8 3.070140 Daily
# 9 2011 1 9 17.254325 Daily
# 10 2011 1 10 11.040388 Daily
# .. ... ... ... ... ...
Are there better ways to do this?
(Note that in many cases one could easily progressively roll up summaries as pointed out in the Introduction to dplyr. However, this doesn't work for statistics like median, mean etc.)
As a one-off table. This is fairly straightforward in data.table:
require(data.table)
setDT(hflights)[,{
mo_del <- mean(ArrDelay,na.rm=TRUE)
.SD[,.(DailyDelay = mean(ArrDelay,na.rm=TRUE),MonthlyDelay = mo_del),by=DayofMonth]
},by=.(Year,Month)]
# Year Month DayofMonth DailyDelay MonthlyDelay
# 1: 2011 1 1 10.0676417 4.926065
# 2: 2011 1 2 10.5097451 4.926065
# 3: 2011 1 3 6.0386266 4.926065
# 4: 2011 1 4 7.9707401 4.926065
# 5: 2011 1 5 4.1726496 4.926065
# ---
# 361: 2011 12 14 1.0293610 5.013244
# 362: 2011 12 17 -0.1049822 5.013244
# 363: 2011 12 24 -4.1457490 5.013244
# 364: 2011 12 25 -2.2976827 5.013244
# 365: 2011 12 31 46.4846491 5.013244
How it works. The basic syntax is DT[i,j,by].
With by=.(Year,Month), all operations in j are done per "by group."
We can nest another "by group" using the data.table of the current Subset of Data, .SD.
To return columns in j we use .(colname1=col1,colname2=col2,...).
Creating new variables. Alternately, we could create new variables in hflights using := in j.
hflights[,DailyDelay := mean(ArrDelay,na.rm=TRUE),.(Year,Month,DayofMonth)]
hflights[,MonthlyDelay := mean(ArrDelay,na.rm=TRUE),.(Year,Month)]
Then we can view the summary table:
hflights[,.GRP,.(Year,Month,DayofMonth,DailyDelay,MonthlyDelay)]
# Year Month DayofMonth DailyDelay MonthlyDelay .GRP
# 1: 2011 1 1 10.0676417 4.926065 1
# 2: 2011 1 2 10.5097451 4.926065 2
# 3: 2011 1 3 6.0386266 4.926065 3
# 4: 2011 1 4 7.9707401 4.926065 4
# 5: 2011 1 5 4.1726496 4.926065 5
# ---
# 361: 2011 12 14 1.0293610 5.013244 361
# 362: 2011 12 17 -0.1049822 5.013244 362
# 363: 2011 12 24 -4.1457490 5.013244 363
# 364: 2011 12 25 -2.2976827 5.013244 364
# 365: 2011 12 31 46.4846491 5.013244 365
(Something needed to be put in j here, so I used the "by group" code, .GRP.)

Resources