Sum rows by date range, for a given identifier - r

I looked at many posts with similar, but I believe less complex questions, and just cant seem to work out an answer for this.
I have a >1000000 lines of data, for example in this form:
date<-c("9/30/2012","10/31/2012","11/30/2012","12/31/2012","1/31/2013","2/28/2013","3/31/2013","10/31/2012","11/30/2012","12/31/2012","1/31/2013","2/28/2013","3/31/2013")
name<-c("a","a","a","a","a","a","a","b","b","b","b","b","b")
amount<-c(100,200,300,400,500,600,700,800,900,800,700,600,500)
data<-data.frame(name,date,amount)
View(data)
What I need is, for entries of the same name, sum the amount for dates that are in jan-mar, apr-jun, jul-sep, oct-dec in the same year.
This is my ideal output:
date2<-c("9/30/2012","12/31/2012","3/31/2013","12/31/2012","3/13/2013")
name2<-c("a","a","a","b","b")
amount2<-c(100,900,1800,2500,1800)
data2<-data.frame(name2,date2,amount2)
View(data2)
Will appreciate any input at all, to lead me towards the correct direction.
Thank you very much!

1. Using dplyr/zoo
We can convert the 'date' class from 'character' to 'Date', get the sum of 'amount' and last value of 'date' grouped by columns 'name' and 'Qtr' (from converting the 'date' to year quarter (as.yearqtr).
library(dplyr)
library(zoo)
data %>%
mutate(date=as.Date(date, format='%m/%d/%Y')) %>%
group_by(name, Qtr=as.character(as.yearqtr(date))) %>%
summarise(amount= sum(amount), date=last(date))
# name Qtr amount date
#1 a 2012 Q3 100 2012-09-30
#2 a 2012 Q4 900 2012-12-31
#3 a 2013 Q1 1800 2013-03-31
#4 b 2012 Q4 2500 2012-12-31
#5 b 2013 Q1 1800 2013-03-31
NOTE: Also added #docendo discimus suggestion to use last and changing the class of 'date' column. The Qtr column is 'character' as the as.yearqtr class is unsupported by dplyr (from the errors). The 'Qtr' column was not in the expected dataset 'data2'. So, I guess it doesn't matter whether it is 'character' or 'as.yearqtr'. If we don't change the 'date' column to 'Date' class, and do the change in the group_by step, this will give the same result as the 'data2'. The extra 'Qtr' column can be deleted.
2. Without using zoo
data %>%
mutate(date1 = as.Date(date, format = '%m/%d/%Y')) %>%
group_by(name, Qtr= sprintf('%s %s', format(date1, '%Y'),
quarters(date1))) %>%
summarise(amount = sum(amount), date=last(date)) %>%
ungroup() %>%
select(-Qtr) %>%
as.data.frame()
# name amount date
#1 a 100 9/30/2012
#2 a 900 12/31/2012
#3 a 1800 3/31/2013
#4 b 2500 12/31/2012
#5 b 1800 3/31/2013
NOTE2: Added a solution without using as.yearqtr, kept the same format for 'date' as in the expected output 'data2'

Here are a few approaches:
1) aggregate & zoo
library(zoo)
aggregate(amount ~ name + yearqtr,
transform(data, yearqtr = as.yearqtr(date, "%m/%d/%Y")),
sum)
2) data.table & zoo
library(data.table)
library(zoo)
dt <- data.table(data, key = "name,date")
dt[, date := as.yearqtr(date, "%m/%d/%Y")][, list(sum = sum(amount)), by = "name,date"]
Note that both these solutions convert the date to a real "yearqtr" object and not just to a character string. I haven't benchmarked these but typically data.table is very fast. You could create the data.table from data by reference using setDT for every greater performance but might prefer to keep them separate as well so we left them separate here.

Related

How to take average of column in specific date ranges using R?

I know there might be similar questions like this but i can't find any specific answers for my condition. I have a dataframe like this:
date CVDadmissions
2001.10.01 48
2001.10.02 12
2002.10.01 24
2002.10.02 22
What I want is:
average of cvdadmissions for 2001 and then for 2002.
Can someone please guide me how can i do in R?
aggregate(df$CVDadmissions,list(substr(df$date,1,4)),mean)
Convert to Date object, extract year and then take mean
aggregate(CVDadmissions~year,
transform(df, year = format(as.Date(date, "%Y.%m.%d"), "%Y")), mean)
# year CVDadmissions
#1 2001 30
#2 2002 23
With dplyr and lubridate, we can do
library(dplyr)
library(lubridate)
df %>%
mutate(date = ymd(date)) %>%
group_by(year = year(date)) %>%
summarise(CVDadmissions = mean(CVDadmissions))

How to filter by dates and grouping months together in R using dplyr

I have a dataframe (lets call it df1) that looks something like this...
Date Price
2014-08-06 22
2014-08-06 89
2014-09-15 56
2014-06-04 41
2015-01-19 11
2015-05-23 5
2014-07-21 108
There are other variables in the dataframe but we will ignore them for now, as I do not require them.
I have previously ordered it using
df2 <- df1[order(as.Date(df1$Date, format="%Y/%m/%d")),]
And then created a dataframe containing the values for just one month, for example, just September 2015 dates...
september2015 <- df2[df2$Date >= "2015-09-01" & df2$Date <= "2015-09-30",]
I have done this for all the months in 2015 and 2014.
Then I need to create an average of prices within each given month. I have done this by...
mean(september2015$Price, na.rm = TRUE)
Obviously, this is very long and tedious and involves many lines of code. I am trying to make my code more efficient through using the dplyr package.
So far I have...
datesandprices <- select(df2, Date, Price)
datesandprices <- arrange(datesandprices, Date)
summarise(datesandprices, avg = mean(Price, na.rm = TRUE))
Or in a simpler form...
df1 %>%
select(Date, Price) %>%
arrange(Date) %>%
filter(Date >= 2014-08-06 & Date =< 2014-08-30)
summarise(mean(Price, na.rm = TRUE))
The filter line is not working for me and I can't figure out how to filter by dates using this method. I would like to get the mean for each month without having to calculate it one by one - and ideally extract the monthly means into a new dataframe or column that looks like...
Month Average
Jan 2014 x
Feb 2014 y
...
Nov 2015 z
Dec 2015 a
I hope this makes sense. I can't find anything on stackoverflow that works with dates, attempting to do something similar to this (unless I am searching for the wrong functions). Many thanks!
I made a separate column in your data set that contains only year and month. Then, I did a group_by on that column to get the means for each month.
Date <- c("2014-08-06", "2014-08-06", "2014-09-15", "2014-06-04", "2015-01-19", "2015-05-23", "2014-07-21")
Price <- c(22,89,56,41,11,5,108)
Date <- as.Date(Date, format="%Y-%m-%d")
df <- data.frame(Date, Price)
df$Month_Year <- substr(df$Date, 1,7)
library(dplyr)
df %>%
#select(Date, Price) %>%
group_by(Month_Year) %>%
summarise(mean(Price, na.rm = TRUE))
For the sake of completeness, here is also a data.table solution:
library(data.table)
# in case Date is of type character
setDT(df1)[, .(Average = mean(Price, na.rm = TRUE)), keyby = .(Yr.Mon = substr(Date, 1,7))]
# in case Date is of class Date or POSIXct
setDT(df2)[, .(Average = mean(Price, na.rm = TRUE)), keyby = .(Yr.Mon = format(Date, "%Y-%m"))]
Yr.Mon Average
1: 2014-06 41.0
2: 2014-07 108.0
3: 2014-08 55.5
4: 2014-09 56.0
5: 2015-01 11.0
6: 2015-05 5.0
Note that the grouping variable Yr.Mon is created "on-the-fly" in the keyby clause.
Data
library(data.table)
df1 <- fread(
"Date Price
2014-08-06 22
2014-08-06 89
2014-09-15 56
2014-06-04 41
2015-01-19 11
2015-05-23 5
2014-07-21 108")
df2 <- df1[, Date := as.Date(Date)]
I managed to do it using all dplyr functions, with help from #user108636
df %>%
select(Date, Price) %>%
arrange(Date) %>%
mutate(Month_Year = substr(Date, 1,7)) %>%
group_by(Month_Year) %>%
summarise(mean(Price, na.rm = TRUE))
The select function selects the date and price columns.
The arrange function arranges my dataframe according to the date - with the earliest date first. The mutate function adds another column which excludes the day and leaves us with, for example...
Month_Year
2015-10
2015-10
2015-11
2015-12
2015-12
The group by function groups all the months together and the summarise function calculates the mean of the price of each month.
This should mean your price data by month-year.
library(zoo)
#Pull out columns
Price<-df1["Price"]
Date<-df1["Date"]
#Put in Zoo
zooPrice <- zoo(Price,Date)
#Monthly mean with year (vector)
monthly.avg <- apply.monthly(zooPrice, mean)
#function to change back to DF
zooToDf <- function(z) {
df <- as.data.frame(z)
df$Date <- time(z) #create a Date column
rownames(df) <- NULL #so row names not filled with dates
df <- df[,c(ncol(df), 1:(ncol(df)-1))] #reorder columns so Date first
return(df)
}
#Apply function to create new Df with data!
MonthYearAvg<-zooToDf(monthly.avg)
Convert your column to a Date object and use format
df <- data.frame(
Date = c("2014-08-06", "2014-08-06", "2014-09-15", "2014-06-04", "2015-01-19", "2015-05-23", "2014-07-21"),
Price = c(22, 89, 56, 41, 11, 5, 108))
library(dplyr)
df %>%
group_by(Month_Year = as.Date(Date) %>% format("%b %Y")) %>%
summarise(avg = mean(Price, na.rm = TRUE))
# A tibble: 6 x 2
Month_Year avg
<chr> <dbl>
1 août 2014 55.5
2 janv. 2015 11
3 juil. 2014 108
4 juin 2014 41
5 mai 2015 5
6 sept. 2014 56

Earliest Date for each id in R

I have a dataset where each individual (id) has an e_date, and since each individual could have more than one e_date, I'm trying to get the earliest date for each individual. So basically I would like to have a dataset with one row per each id showing his earliest e_date value.
I've use the aggregate function to find the minimum values, I've created a new variable combining the date and the id and last I've subset the original dataset based on the one containing the minimums using the new variable created. I've come to this:
new <- aggregate(e_date ~ id, data_full, min)
data_full["comb"] <- NULL
data_full$comb <- paste(data_full$id,data_full$e_date)
new["comb"] <- NULL
new$comb <- paste(new$lopnr,new$EDATUM)
data_fixed <- data_full[which(new$comb %in% data_full$comb),]
The first thing is that the aggregate function doesn't seems to work at all, it reduces the number of rows but viewing the data I can clearly see that some ids appear more than once with different e_date. Plus, the code gives me different results when I use the as.Date format instead of its original format for the date (integer). I think the answer is simple but I'm struck on this one.
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(data_full)), grouped by 'id', we get the 1st row (head(.SD, 1L)).
library(data.table)
setDT(data_full)[order(e_date), head(.SD, 1L), by = id]
Or using dplyr, after grouping by 'id', arrange the 'e_date' (assuming it is of Date class) and get the first row with slice.
library(dplyr)
data_full %>%
group_by(id) %>%
arrange(e_date) %>%
slice(1L)
If we need a base R option, ave can be used
data_full[with(data_full, ave(e_date, id, FUN = function(x) rank(x)==1)),]
Another answer that uses dplyr's filter command:
dta %>%
group_by(id) %>%
filter(date == min(date))
You may use library(sqldf) to get the minimum date as follows:
data1<-data.frame(id=c("789","123","456","123","123","456","789"),
e_date=c("2016-05-01","2016-07-02","2016-08-25","2015-12-11","2014-03-01","2015-07-08","2015-12-11"))
library(sqldf)
data2 = sqldf("SELECT id,
min(e_date) as 'earliest_date'
FROM data1 GROUP BY 1", method = "name__class")
head(data2)
id earliest_date
123 2014-03-01
456 2015-07-08
789 2015-12-11
I made a reproducible example, supposing that you grouped some dates by which quarter they were in.
library(lubridate)
library(dplyr)
rand_weeks <- now() + weeks(sample(100))
which_quarter <- quarter(rand_weeks)
df <- data.frame(rand_weeks, which_quarter)
df %>%
group_by(which_quarter) %>% summarise(sort(rand_weeks)[1])
# A tibble: 4 x 2
which_quarter sort(rand_weeks)[1]
<dbl> <time>
1 1 2017-01-05 05:46:32
2 2 2017-04-06 05:46:32
3 3 2016-08-18 05:46:32
4 4 2016-10-06 05:46:32

R extract Date and Time Info

I have a data.frame that looks like this:
> df1
Date Name Surname Amount
2015-07-24 John Smith 200
I want to extrapolate all the infos out of the Date into new columns, so I can get to this:
> df2
Date Year Month Day Day_w Name Surname Amount
2015-07-24 2015 7 24 Friday John Smith 200
So now I'd like to have Year, Month, Day and Day of the Week. How can I do that? When I try to first make the variable a date using as.Date the data.frame gets messed up and the Date all become NA (and no new columns). Thanks for your help!
Here's a simple and efficient solution using the devel version of data.table and its new tstrsplit function which will perform the splitting operation only once and also update your data set in place.
library(data.table)
setDT(df1)[, c("Year", "Month", "Day", "Day_w") :=
c(tstrsplit(Date, "-", type.convert = TRUE), wday(Date))]
df1
# Date Name Surname Amount Year Month Day Day_w
# 1: 2015-07-24 John Smith 200 2015 7 24 6
Note that I've used a numeric representation of the week days because there is an efficient built in wday function for that in the data.table package, but you can easily tweak it if you really need to using format(as.Date(Date), format = "%A") instead.
In order to install the devel version use the following
library(devtools)
install_github("Rdatatable/data.table", build_vignettes = FALSE)
Maybe this helps:
df2 <- df1
dates <- strptime(as.character(df1$Date),format="%Y-%m-%d")
df2$Year <- format(dates, "%Y")
df2$Month <- format(dates, "%m")
df2$Day <- format(dates, "%d")
df2$Day_w <- format(dates, "%a")
Afterwards you can rearrange the order of columns in df2as you desire.

Why mutate applied only to first row and repeats its result to the rest

I have a data frame which consists of several columns. One of them is date_created column in the unified format. I want to split it into year, month, day and add these columns to the same data frame.
input:
id date_created
1 02-20-2014
2 01-15-2015
result:
id date_created year month day
1 02-20-2014 2014 2 20
2 01-15-2015 2015 1 15
I have a sample code which works incorrect
displays <- displays %>%
mutate(month = as.integer(unlist(strsplit(date, '-')))[1],
day = as.integer(unlist(strsplit(date, '-')))[2],
year = as.integer(unlist(strsplit(date, '-')))[3]
)
it produces the following:
id date_created year month day
1 02-20-2014 2014 2 20
2 01-15-2015 2014 2 20
I guess that the function is not called for each row, but cannot understand why. Explain, please, how it works and provide the sample code to achieve desired result. Thanks
You can use separate or extract from tidyr
library(tidyr)
separate(d1, date_created, c('month', 'day', 'year'), remove=FALSE)
Or
extract(d1, date_created, c('month', 'day', 'year'),
'([^-]+)-([^-]+)-([^-]+)', remove=FALSE)
Or cSplit from splitstackshape
library(splitstackshape)
cSplit(d1, 'date_created', sep="-", drop=FALSE)
Or using tstrsplit from the devel version of data.table
library(data.table)#v1.9.5
setDT(d1)[, c('month', 'day', 'year') := tstrsplit(date_created, '-')]
Regarding the problem in your code, it is just selecting 1st, 2nd and 3rd element from the entire 'date_created' column. Just use rowwise
library(dplyr)
d1 %>%
rowwise() %>%
mutate(month= as.integer(unlist(strsplit(date_created, '-')))[1],
day= as.integer(unlist(strsplit(date_created, '-')))[2],
year=as.integer(unlist(strsplit(date_created, '-')))[3])
Or another option would be to convert to date class and then extract 'day', 'month' and 'year'
library(lubridate)
d1 %>%
mutate(date=mdy(date_created), year=year(date),
month=month(date), day=day(date)) %>%
select(-date)

Resources