How to take average of column in specific date ranges using R?

How to take average of column in specific date ranges using R? - r

I know there might be similar questions like this but i can't find any specific answers for my condition. I have a dataframe like this:
date CVDadmissions
2001.10.01 48
2001.10.02 12
2002.10.01 24
2002.10.02 22
What I want is:
average of cvdadmissions for 2001 and then for 2002.
Can someone please guide me how can i do in R?

aggregate(df$CVDadmissions,list(substr(df$date,1,4)),mean)

Convert to Date object, extract year and then take mean
aggregate(CVDadmissions~year,
transform(df, year = format(as.Date(date, "%Y.%m.%d"), "%Y")), mean)
# year CVDadmissions
#1 2001 30
#2 2002 23
With dplyr and lubridate, we can do
library(dplyr)
library(lubridate)
df %>%
mutate(date = ymd(date)) %>%
group_by(year = year(date)) %>%
summarise(CVDadmissions = mean(CVDadmissions))

Related

R - Filter data by month

I apologize for my bad English, but I really need your help.
I have a .csv dataset with two columns - year and value. There is data about height of precipitation monthly from 1900 to 2019.
It looks like this:
year value
190001 100
190002 39
190003 78
190004 45
...
201912 25
I need to create two new datasets: the first one with the data for every year from July (07) to September (09) and the second one from January (01) to March (03).
Also I need to summarize this data for every year (it means I need only one value per year).
So I have data for summer 1900-2019 and winter 1900-2019.

You can use the dplyr and stringr packages to achive what you need. I created a mock data set first:
library(dplyr)
library(stringr)
df <- data.frame(time = 190001:201219, value=runif(length(190001:201219), 0, 100))
After that, we create two separate columns for month and year:
df$year <- as.numeric(str_extract(df$time, "^...."))
df$month <- as.numeric(str_extract(df$time, "..$"))
At this point, we can filter:
df_1 <- df %>% filter(between(month,7,9))
df_2 <- df %>% filter(between(month,1,3))
... and summarize:
df <- df %>% group_by(year) %>% summarise(value = sum(value))

library(tidyverse)
dat <- tribble(
~year, ~value,
190001, 100,
190002, 39,
190003, 78,
190004, 45)
Splitting the year variable into a month and year variable:
dat_prep <- dat %>%
mutate(month = str_remove(year, "^\\d{4}"), # Remove the first 4 digits
year = str_remove(year, "\\d{2}$"), # Remove the last 2 digits
across(everything(), as.numeric))
dat_prep %>%
filter(month %in% 7:9) %>% # For months Jul-Sep. Repeat with 1:3 for Jan-Mar
group_by(year) %>%
summarize(value = sum(value))

R -- Always grab the last day of the previous year in R

I am an aspiring data scientist, and this will be my first ever question on StackOF.
I have this line of code to help wrangle me data. My date filter is static. I would prefer not to have to go in an change this hardcoded value every year. What is the best alternative for my date filter to make it more dynamic? The date column is also difficult to work with because it is not a
"date", it is a "dbl"
library(dplyr)
library(lubridate)
# create a sample dataframe
df <- data.frame(
DATE = c(20191230, 20191231, 20200122)
)
Tried so far:
df %>%
filter(DATE >= 20191231)

# load packages (lubridate for dates)
library(dplyr)
library(lubridate)
# create a sample dataframe
df <- data.frame(
DATE = c(20191230, 20191231, 20200122)
)
This looks like this:
DATE
1 20191230
2 20191231
3 20200122
# and now...
df %>% # take the dataframe
mutate(DATE = ymd(DATE)) %>% # turn the DATE column actually into a date
filter(DATE >= floor_date(Sys.Date(), "year") - days(1))
...and filter rows where DATE is >= to one day before the first day of this year (floor_date(Sys.Date(), "year"))
DATE
1 2019-12-31
2 2020-01-22

How to filter by dates and grouping months together in R using dplyr

I have a dataframe (lets call it df1) that looks something like this...
Date Price
2014-08-06 22
2014-08-06 89
2014-09-15 56
2014-06-04 41
2015-01-19 11
2015-05-23 5
2014-07-21 108
There are other variables in the dataframe but we will ignore them for now, as I do not require them.
I have previously ordered it using
df2 <- df1[order(as.Date(df1$Date, format="%Y/%m/%d")),]
And then created a dataframe containing the values for just one month, for example, just September 2015 dates...
september2015 <- df2[df2$Date >= "2015-09-01" & df2$Date <= "2015-09-30",]
I have done this for all the months in 2015 and 2014.
Then I need to create an average of prices within each given month. I have done this by...
mean(september2015$Price, na.rm = TRUE)
Obviously, this is very long and tedious and involves many lines of code. I am trying to make my code more efficient through using the dplyr package.
So far I have...
datesandprices <- select(df2, Date, Price)
datesandprices <- arrange(datesandprices, Date)
summarise(datesandprices, avg = mean(Price, na.rm = TRUE))
Or in a simpler form...
df1 %>%
select(Date, Price) %>%
arrange(Date) %>%
filter(Date >= 2014-08-06 & Date =< 2014-08-30)
summarise(mean(Price, na.rm = TRUE))
The filter line is not working for me and I can't figure out how to filter by dates using this method. I would like to get the mean for each month without having to calculate it one by one - and ideally extract the monthly means into a new dataframe or column that looks like...
Month Average
Jan 2014 x
Feb 2014 y
...
Nov 2015 z
Dec 2015 a
I hope this makes sense. I can't find anything on stackoverflow that works with dates, attempting to do something similar to this (unless I am searching for the wrong functions). Many thanks!

I made a separate column in your data set that contains only year and month. Then, I did a group_by on that column to get the means for each month.
Date <- c("2014-08-06", "2014-08-06", "2014-09-15", "2014-06-04", "2015-01-19", "2015-05-23", "2014-07-21")
Price <- c(22,89,56,41,11,5,108)
Date <- as.Date(Date, format="%Y-%m-%d")
df <- data.frame(Date, Price)
df$Month_Year <- substr(df$Date, 1,7)
library(dplyr)
df %>%
#select(Date, Price) %>%
group_by(Month_Year) %>%
summarise(mean(Price, na.rm = TRUE))

For the sake of completeness, here is also a data.table solution:
library(data.table)
# in case Date is of type character
setDT(df1)[, .(Average = mean(Price, na.rm = TRUE)), keyby = .(Yr.Mon = substr(Date, 1,7))]
# in case Date is of class Date or POSIXct
setDT(df2)[, .(Average = mean(Price, na.rm = TRUE)), keyby = .(Yr.Mon = format(Date, "%Y-%m"))]
Yr.Mon Average
1: 2014-06 41.0
2: 2014-07 108.0
3: 2014-08 55.5
4: 2014-09 56.0
5: 2015-01 11.0
6: 2015-05 5.0
Note that the grouping variable Yr.Mon is created "on-the-fly" in the keyby clause.
Data
library(data.table)
df1 <- fread(
"Date Price
2014-08-06 22
2014-08-06 89
2014-09-15 56
2014-06-04 41
2015-01-19 11
2015-05-23 5
2014-07-21 108")
df2 <- df1[, Date := as.Date(Date)]

I managed to do it using all dplyr functions, with help from #user108636
df %>%
select(Date, Price) %>%
arrange(Date) %>%
mutate(Month_Year = substr(Date, 1,7)) %>%
group_by(Month_Year) %>%
summarise(mean(Price, na.rm = TRUE))
The select function selects the date and price columns.
The arrange function arranges my dataframe according to the date - with the earliest date first. The mutate function adds another column which excludes the day and leaves us with, for example...
Month_Year
2015-10
2015-10
2015-11
2015-12
2015-12
The group by function groups all the months together and the summarise function calculates the mean of the price of each month.

This should mean your price data by month-year.
library(zoo)
#Pull out columns
Price<-df1["Price"]
Date<-df1["Date"]
#Put in Zoo
zooPrice <- zoo(Price,Date)
#Monthly mean with year (vector)
monthly.avg <- apply.monthly(zooPrice, mean)
#function to change back to DF
zooToDf <- function(z) {
df <- as.data.frame(z)
df$Date <- time(z) #create a Date column
rownames(df) <- NULL #so row names not filled with dates
df <- df[,c(ncol(df), 1:(ncol(df)-1))] #reorder columns so Date first
return(df)
}
#Apply function to create new Df with data!
MonthYearAvg<-zooToDf(monthly.avg)

Convert your column to a Date object and use format
df <- data.frame(
Date = c("2014-08-06", "2014-08-06", "2014-09-15", "2014-06-04", "2015-01-19", "2015-05-23", "2014-07-21"),
Price = c(22, 89, 56, 41, 11, 5, 108))
library(dplyr)
df %>%
group_by(Month_Year = as.Date(Date) %>% format("%b %Y")) %>%
summarise(avg = mean(Price, na.rm = TRUE))
# A tibble: 6 x 2
Month_Year avg
<chr> <dbl>
1 août 2014 55.5
2 janv. 2015 11
3 juil. 2014 108
4 juin 2014 41
5 mai 2015 5
6 sept. 2014 56

Why mutate applied only to first row and repeats its result to the rest

I have a data frame which consists of several columns. One of them is date_created column in the unified format. I want to split it into year, month, day and add these columns to the same data frame.
input:
id date_created
1 02-20-2014
2 01-15-2015
result:
id date_created year month day
1 02-20-2014 2014 2 20
2 01-15-2015 2015 1 15
I have a sample code which works incorrect
displays <- displays %>%
mutate(month = as.integer(unlist(strsplit(date, '-')))[1],
day = as.integer(unlist(strsplit(date, '-')))[2],
year = as.integer(unlist(strsplit(date, '-')))[3]
)
it produces the following:
id date_created year month day
1 02-20-2014 2014 2 20
2 01-15-2015 2014 2 20
I guess that the function is not called for each row, but cannot understand why. Explain, please, how it works and provide the sample code to achieve desired result. Thanks

You can use separate or extract from tidyr
library(tidyr)
separate(d1, date_created, c('month', 'day', 'year'), remove=FALSE)
Or
extract(d1, date_created, c('month', 'day', 'year'),
'([^-]+)-([^-]+)-([^-]+)', remove=FALSE)
Or cSplit from splitstackshape
library(splitstackshape)
cSplit(d1, 'date_created', sep="-", drop=FALSE)
Or using tstrsplit from the devel version of data.table
library(data.table)#v1.9.5
setDT(d1)[, c('month', 'day', 'year') := tstrsplit(date_created, '-')]
Regarding the problem in your code, it is just selecting 1st, 2nd and 3rd element from the entire 'date_created' column. Just use rowwise
library(dplyr)
d1 %>%
rowwise() %>%
mutate(month= as.integer(unlist(strsplit(date_created, '-')))[1],
day= as.integer(unlist(strsplit(date_created, '-')))[2],
year=as.integer(unlist(strsplit(date_created, '-')))[3])
Or another option would be to convert to date class and then extract 'day', 'month' and 'year'
library(lubridate)
d1 %>%
mutate(date=mdy(date_created), year=year(date),
month=month(date), day=day(date)) %>%
select(-date)

Sum rows by date range, for a given identifier

I looked at many posts with similar, but I believe less complex questions, and just cant seem to work out an answer for this.
I have a >1000000 lines of data, for example in this form:
date<-c("9/30/2012","10/31/2012","11/30/2012","12/31/2012","1/31/2013","2/28/2013","3/31/2013","10/31/2012","11/30/2012","12/31/2012","1/31/2013","2/28/2013","3/31/2013")
name<-c("a","a","a","a","a","a","a","b","b","b","b","b","b")
amount<-c(100,200,300,400,500,600,700,800,900,800,700,600,500)
data<-data.frame(name,date,amount)
View(data)
What I need is, for entries of the same name, sum the amount for dates that are in jan-mar, apr-jun, jul-sep, oct-dec in the same year.
This is my ideal output:
date2<-c("9/30/2012","12/31/2012","3/31/2013","12/31/2012","3/13/2013")
name2<-c("a","a","a","b","b")
amount2<-c(100,900,1800,2500,1800)
data2<-data.frame(name2,date2,amount2)
View(data2)
Will appreciate any input at all, to lead me towards the correct direction.
Thank you very much!

1. Using dplyr/zoo
We can convert the 'date' class from 'character' to 'Date', get the sum of 'amount' and last value of 'date' grouped by columns 'name' and 'Qtr' (from converting the 'date' to year quarter (as.yearqtr).
library(dplyr)
library(zoo)
data %>%
mutate(date=as.Date(date, format='%m/%d/%Y')) %>%
group_by(name, Qtr=as.character(as.yearqtr(date))) %>%
summarise(amount= sum(amount), date=last(date))
# name Qtr amount date
#1 a 2012 Q3 100 2012-09-30
#2 a 2012 Q4 900 2012-12-31
#3 a 2013 Q1 1800 2013-03-31
#4 b 2012 Q4 2500 2012-12-31
#5 b 2013 Q1 1800 2013-03-31
NOTE: Also added #docendo discimus suggestion to use last and changing the class of 'date' column. The Qtr column is 'character' as the as.yearqtr class is unsupported by dplyr (from the errors). The 'Qtr' column was not in the expected dataset 'data2'. So, I guess it doesn't matter whether it is 'character' or 'as.yearqtr'. If we don't change the 'date' column to 'Date' class, and do the change in the group_by step, this will give the same result as the 'data2'. The extra 'Qtr' column can be deleted.
2. Without using zoo
data %>%
mutate(date1 = as.Date(date, format = '%m/%d/%Y')) %>%
group_by(name, Qtr= sprintf('%s %s', format(date1, '%Y'),
quarters(date1))) %>%
summarise(amount = sum(amount), date=last(date)) %>%
ungroup() %>%
select(-Qtr) %>%
as.data.frame()
# name amount date
#1 a 100 9/30/2012
#2 a 900 12/31/2012
#3 a 1800 3/31/2013
#4 b 2500 12/31/2012
#5 b 1800 3/31/2013
NOTE2: Added a solution without using as.yearqtr, kept the same format for 'date' as in the expected output 'data2'

Here are a few approaches:
1) aggregate & zoo
library(zoo)
aggregate(amount ~ name + yearqtr,
transform(data, yearqtr = as.yearqtr(date, "%m/%d/%Y")),
sum)
2) data.table & zoo
library(data.table)
library(zoo)
dt <- data.table(data, key = "name,date")
dt[, date := as.yearqtr(date, "%m/%d/%Y")][, list(sum = sum(amount)), by = "name,date"]
Note that both these solutions convert the date to a real "yearqtr" object and not just to a character string. I haven't benchmarked these but typically data.table is very fast. You could create the data.table from data by reference using setDT for every greater performance but might prefer to keep them separate as well so we left them separate here.