I am trying to calculate the mean date independent of year for each level of a factor.
DF <- data.frame(Date = seq(as.Date("2013-2-15"), by = "day", length.out = 730))
DF$ID = rep(c("AAA", "BBB", "CCC"), length.out = 730)
head(DF)
Date ID
1 2013-02-15 AAA
2 2013-02-16 BBB
3 2013-02-17 CCC
4 2013-02-18 AAA
5 2013-02-19 BBB
6 2013-02-20 CCC
With the data above and the code below, I can calculate the mean date for each factor, but this includes the year.
I want a mean month and day across years. The preferred result would be a POSIXct time class formatted as month-day (eg. 12-31 for Dec 31st) representing the mean month and day across multiple years.
library(dplyr)
DF2 <- DF %>% group_by(ID) %>% mutate(
Col = mean(Date, na.rm = T))
DF2
Addition
I am looking for the mean day of the year with a month and day component, for each factor level. If the date represents, for example, the date an animal reproduced, I am not interested in the yearly differences between years, but instead want a single mean day.
I The end result would look like DF2 but with the new value calculated as previously described (mean day of the year with a month day component.
Sorry this was not more clear.
If I understand your question correctly, here's how to get a mean date column. I first extract the day of the year with yday from POSIXlt. I then calculate the mean. To get a date back, I have to add those days to an actual year, hence the creation of the Year object. As requested, I put the results in the same format as DF2 in your example.
library(dplyr)
DF2 <- DF %>%
mutate(Year=format(Date,"%Y"),
Date_day=as.POSIXlt(Date, origin = "1960-01-01")$yday)%>%
group_by(ID) %>%
mutate(Col = mean(Date_day, na.rm = T),Mean_date=format(as.Date(paste0(Year,"-01-01"))+Col,"%m-%d"))%>%
select(Date,ID,Mean_date)
DF2
> DF2
Source: local data frame [730 x 3]
Groups: ID [3]
Date ID Mean_date
(date) (chr) (chr)
1 2013-02-15 AAA 07-02
2 2013-02-16 BBB 07-02
3 2013-02-17 CCC 07-01
4 2013-02-18 AAA 07-02
5 2013-02-19 BBB 07-02
6 2013-02-20 CCC 07-01
7 2013-02-21 AAA 07-02
8 2013-02-22 BBB 07-02
9 2013-02-23 CCC 07-01
10 2013-02-24 AAA 07-02
.. ... ... ...
You can take the mean of dates by using the mean function. However, note that the mean implementation (and result) will be different depending on the data type. For POSIXct, the mean will be calculated and return the date and time - think of taking the mean of a bunch of integers and you will likely get a float or numeric. For Date, it will essentially 'round' the date to the nearest date.
For example, I recently took a mean of dates. Look at the output when different data types are used.
> mean(as.Date(stationPointDf$knockInDate))
[1] "2018-06-04"
> mean(as.POSIXct(stationPointDf$knockInDate))
[1] "2018-06-03 21:19:21 CDT"
If I am looking for a mean Month and Day across years, I convert all the dates to have the current year using lubridate package.
library(lubridate)
year(myVectorOfDates) <- 2018
Then, I compute the mean and drop the year.
Related
I have a dataframe with information on date of birth by individual id.
mydf <- data.frame(id=c(1,2),
dtbirth=as.Date(c("2012-01-01","2013-04-01")))
I would like to compute the age of the individuals as of today. The code below seems to work but outputs "days" to the new variable age
mydf %>%
mutate(age=floor((today()-dtbirth)/365))
We can wrap with as.integer/as.numeric to remove the class attribute difftime
mydf %>%
mutate(age= as.integer(floor((today()-dtbirth)/365)))
-output
# id dtbirth age
#1 1 2012-01-01 9
#2 2 2013-04-01 8
By default, when we use the -, the difftime picks up the unit by "auto"
mydf %>%
mutate(age = today() - dtbirth)
# id dtbirth age
#1 1 2012-01-01 3430 days
#2 2 2013-04-01 2974 days
If we need more fine control, use difftime itself and specify the units
mydf %>%
mutate(age = difftime(today(), dtbirth, units = 'weeks'))
# id dtbirth age
#1 1 2012-01-01 490.0000 weeks
#2 2 2013-04-01 424.8571 weeks
We cannot have units greater than 'weeks' as the available options are
difftime(time1, time2, tz,
units = c("auto", "secs", "mins", "hours",
"days", "weeks"))
and it is mentioned as
Units such as "months" are not possible as they are not of constant length. To create intervals of months, quarters or years use seq.Date or seq.POSIXt.
I have the below dataset which is a dataframe. But I would like to convert it into time series so that I can do ARIMA forecasting.
Have searched various topics in SO but could not find anything similar which is at YEARMONTH grain. Everyone talked about date field. But here I don't have date.
I am using the below code but this gives error
dataset <- data.frame(year =c(2017), YearMonth = c(201701,201702,201703,201704), sales = c(100,200,300,400))
library(zoo)
newdataset <- as.ts(read.zoo(dataset, FUN = as.yearmon))
# Error:
#
# In zoo(coredata(x), tt) :
# some methods for “zoo” objects do not work if the index entries in ‘order.by’ are not unique
I know it gives error because I have year column as 1st column which does not have unique values but not really sure how to fix it.
Any help would be really appreciated.
Regards,
Akash
One option is to convert YearMonth to 1st date of a month and generate ts.
library(zoo)
dataset$YearMonth = as.Date(as.yearmon(as.character(dataset$YearMonth),"%Y%m"), frac = 0)
dataset
# year YearMonth sales
# 1 2017 2017-01-01 100
# 2 2017 2017-02-01 200
# 3 2017 2017-03-01 300
# 4 2017 2017-04-01 400
Just for ts another option is as:
dataset$YearMonth = as.yearmon(as.character(dataset$YearMonth),"%Y%m")
as.ts(dataset[-1])
# Time Series:
# Start = 1
# End = 4
# Frequency = 1
# YearMonth sales
# 1 2017.000 100
# 2 2017.083 200
# 3 2017.167 300
# 4 2017.250 400
Hi I am new to R and would like to know if there is a simple way to filter data over multiple dates.
I have a data which has dates from 07.03.2003 to 31.12.2016.
I need to split/ filter the data by multiple time series, as per below.
Dates require in new data frame:
07.03.2003 to 06/03/2005
and
01/01/2013 to 31/12/2016
i.e the new data frame should not include dates from 07/03/2005 to 31/12/2012
Let's take the following data.frame with dates:
df <- data.frame( date = c(ymd("2017-02-02"),ymd("2016-02-02"),ymd("2014-02-01"),ymd("2012-01-01")))
date
1 2017-02-02
2 2016-02-02
3 2014-02-01
4 2012-01-01
I can filter this for a range of dates using lubridate::ymd and dplyr::between and dplyr::between:
df1 <- filter(df, between(date, ymd("2017-01-01"), ymd("2017-03-01")))
date
1 2017-02-02
Or:
df2 <- filter(df, between(date, ymd("2013-01-01"), ymd("2014-04-01")))
date
1 2014-02-01
I would go with lubridate. In particular
library(data.table)
library(lubridate)
set.seed(555)#in order to be reproducible
N <- 1000#number of pseudonumbers to be generated
date1<-dmy("07-03-2003")
date2<-dmy("06-03-2005")
date3<-dmy("01-01-2013")
date4<-dmy("31-12-2016")
Creating data table with two columns (dates and numbers):
my_dt<-data.table(date_sample=c(sample(seq(date1, date4, by="day"), N),numeric_sample=sample(N,replace = F)))
> head(my_dt)
date_sample numeric_sample
1: 2007-04-11 2
2: 2006-04-20 71
3: 2007-12-20 46
4: 2016-05-23 78
5: 2011-10-07 5
6: 2003-09-10 47
Let's impose some cuts:
forbidden_dates<-interval(date2+1,date3-1)#create interval that dates should not fall in.
> forbidden_dates
[1] 2005-03-07 UTC--2012-12-31 UTC
test_date1<-dmy("08-03-2003")#should not fall in above range
test_date2<-dmy("08-03-2005")#should fall in above range
Therefore:
test_date1 %within% forbidden_dates
[1] FALSE
test_date2 %within% forbidden_dates
[1] TRUE
A good way of visualizing the cut:
before
>plot(my_dt)
my_dt<-my_dt[!(date_sample %within% forbidden_dates)]#applying the temporal cut
after
>plot(my_dt)
The end goal is to visualize the amount of a medication taken per day across a large sample of individuals. I'm trying to reshape my data to make a stacked area chart (or something similar).
In a more general term; I have my data structured as below:
id med start_date end_date
1 drug_a 2010-08-24 2011-03-03
2 drug_a 2011-06-07 2011-08-12
3 drug_b 2010-03-26 2010-10-31
4 drug_b 2012-08-14 2013-01-31
5 drug_c 2012-03-01 2012-06-20
5 drug_a 2012-04-01 2012-06-14
I think I'm trying to create a data frame with one row per date, and a column summing the total of patients (id) that are taking that drug on that day. For example, if someone is taking drug_a from 2010-01-01 to 2010-01-20, each of those drug-days should count.
Something like:
date drug_a drug_b drug_c
2010-01-01 5 0 10
2010-01-02 10 2 8
I'm functional with dplyr and tidyr, but unsure how to use spread with dates and durations.
I'd expand out the data to use all dates using a do loop:
library(dplyr)
library(tidyr)
library(zoo)
df %>%
group_by(id, med) %>%
do(with(.,
data_frame(
date = (start_date:end_date) %>% as.Date) ) ) %>%
group_by(date, med) %>%
summarize(frequency = n() ) %>%
spread(med, frequency)
I'm working with some meteorology data in R, and conceptually, I'm trying to find out how much a certain day is above/below average. To do this, I want to separate by day of year, find the average for all DOY (e.g. what is the average January 1 Temperature?), and then compare every date (e.g was January 1, 2014 anomalously warm, by how much?)
I can find a 'mean' table for every day of the year using aggregate:
head(data)
x date
1 5.072241 1970-01-01
2 6.517069 1970-01-02
3 4.413654 1970-01-03
4 11.129351 1970-01-04
5 9.331630 1970-01-05
library(lubridate)
temp = aggregate(data$x, list(yday(data$date)), mean)
but I'm stuck then how to use the aggregated table to compare with my original data.frame, to see how x at 1970 Jan 1 relates to average Jan 1 x.
We can remove the 'year' part with sub ('Monthday'). Use ave if a Mean variable needs to be created grouped by 'Monthday'.
data$Monthday <- sub('\\d+-', '', data$date)
data$Mean <- with(data, ave(x, Monthday))
Then, we can compare with 'x' variable, for example
data$rel_temp <- with(data, x/Mean)
You should use dplyr as well.
library(dplyr); library(lubridate)
data %>% mutate(year_day = paste0(month(date), "_",mday(date))) %>%
group_by(year_day) %>% mutate(relev_temp = x/mean(x)) %>% ungroup
The logic is the following:
Create a new variable year_day which is just the month and day of every date mutate(year_day =...
Then take the temperature x and divide with the average temp of that year_day, group_by(year_day) %>% mutate(relev_temp = x/mean(x))
Thanks for the feedback. #akrun's answer works well for me.
As an alternative, I also hacked this together, which produces the same output as #akrun's answer (and is 1/10th of a second slower for 40 yrs of daily data):
averages = aggregate(x, list(DOY = yday(date)), mean)
temp = merge(data.frame(x,date, DOY = yday(date)), averages, by = 'DOY')
head(temp[order(temp$date),])
DOY x.x date x.y
1 1 -12.0 1970-01-01 -8.306667
70 2 -14.2 1970-01-02 -8.695556
113 3 -16.7 1970-01-03 -8.060000
157 4 -13.6 1970-01-04 -8.233333
200 5 -19.2 1970-01-05 -8.633333
243 6 -15.0 1970-01-06 -8.922222