I am a beginner in R and I am trying to convert sets of calendar dates to sets of Julian dates in a data frame using R. I know there are a similar questions answered but I am not being able to get I want.
df <- data.frame(Date = c('2010-06-20','2005-10-19','2000-05-01','2003-04-04','2010-11-20','2009-09-14'), No = c(1, 4, 6, 11, 7, 9))
df$ jDate <- as.POSIXct(as.numeric(df$Date), origin = '1970-01-01')
gives me
df
Date No cDate
1 2010-06-20 1 1969-12-31 19:00:05
2 2005-10-19 4 1969-12-31 19:00:03
3 2000-05-01 6 1969-12-31 19:00:01
4 2003-04-04 11 1969-12-31 19:00:02
5 2010-11-20 7 1969-12-31 19:00:06
6 2009-09-14 9 1969-12-31 19:00:04
How could I get a column with Julian days in the column 'jDate'?
Thank you for your help.
You can do
df$Date <- as.Date(df$Date)
to get the date, and then
df$jDate <- format(df$Date, "%j")
to get the julian days or
df$jDateYr <- format(df$Date, "%Y-%j")
to prepend the year (if you want). This returns
df
Date No jDate jDateYr
1 2010-06-20 1 171 2010-171
2 2005-10-19 4 292 2005-292
3 2000-05-01 6 122 2000-122
4 2003-04-04 11 094 2003-094
5 2010-11-20 7 324 2010-324
6 2009-09-14 9 257 2009-257
To read more about the possible date-time formats, see ?strptime.
Based on aosmith's comments, I did this and got what I wanted.
> df$jDate <- julian(as.Date(df$Date), origin = as.Date('1970-01-01'))
df
Date No jDate
1 2010-06-20 1 14780
2 2005-10-19 4 13075
3 2000-05-01 6 11078
4 2003-04-04 11 12146
5 2010-11-20 7 14933
6 2009-09-14 9 14501
Related
I am trying to get a count of active clients per month, using data that has a start and end date to each client's episode. The code I am using I can't work out how to count per month, rather than per every n days.
Here is some sample data:
Start.Date <- as.Date(c("2014-01-01", "2014-01-02","2014-01-03","2014-01-03"))
End.Date<- as.Date(c("2014-01-04", "2014-01-03","2014-01-03","2014-01-04"))
Make sure the dates are dates:
Start.Date <- as.Date(Start.Date, "%d/%m/%Y")
End.Date <- as.Date(End.Date, "%d/%m/%Y")
Here is the code I am using, which current counts the number per day:
library(plyr)
count(Reduce(c, Map(seq, start.month, end.month, by = 1)))
which returns:
x freq
1 2014-01-01 1
2 2014-01-02 2
3 2014-01-03 4
4 2014-01-04 2
The "by" argument can be changed to be however many days I want, but problems arise because months have different lengths.
Would anyone be able to suggest how I can count per month?
Thanks a lot.
note: I now realize that for my example data I have only used dates in the same month, but my real data has dates spanning 3 years.
Here's a solution that seems to work. First, I set the seed so that the example is reproducible.
# Set seed for reproducible example
set.seed(33550336)
Next, I create a dummy data frame.
# Test data
df <- data.frame(Start_date = as.Date(sample(seq(as.Date('2014/01/01'), as.Date('2015/01/01'), by="day"), 12))) %>%
mutate(End_date = as.Date(Start_date + sample(1:365, 12, replace = TRUE)))
which looks like,
# Start_date End_date
# 1 2014-11-13 2015-09-26
# 2 2014-05-09 2014-06-16
# 3 2014-07-11 2014-08-16
# 4 2014-01-25 2014-04-23
# 5 2014-05-16 2014-12-19
# 6 2014-11-29 2015-07-11
# 7 2014-09-21 2015-03-30
# 8 2014-09-15 2015-01-03
# 9 2014-09-17 2014-09-26
# 10 2014-12-03 2015-05-08
# 11 2014-08-03 2015-01-12
# 12 2014-01-16 2014-12-12
The function below takes a start date and end date and creates a sequence of months between these dates.
# Sequence of months
mon_seq <- function(start, end){
# Change each day to the first to aid month counting
day(start) <- 1
day(end) <- 1
# Create a sequence of months
seq(start, end, by = "month")
}
Right, this is the tricky bit. I apply my function mon_seq to all rows in the data frame using mapply. This gives the months between each start and end date. Then, I combine all these months together into a vector. I format this vector so that dates just contain months and years. Finally, I pipe (using dplyr's %>%) this into table which counts each occurrence of year-month and I cast as a data frame.
data.frame(format(do.call("c", mapply(mon_seq, df$Start_date, df$End_date)), "%Y-%m") %>% table)
This gives,
# . Freq
# 1 2014-01 2
# 2 2014-02 2
# 3 2014-03 2
# 4 2014-04 2
# 5 2014-05 3
# 6 2014-06 3
# 7 2014-07 3
# 8 2014-08 4
# 9 2014-09 6
# 10 2014-10 5
# 11 2014-11 7
# 12 2014-12 8
# 13 2015-01 6
# 14 2015-02 4
# 15 2015-03 4
# 16 2015-04 3
# 17 2015-05 3
# 18 2015-06 2
# 19 2015-07 2
# 20 2015-08 1
# 21 2015-09 1
I have some data. ID and date and I'm trying to create a new field for semester.
df:
id date
1 20160822
2 20170109
3 20170828
4 20170925
5 20180108
6 20180402
7 20160711
8 20150831
9 20160111
10 20160502
11 20160829
12 20170109
13 20170501
I also have a semester table:
start end season_year
20120801 20121222 Fall-2012
20121223 20130123 Winter-2013
20130124 20130523 Spring-2013
20130524 20130805 Summer-2013
20130806 20131228 Fall-2013
20131229 20140122 Winter-2014
20140123 20140522 Spring-2014
20140523 20140804 Summer-2014
20140805 20141227 Fall-2014
20141228 20150128 Winter-2015
20150129 20150528 Spring-2015
20150529 20150803 Summer-2015
20150804 20151226 Fall-2015
20151227 20160127 Winter-2016
20160128 20160526 Spring-2016
20160527 20160801 Summer-2016
20160802 20161224 Fall-2016
20161225 20170125 Winter-2017
20170126 20170525 Spring-2017
20170526 20170807 Summer-2017
20170808 20171230 Fall-2017
20171231 20180124 Winter-2018
20180125 20180524 Spring-2018
20180525 20180806 Summer-2018
20180807 20181222 Fall-2018
20181223 20190123 Winter-2019
20190124 20190523 Spring-2019
20190524 20180804 Summer-2019
I'd like to create a new field in df if df$date is between semester$start and semester$end, then place the respective value semester$season_year in df
I tried to see if the lubridate package could help but that seems to be more for calculations
I saw this question and it seems to be the closest to what i want, but, to make things more complicated, not all of our semesters are six months
Does this work?
library(lubridate)
semester$start <- ymd(semester$start)
semester$end <- ymd(semester$end)
df$date <- ymd(df$date)
LU <- Map(`:`, semester$start, semester$end)
LU <- data.frame(value = unlist(LU),
index = rep(seq_along(LU), lapply(LU, length)))
df$semester <- semester$season_year[LU$index[match(df$date, LU$value)]]
A solution using non-equi update joins using data.table and lubridate package can be as:
library(data.table)
setDT(df)
setDT(semester)
df[,date:=as.IDate(as.character(date), format = "%Y%m%d")]
semester[,':='(start = as.IDate(as.character(start), format = "%Y%m%d"),
end=as.IDate(as.character(end), format = "%Y%m%d"))]
df[semester, on=.(date >= start, date <= end), season_year := i.season_year]
df
# id date season_year
# 1: 1 2016-08-22 Fall-2016
# 2: 2 2017-01-09 Winter-2017
# 3: 3 2017-08-28 Fall-2017
# 4: 4 2017-09-25 Fall-2017
# 5: 5 2018-01-08 Winter-2018
# 6: 6 2018-04-02 Spring-2018
# 7: 7 2016-07-11 Summer-2016
# 8: 8 2015-08-31 Fall-2015
# 9: 9 2016-01-11 Winter-2016
# 10: 10 2016-05-02 Spring-2016
# 11: 11 2016-08-29 Fall-2016
# 12: 12 2017-01-09 Winter-2017
# 13: 13 2017-05-01 Spring-2017
Data:
df <- read.table(text="
id date
1 20160822
2 20170109
3 20170828
4 20170925
5 20180108
6 20180402
7 20160711
8 20150831
9 20160111
10 20160502
11 20160829
12 20170109
13 20170501",
header = TRUE, stringsAsFactors = FALSE)
semester <- read.table(text="
start end season_year
20120801 20121222 Fall-2012
20121223 20130123 Winter-2013
20130124 20130523 Spring-2013
20130524 20130805 Summer-2013
20130806 20131228 Fall-2013
20131229 20140122 Winter-2014
20140123 20140522 Spring-2014
20140523 20140804 Summer-2014
20140805 20141227 Fall-2014
20141228 20150128 Winter-2015
20150129 20150528 Spring-2015
20150529 20150803 Summer-2015
20150804 20151226 Fall-2015
20151227 20160127 Winter-2016
20160128 20160526 Spring-2016
20160527 20160801 Summer-2016
20160802 20161224 Fall-2016
20161225 20170125 Winter-2017
20170126 20170525 Spring-2017
20170526 20170807 Summer-2017
20170808 20171230 Fall-2017
20171231 20180124 Winter-2018
20180125 20180524 Spring-2018
20180525 20180806 Summer-2018
20180807 20181222 Fall-2018
20181223 20190123 Winter-2019
20190124 20190523 Spring-2019
20190524 20180804 Summer-2019",
header = TRUE, stringsAsFactors = FALSE)
I have a vector of dates like this:
1 2014-03-10 22:54:24
2 2014-03-10 22:53:16
3 2014-03-10 22:53:01
4 2014-03-10 22:52:38
5 2014-03-10 22:52:00
6 2014-03-01 01:13:08
7 2014-03-01 01:11:30
8 2014-03-01 01:07:41
9 2014-03-01 01:05:28
10 2014-03-01 00:58:40
11 2014-03-27 18:11:57
How can I group by month, day, morning, afternoon or week? For instance:
month sum
2014-3 11
==============
week sum
2014-3-1 5
2014-3-9 5
==============
2014-3-1
morning sum
2014-3-1 5
Use the package data.table and get known of the class POSIXlt.
#x is assumed to be you're vector of time objects (POSIXct POSIXlt).
# The following lines are just for getting known to POSIXlt. You do not need to run these.
Secs <- as.POSIXlt(x)[[1]]
Mins <- as.POSIXlt(x)[[2]]
# ...
Month <- as.POSIXlt(x)[[5]] + 1 # months do start with 0 instead of 1
Year <- as.POSIXlt(x)[[6]] - 100 #for 2016 the result would be 116 ...
DayOfYear <- as.POSIXlt(x)[[9]] + 1 #starts with 0
You can calculate more complicated values similarly. Use data.table now.
require(data.table)
X <- as.data.table(x) # creates a data.table object
setnames(X, "Time") # names the 1 column 'Time'
X[ , month := as.POSIXlt(Time)[[5]] + 1] #adds a column month
X[ , doy:= as.POSIXlt(Time)[[8]] + 1] #adds a column day of year
#....
Now you can group your data.table with:
X[ , .N, by = doy]
X[ , .N, by = month]
# ...
.N returns the number of items in each group. You could also combine the grouping:
X[ , .N, by = list(doy, month)]
There are many nice tutorials using data.tables and the grouping and evaluation is similar to sql syntax (which can also be found in tutorials).
A good link to start is the FAQ of the developer:
http://datatable.r-forge.r-project.org/datatable-faq.pdf
EDIT:
Of course you could also make more complicated columns for afternoon and morning like this:
X[ , afternoon:= ifelse(as.POSIXlt(x)[[3]] > 12, TRUE, FALSE)]
Assuming you have a data frame like this where time is in POSIXct format:
df
time
1 2014-03-10 22:54:24
2 2014-03-10 22:53:16
3 2014-03-10 22:53:01
4 2014-03-10 22:52:38
5 2014-03-10 22:52:00
6 2014-03-01 01:13:08
7 2014-03-01 01:11:30
8 2014-03-01 01:07:41
9 2014-03-01 01:05:28
10 2014-03-01 00:58:40
11 2014-03-27 18:11:57
You can get month, week and am/pm as follows:
df$month <- format(df$time, '%Y-%m')
df$week <- format(df$time, '%Y-%U')
df$ampm <- ifelse(as.numeric(format(df$time, '%H')) > 12, 'pm', 'am')
df
time month week ampm
1 2014-03-10 22:54:24 2014-03 2014-10 pm
2 2014-03-10 22:53:16 2014-03 2014-10 pm
3 2014-03-10 22:53:01 2014-03 2014-10 pm
4 2014-03-10 22:52:38 2014-03 2014-10 pm
5 2014-03-10 22:52:00 2014-03 2014-10 pm
6 2014-03-01 01:13:08 2014-03 2014-08 am
7 2014-03-01 01:11:30 2014-03 2014-08 am
8 2014-03-01 01:07:41 2014-03 2014-08 am
9 2014-03-01 01:05:28 2014-03 2014-08 am
10 2014-03-01 00:58:40 2014-03 2014-08 am
11 2014-03-27 18:11:57 2014-03 2014-12 pm
Then, you can get your summaries using library dplyr like this:
library(dplyr)
count(df, month)
Source: local data frame [1 x 2]
month n
(chr) (int)
1 2014-03 11
count(df, week)
Source: local data frame [3 x 2]
week n
(chr) (int)
1 2014-08 5
2 2014-10 5
3 2014-12 1
count(df, ampm)
Source: local data frame [2 x 2]
ampm n
(chr) (int)
1 am 5
2 pm 6
I have a column of dates in the format:
16Jun10
and I would like to extract the Julian day.
I have various years.
I have tried the functions julian and mdy.date and it doesn't seem to work.
Try the following to convert from class character(i.e. text) to class POSIXlt, and then extract Julian day (yday):
tmp <- as.POSIXlt("16Jun10", format = "%d%b%y")
tmp$yday
# [1] 166
For more details on function settings:
?POSIXlt
?DateTimeClasses
Another option is to use a Date class, and then use format to extract a julian day (notice that this class define julian days between 1:366, while POSIXlt is 0:365):
tmp <- as.Date("16Jun10", format = "%d%b%y")
format(tmp, "%j")
# [1] "167"
Similarly:
require(lubridate)
x = as.Date('2010-06-10')
yday(x)
[1] 161
Also note, using lubridate:
> dmy('16Jun10')
[1] "2010-06-16 UTC"
You can use R's insol package which has a JD(x, inverse=FALSE) function which converts POSIXct to Julian Day Number (JDN).
insol package also has JDymd(year,month,day,hour=12,minute=0,sec=0) for custom dates.
To display the whole Julian Date (JD) you possibly have to set options(digits=16).
my.data = read.table(text = "
OBS MONTH1 DAY1 YEAR1
1 3 1 2012
2 3 31 2012
3 4 1 2012
4 4 30 2012
5 5 1 2012
6 5 31 2012
7 6 1 2012
8 6 30 2012
9 7 1 2012
10 7 31 2012
", header = TRUE, stringsAsFactors = FALSE)
my.data$MY.DATE1 <- do.call(paste, list(my.data$MONTH1, my.data$DAY1, my.data$YEAR1))
my.data$MY.DATE1 <- as.Date(my.data$MY.DATE1, format=c("%m %d %Y"))
my.data$my.julian.date <- as.numeric(format(my.data$MY.DATE1, "%j"))
my.data
Returns, which technically is incorrect since Julian dates do not return to 1 on the first day of each January:
http://en.wikipedia.org/wiki/Julian_day
The dates below are Ordinal dates:
OBS MONTH1 DAY1 YEAR1 MY.DATE1 my.julian.date
1 1 3 1 2012 2012-03-01 61
2 2 3 31 2012 2012-03-31 91
3 3 4 1 2012 2012-04-01 92
4 4 4 30 2012 2012-04-30 121
5 5 5 1 2012 2012-05-01 122
6 6 5 31 2012 2012-05-31 152
7 7 6 1 2012 2012-06-01 153
8 8 6 30 2012 2012-06-30 182
9 9 7 1 2012 2012-07-01 183
10 10 7 31 2012 2012-07-31 213
Here are my R versions of code originally written in APL and converted to J. We call this pseudo-Julian because it is only intended for dates after October 15, 1582 which is when calendar reform, in some parts of the Western world, arbitrarily changed the date.
#* toJulian: convert 3-element c(Y,M,D) timestamp into pseudo-Julian day number.
toJulian<- function(TS3)
{ mm<- TS3[2]
xx<- 0
if( mm<=2) {xx<- 1}
mm<- (12*xx)+mm
yy<- TS3[1]-xx
nc<- floor(0.01*yy)
jd<- floor(365.25*yy)+floor(30.6001*(1+mm))+TS3[3]+1720995+(2-(nc-floor(0.25*nc)))
return(jd)
#EG toJulian c(1959,5,24) -> 2436713
#EG toJulian c(1992,12,16) -> 2448973
}
Here's the inverse function:
#* toGregorian: convert pseudo-Julian day number to timestamp in form c(Y,M,D)
# (>15 Oct 1582). Adapted from "Numerical Recipes in C" by Press,
# Teukolsky, et al.
toGregorian<- function(jdn)
{ igreg<- 2299161 # Gregorian calendar conversion day c(1582,10,15).
ja<- floor(jdn)
xx<- 0
if(igreg<=ja){xx<- 1}
jalpha<- floor((floor((xx*ja)-1867216)-0.25)/36524.25)
ja<- ((1-xx)*ja) + ((xx*ja)+1+jalpha-floor(0.25*jalpha))
jb<- ja+1524
jc<- floor(6680+((jb-2439870)-122.1)/365.25)
jd<- floor(365.25*jc)
je<- floor((jb-jd)/30.6001)
id<- floor((jb-jd)-floor(30.6001*je))
mm<- floor(je-1)
if(12<mm){mm<- mm-12}
iyyy<- floor(jc-4715)
if(mm>2){iyyy<- iyyy-1}
if(0>iyyy){iyyy<- iyyy-1}
gd<- c(iyyy, mm, id)
return(gd)
#EG toGregorian 2436713 -> c(1959,5,24)
#EG toGregorian 2448973 -> c(1992,12,16)
}
Let's say I have several years worth of data which look like the following
# load date package and set random seed
library(lubridate)
set.seed(42)
# create data.frame of dates and income
date <- seq(dmy("26-12-2010"), dmy("15-01-2011"), by = "days")
df <- data.frame(date = date,
wday = wday(date),
wday.name = wday(date, label = TRUE, abbr = TRUE),
income = round(runif(21, 0, 100)),
week = format(date, format="%Y-%U"),
stringsAsFactors = FALSE)
# date wday wday.name income week
# 1 2010-12-26 1 Sun 91 2010-52
# 2 2010-12-27 2 Mon 94 2010-52
# 3 2010-12-28 3 Tues 29 2010-52
# 4 2010-12-29 4 Wed 83 2010-52
# 5 2010-12-30 5 Thurs 64 2010-52
# 6 2010-12-31 6 Fri 52 2010-52
# 7 2011-01-01 7 Sat 74 2011-00
# 8 2011-01-02 1 Sun 13 2011-01
# 9 2011-01-03 2 Mon 66 2011-01
# 10 2011-01-04 3 Tues 71 2011-01
# 11 2011-01-05 4 Wed 46 2011-01
# 12 2011-01-06 5 Thurs 72 2011-01
# 13 2011-01-07 6 Fri 93 2011-01
# 14 2011-01-08 7 Sat 26 2011-01
# 15 2011-01-09 1 Sun 46 2011-02
# 16 2011-01-10 2 Mon 94 2011-02
# 17 2011-01-11 3 Tues 98 2011-02
# 18 2011-01-12 4 Wed 12 2011-02
# 19 2011-01-13 5 Thurs 47 2011-02
# 20 2011-01-14 6 Fri 56 2011-02
# 21 2011-01-15 7 Sat 90 2011-02
I would like to sum 'income' for each week (Sunday thru Saturday). Currently I do the following:
Weekending 2011-01-01 = sum(df$income[1:7]) = 487
Weekending 2011-01-08 = sum(df$income[8:14]) = 387
Weekending 2011-01-15 = sum(df$income[15:21]) = 443
However I would like a more robust approach which will automatically sum by week. I can't work out how to automatically subset the data into weeks. Any help would be much appreciated.
First use format to convert your dates to week numbers, then plyr::ddply() to calculate the summaries:
library(plyr)
df$week <- format(df$date, format="%Y-%U")
ddply(df, .(week), summarize, income=sum(income))
week income
1 2011-52 413
2 2012-01 435
3 2012-02 379
For more information on format.date, see ?strptime, particular the bit that defines %U as the week number.
EDIT:
Given the modified data and requirement, one way is to divide the date by 7 to get a numeric number indicating the week. (Or more precisely, divide by the number of seconds in a week to get the number of weeks since the epoch, which is 1970-01-01 by default.
In code:
df$week <- as.Date("1970-01-01")+7*trunc(as.numeric(df$date)/(3600*24*7))
library(plyr)
ddply(df, .(week), summarize, income=sum(income))
week income
1 2010-12-23 298
2 2010-12-30 392
3 2011-01-06 294
4 2011-01-13 152
I have not checked that the week boundaries are on Sunday. You will have to check this, and insert an appropriate offset into the formula.
This is now simple using dplyr. Also I would suggest using cut(breaks = "week") rather than format() to cut the dates into weeks.
library(dplyr)
df %>% group_by(week = cut(date, "week")) %>% mutate(weekly_income = sum(income))
I Googled "group week days into weeks R" and came across this SO question. You mention you have multiple years, so I think we need to keep up with both the week number and also the year, so I modified the answers there as so format(date, format = "%U%y")
In use it looks like this:
library(plyr) #for aggregating
df <- transform(df, weeknum = format(date, format = "%y%U"))
ddply(df, "weeknum", summarize, suminc = sum(income))
#----
weeknum suminc
1 1152 413
2 1201 435
3 1202 379
See ?strptime for all the format abbreviations.
Try rollapply from the zoo package:
rollapply(df$income, width=7, FUN = sum, by = 7)
# [1] 487 387 443
Or, use period.sum from the xts package:
period.sum(xts(df$income, order.by=df$date), which(df$wday %in% 7))
# [,1]
# 2011-01-01 487
# 2011-01-08 387
# 2011-01-15 443
Or, to get the output in the format you want:
data.frame(income = period.sum(xts(df$income, order.by=df$date),
which(df$wday %in% 7)),
week = df$week[which(df$wday %in% 7)])
# income week
# 2011-01-01 487 2011-00
# 2011-01-08 387 2011-01
# 2011-01-15 443 2011-02
Note that the first week shows as 2011-00 because that's how it is entered in your data. You could also use week = df$week[which(df$wday %in% 1)] which would match your output.
This solution is influenced by #Andrie and #Chase.
# load plyr
library(plyr)
# format weeks as per requirement (replace "00" with "52" and adjust corresponding year)
tmp <- list()
tmp$y <- format(df$date, format="%Y")
tmp$w <- format(df$date, format="%U")
tmp$y[tmp$w=="00"] <- as.character(as.numeric(tmp$y[tmp$w=="00"]) - 1)
tmp$w[tmp$w=="00"] <- "52"
df$week <- paste(tmp$y, tmp$w, sep = "-")
# get summary
df2 <- ddply(df, .(week), summarize, income=sum(income))
# include week ending date
tmp$week.ending <- lapply(df2$week, function(x) rev(df[df$week==x, "date"])[[1]])
df2$week.ending <- sapply(tmp$week.ending, as.character)
# week income week.ending
# 1 2010-52 487 2011-01-01
# 2 2011-01 387 2011-01-08
# 3 2011-02 443 2011-01-15
df.index = df['week'] #the the dt variable as index
df.resample('W').sum() #sum using resample
With dplyr:
df %>%
arrange(date) %>%
mutate(week = as.numeric(date - date[1])%/%7) %>%
group_by(week) %>%
summarise(weekincome= sum(income))
Instead of date[1] you can have any date from when you want to start your weekly study.