Format dates to Month-Year while keeping class Date - r

I feel like there's a pretty simple way to do this, but I'm not finding it easily...
I am working with R to extract data from a dataset and them summarize it by a number of different characteristics. One of them is the month in which an event is scheduled / has occurred. We have the exact date of the event in the database, something like this:
person_id date_visit
1 2012-05-03
2 2012-08-13
3 2012-12-12
...
I would like to use the table() function to generate a summary table that would look something like this:
Month Freq
Jan 12 1
Feb 12 2
Mar 12 1
Apr 12 3
...
My issue is this. I've read the data in and used as.Date() to convert character strings to dates. I can use format.Date() to get the dates formatted as Jan 12, Mar 12, etc. But when you use format.Date(), you end up with character strings again. This means when you apply table() to them, they come out in alphabetical order (my current set is Aug 12, Jul 12, Jun 12, Mar 12, and so forth).
I know that in SAS, you could use a format to change the appearance of a date, while preserving it as a date (so you could still do date operators on it). Can the same thing be done using R?
My plan is to build a nice data frame through a number of steps, and then (after making sure that all the dates are converted to strings, for compatibility reasons) use xtable() to make a nice LaTeX output.
Here's my code at present.
load("temp.RData")
ds$date_visit <- as.Date(ds$date_visit,format="%Y-%m-%d")
table(format.Date(safebeat_recruiting$date_baseline,format="%b %Y"))
ETA: I'd prefer to just do it in Base R if I can, but if I have to I can always use an additional package.

You could use the yearmon class from the zoo package
require("zoo")
ds <- data.frame(person_id=1:3, date_visit=c("2012-05-03", "2012-08-13", "2012-12-12"))
ds$date_visit <- as.yearmon(ds$date_visit)
ds
person_id date_visit
1 1 May 2012
2 2 Aug 2012
3 3 Dec 2012

month.abb is a constant vector in R and can be used to sort on the first three letter of the string of names for the table.
ds <- data.frame(person_id=1:3, date_visit=as.Date(c("2012-05-03", "2012-08-13", "2012-12-12")))
table(format( ds$date_visit, format="%b %Y"))
tbl <- table(format( ds$date_visit, format="%b %Y"))
tbl[order( match(substr(names(tbl), 1,3), month.abb) )]
May 2012 Aug 2012 Dec 2012
1 1 1
With additional years you would see the "May"s all together so this would be needed:
tbl[order( substr(names(tbl), 5,8), match(substr(names(tbl), 1,3), month.abb) )]

Related

How could generate a numerical value on Time?

I have time data for mixed linear analysis.
I hope to use R to center on time to get a numerical value.
Below is an example:
TIME = 0 at Wave 1 (0 month, September 2006),
TIME = 0.67 at Wave 2 (8 months, May 2007),
TIME = 1 at Wave 3 (12 months, September 2007),
TIME = 1.67 at Wave 4 (20 months, May 2008),
TIME = 2 at Wave 5 (24 months, September 2008),
TIME = 2.67 at Wave 6 (32 months, May 2009).
Expected format:
Time = ? Wave 1 is April 2020
Time = ? Wave 2 is July 2020
Time = ? Wave 3 is Jan 2021
Time = ? Wave 4 is April 2021
I hope to calculate the numerical value Time.
How could I use R to generate a Time Value like the example shows?
Perhaps I'm unfamiliar with this approach, but it doesn't look like you are "centering." It looks like you are calculating durations. Specifically, each of the values in the example that you give are just the time (in years) since wave 1 (i.e., May 2009 is 2.67 years from Sep 2006). There's nothing wrong with this, I just want to make sure we are working on the same problem.
Assuming you are just looking for the amount of time between two dates, you have two options.
Option 1: Lubridate
The lubridate package is generally the easiest way to work with dates. If you don't use it yet, I think you'll really appreciate how easy it makes handling dates and times in R (but it does need to be installed with "install.packages('lubridate')".
library(lubridate)
wave_dates <- c('April 1, 2020', 'July 1, 2020', 'Jan 1, 2021', 'April 1, 2021')
wave_dates <- mdy(wave_dates) # lubridate converts from string to date objects
# get times in years
(wave_dates - min(wave_dates))/dyears(1)
# > [1] 0.0000000 0.2491444 0.7529090 0.9993155
Option 2: Base R
If you want to use base R, you'll need to make sure your dates are converted into a format R can understand with strptime(). Make sure to consult ?strptime()'s documentation for all of the different formatting instructions you can give it (there are a lot). In this case, we need...
wave_dates <- c('April 1, 2020', 'July 1, 2020', 'January 1, 2021', 'April 1, 2021')
wave_dates <- strptime(wave_dates, '%B %d, %Y') # base R converts from string to date objects
difftime(wave_dates, min(wave_dates), units = 'days') / 365
#> [1] 0.0000000 0.2493151 0.7535388 1.0000000
Note that when using difftime() we need to divide our answer by 365 because it doesn't have a units = 'years' option. This is because some years (leap years) are a different length than others and base R is generally not designed to handle that. In contrast, lubridate can.

Character 2 digit year conversion to year only

Using R
Got large clinical health data set to play with, but dates are weird
Most problematic is 2digityear/halfyear, as in 98/2, meaning at some point in 1998 after July 1
I have split the column up into 2 character columns, e.g. 98 and 2 but now need to convert the 2 digit year character string into an actual year.
I tried as.Date(data$variable,format="%Y") but not only did I get a conversion to 0098 as the year rather than 1998, I also got todays month and year arbitrarily added (the actual data has no month or day).
as in 0098-06-11
How do I get just 1998 instead?
Not elegant. But using combination of lubridate and as.Date you can get that.
library(lubridate)
data <- data.frame(variable = c(95, 96, 97,98,99), date=c(1,2,3,4,5))
data$variableUpdated <- year(as.Date(as.character(data$variable), format="%y"))
and only with base R
data$variableUpdated <- format(as.Date(as.character(data$variable), format="%y"),"%Y")

Importing Time Series In r

my dataset I am trying to implement Time Series Analysis on a data set which has two attributes (Year & Sales). Year are 2016,2017 & 2018 for which there are average sales value for all 12 months. My data looks like below:
JAN FEB MAR APR MAY JUNE
2016 4457. 4,105 4,276 4712. 5,116 4,512
2017 4,222 5,432 4,816 5,018 4,497 4,603
2018 4,355 4,972 4,868 4,665 4,735 4,926
This is just some part of my data set to get an idea how it looks like. The months are JAN to DEC. Now I want to know, firstly, how to import this data set into R? As I obviously cannot import it like this because it treats all the columns like X1,X2 etc and these becomes too many variables. Secondly, R takes this data set as "data.frame". How can I convert it into just "ts". I have tried
data.ts<- as.ts(myData)
but it converts it into
"mts" "ts" "matrix"
and moreover, it shows my frequency 1 while it should
be 12. Please help me. I am stuck at the starting.
First you want to restructure your data to be in long format which can be done with the gather function from tidyr.
library(tidyr)
myData <- myData %>% tidyr::gather(timeperiod, sales, JAN:DEC)
Then your data will be structured to create a time series:
ts <- as.ts(data, from=c(2016,1), frequency=12)

Extract year from date

How can I remove the first elements from a variable, especially if this variable has a special characters. For example, I have the following column:
Date
01/01/2009
01/01/2010
01/01/2011
01/01/2012
I need to have a new column like the following:
Date
2009
2010
2011
2012
As discussed in the comments, this can be achieved by converting the entry into Date format and extracting the year, for instance like this:
format(as.Date(df1$Date, format="%d/%m/%Y"),"%Y")
library(lubridate)
a=mdy(b)
year(a)
https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html
http://vita.had.co.nz/papers/lubridate.pdf
When you convert your variable to Date:
date <- as.Date('10/30/2018','%m/%d/%Y')
you can then cut out the elements you want and make new variables, like year:
year <- as.numeric(format(date,'%Y'))
or month:
month <- as.numeric(format(date,'%m'))
if all your dates are the same width, you can put the dates in a vector and use substring
Date
a <- c("01/01/2009", "01/01/2010" , "01/01/2011")
substring(a,7,10) #This takes string and only keeps the characters beginning in position 7 to position 10
output
[1] "2009" "2010" "2011"
This is more advice than a specific answer, but my suggestion is to convert dates to date variables immediately, rather than keeping them as strings. This way you can use date (and time) functions on them, rather than trying to use very troublesome workarounds.
As pointed out, the lubridate package has nice extraction functions.
For some projects, I have found that piecing dates out from the start is helpful:
create year, month, day (of month) and day (of week) variables to start with.
This can simplify summaries, tables and graphs, because the extraction code is separate from the summary/table/graph code, and because if you need to change it, you don't have to roll out those changes in multiple spots.
If you are using the date package, this can be done fairly easily.
library(date)
Date <- c("01/01/2009", "01/01/2010", "01/01/2011", "01/01/2012")
Date <- as.date(Date)
Date
# [1] 1Jan2009 1Jan2010 1Jan2011 1Jan2012
date.mdy(Date)$year
# [1] 2009 2010 2011 2012
## be aware that these are now integers and thus different methods may be invoked:
str(date.mdy(Date)$year)
# int [1:4] 2009 2010 2011 2012
summary(Date)
# First Last
# "1Jan2009" "1Jan2012"
summary(date.mdy(Date)$year)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 2009 2010 2010 2010 2011 2012
For some time now, you can also only rely on the data.table package and its IDate class plus associated functions (Check ?as.IDate()).
require(data.table)
a <- c("01/01/2009", "01/01/2010" , "01/01/2011")
year(as.IDate(a, '%d/%m/%Y')) # all data.table functions

Calculating days per month between interval of two dates

I have a set of events that each have a start and end date, but they take place over the scope of a number of months. I would like to create a table that shows the number of days in each month for this event.
I have the following example.
event_start_date <- as.Date("23/10/2012", "%d/%m/%Y")
event_end_date <- as.Date("07/02/2013", "%d/%m/%Y")
I would expect to get a table out as the following:
Oct-12 8
Nov-12 30
Dec-12 31
Jan-13 31
Feb-13 7
Does anybody know about a smart and elegant way of doing this or is creating a system of loops the only viable method?
Jochem
This is not necessarily efficient because it creates a sequence of days, but it does the job:
> library(zoo)
> table(as.yearmon(seq(event_start_date, event_end_date, "day")))
Oct 2012 Nov 2012 Dec 2012 Jan 2013 Feb 2013
9 30 31 31 7
If your time span is so large than this method is slow, you'll have to create a sequence of firsts of the months between your two (truncated) dates, take the diff, and do a little extra work for the end points.
As DjSol already pointed out in his comment, you can just subtract two dates to get the number of days:
event_start_date <- as.Date("23/10/2012", "%d/%m/%Y")
event_end_date <- as.Date("07/02/2013", "%d/%m/%Y")
as.numeric(event_end_date - event_start_date)
Is that what you want? I have the feeling that you might have more of a problem to get the start and end date in such a format so you can easily subtract them because you mention a loop. If so, however, I guess we need more details on how your actual data looks.

Resources