Using lubridate with multiple date formats - r

I have a column of dates that was stored in the format 8/7/2001, 10/21/1990, etc. Two values are just four-digit years. I converted the entire column to class Date using the following code.
lubridate::parse_date_time(eventDate, orders = c('mdy', 'Y'))
It works great, except the values that were just years are converted to yyyy-01-01 and I want them to just be yyyy. Is there a way to keep lubridate from adding on any information that wasn't already there?
Edit: Code to create data frame
id = (1:5)
eventDate = c("10/7/2001", "1989", NA, "5/5/2016", "9/18/2011")
df <- data.frame(id, eventDate)

I do not think is possible to convert your values to Dates, and keep the "yyyy" values intact. And by transforming your "yyyy" values into "yyyy-01-01" the lubridate is doing the right thing. Because dates have order, and if you have other values in your column that have days and months defined, all the other values needs to have these components too.
For example. If I produce the data.frame below. If I ask R, to order the table, according to the date column, the date in the first line ("2020"), comes before the value in the second row ("2020-02-28")? Or comes after it? The value "2020" being the year of 2020, it can actually means every possible day in this year, so how R should treate it? By adding the first day of the year, lubridate is defining these components, and avoiding that R get confused by it.
dates <- c("2020", "2020-02-28", "2020-02-20", "2020-01-10", "2020-05-12")
id <- 1:5
df <- data.frame(
id,
dates
)
id dates
1 1 2020
2 2 2020-02-28
3 3 2020-02-20
4 4 2020-01-10
5 5 2020-05-12
So if you want to mantain the "yyyy" intact, is very likely that they should not rest in your eventDate column, with other values that are in a different structure ("dd/mm/yyyy"). Now if is really necessary to mantain these values intact, I think is best, to keep the values of eventDate column as characters, and store these values as Dates in another column, like this:
df$as_dates <- lubridate::parse_date_time(df$eventDate, orders = c('mdy', 'Y'))
id eventDate as_dates
1 1 10/7/2001 2001-10-07
2 2 1989 1989-01-01
3 3 <NA> <NA>
4 4 5/5/2016 2016-05-05
5 5 9/18/2011 2011-09-18

Related

Character 2 digit year conversion to year only

Using R
Got large clinical health data set to play with, but dates are weird
Most problematic is 2digityear/halfyear, as in 98/2, meaning at some point in 1998 after July 1
I have split the column up into 2 character columns, e.g. 98 and 2 but now need to convert the 2 digit year character string into an actual year.
I tried as.Date(data$variable,format="%Y") but not only did I get a conversion to 0098 as the year rather than 1998, I also got todays month and year arbitrarily added (the actual data has no month or day).
as in 0098-06-11
How do I get just 1998 instead?
Not elegant. But using combination of lubridate and as.Date you can get that.
library(lubridate)
data <- data.frame(variable = c(95, 96, 97,98,99), date=c(1,2,3,4,5))
data$variableUpdated <- year(as.Date(as.character(data$variable), format="%y"))
and only with base R
data$variableUpdated <- format(as.Date(as.character(data$variable), format="%y"),"%Y")

Filtering Data based on another dataframe based on two rows

I have two Datasets.
The first dataset includes Companies, the Quarter and the corresponding value from the whole timespan.
Quarter Date Company value
2012.1 2012-12-28 x 1
2013.1 2013-01-02 y 2
2013.1 2013-01-03 z 3
Companies again are in the dataset over the whole time and show up multiple times.
The other dataset is an index which includes a company identifier and the quarter in which it existed in the index (Companies can be in the index in multiple quarters).
Quarter Date Company value
2012.1 2012-12-28 x 1
2014.1 2013-01-02 y 2
2013.1 2013-01-03 x 3
Now I need to only select the companies which are in the index at the same time (quarter) as I have data from the first dataset.
In the example above I would need the data from company x in both quarters, but company y needs to get kicked out because the data is available in the wrong quarter.
I tried multiple functions including filter, subset and match but never got the desired result. It always filters either too much or too little.
data %>% filter(Company == index$Company & Quarter == index$Quarter)
or
data[Company == index$Company & Quarter = index$Quarter,]
Something with my conditions doesn't seem right. Any help is appreciated!
Have a look at dplyr's powerful join functions. Here inner_join might help you
dplyr::inner_join(df1, df2, by=c("Company", "Quarter"))

Subset dataframe in r for a specific month and date

I have a dataframe that looks like this:
V1 V2 V3 Month_nr Date
1 2 3 1 2017-01-01
3 5 6 1 2017-01-02
6 8 9 2 2017-02-01
6 8 9 8 2017-08-01
and I want to take all variables from the data set that have Month=1 (January) and date from 2017-01-01 til 2017-01-31 (so end of January), which means that I want to take the dates as well. I would create a column with days but I have multiple observations for one day and this would be even more confusing. I tried it with this:
df<- filter(df,df$Month_nr == 1, df$Date > 2017-01-01 && df$Date < 2017-01-31)
but it did not work. I would appreciate so much your help! I am desperate at this point. My dataset has measurements for an entire year (from 1 to 12) and hence I filter for months.
The problem is that you didn't put quotation marks around 2017-01-01. Directly putting 2017-01-01 will compute the subtraction and return a number, and then you're comparing a string to a number. You can compare string to string; with string, "2" is still greater than "1", so it would work for comparing dates as strings. BTW, you don't need to write df$ when using filter; you can directly write the column names without quoting when using the tidyverse.
Why do you need to have the month as well as dates in the filter? Just the filter on the dates would work fine. However, you will have to convert the date column into a date object. You can do that as follows:
df$Date_nr <- as.Date(df$Date_nr, format = "%Y-%m-%d")
df_new <- subset(df, Date_nr >= "2017-01-01" & Date_nr <= "2017-01-31")

Calculate Running Difference in Dates as New Dataframe Column

I've searched for several days and am still stumped.
Given a dataset defined by the following:
ids = c("a","b","c")
dates = c(as.Date("2015-01-01"), as.Date("2015-02-01"), as.Date("2015-02-15"))
test = data.frame(ids, dates)
I am trying to dynamically add new columns to the data frame whose values will be the difference between the column date (2015-03-01) and the value in the date column. I would expect the result would look like the following, but with a better column name:
d20150301 = c(59, 28, 14)
result = data.frame(ids, dates, d20150301)
Many thanks in advance.
You can subtract a vector of dates from a single date, so
test$d2015_03_01 <- as.Date('2015-03-01')-test$dates
makes test look like
> test
ids dates d2015_03_01
1 a 2015-01-01 59 days
2 b 2015-02-01 28 days
3 c 2015-02-15 14 days

Using dplyr::mutate between two dataframes to create column based on date range

Right now I have two dataframes. One contains over 11 million rows of a start date, end date, and other variables. The second dataframe contains daily values for heating degree days (basically a temperature measure).
set.seed(1)
library(lubridate)
date.range <- ymd(paste(2008,3,1:31,sep="-"))
daily <- data.frame(date=date.range,value=runif(31,min=0,max=45))
intervals <- data.frame(start=daily$date[1:5],end=daily$date[c(6,9,15,24,31)])
In reality my daily dataframe has every day for 9 years and my intervals dataframe has entries that span over arbitrary dates in this time period. What I wanted to do was to add a column to my intervals dataframe called nhdd that summed over the values in daily corresponding to that time interval (end exclusive).
For example, in this case the first entry of this new column would be
sum(daily$value[1:5])
and the second would be
sum(daily$value[2:8]) and so on.
I tried using the following code
intervals <- mutate(intervals,nhdd=sum(filter(daily,date>=start&date<end)$value))
This is not working and I think it might have something to do with not referencing the columns correctly but I'm not sure where to go.
I'd really like to use dplyr to solve this and not a loop because 11 million rows will take long enough using dplyr. I tried using more of lubridate but dplyr doesn't seem to support the Period class.
Edit: I'm actually using dates from as.Date now instead of lubridatebut the basic question of how to refer to a different dataframe from within mutate still stands
eps <- .Machine$double.eps
library(dplyr)
intervals %>%
rowwise() %>%
mutate(nhdd = sum(daily$value[between(daily$date, start, end - eps )]))
# start end nhdd
#1 2008-03-01 2008-03-06 144.8444
#2 2008-03-02 2008-03-09 233.4530
#3 2008-03-03 2008-03-15 319.5452
#4 2008-03-04 2008-03-24 531.7620
#5 2008-03-05 2008-03-31 614.2481
In case if you find dplyr solution bit slow (basically due torowwise), you might want to use data.table for pure speed
library(data.table)
setkey(setDT(intervals), start, end)
setDT(daily)[, date1 := date]
foverlaps(daily, by.x = c("date", "date1"), intervals)[, sum(value), by=c("start", "end")]
# start end V1
#1: 2008-03-01 2008-03-06 144.8444
#2: 2008-03-02 2008-03-09 233.4530
#3: 2008-03-03 2008-03-15 319.5452
#4: 2008-03-04 2008-03-24 531.7620
#5: 2008-03-05 2008-03-31 614.2481

Resources