I'd like to know if it is possible to achieve the following using dplyr, or some tidyverse package...
Context: I am having trouble getting my data into a structure that will allow the use of geom_rect. See this SO question for the motivation.
library(tis)
# Prepare NBER recession start end dates.
recessions <- data.frame(start = as.Date(as.character(nberDates()[,"Start"]),"%Y%m%d"),
end= as.Date(as.character(nberDates()[,"End"]),"%Y%m%d"))
dt <- tibble(date=c(as.Date('1983-01-01'),as.Date('1990-10-15'), as.Date('1993-01-01')))
Desired output:
date start end
1983-01-01 NA NA
1990-10-15 1990-08-01 1991-03-31
1993-01-01 NA NA
Appreciate any suggestions.
Note: Previous questions indicate that sqldf is one approach to take. However, the data here involves dates and my understanding date is not a data type in SQLite.
In the spirit of 'write the code you wish you had':
df <- dt %>%
left_join(x=., y=recessions, date >= start & date <= end)
"Date" class objects in R are internally stored as the number of days since the Epoch (January 1, 1970) and that number is what is sent to SQLite so the order is still maintained even though the class is not; therefore, we can do this using the SQLite back end:
sqldf("select * from dt left join recessions on date between start and end")
giving:
date start end
1 1983-01-01 <NA> <NA>
2 1990-10-15 1990-08-01 1991-03-31
3 1993-01-01 <NA> <NA>
Also note that sqldf works with several other back ends that do fully support dates so you are not restricted to SQLite. Suggest you review the FAQ and Examples at https://github.com/ggrothendieck/sqldf .
The following uses only dplyr and produces the desired data frame result.
Note: On larger datasets you will likely run into memory issues and the sqldf proposed by G. Grothendieck will work.
Hat-tip:
#nick-criswell for directing me to #ian-gow for this partial solution
# Build data frame of dates within the interval [start, end]
df1 <- dt %>%
mutate(dummy=TRUE) %>%
left_join(recessions %>% mutate(dummy=TRUE)) %>%
filter(date >= start & date <= end) %>%
select(-dummy)
# Build data frame of all other dates with start=NA and end=NA
df2 <- dt %>%
mutate(dummy=TRUE) %>%
left_join(recessions %>% mutate(dummy=TRUE)) %>%
mutate(start=NA, end=NA) %>%
unique() %>%
select(-dummy)
# Now merge the two. Overwirte NA values with start and end dates
df <- df2 %>%
left_join(x=., y=df1, by="date") %>%
mutate(date, start = ifelse(is.na(start.y), as.character(start.x), as.character(start.y)),end = ifelse(is.na(end.y), as.character(end.x), as.character(end.y))) %>%
mutate(start=as.Date(start), end=as.Date(end) )
> df
# A tibble: 3 x 3
date start end
<date> <date> <date>
1 1983-01-01 NA NA
2 1990-10-15 1990-08-01 1991-03-31
3 1993-01-01 NA NA
Related
I have a large database with a date column that has date numbers coming from Excel, incomplete dates that are missing the year (but year is in another column), and some cells with missing date. I found out how to change format of the dates, but the problem is how to filter the three types of cells I have in the date variable (that is date numbers from excel, incomplete dates, and empty cell). I managed to do it by filtering a by a created column (value) that I DON'T have in the real database.
This is my original database:
This is what I required end result:
What I managed to do was to filter the dataset with the fictitious value column and convert the date to the required format. This is what I did:
library(dplyr)
data_a <- read.csv(text = "
year,date,value
2018,43238,1
2017,43267,2
2020,7/25,3
2018,,4
2013,,5
2000,8/23,6
2000,9/21,7")
data_b <- data_a %>%
filter(value %in% c(1,2)) %>%
mutate(data_formatted = as.Date(as.numeric(date), origin = "1899-12-30"))
data_c <- data_a %>%
filter(value %in% c(3, 6, 7)) %>%
mutate(data_formatted = as.Date(paste0(year, "/", date)))
data_d <- data_a %>%
filter(value %in% c(4, 5)) %>%
mutate(data_formatted = NA)
data_final <- rbind(data_b, data_c, data_d)
I need to do the same all at once WITHOUT using the value column.
You can use do conditional for the scenarios and apply different functions to convert to date.
Code
library(dplyr)
library(stringr)
library(lubridate)
data_a %>%
mutate(
data_formatted = case_when(
!str_detect(date,"/") ~ as.Date(as.numeric(date), origin = "1899-12-30"),
TRUE ~ ymd(paste0(year, "/", date))
)
)
Output
year date value data_formatted
1 2018 43238 1 2018-05-18
2 2017 43267 2 2018-06-16
3 2020 7/25 3 2020-07-25
4 2018 4 <NA>
5 2013 5 <NA>
6 2000 8/23 6 2000-08-23
7 2000 9/21 7 2000-09-21
Please try
data_a2 <- data_a %>% mutate(date2=as.numeric(ifelse(str_detect(date,'\\/'), '',date)),
date2_=as.numeric(as.Date(ifelse(str_detect(date,'\\/'), paste0(year,'/',date),''), format='%Y/%m/%d')),
date_formatted=as.Date(coalesce(date2,date2_), origin = "1970-01-01")) %>%
dplyr::select(-date2,-date2_)
I am an aspiring data scientist, and this will be my first ever question on StackOF.
I have this line of code to help wrangle me data. My date filter is static. I would prefer not to have to go in an change this hardcoded value every year. What is the best alternative for my date filter to make it more dynamic? The date column is also difficult to work with because it is not a
"date", it is a "dbl"
library(dplyr)
library(lubridate)
# create a sample dataframe
df <- data.frame(
DATE = c(20191230, 20191231, 20200122)
)
Tried so far:
df %>%
filter(DATE >= 20191231)
# load packages (lubridate for dates)
library(dplyr)
library(lubridate)
# create a sample dataframe
df <- data.frame(
DATE = c(20191230, 20191231, 20200122)
)
This looks like this:
DATE
1 20191230
2 20191231
3 20200122
# and now...
df %>% # take the dataframe
mutate(DATE = ymd(DATE)) %>% # turn the DATE column actually into a date
filter(DATE >= floor_date(Sys.Date(), "year") - days(1))
...and filter rows where DATE is >= to one day before the first day of this year (floor_date(Sys.Date(), "year"))
DATE
1 2019-12-31
2 2020-01-22
Good morning all, this is my first time posting on stack overflow. Thank you for any help!
I have 2 dataframes that I am using to analyze stock data. One data frame has dates among other information, we can call it df:
df1 <- tibble(Key = c('a','b','c'), i =11:13, date= ymd(20110101:20110103))
The second dataframe also has dates and other important information.
df2 <-tibble(Answer = c('a','d','e','b','f','c'), j =14:19, date= ymd(20150304:20150309))
Here is what I want to do:
For each row in df1, I need to:
-Find the date in df2 where, for when df2$answer is the same as df1$key, it is the closest to that row's date in df1.
-Then extract information for another column in that row in df2, and put it in a new row in df1.
The code i tried:
df1 %>%
group_by(Key, i) %>%
mutate(
`New Column` = df2$j[
which.min(subset(df2$date, df2$Answer== Key) - date)])
This has the result:
Key i date `New Column`
1 a 11 2011-01-01 14
2 b 12 2011-01-02 14
3 c 13 2011-01-03 14
This is correct for the first row, a. In df2, the closest date is 2015-03-04, for which the value of j is in fact 14.
However, for the second row, Key=b, I want df2 to subset to only look at dates for rows where df2$Answer = b. Therefore, the date should be 2015-03-07, for which j =17.
Thank you for your help!
Jesse
This should work:
library(dplyr)
df1 %>%
left_join(df2, by = c("Key" = "Answer")) %>%
mutate(date_diff = abs(difftime(date.x, date.y, units = "secs"))) %>%
group_by(Key) %>%
arrange(date_diff) %>%
slice(1) %>%
ungroup()
We are first joining the two data frames with left_join. Yes, I'm aware there are possibly multiple dates for each Key, bear with me.
Next, we calculate (with mutate) the absolute value (abs) of the difference between the two dates date.x and date.y.
Now that we have this, we will group the data by Key using group_by. This will make sure that each distinct Key will be treated separately in subsequent calculations.
Since we've calculated the date_diff, we can now re-order (arrange) the data for each Key, with the smallest date_diff as first for each Key.
Finally, we are only interested in that first, smallest date_diff for each Key, so we can discard the rest using slice(1).
This pipeline gives us the following:
Key i date.x j date.y date_diff
<chr> <int> <date> <int> <date> <time>
1 a 11 2011-01-01 14 2015-03-04 131587200
2 b 12 2011-01-02 17 2015-03-07 131760000
3 c 13 2011-01-03 19 2015-03-09 131846400
I have a dataset where each individual (id) has an e_date, and since each individual could have more than one e_date, I'm trying to get the earliest date for each individual. So basically I would like to have a dataset with one row per each id showing his earliest e_date value.
I've use the aggregate function to find the minimum values, I've created a new variable combining the date and the id and last I've subset the original dataset based on the one containing the minimums using the new variable created. I've come to this:
new <- aggregate(e_date ~ id, data_full, min)
data_full["comb"] <- NULL
data_full$comb <- paste(data_full$id,data_full$e_date)
new["comb"] <- NULL
new$comb <- paste(new$lopnr,new$EDATUM)
data_fixed <- data_full[which(new$comb %in% data_full$comb),]
The first thing is that the aggregate function doesn't seems to work at all, it reduces the number of rows but viewing the data I can clearly see that some ids appear more than once with different e_date. Plus, the code gives me different results when I use the as.Date format instead of its original format for the date (integer). I think the answer is simple but I'm struck on this one.
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(data_full)), grouped by 'id', we get the 1st row (head(.SD, 1L)).
library(data.table)
setDT(data_full)[order(e_date), head(.SD, 1L), by = id]
Or using dplyr, after grouping by 'id', arrange the 'e_date' (assuming it is of Date class) and get the first row with slice.
library(dplyr)
data_full %>%
group_by(id) %>%
arrange(e_date) %>%
slice(1L)
If we need a base R option, ave can be used
data_full[with(data_full, ave(e_date, id, FUN = function(x) rank(x)==1)),]
Another answer that uses dplyr's filter command:
dta %>%
group_by(id) %>%
filter(date == min(date))
You may use library(sqldf) to get the minimum date as follows:
data1<-data.frame(id=c("789","123","456","123","123","456","789"),
e_date=c("2016-05-01","2016-07-02","2016-08-25","2015-12-11","2014-03-01","2015-07-08","2015-12-11"))
library(sqldf)
data2 = sqldf("SELECT id,
min(e_date) as 'earliest_date'
FROM data1 GROUP BY 1", method = "name__class")
head(data2)
id earliest_date
123 2014-03-01
456 2015-07-08
789 2015-12-11
I made a reproducible example, supposing that you grouped some dates by which quarter they were in.
library(lubridate)
library(dplyr)
rand_weeks <- now() + weeks(sample(100))
which_quarter <- quarter(rand_weeks)
df <- data.frame(rand_weeks, which_quarter)
df %>%
group_by(which_quarter) %>% summarise(sort(rand_weeks)[1])
# A tibble: 4 x 2
which_quarter sort(rand_weeks)[1]
<dbl> <time>
1 1 2017-01-05 05:46:32
2 2 2017-04-06 05:46:32
3 3 2016-08-18 05:46:32
4 4 2016-10-06 05:46:32
I know this has been asked before but the answers I have found seem to rely on POSIXct whereas I don't see why I cant do this with date
I have data like
Person Event VisitDate
1 RFA 2004-06-04
1 EMR 2016-06-03
1 Nil 2016-06-05
I want to get the difference between the dates in a separate column (eventually to average the difference of the dates over all Person ids).
Expected output:
Person Event VisitDate Date Difference in days
1 RFA 2004-06-04
1 EMR 2016-06-03 4383
1 Nil 2016-06-05 2
So far I have used:
EndoSubsetOnSurveil %>%
arrange(Person, as.Date(EndoSubsetOnSurveil$VisitDate, '%d/%m/%y')) %>%
difftime(VisitDate[1:(length(VisitDate)-1)] , VisitDate[2:length(VisitDate)])
but I get the error
Error in as.POSIXct.default(time1, tz = tz) :
do not know how to convert 'time1' to class “POSIXct”
Explanation:
(i) The format provided in as.Date should be changed to %Y-%m-%d. (ii) Your variable should be changed to as.Date if you want it to be recognized as such. In your code, it is only used to arrange the database but is not recognized later. (iii) Using lag makes is more useful.
Code:
I think that the last chunk output is what you want in comparison with the second chunk.
# SAMPLE DATA -------------------------------------------------------------
EndoSubsetOnSurveil <-
data.frame(Person = c(1,1,2,2),
VisitDate = c("2004-06-04", "2016-06-03", "2016-07-01",
"2016-08-01"))
EndoSubsetOnSurveil$VisitDate <-
as.Date(EndoSubsetOnSurveil$VisitDate, '%Y-%m-%d')
# DIFFERENCE BETWEEN VISIT WITHOUT GROUPING -------------------------------
library(dplyr)
EndoSubsetOnSurveil %>% arrange(Person, VisitDate) %>%
mutate(diffDate = difftime(VisitDate, lag(VisitDate,1)))
# DIFFERENCE BETWEEN VISIT BY PATIENT -------------------------------------
EndoSubsetOnSurveil %>% arrange(Person, VisitDate) %>% group_by(Person) %>%
mutate(diffDate = difftime(VisitDate, lag(VisitDate,1))) %>% ungroup()