Earliest Date for each id in R - r

I have a dataset where each individual (id) has an e_date, and since each individual could have more than one e_date, I'm trying to get the earliest date for each individual. So basically I would like to have a dataset with one row per each id showing his earliest e_date value.
I've use the aggregate function to find the minimum values, I've created a new variable combining the date and the id and last I've subset the original dataset based on the one containing the minimums using the new variable created. I've come to this:
new <- aggregate(e_date ~ id, data_full, min)
data_full["comb"] <- NULL
data_full$comb <- paste(data_full$id,data_full$e_date)
new["comb"] <- NULL
new$comb <- paste(new$lopnr,new$EDATUM)
data_fixed <- data_full[which(new$comb %in% data_full$comb),]
The first thing is that the aggregate function doesn't seems to work at all, it reduces the number of rows but viewing the data I can clearly see that some ids appear more than once with different e_date. Plus, the code gives me different results when I use the as.Date format instead of its original format for the date (integer). I think the answer is simple but I'm struck on this one.

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(data_full)), grouped by 'id', we get the 1st row (head(.SD, 1L)).
library(data.table)
setDT(data_full)[order(e_date), head(.SD, 1L), by = id]
Or using dplyr, after grouping by 'id', arrange the 'e_date' (assuming it is of Date class) and get the first row with slice.
library(dplyr)
data_full %>%
group_by(id) %>%
arrange(e_date) %>%
slice(1L)
If we need a base R option, ave can be used
data_full[with(data_full, ave(e_date, id, FUN = function(x) rank(x)==1)),]

Another answer that uses dplyr's filter command:
dta %>%
group_by(id) %>%
filter(date == min(date))

You may use library(sqldf) to get the minimum date as follows:
data1<-data.frame(id=c("789","123","456","123","123","456","789"),
e_date=c("2016-05-01","2016-07-02","2016-08-25","2015-12-11","2014-03-01","2015-07-08","2015-12-11"))
library(sqldf)
data2 = sqldf("SELECT id,
min(e_date) as 'earliest_date'
FROM data1 GROUP BY 1", method = "name__class")
head(data2)
id earliest_date
123 2014-03-01
456 2015-07-08
789 2015-12-11

I made a reproducible example, supposing that you grouped some dates by which quarter they were in.
library(lubridate)
library(dplyr)
rand_weeks <- now() + weeks(sample(100))
which_quarter <- quarter(rand_weeks)
df <- data.frame(rand_weeks, which_quarter)
df %>%
group_by(which_quarter) %>% summarise(sort(rand_weeks)[1])
# A tibble: 4 x 2
which_quarter sort(rand_weeks)[1]
<dbl> <time>
1 1 2017-01-05 05:46:32
2 2 2017-04-06 05:46:32
3 3 2016-08-18 05:46:32
4 4 2016-10-06 05:46:32

Related

How to subtract using max(date) and second latest (month) date

I'm trying to create a new variable which equals the latest month's value minus the previous month's (or 3 months prior, etc.).
A quick df:
country <- c("XYZ", "XYZ", "XYZ")
my_dates <- c("2021-10-01", "2021-09-01", "2021-08-01")
var1 <- c(1, 2, 3)
df1 <- country %>% cbind(my_dates) %>% cbind(var1) %>% as.data.frame()
df1$my_dates <- as.Date(df1$my_dates)
df1$var1 <- as.numeric(df1$var1)
For example, I've tried (partially from: How to subtract months from a date in R?)
library(tidyverse)
df2 <- df1 %>%
mutate(dif_1month = var1[my_dates==max(my_dates)] -var1[my_dates==max(my_dates) %m-% months(1)]
I've also tried different variations of using lag():
df2 <- df1 %>%
mutate(dif_1month = var1[my_dates==max(my_dates)] - var1[my_dates==max(my_dates)-lag(max(my_dates), n=1L)])
Any suggestions on how to grab the value of a variable when dates equal the second latest observation?
Thanks for help, and apologies for not including any data. Can edit if necessary.
Edited with a few potential answers:
#this gives me the value of var1 of the latest date
df2 <- df1 %>%
mutate(value_1month = var1[my_dates==max(my_dates)])
#this gives me the date of the second latest date
df2 <- df1 %>%
mutate(month1 = max(my_dates) %m-%months(1))
#This gives me the second to latest value
df2 <- df1 %>%
mutate(var1_1month = var1[my_dates==max(my_dates) %m-%months(1)])
#This gives me the difference of the latest value and the second to last of var1
df2 <- df1 %>%
mutate(diff_1month = var1[my_dates==max(my_dates)] - var1[my_dates==max(my_dates) %m-%months(1)])
mutate requires the output to be of the same length as the number of rows of the original data. When we do the subsetting, the length is different. We may need ifelse or case_when
library(dplyr)
library(lubridate)
df1 %>%
mutate(diff_1month = case_when(my_dates==max(my_dates) ~
my_dates %m-% months(1)))
NOTE: Without a reproducible example, it is not clear about the column types and values
Based on the OP's update, we may do an arrange first, grab the last two 'val' and get the difference
df1 %>%
arrange(my_dates) %>%
mutate(dif_1month = diff(tail(var1, 2)))
. my_dates var1 dif_1month
1 XYZ 2021-08-01 3 -1
2 XYZ 2021-09-01 2 -1
3 XYZ 2021-10-01 1 -1

R -- Always grab the last day of the previous year in R

I am an aspiring data scientist, and this will be my first ever question on StackOF.
I have this line of code to help wrangle me data. My date filter is static. I would prefer not to have to go in an change this hardcoded value every year. What is the best alternative for my date filter to make it more dynamic? The date column is also difficult to work with because it is not a
"date", it is a "dbl"
library(dplyr)
library(lubridate)
# create a sample dataframe
df <- data.frame(
DATE = c(20191230, 20191231, 20200122)
)
Tried so far:
df %>%
filter(DATE >= 20191231)
# load packages (lubridate for dates)
library(dplyr)
library(lubridate)
# create a sample dataframe
df <- data.frame(
DATE = c(20191230, 20191231, 20200122)
)
This looks like this:
DATE
1 20191230
2 20191231
3 20200122
# and now...
df %>% # take the dataframe
mutate(DATE = ymd(DATE)) %>% # turn the DATE column actually into a date
filter(DATE >= floor_date(Sys.Date(), "year") - days(1))
...and filter rows where DATE is >= to one day before the first day of this year (floor_date(Sys.Date(), "year"))
DATE
1 2019-12-31
2 2020-01-22

Using mutate and summarize to find elements in a vector

I'm trying to replace vba code with R code. Currently in vba I use sumif in a range to find the total value of an ID depending on some dates. In R I'm using mutate an summarize but there's always an error. I donĀ“t know how to fix it.
If i want to find the value for ID=1 that made some value withing 2 days:
#sys.Date() = 2016-01-06
df
DATES ID VALUE
2016/01/01 1 10
2016/01/02 2 15
2016/01/05 1 13
the result must be:
ID Value
1 13
Currently, the code is:
df%>%
group_by(ID) %>%
mutate(Total_op = if (Sys.Date()-as.Date(Dates,format="%YYYY-%mm-
%dd")>=1) Value else 0)))%>%
summarize(SumTotal = sum(Total_op))%>%
collect
But the error showed is:
Error: Column 'sumTotal' must be length X (the group size) or one, not Y
With lubridate we can convert the DATES string to a datetime object and filter accordingly:
library(lubridate)
library(tidyverse)
Dat <- ymd("2016-01-06") #Set a date. Can be done by Sys.Date()
df %>%
mutate_at("DATES",ymd) %>% #convert to datetime
filter(DATES %within% interval(Dat-2,Dat)) %>% #filter entries in the last 2 days
group_by(ID) %>% #group by ID
summarise(SumTotal = sum(VALUE)) #summarise value as Sum

Select only rows where the last date is present

Let's say that I have the following data.
df = data.frame(name = c("A","A","A","B","B","B","B"),
date = c("2011-01-01","2011-03-01","2011-05-01",
"2011-01-01","2011-05-01","2011-06-01",
"2011-07-01"))
df
I know the last date in the data set and only want to pick those names where data is available for the last date. So in the above example, the last date is only available for name B. Thus, I want to select only the rows for name B.
I can do simple hacks like this to get the desired result.
last_date = "2011-07-01"
#unique(df$name[df$date %in% last_date])
df[df$name %in% unique(df$name[df$date %in% last_date]),]
However, I was wondering if there was a dplyr/tidyverse or data.table solution for this task.
There are multiple ways you can do this, with dplyr we can filter only those groups which have the last_date
library(dplyr)
df %>%
group_by(name) %>%
filter(last_date %in% date)
# name date
# <fct> <fct>
#1 B 2011-01-01
#2 B 2011-05-01
#3 B 2011-06-01
#4 B 2011-07-01
Or similarly in base R :
df[ave(df$date, df$name, FUN = function(x) last_date %in% x) == TRUE,]
Also, we can get all the name where you find last_date and filter those names from the original dataframe.
df[with(df, name %in% name[date %in% last_date]), ]

Sum rows by date range, for a given identifier

I looked at many posts with similar, but I believe less complex questions, and just cant seem to work out an answer for this.
I have a >1000000 lines of data, for example in this form:
date<-c("9/30/2012","10/31/2012","11/30/2012","12/31/2012","1/31/2013","2/28/2013","3/31/2013","10/31/2012","11/30/2012","12/31/2012","1/31/2013","2/28/2013","3/31/2013")
name<-c("a","a","a","a","a","a","a","b","b","b","b","b","b")
amount<-c(100,200,300,400,500,600,700,800,900,800,700,600,500)
data<-data.frame(name,date,amount)
View(data)
What I need is, for entries of the same name, sum the amount for dates that are in jan-mar, apr-jun, jul-sep, oct-dec in the same year.
This is my ideal output:
date2<-c("9/30/2012","12/31/2012","3/31/2013","12/31/2012","3/13/2013")
name2<-c("a","a","a","b","b")
amount2<-c(100,900,1800,2500,1800)
data2<-data.frame(name2,date2,amount2)
View(data2)
Will appreciate any input at all, to lead me towards the correct direction.
Thank you very much!
1. Using dplyr/zoo
We can convert the 'date' class from 'character' to 'Date', get the sum of 'amount' and last value of 'date' grouped by columns 'name' and 'Qtr' (from converting the 'date' to year quarter (as.yearqtr).
library(dplyr)
library(zoo)
data %>%
mutate(date=as.Date(date, format='%m/%d/%Y')) %>%
group_by(name, Qtr=as.character(as.yearqtr(date))) %>%
summarise(amount= sum(amount), date=last(date))
# name Qtr amount date
#1 a 2012 Q3 100 2012-09-30
#2 a 2012 Q4 900 2012-12-31
#3 a 2013 Q1 1800 2013-03-31
#4 b 2012 Q4 2500 2012-12-31
#5 b 2013 Q1 1800 2013-03-31
NOTE: Also added #docendo discimus suggestion to use last and changing the class of 'date' column. The Qtr column is 'character' as the as.yearqtr class is unsupported by dplyr (from the errors). The 'Qtr' column was not in the expected dataset 'data2'. So, I guess it doesn't matter whether it is 'character' or 'as.yearqtr'. If we don't change the 'date' column to 'Date' class, and do the change in the group_by step, this will give the same result as the 'data2'. The extra 'Qtr' column can be deleted.
2. Without using zoo
data %>%
mutate(date1 = as.Date(date, format = '%m/%d/%Y')) %>%
group_by(name, Qtr= sprintf('%s %s', format(date1, '%Y'),
quarters(date1))) %>%
summarise(amount = sum(amount), date=last(date)) %>%
ungroup() %>%
select(-Qtr) %>%
as.data.frame()
# name amount date
#1 a 100 9/30/2012
#2 a 900 12/31/2012
#3 a 1800 3/31/2013
#4 b 2500 12/31/2012
#5 b 1800 3/31/2013
NOTE2: Added a solution without using as.yearqtr, kept the same format for 'date' as in the expected output 'data2'
Here are a few approaches:
1) aggregate & zoo
library(zoo)
aggregate(amount ~ name + yearqtr,
transform(data, yearqtr = as.yearqtr(date, "%m/%d/%Y")),
sum)
2) data.table & zoo
library(data.table)
library(zoo)
dt <- data.table(data, key = "name,date")
dt[, date := as.yearqtr(date, "%m/%d/%Y")][, list(sum = sum(amount)), by = "name,date"]
Note that both these solutions convert the date to a real "yearqtr" object and not just to a character string. I haven't benchmarked these but typically data.table is very fast. You could create the data.table from data by reference using setDT for every greater performance but might prefer to keep them separate as well so we left them separate here.

Resources