Select only rows where the last date is present - r

Let's say that I have the following data.
df = data.frame(name = c("A","A","A","B","B","B","B"),
date = c("2011-01-01","2011-03-01","2011-05-01",
"2011-01-01","2011-05-01","2011-06-01",
"2011-07-01"))
df
I know the last date in the data set and only want to pick those names where data is available for the last date. So in the above example, the last date is only available for name B. Thus, I want to select only the rows for name B.
I can do simple hacks like this to get the desired result.
last_date = "2011-07-01"
#unique(df$name[df$date %in% last_date])
df[df$name %in% unique(df$name[df$date %in% last_date]),]
However, I was wondering if there was a dplyr/tidyverse or data.table solution for this task.

There are multiple ways you can do this, with dplyr we can filter only those groups which have the last_date
library(dplyr)
df %>%
group_by(name) %>%
filter(last_date %in% date)
# name date
# <fct> <fct>
#1 B 2011-01-01
#2 B 2011-05-01
#3 B 2011-06-01
#4 B 2011-07-01
Or similarly in base R :
df[ave(df$date, df$name, FUN = function(x) last_date %in% x) == TRUE,]
Also, we can get all the name where you find last_date and filter those names from the original dataframe.
df[with(df, name %in% name[date %in% last_date]), ]

Related

How to subtract using max(date) and second latest (month) date

I'm trying to create a new variable which equals the latest month's value minus the previous month's (or 3 months prior, etc.).
A quick df:
country <- c("XYZ", "XYZ", "XYZ")
my_dates <- c("2021-10-01", "2021-09-01", "2021-08-01")
var1 <- c(1, 2, 3)
df1 <- country %>% cbind(my_dates) %>% cbind(var1) %>% as.data.frame()
df1$my_dates <- as.Date(df1$my_dates)
df1$var1 <- as.numeric(df1$var1)
For example, I've tried (partially from: How to subtract months from a date in R?)
library(tidyverse)
df2 <- df1 %>%
mutate(dif_1month = var1[my_dates==max(my_dates)] -var1[my_dates==max(my_dates) %m-% months(1)]
I've also tried different variations of using lag():
df2 <- df1 %>%
mutate(dif_1month = var1[my_dates==max(my_dates)] - var1[my_dates==max(my_dates)-lag(max(my_dates), n=1L)])
Any suggestions on how to grab the value of a variable when dates equal the second latest observation?
Thanks for help, and apologies for not including any data. Can edit if necessary.
Edited with a few potential answers:
#this gives me the value of var1 of the latest date
df2 <- df1 %>%
mutate(value_1month = var1[my_dates==max(my_dates)])
#this gives me the date of the second latest date
df2 <- df1 %>%
mutate(month1 = max(my_dates) %m-%months(1))
#This gives me the second to latest value
df2 <- df1 %>%
mutate(var1_1month = var1[my_dates==max(my_dates) %m-%months(1)])
#This gives me the difference of the latest value and the second to last of var1
df2 <- df1 %>%
mutate(diff_1month = var1[my_dates==max(my_dates)] - var1[my_dates==max(my_dates) %m-%months(1)])
mutate requires the output to be of the same length as the number of rows of the original data. When we do the subsetting, the length is different. We may need ifelse or case_when
library(dplyr)
library(lubridate)
df1 %>%
mutate(diff_1month = case_when(my_dates==max(my_dates) ~
my_dates %m-% months(1)))
NOTE: Without a reproducible example, it is not clear about the column types and values
Based on the OP's update, we may do an arrange first, grab the last two 'val' and get the difference
df1 %>%
arrange(my_dates) %>%
mutate(dif_1month = diff(tail(var1, 2)))
. my_dates var1 dif_1month
1 XYZ 2021-08-01 3 -1
2 XYZ 2021-09-01 2 -1
3 XYZ 2021-10-01 1 -1

Keeping IDs conditional on repeating variable

I have data that looks like this:
Is there a way I can very efficiently (without much R code) retain only 'ID' cases where instances of 'X' are equal to zero? For example, in this case only ID number 3 should be retained in my data set.
THIS ISSUE IS CLOSED - THERE ARE MULTIPLE STRONG ANSWERS IN THE COMMENTS BELOW
using the data.table package, I was able to quickly pull this together
library(data.table)
df <- data.table(ID=c(1,1,1,2,2,2,3,3,3), y=c(5,6,4,6,3,1,9,5,5), x=c(1,0,0,0,1,1,0,0,0))
df <- df[, .(ident = all(x ==0), y, x), by = ID][ident== TRUE] #aggregate, x, y and identifier by each ID
df[, ident := NULL] # get rid of redundant identifier column
df <- data.frame(ID=c(1,1,1,2,2,2,3,3,3), y=c(5,6,4,6,3,1,9,5,5), x=c(1,0,0,0,1,1,0,0,0))
subset(df, !ID %in% subset(df, x!=0)$ID)
That is, first find the ID's where x is not zero (subset(df, x!=0)$ID), and then exclude cases with those ID's (!ID %in% subset(df, x!=0)$ID)
try this:
first get all IDs for which any row has a non-zero value
Then use that to subset
df <- data.frame(ID=c(1,1,1,2,2,2,3,3,3), y=c(5,6,4,6,3,1,9,5,5), x=c(1,0,0,0,1,1,0,0,0))
exclude <- subset(df, x!=0)$ID
new_df <- subset(df, ! ID %in% exclude)
A base R option using ave, where we select the ID if all values (x) for the ID are 0.
df[ave(df$x == 0, df$ID, FUN = all), ]
# ID y x
#7 3 9 0
#8 3 5 0
#9 3 5 0
An equivalent dplyr solution would be
library(dplyr)
df %>%
group_by(ID) %>%
filter(all(x == 0)) %>%
ungroup()
# A tibble: 3 x 3
# ID y x
# <dbl> <dbl> <dbl>
#1 3. 9. 0.
#2 3. 5. 0.
#3 3. 5. 0.

dplyr lag of different group

I am trying to use dplyr to mutate both a column containing the samegroup lag of a variable as well as the lag of (one of) the other group(s).
Edit: Sorry, in the first edition, I messed up the order a bit by rearranging by date at the last second.
This is what my desired result would look like:
Here is a minimal code example:
library(tidyverse)
set.seed(2)
df <-
data.frame(
x = sample(seq(as.Date('2000/01/01'), as.Date('2015/01/01'), by="day"), 10),
group = sample(c("A","B"),10,replace = T),
value = sample(1:10,size=10)
) %>% arrange(x)
df <- df %>%
group_by(group) %>%
mutate(own_lag = lag(value))
df %>% data.frame(other_lag = c(NA,1,2,7,7,9,10,10,8,6))
Thank you very much!
A solution with data.table:
library(data.table)
# to create own lag:
setDT(df)[, own_lag:=c(NA, head(value, -1)), by=group]
# to create other group lag: (the function works actually outside of data.table, in base R, see N.B. below)
df[, other_lag:=sapply(1:.N,
function(ind) {
gp_cur <- group[ind]
if(any(group[1:ind]!=gp_cur)) tail(value[1:ind][group[1:ind]!=gp_cur], 1) else NA
})]
df
# x group value own_lag other_lag
#1: 2001-12-08 B 1 NA NA
#2: 2002-07-09 A 2 NA 1
#3: 2002-10-10 B 7 1 2
#4: 2007-01-04 A 5 2 7
#5: 2008-03-27 A 9 5 7
#6: 2008-08-06 B 10 7 9
#7: 2010-07-15 A 4 9 10
#8: 2012-06-27 A 8 4 10
#9: 2014-02-21 B 6 10 8
#10: 2014-02-24 A 3 8 6
Explanation of other_lag determination: The idea is, for each observation, to look at the group value, if there is any group value different from current one, previous to current one, then take the last value, else, put NA.
N.B.: other_lag can be created without the need of data.table:
df$other_lag <- with(df, sapply(1:nrow(df),
function(ind) {
gp_cur <- group[ind]
if(any(group[1:ind]!=gp_cur)) tail(value[1:ind][group[1:ind]!=gp_cur], 1) else NA
}))
Another data.table approach similar to #Cath's:
library(data.table)
DT = data.table(df)
DT[, vlag := shift(value), by=group]
DT[, volag := .SD[.(chartr("AB", "BA", group), x - 1), on=.(group, x), roll=TRUE, x.value]]
This assumes that A and B are the only groups. If there are more...
DT[, volag := DT[!.BY, on=.(group)][.(.SD$x - 1), on=.(x), roll=TRUE, x.value], by=group]
How it works:
:= creates a new column
DT[, col := ..., by=] does each assignment separately per by= group, essentially as a loop.
The grouping values for the current iteration of the loop are in the named list .BY.
The subset of data used by the current iteration of the loop is the data.table .SD.
x[!i, on=] is an anti-join, looking up rows of i in x and returning x with the matched rows dropped.
x[i, on=, roll=TRUE, x.v] ...
looks up each row of i in x using the on= condition
when no exact on= match is found, it "rolls" to the nearest previous value of the final on= column
it returns v from the x table
For more details and intuition, review the startup messages shown when you type library(data.table).
I am not entirely sure whether I got your question correctly, but if "own" and "other" refers to group A and B, then this might do the trick. I strongly assume there are more elegant ways to do this:
df.x <- df %>%
dplyr::group_by(group) %>%
mutate(value.lag=lag(value)) %>%
mutate(index=seq_along(group)) %>%
arrange(group)
df.a <- df.x %>%
filter(group=="A") %>%
rename(value.lag.a=value.lag)
df.b <- df.x %>%
filter(group=="B") %>%
rename(value.lag.b = value.lag)
df.a.b <- left_join(df.a, df.b[,c("index", "value.lag.b")], by=c("index"))
df.b.a <- left_join(df.b, df.a[,c("index", "value.lag.a")], by=c("index"))
df.x <- bind_rows(df.a.b, df.b.a)
Try this: (Pipe-Only approach)
library(zoo)
df %>%
mutate(groupLag = lag(group),
dupLag = group == groupLag) %>%
group_by(dupLag) %>%
mutate(valueLagHelp = lag(value)) %>%
ungroup() %>%
mutate(helper = ifelse(dupLag == T, NA, valueLagHelp)) %>%
mutate(helper = case_when(is.na(helper) ~ na.locf(helper, na.rm=F),
TRUE ~ helper)) %>%
mutate(valAfterLag = lag(dupLag)) %>%
mutate(otherLag = ifelse(is.na(lag(valueLagHelp)), lag(value), helper)) %>%
mutate(otherLag = ifelse((valAfterLag | is.na(valAfterLag)) & !dupLag,
lag(value), otherLag)) %>%
select(c(x, group, value, ownLag, otherLag))
Sorry for the mess.
What it does it that it first creates a group lag and creates a helper variable for the case when the group is equal to its lag (i. e. when two "A"s are subsequent. Then it groups by this helper variable and it assigns to all values which are dupLag == F the correct value. Now we need to take care of the ones with dupLag == T.
So, ungroup. We need a new lagged-value helper that assigns all dupLag == T an NA, because they are not correctly assigned yet.
What's next is that we assign all NAs in our helper the last non-NA value.
This is not all because we still need to take care of some dupLag == F data points (you get that when you look at the complete tibble). First, we basically just change the second data point with the first mutate(otherLag==... operation. The next operation finalizes everything and then we select the variables which we'd like to have in the end.

Replicate each row of data.frame when occurrence

I am facing a tricky question and would be glad to have some help.
I have a data frame with an ID name taking different structures. Something like this following :
ID
bbb-5p/mi-98/6134
abb-4p
bbb-5p/mi-98
Every time I have this "/" I would like to duplicate the row. Each row should be duplicated the number of time we find this "/".
Then the name of the duplicated row should be the root + the characters right after the "/".
For exemple this :
ID
bbb-5p/mi-98/6134
should give :
ID
bbb-5p
bbb-5p-mi-98
bbb-5p-6134
Also my initial data frame have 5 variables :
[ID, varA, varB, varC, varD]
And every time I have this "/" I would like to duplicate the entire row. Then I am expecting to have a new data frame with something like
newID newvarA newvarB newvarC newvarD
bbb-5p varA(1) varB(1) varC(1) varD(1)
bbb-5p-mi-98 varA(1) varB(1) varC(1) varD(1)
bbb-5p-6134 varA(1) varB(1) varC(1) varD(1)
abb-4p varA(2) varB(2) varC(2) varD(2)
bbb-5p varA(3) varB(3) varC(3) varD(3)
bbb-5p-mi-98 varA(3) varB(3) varC(3) varD(3)
Any idea?
Thank you in advance
Peter
You can accomplish this in base R, using lapply() with a custom function. First, you split your character column on "/", resulting in a list of vectors:
l <- strsplit(df$ID,"/")
Then you apply a user defined function to each element of l using lapply():
l_stacked <- lapply(l, function(x)
if(length(x) > 1) {
c(x[1], paste0(x[1],"-",x[-1])) }
else { x })
The function first checks whether the vector has a length > 1. If so, it concatenates all elements with the first element, separated by "-". If length <= 1, it means the string didn't contain "/", hence it is returned as is. Finally we flatten our output using unlist() to be able to convert to data.frame.
data.frame(ID = unlist(l_stacked))
# ID
#1 bbb-5p
#2 bbb-5p-mi-98
#3 bbb-5p-6134
#4 abb-4p
#5 bbb-5p
#6 bbb-5p-mi-98
One way to achieve this is the following:
library(dplyr)
library(tidyr)
res <- df %>% mutate(i=row_number(),
ID = strsplit(ID,split='/')) %>%
unnest() %>%
group_by(i) %>%
mutate(ID=ifelse(ID==first(ID),first(ID),paste(first(ID),ID,sep='-'))) %>%
ungroup() %>% select(-i)
### A tibble: 6 x 1
## ID
## <chr>
##1 bbb-5p
##2 bbb-5p-mi-98
##3 bbb-5p-6134
##4 abb-4p
##5 bbb-5p
##6 bbb-5p-mi-98
Notes:
First, create an indexing column i to group by later so that we can group each "root".
Use strsplit to split each row by "|".
tidyr::unnest the result to separate rows.
group_by the created index i and then if the row is the first row, just return the root; otherwise, paste to prepend the root to the row with separator "-".
Finally, ungroup and remove the created index column i.
Data
df <- structure(list(ID = c("bbb-5p/mi-98/6134", "abb-4p", "bbb-5p/mi-98"
)), .Names = "ID", row.names = c(NA, -3L), class = "data.frame")
ID
1 bbb-5p/mi-98/6134
2 abb-4p
3 bbb-5p/mi-98
Here is one option using data.table. Convert the 'data.frame' to 'data.table' (setDT(df1, ..)) and create a column of rownames, grouped by 'rn', split the 'ID' by /, loop through the sequence of rows, paste the split elements based on the index.
library(splitstackshape)
library(data.table)
setDT(df1, keep.rownames=TRUE)[, unlist(strsplit(ID, "/")),
by = rn][, .(ID=sapply(seq_len(.N), function(i)
paste(V1[unique(c(1,i))], collapse="-"))) , rn]
Or an option with dplyr/tidyr/tibble. Create the rownames column with tibble::rownames_to_column, separate the rows into long format with separate_rows, grouped by 'rn', we mutate the 'ID' by pasteing the elements based on the condition of length and remove the 'rn' column.
library(dplyr)
library(tidyr)
library(tidyr)
rownames_to_column(df1, var = "rn") %>%
separate_rows(ID, sep="/") %>%
group_by(rn) %>%
mutate(ID = if(n()>1) c(ID[1], paste(ID[1], ID[-1], sep="-")) else ID) %>%
ungroup() %>%
select(-rn)
# ID
# <chr>
#1 bbb-5p
#2 bbb-5p-mi-98
#3 bbb-5p-6134
#4 abb-4p
#5 bbb-5p
#6 bbb-5p-mi-98

Earliest Date for each id in R

I have a dataset where each individual (id) has an e_date, and since each individual could have more than one e_date, I'm trying to get the earliest date for each individual. So basically I would like to have a dataset with one row per each id showing his earliest e_date value.
I've use the aggregate function to find the minimum values, I've created a new variable combining the date and the id and last I've subset the original dataset based on the one containing the minimums using the new variable created. I've come to this:
new <- aggregate(e_date ~ id, data_full, min)
data_full["comb"] <- NULL
data_full$comb <- paste(data_full$id,data_full$e_date)
new["comb"] <- NULL
new$comb <- paste(new$lopnr,new$EDATUM)
data_fixed <- data_full[which(new$comb %in% data_full$comb),]
The first thing is that the aggregate function doesn't seems to work at all, it reduces the number of rows but viewing the data I can clearly see that some ids appear more than once with different e_date. Plus, the code gives me different results when I use the as.Date format instead of its original format for the date (integer). I think the answer is simple but I'm struck on this one.
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(data_full)), grouped by 'id', we get the 1st row (head(.SD, 1L)).
library(data.table)
setDT(data_full)[order(e_date), head(.SD, 1L), by = id]
Or using dplyr, after grouping by 'id', arrange the 'e_date' (assuming it is of Date class) and get the first row with slice.
library(dplyr)
data_full %>%
group_by(id) %>%
arrange(e_date) %>%
slice(1L)
If we need a base R option, ave can be used
data_full[with(data_full, ave(e_date, id, FUN = function(x) rank(x)==1)),]
Another answer that uses dplyr's filter command:
dta %>%
group_by(id) %>%
filter(date == min(date))
You may use library(sqldf) to get the minimum date as follows:
data1<-data.frame(id=c("789","123","456","123","123","456","789"),
e_date=c("2016-05-01","2016-07-02","2016-08-25","2015-12-11","2014-03-01","2015-07-08","2015-12-11"))
library(sqldf)
data2 = sqldf("SELECT id,
min(e_date) as 'earliest_date'
FROM data1 GROUP BY 1", method = "name__class")
head(data2)
id earliest_date
123 2014-03-01
456 2015-07-08
789 2015-12-11
I made a reproducible example, supposing that you grouped some dates by which quarter they were in.
library(lubridate)
library(dplyr)
rand_weeks <- now() + weeks(sample(100))
which_quarter <- quarter(rand_weeks)
df <- data.frame(rand_weeks, which_quarter)
df %>%
group_by(which_quarter) %>% summarise(sort(rand_weeks)[1])
# A tibble: 4 x 2
which_quarter sort(rand_weeks)[1]
<dbl> <time>
1 1 2017-01-05 05:46:32
2 2 2017-04-06 05:46:32
3 3 2016-08-18 05:46:32
4 4 2016-10-06 05:46:32

Resources