I have two datafram as mentioned:
DF_1
ID Date1
1 12/01/2017
2 15/02/2017
3 18/03/2017
DF_2
ID Date1
1 05/01/2017
1 15/01/2017
1 18/01/2017
2 10/02/2017
2 13/02/2017
2 15/02/2017
3 22/03/2017
I want to calculate the difference between dates for a particular id in DF_1 to same id in DF_2 with most recent old date in DF_2 as compare with date of DF_1.
For Example: For ID=1 the Date of DF_1 is 12-01-2017 and Most recent old date for that in DF_2 would be 05-01-2017 because 15 & 18 both are > than DF_1 Date.
Required Output:
ID Date1 Count
1 12/01/2017 7
2 15/02/2017 0
3 18/03/2017 -4
The following reproduces your expected output:
library(tidyverse);
df1 <- read.table(text =
"ID Date1
1 12/01/2017
2 15/02/2017
3 18/03/2017", header = T) %>%
mutate(Date1 = as.Date(Date1, format = "%d/%m/%Y"));
df2 <- read.table(text =
"ID Date1
1 05/01/2017
1 15/01/2017
1 18/01/2017
2 10/02/2017
2 13/02/2017
2 15/02/2017
3 22/03/2017", header = T) %>%
mutate(Date1 = as.Date(Date1, format = "%d/%m/%Y"));
left_join(df1, df2, by = "ID") %>%
mutate(Count = Date1.x - Date1.y) %>%
group_by(ID) %>%
slice(ifelse(
all(Count < 0),
which.min(abs(Count)),
which.min(Count[Count >= 0]))) %>%
select(ID, Date1.x, Count)
## A tibble: 3 x 3
## Groups: ID [3]
# ID Date1.x Count
# <int> <date> <time>
#1 1 2017-01-12 7
#2 2 2017-02-15 0
#3 3 2017-03-18 -4
Explanation: Calculate the time difference between df1$Date1 and df2$Date2, group entries by ID, and keep only the row which has the smallest positive time difference, unless all time differences are negative in which case report the smallest absolute time difference.
I think your last row is wrong, as for ID=3 the df2 value is not before the df1 value. Assuming that is correct, you can do this...
df3 <- df2 %>% rename(Date2=Date1) %>%
left_join(df1) %>%
mutate(datediff=as.Date(Date1,format="%d/%m/%Y")-as.Date(Date2,format="%d/%m/%Y")) %>%
filter(datediff>=0) %>%
group_by(ID) %>%
summarise(Date1=first(Date1),Count=min(datediff))
df3
ID Date1 Count
1 1 12/01/2017 7
2 2 15/02/2017 0
Related
I was working in the following problem. I've got monthly data from a survey, let's call it df:
df1 = tibble(ID = c('1','2'), reported_value = c(1200, 31000), anchor_month = c(3,5))
ID reported_value anchor_month
1 1200 3
2 31000 5
So, the first row was reported in March, but there's no way to know if it's reporting March or February values and also can be an approximation to the real value. I've also got a table with actual values for each ID, let's call it df2:
df2 = tibble( ID = c('1', '2') %>% rep(4) %>% sort,
real_value = c(1200,1230,11000,10,25000,3100,100,31030),
month = c(1,2,3,4,2,3,4,5))
ID real_value month
1 1200 1
1 1230 2
1 11000 3
1 10 4
2 25000 2
2 3100 3
2 100 4
2 31030 5
So there's two challenges: first, I only care about the anchor month OR the previous month to the anchor month of each ID and then I want to match to the closest value (sounds like fuzzy join). So, my first challenge was to filter my second table so it only has the anchor month or the previous one, which I did doing the following:
filter_aux = df1 %>%
bind_rows(df1 %>% mutate(anchor_month = if_else(anchor_month == 1, 12, anchor_month- 1)))
df2 = df2 %>%
inner_join(filter_aux , by=c('ID', 'month' = 'anchor_month')) %>% distinct(ID, month)
Reducing df2 to:
ID real_value month
1 1230 2
1 11000 3
2 100 4
2 31030 5
Now I tried to do a difference_inner_join by ID and reported_value = real_value, (df1 %>% difference_inner_join(df2, by= c('ID', 'reported_value' = 'real_value'))) but it's bringing a non-numeric argument to binary operator error I'm guessing because ID is a string in my actual data. What gives? I'm no expert in fuzzy joins, so I guess I'm missing something.
My final dataframe would look like this:
ID reported_value anchor_month closest_value month
1 1200 3 1230 2
2 31000 5 31030 5
Thanks!
It was easier without fuzzy_join:
df3 = df1 %>% left_join(df2 , by='ID') %>%
mutate(dif = abs(real_value - reported_value)) %>%
group_by(ID) %>% filter(dif == min(dif))
Output:
ID reported_value anchor_month real_value month dif
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1200 3 1230 2 30
2 2 31000 5 31030 5 30
I would ultimately like to have df2 with certain dates and the cumulative sum of values connected to those date ranges from df1.
df1 = data.frame("date"=c("10/01/2020","10/02/2020","10/03/2020","10/04/2020","10/05/2020",
"10/06/2020","10/07/2020","10/08/2020","10/09/2020","10/10/2020"),
"value"=c(1:10))
df1
> df1
date value
1 10/01/2020 1
2 10/02/2020 2
3 10/03/2020 3
4 10/04/2020 4
5 10/05/2020 5
6 10/06/2020 6
7 10/07/2020 7
8 10/08/2020 8
9 10/09/2020 9
10 10/10/2020 10
df2 = data.frame("date"=c("10/05/2020","10/10/2020"))
df2
> df2
date
1 10/05/2020
2 10/10/2020
I realize this is incorrect, but I am not sure how to define df2$value as the sums of certain df1$value rows:
df2$value = filter(df1, c(sum(1:5),sum(6:10)))
df2
I would like the output to look like this:
> df2
date value
1 10/05/2020 15
2 10/10/2020 40
Here is another approach using dplyr and lubridate:
library(lubridate)
library(dplyr)
df1 %>%
mutate(date = dmy(date)) %>%
mutate(date = if_else(date == "2020-05-10" |
date == "2020-10-10", date, NA_Date_)) %>%
fill(date, .direction = "up") %>%
group_by(date) %>%
summarise(value = sum(value))
date value
<date> <int>
1 2020-05-10 15
2 2020-10-10 40
We may use a non-equi join after converting the 'date' columns to Date class
library(lubridate)
library(data.table)
setDT(df1)[, date := mdy(date)]
setDT(df2)[, date := mdy(date)]
df2[, start_date := fcoalesce(shift(date) + days(1), floor_date(date, 'month'))]
df1[df2,.(value = sum(value)), on = .( date >= start_date,
date <= date), by = .EACHI][, -1, with = FALSE]
date value
<Date> <int>
1: 2020-10-05 15
2: 2020-10-10 40
Or another option is creating a group with findInterval and then do the group by sum
library(dplyr)
df1 %>%
group_by(grp = findInterval(date, df2$date, left.open = TRUE)) %>%
summarise(date = last(date), value = sum(value)) %>%
select(-grp)
-output
# A tibble: 2 × 2
date value
<date> <int>
1 2020-10-05 15
2 2020-10-10 40
I have a dataframe with two start and stop dates that looks like this:
ID G1_START G1_END G2_START G2_END LOCATION
1 1/1/2021 5/31/2021 2/1/2021 5/31/2021 A
2 12/1/2020 3/31/2021 10/1/2020 5/31/2021 B
What I would like to do is create one row per month per patient where the months overlap between the four dates. For example
ID MONTH ACTIVE LOCATION
1 2/1/2021 1 A
1 3/1/2021 1 A
1 4/1/2021 1 A
1 5/1/2021 1 A
2 12/1/2020 1 B
2 1/1/2021 1 B
2 2/1/2021 1 B
2 3/1/2021 1 B
Where active means the ID was on both G1 and G2 during these months.
Here is a method in tidyverse
Reshape data from wide to long format - pivot_longer
Convert the date columns 'START', 'END' to Date class (mdy)
Loop over the 'START', 'END' with map2, get the sequence by '1 month'
Floor the date by month - floor_date
grouped by ID, LOCATION, MONTH, filter the groups where 'Categ' distinct elements are 2
Create 'ACTIVE' column of 1 after returning the distinct rows
library(dplyr)
library(tidyr)
library(lubridate)
library(purrr)
pivot_longer(df1, cols = contains("_"),
names_to = c("Categ", ".value"), names_sep= "_") %>%
transmute(ID, LOCATION, Categ, MONTH = map2(mdy(START), mdy(END), ~
floor_date(seq(.x, .y, by = '1 month'), 'month'))) %>%
unnest(MONTH) %>%
group_by(ID, LOCATION, MONTH) %>%
filter(n_distinct(Categ) == 2) %>%
ungroup %>%
distinct(ID, LOCATION, MONTH) %>%
mutate(ACTIVE = 1) %>%
select(ID, MONTH, ACTIVE, LOCATION)
-output
# A tibble: 8 x 4
ID MONTH ACTIVE LOCATION
<int> <date> <dbl> <chr>
1 1 2021-02-01 1 A
2 1 2021-03-01 1 A
3 1 2021-04-01 1 A
4 1 2021-05-01 1 A
5 2 2020-12-01 1 B
6 2 2021-01-01 1 B
7 2 2021-02-01 1 B
8 2 2021-03-01 1 B
data
df1 <- structure(list(ID = 1:2, G1_START = c("1/1/2021", "12/1/2020"
), G1_END = c("5/31/2021", "3/31/2021"), G2_START = c("2/1/2021",
"10/1/2020"), G2_END = c("5/31/2021", "5/31/2021"), LOCATION = c("A",
"B")), class = "data.frame", row.names = c(NA, -2L))
It may be an stupid question, but I can not figure out how to filter df to keep the rows in which the id match the condition of being present in all the levels of factor_A:
df = data.frame(id = c(1,1,1,2,2,3,3),
factor_A = c(1,2,3,1,2,1,3))
The desired df1 would keep only the rows containing id=1, since it is present in factor_A=1,2 and 3:
id factor_A
1 1 1
2 1 2
3 1 3
this should do it
library(dplyr)
df = data.frame(id = c(1,1,1,2,2,3,3),
factor_A = c(1,2,3,1,2,1,3))
df %>% group_by(id) %>%
filter(length(unique(factor_A)) == length(unique(df$factor_A)))
I would suggest a dplyr approach. You can count the number of levels for each id and then filter. As your factor variable has 3 levels you will keep those rows with Flag equals to 3:
library(dplyr)
#Data
df = data.frame(id = c(1,1,1,2,2,3,3),
factor_A = c(1,2,3,1,2,1,3))
#Create flag
df %>% group_by(id) %>%
#Count levels
mutate(Flag=n_distinct(factor_A)) %>%
#Filter only rows with 3
filter(Flag==3) %>% select(-Flag)
Output:
# A tibble: 3 x 2
# Groups: id [1]
id factor_A
<dbl> <dbl>
1 1 1
2 1 2
3 1 3
We can use base R
subset(df, id %in% names(which(!rowSums(!table(df) > 0))))
# id factor_A
#1 1 1
#2 1 2
#3 1 3
I have a dataset with ID, date, days of life, and medication variables. Each ID has multiple observations indicating different administrations of a certain drug. I want to find UNIQUE meds that were administered within 365 days of each other. A sample of the data frame is as follows:
ID date dayoflife meds
1 2003-11-24 16361 lasiks
1 2003-11-24 16361 vigab
1 2004-01-09 16407 lacos
1 2013-11-25 20015 pheno
1 2013-11-26 20016 vigab
1 2013-11-26 20016 lasiks
2 2008-06-05 24133 pheno
2 2008-04-07 24074 vigab
3 2014-11-25 8458 pheno
3 2014-12-22 8485 pheno
I expect the outcome to be:
ID N
1 3
2 2
3 1
indicating that individual 1 had a max of 3 different types of medications administered within 365 days of each other. I am not sure if it is best to use days of life or the date to get to this expected outcome.Any help is appreciated
An option would be to convert the 'date' to Date class, grouped by 'ID', get the absolute difference of 'date' and the lag of the column, check whether it is greater than 365, create a grouping index with cumsum, get the number of distinct elements of 'meds' in summarise
library(dplyr)
df1 %>%
mutate(date = as.Date(date)) %>%
group_by(ID) %>%
mutate(diffd = abs(as.numeric(difftime(date, lag(date, default = first(date)),
units = 'days')))) %>%
group_by(grp = cumsum(diffd > 365), add = TRUE) %>%
summarise(N = n_distinct(meds)) %>%
group_by(ID) %>%
summarise(N = max(N))
# A tibble: 3 x 2
# ID N
# <int> <int>
#1 1 2
#2 2 2
#3 3 1
You can try:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(date = as.Date(date),
lag_date = abs(date - lag(date)) <= 365,
lead_date = abs(date - lead(date)) <= 365) %>%
mutate_at(vars(lag_date, lead_date), ~ ifelse(., ., NA)) %>%
filter(coalesce(lag_date, lead_date)) %>%
summarise(N = n_distinct(meds))
Output:
# A tibble: 3 x 2
ID N
<int> <int>
1 1 2
2 2 2
3 3 1