Mutate based on two conditions in R dataframe - r

I have a R dataframe which can be generated from the code below
DF <- data.frame("Person_id" = c(1,1,1,1,2,2,2,2,3,3), "Type" = c("IN","OUT","IN","ANC","IN","OUT","IN","ANC","EM","ANC"), "Name" = c("Nara","Nara","Nara","Nara","Dora","Dora","Dora","Dora","Sara","Sara"),"day_1" = c("21/1/2002","21/4/2002","21/6/2002","21/9/2002","28/1/2012","28/4/2012","28/6/2012","28/9/2012","30/06/2004","30/06/2005"),"day_2" = c("23/1/2002","21/4/2002","","","30/1/2012","28/4/2012","","28/9/2012","",""))
What I would like to do is create two new columns as admit_start_date and admit_end_date based on few conditions which are given below
Rule 1
admit_start_date = day_1
admit_end_date = day_2 (sometimes day_2 can be NA. So refer Rule 2 below)
Rule 2
if day_2 is (null or blank or na) and Type is (Out or ANC or EM) then
admit_end_date = day_1
else (if Type is IN)
admit_end_date = day_1 + 5 (days)
This is what I am trying but doesn't seem to help
transform_dates = function(DF){ # this function is to create 'date' columns
DF %>%
mutate(admit_start_date = day_1) %>%
mutate(admit_end_date = day_2) %>%
admit_end_date = if_else(((Type == 'Out' & admit_end_date.isna() ==True|Type == 'ANC' & admit_end_date.isna() ==True|Type == 'EM' & admit_end_date.isna() ==True),day_1,day_1 + 5)
)
}
As you can see, I am not sure how to check for NA for a newly created column and replace those NAs with day_1 or day_1 + 5(days) based on Type column.
Can you please help?
I expect my output to be like as shown below

We can use case_when to specify each condition separately after converting "day" columns to actual date objects.
library(dplyr)
DF %>%
mutate_at(vars(starts_with('day')), as.Date, "%d/%m/%Y") %>%
mutate(admit_start_date = day_1,
admit_end_date = case_when(
!is.na(day_2) ~day_2,
is.na(day_2) & Type %in% c('OUT', 'ANC', 'EM') ~ day_1,
Type == 'IN' ~ day_1 + 5))
# Person_id Type Name day_1 day_2 admit_start_date admit_end_date
#1 1 IN Nara 2002-01-21 2002-01-23 2002-01-21 2002-01-23
#2 1 OUT Nara 2002-04-21 2002-04-21 2002-04-21 2002-04-21
#3 1 IN Nara 2002-06-21 <NA> 2002-06-21 2002-06-26
#4 1 ANC Nara 2002-09-21 <NA> 2002-09-21 2002-09-21
#5 2 IN Dora 2012-01-28 2012-01-30 2012-01-28 2012-01-30
#6 2 OUT Dora 2012-04-28 2012-04-28 2012-04-28 2012-04-28
#7 2 IN Dora 2012-06-28 <NA> 2012-06-28 2012-07-03
#8 2 ANC Dora 2012-09-28 2012-09-28 2012-09-28 2012-09-28
#9 3 EM Sara 2004-06-30 <NA> 2004-06-30 2004-06-30
#10 3 ANC Sara 2005-06-30 <NA> 2005-06-30 2005-06-30
The dates in the dataframe are not of class "Date", (class(DF$day_1)), using mutate_at we change their class to "Date" so we can perform mathematical calculations on it. starts_with('day') means that any column whose name starts with "day" would be converted to "Date" class. We use mutate_at when we want to apply the same function to multiple columns.
case_when is an alternative to nested ifelse statements. They execute in sequential order. So first condition is checked, if the condition is satisfied it doesn't check the remaining conditions. If the first condition is not satisfied, it checks for the second condition and so on. Hence, no else is required here. If none of the conditions are satisfied it returns NA. Check ?case_when.

Related

How to evaluate zodiac sign based on date of birth in R?

So I have a date of birth vector in a data.frame. I want to evaluate, based on this date, which zodiac sign is the respondent.
I've seen this solution:
Checking if Date is Between two Dates in R
But, this approach would mean that I have to create 12 vectors times 2 for each zodiac sign (starting date and finishing date), to check if my date of birth falls between the two. Is there a more efficient way to do this?
So this is my data.frame:
data.frame(respondent = c(1,2,3,4,5), date_of_birth = seq(as.Date("2011-12-30"), as.Date("2012-04-30"), by="months") )
respondent date_of_birth
1 1 2011-12-30
2 2 2012-01-30
3 3 2012-03-01
4 4 2012-03-30
5 5 2012-04-30
and I want to get this:
respondent date_of_birth zodiac
1 1 2011-12-30 Capricorn
2 2 2012-01-30 Aquarius
3 3 2012-03-01 Pisces
4 4 2012-03-30 Aries
5 5 2012-04-30 Taurus
I think the *apply functions are just made for this work. You could try to use lapply on your fisrt data frame (more precisely: with its date_of_birth column) and with a data frame indexing the zodiac signs according to the date to produce a vector zodiac whose length equals the height of your data frame.
That would work and with a fully populated zodiac database it should be pretty easy. What I mean with this is that you need a database, where for each year, you've got the different dates, because otherwise it's difficult to compare dates across New Year. Also please make sure that the conditions are correct, don't know anything about zodiac signs.
library(fuzzyjoin)
birth.days <- data.frame(respondent = c(1,2,3,4,5), date_of_birth = seq(as.Date("2011-12-30"), as.Date("2012-04-30"), by="months") )
zodiacs <- data.frame(Zodiac = c("Capricorn")
, Start.Date = as.Date("2011-12-22")
, End.Date = as.Date("2012-01-20"))
fuzzy_left_join(birth.days, zodiacs,
by = c("date_of_birth" = "Start.Date", "date_of_birth" = "End.Date"),
match_fun = list(`>=`, `<`))
respondent date_of_birth Zodiac Start.Date End.Date
1 1 2011-12-30 Capricorn 2011-12-22 2012-01-20
2 2 2012-01-30 <NA> <NA> <NA>
3 3 2012-03-01 <NA> <NA> <NA>
4 4 2012-03-30 <NA> <NA> <NA>
5 5 2012-04-30 <NA> <NA> <NA>
Just as an example on how to populate a database with the dates:
Capricorn <- data.frame( Start.Date = seq.Date(from= as.Date("1900-12-22"), to = as.Date("2100-01-01"), by = "year")
, End.Date = seq.Date(from= as.Date("1901-01-20"), to = as.Date("2100-01-20"), by = "year")
, Zodiac = rep("Capricorn", 200 )
)

Use apply on a dataframe to fill in missing values from another dataframe

First I want to say I am new to R. This problem is frustrating beyond belief. I have tried apply, lapply, and mapply. All with errors. I am lost.
What I want to do is take the time from "Results" and place it in the time in "Records" IF Records does not have a time (where it is NA).
I have already done this in a traditional for-loop but it makes the code hard to read. I have read the apply functions can make this easier.
Data Frame "Results"
ID Time(sec)
1 1.7169811
2 1.9999999
3 2.3555445
4 3.4444444
Data Frame "Records"
ID Time(sec) Date
1 NA 1/1/2018
2 1.9999999 1/1/2018
3 NA 1/1/2018
4 3.1111111 1/1/2018
Data Frame 'New' Records
ID Time(sec) Date
1 1.7169811 1/1/2018
2 1.9999999 1/1/2018
3 2.3555445 1/1/2018
4 3.1111111 1/1/2018
No need to use apply in this situation. A pattern of conditionally choosing between two values based on some predicate is ifelse():
ifelse(predicate, value_a, value_b)
In this case you said you also have to make sure the values are matched by ID between the two dataframes. A function that achieves this in R is appropriately named match()
match(target_values, values_to_be_matched)
match returns indices that match values_to_be_matched to target_values when used like so: target_values[indices].
Combining this together:
inds <- match(records$ID, results$ID)
records$time <- ifelse(is.na(records$time), results$time[inds], records$time)
is.na() here is a predicate that checks if the value is NA for every value in the vector.
Inspired by this answer.
From the help: Given a set of vectors, coalesce() finds the first non-missing value at each position. This is inspired by the SQL COALESCE function which does the same thing for NULLs
library(tidyverse)
txt1 <- "ID Time(sec)
1 1.7169811
2 1.9999999
3 2.3555445
4 3.4444444"
txt2 <- "ID Time(sec) Date
1 NA 1/1/2018
2 1.9999999 1/1/2018
3 NA 1/1/2018
4 3.1111111 1/1/2018"
df1 <- read.table(text = txt1, header = TRUE)
df2 <- read.table(text = txt2, header = TRUE)
df1 %>%
left_join(df2, by = "ID") %>%
mutate(Time.sec. = coalesce(Time.sec..x, Time.sec..y)) %>%
select(-Time.sec..x, -Time.sec..y)
#> ID Date Time.sec.
#> 1 1 1/1/2018 1.716981
#> 2 2 1/1/2018 2.000000
#> 3 3 1/1/2018 2.355545
#> 4 4 1/1/2018 3.444444
Created on 2018-03-10 by the reprex package (v0.2.0).

How to diagonally subtract different columns in R

I have a dataset of a hypothetical exam.
id <- c(1,1,3,4,5,6,7,7,8,9,9)
test_date <- c("2012-06-27","2012-07-10","2013-07-04","2012-03-24","2012-07-22", "2013-09-16","2012-06-21","2013-10-18", "2013-04-21", "2012-02-16", "2012-03-15")
result_date <- c("2012-07-29","2012-09-02","2013-08-01","2012-04-25","2012-09-01","2013-10-20","2012-07-01","2013-10-31", "2013-05-17", "2012-03-17", "2012-04-20")
data1 <- as_data_frame(id)
data1$test_date <- test_date
data1$result_date <- result_date
colnames(data1)[1] <- "id"
"id" indicates the ID of the students who have taken a particular exam. "test_date" is the date the students took the test and "result_date" is the date when the students' results are posted. I'm interested in finding out which students retook the exam BEFORE the result of that exam session was released, e.g. students who knew that they have underperformed and retook the exam without bothering to find out their scores. For example, student with "id" 1 took the exam for the second time on "2012-07-10" which was before the result date for his first exam - "2012-07-29".
I tried to:
data1%>%
group_by(id) %>%
arrange(id, test_date) %>%
filter(n() >= 2) %>% #To only get info on students who have taken the exam more than once and then merge it back in with the original data set using a join function
So essentially, I want to create a new column called "re_test" where it would equal 1 if a student retook the exam BEFORE receiving the result of a previous exam and 0 otherwise (those who retook after seeing their marks or those who did not retake).
I have tried to mutate in order to find cases where dates are either positive or negative by subtracting the 2nd test_date from the 1st result_date:
mutate(data1, re_test = result_date - lead(test_date, default = first(test_date)))
However, this leads to mixing up students with different id's. I tried to split but mutate won't work on a list of dataframes so now I'm stuck:
split(data1, data1$id)
Just to add on, this is a part of the desired result:
data2 <- as_data_frame(id <- c(1,1,3,4))
data2$test_date_result <- c("2012-06-27","2012-07-10", "2013-07-04","2012-03-24")
data2$result_date_result <- c("2012-07-29","2012-09-02","2013-08-01","2012-04-25")
data2$re_test <- c(1, 0, 0, 0)
Apologies for the verbosity and hope I was clear enough.
Thanks a lot in advance!
library(reshape2)
library(dplyr)
# first melt so that we can sequence by date
data1m <- data1 %>%
melt(id.vars = "id", measure.vars = c("test_date", "result_date"), value.name = "event_date")
# any two tests in a row is a flag - use dplyr::lag to comapre the previous
data1mc <- data1m %>%
arrange(id, event_date) %>%
group_by(id) %>%
mutate (multi_test = (variable == "test_date" & lag(variable == "test_date"))) %>%
filter(multi_test)
# id variable event_date multi_test
# 1 1 test_date 2012-07-10 TRUE
# 2 9 test_date 2012-03-15 TRUE
## join back to the original
data1 %>%
left_join (data1mc %>% select(id, event_date, multi_test),
by=c("id" = "id", "test_date" = "event_date"))
I have a piecewise answer that may work for you. I first create a data.frame called student that contains the re-test information, and then join it with the data1 object. If students re-took the test multiple times, it will compare the last test to the first, which is a flaw, but I'm unsure if students have the ability to re-test multiple times?
student <- data1 %>%
group_by(id) %>%
summarise(retest=(test_date[length(test_date)] < result_date[1]) == TRUE)
Some re-test values were NA. These were individuals that only took the test once. I set these to FALSE here, but you can retain the NA, as they do contain information.
student$retest[is.na(student$retest)] <- FALSE
Join the two data.frames to a single object called data2.
data2 <- left_join(data1, student, by='id')
I am sure there are more elegant ways to approach this. I did this by taking advantage of the structure of your data (sorted by id) and the lag function that can refer to the previous records while dealing with a current record.
### Ensure Data are sorted by ID ###
data1 <- arrange(data1,id)
### Create Flag for those that repeated ###
data1$repeater <- ifelse(lag(data1$id) == data1$id,1,0)
### I chose to do this on all data, you could filter on repeater flag first ###
data1$timegap <- as.Date(data1$result_date) - as.Date(data1$test_date)
data1$lagdate <- as.Date(data1$test_date) - lag(as.Date(data1$result_date))
### Display results where your repeater flag is 1 and there is negative time lag ###
data1[data1$repeater==1 & !is.na(data1$repeater) & as.numeric(data1$lagdate) < 0,]
# A tibble: 2 × 6
id test_date result_date repeater timegap lagdate
<dbl> <chr> <chr> <dbl> <time> <time>
1 1 2012-07-10 2012-09-02 1 54 days -19 days
2 9 2012-03-15 2012-04-20 1 36 days -2 days
I went with a simple shift comparison. 1 line of code.
data1 <- data.frame(id = c(1,1,3,4,5,6,7,7,8,9,9), test_date = c("2012-06-27","2012-07-10","2013-07-04","2012-03-24","2012-07-22", "2013-09-16","2012-06-21","2013-10-18", "2013-04-21", "2012-02-16", "2012-03-15"), result_date = c("2012-07-29","2012-09-02","2013-08-01","2012-04-25","2012-09-01","2013-10-20","2012-07-01","2013-10-31", "2013-05-17", "2012-03-17", "2012-04-20"))
data1$re_test <- unlist(lapply(split(data1,data1$id), function(x)
ifelse(as.Date(x$test_date) > c(NA, as.Date(x$result_date[-nrow(x)])), 0, 1)))
data1
id test_date result_date re_test
1 1 2012-06-27 2012-07-29 NA
2 1 2012-07-10 2012-09-02 1
3 3 2013-07-04 2013-08-01 NA
4 4 2012-03-24 2012-04-25 NA
5 5 2012-07-22 2012-09-01 NA
6 6 2013-09-16 2013-10-20 NA
7 7 2012-06-21 2012-07-01 NA
8 7 2013-10-18 2013-10-31 0
9 8 2013-04-21 2013-05-17 NA
10 9 2012-02-16 2012-03-17 NA
11 9 2012-03-15 2012-04-20 1
I think there is benefit in leaving NAs but if you really want all others as zero, simply:
data1$re_test <- ifelse(is.na(data1$re_test), 0, data1$re_test)
data1
id test_date result_date re_test
1 1 2012-06-27 2012-07-29 0
2 1 2012-07-10 2012-09-02 1
3 3 2013-07-04 2013-08-01 0
4 4 2012-03-24 2012-04-25 0
5 5 2012-07-22 2012-09-01 0
6 6 2013-09-16 2013-10-20 0
7 7 2012-06-21 2012-07-01 0
8 7 2013-10-18 2013-10-31 0
9 8 2013-04-21 2013-05-17 0
10 9 2012-02-16 2012-03-17 0
11 9 2012-03-15 2012-04-20 1
Let me know if you have any questions, cheers.

Count number of rows meeting criteria in another table - R PRogramming

I have two tables, one with property listings and another one with contacts made for a property (i.e. is someone is interested in the property they will "contact" the owner).
Sample "listings" table below:
listings <- data.frame(id = c("6174", "2175", "9176", "4176", "9177"), city = c("A", "B", "B", "B" ,"A"), listing_date = c("01/03/2015", "14/03/2015", "30/03/2015", "07/04/2015", "18/04/2015"))
listings$listing_date <- as.Date(listings$listing_date, "%d/%m/%Y")
listings
# id city listing_date
#1 6174 A 01/03/2015
#2 2175 B 14/03/2015
#3 9176 B 30/03/2015
#4 4176 B 07/04/2015
#5 9177 A 18/04/2015
Sample "contacts" table below:
contacts <- data.frame (id = c ("6174", "6174", "6174", "6174", "2175", "2175", "2175", "9176", "9176", "4176", "4176", "9177"), contact_date = c("13/03/2015","14/04/2015", "27/03/2015", "13/04/2015", "15/03/2015", "16/03/2015", "17/03/2015", "30/03/2015", "01/06/2015", "08/05/2015", "09/05/2015", "23/04/2015" ))
contacts$contact_date <- as.Date(contacts$contact_date, "%d/%m/%Y")
contacts
# id contact_date
#1 6174 2015-03-13
#2 6174 2015-04-14
#3 6174 2015-03-27
#4 6174 2015-04-13
#5 2175 2015-03-15
#6 2175 2015-03-16
#7 2175 2015-03-17
#8 9176 2015-03-30
#9 9176 2015-06-01
#10 4176 2015-05-08
#11 4176 2015-05-09
#12 9177 2015-04-23
Problem
1. I need to count the number of contacts made for a property within 'x' days of listing. The output should be a new column added to "listings" with # contacts:
Sample ('x' = 30 days)
listings
# id city listing_date ngs
#1 6174 A 2015-03-01 2
#2 2175 B 2015-03-14 3
#3 9176 B 2015-03-30 1
#4 4176 B 2015-04-07 0
#5 9177 A 2015-04-18 1
I have done this with the for loop; it is horrible slow for live data:
n <- nrow(listings)
mat <- vector ("integer", n)
for (i in 1:n) {
mat[i] <- nrow (contacts[contacts$id==listings[i,"id"] & as.numeric (contacts$contact_date - listings[i,"listing_date"]) <=30,])
}
listings$ngs <- mat
I need to prepare a histogram of # contacts vs days with 'x' as variable - through manipulate function. I can't figure out a way to do all this inside the manipulate function.
Here's a possible solution using data.table rolling joins
library(data.table)
# key `listings` by proper columns in order perform the binary join
setkey(setDT(listings), id, listing_date)
# Perform a binary rolling join while extracting matched icides and counting them
indx <- data.table(listings[contacts, roll = 30, which = TRUE])[, .N, by = V1]
# Joining back to `listings` by proper rows while assigning the counts by reference
listings[indx$V1, ngs := indx$N]
# id city listing_date ngs
# 1: 2175 B 2015-03-14 3
# 2: 4176 B 2015-04-07 NA
# 3: 6174 A 2015-03-01 2
# 4: 9176 B 2015-03-30 1
# 5: 9177 A 2015-04-18 1
I'm not sure if your actual id values are factor, but I'll start by making those numeric. Using them as factors will cause you problems:
listings$id <- as.numeric(as.character(listings$id))
contacts$id <- as.numeric(as.character(contacts$id))
Then, the strategy is to calculate the "days since listing" value for each contact and add this to your contacts data.frame. Then, aggregate this new data.frame (in your example, sum of contacts within 30 days), and then merge the resulting count back into your original data.
contacts$ngs <- contacts$contact_date - listings$listing_date[match(contacts$id, listings$id)]
a <- aggregate(ngs ~ id, data = contacts, FUN = function(x) sum(x <= 30))
merge(listings, a)
# id city listing_date ngs
# 1 2175 B 2015-03-14 3
# 2 4176 B 2015-04-07 0
# 3 6174 A 2015-03-01 2
# 4 9176 B 2015-03-30 1
# 5 9177 A 2015-04-18 1
Or:
indx <- match(contacts$id, listings$id)
days_since <- contacts$contact_date - listings$listing_date[indx]
n <- with(contacts[days_since <= 30, ], tapply(id, id, length))
n[is.na(n)] <- 0
listings$n <- n[match(listings$id, names(n))]
It's similar to Thomas' answer but utilizes tapply and match instead of aggregate and merge.
You could use the dplyr package. First merge the data:
all.data <- merge(contacts,listings,by = "id")
Set a target number of days:
number.of.days <- 30
Then gather the data by ID (group_by), exclude the results that are not within the time frame (filter) and count the number of occurrences/rows (summarise).
result <- all.data %>% group_by(id) %>% filter(contact_date > listing_date + number.of.days) %>% summarise(count.of.contacts = length(id))
I think there are a number of ways this could be potentially solved but I have found dplyr to be very helpful in a lot circumstances.
EDIT:
Sorry should have thought about that a little more. Does this work,
result <- all.data %>% group_by(id,city,listing_date) %>% summarise(ngs = length(id[which(contact_date < listing_date + number.of.days)]))
I don't think zero results can be passed sensibly through the filter stage (understandably, the goal is usually the opposite). I'm not too sure what sort of impact the 'which' component will have on processing time, likely to be slower than using the 'filter' function but might not matter.
Using dplyr for your first problem:
left_join(contacts, listings, by = c("id" = "id")) %>%
filter(abs(listing_date - contact_date) < 30) %>%
group_by(id) %>% summarise(cnt = n()) %>%
right_join(listings)
And the output is:
id cnt city listing_date
1 6174 2 A 2015-03-01
2 2175 3 B 2015-03-14
3 9176 1 B 2015-03-30
4 4176 NA B 2015-04-07
5 9177 1 A 2015-04-18
I'm not sure I understand your second question to answer it.

R: Using values from data frame A from a date prior to populate a row in data frame B

This may be very complicated and I suspect requires advanced knowledge. I have now two different types of data.frames I need to combine:
The data:
Dataframe A:
lists all transfusion dates by patient ID. Every transfusion is represented by a separate row, patients can have multiple transfusions. Different patients can have transfusions on the same date.
Patient ID Transfusion.Date
1 01/01/2000
1 01/30/2000
2 04/01/2003
3 04/01/2003
Dataframes of Type B contain test results at other dates, also by patient ID:
Patient ID Test.Date Test.Value
1 11/30/1999 negative
1 01/15/2000 700 copies/uL
1 01/27/2000 900 copies/uL
2 03/30/2003 negative
What I would like to have is Dataframe A with the same number of rows (1 for each transfusion), and with the most recent Test.Value as a separate column. Each transfusion date should have the test result from the test performed most closely (prior) to the transfusion.
desired output:
-->
Patient ID Transfusion.Date Pre.Transfusion.Test
1 01/01/2000 negative
1 01/30/2000 900 copies/ul
2 04/01/2003 negative
3 04/01/2003 NA
I think the general strategy would be to subset the data.frames by patient IDs. Then take all transfusion dates for patient 1, check which result is closest to all available test_dates for each element and then return the value closest.
How can I explain R to do that?
Edit 1: Here is the R-code for these examples
df_A <- data.frame(MRN = c(1,1,2,3),
Transfusion.Date = as.Date(c('01/01/2000', '01/30/2000',
'04/01/2003','04/01/2003'),'%m/%d/%Y'))
df_B <- data.frame(MRN = c(1,1,1,2),
Test.Date = as.Date(c('11/30/1999', '01/15/2000', '01/27/2000',
'03/30/2003'),'%m/%d/%Y'), Test.Result = c('negative',
'700 copies/ul','900 copies/ul','negative'))
Edit 2:
To clarify, the resulting data should be: Patient A received transfusions on Day X and Day Y. (for df_A). Prior to the transfusion on day X, his most recent test result was X (closest test date to first transfusion, in df_B). Prior to the transfusion on day Y, his most recent test result was Y (prior to the second transfusion, also in df_B. df_B also contains a bunch of other test dates, which are not needed for the final output.
Here's using data.table's rolling joins:
require(data.table)
setkey(setDT(df_A), MRN, Transfusion.Date)
setkey(setDT(df_B), MRN, Test.Date)
df_B[df_A, roll=TRUE]
# MRN Test.Date Test.Result
# 1: 1 2000-01-01 negative
# 2: 1 2000-01-30 900 copies/ul
# 3: 2 2003-04-01 negative
# 4: 3 2003-04-01 NA
setDT converts data.frame to data.table by reference (without any additional copying). That'll result in df_A and df_B now being data.tables.
setkey sorts the data.table by the columns we provided, and marks those columns as key columns, which allows us to use binary search based joins.
We perform a join of the form x[i] on the key columns, where for each row of i, the matching rows of x (if any, else NA) along with i's rows are returned. This is what we call an equi-join. By adding roll = TRUE, in the event of a mismatch, the last observation is carried forward (LOCF). This is what we call a rolling join. The sorting in increasing order (due to setkey()) ensures that the last observation is the most recent date.
HTH
dfLast <- df_B[ df_B$Test.Date %in%
as.Date( tapply(df_B$Test.Date, df_B$MRN, tail,1),"1970-01-01"), ]
merge(df_A, dfLast, by=c(1:2,1:2) ,all.y=TRUE)
MRN Transfusion.Date Test.Result
1 1 2000-01-27 900 copies/ul
2 2 2003-03-30 negative
Edited. Had some logical errors and some sytactic errors. tapply returned the integer values of the Dates and as you pointed out I was using the wrong column name in the data reduction step.
OK thanks for everyone's help. It took me a lot of toil, blood, sweat, and tears, but this is the solution I came up with:
Merge both data frames:
df_AB <- merge(df_A, df_B, all.x = T)
df_AB:
MRN Transfusion.Date Test.Date Test.Result
1 1 2000-01-01 1999-11-30 negative
2 1 2000-01-01 2000-01-15 700 copies/ul
3 1 2000-01-01 2000-01-27 900 copies/ul
4 1 2000-01-30 1999-11-30 negative
5 1 2000-01-30 2000-01-15 700 copies/ul
6 1 2000-01-30 2000-01-27 900 copies/ul
7 2 2003-04-01 2003-03-30 negative
8 3 2003-04-01 <NA> <NA>
Using dplyr
df_tests <- df_AB %>%
group_by(MRN, Transfusion.Date) %>%
mutate(Time.Difference = Transfusion.Date - Test.Date) %>%
filter(Time.Difference > 0) %>%
arrange(Time.Difference) %>%
summarize(Test.Date = Test.Date[1], Test.Result = Test.Result[1])
df_tests:
MRN Transfusion.Date Test.Date Test.Result
1 1 2000-01-01 1999-11-30 negative
2 1 2000-01-30 1999-11-30 negative
3 2 2003-04-01 2003-03-30 negative
using merge again for MRN3:
df_desired <- merge(df_A, df_tests, all.x = T)
MRN Transfusion.Date Test.Date Test.Result
1 1 2000-01-01 1999-11-30 negative
2 1 2000-01-30 2000-01-27 900 copies/ul
3 2 2003-04-01 2003-03-30 negative
4 3 2003-04-01 <NA> <NA>

Resources