Merging overlapping dataframes in R - r

Okay, so I have two different data frames (df1 and df2) which, to simplify it, have an ID, a date, and the score on a test. In each data frame the person (ID) have taken the test on multiple dates. When looking between the two data frames, some of the people are listed in df1 but not in df2, and vice versa, but some are listed in both and they can overlap differently.
I want to combine all the data into one frame, but the tricky part is if any of the IDs and scores from df1 and df2 are within 7 days (I can do this with a subtracted dates column), I want to combine that row.
In essence, for every ID there will be one row with both scores written separately if taken within 7 days, and if not it will make two separate rows, one with score from df1 and one from df2 along with all the other scores that might not be listed in both.
EX:
df1
ID Date1(yyyymmdd) Score1
1 20140512 50
1 20140501 30
1 20140703 50
1 20140805 20
3 20140522 70
3 20140530 10
df2
ID Date2(yyyymmdd) Score2
1 20140530 40
1 20140622 20
1 20140702 10
1 20140820 60
2 20140522 30
2 20140530 80
Wanted_df
ID Date1(yyyymmdd) Score1 Date2(yyyymmdd) Score2
1 20140512 50
1 20140501 30
1 20140703 50 20140702 10
1 20140805 20
1 20140530 40
1 20140622 20
1 20140820 60
3 20140522 70
3 20140530 10
2 20140522 30
2 20140530 80

Alright. I feel bad about the bogus outer join answer (which may be possible in a library I don't know about, but there are advantages to using RDBMS sometimes...) so here is a hacky workaround. It assumes that all the joins will be at most one to one, which you've said is OK.
# ensure the date columns are date type
df1$Date1 <- as.Date(as.character(df1$Date1), format="%Y%m%d")
df2$Date2 <- as.Date(as.character(df2$Date2), format="%Y%m%d")
# ensure the dfs are sorted
df1 <- df1[order(df1$ID, df1$Date1),]
df2 <- df2[order(df2$ID, df2$Date2),]
# initialize the output df3, which starts as everything from df1 and NA from df2
df3 <- cbind(df1,Date2=NA, Score2=NA)
library(plyr) #for rbind.fill
for (j in 1:nrow(df2)){
# see if there are any rows of test1 you could join test2 to
join_rows <- which(df3[,"ID"]==df2[j,"ID"] & abs(df3[,"Date1"]-df2[j,"Date2"])<7 )
# if so, join it to the first one (see discussion)
if(length(join_rows)>0){
df3[min(join_rows),"Date2"] <- df2[j,"Date2"]
df3[min(join_rows),"Score2"] <- df2[j,"Score2"]
} # if not, add a new row of just the test2
else df3 <- rbind.fill(df3,df2[j,])
}
df3 <- df3[order(df3$ID,df3$Date1,df3$Date2),]
row.names(df3)<-NULL # i hate these
df3
# ID Date1 Score1 Date2 Score2
# 1 1 2014-05-01 30 <NA> NA
# 2 1 2014-05-12 50 <NA> NA
# 3 1 2014-07-03 50 2014-07-02 10
# 4 1 2014-08-05 20 <NA> NA
# 5 1 <NA> NA 2014-05-30 40
# 6 1 <NA> NA 2014-06-22 20
# 7 1 <NA> NA 2014-08-20 60
# 8 2 <NA> NA 2014-05-22 30
# 9 2 <NA> NA 2014-05-30 80
# 10 3 2014-05-22 70 <NA> NA
# 11 3 2014-05-30 10 <NA> NA
I couldn't get the rows in the same sort order as yours, but they look the same.
Short explanation: For each row in df2, see if there's a row in df1 you can "join" it to. If not, stick it at the bottom of the table. In the initialization and rbinding, you'll see some hacky ways of assigning blank rows or columns as placeholders.
Why this is a bad hacky workaround: for large data sets, the rbinding of df3 to itself will consume more and more memory. The loop is definitely not optimal and its search does not exploit the fact that the tables are sorted. If by some chance the test were taken twice within a week, you would see some unexpected behavior (duplicates from df2, etc).

Use an outer join with an absolute value limit on the date difference. (A outer join B keeps all rows of A and B.) For example:
library(sqldf)
sqldf("select a.*, b.* from df1 a outer join df2 b on a.ID = b.ID and abs(a.Date1 - b.Date2) <=7")
Note that your date variables will have to be true dates. If they are currently characters or integers, you need to do something like df1$Date1 <- as.Date(as.character(df$Date1), format="%Y%M%D) etc.

Related

Aggregate week and date in R by some specific rules

I'm not used to using R. I already asked a question on stack overflow and got a great answer.
I'm sorry to post a similar question, but I tried many times and got the output that I didn't expect.
This time, I want to do slightly different from my previous question.
Merge two data with respect to date and week using R
I have two data. One has a year_month_week column and the other has a date column.
df1<-data.frame(id=c(1,1,1,2,2,2,2),
year_month_week=c(2022051,2022052,2022053,2022041,2022042,2022043,2022044),
points=c(65,58,47,21,25,27,43))
df2<-data.frame(id=c(1,1,1,2,2,2),
date=c(20220503,20220506,20220512,20220401,20220408,20220409),
temperature=c(36.1,36.3,36.6,34.3,34.9,35.3))
For df1, 2022051 means 1st week of May,2022. Likewise, 2022052 means 2nd week of May,2022. For df2,20220503 means May 3rd, 2022. What I want to do now is merge df1 and df2 with respect to year_month_week. In this case, 20220503 and 20220506 are 1st week of May,2022.If more than one date are in year_month_week, I will just include the first of them. Now, here's the different part. Even if there is no date inside year_month_week,just leave it NA. So my expected output has a same number of rows as df1 which includes the column year_month_week.So my expected output is as follows:
df<-data.frame(id=c(1,1,1,2,2,2,2),
year_month_week=c(2022051,2022052,2022053,2022041,2022042,2022043,2022044),
points=c(65,58,47,21,25,27,43),
temperature=c(36.1,36.6,NA,34.3,34.9,NA,NA))
First we can convert the dates in df2 into year-month-date format, then join the two tables:
library(dplyr);library(lubridate)
df2$dt = ymd(df2$date)
df2$wk = day(df2$dt) %/% 7 + 1
df2$year_month_week = as.numeric(paste0(format(df2$dt, "%Y%m"), df2$wk))
df1 %>%
left_join(df2 %>% group_by(year_month_week) %>% slice(1) %>%
select(year_month_week, temperature))
Result
Joining, by = "year_month_week"
id year_month_week points temperature
1 1 2022051 65 36.1
2 1 2022052 58 36.6
3 1 2022053 47 NA
4 2 2022041 21 34.3
5 2 2022042 25 34.9
6 2 2022043 27 NA
7 2 2022044 43 NA
You can build off of a previous answer here by taking the function to count the week of the month, then generate a join key in df2. See here
df1 <- data.frame(
id=c(1,1,1,2,2,2,2),
year_month_week=c(2022051,2022052,2022053,2022041,2022042,2022043,2022044),
points=c(65,58,47,21,25,27,43))
df2 <- data.frame(
id=c(1,1,1,2,2,2),
date=c(20220503,20220506,20220512,20220401,20220408,20220409),
temperature=c(36.1,36.3,36.6,34.3,34.9,35.3))
# Take the function from the previous StackOverflow question
monthweeks.Date <- function(x) {
ceiling(as.numeric(format(x, "%d")) / 7)
}
# Create a year_month_week variable to join on
df2 <-
df2 %>%
mutate(
date = lubridate::parse_date_time(
x = date,
orders = "%Y%m%d"),
year_month_week = paste0(
lubridate::year(date),
0,
lubridate::month(date),
monthweeks.Date(date)),
year_month_week = as.double(year_month_week))
# Remove duplicate year_month_weeks
df2 <-
df2 %>%
arrange(year_month_week) %>%
distinct(year_month_week, .keep_all = T)
# Join dataframes
df1 <-
left_join(
df1,
df2,
by = "year_month_week")
Produces this result
id.x year_month_week points id.y date temperature
1 1 2022051 65 1 2022-05-03 36.1
2 1 2022052 58 1 2022-05-12 36.6
3 1 2022053 47 NA <NA> NA
4 2 2022041 21 2 2022-04-01 34.3
5 2 2022042 25 2 2022-04-08 34.9
6 2 2022043 27 NA <NA> NA
7 2 2022044 43 NA <NA> NA
>
Edit: forgot to mention that you need tidyverse loaded
library(tidyverse)

How to diagonally subtract different columns in R

I have a dataset of a hypothetical exam.
id <- c(1,1,3,4,5,6,7,7,8,9,9)
test_date <- c("2012-06-27","2012-07-10","2013-07-04","2012-03-24","2012-07-22", "2013-09-16","2012-06-21","2013-10-18", "2013-04-21", "2012-02-16", "2012-03-15")
result_date <- c("2012-07-29","2012-09-02","2013-08-01","2012-04-25","2012-09-01","2013-10-20","2012-07-01","2013-10-31", "2013-05-17", "2012-03-17", "2012-04-20")
data1 <- as_data_frame(id)
data1$test_date <- test_date
data1$result_date <- result_date
colnames(data1)[1] <- "id"
"id" indicates the ID of the students who have taken a particular exam. "test_date" is the date the students took the test and "result_date" is the date when the students' results are posted. I'm interested in finding out which students retook the exam BEFORE the result of that exam session was released, e.g. students who knew that they have underperformed and retook the exam without bothering to find out their scores. For example, student with "id" 1 took the exam for the second time on "2012-07-10" which was before the result date for his first exam - "2012-07-29".
I tried to:
data1%>%
group_by(id) %>%
arrange(id, test_date) %>%
filter(n() >= 2) %>% #To only get info on students who have taken the exam more than once and then merge it back in with the original data set using a join function
So essentially, I want to create a new column called "re_test" where it would equal 1 if a student retook the exam BEFORE receiving the result of a previous exam and 0 otherwise (those who retook after seeing their marks or those who did not retake).
I have tried to mutate in order to find cases where dates are either positive or negative by subtracting the 2nd test_date from the 1st result_date:
mutate(data1, re_test = result_date - lead(test_date, default = first(test_date)))
However, this leads to mixing up students with different id's. I tried to split but mutate won't work on a list of dataframes so now I'm stuck:
split(data1, data1$id)
Just to add on, this is a part of the desired result:
data2 <- as_data_frame(id <- c(1,1,3,4))
data2$test_date_result <- c("2012-06-27","2012-07-10", "2013-07-04","2012-03-24")
data2$result_date_result <- c("2012-07-29","2012-09-02","2013-08-01","2012-04-25")
data2$re_test <- c(1, 0, 0, 0)
Apologies for the verbosity and hope I was clear enough.
Thanks a lot in advance!
library(reshape2)
library(dplyr)
# first melt so that we can sequence by date
data1m <- data1 %>%
melt(id.vars = "id", measure.vars = c("test_date", "result_date"), value.name = "event_date")
# any two tests in a row is a flag - use dplyr::lag to comapre the previous
data1mc <- data1m %>%
arrange(id, event_date) %>%
group_by(id) %>%
mutate (multi_test = (variable == "test_date" & lag(variable == "test_date"))) %>%
filter(multi_test)
# id variable event_date multi_test
# 1 1 test_date 2012-07-10 TRUE
# 2 9 test_date 2012-03-15 TRUE
## join back to the original
data1 %>%
left_join (data1mc %>% select(id, event_date, multi_test),
by=c("id" = "id", "test_date" = "event_date"))
I have a piecewise answer that may work for you. I first create a data.frame called student that contains the re-test information, and then join it with the data1 object. If students re-took the test multiple times, it will compare the last test to the first, which is a flaw, but I'm unsure if students have the ability to re-test multiple times?
student <- data1 %>%
group_by(id) %>%
summarise(retest=(test_date[length(test_date)] < result_date[1]) == TRUE)
Some re-test values were NA. These were individuals that only took the test once. I set these to FALSE here, but you can retain the NA, as they do contain information.
student$retest[is.na(student$retest)] <- FALSE
Join the two data.frames to a single object called data2.
data2 <- left_join(data1, student, by='id')
I am sure there are more elegant ways to approach this. I did this by taking advantage of the structure of your data (sorted by id) and the lag function that can refer to the previous records while dealing with a current record.
### Ensure Data are sorted by ID ###
data1 <- arrange(data1,id)
### Create Flag for those that repeated ###
data1$repeater <- ifelse(lag(data1$id) == data1$id,1,0)
### I chose to do this on all data, you could filter on repeater flag first ###
data1$timegap <- as.Date(data1$result_date) - as.Date(data1$test_date)
data1$lagdate <- as.Date(data1$test_date) - lag(as.Date(data1$result_date))
### Display results where your repeater flag is 1 and there is negative time lag ###
data1[data1$repeater==1 & !is.na(data1$repeater) & as.numeric(data1$lagdate) < 0,]
# A tibble: 2 × 6
id test_date result_date repeater timegap lagdate
<dbl> <chr> <chr> <dbl> <time> <time>
1 1 2012-07-10 2012-09-02 1 54 days -19 days
2 9 2012-03-15 2012-04-20 1 36 days -2 days
I went with a simple shift comparison. 1 line of code.
data1 <- data.frame(id = c(1,1,3,4,5,6,7,7,8,9,9), test_date = c("2012-06-27","2012-07-10","2013-07-04","2012-03-24","2012-07-22", "2013-09-16","2012-06-21","2013-10-18", "2013-04-21", "2012-02-16", "2012-03-15"), result_date = c("2012-07-29","2012-09-02","2013-08-01","2012-04-25","2012-09-01","2013-10-20","2012-07-01","2013-10-31", "2013-05-17", "2012-03-17", "2012-04-20"))
data1$re_test <- unlist(lapply(split(data1,data1$id), function(x)
ifelse(as.Date(x$test_date) > c(NA, as.Date(x$result_date[-nrow(x)])), 0, 1)))
data1
id test_date result_date re_test
1 1 2012-06-27 2012-07-29 NA
2 1 2012-07-10 2012-09-02 1
3 3 2013-07-04 2013-08-01 NA
4 4 2012-03-24 2012-04-25 NA
5 5 2012-07-22 2012-09-01 NA
6 6 2013-09-16 2013-10-20 NA
7 7 2012-06-21 2012-07-01 NA
8 7 2013-10-18 2013-10-31 0
9 8 2013-04-21 2013-05-17 NA
10 9 2012-02-16 2012-03-17 NA
11 9 2012-03-15 2012-04-20 1
I think there is benefit in leaving NAs but if you really want all others as zero, simply:
data1$re_test <- ifelse(is.na(data1$re_test), 0, data1$re_test)
data1
id test_date result_date re_test
1 1 2012-06-27 2012-07-29 0
2 1 2012-07-10 2012-09-02 1
3 3 2013-07-04 2013-08-01 0
4 4 2012-03-24 2012-04-25 0
5 5 2012-07-22 2012-09-01 0
6 6 2013-09-16 2013-10-20 0
7 7 2012-06-21 2012-07-01 0
8 7 2013-10-18 2013-10-31 0
9 8 2013-04-21 2013-05-17 0
10 9 2012-02-16 2012-03-17 0
11 9 2012-03-15 2012-04-20 1
Let me know if you have any questions, cheers.

How to run a loop in R to find a unique combination of numbers within a range of 7?

I have a dataset which looks something like this:-
Key Days
A 1
A 2
A 3
A 8
A 9
A 36
A 37
B 14
B 15
B 44
B 45
I would like to split the individual keys based on the days in groups of 7. For e.g.:-
Key Days
A 1
A 2
A 3
Key Days
A 8
A 9
Key Days
A 36
A 37
Key Days
B 14
B 15
Key Days
B 44
B 45
I could use ifelse and specify buckets of 1-7, 7-14 etc until 63-70 (max possible value of days). However the issue lies with the days column. There are lots of cases wherein there is an overlap in days - Take days 14-15 as an example which would fall into 2 brackets if split using the ifelse logic (7-14 & 15-21).
The ideal method of splitting this would be to identify a day and add 7 to it and check how many rows of data are actually falling under that category. I think we need to use loops for this. I could do it in excel but i have 20000 rows of data for 2000 keys hence i'm using R. I would need a loop which checks each key value and for each key it further checks the value of days and buckets them in group of 7 by checking the first day value of each range.
We create a grouping variable by applying %/% on the 'Day' column and then split the dataset into a list based on that 'grp'.
grp <- df$Day %/%7
split(df, factor(grp, levels = unique(grp)))
#$`0`
# Key Days
#1 A 1
#2 A 2
#3 A 3
#$`1`
# Key Days
#4 A 8
#5 A 9
#$`5`
# Key Days
#6 A 36
#7 A 37
#$`2`
# Key Days
#8 B 14
#9 B 15
#$`6`
# Key Days
#10 B 44
#11 B 45
Update
If we need to split by 'Key' also
lst <- split(df, list(factor(grp, levels = unique(grp)), df$Key), drop=TRUE)

Assign rows to a group based on spatial neighborhood and temporal criteria in R

I have an issue that I just cannot seem to sort out. I have a dataset that was derived from a raster in arcgis. The dataset represents every fire occurrence during a 10-year period. Some raster cells had multiple fires within that time period (and, thus, will have multiple rows in my dataset) and some raster cells will not have had any fire (and, thus, will not be represented in my dataset). So, each row in the dataset has a column number (sequential integer) and a row number assigned to it that corresponds with the row and column ID from the raster. It also has the date of the fire.
I would like to assign a unique ID (fire_ID) to all of the fires that are within 4 days of each other and in adjacent pixels from one another (within the 8-cell neighborhood) and put this into a new column.
To clarify, if there were an observation from row 3, col 3, Jan 1, 2000 and another from row 2, col 4, Jan 4, 2000, those observations would be assigned the same fire_ID.
Below is a sample dataset with "rows", which are the row IDs of the raster, "cols", which are the column IDs of the raster, and "dates" which are the dates the fire was detected.
rows<-sample(seq(1,50,1),600, replace=TRUE)
cols<-sample(seq(1,50,1),600, replace=TRUE)
dates<-sample(seq(from=as.Date("2000/01/01"), to=as.Date("2000/02/01"), by="day"),600, replace=TRUE)
fire_df<-data.frame(rows, cols, dates)
I've tried sorting the data by "row", then "column", then "date" and looping through, to create a new fire_ID if the row and column ID were within one value and the date was within 4 days, but this obviously doesn't work, as fires which should be assigned the same fire_ID are assigned different fire_IDs if there are observations in between them in the list that belong to a different fire_ID.
fire_df2<-fire_df[order(fire_df$rows, fire_df$cols, fire_df$date),]
fire_ID=numeric(length=nrow(fire_df2))
fire_ID[1]=1
for (i in 2:nrow(fire_df2)){
fire_ID[i]=ifelse(
fire_df2$rows[i]-fire_df2$rows[i-1]<=abs(1) & fire_df2$cols[i]-fire_df2$cols[i-1]<=abs(1) & fire_df2$date[i]-fire_df2$date[i-1]<=abs(4),
fire_ID[i-1],
i)
}
length(unique(fire_ID))
fire_df2$fire_ID<-fire_ID
Please let me know if you have any suggestions.
I think this task requires something along the lines of hierarchical clustering.
Note, however, that there will be necessarily some degree of arbitrariness in the ids. This is because it is entirely possible that the cluster of fires itself is longer than 4 days yet every fire is less than 4 days away from some other fire in that cluster (and thus should have the same id).
library(dplyr)
# Create the distances
fire_dist <- fire_df %>%
# Normalize dates
mutate( norm_dates = as.numeric(dates)/4) %>%
# Only keep the three variables of interest
select( rows, cols, norm_dates ) %>%
# Compute distance using L-infinite-norm (maximum)
dist( method="maximum" )
# Do hierarchical clustering with "single" aggl method
fire_clust <- hclust(fire_dist, method="single")
# Cut the tree at height 1 and obtain groups
group_id <- cutree(fire_clust, h=1)
# First attach the group ids back to the data frame
fire_df2 <- cbind( fire_df, group_id ) %>%
# Then sort the data
arrange( group_id, dates, rows, cols )
# Print the first 20 records
fire_df2[1:10,]
(Make sure you have dplyr library installed. You can run install.packages("dplyr",dep=TRUE) if not installed. It is a really good and very popular library for data manipulations)
A couple of simple tests:
Test #1. The same forest fire moving.
rows<-1:6
cols<-1:6
dates<-seq(from=as.Date("2000/01/01"), to=as.Date("2000/01/06"), by="day")
fire_df<-data.frame(rows, cols, dates)
gives me this:
rows cols dates group_id
1 1 1 2000-01-01 1
2 2 2 2000-01-02 1
3 3 3 2000-01-03 1
4 4 4 2000-01-04 1
5 5 5 2000-01-05 1
6 6 6 2000-01-06 1
Test #2. 6 different random forest fires.
set.seed(1234)
rows<-sample(seq(1,50,1),6, replace=TRUE)
cols<-sample(seq(1,50,1),6, replace=TRUE)
dates<-sample(seq(from=as.Date("2000/01/01"), to=as.Date("2000/02/01"), by="day"),6, replace=TRUE)
fire_df<-data.frame(rows, cols, dates)
output:
rows cols dates group_id
1 6 1 2000-01-10 1
2 32 12 2000-01-30 2
3 31 34 2000-01-10 3
4 32 26 2000-01-27 4
5 44 35 2000-01-10 5
6 33 28 2000-01-09 6
Test #3: one expanding forest fire
dates <- seq(from=as.Date("2000/01/01"), to=as.Date("2000/01/06"), by="day")
rows_start <- 50
cols_start <- 50
fire_df <- data.frame(dates = dates) %>%
rowwise() %>%
do({
diff = as.numeric(.$dates - as.Date("2000/01/01"))
expand.grid(rows=seq(rows_start-diff,rows_start+diff),
cols=seq(cols_start-diff,cols_start+diff),
dates=.$dates)
})
gives me:
rows cols dates group_id
1 50 50 2000-01-01 1
2 49 49 2000-01-02 1
3 49 50 2000-01-02 1
4 49 51 2000-01-02 1
5 50 49 2000-01-02 1
6 50 50 2000-01-02 1
7 50 51 2000-01-02 1
8 51 49 2000-01-02 1
9 51 50 2000-01-02 1
10 51 51 2000-01-02 1
and so on. (All records identified correctly to belong to the same forest fire.)

How to create dataframe subset of the one patient observation with the lowest score on a variable

Hello I have a dataset with multiple patients, each with multiple observations.
I want to select the earliest observation for each patient.
Example:
Patient ID Tender Swollen pt_visit
101 1 10 6
101 6 12 12
101 4 3 18
102 9 5 18
102 3 6 24
103 5 2 12
103 2 1 18
103 8 0 24
The pt_visit variable is the number of months the patient was in the study at the time of the observation. What I need is the first observation from each patient based on the lowest number of months in the pt_visit column. However I need the earliest observation for each patient ID.
My desired results:
Patient ID Tender Swollen pt_visit
101 1 10 6
102 9 5 18
103 5 2 12
Assuming your data frame is called df, use the ddply function in the plyr package:
require(plyr)
firstObs <- ddply(df, "PatientID", function(x) x[x$pt_visit == min(x$pt_visit), ])
I would use the data.table package:
Data <- data.table(Data)
setkey(Data, Patient_ID, pt_visit)
Data[,.SD[1], by=Patient_ID]
Assuming that the Patient ID column is actually named Patient_ID, here are a few approaches. DF is assumed to be the name of the input data frame:
sqldf
library(sqldf)
sqldf("select Patient_ID, Tender, Swollen, min(pt_visit) pt_visit
from DF
group by Patient_ID")
or
sqldf("select *, min(pt_visit) pt_visit from DF group by Patient_ID")[-ncol(DF)]
Note: The above two alternatives use an extension to SQL only found in SQLite so be sure you are using the SQLite backend. (SQLite is the default backend for sqldf unless RH2, RProgreSQL or RMYSQL is loaded.)
subset/ave
subset(DF, ave(pt_visit, Patient_ID, FUN = rank) == 1)
Note: This makes use of the fact that there are no duplicate pt_visit values within the same Patient_ID. If there were we would need to specify the ties= argument to rank.
I almost think they should be a subset parameter named "by" that would do the same as it does in data.table. This is a base-solution:
do.call(rbind, lapply( split(dfr, dfr$PatientID),
function(x) x[which.min(x$pt_visit),] ) )
PatientID Tender Swollen pt_visit
101 101 1 10 6
102 102 9 5 18
103 103 5 2 12
I guess you can see why #hadley built 'plyr'.

Resources