Hey guys I have 2 DBs:
BaseDB:
"id", "Main_event_date"
*id* *Main_event_date*
1 01/01/2017
2 01/07/2018
3 01/11/2017
DB_2
"id","event_date"
*id* *event_date*
1 01/02/2017
1 19/12/2017
1 19/01/2018
2 01/10/2018
2 10/01/2019
DB_2 might be duplicated in "id": every time an "id" has an event, I have a new row with a new "event_date".
I reshaped the DB_2 from long to wide using "reshape", so that now I have:
DB_2_wide
"id", "event_date0", "event_date1"...."event_date100".
I would like to join the BaseDB with DB_2_Wide considering only the first 30 events after the main_event_date for every "id" having a final table of the type:
Final_Table
id, t1, ... t30
*id* *t1* *t2* *t3* ..... *t30*
1 x NA NA
2 NA NA x
3 NA NA NA
Considering also that t1 for 1 is Feb2017,
t1 for 2 is August 2018
t1 for 3 is December 2017
Not sure how to do this because simply using left_join (or merge) I would get a table like:
"id", "event_date0", "event_date1"...."event_date100",
hence for every id I would also get a column for events before the main_event_date, which would be NA.
Not sure this is clear so let me know if have any questions.
Related
I have two unequal dataframes with different patient information
DfA has details on X-rays patients have had after a procedure. The table has their ID, date of their X-ray and 15 other values as columns.
DfA
PatientID
X-rayDate
X-ray_value1
X-ray_value2
123456
01/01/2017
1.5
2.5
123456
03/01/2018
2.0
4.2
654321
02/03/2017
4.4
1.1
654321
18/03/2018
7.2
1.2
112233
05/04/2017
0.4
7.6
112233
20/03/2018
4.2
5.5
DfB has the date of their procedure date and test results at 3 different time points: pre-procedure, early post procedure and 1-year post procedure going by rows. One or more of the time points may also have missing data (NA). There are 36 extra columns with more test results and their associated dates, which aren't always the same dates as others (Test1 may be on 24/01/2017 but Test15 might have been done on the 01/02/2017).
DfB
PatientID
ProcDate
Test1
Test1Date
Test2
Test2Date
123456
25/12/2016
400
25/12/2016
NA
NA
123456
25/12/2016
55
24/01/2017
22
26/01/2017
123456
25/12/2016
20
10/01/2018
14
12/01/2018
I need to merge the 2 dataframes together into a single dataframe that looks something like this below
Dfc
PatientID
ProcDate
Test1
Test1Date
Test2
Test2Date
X-rayDate
X-ray_value1
X-ray_value2
123456
25/12/2016
400
25/12/2016
NA
NA
NA
NA
NA
123456
25/12/2016
55
24/01/2017
22
26/01/2017
01/01/2017
1.5
2.5
123456
25/12/2016
20
10/01/2018
14
12/01/2018
03/01/2018
2.0
4.2
DfA is larger than DfB as it is table of values of all patients who have had a X-rays. DfB is specific for patients who have had the procedure and any patients who haven't had the procedure can be ignored for DfC. The first row for each patient should have any X-ray columns filled with NA as there is no X-ray data for the pre-procedure time point and the 2 x-ray measurements should match to the closest dates early post-procedure and 1-year post-procedure.
So far I have tried to merge using:
library(dyplr)
DfC<-left_join(DFB, DfA, by="Patient ID")
This duplicates values of the DfB for each row and some Patient ID's lose an entire row of data for time point and I can't figure out why. (I apologise for not sharing data for a reproducible example as it is actual patient data and I can't-HIPAA and all). I've tried to filter the duplicated data row by row into a new DfCv2 but I can't figure it out.
How do I merge the two dataframes together to get DfC?
First I want to say I am new to R. This problem is frustrating beyond belief. I have tried apply, lapply, and mapply. All with errors. I am lost.
What I want to do is take the time from "Results" and place it in the time in "Records" IF Records does not have a time (where it is NA).
I have already done this in a traditional for-loop but it makes the code hard to read. I have read the apply functions can make this easier.
Data Frame "Results"
ID Time(sec)
1 1.7169811
2 1.9999999
3 2.3555445
4 3.4444444
Data Frame "Records"
ID Time(sec) Date
1 NA 1/1/2018
2 1.9999999 1/1/2018
3 NA 1/1/2018
4 3.1111111 1/1/2018
Data Frame 'New' Records
ID Time(sec) Date
1 1.7169811 1/1/2018
2 1.9999999 1/1/2018
3 2.3555445 1/1/2018
4 3.1111111 1/1/2018
No need to use apply in this situation. A pattern of conditionally choosing between two values based on some predicate is ifelse():
ifelse(predicate, value_a, value_b)
In this case you said you also have to make sure the values are matched by ID between the two dataframes. A function that achieves this in R is appropriately named match()
match(target_values, values_to_be_matched)
match returns indices that match values_to_be_matched to target_values when used like so: target_values[indices].
Combining this together:
inds <- match(records$ID, results$ID)
records$time <- ifelse(is.na(records$time), results$time[inds], records$time)
is.na() here is a predicate that checks if the value is NA for every value in the vector.
Inspired by this answer.
From the help: Given a set of vectors, coalesce() finds the first non-missing value at each position. This is inspired by the SQL COALESCE function which does the same thing for NULLs
library(tidyverse)
txt1 <- "ID Time(sec)
1 1.7169811
2 1.9999999
3 2.3555445
4 3.4444444"
txt2 <- "ID Time(sec) Date
1 NA 1/1/2018
2 1.9999999 1/1/2018
3 NA 1/1/2018
4 3.1111111 1/1/2018"
df1 <- read.table(text = txt1, header = TRUE)
df2 <- read.table(text = txt2, header = TRUE)
df1 %>%
left_join(df2, by = "ID") %>%
mutate(Time.sec. = coalesce(Time.sec..x, Time.sec..y)) %>%
select(-Time.sec..x, -Time.sec..y)
#> ID Date Time.sec.
#> 1 1 1/1/2018 1.716981
#> 2 2 1/1/2018 2.000000
#> 3 3 1/1/2018 2.355545
#> 4 4 1/1/2018 3.444444
Created on 2018-03-10 by the reprex package (v0.2.0).
family_id<-c(1,2,3)
age_mother<-c(30,27,29)
dob_child1<-c("1998-11-12","1999-12-12","1996-04-12")##child one birth day
dob_child2<-c(NA,"1997-09-09",NA)##if no child,NA
dob_child3<-c(NA,"1999-09-01","1996-09-09")
DT<-data.table(family_id,age_mother,dob_child1,dob_child2,dob_child3)
Now I have DT, how can I use this table to know how many children each family have using syntax like this:
DT[,apply..,keyby=family_id]##this code is wrong
This may also work:
> DT$total_child <- as.vector(rowSums(!is.na(DT[, c("dob_child1",
"dob_child2", "dob_child3")])))
> DT
family_id age_mother dob_child1 dob_child2 dob_child3 total_child
1 1 30 1998-11-12 <NA> <NA> 1
2 2 27 1999-12-12 1997-09-09 1999-09-01 3
3 3 29 1996-04-12 <NA> 1996-09-09 2
You can use sqldf package, to use a SQL query in R.
I duplicated your DT.
family_id<-c(1,2,3)
age_mother<-c(30,27,29)
dob_child1<-c("1998-11-12","1999-12-12","1996-04-12")##child one birth day
dob_child2<-c(NA,"1997-09-09",NA)##if no child,NA
dob_child3<-c(NA,"1999-09-01","1996-09-09")
DT<-data.table(family_id,age_mother,dob_child1,dob_child2,dob_child3)
library(sqldf)
sqldf('select distinct (count(dob_child3)+count(dob_child2)+count(dob_child1)) as total_child,
family_id from DT group by family_id')
The result is the following:
total_child family_id
1 1 1
2 3 2
3 2 3
It is correct for you?
Okay, so I have two different data frames (df1 and df2) which, to simplify it, have an ID, a date, and the score on a test. In each data frame the person (ID) have taken the test on multiple dates. When looking between the two data frames, some of the people are listed in df1 but not in df2, and vice versa, but some are listed in both and they can overlap differently.
I want to combine all the data into one frame, but the tricky part is if any of the IDs and scores from df1 and df2 are within 7 days (I can do this with a subtracted dates column), I want to combine that row.
In essence, for every ID there will be one row with both scores written separately if taken within 7 days, and if not it will make two separate rows, one with score from df1 and one from df2 along with all the other scores that might not be listed in both.
EX:
df1
ID Date1(yyyymmdd) Score1
1 20140512 50
1 20140501 30
1 20140703 50
1 20140805 20
3 20140522 70
3 20140530 10
df2
ID Date2(yyyymmdd) Score2
1 20140530 40
1 20140622 20
1 20140702 10
1 20140820 60
2 20140522 30
2 20140530 80
Wanted_df
ID Date1(yyyymmdd) Score1 Date2(yyyymmdd) Score2
1 20140512 50
1 20140501 30
1 20140703 50 20140702 10
1 20140805 20
1 20140530 40
1 20140622 20
1 20140820 60
3 20140522 70
3 20140530 10
2 20140522 30
2 20140530 80
Alright. I feel bad about the bogus outer join answer (which may be possible in a library I don't know about, but there are advantages to using RDBMS sometimes...) so here is a hacky workaround. It assumes that all the joins will be at most one to one, which you've said is OK.
# ensure the date columns are date type
df1$Date1 <- as.Date(as.character(df1$Date1), format="%Y%m%d")
df2$Date2 <- as.Date(as.character(df2$Date2), format="%Y%m%d")
# ensure the dfs are sorted
df1 <- df1[order(df1$ID, df1$Date1),]
df2 <- df2[order(df2$ID, df2$Date2),]
# initialize the output df3, which starts as everything from df1 and NA from df2
df3 <- cbind(df1,Date2=NA, Score2=NA)
library(plyr) #for rbind.fill
for (j in 1:nrow(df2)){
# see if there are any rows of test1 you could join test2 to
join_rows <- which(df3[,"ID"]==df2[j,"ID"] & abs(df3[,"Date1"]-df2[j,"Date2"])<7 )
# if so, join it to the first one (see discussion)
if(length(join_rows)>0){
df3[min(join_rows),"Date2"] <- df2[j,"Date2"]
df3[min(join_rows),"Score2"] <- df2[j,"Score2"]
} # if not, add a new row of just the test2
else df3 <- rbind.fill(df3,df2[j,])
}
df3 <- df3[order(df3$ID,df3$Date1,df3$Date2),]
row.names(df3)<-NULL # i hate these
df3
# ID Date1 Score1 Date2 Score2
# 1 1 2014-05-01 30 <NA> NA
# 2 1 2014-05-12 50 <NA> NA
# 3 1 2014-07-03 50 2014-07-02 10
# 4 1 2014-08-05 20 <NA> NA
# 5 1 <NA> NA 2014-05-30 40
# 6 1 <NA> NA 2014-06-22 20
# 7 1 <NA> NA 2014-08-20 60
# 8 2 <NA> NA 2014-05-22 30
# 9 2 <NA> NA 2014-05-30 80
# 10 3 2014-05-22 70 <NA> NA
# 11 3 2014-05-30 10 <NA> NA
I couldn't get the rows in the same sort order as yours, but they look the same.
Short explanation: For each row in df2, see if there's a row in df1 you can "join" it to. If not, stick it at the bottom of the table. In the initialization and rbinding, you'll see some hacky ways of assigning blank rows or columns as placeholders.
Why this is a bad hacky workaround: for large data sets, the rbinding of df3 to itself will consume more and more memory. The loop is definitely not optimal and its search does not exploit the fact that the tables are sorted. If by some chance the test were taken twice within a week, you would see some unexpected behavior (duplicates from df2, etc).
Use an outer join with an absolute value limit on the date difference. (A outer join B keeps all rows of A and B.) For example:
library(sqldf)
sqldf("select a.*, b.* from df1 a outer join df2 b on a.ID = b.ID and abs(a.Date1 - b.Date2) <=7")
Note that your date variables will have to be true dates. If they are currently characters or integers, you need to do something like df1$Date1 <- as.Date(as.character(df$Date1), format="%Y%M%D) etc.
I have a data.table which contains multiple columns, which is well represented by the following:
DT <- data.table(date = as.IDate(rep(c("2012-10-17", "2012-10-18", "2012-10-19"), each=10)),
session = c(1,2,3), price = c(10, 11, 12,13,14),
volume = runif(30, min=10, max=1000))
I would like to extract a multiple column table which shows the volume traded at each price in a particular type of session -- with each column representing a date.
At present, i extract this data one date at a time using the following:
DT[session==1,][date=="2012-10-17", sum(volume), by=price]
and then bind the columns.
Is there a way of obtaining the end product (a table with each column referring to a particular date) without sticking all the single queries together -- as i'm currently doing?
thanks
Does the following do what you want.
A combination of reshape2 and data.table
library(reshape2)
.DT <- DT[,sum(volume),by = list(price,date,session)][, DATE := as.character(date)]
# reshape2 for casting to wide -- it doesn't seem to like IDate columns, hence
# the character DATE co
dcast(.DT, session + price ~ DATE, value.var = 'V1')
session price 2012-10-17 2012-10-18 2012-10-19
1 1 10 308.9528 592.7259 NA
2 1 11 649.7541 NA 816.3317
3 1 12 NA 502.2700 766.3128
4 1 13 424.8113 163.7651 NA
5 1 14 682.5043 NA 147.1439
6 2 10 NA 755.2650 998.7646
7 2 11 251.3691 695.0153 NA
8 2 12 791.6882 NA 275.4777
9 2 13 NA 111.7700 240.3329
10 2 14 230.6461 817.9438 NA
11 3 10 902.9220 NA 870.3641
12 3 11 NA 719.8441 963.1768
13 3 12 361.8612 563.9518 NA
14 3 13 393.6963 NA 718.7878
15 3 14 NA 871.4986 582.6158
If you just wanted session 1
dcast(.DT[session == 1L], session + price ~ DATE)
session price 2012-10-17 2012-10-18 2012-10-19
1 1 10 308.9528 592.7259 NA
2 1 11 649.7541 NA 816.3317
3 1 12 NA 502.2700 766.3128
4 1 13 424.8113 163.7651 NA
5 1 14 682.5043 NA 147.1439