Extracting rows based on ID and date. R-base - r

i have 2 data frames. one with a list of ID and dates of 700 persons, and another with 400.000 rows with date and several other variables for over 1000 persons.
example df1:
ID date
1010 2014-05-31
1011 2015-08-27
1015 2011-04-15
...
example df2:
ID Date Operationcode
1010 2008-01-03 456
1010 2016-06-09 1234
1010 1999-10-04 123186
1010 2017-02-30 71181
1010 2005-05-05 201
1011 2008-04-02 46
1011 2009-09-09 1231
1515 2017-xx-xx 156
1015 2013-xx-xx 123
1615 1998-xx-xx 123
1015 2005-xx-xx 4156
1015 2007-xx-xx 123
1015 2016-xx-xx 213
now i wanna create a df3 where i only keep rows from df2 where the date is before df1 (when matched by ID).
so i get:
ID Date Operationcode
1010 2008-01-03 456
1010 1999-10-04 123186
1010 2005-05-05 201
1015 2005-xx-xx 4156
1015 2007-xx-xx 123
ive tried
df3 <- subset(df1, ID %in% df2$ID & df2$date < df1$date)
but keep giving me an error where something with the length of the last part, df2$date < df1$date doesnt match, and when I take a sampletest (look for the operationcode for 1 ID) i can see that i miss alot of rows before the date from df1. Any idea or solutions?
AND i only got base-R as its the hospitals computer which doesnt allow any downloading -.-

In base R you could do something like this...
df3 <- merge(df2,df1,by="ID",all.x=TRUE) #merge in df1 date column
df3 <- df3[as.Date(df3$Date)<as.Date(df3$date),] #remove rows with invalid dates
#note that 'Date' is the df2 column, 'date' is the df1 version
df3 <- df3[!is.na(df3$ID),] #remove NA rows
df3$date <- NULL #remove df1 date column
df3
ID Date Operationcode
1 1010 2008-01-03 456
2 1010 1999-10-04 123186
3 1010 2005-05-05 201
6 1011 2009-09-09 1231
7 1011 2008-04-02 46
I'm not sure what is supposed to happen with the dates with xx in your data. Are they real? If they appear in the actual data, they will need special handling as otherwise they will not be converted to proper date format, so the calculation fails.

Related

How to subtract a column of date values by sys.Date() using mutate - tidyverse/dplyr? R

I have this dataframe I am working with.
data <- data.frame(id = c(123,124,125,126,127,128,129,130),
date = c("10/7/2021","10/6/2021","9/13/2021","10/18/2021","8/12/2021","9/6/2021","10/29/2021","9/6/2021"))
My goal is create a new column that tells me how many days have passed since that recorded date for each row. I'm trying to use this code but I keep getting NA days in my new column.
data %>%
select(id,date) %>%
mutate("days_since" = as.Date(Sys.Date()) - as.Date(date,format="%Y-%m-%d"))
id date days_since
1 123 10/7/2021 NA days
2 124 10/6/2021 NA days
3 125 9/13/2021 NA days
4 126 10/18/2021 NA days
5 127 8/12/2021 NA days
6 128 9/6/2021 NA days
7 129 10/29/2021 NA days
8 130 9/6/2021 NA days
What am I doing wrong? Thank you for any feedback.
We can use the lubridate package. It makes type conversion and operations with dates much easier.
In your code, the as.Date(date) step was problematic because the format was wrong.
library(dplyr)
library(lubridate)
data %>% mutate("days_since" = Sys.Date() - mdy(date))
id date days_since
1 123 10/7/2021 28
2 124 10/6/2021 29
3 125 9/13/2021 22
4 126 10/18/2021 17
5 127 8/12/2021 23
6 128 9/6/2021 29
7 129 10/29/2021 6
8 130 9/6/2021 29
Thanks, #Karthik S for the simplification
it is also easily done, using base r and a simple "-". This gives back the difference in days:
data <- data.frame(id = c(123,124,125,126,127,128,129,130),
date = c("2021-10-10","2021-10-06","2021-09-13","2021-10-18","2021-08-12","2021-09-06","2021-10-29","2021-09-06"))
data$date <- as.Date(data$date)
data$sys_date <- Sys.Date()
data$sysDate_to_date <- data$sys_date -data$date

How to calculate the sequential date diff in a dataframe and make it as another column for further analysis?

Please before make it as duplicate read carefully my question!
I am new in R and I am trying to figure it out how to calculate the sequential date difference from one row/variable compare to the next row/variable in based on weeks and create another field/column for making a graph accordingly.
There are couple of answer here Q1 , Q2 , Q3 but none specifically talk about making difference in one column sequentially between rows lets say from top to bottom.
Below is the example and the expected results:
Date Var1
2/6/2017 493
2/20/2017 558
3/6/2017 595
3/20/2017 636
4/6/2017 697
4/20/2017 566
5/5/2017 234
Expected
Date Var1 week
2/6/2017 493 0
2/20/2017 558 2
3/6/2017 595 4
3/20/2017 636 6
4/6/2017 697 8
4/20/2017 566 10
5/6/2017 234 12
You can use a similar approach to that in your first linked answer by saving the difftime result as a new column in your data frame.
# Set up data
df <- read.table(text = "Date Var1
2/6/2017 493
2/20/2017 558
3/6/2017 595
3/20/2017 636
4/6/2017 697
4/20/2017 566
5/5/2017 234", header = T)
df$Date <- as.Date(as.character(df$Date), format = "%m/%d/%Y")
# Create exact week variable
df$week <- difftime(df$Date, first(df$Date), units = "weeks")
# Create rounded week variable
df$week2 <- floor(difftime(df$Date, first(df$Date), units = "weeks"))
df
# Date Var1 week week2
# 2017-02-06 493 0.000000 weeks 0 weeks
# 2017-02-20 558 2.000000 weeks 2 weeks
# 2017-03-06 595 4.000000 weeks 4 weeks
# 2017-03-20 636 6.000000 weeks 6 weeks
# 2017-04-06 697 8.428571 weeks 8 weeks
# 2017-04-20 566 10.428571 weeks 10 weeks
# 2017-05-05 234 12.571429 weeks 12 weeks

Combination of merge and aggregate in R

I have created the following 2 dummy datasets as follows:
id<-c(8,8,50,87,141,161,192,216,257,282)
date<-c("2011-03-03","2011-12-12","2010-08-18","2009-04-28","2010-11-29","2012-04-02","2013-01-08","2007-01-22","2009-06-03","2009-12-02")
data<-data.frame(cbind(id,date))
id<-c(3,8,11,11,11,11,11,11,19,19,19,19,19,50,50,50,50,50,87,87,87,87,87,87,282,282,282,282,282,282,282,282,282,282,288,288,288,288,288,288,288,288,288,288,288,288,288)
date<-c("2010-11-04","2011-02-25","2009-07-26","2009-07-27","2009-08-09","2009-08-10","2009-08-30","2004-01-20","2006-02-13","2006-07-18","2007-04-20","2008-05-12","2008-05-29","2009-06-10","2010-08-17","2010-08-15","2011-05-13","2011-06-08","2007-08-09","2008-01-19","2008-02-19","2009-04-28","2009-05-16","2009-05-20","2005-05-14","2007-04-15","2007-07-25","2007-10-12","2007-10-23","2007-10-27","2007-11-20","2009-11-28","2012-08-16","2012-08-16","2008-11-17","2009-10-23","2009-10-27","2009-10-27","2009-10-27","2009-10-27","2009-10-28","2010-06-15","2010-06-17","2010-06-23","2010-07-27","2010-07-27","2010-07-28")
ns<-data.frame(cbind(id,date))
Note that only some of the id in data are included in ns and viceversa.
For each of the values in data$id I am trying to find if there is a ns$date that is 14 days before the data$date where data$id==ns$id and report the number of days difference.
The output I need is a vector/column ("received") of the same number of rows of data, with a TRUE/FALSE whre ns$date[ns$id==data$id] is less than 14 days before the respective data$date and a similar vector with the actual number of days where "received" is TRUE. I hope this makes sense now.
This is where I got so far
# convert dates
data$date <- ymd(data$date)
ns$date <- ymd(ns$date)
# left join datasets
tmp <- merge(data, ns, by="id", all.x=TRUE)
#NOTE THAT this will automatically rename data$date as date.x and tmp$date as date.y
# create variable to say when there is a date difference less than 14 days
tmp$received <- with(tmp, difftime(date.x, date.y, units="days")<14&difftime(date.x, date.y, units="days")>0)
#create a variable that reports the days difference
tmp$dif<-ifelse(tmp$received==TRUE,difftime(tmp$date.x,tmp$date.y, units="days"),NA)
This link Find if date is within 14 days if id matches between datasets in R provides an idea but the result does not include the number of days difference in tmp$dif.
In the result table I need only the lowest difference for each data$id for those cases were tmp$received was TRUE.
Hope this makes more sense now? If not please let me know what needs further clarification.
M
PS: as requested I added what the desired output should look like (same number of rows of data = 10 - no rows for data in ns not in data). Should have thought this might help earlier.
id date received dif
1 8 2011-03-03 TRUE 6
2 8 2011-12-12 FALSE NA
3 50 2010-08-18 TRUE 1
4 87 2009-04-28 TRUE 0
5 141 2010-11-29 NA NA
6 161 2012-04-02 NA NA
7 192 2013-01-08 NA NA
8 216 2007-01-22 NA NA
9 257 2009-06-03 NA NA
10 282 2009-12-02 TRUE 4
Here's a data.table approach
Converting to data.table objects
library(data.table)
setkey(setDT(data), id)
setkey(setDT(ns), id)
Merging
ns <- ns[data]
Converting to Date class
ns[, c("date", "date.1") := lapply(.SD, as.Date), .SDcols = c("date", "date.1")]
Computing days differences and TRUE/FALSE
ns[, `:=`(timediff = date.1 - date,
Logical = (date.1 - date) < 14)]
Taking only the rows we are interested in
res <- ns[is.na(timediff) | timediff >= 0, list(received = any(Logical), dif = timediff[Logical]), by = list(id, date.1)]
Sorting by id and date
res[, id := as.numeric(as.character(id))]
setkey(res, id, date.1)
Subsetting by minimum dstance
res[, list(diff = min(dif)), by = list(id, date.1, received)]
# id date.1 received diff
# 1: 8 2011-03-03 TRUE 6 days
# 2: 8 2011-12-12 FALSE NA days
# 3: 50 2010-08-18 TRUE 1 days
# 4: 87 2009-04-28 TRUE 0 days
# 5: 141 2010-11-29 NA NA days
# 6: 161 2012-04-02 NA NA days
# 7: 192 2013-01-08 NA NA days
# 8: 216 2007-01-22 NA NA days
# 9: 257 2009-06-03 NA NA days
# 10: 282 2009-12-02 TRUE 4 days

how to reorder my R dataframe by date and filled with NAs?

I want to convert dataframe df1 to df2 like the following:
df1 <- read.table(textConnection("
id date ret
1101 19900104 6.5867
1102 19900105 6.5383
1103 19900106 6.6043
1101 19900105 3.6943
1102 19900106 3.6368
1103 19900107 1.2740
1104 19900107 3.8572
1101 19900106 2.2525
1102 19900107 1.1253
1101 19900107 2.2331
"),header=T)
df2 <- read.table(textConnection("
date 1101 1102 1103 1104
19900104 6.5867 NA NA NA
19900105 3.6943 6.5383 NA NA
19900106 2.2525 3.6368 6.6043 NA
19900107 2.2331 1.1253 1.2740 3.8572
"),header=T)
I tried to use loop but I don't think it's a good solution in case I have very large data which covers daily period from 1990 to 2012. Many thanks to anyone can help me...
This is going from long to wide format. reshape2 is a great package for working with these types of problems. To go from long to wide, you want to use dcast(). You give it the object to work with (df1), then a formula, which basically indicates what the rows are indexed by on the left of the ~ and what the columns are indexed by on the right.
library(reshape2)
df2 <- dcast(df1, date ~ id)
df2
# date 1101 1102 1103 1104
# 1 19900104 6.5867 NA NA NA
# 2 19900105 3.6943 6.5383 NA NA
# 3 19900106 2.2525 3.6368 6.6043 NA
# 4 19900107 2.2331 1.1253 1.2740 3.8572

Merge data frames whilst summing common columns in R

My problem is very similar to the one posted here.
The difference is that they knew the columns that would be conflicting whereas I need a generic method that wont know in advance which columns conflict.
example:
TABLE1
Date Time ColumnA ColumnB
01/01/2013 08:00 10 30
01/01/2013 08:30 15 25
01/01/2013 09:00 20 20
02/01/2013 08:00 25 15
02/01/2013 08:30 30 10
02/01/2013 09:00 35 5
TABLE2
Date ColumnA ColumnB ColumnC
01/01/2013 100 300 1
02/01/2013 200 400 2
Table 2 only has dates and so is applied to all fields in table A that match the date regardless on time.
I would like the merge to sum the conflicting columns into 1. The result should look like this:
TABLE3
Date Time ColumnA ColumnB ColumnC
01/01/2013 08:00 110 330 1
01/01/2013 08:30 115 325 1
01/01/2013 09:00 120 320 1
02/01/2013 08:00 225 415 2
02/01/2013 08:30 230 410 2
02/01/2013 09:00 235 405 2
At the moment my standard merge just creates duplicate columns of "ColumnA.x", "ColumnA.y", "ColumnB.x", "ColumnB.y".
Any help is much appreciated
If I understand correctly, you want a flexible method that does not require knowing which columns exist in each table aside from the columns you want to merge by and the columns you want to preserve. This may not be the most elegant solution, but here is an example function to suit your exact needs:
merge_Sum <- function(.df1, .df2, .id_Columns, .match_Columns){
merged_Columns <- unique(c(names(.df1),names(.df2)))
merged_df1 <- data.frame(matrix(nrow=nrow(.df1), ncol=length(merged_Columns)))
names(merged_df1) <- merged_Columns
for (column in merged_Columns){
if(column %in% .id_Columns | !column %in% names(.df2)){
merged_df1[, column] <- .df1[, column]
} else if (!column %in% names(.df1)){
merged_df1[, column] <- .df2[match(.df1[, .match_Columns],.df2[, .match_Columns]), column]
} else {
df1_Values=.df1[, column]
df2_Values=.df2[match(.df1[, .match_Columns],.df2[, .match_Columns]), column]
df2_Values[is.na(df2_Values)] <- 0
merged_df1[, column] <- df1_Values + df2_Values
}
}
return(merged_df1)
}
This function assumes you have a table '.df1' that is a master of sorts, and you want to merge data from a second table '.df2' that has rows that match one or more of the rows in '.df1'. The columns to preserve from the master table '.df1' are accepted as an array '.id_Columns', and the columns that provide the match for merging the two tables are accepted as an array '.match_Columns'
For your example, it would work like this:
merge_Sum(table1, table2, c("Date","Time"), "Date")
# Date Time ColumnA ColumnB ColumnC
# 1 01/01/2013 08:00 110 330 1
# 2 01/01/2013 08:30 115 325 1
# 3 01/01/2013 09:00 120 320 1
# 4 02/01/2013 08:00 225 415 2
# 5 02/01/2013 08:30 230 410 2
# 6 02/01/2013 09:00 235 405 2
In plain language, this function first finds the total number of unique columns and makes an empty data frame in the shape of the master table '.df1' to later hold the merged data. Then, for the '.id_Columns', the data is copied from '.df1' into the new merged data frame. For the other columns, any data that exists in '.df1' is added to any existing data in '.df2', where the rows in '.df2' are matched based on the '.match_Columns'
There is probably some package out there that does something similar, but most of them require knowledge of all the existing columns and how to treat them. As I said before, this is not the most elegant solution, but it is flexible and accurate.
Update: The original function assumed a many-to-one relationship between table1 and table2, and the OP requested the allowance of a many-to-none relationship, also. The code has been updated with a slightly less efficient but 100% more flexible logic.
A data.table solution:
dt1 <- data.table(read.table(header=T, text="Date Time ColumnA ColumnB
01/01/2013 08:00 10 30
01/01/2013 08:30 15 25
01/01/2013 09:00 20 20
02/01/2013 08:00 25 15
02/01/2013 08:30 30 10
02/01/2013 09:00 35 5"))
dt2 <- data.table(read.table(header=T, text="Date ColumnA ColumnB ColumnC
01/01/2013 100 300 1
02/01/2013 200 400 2"))
setkey(dt1, "Date")
setkey(dt2, "Date")
# Note: The ColumnC assignment has to be come before the summing operations
# Else it gives out error (see below)
dt1[dt2, `:=`(ColumnC = i.ColumnC, ColumnA = ColumnA + i.ColumnA,
ColumnB = ColumnB + i.ColumnB)]
# Date Time ColumnA ColumnB ColumnC
# 1: 01/01/2013 08:00 110 330 1
# 2: 01/01/2013 08:30 115 325 1
# 3: 01/01/2013 09:00 120 320 1
# 4: 02/01/2013 08:00 225 415 2
# 5: 02/01/2013 08:30 230 410 2
# 6: 02/01/2013 09:00 235 405 2
I'm not sure why placing ColumnC assignment on the right end throws this error. Perhaps MatthewDowle could explain the cause for this error.
dt1[dt2, `:=`(ColumnA = ColumnA + i.ColumnA, ColumnB = ColumnB + i.ColumnB,
ColumnC = i.ColumnC)]
Error in `[.data.table`(dt1, dt2, `:=`(ColumnA = ColumnA + i.ColumnA, :
Value of SET_STRING_ELT() must be a 'CHARSXP' not a 'NULL'
Update from v1.8.9 :
o Mixing adding new with updating existing columns into one :=() by group; i.e.,
DT[,:=(existingCol=...,newCol=...), by=...] now works without error or
segfault, #2778 and #2528. Many thanks to Arun for reporting both with reproducible examples. Tests added.
I wrote the package safejoin which solves this very succintly
#devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
safe_full_join(df1,df2, by = "Date", conflict = `+`)
# Date Time ColumnA ColumnB ColumnC
# 1 01/01/2013 08:00 110 330 1
# 2 01/01/2013 08:30 115 325 1
# 3 01/01/2013 09:00 120 320 1
# 4 02/01/2013 08:00 225 415 2
# 5 02/01/2013 08:30 230 410 2
# 6 02/01/2013 09:00 235 405 2
In case of conflict, the function + is used on pairs of conflicting columns
data
df1 <- read.table(header=T, text="Date Time ColumnA ColumnB
01/01/2013 08:00 10 30
01/01/2013 08:30 15 25
01/01/2013 09:00 20 20
02/01/2013 08:00 25 15
02/01/2013 08:30 30 10
02/01/2013 09:00 35 5")
df2 <- read.table(header=T, text="Date ColumnA ColumnB ColumnC
01/01/2013 100 300 1
02/01/2013 200 400 2")

Resources