There're two sample datasets:
> aDT
col1 col2 ExtractDate
1: 1 A 2017-01-01
2: 1 A 2016-01-01
3: 2 B 2015-01-01
4: 2 B 2014-01-01
> bDT
col1 col2 date_pol Value
1: 1 A 2017-05-20 1
2: 1 A 2016-05-20 2
3: 1 A 2015-05-20 3
4: 2 B 2014-05-20 4
And I need:
> cDT
col1 col2 ExtractDate date_pol Value
1: 1 A 2017-01-01 2016-05-20 2
2: 1 A 2016-01-01 2015-05-20 3
3: 2 B 2015-01-01 2014-05-20 4
4: 2 B 2014-01-01 NA NA
Basically, aDT left join bDT based on col1, col2 and ExtractDate >= date_pol, only keep the first match (i.e. highest date_pol). Cartesian join not allowed due to memory limits.
Note:
To generate sample datasets
aDT <- data.table(col1 = c(1,1,2,2), col2 = c("A","A","B","B"), ExtractDate = c("2017-01-01","2016-01-01","2015-01-01","2014-01-01"))
bDT <- data.table(col1 = c(1,1,1,2), col2 = c("A","A","A","B"), date_pol = c("2017-05-20","2016-05-20","2015-05-20","2014-05-20"), Value = c(1,2,3,4))
cDT <- data.table(col1 = c(1,1,2,2), col2 = c("A","A","B","B"), ExtractDate = c("2017-01-01","2016-01-01","2015-01-01","2014-01-01")
,date_pol = c("2016-05-20","2015-05-20","2014-05-20",NA), Value = c(2,3,4,NA))
aDT[,ExtractDate := ymd(ExtractDate)]
bDT[,date_pol := ymd(date_pol)]
aDT[order(-ExtractDate)]
bDT[order(-date_pol)]
I have tried:
aDT[, c("date_pol", "Value") :=
bDT[aDT,
.(date_pol, Value)
,on = .(date_pol <= ExtractDate
,col1 = col1
,col2 = col2)
,mult = "first"]]
But results are a bit weird:
> aDT
col1 col2 ExtractDate date_pol Value ##date_pol values not right
1: 1 A 2017-01-01 2017-01-01 2
2: 1 A 2016-01-01 2016-01-01 3
3: 2 B 2015-01-01 2015-01-01 4
4: 2 B 2014-01-01 2014-01-01 NA
When i is a data.table, the columns of i can be referred to in j by using the prefix i., e.g., X[Y, .(val, i.val)]. Here val refers to X's column and i.val Y's. Columns of x can now be referred to using the prefix x. and is particularly useful during joining to refer to x's join columns as they are otherwise masked by i's. For example, X[Y, .(x.a-i.a, b), on="a"].
bDT[aDT, .(col1, col2, i.ExtractDate, x.date_pol, Value),
on = .(date_pol <= ExtractDate, col1 = col1, col2 = col2),
mult = "first"]
output
col1 col2 i.ExtractDate x.date_pol Value
1: 1 A 2017-01-01 2016-05-20 2
2: 1 A 2016-01-01 2015-05-20 3
3: 2 B 2015-01-01 2014-05-20 4
4: 2 B 2014-01-01 <NA> NA
I like the approach you did yourself: without explicitly mentioning the columns in your left join. This can be very helpful if you have a lot of columns on the left side of your join, so you don't have to specify them all.
The only thing you need to do is use the prefix x.
aDT[, c("date_pol", "Value") := bDT[aDT, on = .(date_pol <= ExtractDate, col1, col2),
mult = "first", .(x.date_pol, x.Value)]]
Output:
col1 col2 ExtractDate date_pol Value
1: 1 A 2017-01-01 2016-05-20 2
2: 1 A 2016-01-01 2015-05-20 3
3: 2 B 2015-01-01 2014-05-20 4
4: 2 B 2014-01-01 <NA> NA
Related
I have two data.tables, columns v2 of each one are complementary:
set.seed(1234)
v1 <- sample(1:20, 5)
v2a <- c(1:2,NA,NA,NA)
v2b <- c(NA,NA,3:5)
id <- c(letters[1:5])
library(data.table)
dt1 <- data.table(id = id, v1=v1,v2=v2a)
dt2 <- data.table(id = id, v2=v2b)
dt1
id v1 v2
1: a 16 1
2: b 5 2
3: c 12 NA
4: d 15 NA
5: e 9 NA
dt2
id v2
1: a NA
2: b NA
3: c 3
4: d 4
5: e 5
The goal is to merge the two data.tables and have column v2 with the proper values without NA.
I got it correctly done either by:
dt <- rbindlist(list(dt1,dt2), use.names = T, fill = T)
dt <- dt[,v2:= sum(v2, na.rm = T), by = id]
dt <- dt[!is.na(v1)]
or:
dt <- merge(dt1, dt2, by = "id", all = T)
dt[, v2:=sum(v2.x, v2.y, na.rm = T), by = id][, v2.x := NULL][,v2.y := NULL]
both giving the correct desired result:
dt
id v1 v2
1: a 16 1
2: b 5 2
3: c 12 3
4: d 15 4
5: e 9 5
Is there an easier/one go way to do it?
The code below updates the values of dt1$v2 where is.na(dt1$v2) == TRUE with the values of dt$v2, based on id.
dt1[is.na(v2), v2 := dt2[ dt1[is.na(v2),], v2, on = .(id)] ][]
id v1 v2
1: a 16 1
2: b 5 2
3: c 12 3
4: d 15 4
5: e 9 5
There is another, less convoluted approach which uses the fcoalesce() function which was introduced with data.table v1.12.4 (on CRAN 03 Oct 2019):
dt1[dt2, on = .(id), v2 := fcoalesce(x.v2, i.v2)][]
id v1 v2
1: a 16 1
2: b 5 2
3: c 12 3
4: d 15 4
5: e 9 5
dt1[dt2, on = .(id), v2 := fcoalesce(v2, i.v2)][]
works as well because
dt1[dt2, on = .(id)]
returns
id v1 v2 i.v2
1: a 16 1 NA
2: b 5 2 NA
3: c 12 NA 3
4: d 15 NA 4
5: e 9 NA 5
I am searching for an efficient and fast approach to fill missing data in a table with missing dates.
library(data.table)
dt <- as.data.table(read.csv(textConnection('"date","gr1","gr2","x"
"2017-01-01","A","a",1
"2017-02-01","A","b",2
"2017-02-01","B","a",4
"2017-04-01","B","a",5
"2017-05-01","A","b",3')))
dt[,date := as.Date(date)]
Suppose that this table has all the information for x by date and groups gr1 and gr2. I want to fill the missing dates and expand this table by repeating the last known values of x by gr1 and gr2. My approach is as follows:
# define the period to expand
date_min <- as.Date('2017-01-01')
date_max <- as.Date('2017-06-01')
dates <- setDT(list(ddate = seq.Date(date_min, date_max,by = 'month')))
# cast the data
dt.c <- dcast(dt, date~gr1+gr2, value.var = "x")
# fill missing dates
dt.c <- dt.c[dates, roll=Inf]
# melt the data to return to original table format
dt.m <- melt(dt.c, id.vars = "date", value.name = "x")
# split column - the slowest part of my code
dt.m[,c("gr1","gr2") := tstrsplit(variable,'_')][,variable:=NULL]
# remove unnecessary NAs
dt.m <- dt.m[complete.cases(dt.m[,x])][,.(date,gr1,gr2,x)]
setkey(dt.m)
This is the output that I expect to see:
> dt.m
date gr1 gr2 x
1: 2017-01-01 A a 1
2: 2017-02-01 A b 2
3: 2017-02-01 B a 4
4: 2017-03-01 A b 2
5: 2017-03-01 B a 4
6: 2017-04-01 B a 5
7: 2017-05-01 A b 3
8: 2017-06-01 A b 3
Now the problem is that tstrsplit is very slow on large data sets with a lot of groups.
This approach is very close to what I need but if I follow it I could not get the desired output as it fills not only the missing dates but the NAs as well. This is my modification of the example:
# the desired dates by group
date_min <- as.Date('2017-01-01')
date_max <- as.Date('2017-06-01')
indx <- dt[,.(date=seq(date_min,date_max,"months")),.(gr1,gr2)]
# key the tables and join them using a rolling join
setkey(dt,gr1,gr2,date)
setkey(indx,gr1,gr2,date)
dt0 <- dt[indx,roll=TRUE][,.(date,gr1,gr2,x)]
setkey(dt0,date)
And this is not the output that I expect to see:
> dt0
date gr1 gr2 x
1: 2017-01-01 A a 1
2: 2017-01-01 A b NA
3: 2017-01-01 B a NA
4: 2017-02-01 A a 1
5: 2017-02-01 A b 2
6: 2017-02-01 B a 4
7: 2017-03-01 A a 1
8: 2017-03-01 A b 2
9: 2017-03-01 B a 4
10: 2017-04-01 A a 1
11: 2017-04-01 A b 2
12: 2017-04-01 B a 5
13: 2017-05-01 A a 1
14: 2017-05-01 A b 3
15: 2017-05-01 B a 5
16: 2017-06-01 A a 1
17: 2017-06-01 A b 3
18: 2017-06-01 B a 5
What is the best (fastest) way to reproduce my output above (dt.m)?
On rolling join, one 'normal' join and some column switching, aaaand you're done :)
temp <- dates[, near.date := dt[dates, x.date, on = .(date=ddate), roll = TRUE, mult = "first"]][]
dt[temp, on = .(date = near.date)][, date := ddate][,ddate := NULL][]
# date gr1 gr2 x
# 1: 2017-01-01 A a 1
# 2: 2017-02-01 A b 2
# 3: 2017-02-01 B a 4
# 4: 2017-03-01 A b 2
# 5: 2017-03-01 B a 4
# 6: 2017-04-01 B a 5
# 7: 2017-05-01 A b 3
# 8: 2017-06-01 A b 3
You can (of course) make it a one-liner by integrating the first row into the last.
I'd use IDate and an integer counter for the sequence of dates:
dt[, date := as.IDate(date)]
dates = seq(as.IDate("2017-01-01"), as.IDate("2017-06-01"), by="month")
dDT = data.table(date = dates)[, dseq := .I][]
dt[dDT, on=.(date), dseq := i.dseq]
Then enumerate all desired combos (gr1, gr2, dseq) and do a couple update joins:
cDT = CJ(dseq = dDT$dseq, gr1 = unique(dt$gr1), gr2 = unique(dt$gr2))
cDT[, x := dt[cDT, on=.(gr1, gr2, dseq), x.x]]
cDT[is.na(x), x := dt[copy(.SD), on=.(gr1, gr2, dseq), roll=1L, x.x]]
res = cDT[!is.na(x)]
res[dDT, on=.(dseq), date := i.date]
dseq gr1 gr2 x date
1: 1 A a 1 2017-01-01
2: 2 A a 1 2017-02-01
3: 2 A b 2 2017-02-01
4: 2 B a 4 2017-02-01
5: 3 A b 2 2017-03-01
6: 3 B a 4 2017-03-01
7: 4 B a 5 2017-04-01
8: 5 A b 3 2017-05-01
9: 5 B a 5 2017-05-01
10: 6 A b 3 2017-06-01
There are two extra rows here compared with what the OP expected
res[!dt.m, on=.(date, gr1, gr2)]
dseq gr1 gr2 x date
1: 2 A a 1 2017-02-01
2: 5 B a 5 2017-05-01
since I am treating each missing gr1 x gr2 value independently, rather than filling it iff the date is not in dt at all (as in the OP). To apply that rule...
drop_rows = res[!dt, on=.(gr1,gr2,date)][date %in% dt$date, .(gr1,gr2,date)]
res[!drop_rows, on=names(drop_rows)]
(The copy(.SD) is needed because of a likely bug.)
dt should have NA for all unique date for each combi of gr* but is not showing up. Hence, we use CJ and a join to fill those missing dates with NA for x.
After that, expand the dataset for all required ddates.
Finally, filter away rows where x is NA and order by date to make output have the same characteristics as the original dt.
dt[, g := .GRP, .(gr1, gr2)][
CJ(date=date, g=g, unique=T), on=.(date, g)][,
.SD[.(date=ddate), on=.(date), roll=Inf], .(g)][
!is.na(x)][order(date)]
output:
g date gr1 gr2 x
1: 1 2017-01-01 A a 1
2: 2 2017-02-01 A b 2
3: 3 2017-02-01 B a 4
4: 2 2017-03-01 A b 2
5: 3 2017-03-01 B a 4
6: 3 2017-04-01 B a 5
7: 2 2017-05-01 A b 3
8: 2 2017-06-01 A b 3
data:
library(data.table)
dt <- fread('date,gr1,gr2,x
2017-01-01,A,a,1
2017-02-01,A,b,2
2017-02-01,B,a,4
2017-04-01,B,a,5
2017-05-01,A,b,3')
dt[,date := as.Date(date)]
date_min <- as.Date('2017-01-01')
date_max <- as.Date('2017-06-01')
ddate = seq.Date(date_min, date_max,by = 'month')
Please try on your actual dataset.
This is a bit similar to another question, although note precisely a duplicate. The approach is similar, but with data.tables and with multiple columns. See also: Fill in missing date and fill with the data above
Here, it's unclear if you're seeking to fill-in columns gr2 and x or what gr2 is doing. I'm assuming you're seeking to fill-in gaps with dates in 1-month increments. Also as input data's max month is 5 (May) the example desired output has up until 6 (June) so it's unclear how June is reached if the goal is to fill-in between input dates -- but if there's an external maximum, this can be set instead of the max of input dates
library(data.table)
library(tidyr)
dt <- as.data.table(read.csv(textConnection('"date","gr1","gr2","x"
"2017-01-01","A","a",1
"2017-02-01","A","b",2
"2017-02-01","B","a",4
"2017-04-01","B","a",5
"2017-05-01","A","b",3')))
dt[,date := as.Date(date)]
setkeyv(dt,"date")
all_date_groups <- dt[,list(date=seq.Date(from=min(.SD$date),to=max(.SD$date),by="1 month")),by="gr1"]
setkeyv(all_date_groups,"date")
all_dates_dt <- dt[all_date_groups,on=c("date","gr1")]
setorderv(all_dates_dt,c("gr1","date"))
all_dates_dt <- fill(all_dates_dt,c("gr2","x"))
setorderv(all_dates_dt,c("date","gr1"))
all_dates_dt
Results:
> all_dates_dt
date gr1 gr2 x
1: 2017-01-01 A a 1
2: 2017-02-01 A b 2
3: 2017-02-01 B a 4
4: 2017-03-01 A b 2
5: 2017-03-01 B a 4
6: 2017-04-01 A b 2
7: 2017-04-01 B a 5
8: 2017-05-01 A b 3
This post is related to the previous post here: match rows of two data.tables to fill subset of a data.table
Not sure how I can integrate them together.
I have a situation where other than the NA for one column of DT1, a couple of more conditions should apply for merging, but that doesn't work.
> DT1 <- data.table(colA = c(1,1, 2,2,2,3,3), colB = c('A', NA, 'AA', 'B', NA, 'A', 'C'), timeA = c(2,4,3,4,6,1,4))
> DT1
colA colB timeA
1: 1 A 2
2: 1 <NA> 4
3: 2 AA 3
4: 2 B 4
5: 2 <NA> 6
6: 3 A 1
7: 3 C 4
> DT2 <- data.table(colC = c(1,1,1,2,2,3), timeB1 = c(1,3,6, 2,4, 1), timeB2 = c(2,5,7,3,5,4), colD = c('Z', 'YY', 'AB', 'JJ', 'F', 'RR'))
> DT2
colC timeB1 timeB2 colD
1: 1 1 2 Z
2: 1 3 5 YY
3: 1 6 7 AB
4: 2 2 3 JJ
5: 2 4 5 F
6: 3 1 4 RR
Using the same guideline as mentioned above, I'd like to merge ColD of DT2 to colB of DT1 only for NA values of colB in DT1 AND use the values of colD for which timeA in DT1 is between timeB1 and timeB2 in DT2. I tried the following but merge doesn't happen:
> output <- DT1[DT2, on = .(colA = colC), colB := ifelse(is.na(x.colB) & i.timeB1 <= x.timeA & x.timeA <= i.timeB2, i.colD, x.colB)]
> output
> output
colA colB timeA
1: 1 A 2
2: 1 <NA> 4
3: 2 AA 3
4: 2 B 4
5: 2 <NA> 6
6: 3 A 1
7: 3 C 4
Nothing changes in output.
these is my desired output:
> desired_output
colA colB timeA
1: 1 A 2
2: 1 YY 4 --> should find a match
3: 2 AA 3
4: 2 B 4
5: 2 <NA> 6 --> shouldn't find a match
6: 3 A 1
7: 3 C 4
why doesn't this work?
I'd like to use data.table operations only without using additional packages.
An in place update of the colB in DT1 would work as follows:
DT1[is.na(colB), colB := DT2[DT1[is.na(colB)],
on = .(colC = colA, timeB1 <= timeA, timeB2 >= timeA), colD]]
print(DT1)
colA colB timeA
1: 1 A 2
2: 1 YY 4
3: 2 AA 3
4: 2 B 4
5: 2 <NA> 6
6: 3 A 1
7: 3 C 4
This indexes the values where colB is NA and after a join on the condition, as defined in on= ..., replaces the missing values by the matching values found in colD.
Possibly not the sortest answer, but it gets the job done.. I'm no data.table-expert, so I welcome improvements/suggestions.
DT1[ is.na(colB), colB := DT1[ is.na(colB), ][ DT2, colB := i.colD, on = c( "colA == colC", "timeA >= timeB1", "timeA <= timeB2")]$colB]
what is does:
first, subset DT1 for all rows where is.na(colB) = TRUE
then, update the value of colB in these rows with the colB-vector from the result of a non-equi join of the same subset of rows on DT2
Bonus is that DT1 is chaged by reference, so it's pretty fast and memory efficient on large data (I think).
colA colB timeA
1: 1 A 2
2: 1 YY 4
3: 2 AA 3
4: 2 B 4
5: 2 <NA> 6
6: 3 A 1
7: 3 C 4
I wish to have the factor that happened earlier as a new row.
This is my data
df <- data.frame (id =c(1,1,2,2,1), date= c(20161002,20151019, 20160913, 20161117, 20160822), factor = c("A" , "B" ,"C" ,"D" ,"H"))
and I want to have an additional row that shows the immediate last factor. So, my ideal output is:
id date factor col2
1 1 20161002 A H
2 1 20151019 B NA
3 2 20160913 C NA
4 2 20161117 D C
5 1 20160822 H B
For instance, for id 1 in the first row the previous factor happend in 20160822 and its value was H.
What I tied does not consider the last date
library (dplyr)
library(zoo)
mutate( col2 = na.locf(factor))
do this
library(data.table)
df$date = as.Date(as.character(df$date),"%Y%m%d")
setDT(df)
setorder(df,id,date)
df[, "col2" := shift(factor), by = .(id)]
id date factor col2
1: 1 2015-10-19 B NA
2: 1 2016-08-22 H B
3: 1 2016-10-02 A H
4: 2 2016-09-13 C NA
5: 2 2016-11-17 D C
We can use dplyr. Convert the character date to Date format. Then we sort the date by group (id) using arrange and select the last factor using lag.
df$date <- as.Date(as.character(df$date), "%Y%m%d")
library(dplyr)
df %>%
group_by(id) %>%
arrange(date) %>%
mutate(col2 = lag(factor))
# id date factor col2
# <dbl> <date> <fctr> <fctr>
#1 1 2015-10-19 B NA
#2 1 2016-08-22 H B
#3 2 2016-09-13 C NA
#4 1 2016-10-02 A H
#5 2 2016-11-17 D C
I have done a bunch of searching for a solution to this and either can't find one or don't know it when I see it. I've seen some topics that are close to this but deal with matching between two different dataframes, whereas this is dealing with a single dataframe.
I have a dataframe with two groups (factors, col1) and a sampling date (date, col2), and then the measurement (numeric, col3). I would like to eventually run a statistical test on a paired sample between group A and B, so in order to create the paired sample, I want to only keep the records that have a measurement taken on the same day for both groups. In other words, remove the records in group A that do not have a corresponding measurement taken on the same day in group B, and vice versa. In the sample data below, that would result in rows 4 and 8 being removed. Another way of thinking of it is, how do I search for and remove records with only one occurrence of each date?
Sample data:
my.df <- data.frame(col1 = as.factor(c(rep("A", 4), rep("B", 4))),
col2 = as.Date(c("2001-01-01", "2001-01-02", "2001-01-03",
"2001-01-04", "2001-01-01", "2001-01-02", "2001-01-03",
"2001-02-03")),
col3 = sample(8))
Here are a few alternatives:
1) ave
> subset(my.df, ave(col3, col2, FUN = length) > 1)
col1 col2 col3
1 A 2001-01-01 3
2 A 2001-01-02 2
3 A 2001-01-03 6
5 B 2001-01-01 7
6 B 2001-01-02 4
7 B 2001-01-03 1
2) split / Filter / do.call
> do.call("rbind", Filter(function(x) nrow(x) > 1, split(my.df, my.df$col2)))
col1 col2 col3
2001-01-01.1 A 2001-01-01 3
2001-01-01.5 B 2001-01-01 7
2001-01-02.2 A 2001-01-02 2
2001-01-02.6 B 2001-01-02 4
2001-01-03.3 A 2001-01-03 6
2001-01-03.7 B 2001-01-03 1
3) dplyr (2) translates nearly directly into a dplyr solution:
> library(dplyr)
> my.df %>% group_by(col2) %>% filter(n() > 1)
Source: local data frame [6 x 3]
Groups: col2
col1 col2 col3
1 A 2001-01-01 5
2 A 2001-01-02 1
3 A 2001-01-03 7
4 B 2001-01-01 2
5 B 2001-01-02 4
6 B 2001-01-03 6
4) data.table The last two solutions can also be translated to data.table
> data.table(my.df)[, if (.N > 1) .SD, by = col2]
col2 col1 col3
1: 2001-01-01 A 5
2: 2001-01-01 B 2
3: 2001-01-02 A 1
4: 2001-01-02 B 4
5: 2001-01-03 A 7
6: 2001-01-03 B 6
5) tapply
> na.omit(tapply(my.df$col3, my.df[c('col2', 'col1')], identity))
col1
col2 A B
2001-01-01 3 7
2001-01-02 2 4
2001-01-03 6 1
attr(,"na.action")
2001-02-03 2001-01-04
5 4
6) merge
> merge(subset(my.df, col1 == 'A'), subset(my.df, col1 == 'B'), by = 2)
col2 col1.x col3.x col1.y col3.y
1 2001-01-01 A 3 B 7
2 2001-01-02 A 2 B 4
3 2001-01-03 A 6 B 1
7) sqldf (6) is similar to the following sqldf solution:
> sqldf("select * from `my.df` A join `my.df` B
+ on A.col2 = B.col2 and A.col1 = 'A' and B.col1 = 'B'")
col1 col2 col3 col1 col2 col3
1 A 2001-01-01 5 B 2001-01-01 2
2 A 2001-01-02 1 B 2001-01-02 4
3 A 2001-01-03 7 B 2001-01-03 6