merge two dataframes by nearest preceding date while aggregating - r

I am trying to match two datasets by nearest preceding date, by group.
So within a group, I would like to add the variables of a second dataset (d2) to that of the first (d1) when the date of the first is the nearest date on or before the date in the second. If two rows in the second dataset are matched with one row in the first I would like to add the larger of the values. (there will always be at least one date in d1 less then the date in d2, by group)
Here is an example, which hopefully makes it clearer
d1 = data.frame(id=c(1,1,1,2,2),
ref=as.Date(c("2013-12-07", "2014-12-07", "2015-12-07", "2013-11-07", "2014-11-07" )))
d1
# id ref
# 1 1 2013-12-07
# 2 1 2014-12-07
# 3 1 2015-12-07
# 4 2 2013-11-07
# 5 2 2014-11-07
d2 = data.frame(id=c(1,1,2),
date=as.Date(c("2014-05-07","2014-12-05", "2015-11-05")),
x1 = factor(c(1,2,2), ordered = TRUE),
x2 = factor(c(2, NA ,2), ordered=TRUE))
d2
# id date x1 x2
# 1 1 2014-05-07 1 2
# 2 1 2014-12-05 2 <NA>
# 3 2 2015-11-05 2 2
With the expected outcome
output = data.frame(id=c(1,1,1,2,2),
ref=as.Date(c("2013-12-07", "2014-12-07", "2015-12-07", "2013-11-07", "2014-11-07" )),
x1 = c(2, NA, NA, NA, 2),
x2 = c(2, NA, NA, NA, 2))
output
# id ref x1 x2
# 1 1 2013-12-07 2 2
# 2 1 2014-12-07 NA NA
# 3 1 2015-12-07 NA NA
# 4 2 2013-11-07 NA NA
# 5 2 2014-11-07 2 2
So for example, the first two observations of d2, id=1, with dates "2014-05-07","2014-12-05", are matched to the earlier date "2013-12-07" in d1. As there are two rows matched to one row in d1,
then the highest level is selected.
I could do this in base R by looping the following calculations through
each group but I was hoping for something more efficient.
I would love to see a data.table approach (but I am limited to R v3.1 and data.table v1.9.4). Thanks
real dataset:
d1: rows 1M / 100K groups
d2: rows 11K / 4K groups
# for one group
x = d1[d1$id==1, ]
y = d2[d2$id==1, ]
id = apply(outer(x$ref, y$date, "-"), 2, which.min)
temp = cbind(y, ref=x$ref[id])
# aggregate variables by ref
temp = merge(aggregate(x1 ~ ref, data=temp, max),
aggregate(x2 ~ ref, data=temp, max)
)
merge(x, temp, all=T)
ps: I had looked at How to match by nearest date from two data frames? and Join data.table on exact date or if not the case on the nearest less than date with no success.

You can do this using dplyr:
d2$ind <- 0
library(dplyr)
out <- d1 %>% full_join(d2,by=c("id","ref"="date")) %>%
arrange(id,ref) %>%
mutate(ind=cumsum(ifelse(is.na(ind),1,ind))) %>%
group_by(ind) %>%
summarise(ref=min(ref),x1=max(x1,na.rm=TRUE),x2=max(x2,na.rm=TRUE))
### A tibble: 5 x 4
## ind ref x1 x2
## <dbl> <date> <fctr> <fctr>
##1 1 2013-12-07 2 2
##2 2 2014-12-07 NA NA
##3 3 2015-12-07 NA NA
##4 4 2013-11-07 NA NA
##5 5 2014-11-07 2 2
We first add a column of indicators to d2 and set those to zero. Then, we perform a full outer join between d1 and d2. Those rows in d1 will have ind of NA. We sort by id and ref (i.e., the date), and we replace the NA entries of ind with 1 and perform a cumsum. This results in:
id ref x1 x2 ind
1 1 2013-12-07 <NA> <NA> 1
2 1 2014-05-07 1 2 1
3 1 2014-12-05 2 <NA> 1
4 1 2014-12-07 <NA> <NA> 2
5 1 2015-12-07 <NA> <NA> 3
6 2 2013-11-07 <NA> <NA> 4
7 2 2014-11-07 <NA> <NA> 5
8 2 2015-11-05 2 2 5
From this we can easily see that we can group by ind and summarise appropriately to get your result.

Related

Is there a way to subset a group if ONLY the first observation meets a criteria?

I have a data frame like so:
ID = c(1,1,1,2,2,2,3,3,3,4,4,4,4)
VAR_1 = c(2,4,6,1,7,9,4,4,3,1,7,4,0)
VAR_2 = c(NA,NA,NA,NA,NA,20190101,20190101,20190101,NA,20190101,NA,NA,NA)
df2 = data.frame(ID,VAR_1,VAR_2)
And I would like to subset from this data frame all the rows for every group (ID) ONLY if the first observation by group in VAR_2 has a value, In this simple case, the new subset should be all the rows from ID's 3 and 4
To represent this better:
df df_subset
ID VAR_1 VAR_2 ID VAR_1 VAR_2
1 2 NA 3 4 20190101
1 4 NA 3 4 20190101
1 6 NA 3 3 NA
2 1 NA 4 1 20190101
2 7 NA 4 7 NA
2 9 20190101 4 4 NA
3 4 20190101 4 0 NA
3 4 20190101
3 3 NA
4 1 20190101
4 7 NA
4 4 NA
4 0 NA
I manage to do this in several steps (I subset the original taking only the first observation by group,assign VAR_1 a special value, re-merge and then finally filter by the special value), but I would like to know if there's a simpler more elegant (and probably) more efficient way. I don't need VAR_1, so that can be changed if needed to provide a faster solution.
Any help would be appreciated.
Using dplyr, we can group_by ID and select groups only if first value in each group is non-NA.
library(dplyr)
df2 %>%
group_by(ID) %>%
filter(!is.na(VAR_2[1L]))
# ID VAR_1 VAR_2
# <dbl> <dbl> <dbl>
#1 3 4 20190101
#2 3 4 20190101
#3 3 3 NA
#4 4 1 20190101
#5 4 7 NA
#6 4 4 NA
#7 4 0 NA
Some variations to extract first value could be (thanks to #tmfmnk)
df2 %>% group_by(ID) %>% filter(!is.na(first(VAR_2)))
OR
df2 %>% group_by(ID) %>% filter(!is.na(nth(VAR_2, 1)))
Same using base R ave
df2[with(df2, ave(!is.na(VAR_2), ID, FUN = function(x) x[1L])), ]
or a bit complicated one with split and subset
subset(df2, ID %in% names(na.omit(sapply(split(df2$VAR_2, df2$ID), head, 1))))

Expand dataframe in R by columns having different ID values

I have the following data frame in R
df1 <- data.frame(
"ID" = c("A", "B", "A", "B"),
"Value" = c(1, 2, 5, 5),
"freq" = c(1, 3, 5, 3)
)
I wish to obtain the following data frame
Value freq ID
1 1 A
2 NA A
3 NA A
4 NA A
5 1 A
1 NA B
2 2 B
3 NA B
4 NA B
5 5 B
I have tried the following code
library(tidyverse)
df_new <- bind_cols(df1 %>%
select(Value, freq, ID) %>%
complete(., expand(.,
Value = min(df1$Value):max(df1$Value))),)
I am getting the following output
Value freq ID
<dbl> <dbl> <fct>
1 1 A
2 3 B
3 NA NA
4 NA NA
5 5 A
5 3 B
I request someone to help me.
Using tidyr::full_seq we can find the full version of Value but nesting(full_seq(Value,1) will return an error:
Error: by can't contain join column full_seq(Value, 1) which is missing from RHS
so we need to add a name, hence nesting(Value=full_seq(Value,1)
library(tidyr)
df1 %>% complete(ID, nesting(Value=full_seq(Value,1)))
# A tibble: 10 x 3
ID Value freq
<fct> <dbl> <dbl>
1 A 1. 1.
2 A 2. NA
3 A 3. NA
4 A 4. NA
5 A 5. 5.
6 B 1. NA
7 B 2. 3.
8 B 3. NA
9 B 4. NA
10 B 5. 3.
Using data.table:
library(data.table)
setDT(df1)
setkey(df1, ID, Value)
df1[CJ(ID = c("A", "B"), Value = 1:5)]
ID Value freq
1: A 1 1
2: A 2 NA
3: A 3 NA
4: A 4 NA
5: A 5 5
6: B 1 NA
7: B 2 3
8: B 3 NA
9: B 4 NA
10: B 5 3
Would the following approach work for you?
with(data = df1,
expr = {
data.frame(Value = rep(wrapr::seqi(min(Value), max(Value)), length(unique(ID))),
ID = unique(ID))
}) %>%
left_join(y = df1,
by = c("ID" = "ID", "Value" = "Value")) %>%
arrange(ID, Value)
Results
Value ID freq
1 1 A 1
2 2 A NA
3 3 A NA
4 4 A NA
5 5 A 5
6 1 B NA
7 2 B 3
8 3 B NA
9 4 B NA
10 5 B 3
Comments
If I'm following your example correctly, your ID group takes values from 1 to 5. If this is the case, my approach would be to generate that reading unique combinations of both from the original data frame.
The only variable that is carried from the original data frame is freq that may / may not be available for a given par ID-Value. I would join that variable via left_join (as you seem to like tidyverse)
In your example, you have freq variable with values 1,3,5 but then in the example you list 1,2,5? In my example, I took original freq and left join it. You can modify it further using normal dplyr pipeline, if this is something you intended to do.

Element of vector to different columns of data frame

I have a df:
group number id
1 A abcd 1
2 A abcd 2
3 A abcd 3
4 A efgh 4
5 A efgh 5
6 B abcd 1
7 B abcd 2
8 B abcd 3
9 B abcd 9
10 B ijkl 10
I want to make it like this:
group number data1 data2 data3 data4 Length
1 A abcd 1 2 3 3
2 A efgh 4 5 2
3 B abcd 1 2 3 9 4
4 B ijkl 10 1
I am sorry I can only make it to df2 like this:
group number data Length
1 A abcd c(1,2,3) 3
2 A efgh c(4,5) 2
3 B abcd c(1,2,3,9) 4
4 B ijkl 10 1
My code is here:
library(tidyverse)
df <- data.frame (group = c(rep('A',5),rep("B",5)),
number = c(rep('abcd',3),rep('efgh',2),rep('abcd',4),rep('ijkl',1)),
id = c(1,2,3,4,5,1,2,3,9,10))
df2 <- df %>%
group_by(group,number) %>%
nest() %>%
mutate(data=map(data,~unlist(.x, recursive = TRUE, use.names = FALSE)),
Length= map(data, ~length(.x)))
Please feel free to start with df or df2, with(out) any package is fine.
You can change the name count to length(also, I perfer make the 'space' to NA, If want to change it , df2[is.na(df2)]='')
Option 1
df <- data.frame (group = c(rep('A',5),rep("B",5)),
number = c(rep('abcd',3),rep('efgh',2),rep('abcd',4),rep('ijkl',1)),
id = c(1,2,3,4,5,1,2,3,9,10))
df2 <- df %>%
group_by(group,number) %>%
mutate(data=toString(id),count=n())
library(splitstackshape)
cSplit(df2, 3, drop = TRUE,sep=',')
group number count data_1 data_2 data_3 data_4
1: A abcd 3 1 2 3 NA
2: A efgh 2 4 5 NA NA
3: B abcd 4 1 2 3 9
4: B ijkl 1 10 NA NA NA
Option 2
library(dplyr)
library(tidyr)
df2 <- df %>%
group_by(group,number) %>%
summarise(data=toString(id),count=n())%>%separate_rows(data)%>% mutate(Col = paste0("data", 1:n()))%>%spread(Col, data)
df2
# A tibble: 4 x 8
# Groups: group [2]
group number count data1 data2 data3 data4 data5
* <fctr> <fctr> <int> <chr> <chr> <chr> <chr> <chr>
1 A abcd 3 1 2 3 <NA> <NA>
2 A efgh 2 <NA> <NA> <NA> 4 5
3 B abcd 4 1 2 3 9 <NA>
4 B ijkl 1 <NA> <NA> <NA> <NA> 10
I must give it to you blindly but that should work or be close :
library(tidyverse)
df %>%
group_by(group,number) %>%
mutate(key = paste0("data",row_number()),length = n()) %>%
ungroup %>%
spread(key,id,"")
To make it work from your nested data I think you have to change these vectors into 1 line data.frames of same col numbers and names , then use unnest, much more complicated! :)
In base R
temp = split(df, paste(df$group, df$number))
columns = max(sapply(temp, NROW))
do.call(rbind, lapply(temp, function(a)
cbind(group = a$group[1],
number = a$number[1],
setNames(data.frame(t(a$id[1:columns])), paste0("data", 1:columns)),
length = length(a$id))
))
# group number data1 data2 data3 data4 length
#A abcd A abcd 1 2 3 NA 3
#A efgh A efgh 4 5 NA NA 2
#B abcd B abcd 1 2 3 9 4
#B ijkl B ijkl 10 NA NA NA 1
Here is an option using data.table
library(data.table)
dcast(setDT(df), group + number~ paste0("data", rowid(group, number)),
value.var = 'id', fill = 0)[,
length := Reduce(`+`, lapply(.SD, `>`, 0)), .SDcols = data1:data4][]
# group number data1 data2 data3 data4 length
#1: A abcd 1 2 3 0 3
#2: A efgh 4 5 0 0 2
#3: B abcd 1 2 3 9 4
#4: B ijkl 10 0 0 0 1
This is a variation of akrun's data.table answer which does compute Length before reshaping from long to wide format and uses the prefix parameter in the call to rowid():
library(data.table)
data.table(df)[, Length := .N, by = .(group, number)][
, dcast(.SD, group + number + Length ~ rowid(group, number, prefix = "data"),
value.var = "id")]
group number Length data1 data2 data3 data4
1: A abcd 3 1 2 3 NA
2: A efgh 2 4 5 NA NA
3: B abcd 4 1 2 3 9
4: B ijkl 1 10 NA NA NA
For pretty printing, the NA values can be converted into white space:
data.table(df)[, Length := .N, by = .(group, number)][
, dcast(.SD, group + number + Length ~ rowid(group, number, prefix = "data"),
as.character, value.var = "id", fill = "")]
group number Length data1 data2 data3 data4
1: A abcd 3 1 2 3
2: A efgh 2 4 5
3: B abcd 4 1 2 3 9
4: B ijkl 1 10

Changing dcast to show multiple columns

I have the following situation. Consider the following df:
mymatrix <- as.data.frame(matrix(data = 0, nrow = 7, ncol = 4))
colnames(mymatrix) <- c("Patient", "marker", "Number", "Visit")
mymatrix[,1] <- c("B1","B1","C1","C1","D1","D1","D1")
mymatrix[,2] <- c("A","A","A","A","A","A","A")
mymatrix[,3] <- c(1,0,0,15,1,2,13)
mymatrix[,4] <- c("baseline","followup","baseline","followup","baseline","followup","followup")
> mymatrix
Patient marker Number Visit
1 B1 A 1 baseline
2 B1 A 0 followup
3 C1 A 0 baseline
4 C1 A 15 followup
5 D1 A 1 baseline
6 D1 A 2 followup
7 D1 A 13 followup
If I do dcast on the first 6 rows I get:
> dcast(mymatrix[1:6,], Patient +marker~Visit, value.var = "Number")
Patient marker baseline followup
1 B1 A 1 0
2 C1 A 0 15
3 D1 A 1 2
If I do dcast on all the rows I get:
> dcast(mymatrix, Patient +marker~Visit, value.var = "Number")
Aggregation function missing: defaulting to length
Patient marker baseline followup
1 B1 A 1 1
2 C1 A 1 1
3 D1 A 1 2
Is there a way instead of defaulting to length it would add a second followup column? So the data would show as follows:
Patient marker baseline followup.1 followup.2
1 B1 A 1 0 NA
2 C1 A 0 15 NA
3 D1 A 1 2 13
Thanks!
It's not clear what you asking, because it seems like you want to combine two different functions in dcast at the same time. It seems to me that you want to improve your first output instead of the second. If so, a simple solution would be just to add an automatic index to the values in the Visit column and then dcast. Here's a simple approach using the data.table package (thought the output is not exactly what you want because I've also added an index to baseline, but it can get you started)
library(data.table)
setDT(mymatrix)[, Visit := paste(Visit, seq_len(.N), sep = "."), list(Patient, Visit)]
dcast.data.table(mymatrix, Patient + marker ~ Visit, value.var = "Number")
# Patient marker baseline.1 followup.1 followup.2
# 1: B1 A 1 0 NA
# 2: C1 A 0 15 NA
# 3: D1 A 1 2 13
You could also use base R
d1 <- transform(mymatrix, Visit=paste0(Visit,ave(seq_along(Number),
Patient, Visit, FUN=seq_along)) )
reshape(d1, idvar=c('Patient', 'marker'), timevar='Visit', direction='wide')
# Patient marker Number.baseline1 Number.followup1 Number.followup2
#1 B1 A 1 0 NA
#3 C1 A 0 15 NA
#5 D1 A 1 2 13
Or dplyr/tidyr
library(dplyr)
library(tidyr)
mymatrix %>%
group_by(Patient, Visit) %>%
mutate(indx=row_number()) %>%
ungroup() %>%
unite(Visit1, Visit, indx) %>%
spread(Visit1, Number)
# Patient marker baseline_1 followup_1 followup_2
#1 B1 A 1 0 NA
#2 C1 A 0 15 NA
#3 D1 A 1 2 13

Merging two data frames according to row values

I have two data frames, each with the same two columns: county codes and frequencies. They aren't identical, but some of the county code values show up in both data frames. Like this:
"county_code","freq"
"01011",2
"01051",1
"01073",9
"01077",1
"county_code","freq"
"01011",4
"01056",2
"01073",1
"01088",6
I want to merge them into a new data frame, such that if a county code appears in both data frames, their respective frequencies are added together. If the county code just appears in one or the other of the data frames, I want to add it (and its frequency) to the new data frame unchanged. The result should look like this:
"county_code","freq"
"01011",6
"01051",1
"01056",2
"01073",10
"01077",1
"01088",6
The result doesn't have to be ordered. I tried to use reshape for this, but I wasn't sure that was the right approach. Thoughts?
Combine the two data frames with rbind, then use aggregate to collapse multiple rows with the same county_code:
aggregate(freq~county_code, rbind(d1, d2) , FUN=sum)
## county_code freq
## 1 1011 6
## 2 1051 1
## 3 1073 10
## 4 1077 1
## 5 1056 2
## 6 1088 6
(Using the definitions in MrFlick's answer.)
Using base functions, you can do a merge() then transform(). here are your sample input data.frames
d1 <- data.frame(
county_code = c("1011", "1051", "1073", "1077"),
freq = c(2L, 1L, 9L, 1L)
)
d2 <- data.frame(
county_code = c("1011", "1056", "1073", "1088"),
freq = c(4L, 2L, 1L, 6L)
)
then you would just do
transform(merge(d1, d2, by="county_code", all=T),
freq = rowSums(cbind(freq.x, freq.y), na.rm=T),
freq.x = NULL, freq.y = NULL
)
to get
county_code freq
1 1011 6
2 1051 1
3 1056 2
4 1073 10
5 1077 1
6 1088 6
Here is one way. I used rbind(),merge() and dplyr.
# sample data
country <- c("01011", "01051", "01073", "01077")
value <- c(2,1,9,1)
foo <- data.frame(country, value, stringsAsFactors=F)
country <- c("01011","01056","01073","01088")
value <- c(4,2,1,6)
foo2 <- data.frame(country, value, stringsAsFactors=F)
library(dplyr)
group_by(rbind_list(foo, foo2), country) %>%
summarize(count = sum(value))
ana
country count
1 01011 6
2 01051 1
3 01056 2
4 01073 10
5 01077 1
6 01088 6
The other idea I had was the following.
ana2 <- merge(foo, foo2, all = TRUE, by = "country")
country value.x value.y
1 01011 2 4
2 01051 1 NA
3 01056 NA 2
4 01073 9 1
5 01077 1 NA
6 01088 NA 6
bob2 <- ana2 %>%
rowwise() %>%
mutate(count = sum(value.x,value.y, na.rm = TRUE))
country value.x value.y count
1 01011 2 4 6
2 01051 1 NA 1
3 01056 NA 2 2
4 01073 9 1 10
5 01077 1 NA 1
6 01088 NA 6 6

Resources