Lookup value based on multiple columns in data frame - r

I am trying to figure out how to lookup a value using multiple columns. Just can't seem to get it to work correctly. Here is an example:
df1 <- data.frame(g1 = c("a", "b", "c", "c"), g2 = c(1, 2, 3, 4))
df2 <- data.frame(g.1 = c("a", "b", "c"), g.2 = c(1, 2, 4), val = c(100, 200, 300))
So I tried to do:
df1$value <- df2[match(df1$g1, df2$g.1) & match(df1$g2, df2$g.2),]$val
But this doesn't work for the last value and am guessing it only works for the first 2 just by error. I would like to have df1 look like:
g1 g2 value
1 a 1 100
2 b 2 200
3 c 3 NA
4 c 4 300

Try a left join using merge:
merge(df1, df2, by = 1:2, all.x = TRUE)
giving:
g1 g2 val
1 a 1 100
2 b 2 200
3 c 3 NA
4 c 4 300
Some alternatives are:
transform(df1, val = df2$val[match(paste(g1, g2), paste(df2$g.1, df2$g.2))])
library(sqldf)
sqldf("select df1.*, df2.val
from df1 left join df2 on g1 = [g.1] and g2 = [g.2]")
library(dplyr)
df1 %>% left_join(df2, by = c(g1 = "g.1", g2 = "g.2"))

A join would be better and with data.table, it becomes more efficient as we are updating my reference
library(data.table)
setDT(df1)[df2, value := val, on = .(g1 = g.1, g2 = g.2)]
df1
# g1 g2 value
#1: a 1 100
#2: b 2 200
#3: c 3 NA
#4: c 4 300
With match, one approach would be paste the columns of interest together and then create a single index to change the values
p1 <- do.call(paste, df1)
p2 <- do.call(paste, df2[1:2])
i1 <- match(p1, p2, nomatch = 0)
i2 <- match(p2, p1, nomatch = 0)
df1$value[i2] <- df2$val[i1]
df1
# g1 g2 value
#1 a 1 100
#2 b 2 200
#3 c 3 NA
#4 c 4 300

I figured out what I am doing wrong based on #G. Grothendieck's answer. All I had to do was:
df1$value <- df2[match(paste0(df1$g1,df1$g2), paste0(df2$g.1,df2$g.2)),]$val

Related

How to join data to only the first matching row with {data.table} in R

I have a look-up table of "firsts" in column d. For example, the first time the patient was admitted because of a specific disease. I would like to join this back into the main data frame via data.table on multiple other conditions.
My problem is that, unfortunately, the main data.table could have multiple records with identical joining criteria that results in multiple "firsts" per patient after the join. Real world data is messy, people!
Is it possible to do a {data.table} join on only the first matching record?
This is similar to this question, but the multiple-matches are on the main data table. I think that mult only works on when there are several entries on the table being joined in.
repex:
library(data.table)
set.seed(1724)
d1 <- data.table(a = c(1, 1, 1),
b = c(1, 1, 2),
c = sample(1:10, 3))
d2 <- data.table(a = 1, b = 1, d = TRUE)
d2[d1, on = c("a", "b")]
a b d c
1: 1 1 TRUE 4
2: 1 1 TRUE 8
3: 1 2 NA 2
desired output
a b d c
1: 1 1 TRUE 4
2: 1 1 NA 8
3: 1 2 NA 2
library(data.table)
set.seed(1724)
d1 = data.table(a = c(1, 1, 1), b = c(1, 1, 2), c = sample(1:10, 3))
d2 = data.table(a = 1, b = 1, d = TRUE)
d1[, i1:=seq_len(.N), by=c("a","b")]
d2[, i2:=seq_len(.N), by=c("a","b")]
d2[d1, on = c("a","b","i2==i1")][, "i2":=NULL][]
# a b d c
# <num> <num> <lgcl> <int>
#1: 1 1 TRUE 4
#2: 1 1 NA 8
#3: 1 2 NA 2
One way would be to turn the values to NA after join.
library(data.table)
d3 <- d2[d1, on = c("a", "b")]
d3[, d:= replace(d, seq_len(.N) != 1, NA), .(a, b)]
d3
# a b d c
#1: 1 1 TRUE 4
#2: 1 1 NA 8
#3: 1 2 NA 2
The easy solution would be to index every row and join on this also (the d2 is a filtered version of d1):
library(data.table)
set.seed(1724)
d1 <- data.table(a = c(1, 1, 1),
b = c(1, 1, 2),
c = sample(1:10, 3))
d1[, rid := seq(to = .N)]
d2 <- d1[, .SD[1], by = c("a"), .SDcols = c("b", "rid")][, d := TRUE] # UPDATE
d2[d1, on = c("a", "b", "rid")]

Create a null column with given name, if missing from one dataset

I have 5 data sets, each containing some columns. The data sets have common column names, but all columns are not present in all the data sets. So whenever a column name (that appears in at least one of the data set) is not present in some other data set, I want to create a column of all zeros with that column name in that data set. So that all the data sets have same number of columns (and same column names).
Put the dataframes in the list, get the all the unique column names present in all the dataframes combined and add columns which are absent in each dataframe with 0.
all_names <- unique(unlist(sapply(list_df, names)))
lst1 <- lapply(list_df, function(x) {x[setdiff(all_names, names(x))] <- 0;x})
lst1
#[[1]]
# a b c
#1 1 6 0
#2 2 7 0
#3 3 8 0
#4 4 9 0
#5 5 10 0
#[[2]]
# a c b
#1 1 6 0
#2 2 7 0
#3 3 8 0
#4 4 9 0
#5 5 10 0
#[[3]]
# a c b
#1 1 6 11
#2 2 7 12
#3 3 8 13
#4 4 9 14
#5 5 10 15
If you need separate dataframes you can use lst1[[1]], lst1[[2]] individually again.
data
df1 <- data.frame(a = 1:5, b = 6:10)
df2 <- data.frame(a = 1:5, c = 6:10)
df3 <- data.frame(a = 1:5, c = 6:10, b = 11:15)
list_df <- list(df1, df2, df3)
We can use a for loop to do this
un1 <- Reduce(union, lapply(lst1, names))
for(i in seq_along(lst1)) lst1[[i]][setdiff(un1, names(lst1[[i]]))] <- 0
data
lst1 <- list(structure(list(a = 1:5, b = 6:10, c = c(0, 0, 0, 0, 0)),
row.names = c(NA,
-5L), class = "data.frame"), structure(list(a = 1:5, c = 6:10,
b = c(0, 0, 0, 0, 0)),
row.names = c(NA, -5L), class = "data.frame"),
structure(list(a = 1:5, c = 6:10, b = 11:15),
class = "data.frame", row.names = c(NA,
-5L)))
I would use dplyr's bind_rows, which automatically fills missing values with NA. If you include .id = "df_id" a column will be added connecting each row to the original dataframe:
library(dplyr)
bind_rows(df1, df2, df3, .id = "df_id")
#### OUTPUT ####
df_id x y z
1 1 1 2 NA
2 2 3 NA 4
3 3 NA 5 6
If you want 0s instead of NAs just runt df[is.na(df)] <- 0. If you want a more informative df_id column you can pass in a named list:
bind_rows(list(df1 = df1, df2 = df2, df3 = df3), .id = "df_id")
#### OUTPUT ####
df_id x y z
1 df1 1 2 NA
2 df2 3 NA 4
3 df3 NA 5 6
If you want your dataframes separate then simply split by df_id, which generates a list of dataframes:
df <- bind_rows(df1, df2, df3, .id = "df_id")
split(df, df$df_id)
#### OUTPUT ####
$`1`
df_id x y z
1 1 1 2 NA
$`2`
df_id x y z
2 2 3 NA 4
$`3`
df_id x y z
3 3 NA 5 6
Data:
df1 <- data.frame(x = 1, y = 2)
df2 <- data.frame(x = 3, z = 4)
df3 <- data.frame(y = 5, z = 6)
In addition to the previous answers, you can use the bind_rows function in order to quickly combine all your data frames, which will take care of differences in column names:
library(dplyr)
x <- data.frame(
a = 1:3,
b = 4:6
)
y <- data.frame(
a = 4:7
)
z <- data.frame(
c = 8:10
)
xyz <- bind_rows(x, y, z)
xyz %>% replace(., is.na(.), 0)

Replace values of multiple columns from one dataframe using another dataframe with conditions

Hi I have two data frames as followed:
df1:
ID x y z
1 a b c
2 a b c
3 a b c
4 a b c
and df2:
ID x y
2 d NA
3 NA e
and I am after a result like this:
df1:
ID x y z
1 a b c
2 d b c
3 a e c
4 a b c
I have been trying to use the match function as suggested by some other posts but I keep getting the issue where my df1 dataframe being replaced with NA values from df2.
This is the code I have been using without luck
for (i in names(df2)[2:length(names(df2))]) {
df1[i] <- df2[match(df1$ID, df2$ID)]
}
Thanks
Your code didn't work for me so I change it a little but it works. If you are reading data from an external file use the stringAsFactor = FALSE when you read it so you don't run into problems.
df1 = data.frame("ID" = 1:4,"x" = rep("a",4), "y" =rep("b",4),"z" = rep("c",4),
stringsAsFactors=FALSE)
df2 = data.frame("ID" = 2:3,"x" = c("d",NA), "y" = c(NA,"e"),stringsAsFactors=FALSE)
for(i in 1:nrow(df2)){
new_data = df2[i,-which(apply(df2[i,],2,is.na))]
pos = as.numeric(new_data[1])
col_replace = intersect(colnames(new_data),colnames(df1))
df1[pos,col_replace] = new_data
}
A solution using dplyr. The idea is to convert both data frames to long format, conduct join and replace the values, and convert the format back to wide format. df5 is the final output.
library(dplyr)
library(tidyr)
df3 <- df1 %>% gather(Col, Value, -ID)
df4 <- df2 %>% gather(Col, Value, -ID, na.rm = TRUE)
df5 <- df3 %>%
left_join(df4, by = c("ID", "Col")) %>%
mutate(Value.x = ifelse(!is.na(Value.y), Value.y, Value.x)) %>%
select(ID, Col, Value.x) %>%
spread(Col, Value.x)
df5
# ID x y z
# 1 1 a b c
# 2 2 d b c
# 3 3 a e c
# 4 4 a b c
DATA
df1 <- read.table(text = "ID x y z
1 a b c
2 a b c
3 a b c
4 a b c",
header = TRUE, stringsAsFactors = FALSE)
df2 <- read.table(text = "ID x y
2 d NA
3 NA e",
header = TRUE, stringsAsFactors = FALSE)
As mentioned by alistaire this is an update join. It is available with the data.table package:
library(data.table)
setDT(df1)
setDT(df2)
df1[df2, on = "ID", x := ifelse(is.na(i.x), x, i.x)]
df1[df2, on = "ID", y := ifelse(is.na(i.y), y, i.y)]
df1
ID x y z
1: 1 a b c
2: 2 d b c
3: 3 a e c
4: 4 a b c
If there are many columns with replacement values, it might be worthwhile to follow www's suggestion to do the replacement after reshaping to long format where column names are treated as data:
library(data.table)
melt(setDT(df1), "ID")[
melt(setDT(df2), "ID", na.rm = TRUE), on = .(ID, variable), value := i.value][
, dcast(.SD, ID ~ variable)]
ID x y z
1: 1 a b c
2: 2 d b c
3: 3 a e c
4: 4 a b c
Data
df1 <- fread(
"ID x y z
1 a b c
2 a b c
3 a b c
4 a b c")
df2 <- fread(
"ID x y
2 d NA
3 NA e")

Merging 3 dataframes Left join

I have 3 dataframes with unequal rows
df1-
T1 T2 T3
1 Joe TTT
2 PP YYY
3 JJ QQQ
5 UU OOO
6 OO GGG
df2
X1 X2
1 09/20/2017
2 08/02/2015
3 05/02/2000
8 06/03/1999
df3
L1 L2
1 New
6 Notsure
9 Also
The final dataframe should be like a left join of all 3 only retaining rows of df1. The matching rows are T1, X1 and L1 but with different header names. The number of rows are different in each dataframe. I couldn't find a solution for this situation. On SO, what i found was available for 2 dataframes or 3 dataframes with equal rows or same column name
T1 T2 T3 X2 L2
1 Joe TTT 09/20/2017 New
2 PP YYY 08/02/2015 NA
3 JJ QQQ 05/02/2000 NA
5 UU OOO NA NA
6 OO GGG NA NotSure
I am comparatively new in R, and couldn't find a R code for this
The idea is to put your data frames in a list, change the name of the first column, and use Reduce to merge, i.e.
Reduce(function(...) merge(..., by = 'Var1', all.x = TRUE),
lapply( mget(ls(pattern = 'df[0-9]+')), function(i) {names(i)[1] <- 'Var1'; i}))
which gives,
Var1 T2 T3 X2 L2
1 1 Joe TTT 09/20/2017 New
2 2 PP YYY 08/02/2015 Old
3 3 JJ QQQ 05/02/2000 <NA>
4 5 UU OOO <NA> <NA>
5 6 OO GGG <NA> Notsure
using tidyverse functions, you can try:
df1 %>%
left_join(df2, by = c("T1" = "X1")) %>%
left_join(df3, by = c("T1" = "L1"))
which gives:
T1 T2 T3 X2 L2
1 1 Joe TTT 09/20/2017 New
2 2 PP YYY 08/02/2015 <NA>
3 3 JJ QQQ 05/02/2000 <NA>
4 5 UU OOO <NA> <NA>
5 6 OO GGG <NA> Notsure
1) sqldf
library(sqldf)
sqldf("select df1.*, X2, L2
from df1
left join df2 on T1 = X1
left join df3 on T1 = L1")
1a) Although slightly longer this variation can make it easier later when reviewing the code by making it explicit as to which source each column came from. If the data frame names were long you might want to use aliases, e.g. from df1 as a, but here we don't bother since they are short.
sqldf("select df1.*, df2.X2, df3.L2
from df1
left join df2 on df1.T1 = df2.X1
left join df3 on df1.T1 = df3.L1")
2) merge Using repeated merge. No packages used.
Merge <- function(x, y) merge(x, y, by = 1, all.x = TRUE)
Merge(Merge(df1, df2), df3)
2a) This could also be written using a magrittr pipeline like this:
library(magrittr)
df1 %>% Merge(df2) %>% Merge(df3)
2b) Using Reduce we can do the repeated merges like this:
Reduce(Merge, list(df1, df2, df3))
Note: The inputs in reproducible form are:
Lines1 <- "
T1 T2 T3
1 Joe TTT
2 PP YYY
3 JJ QQQ
5 UU OOO
6 OO GGG"
Lines2 <- "
X1 X2
1 09/20/2017
2 08/02/2015
3 05/02/2000
8 06/03/1999"
Lines3 <- "
L1 L2
1 New
6 Notsure
9 Also"
df1 <- read.table(text = Lines1, header = TRUE)
df2 <- read.table(text = Lines2, header = TRUE)
df3 <- read.table(text = Lines3, header = TRUE)
With left_join() it would be something like this
df1 = data.frame(X = c("a", "b", "c"), var1 = c(1,2, 3))
df2 = data.frame(V = c("a", "b", "c"), var2 =c(5,NA, NA) )
df3 = data.frame(Y = c("a", "b", "c"), var3 =c("name", NA, "age") )
# rename
df2 = df2 %>% rename(X = V)
df3 = df3 %>% rename(X = Y)
df = left_join(df1, df2, by = "X") %>%
left_join(., df3, by = "X")
> df
X var1 var2 var3
1 a 1 5 name
2 b 2 NA <NA>
3 c 3 NA age

Aggregated sum of another column over a vector of names in a data.frame

I have the following data.frame:
> DF <- data.frame(names = I(list(c("a", "b", "c"), c("a"), c("c", "d"))),
counts = c(1, 2, 3))
> DF
names counts
1 a, b, c 1
2 a 2
3 c, d 3
How do I get a result that sums up the total counts of each name?
Something like:
name sum
a 3
b 1
c 4
d 3
Try
DF1 <- data.frame(name=unlist(DF$names),
val=rep(DF$counts,sapply(DF$names, length)))
Or
DF1 <- do.call(rbind,Map(data.frame, name=DF$names, val=DF$counts))
aggregate(val~name, DF1, sum)
# name val
#1 a 3
#2 b 1
#3 c 4
#4 d 3
Or
DF2 <- transform(stack(setNames(DF$names, DF$counts)),
ind=as.numeric(as.character(ind)))
aggregate(ind~values, DF2, sum)

Resources