R: Binding columns by key variable - r

I want to combine two dataframes, df1 and df2, by different groups of a key variable in x1. It is basically some join operation, however, I do not want the rows to duplicate and do not care about the relationship among the added columns.
Assume:
df1:
x1 x2
A 1
A 2
A 3
B 4
B 5
C 6
C 7
df2:
x1 x3
A a
A b
A c
A d
A e
A f
B g
C h
The result should look like this.
df1 + df2:
x1 x2 x3
A 1 a
A 2 b
A 3 c
A NA d
A NA f
B 4 g
B 5 NA
C 6 h
C 7 NA
Does anyone have an idea? I would most appreciate your help!

The full_join in dplyr works well for this too. See below:
#recreate your data
library (data.table)
library (dplyr)
df1 <- data.table (x1 = c("A","A","A","B","B","C","C"), x2 = seq (from = 1, to = 7))
df2 <- data.table (x1 = c("A","A","A","A","A","A","B","C"), x3 = c("a","b","c","d","e","f","g","h" ))
df1[, rowid := rowid(x1)]
df2[, rowid := rowid(x1)]
df3 <- full_join (df1, df2, by = c ("x1","rowid"))
df3$rowid <- NULL
setorder (df3, x1)

To replicate your resulting data.frame you can create row ids by x1 and then merge on those row ids and x1 (but I don't really know if that is what you are trying to accomplish)
library(data.table)
df1 = read.table(text = "x1 x2
A 1
A 2
A 3
B 4
B 5
C 6
C 7", header = T)
df2 = read.table(text = "x1 x3
A a
A b
A c
A d
A e
A f
B g
C h", header = T)
setDT(df1)
setDT(df2)
df1[, rowid := seq(.N), by = x1] # create rowid
df2[, rowid := seq(.N), by = x1] # create rowid
merge(df1, df2, by = c("x1", "rowid"), all = T)[, rowid := NULL][]
x1 x2 x3
1: A 1 a
2: A 2 b
3: A 3 c
4: A NA d
5: A NA e
6: A NA f
7: B 4 g
8: B 5 NA
9: C 6 h
10: C 7 NA

Related

Create a new column based on the the values and heading of another dataset

Say I have an original dataset whose values in the first column are from a to d in the alphabet df1:
a x1
b x2
c x3
d x4
e x5
and then I have another dataset which multiple columns but whose entries reference the columns in the aforementioned dataset df2
---------
A | B | C
---------
a b c
d e
I would like to use a R function to use the unique values in df2 (a,b, c and d above) in order to create a new column in the df1 dataset that references the title of the corresponding column in df2 , i.e. df3
a x1 A
b x2 B
c x3 C
d x4 B
e x5 C
.
Working example:
> # data frame with numbers and characters
> df1 = data.frame(unique_values=letters[1:5], other_col=paste(rep("x",5), 1:5, sep=""))
> print(df1)
unique_values other_col
1 a x1
2 b x2
3 c x3
4 d x4
5 e x5
> # Create dataset that is then used to create new column
> df2 = data.frame(A = c("a",NA), B=c("b","d"), C=c("c","e") )
> df2
A B C
1 a b c
2 <NA> d e
# Using df1 and columns referenging the df1 in df2 create df3
library(dplyr)
#df3?
A base R option using merge + stack
merge(df1, setNames(na.omit(stack(df2)), c("unique_values", "names")))
gives
unique_values other_col names
1 a x1 A
2 b x2 B
3 c x3 C
4 d x4 B
5 e x5 C
Reshape the second data to 'long' format and then do a join
library(dplyr)
library(tidyr)
pivot_longer(df2, everything(), values_to = 'unique_values',
values_drop_na = TRUE) %>%
left_join(df1)
-output
# A tibble: 5 x 3
# name unique_values other_col
# <chr> <chr> <chr>
#1 A a x1
#2 B b x2
#3 C c x3
#4 B d x4
#5 C e x5
data.table version :
library(data.table)
merge(setDT(df1), melt(setDT(df2), measure.vars = names(df2)),
by.x = 'unique_values', by.y = 'value')
# unique_values other_col variable
#1: a x1 A
#2: b x2 B
#3: c x3 C
#4: d x4 B
#5: e x5 C

A merge indicator for R data.table?

My question is related to this question but it was asking dplyr solution.
What I'd like to do is to perform outer join and create a indicator variable that explains the merge result, like pandas or STATA would do.
To be specific, I would like to have _merge column after full outer join operation that indicates the merge result with left_only or right_only or both as below example.
UPDATE : I've updated example
key1 = c('a','b','c','d','e')
v1 = c(1,2,3, NA, 5)
key2 = c('a','b','d','f')
v2 = c(4,5,6,7)
df1 = data.frame(key=key1,v1)
df2 = data.frame(key=key2,v2)
> df1
key v1
1: a 1
2: b 2
3: c 3
4: d NA
5: e 5
> df2
key v2
1: a 4
2: b 5
3: d 6
4: f 7
# merge result I'd like to have
key v1 v2 _merge
1: a 1 4 both
2: b 2 5 both
3: c 3 NA left_only
4: d NA 6 both # <- not right_only, both
5: e 5 NA left_only
6: f NA 7 right_only
I'm wondering if I'm missing an existing data.table feature, or is there a simple way to do this task?
You can use merge.data.table with all=TRUE for a full outer join:
library(data.table)
setDT(df1)
setDT(df2)
DT <- merge(df1[, r1 := .I], df2[, r2 := .I], by="key", all=TRUE)
DT[, merge_ := "both"][
is.na(r1), merge_ := "right_only"][
is.na(r2), merge_ := "left_only"]
output:
key v1 r1 v2 r2 merge_
1: a 1 1 4 1 both
2: b 2 2 5 2 both
3: c 3 3 NA NA left_only
4: d NA NA 6 3 right_only
data:
key1 = c('a','b','c')
v1 = c(1,2,3)
key2 = c('a','b','d')
v2 = c(4,5,6)
df1 = data.frame(key=key1,v1)
df2 = data.frame(key=key2,v2)
As mentioned by Michael Chirico, with data.table_1.13.0 released on Jul 24, 2020, one can also use fcase as follows:
DT[, merge_ := fcase(
is.na(r1), "right_only",
is.na(r2), "left_only",
default = "both"
)]

Group values in rows according into similar columns

I had a column with multiple values inside it..
Like...
ColumnX1
A,D,C,B,F,E,G
F,A,B,E,G,C
C,D,G,F,A,T
I splitted the data with
Species_Data2 <- data.frame(str_split_fixed(Species_Data$Other.Anopheline.species, ",", 21))
But I got the values as below:
I have dataframe like:-
X1 X2 X3 X4 X5 X6 X7
A D C B F E G
F A B E G NA C
C D G F A T NA
I wanted to make a dataframe like:
X1 X2 X3 X4 X5 X6 X7 X8
A B C D E F G NA
A B C NA E F G NA
A NA C D NA F G T
and then....
I want to make the columns names as row values:-
Colnames
'A' 'B' 'C' 'D' 'E' 'F' 'G' 'T'
A B C D E F G NA
A B C NA E F G NA
A NA C D NA F G T
Tried to create sorting...but does not work that great... :(..
Comes up with O values though....
If I understand correctly, the OP wants to rearrange the data so that there is a separate column for each letter. If a letter is present in a row, then the letter appears in the appropriate column/row of the reshaped data. NA indicates that a letter is missing in a row. In addition, the letter columns should be arranged in alphabetical order.
1. dplyr/tidyr approach
If we start with the data.frame resulting from OP's call to stringr::str_split_fixed() we need to reshape the splitted data from wide to long format, remove empty entries, order rows so that columns appear in letter order and reshape to wide format again. For reshaping, a row id is required. To achieve the desired output, pivot_wide() has to be called the names_from = value parameter:
library(dplyr)
library(tidyr)
as.data.frame(stringr::str_split_fixed(DF$ColumnX1, ",", 21)) %>%
mutate(rn = row_number()) %>%
pivot_longer(-rn) %>%
filter(value != "") %>%
arrange(as.character(value)) %>%
pivot_wider(rn, names_from = value)
rn A B C D E F G T
<int> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct>
1 1 A B C D E F G NA
2 2 A B C NA E F G NA
3 3 A NA C D NA F G T
2. data.table approach
If we start from the unsplitted original data, there is a much more concise variant which uses data.table's dcast() for reshaping:
library(data.table)
setDT(DF)[, stringr::str_split(ColumnX1, ","), by = 1:nrow(DF)][, dcast(.SD, nrow ~ V1)]
nrow A B C D E F G T
1: 1 A B C D E F G <NA>
2: 2 A B C <NA> E F G <NA>
3: 3 A <NA> C D <NA> F G T
If required, the additional row id column can be removed in both approaches.
Data
DF <- data.frame(ColumnX1 = c("A,D,C,B,F,E,G",
"F,A,B,E,G,C",
"C,D,G,F,A,T")
)
EDIT: Duplicate values
In a comment, the OP has disclosed that the production dataset contains duplicate values.
In case of duplicate values, dcast() uses the length() function by default to aggregate the data.
With a modified dataset DF2 which contains duplicate values in rows 1 and 2, the original data.table approach returns:
library(data.table)
setDT(DF2)[, stringr::str_split(ColumnX1, ","), by = 1:nrow(DF)][, dcast(.SD, nrow ~ V1)]
nrow A B C D E F G T
1: 1 1 1 2 1 1 1 1 0
2: 2 1 1 1 0 1 2 1 0
3: 3 1 0 1 1 0 1 1 1
Here, the number of duplicate letters is shown.
The expected behaviour can be restored by removing the duplicate values before reshaping by using unique():
setDT(DF2)[, stringr::str_split(ColumnX1, ","), by = 1:nrow(DF)][
, dcast(unique(.SD), nrow ~ V1)]
nrow A B C D E F G T
1: 1 A B C D E F G <NA>
2: 2 A B C <NA> E F G <NA>
3: 3 A <NA> C D <NA> F G T
Also the dplyr/tidyr approach needs to be modified by specifying an appropriate aggregation function in the call to pivot_wider():
library(dplyr)
library(tidyr)
as.data.frame(stringr::str_split_fixed(DF2$ColumnX1, ",", 21)) %>%
mutate(rn = row_number()) %>%
pivot_longer(-rn) %>%
filter(value != "") %>%
arrange(as.character(value)) %>%
pivot_wider(rn, names_from = value, values_fn = list(value = unique))
Data with duplicate values
DF2 <- data.frame(ColumnX1 = c("A,D,C,B,F,E,G,C",
"F,A,B,E,G,C,F",
"C,D,G,F,A,T")
)

R - Merge and Replace Column If ID Found on Another Data Frame

I have two data frames as below and am trying to improve my code so the letters column in df1 should replaced with the letters column in df2 if they match.
df1 <- data.frame(ID = c(1,3,2,4,5), Letters = LETTERS[1:5], stringsAsFactors = F)
df2 <- data.frame(ID = c(1,3,4), Letters2 = "F", stringsAsFactors = F)
desired:
ID letters
1 F
2 B
3 F
4 D
5 F
It would be like doing the following by in one line:
desired <- merge(df1, df2, by = "ID", all.x = T)
desired$letters <- ifelse(is.na(desired$letters2), desired$letters, desired$letters2)
desired$letters2 <- NULL
Try this:
library(tidyverse)
df1%>%
left_join(df2)%>%
mutate(Letters=coalesce(letters2,Letters),letters2=NULL)
Joining, by = "ID"
ID Letters
1 1 F
2 2 B
3 3 F
4 4 F
5 5 E
We could use the numeric 'ID' as index to change the values in 'Letters' to those of 'letters2' (which are all 'F's)
df1$Letters[df2$ID] <- df2$letters2
df1
# ID Letters
#1 1 F
#2 2 B
#3 3 F
#4 4 F
#5 5 E
Or using data.table
library(data.table)
setDT(df1)[df2, Letters := Letters2, on = .(ID)]
df1
# ID Letters
#1: 1 F
#2: 3 F
#3: 2 C
#4: 4 F
#5: 5 E

how to subset in r for this particular condition?

df1 and df2 have columns a,b. I want to subset data from df1 such that each entry in df1$a along with df1$b is in df2$a along with df2$b.
df1
a b c
1 m df1
2 f df1
3 f df1
4 m df1
5 f df1
6 m df1
df2
a b c
1 m df2
3 f df2
4 f df2
5 m df2
6 f df2
7 m df2
desired output
df
a b c
1 m df1
3 f df1
i am using :
df <- subset(df1,(df1$a%in%df2$a & df1$b%in%df2$b))
but this is giving results similar to
df <-subset(df1,df1$a%in%df2$a)
You can use package dplyr:
library(dplyr)
intersect(df1,df2)
# a b
#1 1 m
#2 3 f
Edit for the new data.frames with c column:
you can use function semi_join (also from dplyr):
semi_join(df1,df2,by=c("a","b"))
# a b c
#1 1 m df1
#2 3 f df1
Other option, in base R:
you can paste your a and b variables to subset your data.frame:
df1[paste(df1$a,df1$b) %in% paste(df2$a,df2$b), ]
# a b
#1 1 m
#3 3 f
and with the new data.frames:
# a b c
# 1 1 m df1
# 3 3 f df1
Or you could do
Res <- rbind(df1, df2)
Res[duplicated(Res), ]
# a b
# 7 1 m
# 8 3 f
Edit1: Per the edit, here's a similar data.table solution
library(data.table)
Res <- rbind(df1, df2)
setDT(Res)[duplicated(Res, by = c("a", "b"), fromLast = TRUE)]
# a b c
# 1: 1 m df1
# 2: 3 f df1
Edit2: I see that #CathG opened a join battlefront, so here's how we do it with data.table
setkey(setDT(df1), a, b) ; setkey(setDT(df2), a, b)
df1[df2, nomatch = 0]
# a b c i.c
# 1: 1 m df1 df2
# 2: 3 f df1 df2

Resources