Creating new variable in dataframe based on matching values from other dataframe - r

I have two dataframes, df1 and df2, of which two columns have partly matching values, however in completely different order; also, the values are unique in df2 but may be repeated in df1.
What I'd like to do is transfer into df1, not the matching values, but values associated with them in another variable in df2; for one value in df1, "G", I do not want the associated value to be transferred but rather just NA.
Consider df1 and df2:
df1 <- data.frame(
x = c("A", NA, "L", "G", "C", "F", NA, "J", "G", "K")
)
df2 <- data.frame(
a = LETTERS[1:10],
b = 1:10 # these are the values to be transferred into df1$z
)
df1$z <- ifelse(df1$x=="G", NA, ifelse(df1$x %in% df2$a, df2$b[df2$a %in% df1$x], NA))
The values to be transferred from df2 into df1 are in df2$b. I've tried the above ifelse() string but the resulting values in df1$z are only partly correct. Where's the mistake?

I think this does what you want:
df1$z <- df2$b[match(df1$x,df2$a)]
df1$z[df1$x=='G']=NA
Output:
> df1
x z
1 A 1
2 <NA> NA
3 L NA
4 G 7
5 C 3
6 F 6
7 <NA> NA
8 J 10
9 G 7
10 K NA
Hope this helps!

dplyr::left_join(df1,df2,by=c("x"="a")) %>% mutate(b = ifelse(x=="G",NA,b))
# x b
# 1 A 1
# 2 <NA> NA
# 3 L NA
# 4 G NA
# 5 C 3
# 6 F 6
# 7 <NA> NA
# 8 J 10
# 9 G NA
# 10 K NA

Related

Reshaping in R: grouping vectors after merge

I have a pretty basic problem that I can't seem to solve:
Take 3 vectors:
V1
V2
V3
1
4
7
2
5
8
3
6
9
I want to merge the 3 columns into a single column and assign a group.
Desired result:
Group
Value
A
1
A
2
A
3
B
4
B
5
B
6
C
7
C
8
C
9
My code:
V1 <- c(1,2,3)
V2 <- c(4,5,6)
V3 <- c(7,8,9)
data <- data.frame(group = c("A", "B", "C"),
values = c(V1, V2, V3))
Actual result:
Group
Value
A
1
B
2
C
3
A
4
B
5
C
6
A
7
B
8
C
9
How can I reshape the data to get the desired result?
Thank you!
We can stack on a named list of vectors
stack(setNames(mget(paste0("V", 1:3)), LETTERS[1:3]))[2:1]
-output
ind values
1 A 1
2 A 2
3 A 3
4 B 4
5 B 5
6 B 6
7 C 7
8 C 8
9 C 9
Regarding the issue in the OP's data creation, if the length is less than the length of the second column, it will recycle. We may need rep
data <- data.frame(group = rep(c("A", "B", "C"), c(length(V1),
length(V2), length(V3))),
values = c(V1, V2, V3))
-output
> data
group values
1 A 1
2 A 2
3 A 3
4 B 4
5 B 5
6 B 6
7 C 7
8 C 8
9 C 9
Here is a another option using pivot_longer() and converting the 1,2,3 column names into factors labeled A, B, C
V1 <- c(1,2,3)
V2 <- c(4,5,6)
V3 <- c(7,8,9)
df <- data.frame(V1, V2, V3)
library(dplyr)
library(tidyr)
#basic answer:
answer<-pivot_longer(df, cols=starts_with("V"), names_to = "Group")
OR
Answer changing the column names to a different label
answer<-pivot_longer(df, cols=starts_with("V"), names_prefix = "V", names_to = "Group")
answer$Group <- factor(answer$Group, labels = c("A", "B", "C"))
answer %>% arrange(Group, value)

R - subset rows by rows in another data frame

Let's say I have a data frame df containing only factors/categorical variables. I have another data frame conditions where each row contains a different combination of the different factor levels of some subset of variables in df (made using expand.grid and levels etc.). I'm trying to figure out a way of subsetting df based on each row of conditions. So for example, if the column names of conditions are c("A", "B", "C") and the first row is c('a1', 'b1', 'c1'), then I want df[df$A == 'a1' & df$B == 'b1' & df$C == 'c1',], and so on.
I'd think this is a great time to use merge (or dplyr::*_join or ...):
df1 <- expand.grid(A = letters[1:4], B = LETTERS[1:4], stringsAsFactors = FALSE)
df1$rn <- seq_len(nrow(df1))
# 'df2' contains the conditions we want to filter (retain)
df2 <- data.frame(
a1 = c('a', 'a', 'c'),
b1 = c('B', 'C', 'C'),
stringsAsFactors = FALSE
)
df1
# A B rn
# 1 a A 1
# 2 b A 2
# 3 c A 3
# 4 d A 4
# 5 a B 5
# 6 b B 6
# 7 c B 7
# 8 d B 8
# 9 a C 9
# 10 b C 10
# 11 c C 11
# 12 d C 12
# 13 a D 13
# 14 b D 14
# 15 c D 15
# 16 d D 16
df2
# a1 b1
# 1 a B
# 2 a C
# 3 c C
Using df2 to define which combinations we need to keep,
merge(df1, df2, by.x=c('A','B'), by.y=c('a1','b1'))
# A B rn
# 1 a B 5
# 2 a C 9
# 3 c C 11
# or
dplyr::inner_join(df1, df2, by=c(A='a1', B='b1'))
(I defined df2 with different column names just to show how it works, but in reality since its purpose is "solely" to be declarative on which combinations to filter, it would make sense to me to have the same column names, in which case the by= argument just gets simpler.)
One option is to create the condition with Reduce
df[Reduce(`&`, Map(`==`, df[c("A", "B", "C")], df[1, c("A", "B", "C")])),]
Or another option is rowSums
df[rowSums(df[c("A", "B", "C")] ==
df[1, c("A", "B", "C")][col(df[c("A", "B", "C")])]) == 3,]

Collecting the value which have the same name in column and row from data frame

This is a small example:
a <- c("a", "b", "f", "c", "e")
b <- c("a", "c", "e", "d", "b")
p <- matrix(1:25, nrow = 5, dimnames = list(a, b))
p <- as.data.frame(p)
#data.frame would be like that
a c e d b
a 1 6 11 16 21
b 2 7 12 17 22
f 3 8 13 18 23
c 4 9 14 19 24
e 5 10 15 20 25
The output what I want:
score
a 1
b 22
c 9
e 15
This is the code I wrote:
L <- rownames(p)
output <- NULL
t <- 1
for (i in L) {
tar_column <- p[i]
score <- tar_column[t, ]
tar_score <- matrix(score, nrow = 1, dimnames = list(i, "score"))
output <- rbind(output, tar_score)
t <- t+1
}
The output I got:
score
a 1
b 22
Error in `[.data.frame`(p, i) : undefined columns selected
The problem is that column name and rowname are not matched perfectly. I think that the if statement can help to skip the variable when it can't be matched to the column name. Could someone help me fix this problem?
Just loop through each column/rowname (using sapply) and use square bracket notation to subset p on both that row and column:
sapply(c('a','b','c','e'), function(x) p[x,x])
a b c e
1 22 9 15
If you don't want to specify the variable names beforehand, you can just use either colnames or rownames:
sapply(colnames(p), function(x) p[x,x])
a c e d b
1 9 15 NA 22
If there isn't a matching rowname, this will return NA for that value. If desired, you can drop the NA values by subsetting the result:
result <- sapply(colnames(p), function(x) p[x,x])
result[!is.na(result)]
a c e b
1 9 15 22
Here is another option:
library(tidyverse)
p %>%
rownames_to_column("row") %>%
gather(col, score, -row) %>%
filter(row == col) %>%
select(-row)
#> col score
#> 1 a 1
#> 2 c 9
#> 3 e 15
#> 4 b 22
First we make the row name into a variable, then we gather from wide to long format, lastly we filter only matching pairs of row and col.

Replacing NAs in a column with the values of other column

I wonder how to replace NAs in a column with the values of other column in R using dplyr. MWE is below.
Letters <- LETTERS[1:5]
Char <- c("a", "b", NA, "d", NA)
df1 <- data.frame(Letters, Char)
df1
library(dplyr]
df1 %>%
mutate(Char1 = ifelse(Char != NA, Char, Letters))
Letters Char Char1
1 A a NA
2 B b NA
3 C <NA> NA
4 D d NA
5 E <NA> NA
You can use coalesce:
library(dplyr)
df1 <- data.frame(Letters, Char, stringsAsFactors = F)
df1 %>%
mutate(Char1 = coalesce(Char, Letters))
Letters Char Char1
1 A a a
2 B b b
3 C <NA> C
4 D d d
5 E <NA> E

Matching columns with other columns in data frames and adding certain columns of matching values

I have tried searching for something but cannot find it. I have found similar threads but still they don't get what I want. I know there should be an easy way to do this without writing a loop function. Here it goes
I have two data frame df1 and df2
df1 <- data.frame(ID = c("a", "b", "c", "d", "e", "f"), y = 1:6 )
df2 <- data.frame(x = c("a", "c", "g", "f"), f=c("M","T","T","M"), obj=c("F70", "F60", "F71", "F82"))
df2$f <- as.factor(df2$f)
now I want to match df1 and df2 "ID" and "x" column with each other. But I want to add new columns to the df1 data frame that matches "ID" and "x" from df2 as well. The final output of df1 should look like this
ID y obj f1 f2
a 1 F70 M NA
b 2 NA NA NA
c 3 F60 NA T
d 4 NA NA NA
e 5 NA NA NA
f 6 F82 M NA
We can do this with tidyverse after joining the two datasets and spread the 'f' column
library(tidyverse)
left_join(df1, df2, by = c(ID = "x")) %>%
group_by(f) %>%
spread(f, f) %>%
select(-6) %>%
rename(f1 = M, f2 = T)
# A tibble: 6 × 5
# ID y obj f1 f2
#* <chr> <int> <fctr> <fctr> <fctr>
#1 a 1 F70 M NA
#2 b 2 NA NA NA
#3 c 3 F60 NA T
#4 d 4 NA NA NA
#5 e 5 NA NA NA
#6 f 6 F82 M NA
Or a similar approach with data.table
library(data.table)
dcast(setDT(df2)[df1, on = .(x = ID)], x+obj + y ~ f, value.var = 'f')[, -6, with = FALSE]
Here is a base R process.
# combine the data.frames
dfNew <- merge(df1, df2, by.x="ID", by.y="x", all.x=TRUE)
# add f1 and f2 variables
dfNew[c("f1", "f2")] <- lapply(c("M", "T"),
function(i) factor(ifelse(as.character(dfNew$f) == i, i, NA)))
# remove original factor variable
dfNew <- dfNew[-3]
ID y obj f1 f2
1 a 1 F70 M <NA>
2 b 2 <NA> <NA> <NA>
3 c 3 F60 <NA> T
4 d 4 <NA> <NA> <NA>
5 e 5 <NA> <NA> <NA>
6 f 6 F82 M <NA>

Resources