Comparing two dataframes in R and extract the values from one dataframe

Comparing two dataframes in R and extract the values from one dataframe - r

I have two dataframes which have different number of rows and columns. one dataframe is with two columns and other dataframe with multiple columns.
The first dataframes looks like,
Second dataframe is like
Actually, i need to replace the second dataframe which contains A,B,C etc with the values of 2nd column of first dataframe.
I need the output in below format.
Help me to solve this problem.
dput:
df
structure(list(col1 = c("A", "B", "C", "D", "E", "F", "G", "H",
"I", "J", "K", "L"), col2 = c(10, 1, 2, 3, 4, 3, 1, 8, 19, 200,
12, 112)), row.names = c(NA, -12L), class = c("tbl_df", "tbl",
"data.frame"))
df2
structure(list(col1 = c("A", "F", "W", "E", "F", "G"), col2 = c(NA,
NA, "J", "K", "L", NA), col3 = c(NA, "H", "I", NA, "A", "B")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))

A one-liner:
as_tibble(`colnames<-`(matrix(df1$col2[match(as.matrix(df2),df1$col1)], ncol=3), names(df2)))
#> # A tibble: 6 x 3
#> col1 col2 col3
#> <dbl> <dbl> <dbl>
#> 1 10 NA NA
#> 2 3 NA 8
#> 3 NA 200 19
#> 4 4 12 NA
#> 5 3 112 10
#> 6 1 NA 1

You can accomplish this with a little data manipulation. Make the data in df2 long, then join to df, then make the data wide again.
The rowid_to_column is necessary to make the transition from long to wide work. You can easily remove that column by adding select(-rowid) at the end of the chain.
library(tidyverse)
df2 %>%
rowid_to_column() %>%
pivot_longer(cols = -rowid) %>%
left_join(df, by = c("value" = "col1")) %>%
select(-value) %>%
pivot_wider(names_from = name, values_from = col2)
# rowid col1 col2 col3
# <int> <dbl> <dbl> <dbl>
# 1 1 10 NA NA
# 2 2 3 NA 8
# 3 3 NA 200 19
# 4 4 4 12 NA
# 5 5 3 112 10
# 6 6 1 NA 1

one-liner in base R:
df2 <- as.data.frame(lapply(df2, function(x) ifelse(!is.na(x), setNames(df$col2, df$col1)[x], NA)))
Output
> df2
col1 col2 col3
1 10 NA NA
2 3 NA 8
3 NA 200 19
4 4 12 NA
5 3 112 10
6 1 NA 1

Another short one liner in base. You can use match and assign the result to df2[]:
df2[] <- df[match(unlist(df2), df[,1]), 2]
df2
# col1 col2 col3
#1 10 NA NA
#2 3 NA 8
#3 NA 200 19
#4 4 12 NA
#5 3 112 10
#6 1 NA 1

Related

How can you convert duplicates across multiple columns to be NA in R?

I have a dataset that I want to convert any duplicates across columns to be NA. I've found answers to help with just looking for duplicates in one column, and I've found ways to remove duplicates entirely (e.g., distinct()). Instead, I have this data:
library(dpylr)
test <- tibble(job = c(1:6),
name = c("j", "j", "j", "c", "c", "c"),
id = c(1, 1, 2, 1, 5, 1))
And want this result:
library(dpylr)
answer <- tibble(job = c(1:6),
id = c("j", NA, "j", "c", NA, "c"),
name = c(1, NA, 2, 1, NA, 5))
And I've tried a solution like this using duplicated(), but it fails:
#Attempted solution
library(dpylr)
test %>%
mutate_at(vars(id, name), ~case_when(
duplicated(id, name) ~ NA,
TRUE ~ .
))
I'd prefer to use tidy solutions, but I can be flexible as long as the answer can be piped.

We could create a helper and then identify duplicates and replace them with NA in an ifelse statement using across:
library(dplyr)
test %>%
mutate(helper = paste(id, name)) %>%
mutate(across(c(name, id), ~ifelse(duplicated(helper), NA, .)), .keep="unused")
job name id
<int> <chr> <dbl>
1 1 j 1
2 2 NA NA
3 3 j 2
4 4 c 1
5 5 c 5
6 6 NA NA

If we want to convert to NA, create a column that includes all the columns with paste or unite and then mutate with across
library(dplyr)
library(tidyr)
test %>%
unite(full_nm, -job, remove = FALSE) %>%
mutate(across(-c(job, full_nm), ~ replace(.x, duplicated(full_nm), NA))) %>%
select(-full_nm)
-output
# A tibble: 6 × 3
job name id
<int> <chr> <dbl>
1 1 j 1
2 2 <NA> NA
3 3 j 2
4 4 c 1
5 5 c 5
6 6 <NA> NA

How to find the highest value in a row which is not a distinct variable

I have this dataframe
mydf <- structure(list(POS = c("1", "2", "3", "4"), A = c("10", "10",
"6", "1"), C = c("1", "8", "2", "7"), T = c("6", "2", "10", "8"
), G = c("0", "0", "2", "11"), Ref = c("A", "A", "T", "C")), class = "data.frame", row.names = c(NA,
-4L))
which looks like this
POS A C T G Ref
1 10 1 6 0 A
2 10 8 2 0 A
3 6 2 10 2 T
4 1 7 8 11 C
My aim is to extract the maximum value of each row, which is NOT the one stated in Ref. Meaning in the first row i want to extract the value of T since it has the highest value, which is not the Ref A. In the second row i want to have the value of C and so on...
The POS colum does not count here, it is all about A,T,G and C.
Unfortunately, i have to do this on quite a number of rows, so that i need to have an automated solution.
I would be happy for a dplyr solution, since i am trying to focus on dplyr :)
Thanks a lot!
THANK YOU a lot for all the answers, there are multiple correct solutions, i justed took one which i am currently using. The other answers can work as well!

You can try max in apply:
apply(sapply(c("A", "C", "T", "G"), function(i)
`[<-`(as.numeric(mydf[[i]]), mydf$Ref == i, NA)), 1, max, na.rm=TRUE)
#[1] 6 8 6 11
Or using pmax:
do.call(pmax, c(lapply(c("A", "C", "T", "G"), function(i)
`[<-`(as.numeric(mydf[[i]]), mydf$Ref == i, NA)), na.rm=TRUE))
#[1] 6 8 6 11
Benchmark:
library(dplyr)
bench::mark(check = FALSE
, apply = apply(sapply(c("A", "C", "T", "G"), function(i)
`[<-`(as.numeric(mydf[[i]]), mydf$Ref == i, NA)), 1, max, na.rm=TRUE)
, do.call = do.call(pmax, c(lapply(c("A", "C", "T", "G"), function(i)
`[<-`(as.numeric(mydf[[i]]), mydf$Ref == i, NA)), na.rm=TRUE))
, mapply = mapply(function(x, i) max(as.numeric(unlist(x))[-i]),
x = split(mydf[, 2:5], seq(nrow(mydf))),
i = match(mydf$Ref, names(mydf)[-1]))
, sapply = sapply(split(mydf, seq(nrow(mydf))),
function(x) max(as.numeric(x[, setdiff(c("A", "C", "T", "G"), x$Ref)])))
, dplyr = {mydf %>%
rowwise() %>%
mutate(Res = Reduce(pmax, across(A:G, ~ as.numeric(.) * (. != get(Ref)))))}
)
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
#1 apply 103.7µs 111.06µs 8861. 4.13KB 14.5 4291 7
#2 do.call 63.3µs 68.56µs 14072. 4.13KB 14.4 6825 7
#3 mapply 323.3µs 355.44µs 2747. 14.55KB 12.4 1329 6
#4 sapply 469.4µs 516.12µs 1855. 16.5KB 12.5 892 6
#5 dplyr 7.6ms 8.26ms 120. 23.35KB 11.1 54 5
Using pmax over do.call looks like to be the fastest and uses less memory.

You can turn the values in Ref columns to be NA and use pmax to get rowwise maximum ignoring NA values.
mydf <- type.convert(mydf, as.is = TRUE)
tmp <- mydf
tmp[cbind(1:nrow(tmp), match(tmp$Ref, names(tmp)))] <- NA
mydf$max_value <- do.call(pmax, c(tmp[2:5], na.rm = TRUE))
mydf
# POS A C T G Ref max_value
#1 1 10 1 6 0 A 6
#2 2 10 8 2 0 A 8
#3 3 6 2 10 2 T 6
#4 4 1 7 8 11 C 11

A base R solution is
sapply(split(mydf, seq(nrow(mydf))),
function(x) max(x[, setdiff(c("A", "C", "T", "G"), x$Ref)]))
#R> 1 2 3 4
#R> 6 8 6 11
Or
mapply(function(x, i) max(x[-i]),
x = split(as.matrix(mydf[, 2:5]), seq(nrow(mydf))),
i = match(mydf$Ref, names(mydf)[-1]))
#R> 1 2 3 4
#R> 6 8 6 11
Or like GKi's answer
x <- as.matrix(mydf[, c("A", "C", "T", "G")])
x[rep(c("A", "C", "T", "G"), each = NROW(mydf)) == mydf$Ref] <- NA_real_
apply(x, 1, max, na.rm = TRUE)
#R> [1] 6 8 6 11
# in R 4.1.0 or greater
as.matrix(mydf[, c("A", "C", "T", "G")]) |>
(\(x){
x[rep(c("A", "C", "T", "G"), each = NROW(mydf)) == mydf$Ref] <- NA_real_
x
})() |>
apply(1, max, na.rm = TRUE)
#R> [1] 6 8 6 11
I have first transformed the columns to numeric variables as follows as I assume that this is what you intended:
mydf[, c("A", "C", "T", "G")] <-
lapply(mydf[, c("A", "C", "T", "G")], as.numeric)

One dplyr option could be:
mydf %>%
rowwise() %>%
mutate(Res = Reduce(pmax, across(A:G, ~ . * (. != get(Ref)))))
POS A C T G Ref Res
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 10 1 6 0 A 6
2 2 10 8 2 0 A 8
3 3 6 2 10 2 T 6
4 4 1 7 8 11 C 11

Count number of element for each row in a matrix [duplicate]

This question already has answers here:
Count number of values in row using dplyr
(5 answers)
Counting number of instances of a condition per row R [duplicate]
(1 answer)
Closed 2 years ago.
Hello I have a matrix such as :
COL1 COL2 COL3
A "A" "B" NA
B "B" "B" "C"
C NA NA NA
D "B" "B" "B"
E NA NA "C"
F "A" "A" "C"
and I would liek for each row (A,B,C,D etc) get the number of letters being A or B
exemple :
Nb
A 2
B 2
C 0
D 3
E 0
F 2
does someone have an idea ?

another way is to use sapply:
df$n <- sapply(1:nrow(df), function(i) sum((df[i,] %in% c('A', 'B'))))
# COL1 COL2 COL3 n
# A A B <NA> 2
# B B B C 2
# C <NA> <NA> <NA> 0
# D B B B 3
# E <NA> <NA> C 0
# F A A C 2
You can achieve the same output by using purrr::map_dbl as well. Just replace sapply with map_dbl.

You can try a base R solution with apply():
#Base R
df$Var <- apply(df,1,function(x) length(which(!is.na(x) & x %in% c('A','B'))))
Output:
COL1 COL2 COL3 Var
A A B <NA> 2
B B B C 2
C <NA> <NA> <NA> 0
D B B B 3
E <NA> <NA> C 0
F A A C 2
Some data used:
#Data
df <- structure(list(COL1 = c("A", "B", NA, "B", NA, "A"), COL2 = c("B",
"B", NA, "B", NA, "A"), COL3 = c(NA, "C", NA, "B", "C", "C")), row.names = c("A",
"B", "C", "D", "E", "F"), class = "data.frame")
Or if you feel curious about tidyverse:
library(tidyverse)
#Code
df %>% mutate(id=1:n()) %>%
left_join(df %>% mutate(id=1:n()) %>%
pivot_longer(cols = -id) %>%
filter(value %in% c('A','B')) %>%
group_by(id) %>%
summarise(Var=n())) %>% ungroup() %>%
replace(is.na(.),0) %>% select(-id)
Output:
COL1 COL2 COL3 Var
1 A B 0 2
2 B B C 2
3 0 0 0 0
4 B B B 3
5 0 0 C 0
6 A A C 2

library(dplyr)
df <- structure(list(COL1 = c("A", "B", NA, "B", NA, "A"), COL2 = c("B",
"B", NA, "B", NA, "A"), COL3 = c(NA, "C", NA, "B", "C", "C")), row.names = c("A",
"B", "C", "D", "E", "F"), class = "data.frame")
df %>%
rowwise() %>%
mutate(sumVar = across(c(COL1:COL3),~ifelse(. %in% c("A", "B"),1,0)) %>% sum)
# A tibble: 6 x 4
# Rowwise:
COL1 COL2 COL3 sumVar
<chr> <chr> <chr> <dbl>
1 A B NA 2
2 B B C 2
3 NA NA NA 0
4 B B B 3
5 NA NA C 0
6 A A C 2

Wide to long, combining columns in pairs but keeping ID column - R

I have a dataframe of the following type
ID case1 case2 case3 case4
1 A B C D
2 B A
3 E F
4 G C A
5 T
I need to change its format, to a long shape, similar as the below:
ID col1 col2
1 A B
1 A C
1 A D
1 B C
1 B D
1 C D
2 B A
3 E F
4 G C
4 G A
4 C A
5 T
As you can see, I need to maintain the ID and ignore empty columns. There are some cases like T that need to remain in the dataset, but without a col2.
I am honestly not sure how to approach this, so that is why there are no examples of what I have tried.

You can get the data in long format and create all combination of values for each ID if the number of rows is greater than 1 in that ID.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -ID, values_drop_na = TRUE) %>%
group_by(ID) %>%
summarise(value = if(n() > 1) list(setNames(as.data.frame(t(combn(value, 2))),
c('col1', 'col2')))
else list(data.frame(col1 = value[1], col2 = NA_character_))) %>%
unnest(value)
# A tibble: 12 x 3
# ID col1 col2
# <int> <chr> <chr>
# 1 1 A B
# 2 1 A C
# 3 1 A D
# 4 1 B C
# 5 1 B D
# 6 1 C D
# 7 2 B A
# 8 3 E F
# 9 4 G C
#10 4 G A
#11 4 C A
#12 5 T NA
data
df <- structure(list(ID = 1:5, case1 = c("A", "B", "E", "G", "T"),
case2 = c("B", "A", "F", "C", NA), case3 = c("C", NA, NA,
"A", NA), case4 = c("D", NA, NA, NA, NA)),
class = "data.frame", row.names = c(NA, -5L))

Forward and backward difference between rows with missing values

Here is the sample dataframe:
df <- data.frame(
id = c("A", "A", "A", "A", "B", "B", "B", "B"),
num = c(1, NA, 6, 3, 7, NA , NA, 2))
How do I get forward and backward difference between rows over id category? There should be two new columns: one difference between between current raw and previous, and the other should be difference between current raw and next raw. If the previous raw is NA then it should calculate the difference between current row and the first previous raw that contains real number. The same holds for the other forward difference case.
Many thanks!!

require(magrittr)
df$backdiff <- c(NA, sapply(2:nrow(df),
function(i){
df$num[i] - df$num[(i-1):1] %>% .[!is.na(.)][1]
}))
df$forward.diff <- c(sapply(2:nrow(df) - 1,
function(i){
df$num[i] - df$num[(i+1):nrow(df)] %>% .[!is.na(.)][1]
}), NA)

One solution could be achieved by using fill function from tidyr to create two columns (one each for prev and next calculation) where NA values are removed.
df <- data.frame(
id = c("A", "A", "A", "A", "B", "B", "B", "B"),
num = c(1, NA, 6, 3, 7, NA , NA, 2))
library("tidyverse")
df %>% mutate(dup_num_prv = num, dup_num_nxt = num) %>%
group_by(id) %>%
fill(dup_num_prv, .direction = "down") %>%
fill(dup_num_nxt, .direction = "up") %>%
mutate(prev_diff = ifelse(is.na(num), NA, num - lag(dup_num_prv))) %>%
mutate(next_diff = ifelse(is.na(num), NA, num - lead(dup_num_nxt))) %>%
as.data.frame()
# Result is shown in columns 'prev_diff' and 'next_diff'
# id num dup_num_prv dup_num_nxt prev_diff next_diff
#1 A 1 1 1 NA -5
#2 A NA 1 6 NA NA
#3 A 6 6 6 5 3
#4 A 3 3 3 -3 NA
#5 B 7 7 7 NA 5
#6 B NA 7 2 NA NA
#7 B NA 7 2 NA NA
#8 B 2 2 2 -5 NA
Note: There are few queries which OP needs to clarify. The solution can be fine-tuned afterwards. dup_num_prv and dup_num_nxtare kept just for understanding purpose. These column can be removed.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Comparing two dataframes in R and extract the values from one dataframe - r

A one-liner: as_tibble(`colnames<-`(matrix(df1$col2[match(as.matrix(df2),df1$col1)], ncol=3), names(df2))) #> # A tibble: 6 x 3 #> col1 col2 col3 #> <dbl> <dbl> <dbl> #> 1 10 NA NA #> 2 3 NA 8 #> 3 NA 200 19 #> 4 4 12 NA #> 5 3 112 10 #> 6 1 NA 1

one-liner in base R: df2 <- as.data.frame(lapply(df2, function(x) ifelse(!is.na(x), setNames(df$col2, df$col1)[x], NA))) Output > df2 col1 col2 col3 1 10 NA NA 2 3 NA 8 3 NA 200 19 4 4 12 NA 5 3 112 10 6 1 NA 1

Another short one liner in base. You can use match and assign the result to df2[]: df2[] <- df[match(unlist(df2), df[,1]), 2] df2 # col1 col2 col3 #1 10 NA NA #2 3 NA 8 #3 NA 200 19 #4 4 12 NA #5 3 112 10 #6 1 NA 1

Related

How can you convert duplicates across multiple columns to be NA in R?

How to find the highest value in a row which is not a distinct variable

Count number of element for each row in a matrix [duplicate]

Wide to long, combining columns in pairs but keeping ID column - R

Forward and backward difference between rows with missing values

Categories

Resources