How to summarizing nested groups in R - r

In a data frame like data below:
library(tidyverse)
ID <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y","Z", "a","b","c","d")
State <- rep(c("FL", "GA", "SC", "NC", "VA", "GA"), each = 5)
Location <- rep(c("alpha", "beta", "gamma"), each = 10)
Var3 <- rep(c("Bravo", "Charlie", "Delta", "Echo"), times = c(7,8,10,5))
Sex <- rep(c("M","F","M"), times = 10)
data <- data.frame(ID, State, Location, Var3, Sex)
I want to return a data frame, or a list of several data frames, that summarize each way the data can be grouped. I want to see how many individual IDs are in each State, Location, and Var3, how many M and F are in each State, Location, and Var3, how many Locations are in each State, ect... what is the best way to achieve this.

We can use count
library(dplyr)
data %>%
count(State, Location, Var3, Sex)
Also, to get rollup/cube way of hierarchial counts,
library(data.table)
rollup(as.data.table(data), j = .N, by = c("State","Location","Var3", "Sex"))
# State Location Var3 Sex N
# 1: FL alpha Bravo M 3
# 2: FL alpha Bravo F 2
# 3: GA alpha Bravo M 2
# 4: GA alpha Charlie F 1
# 5: GA alpha Charlie M 2
# 6: SC beta Charlie F 2
# 7: SC beta Charlie M 3
# 8: NC beta Delta M 3
# 9: NC beta Delta F 2
#10: VA gamma Delta M 4
#11: VA gamma Delta F 1
#12: GA gamma Echo F 2
#13: GA gamma Echo M 3
#14: FL alpha Bravo <NA> 5
#15: GA alpha Bravo <NA> 2
#16: GA alpha Charlie <NA> 3
#17: SC beta Charlie <NA> 5
#18: NC beta Delta <NA> 5
#19: VA gamma Delta <NA> 5
#20: GA gamma Echo <NA> 5
#21: FL alpha <NA> <NA> 5
#22: GA alpha <NA> <NA> 5
#23: SC beta <NA> <NA> 5
#24: NC beta <NA> <NA> 5
#25: VA gamma <NA> <NA> 5
#26: GA gamma <NA> <NA> 5
#27: FL <NA> <NA> <NA> 5
#28: GA <NA> <NA> <NA> 10
#29: SC <NA> <NA> <NA> 5
#30: NC <NA> <NA> <NA> 5
#31: VA <NA> <NA> <NA> 5
#32: <NA> <NA> <NA> <NA> 30
# State Location Var3 Sex N
Or use cube
cube(as.data.table(data), j = .N, by = c("State","Location","Var3", "Sex"))
#. State Location Var3 Sex N
# 1: FL alpha Bravo M 3
# 2: FL alpha Bravo F 2
# 3: GA alpha Bravo M 2
# 4: GA alpha Charlie F 1
# 5: GA alpha Charlie M 2
# ---
#111: <NA> <NA> Delta <NA> 10
#112: <NA> <NA> Echo <NA> 5
#113: <NA> <NA> <NA> M 20
#114: <NA> <NA> <NA> F 10
#115: <NA> <NA> <NA> <NA> 30

One dplyr and purrr solution to group by all possible combinations of column names could be:
map2(list(colnames(data)),
1:ncol(data),
combn, simplify = FALSE) %>%
flatten() %>%
map(~ data %>%
group_by_at(.x) %>%
tally())
In this case, there are 31 possible combinations of column names, so it returns 31 lists. The first three lists:
[[1]]
# A tibble: 30 x 2
ID n
<fct> <int>
1 a 1
2 A 1
3 b 1
4 B 1
5 c 1
6 C 1
7 d 1
8 D 1
9 E 1
10 F 1
# … with 20 more rows
[[2]]
# A tibble: 5 x 2
State n
<fct> <int>
1 FL 5
2 GA 10
3 NC 5
4 SC 5
5 VA 5
[[3]]
# A tibble: 3 x 2
Location n
<fct> <int>
1 alpha 10
2 beta 10
3 gamma 10

Related

Anti_join between df1 and df2 but how to change all mismatch in df2 to NA

Below are my two dataframes, df1 and df2
df1 <- data.frame(id=c("632592651","633322173","634703802","634927873","635812953","636004739","636101211","636157799","636263106","636752420"),text=c("asdf","cat","dog","mouse","elephant","goose","rat","mice","kitty","kitten"),response=c("y","y","y","n","n","y","y","n","n","y"))
id text response
1 632592651 asdf y
2 633322173 cat y
3 634703802 dog y
4 634927873 mouse n
5 635812953 elephant n
6 636004739 goose y
7 636101211 rat y
8 636157799 mice n
9 636263106 kitty n
10 636752420 kitten y
df2 <- data.frame(id=c("632592651","633322173","634703802","634927873","635812953","636004739","636101211","636157799","636263106","636752420","636809222","2004722036","2004894388","2005045755","2005535472","2005630542","2005788781","2005809679","2005838317","2005866692"),
text=c("asdf_xyz","cat","dog","mouse","elephant","goose","rat","mice","kitty","kitten","tiger_xyz","lion","leopard","ostrich","kangaroo","platypus","fish","reptile","mammals","amphibians_xyz"),
volume=c("1234","432","324","333","2223","412346","7456","3456","2345","2345","6","345","23","2","4778","234","8675","3459","8","9"))
id text volume
1 632592651 asdf_xyz 1234
2 633322173 cat 432
3 634703802 dog 324
4 634927873 mouse 333
5 635812953 elephant 2223
6 636004739 goose 412346
7 636101211 rat 7456
8 636157799 mice 3456
9 636263106 kitty 2345
10 636752420 kitten 2345
11 636809222 tiger_xyz 6
12 2004722036 lion 345
13 2004894388 leopard 23
14 2005045755 ostrich 2
15 2005535472 kangaroo 4778
16 2005630542 platypus 234
17 2005788781 fish 8675
18 2005809679 reptile 3459
19 2005838317 mammals 8
20 2005866692 amphibians_xyz 9
How do I change the non-matching items from row id1:20 of df2 to NA (i.e. all of them as no matching with df1) and the column 'text' (i.e. asdf_xyz) of id1 to NA?
I have tried
library(dplyr)
df3 <- df2 %>%
anti_join(df1, by=c("id"))
id text volume
1 636809222 tiger_xyz 6
2 2004722036 lion 345
3 2004894388 leopard 23
4 2005045755 ostrich 2
5 2005535472 kangaroo 4778
6 2005630542 platypus 234
7 2005788781 fish 8675
8 2005809679 reptile 3459
9 2005838317 mammals 8
10 2005866692 amphibians_xyz 9
df3$id[df3$id != 0] <- NA
df3$text[df3$text != 0] <- NA
df3$volume[df3$volume != 0] <- NA
(Doing this one by one because I couldn't find solution how to change the entire value of the dataframe to NA)
id text volume
1 <NA> <NA> <NA>
2 <NA> <NA> <NA>
3 <NA> <NA> <NA>
4 <NA> <NA> <NA>
5 <NA> <NA> <NA>
6 <NA> <NA> <NA>
7 <NA> <NA> <NA>
8 <NA> <NA> <NA>
9 <NA> <NA> <NA>
10 <NA> <NA> <NA>
and df4 (solution from How to return row values that match column 'id' in both df1 and df2 but not column 'text' and return NA to the mismatch in column 'text'?)
inner_join(x = df1,
y = df2,
by = "id") %>%
mutate_if(is.factor, as.character) %>%
mutate(text = ifelse(test = text.x != text.y,
yes = NA,
no = text.x)) %>%
select(id, text, response, volume)
id text response volume
1 632592651 <NA> y 1234
2 633322173 cat y 432
3 634703802 dog y 324
4 634927873 mouse n 333
5 635812953 elephant n 2223
6 636004739 goose y 412346
7 636101211 rat y 7456
8 636157799 mice n 3456
9 636263106 kitty n 2345
10 636752420 kitten y 2345
but not sure how to replace df2 with df3 and df4. The desired output is shown below:
id text volume
1 632592651 NA 1234
2 633322173 cat 432
3 634703802 dog 324
4 634927873 mouse 333
5 635812953 elephant 2223
6 636004739 goose 412346
7 636101211 rat 7456
8 636157799 mice 3456
9 636263106 kitty 2345
10 636752420 kitten 2345
11 NA NA NA
12 NA NA NA
13 NA NA NA
14 NA NA NA
15 NA NA NA
16 NA NA NA
17 NA NA NA
18 NA NA NA
19 NA NA NA
20 NA NA NA
Can someone help please?
If possible, may I also know if there's a manual approach to select subset of df2 based on df3$id and change all values to NA?
Part 2:
For the second part of my request, I would like to create another dataframes from joined_df which appears only in df1 (call it found_in_df1). Example of output:
found_in_df1:
# id text volume
# 1: 632592651 <NA> 1234
# 2: 633322173 cat 432
# 3: 634703802 dog 324
# 4: 634927873 mouse 333
# 5: 635812953 elephant 2223
# 6: 636004739 goose 412346
# 7: 636101211 rat 7456
# 8: 636157799 mice 3456
# 9: 636263106 kitty 2345
#10: 636752420 kitten 2345
The solution is given in How to return row values that match column 'id' in both df1 and df2 but not column 'text' and return NA to the mismatch in column 'text'? but I'm looking for an alternative approach, i.e., is it possible to write a script to say retrieve from joined_df using df1 to give found_in_df1 since we have df1 and joined_df?
One potential solution for dealing with conflicts is to use the powerjoin package, e.g.
library(dplyr)
df1 <- data.frame(id=c("632592651","633322173","634703802","634927873","635812953","636004739","636101211","636157799","636263106","636752420"),
text=c("asdf","cat","dog","mouse","elephant","goose","rat","mice","kitty","kitten"),
response=c("y","y","y","n","n","y","y","n","n","y"))
df2 <- data.frame(id=c("632592651","633322173","634703802","634927873","635812953","636004739","636101211","636157799","636263106","636752420","636809222","2004722036","2004894388","2005045755","2005535472","2005630542","2005788781","2005809679","2005838317","2005866692"),
text=c("asdf_xyz","cat","dog","mouse","elephant","goose","rat","mice","kitty","kitten","tiger_xyz","lion","leopard","ostrich","kangaroo","platypus","fish","reptile","mammals","amphibians_xyz"),
volume=c(1234,432,324,333,2223,412346,7456,3456,2345,2345,6,345,23,2,4778,234,8675,3459,8,9))
expected_outcome <- data.frame(id = c("632592651","633322173","634703802","634927873","635812953","636004739","636101211","636157799","636263106","636752420",
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
text = c(NA, "cat", "dog", "mouse", "elephant", "goose",
"rat", "mice", "kitty", "kitten",
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
volume = c(1234, 432, 324, 333, 2223, 412346, 7456,
3456, 2345, 2345, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA))
library(powerjoin)
joined_df <- power_full_join(df1, df2, by = c("id"),
conflict = rw ~ ifelse(.x != .y,
NA_integer_,
.x))
final_df <- joined_df %>%
mutate(across(everything(), ~ifelse(is.na(response), NA, .x))) %>%
select(id, text, volume)
final_df
#> id text volume
#> 1 632592651 <NA> 1234
#> 2 633322173 cat 432
#> 3 634703802 dog 324
#> 4 634927873 mouse 333
#> 5 635812953 elephant 2223
#> 6 636004739 goose 412346
#> 7 636101211 rat 7456
#> 8 636157799 mice 3456
#> 9 636263106 kitty 2345
#> 10 636752420 kitten 2345
#> 11 <NA> <NA> NA
#> 12 <NA> <NA> NA
#> 13 <NA> <NA> NA
#> 14 <NA> <NA> NA
#> 15 <NA> <NA> NA
#> 16 <NA> <NA> NA
#> 17 <NA> <NA> NA
#> 18 <NA> <NA> NA
#> 19 <NA> <NA> NA
#> 20 <NA> <NA> NA
all_equal(final_df, expected_outcome)
#> [1] TRUE
# Part 2
found_in_df1 <- power_left_join(df1, df2, by = c("id"),
conflict = rw ~ ifelse(.x != .y,
NA_integer_,
.x)) %>%
select(id, text, volume)
found_in_df1
#> id text volume
#> 1 632592651 <NA> 1234
#> 2 633322173 cat 432
#> 3 634703802 dog 324
#> 4 634927873 mouse 333
#> 5 635812953 elephant 2223
#> 6 636004739 goose 412346
#> 7 636101211 rat 7456
#> 8 636157799 mice 3456
#> 9 636263106 kitty 2345
#> 10 636752420 kitten 2345
Created on 2022-07-02 by the reprex package (v2.0.1)
Edit
Per the comment below from the creator of the powerjoin package (Mr. Mudskipper): these operations are vectorised, so you don't need to perform the command 'rowwise', i.e. you can remove "rw" to simplify and gain performance. There is no practical difference between including and excluding "rw" with df1 and df2, but if we use larger dataframes you can see a clear increase in performance, e.g.
library(dplyr)
library(powerjoin)
# define functions
power_full_join_func_rowwise <- function(df1, df2) {
joined_df <- power_full_join(df1, df2, by = c("id"),
conflict = rw ~ ifelse(.x != .y,
NA_integer_,
.x))
final_df <- joined_df %>%
mutate(across(everything(), ~ifelse(is.na(response), NA, .x))) %>%
select(id, text, volume)
return(final_df)
}
power_full_join_func_not_rowwise <- function(df1, df2) {
joined_df <- power_full_join(df1, df2, by = c("id"),
conflict = ~ifelse(.x != .y,
NA_integer_,
.x))
final_df <- joined_df %>%
mutate(across(everything(), ~ifelse(is.na(response), NA, .x))) %>%
select(id, text, volume)
return(final_df)
}
library(microbenchmark)
library(purrr)
library(ggplot2)
# make larger dfs (copy df1 and df2 X100)
df3 <- map_dfr(seq_len(100), ~ df1)
df4 <- map_dfr(seq_len(100), ~ df2)
# benchmark performance on the larger dataframes
res <- microbenchmark(power_full_join_func_rowwise(df3, df4),
power_full_join_func_not_rowwise(df3, df4))
res
#> Unit: milliseconds
#> expr min lq mean
#> power_full_join_func_rowwise(df3, df4) 397.32661 426.08117 449.88787
#> power_full_join_func_not_rowwise(df3, df4) 71.85757 77.25344 90.36191
#> median uq max neval cld
#> 446.41715 472.47817 587.3301 100 b
#> 81.18239 93.95103 191.1248 100 a
autoplot(res)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.
# Is the result the same?
all_equal(power_full_join_func_rowwise(df3, df4),
power_full_join_func_not_rowwise(df3, df4))
#> [1] TRUE
Created on 2022-11-24 by the reprex package (v2.0.1)
data.table version using an !antijoin, and overwriting := all columns/rows returned in df2 with an NA (recycled list .(NA) to all columns).
Then looping over all the common variables and overwriting any values which don't match by id:
library(data.table)
setDT(df1)
setDT(df2)
df2[!df1, on=.(id), names(df2) := .(NA)]
idvars <- "id"
compvars <- setdiff(intersect(names(df1), names(df2)), idvars)
for (i in compvars) {
df2[!df1, on=c(idvars,i), (i) := NA]
}
# id text volume
# 1: 632592651 <NA> 1234
# 2: 633322173 cat 432
# 3: 634703802 dog 324
# 4: 634927873 mouse 333
# 5: 635812953 elephant 2223
# 6: 636004739 goose 412346
# 7: 636101211 rat 7456
# 8: 636157799 mice 3456
# 9: 636263106 kitty 2345
#10: 636752420 kitten 2345
#11: <NA> <NA> <NA>
#12: <NA> <NA> <NA>
#13: <NA> <NA> <NA>
#14: <NA> <NA> <NA>
#15: <NA> <NA> <NA>
#16: <NA> <NA> <NA>
#17: <NA> <NA> <NA>
#18: <NA> <NA> <NA>
#19: <NA> <NA> <NA>
#20: <NA> <NA> <NA>

Adding values to columns based on multiple conditions

I have 1 df as below
df <- data.frame(n1 = c(1,2,1,2,5,6,8,9,8,8),
n2 = c(100,1000,500,1,NA,NA,2,8,10,15),
n3 = c("a", "a", "a", NA, "b", "c",NA,NA,NA,NA),
n4 = c("red", "red", NA, NA, NA, NA,NA,NA,NA,NA))
df
n1 n2 n3 n4
1 1 100 a red
2 2 1000 a red
3 1 500 a <NA>
4 2 1 <NA> <NA>
5 5 NA b <NA>
6 6 NA c <NA>
7 8 2 <NA> <NA>
8 9 8 <NA> <NA>
9 8 10 <NA> <NA>
9 8 15 <NA> <NA>
First, please see my desired output
df
n1 n2 n3 n4
1 1 100 a red
2 2 1000 a red
3 1 500 a red
4 2 1 <NA> red
5 5 NA b <NA>
6 6 NA c <NA>
7 8 2 <NA> red
8 9 8 <NA> red
9 8 10 <NA> red
9 8 15 <NA> red
I made this post before (Adding values to one columns based on conditions). However, I realized that I need to take one more column to solve my problem.
So, I would like to update/add the red in n4 by asking the conditions comming from n1, n2, n3. If n3 == "a", and values of n1 associated with a, then values of n4 that are the same row with values of n1 should be added with red (i.e. row 3,4th). At the same time, if values of n1 also match with that of n2 (i.e. 2), then this row th of n4 should also be added red. Further, 8 of column n1 is connected with the entire things like that. Then, if we have futher values of n2 or n1 is equal to 8 then, the step would be replicated as before. I hope it is clear, if not I would like to explain more. (It sounds like a Zig Zag thing).
-Note: tidyverse and baseR also welcomed to help me here.
Any suggestions for me please?
You can try the code below if you are using igraph
res <- do.call(
rbind,
lapply(
decompose(
graph_from_data_frame(replace(df, is.na(df), "NA"))
),
function(x) {
n4 <- E(x)$n4
if (!all(n4 == "NA")) {
E(x)$n4 <- unique(n4[n4 != "NA"])
}
get.data.frame(x)
}
)
)
dfout <- type.convert(
res[match(do.call(paste, df[1:2]), do.call(paste, res[1:2])), ],
as.is = TRUE
)
which gives
> dfout
from to n3 n4
1 1 100 a red
2 2 1000 a red
3 1 500 a red
4 2 1 <NA> red
9 5 NA b <NA>
10 6 NA c <NA>
5 8 2 <NA> red
6 9 8 <NA> red
7 8 10 <NA> red
8 8 15 <NA> red

How to combine two data.tables based on multiple criteria in R?

I have two data.tables, which I want to combine based on if a date in one table is in the given time range in the other table. In dt1 I have exit dates and I want to check in dt2 which values were valid at the exit date for each ID.
dt1 <- data.table (ID = 1:10,
exit = c("31/12/2010", "01/01/2021", "30/09/2010", "31/12/2015", "30/09/2010","31/10/2018", "01/02/2016", "01/05/2015", "01/09/2013", "01/01/2016"))
dt2 <- data.table (ID = c(1,2,2,2,3,5,6,6,7,8,8,9,10),
valid_from = c("01/01/2010", "01/01/2012", "01/01/2013", "01/12/2017", "01/05/2010", "01/04/2010", "01/05/2014", "01/11/2016", "01/01/2016", "15/04/2013", "01/01/2015", "15/02/2010", "01/04/2012"),
valid_until = c("01/01/2021", "31/12/2012", "30/11/2017", "01/01/2021", "01/01/2021", "01/01/2021", "31/10/2016", "01/01/2021", "01/01/2021", "31/12/2014", "01/05/2015", "01/01/2013", "01/01/2021"),
text1 = c("a", "a", "b", "c", "b", "b", "c", "a", "a", "b", "a", "c", "a"),
text2 = c("I", "I", "II", "I", "III", "I", "II", "III", "I", "II", "II", "I", "III" ))
ID exit
1: 1 31/12/2010
2: 2 01/01/2021
3: 3 30/09/2010
4: 4 31/12/2015
5: 5 30/09/2010
6: 6 31/10/2018
7: 7 01/02/2016
8: 8 01/05/2015
9: 9 01/09/2013
10: 10 01/01/2016
ID valid_from valid_until text1 text2
1: 1 01/01/2010 01/01/2021 a I
2: 2 01/01/2012 31/12/2012 a I
3: 2 01/01/2013 30/11/2017 b II
4: 2 01/12/2017 01/01/2021 c I
5: 3 01/05/2010 01/01/2021 b III
6: 5 01/04/2010 01/01/2021 b I
7: 6 01/05/2014 31/10/2016 c II
8: 6 01/11/2016 01/01/2021 a III
9: 7 01/01/2016 01/01/2021 a I
10: 8 15/04/2013 31/12/2014 b II
11: 8 01/01/2015 01/05/2015 a II
12: 9 15/02/2010 01/01/2013 c I
13: 10 01/04/2012 01/01/2021 a III
As a result I would like to return in dt1 the valid values to the exit dates.
If an ID is not found in dt2 (would be the case for ID 4 in the sample data), it should return NA.
ID exit text1 text2
1: 1 31/12/2010 a I
2: 2 01/01/2021 c I
3: 3 30/09/2010 b III
4: 4 31/12/2015 <NA> <NA>
5: 5 30/09/2010 b I
6: 6 31/10/2018 a III
7: 7 01/02/2016 a I
8: 8 01/05/2015 a II
9: 9 01/09/2013 c I
10: 10 01/01/2016 a III
Could anyone help me solve this?
As the input is a data.table, consider using data.table methods which are fast
library(data.table)
# // convert the date columns to `Date` class
dt1[, exit := as.IDate(exit, '%d/%m/%Y')]
dt2[, c('valid_from', 'valid_until') := .(as.IDate(valid_from, '%d/%m/%Y'),
as.IDate(valid_until, '%d/%m/%Y'))]
# // do a non-equi join
dt1[dt2, c('text1', 'text2') := .(i.text1, i.text2),
on = .(ID, exit >= valid_from, exit <= valid_until)]
-output
> dt1
ID exit text1 text2
1: 1 2010-12-31 a I
2: 2 2021-01-01 c I
3: 3 2010-09-30 b III
4: 4 2015-12-31 <NA> <NA>
5: 5 2010-09-30 b I
6: 6 2018-10-31 a III
7: 7 2016-02-01 a I
8: 8 2015-05-01 a II
9: 9 2013-09-01 <NA> <NA>
10: 10 2016-01-01 a III
Here is a dplyr solution, that was created with the help of #akrun: see here dates: Not yet implemented NAbounds=TRUE for this non-numeric and non-character type
library(dplyr)
libray(lubridate)
df1 <- left_join(dt1, dt2, by="ID") %>%
mutate(across(c(exit, valid_from, valid_until), dmy)) %>%
rowwise() %>%
mutate(match= +(dplyr::between(exit, valid_from, valid_until))) %>%
group_by(ID) %>%
filter(match==max(match) | is.na(match)) %>%
select(ID, exit, text1, text2) %>%
ungroup()
output:
ID exit text1 text2
<dbl> <date> <chr> <chr>
1 1 2010-12-31 a I
2 2 2021-01-01 c I
3 3 2010-09-30 b III
4 4 2015-12-31 NA NA
5 5 2010-09-30 b I
6 6 2018-10-31 a III
7 7 2016-02-01 a I
8 8 2015-05-01 a II
9 9 2013-09-01 c I
10 10 2016-01-01 a III
You may use fuzzyjoin after changing the dates to Date class.
library(fuzzyjoin)
library(dplyr)
dt1 %>%
mutate(exit = as.Date(exit, '%d/%m/%Y')) %>%
fuzzy_left_join(dt2 %>%
mutate(across(starts_with('valid'), as.Date, '%d/%m/%Y')),
by = c('ID', 'exit' = 'valid_from', 'exit' = 'valid_until'),
match_fun = c(`==`, `>=`, `<=`)) %>%
select(ID = ID.x, exit, text1, text2)
# ID exit text1 text2
#1 1 2010-12-31 a I
#2 2 2021-01-01 c I
#3 3 2010-09-30 b III
#4 4 2015-12-31 <NA> <NA>
#5 5 2010-09-30 b I
#6 6 2018-10-31 a III
#7 7 2016-02-01 a I
#8 8 2015-05-01 a II
#9 9 2013-09-01 <NA> <NA>
#10 10 2016-01-01 a III

Correspondance between values in two df R

I have two df to confrontate. my first df is "sum"
> head(sum)
File_pdb Res1 Chain1 Res2 Chain2
1: 7LD1_CM GLN 81 M ASN 501 C
2: 7LD1_CM TYR 128 M PHE 377 C
3: 7LD1_CM ILE 78 M SER 375 C
4: 7LD1_CM ASN 76 M ALA 372 C
5: 7LD1_CM THR 20 M TYR 369 C
6: 7LD1_CM ARG 408 C LEU 131 M
The second one is "mut"
> head(mut)
RefAA MutAA LineagesCount
1 VAL 3 GLY 3 1
2 LEU 5 PHE 5 2
3 LEU 8 VAL 8 1
4 SER 13 ILE 13 2
5 LEU 18 PHE 18 5
6 THR 20 ILE 20 1
I have to check if in sum$res1 and sum$res2 there are values equal to mut$refAA. If it's so, I need to add the whole row of mut$refAA near to sum$res1 or sum$res2.
here an example:
File_pdb Res1 Chain1 Res2 Chain2 RefAA MutAA LineagesCount
1: 7LD1_CM GLN 81 M ASN 501 C
2: 7LD1_CM TYR 128 M PHE 377 C
3: 7LD1_CM ILE 78 M SER 375 C
4: 7LD1_CM ASN 76 M ALA 372 C
5: 7LD1_CM THR 20 M TYR 369 C THR 20 ILE 20 1
6: 7LD1_CM ARG 408 C LEU 131 M
How I can do this? I was trying something using merge and join functions but I'm not so experienced so I need to practice more. Can someone help me? Thank you!
I had to fix the data a bit, to easily import the data. Then you can try a tidyverse
library(tidyverse)
SUM %>%
mutate(index = 1:n()) %>%
pivot_longer(c(Res1, Res2)) %>%
left_join(mutate(MUT, value=RefAA), by = "value") %>%
group_by(index) %>%
fill(MutAA, RefAA, LineagesCount, .direction = "downup") %>%
ungroup() %>%
pivot_wider(names_from = name, values_from = value, values_fn = toString) %>%
mutate(which_Res = ifelse(RefAA == Res1, "Res1", "Res2"))
# A tibble: 6 x 10
File_pdb Chain1 Chain2 index RefAA MutAA LineagesCount Res1 Res2 which_Res
<chr> <chr> <chr> <int> <chr> <chr> <int> <chr> <chr> <chr>
1 7LD1_CM M C 1 NA NA NA GLN81 ASN501 NA
2 7LD1_CM M C 2 NA NA NA TYR128 PHE377 NA
3 7LD1_CM M C 3 NA NA NA ILE78 SER375 NA
4 7LD1_CM M C 4 NA NA NA ASN76 ALA372 NA
5 7LD1_CM M C 5 THR20 ILE20 1 THR20 TYR369 Res1
6 7LD1_CM C M 6 NA NA NA ARG408 LEU131 NA
The data
SUM <- read.table(text = " File_pdb Res1 Chain1 Res2 Chain2
1: 7LD1_CM GLN81 M ASN501 C
2: 7LD1_CM TYR128 M PHE377 C
3: 7LD1_CM ILE78 M SER375 C
4: 7LD1_CM ASN76 M ALA372 C
5: 7LD1_CM THR20 M TYR369 C
6: 7LD1_CM ARG408 C LEU131 M")
SUM
MUT <- read.table(text = " RefAA MutAA LineagesCount
1 VAL3 GLY3 1
2 LEU5 PHE5 2
3 LEU8 VAL8 1
4 SER13 ILE13 2
5 LEU18 PHE18 5
6 THR20 ILE20 1")
Hope this would help
do.call(
dplyr::coalesce,
lapply(
c("Res1", "Res2"),
function(x) merge(SUM, MUT, by.x = x, by.y = "RefAA", all.x = TRUE)
)
)
which gives
Res1 File_pdb Chain1 Res2 Chain2 MutAA LineagesCount
1 ARG408 7LD1_CM C LEU131 M <NA> NA
2 ASN76 7LD1_CM M ALA372 C <NA> NA
3 GLN81 7LD1_CM M ASN501 C <NA> NA
4 ILE78 7LD1_CM M SER375 C <NA> NA
5 THR20 7LD1_CM M TYR369 C ILE20 1
6 TYR128 7LD1_CM M PHE377 C <NA> NA
Data
> dput(SUM)
structure(list(File_pdb = c("7LD1_CM", "7LD1_CM", "7LD1_CM",
"7LD1_CM", "7LD1_CM", "7LD1_CM"), Res1 = c("GLN81", "TYR128",
"ILE78", "ASN76", "THR20", "ARG408"), Chain1 = c("M", "M", "M",
"M", "M", "C"), Res2 = c("ASN501", "PHE377", "SER375", "ALA372",
"TYR369", "LEU131"), Chain2 = c("C", "C", "C", "C", "C", "M")), class = "data.frame", row.names = c("1:",
"2:", "3:", "4:", "5:", "6:"))
> dput(MUT)
structure(list(RefAA = c("VAL3", "LEU5", "LEU8", "SER13", "LEU18",
"THR20"), MutAA = c("GLY3", "PHE5", "VAL8", "ILE13", "PHE18",
"ILE20"), LineagesCount = c(1L, 2L, 1L, 2L, 5L, 1L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))

Combining columns in a dataframe each with partial information

I have a large data set which used different coding schemes for the same variables over different time periods. The coding in each time period is represented as a column with values during the year it was active and NA everywhere else.
I was able to "combine" them by using nested ifelse commands together with dplyr's mutate [see edit below], but I am running into a problem using ifelse to do something slightly different. I want to code a new variable based on whether ANY of the previous variables meets a condition. But for some reason, the ifelse construct below does not work.
MWE:
library("dplyr")
library("magrittr")
df <- data.frame(id = 1:12, year = c(rep(1995, 5), rep(1996, 5), rep(1997, 2)), varA = c("A","C","A","C","B",rep(NA,7)), varB = c(rep(NA,5),"B","A","C","A","B",rep(NA,2)))
df %>% mutate(varC = ifelse(varA == "C" | varB == "C", "C", "D"))
Output:
> df
id year varA varB varC
1 1 1995 A <NA> <NA>
2 2 1995 C <NA> C
3 3 1995 A <NA> <NA>
4 4 1995 C <NA> C
5 5 1995 B <NA> <NA>
6 6 1996 <NA> B <NA>
7 7 1996 <NA> A <NA>
8 8 1996 <NA> C C
9 9 1996 <NA> A <NA>
10 10 1996 <NA> B <NA>
11 11 1997 <NA> <NA> <NA>
12 12 1997 <NA> <NA> <NA>
If I don't use the | operator, and test against only varA, it will come out with the results as expected, but it will only apply to those years that varA is not NA.
Output:
> df %<>% mutate(varC = ifelse(varA == "C", "C", "D"))
> df
id year varA varB varC
1 1 1995 A <NA> D
2 2 1995 C <NA> C
3 3 1995 A <NA> D
4 4 1995 C <NA> C
5 5 1995 B <NA> D
6 6 1996 <NA> B <NA>
7 7 1996 <NA> A <NA>
8 8 1996 <NA> C <NA>
9 9 1996 <NA> A <NA>
10 10 1996 <NA> B <NA>
11 11 1997 <NA> <NA> <NA>
12 12 1997 <NA> <NA> <NA>
Desired output:
> df
id year varA varB varC
1 1 1995 A <NA> D
2 2 1995 C <NA> C
3 3 1995 A <NA> D
4 4 1995 C <NA> C
5 5 1995 B <NA> D
6 6 1996 <NA> B D
7 7 1996 <NA> A D
8 8 1996 <NA> C C
9 9 1996 <NA> A D
10 10 1996 <NA> B D
11 11 1997 <NA> <NA> <NA>
12 12 1997 <NA> <NA> <NA>
How do I get what I'm looking for?
To make this question more applicable to a wider audience, and to learn from this situation, it would be great have an explanation as to what is happening with the comparison using | that causes it not to work as expected. Thanks in advance!
EDIT: This is what I meant by successfully combining them with nested ifelses
> df %>% mutate(varC = ifelse(year == 1995, as.character(varA),
+ ifelse(year == 1996, as.character(varB), NA)))
id year varA varB varC
1 1 1995 A <NA> A
2 2 1995 C <NA> C
3 3 1995 A <NA> A
4 4 1995 C <NA> C
5 5 1995 B <NA> B
6 6 1996 <NA> B B
7 7 1996 <NA> A A
8 8 1996 <NA> C C
9 9 1996 <NA> A A
10 10 1996 <NA> B B
11 11 1997 <NA> <NA> <NA>
12 12 1997 <NA> <NA> <NA>
R has this annoying tendency where the logical value of a condition that involves NA is just NA, rather than true or false.
i.e. NA>0 = NA rather than FALSE
NA interacts with TRUE just like false does. i.e. TRUE|NA = TRUE. TRUE&NA = NA.
Interestingly, it also interacts with FALSE as if it was TRUE. i.e. FALSE|NA=NA. FALSE&NA=FALSE
In fact, NA is like a logical value between TRUE and FALSE. e.g. NA|TRUE|FALSE = TRUE.
So here's a way to hack this:
ifelse((varA=='C'&!is.na(varA))|(varB=='C'&!is.na(varB))
How do we interpret this? On the left side of the OR, we have the following: If varA is NA, then we have NA&FALSE. Since NA is one step above FALSE in the hierarchy of logicals, the & is going to force the whole thing to be FALSE. Otherwise, if varA is not NA but it's not 'C', you'll have FALSE&TRUE which gives FALSE as you want. Otherwise, if it's 'C', they're both true. Same goes for the thing on the right of the OR.
When using a condition that involves x, but x can be NA, I like to use
((condition for x)&!is.na(x)) to completely rule out the NA output and force the TRUE or FALSE values in the situations I want.
EDIT: I just remembered that you want an NA output if they're both NA. This doesn't end up doing it, so that's my bad. Unless you're okay with a 'D' output when they're both NA.
EDIT2: This should output the NAs as you want:
ifelse(is.na(varA)&is.na(varB), NA, ifelse((varA=='C'&!is.na(varA))|(varB=='C'&!is.na(varB)), 'C','D'))
Per #Khashaa comment. This should do the trick and get you to the desired output.
df %>%
mutate(varC = ifelse(is.na(varA) & is.na(varB), NA,
ifelse(varA %in% "C" | varB %in% "C", "C", "D")))

Resources