R replace values in column based on match between columns - r

I have two dataframes, each with the same columns. Some columns have the same values in the same order in both dataframes (X1, X2 below). Other columns have the same values, but in a different order (Y1). This is only a problem for some levels of first variables (here, the order of rows in Y1 differs for X1 == "a", but not X1 == "b"). Example:
df1 <- data.frame("X1" = c("a", "a", "a", "b", "b", "b"),
"X2" = c("1", "2", "3", "1", "2", "3"),
"Y1" = c("d", "d", "f", "g", "h", "i"))
df2 <- data.frame("X1" = c("a", "a", "a", "b", "b", "b"),
"X2" = c("1", "2", "3", "1", "2", "3"),
"Y1" = c("f", "d", "d", "g", "h", "i"))
I would like to change the values of df2$X1 and df2$X2 such that the two dataframes are matched on values of Y1.
I would like to change X1 and X2 rather than Y1 because there are many Y variables. I would like to do this only for df$X1 == "a".
The output should looks like this:
df2 <- data.frame("X1" = c("a", "a", "a", "b", "b", "b"),
"X2" = c("3", "1", "2", "1", "2", "3"),
"Y1" = c("f", "d", "d", "g", "h", "i"))

What is a little tricky in your situation is that you have duplicates in the Y1 columns which correspond to different values in the X2 columns. So you will have to make these unique.
First, make sure that your Y1 columns are character vectors and not factors:
df1 <- data.frame("X1" = c("a", "a", "a", "b", "b", "b"),
"X2" = c("1", "2", "3", "1", "2", "3"),
"Y1" = c("d", "d", "f", "g", "h", "i"),
stringsAsFactors = F)
df2 <- data.frame("X1" = c("a", "a", "a", "b", "b", "b"),
"X2" = c("1", "2", "3", "1", "2", "3"),
"Y1" = c("f", "d", "d", "g", "h", "i"),
stringsAsFactors = F)
Give unique names to your Y1 duplicates:
df1$Y1uniq <- make.unique(df1$Y1)
df2$Y1uniq <- make.unique(df2$Y1)
Then you can use match() using those uniques values (and remove that column once you don't need it anymore):
df1[match(df2$Y1uniq, df1$Y1uniq), ][ , 1:3]
Output:
X1 X2 Y1
3 a 3 f
1 a 1 d
2 a 2 d
4 b 1 g
5 b 2 h
6 b 3 i

Related

Finding indirect nodes for every edge (in R)

I have information on groups of physicians working together in given hospitals. A physician can work in more than one hospital at the same time. I would like to write a code that outputs information of all indirect colleagues of a given physician working in a given hospital. For instance, if I work in a given hospital with another physician who also works in another hospital, I would like to know who are the physicians with whom my colleague works in this other hospital.
Consider a simple example of three hospitals (1, 2, 3) and five physicians (A, B, C, D, E). Physicians A, B and C work together in hospital 1. Physicians A, B and D work together in hospital 2. Physicians B and E work together in hospital 3.
For each physician working in a given hospital I would like information of their indirect colleagues through each of their direct colleagues. For example, physician A has one indirect colleague through physician B in hospital 1: this is physician E in hospital 3. On the other hand, physician B does not have any indirect colleague through physician A in hospital 1. Physician C has two indirect colleagues through physician B in hospital 1: they are physician D in hospital 2 and physician E in hospital 3. And so on..
Below is the object that describes the nertworks of physicians in all hospitals:
edges <- tibble(hosp = c("1", "1", "1", "1", "1", "1", "2", "2", "2", "2", "2", "2", "3", "3"),
from = c("A", "A", "B", "B", "C", "C", "A", "A", "B", "B", "D", "D", "B", "E"),
to = c("C", "B", "C", "A", "B", "A", "D", "B", "A", "D", "A", "B", "E", "B")) %>% arrange(hosp, from, to)
I would like a code that produces the following output:
output <- tibble(hosp = c("1", "1", "1", "1", "1", "1", "1", "2", "2", "2", "2", "2", "2", "2", "3", "3", "3", "3", "3"),
from = c("A", "A", "B", "B", "C", "C", "C", "A", "A", "B", "B", "D", "D", "D", "B", "E", "E", "E", "E"),
to = c("C", "B", "C", "A", "B", "A", "B", "D", "B", "A", "D", "A", "B", "B", "E", "B", "B", "B", "B"),
hosp_ind = c("" , "3", "" , "" , "2", "2", "3", "" , "3", "" , "" , "1", "1", "3", "" , "1", "1", "2", "2"),
to_ind = c("" , "E", "" , "" , "D", "D", "E", "" , "E", "" , "" , "C", "C", "E", "" , "A", "C", "A", "D")) %>% arrange(hosp, from, to)
Here is one option using igraph + data.table
library(igraph)
library(data.table)
g <- simplify(graph_from_data_frame(edges, directed = FALSE))
res <- setDT(edges)[
,
c(.SD, {
to_ind <- setdiff(
do.call(
setdiff,
Map(names, ego(g, 2, c(to, from), mindist = 2))
), from
)
if (!length(to_ind)) {
hosp_ind <- to_ind <- NA_character_
} else {
hosp_ind <- lapply(to_ind, function(v) names(neighbors(g, v)))
}
data.table(
hosp_ind = unlist(hosp_ind),
to_ind = rep(to_ind, lengths(hosp_ind))
)
}),
.(id = seq(nrow(edges)))
][, id := NULL][]
and you will obtain
> res
hosp from to hosp_ind to_ind
1: 1 A B 3 E
2: 1 A C <NA> <NA>
3: 1 B A <NA> <NA>
4: 1 B C <NA> <NA>
5: 1 C A 2 D
6: 1 C B 2 D
7: 1 C B 3 E
8: 2 A B 3 E
9: 2 A D <NA> <NA>
10: 2 B A <NA> <NA>
11: 2 B D <NA> <NA>
12: 2 D A 1 C
13: 2 D B 1 C
14: 2 D B 3 E
15: 3 B E <NA> <NA>
16: 3 E B 1 A
17: 3 E B 2 A
18: 3 E B 1 C
19: 3 E B 2 D
Also, when you run plot(g), you will see the graph like below

How to match columns both forward and reverse direction in a data_frame using r

I have two dataframe.
df1:
P_1 P_2
1 Anb Bmn
2 Cvd Dbn
3 Elf Fish
4 Goat Hen
5 Ink Jelly
6 Kin Lion
7 ACAN HSPG
8 HSPG2 COL6A2
df2:
P_1 P_2 Value
1 Anb Bmn 12
2 Dbn Cvd 31
3 Elf Fish 15
4 Goat Hen 98
5 Jelly Ink 78
6 Kin Lion 56
7 HSPG ACAN 89
I tried to merge these two dataframe based on P_1 and P_2 using following command
e<-merge(df1,df2, by=c("P_1","P_2"),all.x=TRUE)
But for the row 2 , 5 and 7, I got 'NA'. This is because, the order is changed. But in the output I need the value even the order is changed. How do I achieve this?
Data
df1 <- structure(list(P_1 = c("Anb", "Cvd", "Elf", "Goat", "Ink", "Kin","ACAN"," HSPG2"), P_2 = c("Bmn", "Dbn", "Fish", "Hen", "Jelly", "Lion","HSPG","COL6A2")), class = "data.frame",row.names = c("1", "2", "3", "4", "5", "6", "7", "8"))
df2 <- structure(list(P_1 = c("Anb", "Dbn", "Elf", "Goat", "Jelly", "Kin","HSPG"), P_2 = c("Bmn","Cvd", "Fish", "Hen", "Ink", "Lion","ACAN"), Value = c(12L, 31L, 15L, 98L, 78L,56L,89L)), class = "data.frame", row.names = c("1", "2", "3","4","5", "6","7"))
Any help would be appreciated..
If we need the order to be same, we need to sort by row for each of the datasets
df1new <- df1
df1new[] <- t(apply(df1, 1, sort))
df2new <- df2
df2new[1:2] <- t(apply(df2new[1:2], 1, sort))
and now do the merge
merge(df1new, df2new, all.x = TRUE)
data
df1 <- structure(list(P_1 = c("A", "C", "E", "G", "I", "K", "z", "w"), P_2 = c("B", "D", "F", "H", "J", "L", "b", "c")), class = "data.frame",row.names = c("1", "2", "3", "4", "5", "6", "7", "8"))
df2 <- structure(list(P_1 = c("A", "D", "E", "H", "J", "K"), P_2 = c("B", "C", "F", "G", "I", "L"), Value = c(12L, 31L, 15L, 98L, 78L, 56L)), class = "data.frame", row.names = c("1", "2", "3", "4","5", "6"))

How to pull the column indices when matching the rows of a dataframe and a vector

Say I have a dataframe of letters like so:
X1 X2 X3
1 G A C
2 G T C
3 G T C
4 A T G
5 A C G
And a vector like so:
ref <- c("A", "C", "C", "A", "G")
Going row-wise, how do I pull the column indices of the dataframe which match the vector?
So the answer should be a vector of numbers like so:
2, 3, 3, 1, 3
We can use
max.col(df1 == ref)
#[1] 2 3 3 1 3
data
df1 <- structure(list(X1 = c("G", "G", "G", "A", "A"), X2 = c("A", "T",
"T", "T", "C"), X3 = c("C", "C", "C", "G", "G")), class = "data.frame",
row.names = c("1",
"2", "3", "4", "5"))

How to apply separate_rows() to all columns, passing a `sep` parameter?

Pretty straight straightforward: I have a data frame where the values in many columns need to be split into their own rows, based on ;s as the delimiter.
After reading a bit,
df %>%
Reduce(separate_rows_, x = colnames)
works, except that I can't pass the sep parameter (so it also separates by white spaces, commas, and other non-alphanumeric chars).
One answer proposed writing a modified version of the function that includes the parameter, but I couldn't get that working:
Reduce(f = function(y) separate_rows_(sep = ";"), x = colnames)
What am I doing wrong?
Having said that, my ideal solution would be a tidyverse solution, if it's cleaner (maybe map_dfr?); but obviously any solution is better than none :).
Here's sample data:
structure(list(q1 = c("1,2,3,4", "2,4"), q2 = c("a,b", "e,f"),
q3 = c("c,d", "g,h,z")), row.names = 1:2, class = "data.frame")
Expected output:
structure(list(q1 = c("1", "1", "1", "1", "2", "2", "2", "2",
"3", "3", "3", "3", "4", "4", "4", "4", "2", "2", "2", "2", "2",
"2", "4", "4", "4", "4", "4", "4"), q2 = c("a", "a", "b", "b",
"a", "a", "b", "b", "a", "a", "b", "b", "a", "a", "b", "b", "e",
"e", "e", "f", "f", "f", "e", "e", "e", "f", "f", "f"), q3 = c("c",
"d", "c", "d", "c", "d", "c", "d", "c", "d", "c", "d", "c", "d",
"c", "d", "g", "h", "z", "g", "h", "z", "g", "h", "z", "g", "h",
"z")), row.names = c(NA, -28L), class = "data.frame")
The process I want to streamline is not having to pass every column name like so:
output <- test %>%
separate_rows(q1, sep = ",") %>%
separate_rows(q2, sep = ",") %>%
separate_rows(q3, sep = ",")
You can use purrr::reduce, which applies the given function .f to .init and the first element of .x, then applies the function to the output of that and the second element of .x, etc. until all elements of .x have been used.
Within the .f argument formula, .x is the previous output (or .init for the first run) and .y is the given element of the .x argument to reduce.
library(tidyverse)
reduce(.init = df, .x = names(df), .f = ~separate_rows(.x, .y, sep = ','))
# equiv to: reduce(.init = df, .x = names(df), .f = separate_rows, sep = ',')
As akrun notes in the comments, this can also be done in base R with the code below (same output)
Reduce(function(x, y) separate_rows(x, y, sep=","), names(df), init = df)
# q1 q2 q3
# 1 1 a c
# 2 1 a d
# 3 1 b c
# 4 1 b d
# 5 2 a c
# 6 2 a d
# 7 2 b c
# 8 2 b d
# 9 3 a c
# 10 3 a d
# 11 3 b c
# 12 3 b d
# 13 4 a c
# 14 4 a d
# 15 4 b c
# 16 4 b d
# 17 2 e g
# 18 2 e h
# 19 2 e z
# 20 2 f g
# 21 2 f h
# 22 2 f z
# 23 4 e g
# 24 4 e h
# 25 4 e z
# 26 4 f g
# 27 4 f h
# 28 4 f z

comparing columns of two dataframes and get the deviation point in R

I have 2 dataframes:
> dput(DF1)
structure(c("a", "b", "c", "d", "e", "f", "g"), .Dim = c(1L,
7L), .Dimnames = list("1", c("seq1", "seq2", "seq3", "seq4",
"seq5", "seq6", "seq7")))
> dput(DF2)
structure(list(seq1 = c("a", "a", "a", "a", "a"), seq2 = c("b",
"d", "d", "d", "b"), seq3 = c("c", "c", "c", "c", "c"), seq4 = c("e",
"e", "d", "d", "d"), seq5 = c("f", "f", "f", "g", "e"), seq6 = c("g",
"g", "g", "g", "g"), seq7 = c("g", "g", "g", "g", "g"), UserId = c("1",
"2", "3", "4", "5")), .Names = c("seq1", "seq2", "seq3", "seq4",
"seq5", "seq6", "seq7", "UserId"), row.names = c(NA, -5L), class = "data.frame")
These are the above two datasets which I want to compare for e.g User1 in DF2 has deviated to e ( instead of goind to d, he went to e). DF1 is my correct defined sequence.
So in the end i need to make a dataframe the below requirements:
> dput(required_dataframe)
structure(list(UserID = c("1", "2", "3", "4", "5"), Deviation = c("e",
"d", "d", "d", "g"), Actual_sequence = c("d", "b", "b", "b",
"f")), .Names = c("UserID", "Deviation", "Actual_sequence"), row.names = c(NA,
-5L), class = "data.frame")
For an instance that user1 deviated to point e (it should have gone to d). So for all users I need to calculate the deviation point along with the actual seq.
Please find the attached images of DF1 ,DF2 and the required dataframe as well.
DF1
DF2
Required_dataframe
Once you get the two matrices to line up perfectly, you can compare them row-by-row and find out where they don't match. You can then find the first value in each row and use that as a selection:
sel <- cbind(
seq_len(nrow(DF2)),
max.col(t(t(DF2[seq_along(DF1)]) != c(DF1)), "first")
)
cbind(DF2["UserId"], Deviation=DF2[sel], Actual=DF1[sel[,2]])
# UserId Deviation Actual
#1 1 e d
#2 2 d b
#3 3 d b
#4 4 d b
#5 5 g f
The core of the comparison is this part, where you can see each cell being lined up:
t(DF2[seq_along(DF1)]) != c(DF1)
# [,1] [,2] [,3] [,4] [,5]
#seq1 FALSE FALSE FALSE FALSE FALSE
#seq2 FALSE TRUE TRUE TRUE FALSE
#seq3 FALSE FALSE FALSE FALSE FALSE
#seq4 TRUE TRUE FALSE FALSE FALSE
#seq5 TRUE TRUE TRUE TRUE FALSE
#seq6 TRUE TRUE TRUE TRUE TRUE
#seq7 FALSE FALSE FALSE FALSE FALSE

Resources