This question already has an answer here:
Create a group index for values connected directly and indirectly
(1 answer)
Closed 4 years ago.
I'm trying to link together pairs of unique IDs using R. Given the example below, I have two IDs (here ID1 and ID2) that indicate linkage. I'm trying to create groups of rows that are linked. In this example A is linked to B which is linked to D which is linked to E. Because these are all connected, I want to group them together. Next, there is also X which is linked to both Y and Z. Because these two are also connected, I want to assign them to a single group as well. How can I tackle this using R?
Thanks!
Example data:
ID1 ID2
A B
B D
D E
X Y
X Z
DPUT R representation
structure(list(id1 = structure(c(1L, 2L, 3L, 4L, 4L), .Label = c("A", "B", "D", "X"), class = "factor"), id2 = structure(1:5,.Label = c("B", "D", "E", "Y", "Z"), class = "factor")), .Names = c("id1", "id2"), row.names = c(NA, -5L), class = "data.frame")
Output needed:
ID1 ID2 GROUP
A B 1
B D 1
D E 1
X Y 2
X Z 2
As per mentionned by #Frank in the comments, you can use igraph:
library(igraph)
idf <- graph.data.frame(df)
clusters(idf)$membership
Which gives:
A B D X E Y Z
1 1 1 2 1 2 2
Should you want to assign the result back to rows of df:
merge(df, stack(clusters(idf)$membership), by.x = "id1", by.y = "ind", all.x = TRUE)
Related
Given two dataframes df1 and df2 as follows:
df1:
df1 <- structure(list(A = 1L, B = 2L, C = 3L, D = 4L, G = 5L), class = "data.frame", row.names = c(NA,
-1L))
Out:
A B C D G
1 1 2 3 4 5
df2:
df2 <- structure(list(Col1 = c("A", "B", "C", "D", "X"), Col2 = c("E",
"Q", "R", "Z", "Y")), class = "data.frame", row.names = c(NA,
-5L))
Out:
Col1 Col2
1 A E
2 B Q
3 C R
4 D Z
5 X Y
I need to rename columns of df1 using df2, except column G since it not in df2's Col1.
I use df2$Col2[match(names(df1), df2$Col1)] based on the answer from here, but it returns "E" "Q" "R" "Z" NA, as you see column G become NA. I hope it keep the original name.
The expected result:
E Q R Z G
1 1 2 3 4 5
How could I deal with this issue? Thanks.
By using na.omit(it's little bit messy..)
colnames(df1)[na.omit(match(names(df1), df2$Col1))] <- df2$Col2[na.omit(match(names(df1), df2$Col1))]
df1
E Q R Z G
1 1 2 3 4 5
I have success to reproduce your error with
df2 <- data.frame(
Col1 = c("H","I","K","A","B","C","D"),
Col2 = c("a1","a2","a3","E","Q","R","Z")
)
The problem is location of df2$Col1 and names(df1) in match.
na.omit(match(names(df1), df2$Col1))
gives [1] 4 5 6 7, which index does not exist in df1 that has length 5.
For df1, we should change order of terms in match, na.omit(match(df2$Col1,names(df1))) gives [1] 1 2 3 4
colnames(df1)[na.omit(match(df2$Col1, names(df1)))] <- df2$Col2[na.omit(match(names(df1), df2$Col1))]
This will works.
A solution using the rename_with function from the dplyr package.
library(dplyr)
df3 <- df2 %>%
filter(Col1 %in% names(df1))
df4 <- df1 %>%
rename_with(.cols = df3$Col1, .fn = function(x) df3$Col2[df3$Col1 %in% x])
df4
# E Q R Z G
# 1 1 2 3 4 5
This question already has an answer here:
Create a group index for values connected directly and indirectly
(1 answer)
Closed 2 years ago.
I have a dataset with two variables. As a simple example:
df <- rbind(c("A",1),c("B",2),c("C",2),c("C",3),c("D",4),c("D",5),c("E",1))
I would like to group them by the first component or the second, the desired output would be a third column with the following values:
c(1,2,2,2,3,3,1)
If I use dplyr and group_by and cur_group_id(), I would get groups by the first and second component, obtaining therefore
c(1,2,3,4,5,6,7)
Can anyone help me in an easy way, it could be either base R, dplyr or data.table, to obtain the desired group?
Thank you
Perhaps igraph could be a helpful tool for you
library(igraph)
df$grp <- membership(components(graph_from_data_frame(df, directed = FALSE)))[df$X1]
which gives
> df
X1 X2 grp
1 A 1 1
2 B 2 2
3 C 2 2
4 C 3 2
5 D 4 3
6 D 5 3
7 E 1 1
Data
> dput(df)
structure(list(X1 = c("A", "B", "C", "C", "D", "D", "E"), X2 = c(1L,
2L, 2L, 3L, 4L, 5L, 1L)), row.names = c(NA, -7L), class = "data.frame")
Good evening,
I have a two columns tab separated .txt file, as the following:
number letter
1 a
1 b
2 a
2 b
3 b
I would like to collapse rows where the column "number" has identical value, by creating a comma separated value in the corresponding column "letter".
In other words, this should be the output:
number letter
1 a,b
2 a,b
3 b
I have looked up the web but I did not find an actual solution.
Thank you in advance,
Giuseppe
We can use aggregate in base R
aggregate(letter ~ number, df1, FUN = paste, collapse=",")
-output
# number letter
#1 1 a,b
#2 2 a,b
#3 3 b
Or with tidyverse
library(dplyr)
library(stringr)
df1 %>%
group_by(number) %>%
summarise(letter = str_c(letter, collapse=","))
data
df1 <- structure(list(number = c(1L, 1L, 2L, 2L, 3L), letter = c("a",
"b", "a", "b", "b")), class = "data.frame", row.names = c(NA,
-5L))
We can also combine aggregate() with toString:
#Code
newdf <- aggregate(letter~.,df,toString)
Output:
number letter
1 1 a, b
2 2 a, b
3 3 b
Some data:
#Data
df <- structure(list(number = c(1L, 1L, 2L, 2L, 3L), letter = c("a",
"b", "a", "b", "b")), class = "data.frame", row.names = c(NA,
-5L))
This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 2 years ago.
If two data frames are
symbol wgt
1 A 2
2 C 4
3 D 6
symbol wgt
1 A 20
2 D 10
how can I add them so that missing observations for a "symbol" in either data frame are treated as zero, giving
symbol wgt
1 A 22
2 C 4
3 D 16
You can join the two dataframes by symbol , replace NA with 0 and add the two weights.
library(dplyr)
df1 %>%
left_join(df2, by = 'symbol') %>%
mutate(wgt.y = replace(wgt.y, is.na(wgt.y), 0),
wgt = wgt.x + wgt.y) %>%
select(-wgt.x, -wgt.y)
# symbol wgt
#1 A 22
#2 C 4
#3 D 16
data
df1 <- structure(list(symbol = c("A", "C", "D"), wgt = c(2L, 4L, 6L)),
class = "data.frame", row.names = c(NA, -3L))
df2 <- structure(list(symbol = c("A", "D"), wgt = c(20L, 10L)),
class = "data.frame", row.names = c(NA, -2L))
Try this one line solution by pipes:
#Data
library(dplyr)
df1 <- structure(list(symbol = c("A", "C", "D"), wgt = c(2L, 4L, 6L)), class = "data.frame", row.names = c("1",
"2", "3"))
df2 <- structure(list(symbol = c("A", "D"), wgt = c(20L, 10L)), class = "data.frame", row.names = c("1",
"2"))
#Code
df1 %>% left_join(df2,by = 'symbol') %>% mutate(wgt = rowSums(.[-1],na.rm=T)) %>% select(c(1,4))
symbol wgt
1 A 22
2 C 4
3 D 16
With data.table and the data provided in the answer of #RonakShah and #Duck the solution could be a simple aggregation:
# Convert data.frame to data.table (very fast since inplace)
setDT(df1)
setDT(df2)
# combine both data.frames into one data.frame, group by symbol, apply the sum (NAs are ignored = counted as zero)
rbind(df1,df2)[, sum(wgt, na.rm = TRUE), by = symbol]
# Output
symbol V1
1: A 22
2: C 4
3: D 16
Note: If you want to use base R only (without data.table) you could use aggregate instead:
aggregate(wgt ~ symbol, rbind(df1,df2), sum)
This question already has answers here:
Create unique identifier from the interchangeable combination of two variables
(2 answers)
Closed 6 years ago.
I have a dataframe of 3 columns
A B 1
A B 1
A C 1
B A 1
I want to aggregate it such that it considers combinations A-B and B-A to be the same, resulting in
A B 3
A C 1
How do I go about this?
Use pmin and pmax on the first two columns and then do the group-by-count:
library(dplyr);
df %>% group_by(G1 = pmin(V1, V2), G2 = pmax(V1, V2)) %>% summarise(Count = sum(V3))
Source: local data frame [2 x 3]
Groups: G1 [?]
G1 G2 Count
(chr) (chr) (int)
1 A B 3
2 A C 1
Corresponding data.table solution would be:
library(data.table)
setDT(df)
df[, .(Count = sum(V3)), .(G1 = pmin(V1, V2), G2 = pmax(V1, V2))]
G1 G2 Count
1: A B 3
2: A C 1
Data:
structure(list(V1 = c("A", "A", "A", "B"), V2 = c("B", "B", "C",
"A"), V3 = c(1L, 1L, 1L, 1L)), .Names = c("V1", "V2", "V3"), row.names = c(NA,
-4L), class = "data.frame")