Creating an identifier using pairs of row indices [duplicate] - r

I would like to generate indices to group observations based on two columns. But I want groups to be made of observation that share, at least one observation in commons.
In the data below, I want to check if values in 'G1' and 'G2' are connected directly (appear on the same row), or indirectly via other intermediate values. The desired grouping variable is shown in 'g'.
For example, A is directly linked to Z (row 1) and X (row 2). A is indirectly linked to 'B' via X (A -> X -> B), and further linked to Y via X and B (A -> X -> B -> Y).
dt <- data.frame(id = 1:10,
G1 = c("A","A","B","B","C","C","C","D","E","F"),
G2 = c("Z","X","X","Y","W","V","U","s","T","T"),
g = c(1,1,1,1,2,2,2,3,4,4))
dt
# id G1 G2 g
# 1 1 A Z 1
# 2 2 A X 1
# 3 3 B X 1
# 4 4 B Y 1
# 5 5 C W 2
# 6 6 C V 2
# 7 7 C U 2
# 8 8 D s 3
# 9 9 E T 4
# 10 10 F T 4
I tried with group_indices from dplyr, but haven't managed it.

Using igraph get membership, then map on names:
library(igraph)
# convert to graph, and get clusters membership ids
g <- graph_from_data_frame(df1[, c(2, 3, 1)])
myGroups <- components(g)$membership
myGroups
# A B C D E F Z X Y W V U s T
# 1 1 2 3 4 4 1 1 1 2 2 2 3 4
# then map on names
df1$group <- myGroups[df1$G1]
df1
# id G1 G2 group
# 1 1 A Z 1
# 2 2 A X 1
# 3 3 B X 1
# 4 4 B Y 1
# 5 5 C W 2
# 6 6 C V 2
# 7 7 C U 2
# 8 8 D s 3
# 9 9 E T 4
# 10 10 F T 4

Related

Compare values in a grouped data frame with corresponding value in a vector

Let's say I got a data.frame like the following:
u <- as.numeric(rep(rep(1:5,3)))
w <- as.factor(c(rep("a",5), rep("b",5), rep("c",5)))
q <- data.frame(w,u)
q
w u
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 b 1
7 b 2
8 b 3
9 b 4
10 b 5
11 c 1
12 c 2
13 c 3
14 c 4
15 c 5
and the vector:
v <- c(2,3,1)
Now I want to find the first row in the respective group [i] where the value [i] from vector "v" is bigger than the value in column "u".
The result should look like this:
1 a 3
2 b 4
3 c 2
I tried:
fun <- function (m) {
first(which(m[,2]>v))
}
ddply(q, .(w), summarise, fun(q))
and got as a result:
w fun(q)
1 a 3
2 b 3
3 c 3
Thus it seems like, ddply is only taking the first value from the vector "v".
Does anyone know how to solve this?
We can join the vector by creating a data.frame with 'w' as the unique values from 'w' column of 'q', then do a group_by 'w' and get the first row index where u is greater than the corresponding 'vector' column value
library(dplyr)
q %>%
left_join(data.frame(w = unique(q$w), new = v)) %>%
group_by(w) %>%
summarise(n = which(u > new)[1])
# // or use findInterval
#summarise(n = findInterval(new[1], u)+1)
-output
# A tibble: 3 x 2
# w n
#* <fct> <int>
#1 a 3
#2 b 4
#3 c 2
or use Map after splitting the data by 'w' column
Map(function(x, y) which(x$u > y)[1], split(q,q$w), v)
#$a
#[1] 3
#$b
#[1] 4
#$c
#[1] 2
OP mentioned that comparison starts from the beginning and it is not correct because we have a group_by operation. If we create a column of sequence, it resets at each group
q %>%
left_join(data.frame(w = unique(q$w), new = v)) %>%
group_by(w) %>%
mutate(rn = row_number())
Joining, by = "w"
# A tibble: 15 x 4
# Groups: w [3]
w u new rn
<fct> <dbl> <dbl> <int>
1 a 1 2 1
2 a 2 2 2
3 a 3 2 3
4 a 4 2 4
5 a 5 2 5
6 b 1 3 1
7 b 2 3 2
8 b 3 3 3
9 b 4 3 4
10 b 5 3 5
11 c 1 1 1
12 c 2 1 2
13 c 3 1 3
14 c 4 1 4
15 c 5 1 5
Using data.table: for each 'w' (by = w), subset 'v' with the group index .GRP. Compare the value with 'u' (v[.GRP] < u). Get the index for the first TRUE (which.max):
library(data.table)
setDT(q)[ , which.max(v[.GRP] < u), by = w]
# w V1
# 1: a 3
# 2: b 4
# 3: c 2

Generate pairwise data.frame of all combinations of two data.frame with different number of rows

I have to dataframes a and b that I want to combine in a final dataframe c
a <- data.frame(city=c("a","b","c"),detail=c(1,2,3))
b <- data.frame(city=c("x","y"),detail=c(5,6))
the dataframe c should look like
city.a detail.a city.b detail.b
1 a 1 x 5
2 a 1 y 6
3 b 2 x 5
4 b 2 y 6
5 c 3 x 5
6 c 3 y 6
I think I could use crossing from tidyr but for crossing(a,b) I get:
error: Column names `city`, `detail` must not be duplicated.
Use .name_repair to specify repair.
Yes, crossing is the right function but as the error message suggests that column names should be not be duplicated try to change the column names
names(a) <- paste0(names(a), ".a")
names(b) <- paste0(names(b), ".b")
tidyr::crossing(a, b)
# city.a detail.a city.b detail.b
# <fct> <dbl> <fct> <dbl>
#1 a 1 x 5
#2 a 1 y 6
#3 b 2 x 5
#4 b 2 y 6
#5 c 3 x 5
#6 c 3 y 6
crossing is a wrapper over expand_grid so after correcting the names you can also use it directly.
tidyr::expand_grid(a, b)
Here is a base R solution by using rep() + cbind(), which gives duplicated column names:
C <- `row.names<-`(cbind(a[rep(seq(nrow(a)),each = nrow(b)),],b),NULL)
such that
> C
city detail city detail
1 a 1 x 5
2 a 1 y 6
3 b 2 x 5
4 b 2 y 6
5 c 3 x 5
6 c 3 y 6
Or get a data frame having different column names by using data.frame():
C <- data.frame(a[rep(seq(nrow(a)),each = nrow(b)),],b,row.names = NULL)
such that
> C
city detail city.1 detail.1
1 a 1 x 5
2 a 1 y 6
3 b 2 x 5
4 b 2 y 6
5 c 3 x 5
6 c 3 y 6
With base R, we can use merge
merge(setNames(a, paste0(names(a), ".a")), b)
# city.a detail.a city detail
#1 a 1 x 5
#2 b 2 x 5
#3 c 3 x 5
#4 a 1 y 6
#5 b 2 y 6
#6 c 3 y 6

R delete fathers row based on sons in hierarchycal data

I'm working with some data like these:
id <- c(1,1,1,2,2,2,3,3,3,4,4) # fathers
name <- c('a','b','k','b','e','g','e','f','k','f','u') # sons
data <- data.frame(id,name)
data
> data
id name
1 1 a
2 1 b
3 1 k
4 2 b
5 2 e
6 2 g
7 3 e
8 3 f
9 3 k
10 4 f
11 4 u
My goal is this: if there is only a son that I do not want, remove all the row with the same father of the disliked son. For example, I don't like the son e, the result should be:
> data_e
id name
1 1 a
2 1 b
3 1 k
# 4 2 b
# 5 2 e
# 6 2 g
# 7 3 e
# 8 3 f
# 9 3 k
10 4 f
11 4 u
Because the rows with id 2 and 3 have in their name e.
This could be also a task like " I do not like e and f together":
> data_eandf
id name
1 1 a
2 1 b
3 1 k
4 2 b
5 2 e
6 2 g
# 7 3 e
# 8 3 f
# 9 3 k
10 4 f
11 4 u
Or, "I don't want you if you have e or f":
> data_eorf
id name
1 1 a
2 1 b
3 1 k
# 4 2 b
# 5 2 e
# 6 2 g
# 7 3 e
# 8 3 f
# 9 3 k
# 10 4 f
# 11 4 u
As you've noticed, to be more clear, I've "commented" the must-be-deleted rows.
I've searched, but I've found a lot of question based on only one column like data[which(data$name=='e'),], but this is going to remove only at sons' levels, not all the row of the relative father. Also I've thought to put the data in the wide format, paste all the name of a id in an unique cell, and fetch if there is e for example with function like grepl(), but I think this could be a problem with large dataset (these data are an example).
Do you have any idea about how to manage this?
Thanks in advance
Here's a function to handle the different cases
dislike1 <- c('e')
dislike2 <- c('e', 'f')
myfun <- function(df, dislike, ops = NULL) {
require(dplyr)
if (is.null(ops) || ops == 'OR') {
df %>%
group_by(id) %>%
filter(!any(name %in% dislike)) %>%
ungroup
} else if (ops == 'AND') {
df %>%
group_by(id) %>%
filter(!all(dislike %in% name)) %>%
ungroup
}
}
myfun(data, dislike1)
# A tibble: 5 x 2
# id name
# <dbl> <fct>
# 1 1 a
# 2 1 b
# 3 1 k
# 4 4 f
# 5 4 u
myfun(data, dislike2, 'AND')
# A tibble: 8 x 2
# id name
# <dbl> <fct>
# 1 1 a
# 2 1 b
# 3 1 k
# 4 2 b
# 5 2 e
# 6 2 g
# 7 4 f
# 8 4 u
myfun(data, dislike2, 'OR')
# A tibble: 3 x 2
# id name
# <dbl> <fct>
# 1 1 a
# 2 1 b
# 3 1 k
data[!(data$id %in% unique(data[data$name == 'e', 'id'])),]
unique(data[data$name == 'e', 'id']) will get the unique id's that have 'e' in the name field. Then you can use the %in% operator to find all the rows with those id's. The ! is a negation operator.
I have a data.table solution
require(data.table)
id <- c(1,1,1,2,2,2,3,3,3,4,4) # fathers
name <- c('a','b','k','b','e','g','e','f','k','f','u') # sons
data <- data.table(id,name)
# names to be deleted
to_del <- c("e","f")
# returns only id's without any of the names to be deleted
data[ , .SD[ !any(name %in% to_del) ,name ] , by = "id"]
id V1
1: 1 a
2: 1 b
3: 1 k

R: fill a new column in a data frame with a value by matching variables in reverse

I apologize for the title of this question. I can't figure out how a good way to briefly describe what I want to do.
I have something like this, with >8000 rows:
x y value_xy
A B 7
A C 2
B A 3
B C 6
C A 2
C B 1
I want to create a new column, value_yx, that looks like this:
x y value_xy value_yx
A B 7 3
A C 2 2
B A 3 7
B C 1 1
C A 2 2
C B 1 1
For each value of x and y, I want to have a new column that finds the value of y to x (as y appears later in the x column). Sometimes these values are equal, other times they aren't.
I have explored using for loops, ave(), and several other functions, but I haven't been able to make it work.
Try merge. The by.x and by.y arguments specify columns to be matched, and here the order of matching columns is reversed in by.y:
merge(x = df, y = df, by.x = c("x", "y"), by.y = c("y", "x"))
# x y value_xy.x value_xy.y
# 1 A B 7 3
# 2 A C 2 2
# 3 B A 3 7
# 4 B C 6 1
# 5 C A 2 2
# 6 C B 1 6
Looks like I was beat to it but an alternative solution with mapply
df$value_yx = mapply(function(x_flip, y_flip) df[df$x == y_flip & df$y == x_flip,]$value_xy, df$x, df$y)
# x y value_xy value_yx
#1 A B 7 3
#2 A C 2 2
#3 B A 3 7
#4 B C 6 1
#5 C A 2 2
#6 C B 1 6
xtabs will return a value-matrix that can be indexed by a two-column, character-valued matrix formed from the first two columns and are probably factors (hence the need for the as.character()-conversion:
> dfrm$value_yx <- xtabs(value_xy~x+y, dfrm)[
sapply(dfrm[2:1],as.character) ]
> dfrm
x y value_xy value_yx
1 A B 7 3
2 A C 2 2
3 B A 3 7
4 B C 6 1
5 C A 2 2
6 C B 1 6
--- See what is being indexed
> xtabs(value_xy~x+y, dfrm)
y
x A B C
A 0 7 2
B 3 0 6
C 2 1 0

How to write the remaining data frame in R after randomly subseting the data

I took a random sample from a data frame. But I don't know how to get the remaining data frame.
df <- data.frame(x=rep(1:3,each=2),y=6:1,z=letters[1:6])
#select 3 random rows
df[sample(nrow(df),3)]
What I want is to get the remaining data frame with the other 3 rows.
sample sets a random seed each time you run it, thus if you want to reproduce its results you will either need to set.seed or save its results in a variable.
Addressing your question, you simply need to add - before your index in order to get the rest of the data set.
Also, don't forget to add a comma after the indx if you want to select rows (unlike in your question)
set.seed(1)
indx <- sample(nrow(df), 3)
Your subset
df[indx, ]
# x y z
# 2 1 5 b
# 6 3 1 f
# 3 2 4 c
Remaining data set
df[-indx, ]
# x y z
# 1 1 6 a
# 4 2 3 d
# 5 3 2 e
Try:
> df
x y z
1 1 6 a
2 1 5 b
3 2 4 c
4 2 3 d
5 3 2 e
6 3 1 f
>
> df2 = df[sample(nrow(df),3),]
> df2
x y z
5 3 2 e
3 2 4 c
1 1 6 a
> df[!rownames(df) %in% rownames(df2),]
x y z
1 1 6 a
2 1 5 b
5 3 2 e

Resources