I have a directed network dataset of adolescent friendships. I'd like to make an edgelist that includes the number of friends ego has in common with alter (someone ego and alter both nominated as a friend). Below is some sample data:
HAVE DATA:
id alter
1 3
1 5
1 9
2 3
2 5
3 2
3 5
3 9
3 6
WANT DATA:
id alter num_common
1 3 2
1 5 0
1 9 0
2 3 1
2 5 0
3 2 1
3 5 0
3 9 0
3 6 0
A solution could be to transform the edgelist into an adjacency matrix (using the igraph package) and multiple it by its transpose to count the number of shared neighbors:
el <- read.table(text= " id alter
1 3
1 5
1 9
2 3
2 5
3 2
3 5
3 9
3 6", header =T)
g <- graph_from_edgelist(as.matrix(el), directed = T)
m <- get.adjacency(g, sparse = F)
m2 <- m %*% t(m)
Afterwards transform the resulting matrix back to an edgelist and merge it with the original data set:
el2 <- reshape2::melt(m2)
dplyr::left_join(el, el2, by = c("id" = "Var1", "alter" = "Var2"))
id alter value
1 1 3 2
2 1 5 0
3 1 9 0
4 2 3 1
5 2 5 0
6 3 2 1
7 3 5 0
8 3 9 0
9 3 6 0
To see who how often ego and alter were both nominated by the same friend change the direction of the relation by using t(m) %*% m instead of m %*% t(m). To ignore direction, set the directed argument to FALSE in the graph_from_edgelist function.
this is a possible though not very simple solution:
# your dummy data
df <- data.table::fread("id alter
1 3
1 5
1 9
2 3
2 5
3 2
3 5
3 9
3 6")
library(dplyr)
library(tidyr)
# all pairs vertically with pair ID
pairs_v <- combn(unique(c(df$id, df$alter)), 2) %>%
dplyr::as_tibble() %>%
tidyr::pivot_longer(cols = everything()) %>%
dplyr::arrange(name)
# number of comon friends per group ID
pairs_comp <- pairs_v %>%
dplyr::left_join(df, by = c("value" = "id")) %>%
dplyr::count(name, alter) %>%
dplyr::filter(n > 1 & !is.na(alter)) %>%
dplyr::count(name)
# all pairs horizontally with pair ID
pairs_h <-pairs_v %>%
dplyr::group_by(name) %>%
dplyr::mutate(G_ID = dplyr::row_number()) %>%
tidyr::pivot_wider(names_from = G_ID, values_from = "value")
# multiple left joins to get repeated comon friends for each direction of combination
df %>%
dplyr::left_join(pairs_h, by = c("id" = "1", "alter" = "2")) %>%
dplyr::left_join(pairs_comp) %>%
dplyr::left_join(pairs_h, by = c("id" = "2", "alter" = "1")) %>%
dplyr::left_join(pairs_comp, by = c("name.y" = "name")) %>%
dplyr::mutate(num_common = case_when(!is.na(n.x) ~ as.numeric(n.x),
!is.na(n.y) ~ as.numeric(n.y),
TRUE ~ 0)) %>%
dplyr::select(id, alter, num_common)
id alter num_common
1: 1 3 2
2: 1 5 0
3: 1 9 0
4: 2 3 1
5: 2 5 0
6: 3 2 1
7: 3 5 0
8: 3 9 0
9: 3 6 0
Related
I have the following dataframe:
df <-read.table(header=TRUE, text="id code
1 A
1 B
1 C
2 A
2 A
2 A
3 A
3 B
3 A")
Per id, I would love to find those individuals that have at least 2 conditions, namely:
conditionA = "A"
conditionB = "B"
conditionC = "C"
and create a new colum with "index", 1 if there are two or more conditions met and 0 otherwise:
df_output <-read.table(header=TRUE, text="id code index
1 A 1
1 B 1
1 C 1
2 A 0
2 A 0
2 A 0
3 A 1
3 B 1
3 A 1")
So far I have tried the following:
df_output = df %>%
group_by(id) %>%
mutate(index = ifelse(grepl(conditionA|conditionB|conditionC, code), 1, 0))
and as you can see I am struggling to get the threshold count into the code.
You can create a vector of conditions, and then use %in% and sum to count the number of occurrences in each group. Use + (or ifelse) to convert logical into 1 and 0:
conditions = c("A", "B", "C")
df %>%
group_by(id) %>%
mutate(index = +(sum(unique(code) %in% conditions) >= 2))
id code index
1 1 A 1
2 1 B 1
3 1 C 1
4 2 A 0
5 2 A 0
6 2 A 0
7 3 A 1
8 3 B 1
9 3 A 1
You could use n_distinct(), which is a faster and more concise equivalent of length(unique(x)).
df %>%
group_by(id) %>%
mutate(index = +(n_distinct(code) >= 2)) %>%
ungroup()
# # A tibble: 9 × 3
# id code index
# <int> <chr> <int>
# 1 1 A 1
# 2 1 B 1
# 3 1 C 1
# 4 2 A 0
# 5 2 A 0
# 6 2 A 0
# 7 3 A 1
# 8 3 B 1
# 9 3 A 1
You can check conditions using intersect() function and check whether resulting list is of minimal (eg- 2) length.
conditions = c('A', 'B', 'C')
df_output2 =
df %>%
group_by(id) %>%
mutate(index = as.integer(length(intersect(code, conditions)) >= 2))
I am trying to expand on the answer to this problem that was solved, Take Sum of a Variable if Combination of Values in Two Other Columns are Unique
but because I am new to stack overflow, I can't comment directly on that post so here is my problem:
I have a dataset like the following but with about 100 columns of binary data as shown in "ani1" and "bni2" columns.
Locations <- c("A","A","A","A","B","B","C","C","D", "D","D")
seasons <- c("2", "2", "3", "4","2","3","1","2","2","4","4")
ani1 <- c(1,1,1,1,0,1,1,1,0,1,0)
bni2 <- c(0,0,1,1,1,1,0,1,0,1,1)
df <- data.frame(Locations, seasons, ani1, bni2)
Locations seasons ani1 bni2
1 A 2 1 0
2 A 2 1 0
3 A 3 1 1
4 A 4 1 1
5 B 2 0 1
6 B 3 1 1
7 C 1 1 0
8 C 2 1 1
9 D 2 0 0
10 D 4 1 1
11 D 4 0 1
I am attempting to sum all the columns based on the location and season, but I want to simplify so I get a total column for column #3 and after for each unique combination of location and season.
The problem is not all the columns have a 1 value for every combination of location and season and they all have different names.
I would like something like this:
Locations seasons ani1 bni2
1 A 2 2 0
2 A 3 1 1
3 A 4 1 1
4 B 2 0 1
5 B 3 1 1
6 C 1 1 0
7 C 2 1 1
8 D 2 0 0
9 D 4 1 2
Here is my attempt using a for loop:
df2 <- 0
for(i in 3:length(df)){
testdf <- data.frame(t(apply(df[1:2], 1, sort)), df[i])
df2 <- aggregate(i~., testdf, FUN=sum)
}
I get the following error:
Error in model.frame.default(formula = i ~ ., data = testdf) :
variable lengths differ (found for 'X1')
Thank you!
You can use dplyr::summarise and across after group_by.
library(dplyr)
df %>%
group_by(Locations, seasons) %>%
summarise(across(starts_with("ani"), ~sum(.x, na.rm = TRUE))) %>%
ungroup()
Another option is to reshape the data to long format using functions from the tidyr package. This avoids the issue of having to select columns 3 onwards.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -c(Locations, seasons)) %>%
group_by(Locations, seasons, name) %>%
summarise(Sum = sum(value, na.rm = TRUE)) %>%
ungroup() %>%
pivot_wider(names_from = "name", values_from = "Sum")
Result:
# A tibble: 9 x 4
Locations seasons ani1 ani2
<chr> <int> <int> <int>
1 A 2 2 0
2 A 3 1 1
3 A 4 1 1
4 B 2 0 1
5 B 3 1 1
6 C 1 1 0
7 C 2 1 1
8 D 2 0 0
9 D 4 1 2
How can I get a dense rank of multiple columns in a dataframe? For example,
# I have:
df <- data.frame(x = c(1,1,1,1,2,2,2,3,3,3),
y = c(1,2,3,4,2,2,2,1,2,3))
# I want:
res <- data.frame(x = c(1,1,1,1,2,2,2,3,3,3),
y = c(1,2,3,4,2,2,2,1,2,3),
r = c(1,2,3,4,5,5,5,6,7,8))
res
x y z
1 1 1 1
2 1 2 2
3 1 3 3
4 1 4 4
5 2 2 5
6 2 2 5
7 2 2 5
8 3 1 6
9 3 2 7
10 3 3 8
My hack approach works for this particular dataset:
df %>%
arrange(x,y) %>%
mutate(r = if_else(y - lag(y,default=0) == 0, 0, 1)) %>%
mutate(r = cumsum(r))
But there must be a more general solution, maybe using functions like dense_rank() or row_number(). But I'm struggling with this.
dplyr solutions are ideal.
Right after posting, I think I found a solution here. In my case, it would be:
mutate(df, r = dense_rank(interaction(x,y,lex.order=T)))
But if you have a better solution, please share.
data.table
data.table has you covered with frank().
library(data.table)
frank(df, x,y, ties.method = 'min')
[1] 1 2 3 4 5 5 5 8 9 10
You can df$r <- frank(df, x,y, ties.method = 'min') to add as a new column.
tidyr/dplyr
Another option (though clunkier) is to use tidyr::unite to collapse your columns to one plus dplyr::dense_rank.
library(tidyverse)
df %>%
# add a single column with all the info
unite(xy, x, y) %>%
cbind(df) %>%
# dense rank on that
mutate(r = dense_rank(xy)) %>%
# now drop the helper col
select(-xy)
You can use cur_group_id:
library(dplyr)
df %>%
group_by(x, y) %>%
mutate(r = cur_group_id())
# x y r
# <dbl> <dbl> <int>
# 1 1 1 1
# 2 1 2 2
# 3 1 3 3
# 4 1 4 4
# 5 2 2 5
# 6 2 2 5
# 7 2 2 5
# 8 3 1 6
# 9 3 2 7
# 10 3 3 8
The title says it all! I have grouped data where I'd like to remove rows up until the first 0 value by id group.
Example code:
problem <- data.frame(
id = c(1,1,1,1,2,2,2,2,3,3,3,3),
value = c(0,0,2,0,0,8,4,2,1,7,6,5)
)
solution <- data.frame(
id = c(1,1,2,2,2,3,3,3,3),
value = c(2,0,8,4,2,1,7,6,5)
)
Here is a dplyr solution:
library(dplyr)
problem %>%
group_by(id) %>%
mutate(first_match = min(row_number()[value != 0])) %>%
filter(row_number() >= first_match) %>%
select(-first_match) %>%
ungroup()
# A tibble: 9 x 2
id value
<dbl> <dbl>
1 1 2
2 1 0
3 2 8
4 2 4
5 2 2
6 3 1
7 3 7
8 3 6
9 3 5
Or more succinctly per Tjebo's comment:
problem %>%
group_by(id) %>%
filter(row_number() >= min(row_number()[value != 0])) %>%
ungroup()
You can do this in base R:
subset(problem,ave(value,id,FUN=cumsum)>0)
# id value
# 3 1 2
# 4 1 0
# 6 2 8
# 7 2 4
# 8 2 2
# 9 3 1
# 10 3 7
# 11 3 6
# 12 3 5
Use abs(value) if you have negative values in your real case.
I have a dataframe made from different groups, and for each group real and predicted values. I want to extract values of tests on these values :
library(dplyr)
d = data.frame(group = c(rep(5,x="a"),rep(5,x="b")), real = c(rep(2, x=1:5)), pred = c(2,1,3,4,5,1,2,4,3,5))
group real pred
1 a 1 2
2 a 2 1
3 a 3 3
4 a 4 4
5 a 5 5
6 b 1 1
7 b 2 2
8 b 3 4
9 b 4 3
10 b 5 5
d <- d %>% group_by(group) %>% mutate( sg = ifelse(real == 1 & real == pred, 1, 0))
d <- d %>% group_by(group) %>% mutate( sp = ifelse(real <= 3 & pred <= 3, 1, 0))
d %>% distinct(sg, sp)
sg sp group
1 0 1 a
2 0 0 a
3 1 1 b
4 0 1 b
5 0 0 b
But I want something like this (only 1 result per group)
sg sp group
1 0 1 a
3 1 1 b
I am pretty sure dplyr, data.table or tidyr can do something but I cannot find how.
If it is always the first row of each group that you want to extract, you could use the do function:
d %>% do(.[1,])
Another option is to use the filter function like this:
d %>% filter(seq_along(sp) == 1)