I'm having a data frame with two columns id_1 and id_2. For each of id_1, I want to count the number of matches it has with all the elements of id_2.
I imagine the result being a data frame with columns id_1, id_2 and number_of_id_2_found_for_id_1.
Here's what I'm trying
set.seed(1)
df <- data.frame(
id_1 = sample(1:10, size = 30, replace = TRUE),
id_2 = sample(1:10, size = 30, replace = TRUE)
)
df %>% group_by(id_1, id_2) %>%
# id_1 should be unique
summarise(~n(.x)) # I want this to be the number of id_2 it has found for each of the elements of id_1
My expected output would be:
1 1 0
1 2 0
1 3 0
1 4 1
1 5 0
....
1 9 0
2 1 0
...
2 7 1
2 8 0
2 9 1
And so on, basically for each id_1 the number of elements it has found for each_id_2. In the example above it's mostly 1, but in a lot bigger data frame the count would increase. This is like a bipartite graph where the edge would be the number of left-to-right matches between the first component - id_1 and id_2.
Thanks in advance!
Based on the updated post, may be we need to do a crossing to return all the combinations, do a count on the original dataset for both columns and join with the full combination
library(dplyr)
library(tidyr)
crossing(id_1 = 1:10, id_2 = 1:10) %>%
left_join(., df %>%
count(id_1, id_2)) %>%
mutate(n = replace_na(n, 0))
-output
# A tibble: 100 x 3
# id_1 id_2 n
# <int> <int> <dbl>
# 1 1 1 0
# 2 1 2 0
# 3 1 3 1
# 4 1 4 1
# 5 1 5 0
# 6 1 6 0
# 7 1 7 0
# 8 1 8 0
# 9 1 9 1
#10 1 10 0
# … with 90 more rows
Related
I have the following dataframe:
df <-read.table(header=TRUE, text="id code
1 A
1 B
1 C
2 A
2 A
2 A
3 A
3 B
3 A")
Per id, I would love to find those individuals that have at least 2 conditions, namely:
conditionA = "A"
conditionB = "B"
conditionC = "C"
and create a new colum with "index", 1 if there are two or more conditions met and 0 otherwise:
df_output <-read.table(header=TRUE, text="id code index
1 A 1
1 B 1
1 C 1
2 A 0
2 A 0
2 A 0
3 A 1
3 B 1
3 A 1")
So far I have tried the following:
df_output = df %>%
group_by(id) %>%
mutate(index = ifelse(grepl(conditionA|conditionB|conditionC, code), 1, 0))
and as you can see I am struggling to get the threshold count into the code.
You can create a vector of conditions, and then use %in% and sum to count the number of occurrences in each group. Use + (or ifelse) to convert logical into 1 and 0:
conditions = c("A", "B", "C")
df %>%
group_by(id) %>%
mutate(index = +(sum(unique(code) %in% conditions) >= 2))
id code index
1 1 A 1
2 1 B 1
3 1 C 1
4 2 A 0
5 2 A 0
6 2 A 0
7 3 A 1
8 3 B 1
9 3 A 1
You could use n_distinct(), which is a faster and more concise equivalent of length(unique(x)).
df %>%
group_by(id) %>%
mutate(index = +(n_distinct(code) >= 2)) %>%
ungroup()
# # A tibble: 9 × 3
# id code index
# <int> <chr> <int>
# 1 1 A 1
# 2 1 B 1
# 3 1 C 1
# 4 2 A 0
# 5 2 A 0
# 6 2 A 0
# 7 3 A 1
# 8 3 B 1
# 9 3 A 1
You can check conditions using intersect() function and check whether resulting list is of minimal (eg- 2) length.
conditions = c('A', 'B', 'C')
df_output2 =
df %>%
group_by(id) %>%
mutate(index = as.integer(length(intersect(code, conditions)) >= 2))
I have such a data frame below and I want to find duplicated columns in each row of this data frame. Please see the input and output example below. 0 is repeated 2 times in the first row, that is why column rep should be 0 (data_input[1,"rep"]=0); 2 is repeated 2 times in the second row, that is why column rep should be 0; there are no replicated values in the 3rd row that is why column rep can be 4 (or you can add any value instead of 0,1,2) and 1 is repeated 3 times in the 4th row, that is why column rep should be 1.
data_input=data.frame(X1=c(0,1,2,1), X2=c(0,2,1,1),
X3=c(1,2,0,1))
data_output=data.frame(X1=c(0,1,2,1),
X2=c(0,2,1,1), X3=c(1,2,0,1), rep=c(0,2,4,1))
Here is an option with rowwise - create the rowwise attribute, then find the duplicated element from the row, if there are none, replace the NA with 4
library(dplyr)
library(tidyr)
data_input %>%
rowwise %>%
mutate(rep = {tmp <- c_across(everything())
replace_na(tmp[duplicated(tmp)][1], 4)
}) %>%
ungroup
-output
# A tibble: 4 × 4
X1 X2 X3 rep
<dbl> <dbl> <dbl> <dbl>
1 0 0 1 0
2 1 2 2 2
3 2 1 0 4
4 1 1 1 1
Above solution didn't consider the case where there are multiple duplicates. If there are, then either consider to create a list column or paste the unique elements together to a single string
data_input %>%
rowwise %>%
mutate(rep = {tmp <- c_across(everything())
tmp <- toString(sort(unique(tmp[duplicated(tmp)])))
replace(tmp, tmp == "", "4")
}) %>%
ungroup
-output
# A tibble: 4 × 4
X1 X2 X3 rep
<dbl> <dbl> <dbl> <chr>
1 0 0 1 0
2 1 2 2 2
3 2 1 0 4
4 1 1 1 1
Or using base R
data_input$rep <- apply(data_input, 1, FUN = \(x) x[anyDuplicated(x)][1])
data_input$rep[is.na(data_input$rep)] <- 4
Another solution, based on base R:
nCols <- ncol(data_input)
data_output <- cbind(
data_input, rep = apply(data_input, 1,
function(x) if (length(table(x)) != nCols) x[which.max(table(x))] else nCols+1))
data_output
#> X1 X2 X3 rep
#> 1 0 0 1 0
#> 2 1 2 2 2
#> 3 2 1 0 4
#> 4 1 1 1 1
I know how I can subset a data frame by sampling certain rows. However, I'm struggling with finding an easy (preferably tidyverse) way to just ADD the sampling information as a new column to my data set, i.e. I simply want to populate a new column with "1" if it is sampled and "0" if not.
I currently have this one, but it feels overly complicated. Note, in the example I want to sample 3 rows per group.
df <- data.frame(group = c(1,2,1,2,1,1,1,1,2,2,2,2,2,1,1),
var = 1:15)
library(tidyverse)
df <- df %>%
group_by(group) %>%
mutate(sampling_info = sample.int(n(), size = n(), replace = FALSE),
sampling_info = if_else(sampling_info <= 3, 1, 0))
You can try -
library(dplyr)
set.seed(123)
df %>%
arrange(group) %>%
group_by(group) %>%
mutate(sampling_info = as.integer(row_number() %in% sample(n(), size = 3))) %>%
ungroup
# group var sampling_info
# <dbl> <int> <int>
# 1 1 1 0
# 2 1 3 0
# 3 1 5 1
# 4 1 6 0
# 5 1 7 0
# 6 1 8 0
# 7 1 14 1
# 8 1 15 1
# 9 2 2 0
#10 2 4 1
#11 2 9 1
#12 2 10 0
#13 2 11 0
#14 2 12 1
#15 2 13 0
sample(n(), size = 3) will generate 3 random row numbers for each group and we assign 1 for those row numbers.
I am trying to expand on the answer to this problem that was solved, Take Sum of a Variable if Combination of Values in Two Other Columns are Unique
but because I am new to stack overflow, I can't comment directly on that post so here is my problem:
I have a dataset like the following but with about 100 columns of binary data as shown in "ani1" and "bni2" columns.
Locations <- c("A","A","A","A","B","B","C","C","D", "D","D")
seasons <- c("2", "2", "3", "4","2","3","1","2","2","4","4")
ani1 <- c(1,1,1,1,0,1,1,1,0,1,0)
bni2 <- c(0,0,1,1,1,1,0,1,0,1,1)
df <- data.frame(Locations, seasons, ani1, bni2)
Locations seasons ani1 bni2
1 A 2 1 0
2 A 2 1 0
3 A 3 1 1
4 A 4 1 1
5 B 2 0 1
6 B 3 1 1
7 C 1 1 0
8 C 2 1 1
9 D 2 0 0
10 D 4 1 1
11 D 4 0 1
I am attempting to sum all the columns based on the location and season, but I want to simplify so I get a total column for column #3 and after for each unique combination of location and season.
The problem is not all the columns have a 1 value for every combination of location and season and they all have different names.
I would like something like this:
Locations seasons ani1 bni2
1 A 2 2 0
2 A 3 1 1
3 A 4 1 1
4 B 2 0 1
5 B 3 1 1
6 C 1 1 0
7 C 2 1 1
8 D 2 0 0
9 D 4 1 2
Here is my attempt using a for loop:
df2 <- 0
for(i in 3:length(df)){
testdf <- data.frame(t(apply(df[1:2], 1, sort)), df[i])
df2 <- aggregate(i~., testdf, FUN=sum)
}
I get the following error:
Error in model.frame.default(formula = i ~ ., data = testdf) :
variable lengths differ (found for 'X1')
Thank you!
You can use dplyr::summarise and across after group_by.
library(dplyr)
df %>%
group_by(Locations, seasons) %>%
summarise(across(starts_with("ani"), ~sum(.x, na.rm = TRUE))) %>%
ungroup()
Another option is to reshape the data to long format using functions from the tidyr package. This avoids the issue of having to select columns 3 onwards.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -c(Locations, seasons)) %>%
group_by(Locations, seasons, name) %>%
summarise(Sum = sum(value, na.rm = TRUE)) %>%
ungroup() %>%
pivot_wider(names_from = "name", values_from = "Sum")
Result:
# A tibble: 9 x 4
Locations seasons ani1 ani2
<chr> <int> <int> <int>
1 A 2 2 0
2 A 3 1 1
3 A 4 1 1
4 B 2 0 1
5 B 3 1 1
6 C 1 1 0
7 C 2 1 1
8 D 2 0 0
9 D 4 1 2
I have a data frame with a group, a condition that differs by group, and an index within each group:
df <- data.frame(group = c(rep(c("A", "B", "C"), each = 3)),
condition = rep(c(0,1,1), each = 3),
index = c(1:3,1:3,2:4))
> df
group condition index
1 A 0 1
2 A 0 2
3 A 0 3
4 B 1 1
5 B 1 2
6 B 1 3
7 C 1 2
8 C 1 3
9 C 1 4
I would like to slice the data within each group, filtering out all but the row with the lowest index. However, this filter should only be applied when the condition applies, i.e., condition == 1. My solution was to compute a ranking on the index within each group and filter on the combination of condition and rank:
df %>%
group_by(group) %>%
mutate(rank = order(index)) %>%
filter(case_when(condition == 0 ~ TRUE,
condition == 1 & rank == 1 ~ TRUE))
# A tibble: 5 x 4
# Groups: group [3]
group condition index rank
<chr> <dbl> <int> <int>
1 A 0 1 1
2 A 0 2 2
3 A 0 3 3
4 B 1 1 1
5 C 1 2 1
This left me wondering whether there is a faster solution that does not require a separate ranking variable, and potentially uses slice_min() instead.
You can use filter() to keep all cases where the condition is zero or the index equals the minimum index.
library(dplyr)
df %>%
group_by(group) %>%
filter(condition == 0 | index == min(index))
# A tibble: 5 x 3
# Groups: group [3]
group condition index
<chr> <dbl> <int>
1 A 0 1
2 A 0 2
3 A 0 3
4 B 1 1
5 C 1 2
An option with slice
library(dplyr)
df %>%
group_by(group) %>%
slice(unique(c(which(condition == 0), which.min(index))))