I have such a data frame below and I want to find duplicated columns in each row of this data frame. Please see the input and output example below. 0 is repeated 2 times in the first row, that is why column rep should be 0 (data_input[1,"rep"]=0); 2 is repeated 2 times in the second row, that is why column rep should be 0; there are no replicated values in the 3rd row that is why column rep can be 4 (or you can add any value instead of 0,1,2) and 1 is repeated 3 times in the 4th row, that is why column rep should be 1.
data_input=data.frame(X1=c(0,1,2,1), X2=c(0,2,1,1),
X3=c(1,2,0,1))
data_output=data.frame(X1=c(0,1,2,1),
X2=c(0,2,1,1), X3=c(1,2,0,1), rep=c(0,2,4,1))
Here is an option with rowwise - create the rowwise attribute, then find the duplicated element from the row, if there are none, replace the NA with 4
library(dplyr)
library(tidyr)
data_input %>%
rowwise %>%
mutate(rep = {tmp <- c_across(everything())
replace_na(tmp[duplicated(tmp)][1], 4)
}) %>%
ungroup
-output
# A tibble: 4 × 4
X1 X2 X3 rep
<dbl> <dbl> <dbl> <dbl>
1 0 0 1 0
2 1 2 2 2
3 2 1 0 4
4 1 1 1 1
Above solution didn't consider the case where there are multiple duplicates. If there are, then either consider to create a list column or paste the unique elements together to a single string
data_input %>%
rowwise %>%
mutate(rep = {tmp <- c_across(everything())
tmp <- toString(sort(unique(tmp[duplicated(tmp)])))
replace(tmp, tmp == "", "4")
}) %>%
ungroup
-output
# A tibble: 4 × 4
X1 X2 X3 rep
<dbl> <dbl> <dbl> <chr>
1 0 0 1 0
2 1 2 2 2
3 2 1 0 4
4 1 1 1 1
Or using base R
data_input$rep <- apply(data_input, 1, FUN = \(x) x[anyDuplicated(x)][1])
data_input$rep[is.na(data_input$rep)] <- 4
Another solution, based on base R:
nCols <- ncol(data_input)
data_output <- cbind(
data_input, rep = apply(data_input, 1,
function(x) if (length(table(x)) != nCols) x[which.max(table(x))] else nCols+1))
data_output
#> X1 X2 X3 rep
#> 1 0 0 1 0
#> 2 1 2 2 2
#> 3 2 1 0 4
#> 4 1 1 1 1
Related
I have two dataframes df1 and df2. They both have a column 'ID'. For each row in DF1, I would like to find out how many duplicates of its ID there are in df2 and add the count to that row. If there are no duplicates, the count should return as 0.
# # A tibble: 4 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1_234 1 1
# 2 1_235 1 2
# 3 2_222 1 1
# 4 2_654 1 2
# # A tibble: 4 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1_234 1 1
# 2 1_235 1 2
# 3 1_234 1 1
# 4 3_234 1 2
Using dplyr:
Your data:
df1 <- data.frame(ID = c("1_234","1_235","2_222","2_654"),
a = c(1,1,1,1),
b = c(1,2,1,2))
df2 <- data.frame(ID = c("1_234","1_235","1_234","3_235"),
a = c(1,1,1,1),
b = c(1,2,1,2))
Edit: considering only the IDs:
output <- left_join(df1,
as.data.frame(table(df2$ID)),
by = c("ID" = "Var1")) %>%
mutate(Freq = ifelse(is.na(Freq), 0, Freq))
Output:
ID a b Freq
1 1_234 1 1 2
2 1_235 1 2 1
3 2_222 1 1 0
4 2_654 1 2 0
A base R option using subset + aggregate
subset(
aggregate(
n ~ .,
rbind(
cbind(df1, n = 1),
cbind(df2, n = 1)
), function(x) length(x) - 1
), ID %in% df1$ID
)
gives
ID a b n
1 1_234 1 1 2
2 2_222 1 1 0
3 1_235 1 2 1
4 2_654 1 2 0
I think you can do it with a simple sapply() and base r (no extra packages).
df1$count <- sapply(df1$ID, function(x) sum(df2$ID == x))
We may also use outer
df1$count <- rowSums(outer(df1$ID, df2$ID, FUN = `==`))
df1$count
[1] 2 1 0 0
We could use semi_join and n() to get the count of duplicates:
library(dplyr)
df1 %>%
semi_join(df2, by="ID") %>%
summarise(duplicates_df1_df2 = n())
Output:
duplicates_df1_df2
1 2
I'm having a data frame with two columns id_1 and id_2. For each of id_1, I want to count the number of matches it has with all the elements of id_2.
I imagine the result being a data frame with columns id_1, id_2 and number_of_id_2_found_for_id_1.
Here's what I'm trying
set.seed(1)
df <- data.frame(
id_1 = sample(1:10, size = 30, replace = TRUE),
id_2 = sample(1:10, size = 30, replace = TRUE)
)
df %>% group_by(id_1, id_2) %>%
# id_1 should be unique
summarise(~n(.x)) # I want this to be the number of id_2 it has found for each of the elements of id_1
My expected output would be:
1 1 0
1 2 0
1 3 0
1 4 1
1 5 0
....
1 9 0
2 1 0
...
2 7 1
2 8 0
2 9 1
And so on, basically for each id_1 the number of elements it has found for each_id_2. In the example above it's mostly 1, but in a lot bigger data frame the count would increase. This is like a bipartite graph where the edge would be the number of left-to-right matches between the first component - id_1 and id_2.
Thanks in advance!
Based on the updated post, may be we need to do a crossing to return all the combinations, do a count on the original dataset for both columns and join with the full combination
library(dplyr)
library(tidyr)
crossing(id_1 = 1:10, id_2 = 1:10) %>%
left_join(., df %>%
count(id_1, id_2)) %>%
mutate(n = replace_na(n, 0))
-output
# A tibble: 100 x 3
# id_1 id_2 n
# <int> <int> <dbl>
# 1 1 1 0
# 2 1 2 0
# 3 1 3 1
# 4 1 4 1
# 5 1 5 0
# 6 1 6 0
# 7 1 7 0
# 8 1 8 0
# 9 1 9 1
#10 1 10 0
# … with 90 more rows
So I have a data set containing of 4 individuals. Each individual is measured for different time period. In R:
df = data.frame(cbind("id"=c(1,1,1,2,2,3,3,3,3,4,4), "t"=c(1,2,3,1,2,1,2,3,4,1,2), "x1"=c(0,1,0,1,0,0,1,0,1,0,0)))
and I want to create variable x2 indicating whether there already was 1 in variable x1 for given individual, ie it will look like this:
"x2" = c(0,1,1,1,1,0,1,1,1,0,0)
... ideally with dplyr package. So far I have came here:
new_df = df %>% dplyr::group_by(id) %>% dplyr::arrange(t)
but can not move from this point... The desired result is on picture.
Here is one approach using dplyr:
df %>%
arrange(id, t) %>%
group_by(id) %>%
mutate(x2 = ifelse(row_number() >= min(row_number()[x1 == 1]), 1, 0))
This will add a 1 if the row number is greater or equal to the first row number where x1 is 1; otherwise, it will add a 0.
Note, you will get warnings, as at least one group does not have a value of x1 which equals 1.
Also, another alternative, including if you want NA where no id has a x1 value of 1 (e.g., where id is 4):
df %>%
arrange(id, t) %>%
group_by(id) %>%
mutate(x2 = +(row_number() >= which(x1 == 1)[1]))
Output
id t x1 x2
<dbl> <dbl> <dbl> <dbl>
1 1 1 0 0
2 1 2 1 1
3 1 3 0 1
4 2 1 1 1
5 2 2 0 1
6 3 1 0 0
7 3 2 1 1
8 3 3 0 1
9 3 4 1 1
10 4 1 0 0
11 4 2 0 0
i have a dataframe
df <- tibble(row1= c(1,2,3,4,5),
row2=c(2,3,4,5,6))
how do i subtract the two columbs using index (not rownames)? I would like this to work
df %>% mutate(diff= select(1)-select(2))
But the universe is not on my side....
The select needs a data parameter as the Usage is
select(.data, ...)
Also, as select returns a data.frame/tibble as output, we can get the vector with [[
library(dplyr)
df %>%
mutate(diff = select(., 1)[[1]] - select(., 2)[[1]])
-output
# A tibble: 5 x 3
# row1 row2 diff
# <dbl> <dbl> <dbl>
#1 1 2 -1
#2 2 3 -1
#3 3 4 -1
#4 4 5 -1
#5 5 6 -1
or instead use pull to return the vector
df %>%
mutate(diff = pull(., 1) - pull(., 2))
What about using select like below?
> df %>% mutate(diff = do.call(`-`,select(.,1:2)))
# A tibble: 5 x 3
row1 row2 diff
<dbl> <dbl> <dbl>
1 1 2 -1
2 2 3 -1
3 3 4 -1
4 4 5 -1
5 5 6 -1
I have a data frame with a group, a condition that differs by group, and an index within each group:
df <- data.frame(group = c(rep(c("A", "B", "C"), each = 3)),
condition = rep(c(0,1,1), each = 3),
index = c(1:3,1:3,2:4))
> df
group condition index
1 A 0 1
2 A 0 2
3 A 0 3
4 B 1 1
5 B 1 2
6 B 1 3
7 C 1 2
8 C 1 3
9 C 1 4
I would like to slice the data within each group, filtering out all but the row with the lowest index. However, this filter should only be applied when the condition applies, i.e., condition == 1. My solution was to compute a ranking on the index within each group and filter on the combination of condition and rank:
df %>%
group_by(group) %>%
mutate(rank = order(index)) %>%
filter(case_when(condition == 0 ~ TRUE,
condition == 1 & rank == 1 ~ TRUE))
# A tibble: 5 x 4
# Groups: group [3]
group condition index rank
<chr> <dbl> <int> <int>
1 A 0 1 1
2 A 0 2 2
3 A 0 3 3
4 B 1 1 1
5 C 1 2 1
This left me wondering whether there is a faster solution that does not require a separate ranking variable, and potentially uses slice_min() instead.
You can use filter() to keep all cases where the condition is zero or the index equals the minimum index.
library(dplyr)
df %>%
group_by(group) %>%
filter(condition == 0 | index == min(index))
# A tibble: 5 x 3
# Groups: group [3]
group condition index
<chr> <dbl> <int>
1 A 0 1
2 A 0 2
3 A 0 3
4 B 1 1
5 C 1 2
An option with slice
library(dplyr)
df %>%
group_by(group) %>%
slice(unique(c(which(condition == 0), which.min(index))))