How to group interconnected elements in R dplyr - r

I have a data frame that looks like this.
Elements from the col1 are connected indirectly with elements in col2.
for example 1 is connected with 2 and 3.
and 2 is connected with 3. Therefore 1 should be connected with 3 as well.
library(tidyverse)
df1 <- tibble(col1=c(1,1,2,5,5,6),
col2=c(2,3,3,6,7,7))
df1
#> # A tibble: 6 × 2
#> col1 col2
#> <dbl> <dbl>
#> 1 1 2
#> 2 1 3
#> 3 2 3
#> 4 5 6
#> 5 5 7
#> 6 6 7
Created on 2022-03-15 by the reprex package (v2.0.1)
I want my data to look like this
#> col1 col2 col3
#> <dbl> <dbl>
#> 1 1 2 group1
#> 2 1 3 group1
#> 3 2 3 group1
#> 4 5 6 group2
#> 5 5 7 group2
#> 6 6 7 group2
I would appreciate any possible help to solve this riddle.
Thank you for your time

We may use igraph
library(igraph)
library(dplyr)
library(stringr)
g <- graph.data.frame(df1, directed = TRUE)
df1 %>%
mutate(col3 = str_c("group", clusters(g)$membership[as.character(col1)]))
-output
# A tibble: 6 × 3
col1 col2 col3
<dbl> <dbl> <chr>
1 1 2 group1
2 1 3 group1
3 2 3 group1
4 5 6 group2
5 5 7 group2
6 6 7 group2

Another igraph option
df1 %>%
mutate(
col3 =
paste0("group", {
graph_from_data_frame(.) %>%
components() %>%
membership()
}[as.character(col1)])
)
gives
# A tibble: 6 x 3
col1 col2 col3
<dbl> <dbl> <chr>
1 1 2 group1
2 1 3 group1
3 2 3 group1
4 5 6 group2
5 5 7 group2
6 6 7 group2

Related

Fill NA with a series of characters in R dplyr

I have a large data frame that looks like this. Each player is assigned to a group.
library(tidyverse)
df <- tibble(player=c(1,2,3,4,5),groups=c("group1","group2","group2",NA,NA))
df
#> # A tibble: 5 × 2
#> player groups
#> <dbl> <chr>
#> 1 1 group1
#> 2 2 group2
#> 3 3 group2
#> 4 4 <NA>
#> 5 5 <NA>
Created on 2022-04-12 by the reprex package (v2.0.1)
Some players are not assigned into groups and I want to fill them serially -i.e. like this-
#> # A tibble: 5 × 2
#> player groups
#> <dbl> <chr>
#> 1 1 group1
#> 2 2 group2
#> 3 3 group2
#> 4 4 group3
#> 5 5 group4
dplyr
library(dplyr)
df %>%
mutate(
maxgrp = max(as.integer(gsub("[^0-9]", "", groups)), na.rm = TRUE),
groups = if_else(is.na(groups), paste0("group", maxgrp + cumsum(is.na(groups))), groups)
) %>%
select(-maxgrp)
# # A tibble: 5 x 2
# player groups
# <dbl> <chr>
# 1 1 group1
# 2 2 group2
# 3 3 group2
# 4 4 group3
# 5 5 group4
data.table
library(data.table)
DT <- as.data.table(df)
DT[, groups := fifelse(
is.na(groups),
paste0("group", cumsum(is.na(groups)) + max(as.integer(gsub("[^0-9]", "", groups)), na.rm = TRUE)),
groups) ]
This was tricky, finally I think we could do it this way:
library(dplyr)
df %>%
mutate(x = cumsum(groups %in% NA)+1) %>%
mutate(groups = ifelse(is.na(groups), paste0("group", x+1), groups), .keep="unused")
player groups
<dbl> <chr>
1 1 group1
2 2 group2
3 3 group2
4 4 group3
5 5 group4
You could do:
df |>
mutate(new_group = max(parse_number(groups), na.rm = TRUE) + cumsum(is.na(groups)),
groups = if_else(is.na(groups), paste0("group", new_group), groups)) |>
select(-new_group)
Using a slightly different data example where after the missings another group is mentioned, this would give you:
Input:
library(tidyverse)
df <- tibble(player=c(1,2,3,4,5,6),groups=c("group1","group2","group2",NA,NA, "group3"))
# A tibble: 6 x 2
player groups
<dbl> <chr>
1 1 group1
2 2 group2
3 3 group2
4 4 NA
5 5 NA
6 6 group3
Output:
# A tibble: 6 x 2
player groups
<dbl> <chr>
1 1 group1
2 2 group2
3 3 group2
4 4 group4
5 5 group5
6 6 group3

iterative functions in R

I’m trying to create multiple new score columns based on other columns. I’d like to use a function to minimize copy pasting large blocks of code.
I’m trying to do something like:
Myfunction <- function(column){
Column_df <- old_df %>%
mutate(column.score = if_else(column = 1, “yes”, “no”)
)
}
Score_df <- Myfunction(c(math, reading, science)))
But I’m getting an error saying object math is not found
Starting with an example data frame as below
df <- purrr::map_dfc(c('math', 'reading', 'science', 'history'),
~ rlang::list2(!!.x := sample(1:3, 10, TRUE)))
df
#> # A tibble: 10 × 4
#> math reading science history
#> <int> <int> <int> <int>
#> 1 2 1 3 1
#> 2 3 2 3 1
#> 3 2 2 2 2
#> 4 2 3 1 2
#> 5 3 3 1 2
#> 6 1 2 3 2
#> 7 3 3 2 1
#> 8 3 3 3 2
#> 9 1 2 2 1
#> 10 2 2 2 3
You can create new "score" columns with a function by passing your columns argument to across inside {{ }}, and using the .name option to add ".score" to the name.
If you want only the "score" columns in the output, rather than to add them to existing columns, use transmute instead of mutate.
library(dplyr, warn.conflicts = FALSE)
Myfunction <- function(df, columns){
df %>%
mutate(across({{ columns }}, ~ if_else(. == 1, 'yes', 'no'),
.names = '{.col}.score'))
}
df %>%
Myfunction(c(math, reading, science))
#> # A tibble: 10 × 7
#> math reading science history math.score reading.score science.score
#> <int> <int> <int> <int> <chr> <chr> <chr>
#> 1 2 1 3 1 no yes no
#> 2 3 2 3 1 no no no
#> 3 2 2 2 2 no no no
#> 4 2 3 1 2 no no yes
#> 5 3 3 1 2 no no yes
#> 6 1 2 3 2 yes no no
#> 7 3 3 2 1 no no no
#> 8 3 3 3 2 no no no
#> 9 1 2 2 1 yes no no
#> 10 2 2 2 3 no no no
Created on 2022-01-18 by the reprex package (v2.0.1)

Turn list of lists into dataframe in most efficient way in R

I want to turn a list of lists into a dataframe where each name(listoflists) is a different column name. It is a VERY big list so I want to do this in the most efficient way possible.
listoflists=list(c(1,2,3,4),c(5,6,7,8))
names(listoflists) <- c("col1", "col2")
What could I do to get the following results:
print(df)
col1 col2
1 5
2 6
3 7
4 8
We can wrap with data.frame
data.frame(listoflists)
-output
col1 col2
1 1 5
2 2 6
3 3 7
4 4 8
In case the OP meant, it as a nested list, then unlist the inner list element by looping over the lapply and wrap with data.frame
data.frame(lapply(listoflists, unlist))
If it should be most efficient way, then setDT can be applied as well
library(data.table)
setDT(listoflists)
-output
> listoflists
col1 col2
1: 1 5
2: 2 6
3: 3 7
4: 4 8
We can use !!! operator with tibble. Some examples:
library(tidyverse)
#first case
listoflists1=list(c(1,2,3,4),c(5,6,7,8)) %>%
set_names(c("col1", "col2"))
tibble(!!!listoflists1)
#> # A tibble: 4 × 2
#> col1 col2
#> <dbl> <dbl>
#> 1 1 5
#> 2 2 6
#> 3 3 7
#> 4 4 8
#second case
listoflists2=list(list(1:4),list(5:8)) %>%
set_names(c("col1", "col2"))
tibble(!!!listoflists2) %>%
unnest(names(.))
#> # A tibble: 4 × 2
#> col1 col2
#> <int> <int>
#> 1 1 5
#> 2 2 6
#> 3 3 7
#> 4 4 8
#third case
listoflists3=list(list(1,2,3,4),list(5,6,7,8)) %>% set_names(c("col1", "col2"))
tibble(!!!listoflists3) %>%
unnest(names(.))
#> # A tibble: 4 × 2
#> col1 col2
#> <dbl> <dbl>
#> 1 1 5
#> 2 2 6
#> 3 3 7
#> 4 4 8
#it is possible to use unnest without cols argument but it will throw a warning message.
Created on 2021-09-22 by the reprex package (v2.0.0)
You can use list2DF.
list2DF(listoflists)
# col1 col2
#1 1 5
#2 2 6
#3 3 7
#4 4 8
What will also work with a list of lists and not only with a list of numeric vectors.
x <- list(list(1,2,3,4),list(5,6,7,8))
names(x) <- c("col1", "col2")
list2DF(x)
# col1 col2
#1 1 5
#2 2 6
#3 3 7
#4 4 8
If all list elements have the same length, an efficient but not recommended way might be:
class(listoflists) <- "data.frame"
attr(listoflists, "row.names") <- .set_row_names(length(listoflists[[1]]))
listoflists
# col1 col2
#1 1 5
#2 2 6
#3 3 7
#4 4 8
library(dplyr)
bind_rows(listoflists)
# A tibble: 4 x 2
col1 col2
<dbl> <dbl>
1 1 5
2 2 6
3 3 7
4 4 8
If you cannot use external libraries for some reason you can create a data.frame as follows:
out_df <- do.call(data.frame, listoflists)
If you want a faster implementation of data.frame you can use the data.table package instead.
library(data.table)
out_dt <- do.call(data.table, listoflists)
However the setDT proposed by #arkun should be faster in this case as it doesn't make a copy of the data.
N.B. My solutions work correctly only if your input is actually a list of vectors not when if it is a list of lists.

Calculating means for multiple groups in R

Hello fellow overflowers,
currently I'm trying to calculate means for multiple groups.
My df looks like this (~600 rows):
col1 col2 col3 col4 col5
<type> <gender> <var1> <var2> <var3>
1 A 1 3 2 3
2 A 2 NA 5 NA
3 A 1 3 3 5
4 B 1 4 NA 1
5 B 2 3 4 5
Now the result should look like this:
col1 col2 col3 col4 col5
<type> <gender> <mean-var1> <mean-var2> <mean-var3>
1 A 1 3.6 4.1 4.6
2 A 2 4.1 3.8 4.2
3 B 1 3.9 4.2 3.7
4 B 2 4.3 3.2 2.7
5 C 1 3.5 4.5 3.6
6 C 2 4 3.7 4.2
...
So far, I've tried to use the group_by function:
avg_values<-data%>%
group_by(type, gender) %>%
summarize_all (mean())
So far, it didn't work out. Could you help me figure out a good way to handle this?
Does this work:
library(dplyr)
df %>% group_by(type, gender) %>% summarise(across(var1:var3, ~ mean(., na.rm = T)))
`summarise()` regrouping output by 'type' (override with `.groups` argument)
# A tibble: 4 x 5
# Groups: type [2]
type gender var1 var2 var3
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 3 2.5 4
2 A 2 NaN 5 NaN
3 B 1 4 NaN 1
4 B 2 3 4 5
Data used:
df
# A tibble: 5 x 5
type gender var1 var2 var3
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 3 2 3
2 A 2 NA 5 NA
3 A 1 3 3 5
4 B 1 4 NA 1
5 B 2 3 4 5

bootstrap by group in tibble

Suppose I have a tibble tbl_
tbl_ <- tibble(id = c(1,1,2,2,3,3), dta = 1:6)
tbl_
# A tibble: 6 x 2
id dta
<dbl> <int>
1 1 1
2 1 2
3 2 3
4 2 4
5 3 5
6 3 6
There are 3 id groups. I want to resample entire id groups 3 times with replacement. For example the resulting tibble can be:
id dta
<dbl> <int>
1 1 1
2 1 2
3 1 1
4 1 2
5 3 5
6 3 6
but not
id dta
<dbl> <int>
1 1 1
2 1 2
3 1 1
4 2 4
5 3 5
6 3 6
or
id dta
<dbl> <int>
1 1 1
2 1 1
3 2 3
4 2 4
5 3 5
6 3 6
Here is one option with sample_n and distinct
library(tidyverse)
distinct(tbl_, id) %>%
sample_n(nrow(.), replace = TRUE) %>%
pull(id) %>%
map_df( ~ tbl_ %>%
filter(id == .x)) %>%
arrange(id)
# A tibble: 6 x 2
# id dta
# <dbl> <int>
#1 1.00 1
#2 1.00 2
#3 1.00 1
#4 1.00 2
#5 3.00 5
#6 3.00 6
An option can be to get the minimum row number for each id. That row number will be used to generate random samples from wiht replace = TRUE.
library(dplyr)
tbl_ %>% mutate(rn = row_number()) %>%
group_by(id) %>%
summarise(minrow = min(rn)) ->min_row
indx <- rep(sample(min_row$minrow, nrow(min_row), replace = TRUE), each = 2) +
rep(c(0,1), 3)
tbl_[indx,]
# # A tibble: 6 x 2
# id dta
# <dbl> <int>
# 1 1.00 1
# 2 1.00 2
# 3 3.00 5
# 4 3.00 6
# 5 2.00 3
# 6 2.00 4
Note: In the above answer the number of rows for each id has been assumed as 2 but this answer can tackle any number of IDs. The hard-coded each=2 and c(0,1) needs to be modified in order to scale it up to handle more than 2 rows for each id

Resources