I wrote a code that applies various filters ('a','b', and 'c') that may result into an empty dataframe. If empty rows is true, then filters must be dropped one-by-one (from 'c' to 'a') until finding a non-empty dataframe. Can someone write that code below more elegantly? The code is:
library(dplyr)
df <- data.frame(a = 1:10, b = letters[1:10]) %>% mutate(c= str_c(a,b))
a.selected <- 1:5
b.selected <- letters[2:5]
c.selected <- c('10j')
filtered <- df %>%
filter(a %in% a.selected &
b %in% b.selected &
c %in% c.selected)
if(nrow(filtered)==0) {
filtered1 <- df %>%
filter(a %in% a.selected &
b %in% b.selected)
filtered <- filtered1
} else {
if(nrow(filtered1)==0) {
filtered2 <- df %>%
filter(a %in% a.selected)
filtered <- filtered2
} else {
if(nrow(filtered2)==0) {
filtered3 <- df
filtered <- filtered3
}
}
}
filtered
a b c
1 2 b 2b
2 3 c 3c
3 4 d 4d
4 5 e 5e
I'm not sure it's elegant, but this is considerably shorter:
selections <- list(a.selected, b.selected, c.selected)
combos <- Reduce(`&`, Map(`%in%`, df, selections), accumulate = TRUE)
df %>%
filter(combos[[max(which(sapply(combos, any)))]])
#> a b c
#> 1 2 b 2b
#> 2 3 c 3c
#> 3 4 d 4d
#> 4 5 e 5e
An option with tidyverse would be
library(dplyr)
library(purrr)
library(stringr)
df %>%
mutate(ind = across(everything(),
~ .x %in% get(str_c(cur_column(), ".selected"))) %>%
accumulate(`&`) %>%
keep(any) %>%
tail(1) %>%
names) %>%
filter(cur_data()[[first(ind)]] %in% get(str_c(first(ind), ".selected"))) %>%
select(-ind)
a b c
1 2 b 2b
2 3 c 3c
3 4 d 4d
4 5 e 5e
Related
Edited in response to #akrun's insight:
This works:
require("magrittr")
requireNamespace("dplyr")
df <- data.frame(a = 1:5)
b_column <- c_column <- "a"
df %>% dplyr::mutate(
b = !!dplyr::sym(b_column),
c = !!dplyr::sym(c_column))
But when any one of the *_columns is NULL it doesn't:
c_column <- NULL
df %>% dplyr::mutate(
b = !!dplyr::sym(b_column),
c = !!dplyr::sym(c_column))
The resulting error is:
Error: Only strings can be converted to symbols
Run `rlang::last_error()` to see where the error occurred.
How would I make the call to ANY of the ensymboled *_column variables resilient to it being NULL?
If we need to check for NULL case, use an if condition
df1 <- if(!is.null(c)) {
df %>%
dplyr::mutate(b = !!dplyr::sym(c))
} else df
With multiple columns, an option is map
library(purrr)
b_column <- c_column <- "a"
map2_dfc(list(b_column, c_column), c("b", "c"), ~
if(!is.null(.x)) df %>%
transmute(!! .y := !! sym(.x))) %>%
bind_cols(df, .)
-output
# a b c
#1 1 1 1
#2 2 2 2
#3 3 3 3
#4 4 4 4
#5 5 5 5
If one of them is NULL
c_column <- NULL
map2_dfc(list(b_column, c_column), c("b", "c"), ~
if(!is.null(.x)) df %>%
transmute(!! .y := !! sym(.x))) %>%
bind_cols(df, .)
# a b
#1 1 1
#2 2 2
#3 3 3
#4 4 4
#5 5 5
Another option is mutate with across, but make sure that we need to rename only the columns that are not NULL
nm1 <- c("b", "c")
i1 <- !map_lgl(list(b_column, c_column), is.null)
nm2 <- nm1[i1]
df %>%
mutate(across(all_of(c(b_column, c_column)), ~ .)) %>%
rename_at(vars(everything()), ~ nm2) %>%
bind_cols(df, .)
For example, here is my df:
GP_A <- c(rep("a",3),rep("b",2),rep("c",2))
GP_B <- c(rep("d",2),rep("e",4),rep("f",1))
GENDER <- c(rep("M",4),rep("F",3))
LOC <- c(rep("HK",2),rep("UK",3),rep("JP",2))
SCORE <- c(50,70,80,20,30,80,90)
df <- data.frame(GP_A,GP_B,GENDER,LOC,SCORE)
> df
GP_A GP_B GENDER LOC SCORE
1 a d M HK 50
2 a d M HK 70
3 a e M UK 80
4 b e M UK 20
5 b e F UK 30
6 c e F JP 80
7 c f F JP 90
What I want is:
result[[GP_A]] <- df %>% group_by(GP_A,GENDER,LOC) %>% summarize(SCORE=mean(SCORE))
result[[GP_B]] <- df %>% group_by(GP_B,GENDER,LOC) %>% summarize(SCORE=mean(SCORE))
...
I have tried:
result <- list()
for (i in c("GP_A","GP_B")){
result[[i]] <- df %>% group_by(i,GENDER,LOC) %>% summarize(SCORE=mean(SCORE))
}
Here is the error:
Error: Column I is unknown
I also have tried to use setNames, i.e.
... %>% group_by(setNames(nm=i),GENDER,LOC) %>% ...
But it also doesn't work...
The group_by_at() function allows you to group by string inputs and is probably the best use here.
GP_A <- c(rep("a",3),rep("b",2),rep("c",2))
GP_B <- c(rep("d",2),rep("e",4),rep("f",1))
GENDER <- c(rep("M",4),rep("F",3))
LOC <- c(rep("HK",2),rep("UK",3),rep("JP",2))
SCORE <- c(50,70,80,20,30,80,90)
df <- data.frame(GP_A,GP_B,GENDER,LOC,SCORE)
result <- list()
for(i in c("GP_A","GP_B"))
{
result[[i]] <-
df %>%
group_by_at(c(i,"GENDER", "LOC")) %>%
summarise(SCORE = mean(SCORE)) %>%
ungroup()
}
Remember that it's always best practice to ungroup() your variables once you finish. This is so that in future you don't have unwanted grouping levels.
I have a data frame that's of this structure:
df <- data.frame(var1 = c(1,1,1,2,2,3,3,3,3),
cat1 = c("A","B","D","B","C","D","E","B","A"))`
> df
var1 cat1
1 1 A
2 1 B
3 1 D
4 2 B
5 2 C
6 3 D
7 3 E
8 3 B
9 3 A
And I am looking to create both nodes and edges data frames from it, so that I can draw a network graph, using VisNetwork. This network will show the number/strength of connections between the different cat1 values, as grouped by the var1 value.
I have the nodes data frame sorted:
nodes <- data.frame(id = unique(df$cat1))
> nodes
id
1 A
2 B
3 D
4 C
5 E
What I'd like help with is how to process df in the following manner:
for each distinct value of var1 in df, tally up the group of nodes that are common to that value of var1 to give an edges dataframe that ultimately looks like the one below. Note that I'm not bothered about the direction of flow along the edges. Just that they are connected is all I need.
> edges
from to value
1 A B 2
2 A D 2
3 A E 1
4 B C 1
5 B D 2
6 B E 1
7 D E 1
With thanks in anticipation,
Nevil
Update: I found here a similar problem, and have adapted that code to give, which is getting close to what I want, but not quite there...
> df %>% group_by(var1) %>%
filter(n()>=2) %>% group_by(var1) %>%
do(data.frame(t(combn(.$cat1, 2,function(x) sort(x))),
stringsAsFactors=FALSE))
# A tibble: 10 x 3
# Groups: var1 [3]
var1 X1 X2
<dbl> <chr> <chr>
1 1. A B
2 1. A D
3 1. B D
4 2. B C
5 3. D E
6 3. B D
7 3. A D
8 3. B E
9 3. A E
10 3. A B
I don't know if there is already a suitable function to achieve this task. Here is a detailed procedure to do it. Whith this, you should be able to define you own function. Hope it helps!
# create an adjacency matrix
mat <- table(df)
mat <- t(mat) %*% mat
as.table(mat) # look at your adjacency matrix
# since the network is not directed, we can consider only the (strictly) upper triangular matrix
mat[lower.tri(mat, diag = TRUE)] <- 0
as.table(mat) # look at the new adjacency matrix
library(dplyr)
edges <- as.data.frame(as.table(mat))
edges <- filter(edges, Freq != 0)
colnames(edges) <- c("from", "to", "value")
edges <- arrange(edges, from)
edges # output
# from to value
#1 A B 2
#2 A D 2
#3 A E 1
#4 B C 1
#5 B D 2
#6 B E 1
#7 D E 1
here's a couple other ways...
in base R...
values <- unique(df$var1[duplicated(df$var1)])
do.call(rbind,
lapply(values, function(i) {
nodes <- as.character(df$cat1[df$var1 == i])
edges <- combn(nodes, 2)
data.frame(from = edges[1, ],
to = edges[2, ],
value = i,
stringsAsFactors = F)
})
)
in tidyverse...
library(dplyr)
library(tidyr)
df %>%
group_by(var1) %>%
filter(n() >= 2) %>%
mutate(cat1 = as.character(cat1)) %>%
summarise(edges = list(data.frame(t(combn(cat1, 2)), stringsAsFactors = F))) %>%
unnest(edges) %>%
select(from = X1, to = X2, value = var1)
in tidyverse using tidyr::complete...
library(dplyr)
library(tidyr)
df %>%
group_by(var1) %>%
mutate(cat1 = as.character(cat1)) %>%
mutate(i.cat1 = cat1) %>%
complete(cat1, i.cat1) %>%
filter(cat1 < i.cat1) %>%
select(from = cat1, to = i.cat1, value = var1)
in tidyverse using tidyr::expand...
library(dplyr)
library(tidyr)
df %>%
group_by(var1) %>%
mutate(cat1 = as.character(cat1)) %>%
expand(cat1, to = cat1) %>%
filter(cat1 < to) %>%
select(from = cat1, to, value = var1)
Consider the following two data.frames:
a1 <- data.frame(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)])
a2 <- data.frame(A = c(1:3,2), B = letters[c(1:3,2)])
I would like to remove the exact rows of a1 that are in a2 so that the result should be:
A B
4 d
5 e
4 d
2 b
Note that one row with 2 b in a1 is retained in the final result. Currently, I use a looping statement, which becomes extremely slow as I have many variables and thousands of rows in my data.frames. Is there any built-in function to get this result?
The idea is, add a counter for duplicates to each file, so you can get a unique match for each occurrence of a row. Data table is nice because it is easy to count the duplicates (with .N), and it also gives the necessary function (fsetdiff) for set operations.
library(data.table)
a1 <- data.table(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)])
a2 <- data.table(A = c(1:3,2), B = letters[c(1:3,2)])
# add counter for duplicates
a1[, i := 1:.N, .(A,B)]
a2[, i := 1:.N, .(A,B)]
# setdiff gets the exception
# "all = T" allows duplicate rows to be returned
fsetdiff(a1, a2, all = T)
# A B i
# 1: 4 d 1
# 2: 5 e 1
# 3: 4 d 2
# 4: 2 b 3
You could use dplyr to do this. I set stringsAsFactors = FALSE to get rid of warnings about factor mismatches.
library(dplyr)
a1 <- data.frame(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)], stringsAsFactors = FALSE)
a2 <- data.frame(A = c(1:3,2), B = letters[c(1:3,2)], stringsAsFactors = FALSE)
## Make temp variables to join on then delete later.
# Create a row number
a1_tmp <-
a1 %>%
group_by(A, B) %>%
mutate(tmp_id = row_number()) %>%
ungroup()
# Create a count
a2_tmp <-
a2 %>%
group_by(A, B) %>%
summarise(count = n()) %>%
ungroup()
## Keep all that have no entry int a2 or the id > the count (i.e. used up a2 entries).
left_join(a1_tmp, a2_tmp, by = c('A', 'B')) %>%
ungroup() %>% filter(is.na(count) | tmp_id > count) %>%
select(-tmp_id, -count)
## # A tibble: 4 x 2
## A B
## <dbl> <chr>
## 1 4 d
## 2 5 e
## 3 4 d
## 4 2 b
EDIT
Here is a similar solution that is a little shorter. This does the following: (1) add a column for row number to join both data.frame items (2) a temporary column in a2 (2nd data.frame) that will show up as null in the join to a1 (i.e. indicates it's unique to a1).
library(dplyr)
left_join(a1 %>% group_by(A,B) %>% mutate(rn = row_number()) %>% ungroup(),
a2 %>% group_by(A,B) %>% mutate(rn = row_number(), tmpcol = 0) %>% ungroup(),
by = c('A', 'B', 'rn')) %>%
filter(is.na(tmpcol)) %>%
select(-tmpcol, -rn)
## # A tibble: 4 x 2
## A B
## <dbl> <chr>
## 1 4 d
## 2 5 e
## 3 4 d
## 4 2 b
I think this solution is a little simpler (perhaps very little) than the first.
I guess this is similar to DWal's solution but in base R
a1_temp = Reduce(paste, a1)
a1_temp = paste(a1_temp, ave(seq_along(a1_temp), a1_temp, FUN = seq_along))
a2_temp = Reduce(paste, a2)
a2_temp = paste(a2_temp, ave(seq_along(a2_temp), a2_temp, FUN = seq_along))
a1[!a1_temp %in% a2_temp,]
# A B
#4 4 d
#5 5 e
#7 4 d
#8 2 b
Here's another solution with dplyr:
library(dplyr)
a1 %>%
arrange(A) %>%
group_by(A) %>%
filter(!(paste0(1:n(), A, B) %in% with(arrange(a2, A), paste0(1:n(), A, B))))
Result:
# A tibble: 4 x 2
# Groups: A [3]
A B
<dbl> <fctr>
1 2 b
2 4 d
3 4 d
4 5 e
This way of filtering avoids creating extra unwanted columns that you have to later remove in the final output. This method also sorts the output. Not sure if it's what you want.
I'm working on a data.table with a column like this:
A <- c("a;b;c","a;a;b","d;a;b","f;f;f")
df <- data.frame(A)
I would like to separate this column into 3 columns like this:
seg1 seg2 seg3
1 a b c
2 a b <NA>
3 d a b
4 f <NA> <NA>
The thing here is that when i split each row by ";" i need to keep unique of the row.
Here's a tidyverse approach. We split the character in A, keep only the unique values, paste the result back together and separate into three columns:
library(tidyverse)
df %>%
mutate(A = map(strsplit(as.character(A), ";"),
.f = ~ paste(unique(.x), collapse = ";"))) %>%
separate(A, into = c("seg1", "seg2", "seg3"))
Which gives:
# seg1 seg2 seg3
#1 a b c
#2 a b <NA>
#3 d a b
#4 f <NA> <NA>
library(stringr)
A <- c("a;b;c","a;a;b","d;a;b","f;f;f")
df <- data.frame(A)
df <- str_split_fixed(df$A, ";", 3)
df <- apply(X = df,
FUN = function(x){
return(x[!duplicated(x)][1:ncol(df)])
},
MARGIN = 1)
df <- t(df)
df <- as.data.frame(df)
names(df) <- c("seg1", "seg2", "seg3")
df
# seg1 seg2 seg3
# 1 a b c
# 2 a b <NA>
# 3 d a b
# 4 f <NA> <NA>