I am looking to find common cases across groups in R, based on a tidy data set.
I could split the data sets and then join them, or use Reduce, but that seems laborious and I sure there must be a way to do this easily for tidy data, likely using dplyr and group_by().
Here is an example:
data <- data.frame(case = c('A', 'B', 'C', 'D', 'B', 'C', 'D', 'E'),
var = c(rep(1,4), rep(2, 4)))
case var
1 A 1
2 B 1
3 C 1
4 D 1
5 B 2
6 C 2
7 D 2
8 E 2
What I want is the cases common across variables: 'B', 'C', 'D'. I am thinking this should be easy but can't find an answer.
Group by case, then grab the first row for those cases that have the correct number of occurrences.
library(dplyr)
data %>%
group_by(case) %>%
slice(which(n_distinct(var) == n_distinct(.$var))[1])
After grouping by 'case', filter the groups having the number of distinct elements in 'var' equal to all the distinct elements in 'var', ungroup and get the distinct 'case'
library(dplyr)
data %>%
group_by(case) %>%
filter(n_distinct(var) == n_distinct(.$var)) %>%
ungroup %>%
distinct(case)
# A tibble: 3 x 1
# case
# <fct>
#1 B
#2 C
#3 D
Or using data.table
library(data.table)
setDT(data)[, .GRP[uniqueN(var) == uniqueN(data$var)], case]$case
#[1] B C D
Or using base R
with(data, names(Filter(function(x) all(unique(var) %in% x), split(var, case))))
#[1] "B" "C" "D"
Related
Here is my data
data <- data.frame(a= c(1, NA, 3), b = c(2, 4, NA), c=c(NA, 1, 2))
I wish to select only the rows with no missing data in colunm a AND b. For my example, only the first row will be selected.
I could use
data %>% filter(!is.na(a) & !is.na(b))
to achieve my purpose. But I wish to do it using if_any/if_all, if possible. I tried data %>% filter(if_all(c(a, b), !is.na)) but this returns an error. My question is how to do it in dplyr through if_any/if_all.
data %>%
filter(if_all(c(a,b), ~!is.na(.)))
a b c
1 1 2 NA
We could use filter with if_all
library(dplyr)
data %>%
filter(if_all(c(a,b), complete.cases))
-output
a b c
1 1 2 NA
This could do the trick - use filter_at and all_vars in dplyr:
data %>%
filter_at(vars(a, b), all_vars(!is.na(.)))
Output:
# a b c
#1 1 2 NA
Example data frame:
> df <- data.frame(A = c('a', 'b', 'c'), B = c('c','d','e'))
> df
A B
1 a c
2 b d
3 c e
The following returns all rows in which any value is "c"
> df %>% filter_all(any_vars(. == "c"))
A B
1 a c
2 c e
How do I return the inverse of this, all rows in which no value is ever "c"? In this example, that would be row 2 only. Tidyverse solutions preferred, thanks.
EDIT: To be clear, I am asking about exact matching, I don't care if a value contains a "c", just if the value is exactly "c".
Do you have to use dplyr?
df[rowSums(df == 'c') == 0, ]
# A B
#2 b d
Adding OP's comments into answer
This works for me, thank you. My original issue was that any row with a "c" somewhere also had an NA somewhere else, so the adapted solution is
df[rowSums(df == 'c', na.rm = TRUE) == 0, ]
Honestly this is more readable than dplyr syntax. But as I asked for a dplyr solution, I accepted another answer.
dplyr
FYI, filter_all has been superseded by the use of if_any or if_all.
df %>%
filter(if_all(everything(), ~ . != "c"))
# A B
# 1 b d
library(dplyr)
df <- data.frame(A = c('a', 'b', 'c', NA, 'c'), B = c('c','d','e', 'g', NA))
A B
1 a c
2 b d
3 c e
4 <NA> g
5 c <NA>
df %>% filter_all(all_vars(. != "c" | is.na(.)))
A B
1 b d
2 <NA> g
I have a list of dataframes which I am trying to apply a script to which works for a single data frame.
Part of the script uses both piping and group_by:
df2 <- df1 %>%
group_by (col1) %>%
summarise(newcol = sum(col2))
I've tried various loops or variations with lapply but haven't been able to find a way for it to work with a lists of dataframes where it would be something along the lines of:
mylist2 <- mylist1 %>%
group_by (col1) %>%
summarise(newcol = sum(col2))
But obviously changed around to work with loops or lapply. I'm probably missing something simple here but would appreciate some help. Thanks
PS - I looked at providing the data from the lists but wasn't able to provide reproducible samples.
Here is a tidyverse way.
# generate some data
mylist1 <- replicate(2, data.frame(col1 = rep(letters[1:2], 2),
col2 = 1:4),
simplify = FALSE)
library(purrr)
library(dplyr)
mylist1 %>%
map(., ~ group_by(., col1) %>%
summarise(new_col = sum(col2)))
#[[1]]
# A tibble: 2 x 2
# col1 new_col
# <fct> <int>
#1 a 4
#2 b 6
#[[2]]
# A tibble: 2 x 2
# col1 new_col
# <fct> <int>
#1 a 4
#2 b 6
In base R you might try lapply and tapply
lapply(mylist1, function(x)
tapply(X = x[["col2"]], INDEX = x[["col1"]], FUN = 'sum'))
#[[1]]
#a b
#4 6
#[[2]]
#a b
#4 6
I want to rename many colums. Now I rewrite the statment for each column:
df <- data.frame(col1 = 1:4, col2 = c('a', 'b', 'c', 'd'), col3 = rep(1,4))
df %>%
rename(col1 = col1_new) %>%
rename(col2 = col2_new) %>%
rename(col3 = col3_new)
How do I avoid the duplication of the rename statement? Is there a solution using functional programming with R?
It is easier to use setNames than with rename
df %>%
setNames(., paste0(names(.), "_new"))
# col1_new col2_new col3_new
#1 1 a 1
#2 2 b 1
#3 3 c 1
#4 4 d 1
If there is no limitation such as all the steps should be done within the %>%, a more easier and general approach is
colnames(df) <- paste0(colnames(df), "_new")
What is the proper way to count the result of a left outer join using dplyr?
Consider the two data frames:
a <- data.frame( id=c( 1, 2, 3, 4 ) )
b <- data.frame( id=c( 1, 1, 3, 3, 3, 4 ), ref_id=c( 'a', 'b', 'c', 'd', 'e', 'f' ) )
a specifies four different IDs. b specifies six records that reference IDs in a. If I want to see how many times each ID is referenced, I might try this:
a %>% left_join( b, by='id' ) %>% group_by( id ) %>% summarise( refs=n() )
Source: local data frame [4 x 2]
id refs
(dbl) (int)
1 1 2
2 2 1
3 3 3
4 4 1
However, the result is misleading because it indicates that ID 2 was referenced once when in reality, it was never referenced (in the intermediate data frame, ref_id was NA for ID 2). I would like to avoid introducing a separate library such as sqldf.
With data.table, you can do
library(data.table)
setDT(a); setDT(b)
b[a, .N, on="id", by=.EACHI]
id N
1: 1 2
2: 2 0
3: 3 3
4: 4 1
Here, the syntax is x[i, j, on, by=.EACHI].
.EACHI refers to each row of i=a.
j=.N uses a special variable for the number of rows.
There are already some good answers but since the question asks not to use packages here is one. We perform a left join on a and b and append a refs column which is TRUE if ref_id is not NA. Then use aggregate to sum over the refs column:
m <- transform(merge(a, b, all.x = TRUE), refs = !is.na(ref_id))
aggregate(refs ~ id, m, sum)
giving:
id refs
1 1 2
2 2 0
3 3 3
4 4 1
It does require another package, but i'd feel remiss for not mentioning tidylog which provides reports for a wide range of tidyverse verbs. In your case, it would produce a report like:
library(tidylog)
a <- data.frame(id = c(1, 2, 3, 4 ))
b <- data.frame(id = c(1, 1, 3, 3, 3, 4), ref_id = c('a', 'b', 'c', 'd', 'e', 'f'))
a %>% left_join(b, by='id')
left_join: added one column (ref_id)
> rows only in x 1
> rows only in y (0)
> matched rows 6 (includes duplicates)
> ===
> rows total 7
id ref_id
1 1 a
2 1 b
3 2 <NA>
4 3 c
5 3 d
6 3 e
7 4 f
See here and here for more examples/info
I'm having a hard time deciding if this is a hack or the proper way to count references, but this returns the expected result:
a %>% left_join( b, by='id' ) %>% group_by( id ) %>% summarise( refs=sum( !is.na( ref_id ) ) )
Source: local data frame [4 x 2]
id refs
(dbl) (int)
1 1 2
2 2 0
3 3 3
4 4 1