I am looking to find common cases across groups in R, based on a tidy data set.
I could split the data sets and then join them, or use Reduce, but that seems laborious and I sure there must be a way to do this easily for tidy data, likely using dplyr and group_by().
Here is an example:
data <- data.frame(case = c('A', 'B', 'C', 'D', 'B', 'C', 'D', 'E'),
var = c(rep(1,4), rep(2, 4)))
case var
1 A 1
2 B 1
3 C 1
4 D 1
5 B 2
6 C 2
7 D 2
8 E 2
What I want is the cases common across variables: 'B', 'C', 'D'. I am thinking this should be easy but can't find an answer.
Group by case, then grab the first row for those cases that have the correct number of occurrences.
data %>%
group_by(case) %>%
slice(which(n_distinct(var) == n_distinct(.$var))[1])
After grouping by 'case', filter the groups having the number of distinct elements in 'var' equal to all the distinct elements in 'var', ungroup and get the distinct 'case'
data %>%
group_by(case) %>%
filter(n_distinct(var) == n_distinct(.$var)) %>%
ungroup %>%
# A tibble: 3 x 1
# case
# <fct>
#1 B
#2 C
#3 D
Or using data.table
setDT(data)[, .GRP[uniqueN(var) == uniqueN(data$var)], case]$case
#[1] B C D
Or using base R
with(data, names(Filter(function(x) all(unique(var) %in% x), split(var, case))))
#[1] "B" "C" "D"
Here is my data
data <- data.frame(a= c(1, NA, 3), b = c(2, 4, NA), c=c(NA, 1, 2))
I wish to select only the rows with no missing data in colunm a AND b. For my example, only the first row will be selected.
I could use
data %>% filter(!is.na(a) & !is.na(b))
to achieve my purpose. But I wish to do it using if_any/if_all, if possible. I tried data %>% filter(if_all(c(a, b), !is.na)) but this returns an error. My question is how to do it in dplyr through if_any/if_all.
data %>%
filter(if_all(c(a,b), ~!is.na(.)))
a b c
1 1 2 NA
We could use filter with if_all
data %>%
filter(if_all(c(a,b), complete.cases))
a b c
1 1 2 NA
This could do the trick - use filter_at and all_vars in dplyr:
data %>%
filter_at(vars(a, b), all_vars(!is.na(.)))
# a b c
#1 1 2 NA
Example data frame:
> df <- data.frame(A = c('a', 'b', 'c'), B = c('c','d','e'))
> df
1 a c
2 b d
3 c e
The following returns all rows in which any value is "c"
> df %>% filter_all(any_vars(. == "c"))
1 a c
2 c e
How do I return the inverse of this, all rows in which no value is ever "c"? In this example, that would be row 2 only. Tidyverse solutions preferred, thanks.
EDIT: To be clear, I am asking about exact matching, I don't care if a value contains a "c", just if the value is exactly "c".
Do you have to use dplyr?
df[rowSums(df == 'c') == 0, ]
# A B
#2 b d
Adding OP's comments into answer
This works for me, thank you. My original issue was that any row with a "c" somewhere also had an NA somewhere else, so the adapted solution is
df[rowSums(df == 'c', na.rm = TRUE) == 0, ]
Honestly this is more readable than dplyr syntax. But as I asked for a dplyr solution, I accepted another answer.
FYI, filter_all has been superseded by the use of if_any or if_all.
df %>%
filter(if_all(everything(), ~ . != "c"))
# A B
# 1 b d
df <- data.frame(A = c('a', 'b', 'c', NA, 'c'), B = c('c','d','e', 'g', NA))
1 a c
2 b d
3 c e
4 <NA> g
5 c <NA>
df %>% filter_all(all_vars(. != "c" | is.na(.)))
1 b d
2 <NA> g
I have a list of dataframes which I am trying to apply a script to which works for a single data frame.
Part of the script uses both piping and group_by:
df2 <- df1 %>%
group_by (col1) %>%
summarise(newcol = sum(col2))
I've tried various loops or variations with lapply but haven't been able to find a way for it to work with a lists of dataframes where it would be something along the lines of:
mylist2 <- mylist1 %>%
group_by (col1) %>%
summarise(newcol = sum(col2))
But obviously changed around to work with loops or lapply. I'm probably missing something simple here but would appreciate some help. Thanks
PS - I looked at providing the data from the lists but wasn't able to provide reproducible samples.
Here is a tidyverse way.
# generate some data
mylist1 <- replicate(2, data.frame(col1 = rep(letters[1:2], 2),
col2 = 1:4),
simplify = FALSE)
mylist1 %>%
map(., ~ group_by(., col1) %>%
summarise(new_col = sum(col2)))
# A tibble: 2 x 2
# col1 new_col
# <fct> <int>
#1 a 4
#2 b 6
# A tibble: 2 x 2
# col1 new_col
# <fct> <int>
#1 a 4
#2 b 6
In base R you might try lapply and tapply
lapply(mylist1, function(x)
tapply(X = x[["col2"]], INDEX = x[["col1"]], FUN = 'sum'))
#a b
#4 6
#a b
#4 6
I want to rename many colums. Now I rewrite the statment for each column:
df <- data.frame(col1 = 1:4, col2 = c('a', 'b', 'c', 'd'), col3 = rep(1,4))
df %>%
rename(col1 = col1_new) %>%
rename(col2 = col2_new) %>%
rename(col3 = col3_new)
How do I avoid the duplication of the rename statement? Is there a solution using functional programming with R?
It is easier to use setNames than with rename
df %>%
setNames(., paste0(names(.), "_new"))
# col1_new col2_new col3_new
#1 1 a 1
#2 2 b 1
#3 3 c 1
#4 4 d 1
If there is no limitation such as all the steps should be done within the %>%, a more easier and general approach is
colnames(df) <- paste0(colnames(df), "_new")
What is the proper way to count the result of a left outer join using dplyr?
Consider the two data frames:
a <- data.frame( id=c( 1, 2, 3, 4 ) )
b <- data.frame( id=c( 1, 1, 3, 3, 3, 4 ), ref_id=c( 'a', 'b', 'c', 'd', 'e', 'f' ) )
a specifies four different IDs. b specifies six records that reference IDs in a. If I want to see how many times each ID is referenced, I might try this:
a %>% left_join( b, by='id' ) %>% group_by( id ) %>% summarise( refs=n() )
Source: local data frame [4 x 2]
id refs
(dbl) (int)
1 1 2
2 2 1
3 3 3
4 4 1
However, the result is misleading because it indicates that ID 2 was referenced once when in reality, it was never referenced (in the intermediate data frame, ref_id was NA for ID 2). I would like to avoid introducing a separate library such as sqldf.
With data.table, you can do
setDT(a); setDT(b)
b[a, .N, on="id", by=.EACHI]
id N
1: 1 2
2: 2 0
3: 3 3
4: 4 1
Here, the syntax is x[i, j, on, by=.EACHI].
.EACHI refers to each row of i=a.
j=.N uses a special variable for the number of rows.
There are already some good answers but since the question asks not to use packages here is one. We perform a left join on a and b and append a refs column which is TRUE if ref_id is not NA. Then use aggregate to sum over the refs column:
m <- transform(merge(a, b, all.x = TRUE), refs = !is.na(ref_id))
aggregate(refs ~ id, m, sum)
id refs
1 1 2
2 2 0
3 3 3
4 4 1
It does require another package, but i'd feel remiss for not mentioning tidylog which provides reports for a wide range of tidyverse verbs. In your case, it would produce a report like:
a <- data.frame(id = c(1, 2, 3, 4 ))
b <- data.frame(id = c(1, 1, 3, 3, 3, 4), ref_id = c('a', 'b', 'c', 'd', 'e', 'f'))
a %>% left_join(b, by='id')
left_join: added one column (ref_id)
> rows only in x 1
> rows only in y (0)
> matched rows 6 (includes duplicates)
> ===
> rows total 7
id ref_id
1 1 a
2 1 b
3 2 <NA>
4 3 c
5 3 d
6 3 e
7 4 f
See here and here for more examples/info
I'm having a hard time deciding if this is a hack or the proper way to count references, but this returns the expected result:
a %>% left_join( b, by='id' ) %>% group_by( id ) %>% summarise( refs=sum( !is.na( ref_id ) ) )
Source: local data frame [4 x 2]
id refs
(dbl) (int)
1 1 2
2 2 0
3 3 3
4 4 1