dplyr n_distinct with condition - r

Using dplyr to summarise a dataset, I want to call n_distinct to count the number of unique occurrences in a column. However, I also want to do another summarise() for all unique occurrences in a column where a condition in another column is satisfied.
Example dataframe named "a":
A B
1 Y
2 N
3 Y
1 Y
a %>% summarise(count = n_distinct(A))
However I also want to add a count of n_distinct(A) where B == "Y"
The result should be:
count
3
when you add the condition the result should be:
count
2
The end result I am trying to achieve is both statements merged into one call that gives me a result like
count_all count_BisY
3 2
What is the appropriate way to go about this with dplyr?

This produces the distinct A counts by each value of B using dplyr.
library(dplyr)
a %>%
group_by(B) %>%
summarise(count = n_distinct(A))
This produces the result:
Source: local data frame [2 x 2]
B count
(fctr) (int)
1 N 1
2 Y 2
To produce the desired output added above using dplyr, you can do the following:
a %>% summarise(count_all = n_distinct(A), count_BisY = length(unique(A[B == 'Y'])))
This produces the result:
count_all count_BisY
1 3 2

An alternative is to use the uniqueN function from data.table inside dplyr:
library(dplyr)
library(data.table)
a %>% summarise(count_all = n_distinct(A), count_BisY = uniqueN(A[B == 'Y']))
which gives:
count_all count_BisY
1 3 2
You can also do everything with data.table:
library(data.table)
setDT(a)[, .(count_all = uniqueN(A), count_BisY = uniqueN(A[B == 'Y']))]
which gives the same result.

Filtering the dataframe before performing the summarise works
a %>%
filter(B=="Y") %>%
summarise(count = n_distinct(A))

We can also use aggregate from base R
aggregate(cbind(count=A)~B, a, FUN=function(x) length(unique(x)))
# B count
#1 N 1
#2 Y 2
Based on the OP's expected output
data.frame(count=length(unique(a$A)),
count_BisY = length(unique(a$A[a$B=="Y"])))

Related

Obtain a Count of all the combinations created in a column when grouping by another column in df with different length combinations in R

Sample data frame
Guest <- c("ann","ann","beth","beth","bill","bill","bob","bob","bob","fred","fred","ginger","ginger")
State <- c("TX","IA","IA","MA","AL","TX","TX","AL","MA","MA","IA","TX","AL")
df <- data.frame(Guest,State)
Desired output
I have tried about a dozen different ideas but not getting close. Closest was setting up a crosstab but didn't know how to get counts from that. Long/wide got me nowhere. etc. Too new still to think out of the box I guess.
Try this approach. You can arrange your values and then use group_by() and summarise() to reach a structure similar to those expected:
library(dplyr)
library(tidyr)
#Code
new <- df %>%
arrange(Guest,State) %>%
group_by(Guest) %>%
summarise(Chain=paste0(State,collapse = '-')) %>%
group_by(Chain,.drop = T) %>%
summarise(N=n())
Output:
# A tibble: 4 x 2
Chain N
<chr> <int>
1 AL-MA-TX 1
2 AL-TX 2
3 IA-MA 2
4 IA-TX 1
We can use base R with aggregate and table
table(aggregate(State~ Guest, df[do.call(order, df),], paste, collapse='-')$State)
-output
# AL-MA-TX AL-TX IA-MA IA-TX
# 1 2 2 1

Filtering with multiple criteria on tidy data

I'm struggling with the filter (dplyr) function on a tidy dataframe:
data1<-data.frame("Time"=c(0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5),
"Variable"=rep(c("a","b","c","d"),6),
"Value"=c(0,1,0,0,1,1,1,1,1,3,2,3,10,1,3,7,2,1,4,2,3,1,5,13))
What I want to do is to filter the time when variable "a" is equal to 2 and when variable "a" is as it max.
For first case mi code is:
data1<-data1%>%
group_by(Time)%>%
filter(any(Variable=="a" & Value==2))
and works fine and gives me:
Time Variable Value
2 a 2
2 b 1
2 c 4
2 d 2
Don't now how could be for a=max(a), I tried with:
data1<-data1%>%
group_by(Time)%>%
filter(any(Variable=="a" & Value==max(Value)))
but doesn't work (becaus max is calculated on all column Variable) I think I need something like
Value=max(Value)[Variable$a].
The filtered must act this way:
Time Variable Value
3 a 10
3 b 1
3 c 3
3 d 7
I prefer a solution with dplyr. Can anyone give me a general rule for filtering on tidy df with multiple criteria?
Here's a dplyr way:
library(dplyr)
data1%>%
filter(Time == Time[Value == max(Value[Variable == "a"])])
And a data.table way
library(data.table)
setDT(data1)
data1[Time == Time[Value == max(Value[Variable == "a"])]]
additional option
data1 %>%
filter(Variable == "a") %>%
filter(Value == max(Value, na.rm = T)) %>%
select(Time) %>%
left_join(., data1, by = "Time")
Based on the edited criteria this should provide the desired results.
data1 <- data1 %>%
group_by(Time) %>%
filter(any(Variable=="a" &
Value==max(data1$Value[data1$Variable == 'a'])))

Count and Assign Consecutive Occurrences of Variable

I wish to count consecutive occurrence of any value and assign that count to that value in next column. Below is the example of input and desired output:
dataset <- data.frame(input = c("a","b","b","a","a","c","a","a","a","a","b","c"))
dataset$count <- c(1,2,2,2,2,1,4,4,4,4,1,1)
dataset
input count
a 1
b 2
b 2
a 2
a 2
c 1
a 4
a 4
a 4
a 4
b 1
c 1
With rle(dataset$input) I can just get number of occurrences of each value. But I want resulting output in above format.
My question is similar to:
R: count consecutive occurrences of values in a single column
But here output is in sequence and I want to assign the count itself to that value.
You can repeat the lengths argument lengths time in rle
with(rle(dataset$input), rep(lengths, lengths))
#[1] 1 2 2 2 2 1 4 4 4 4 1 1
Using dplyr, we can use lag to create groups and then count the number of rows in each group.
library(dplyr)
dataset %>%
group_by(gr = cumsum(input != lag(input, default = first(input)))) %>%
mutate(count = n())
and with data.table
library(data.table)
setDT(dataset)[, count:= .N, rleid(input)]
data
Make sure the input column is character and not factor.
dataset <- data.frame(input = c("a","b","b","a","a","c","a","a","a","a","b","c"),
stringsAsFactors = FALSE)
We can use rleid with dplyr
library(dplyr)
dataset %>%
group_by(grp = rleid(input)) %>%
mutate(count = n())

finding the minimum value of multiple variables by group

I would like to find the minimum value of a variable (time) that several other variables are equal to 1 (or any other value). Basically my application is finding the first year that x ==1, for several x. I know how to find this for one x but would like to avoid generating multiple reduced data frames of minima, then merging these together. Is there an efficient way to do this? Here is my example data and solution for one variable.
d <- data.frame(cat = c(rep("A",10), rep("B",10)),
time = c(1:10),
var1 = c(0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1),
var2 = c(0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,0,0,1,1,1))
ddply(d[d$var1==1,], .(cat), summarise,
start= min(time))
How about this using dplyr
d %>%
group_by(cat) %>%
summarise_at(vars(contains("var")), funs(time[which(. == 1)[1]]))
Which gives
# A tibble: 2 x 3
# cat var1 var2
# <fct> <int> <int>
# 1 A 4 5
# 2 B 7 8
We can use base R to get the minimum 'time' among all the columns of 'var' grouped by 'cat'
sapply(split(d[-1], d$cat), function(x)
x$time[min(which(x[-1] ==1, arr.ind = TRUE)[, 1])])
#A B
#4 7
Is this something you are expecting?
library(dplyr)
df <- d %>%
group_by(cat, var1, var2) %>%
summarise(start = min(time)) %>%
filter()
I have left a blank filter argument that you can use to specify any filter condition you want (say var1 == 1 or cat == "A")

Keeping IDs conditional on repeating variable

I have data that looks like this:
Is there a way I can very efficiently (without much R code) retain only 'ID' cases where instances of 'X' are equal to zero? For example, in this case only ID number 3 should be retained in my data set.
THIS ISSUE IS CLOSED - THERE ARE MULTIPLE STRONG ANSWERS IN THE COMMENTS BELOW
using the data.table package, I was able to quickly pull this together
library(data.table)
df <- data.table(ID=c(1,1,1,2,2,2,3,3,3), y=c(5,6,4,6,3,1,9,5,5), x=c(1,0,0,0,1,1,0,0,0))
df <- df[, .(ident = all(x ==0), y, x), by = ID][ident== TRUE] #aggregate, x, y and identifier by each ID
df[, ident := NULL] # get rid of redundant identifier column
df <- data.frame(ID=c(1,1,1,2,2,2,3,3,3), y=c(5,6,4,6,3,1,9,5,5), x=c(1,0,0,0,1,1,0,0,0))
subset(df, !ID %in% subset(df, x!=0)$ID)
That is, first find the ID's where x is not zero (subset(df, x!=0)$ID), and then exclude cases with those ID's (!ID %in% subset(df, x!=0)$ID)
try this:
first get all IDs for which any row has a non-zero value
Then use that to subset
df <- data.frame(ID=c(1,1,1,2,2,2,3,3,3), y=c(5,6,4,6,3,1,9,5,5), x=c(1,0,0,0,1,1,0,0,0))
exclude <- subset(df, x!=0)$ID
new_df <- subset(df, ! ID %in% exclude)
A base R option using ave, where we select the ID if all values (x) for the ID are 0.
df[ave(df$x == 0, df$ID, FUN = all), ]
# ID y x
#7 3 9 0
#8 3 5 0
#9 3 5 0
An equivalent dplyr solution would be
library(dplyr)
df %>%
group_by(ID) %>%
filter(all(x == 0)) %>%
ungroup()
# A tibble: 3 x 3
# ID y x
# <dbl> <dbl> <dbl>
#1 3. 9. 0.
#2 3. 5. 0.
#3 3. 5. 0.

Resources