finding the minimum value of multiple variables by group - r

I would like to find the minimum value of a variable (time) that several other variables are equal to 1 (or any other value). Basically my application is finding the first year that x ==1, for several x. I know how to find this for one x but would like to avoid generating multiple reduced data frames of minima, then merging these together. Is there an efficient way to do this? Here is my example data and solution for one variable.
d <- data.frame(cat = c(rep("A",10), rep("B",10)),
time = c(1:10),
var1 = c(0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1),
var2 = c(0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,0,0,1,1,1))
ddply(d[d$var1==1,], .(cat), summarise,
start= min(time))

How about this using dplyr
d %>%
group_by(cat) %>%
summarise_at(vars(contains("var")), funs(time[which(. == 1)[1]]))
Which gives
# A tibble: 2 x 3
# cat var1 var2
# <fct> <int> <int>
# 1 A 4 5
# 2 B 7 8

We can use base R to get the minimum 'time' among all the columns of 'var' grouped by 'cat'
sapply(split(d[-1], d$cat), function(x)
x$time[min(which(x[-1] ==1, arr.ind = TRUE)[, 1])])
#A B
#4 7

Is this something you are expecting?
library(dplyr)
df <- d %>%
group_by(cat, var1, var2) %>%
summarise(start = min(time)) %>%
filter()
I have left a blank filter argument that you can use to specify any filter condition you want (say var1 == 1 or cat == "A")

Related

Loop through specific columns of dataframe keeping some columns as fixed

I have a large dataset with the two first columns that serve as ID (one is an ID and the other one is a year variable). I would like to compute a count by group and to loop over each variable that is not an ID one. This code below shows what I want to achieve for one variable:
library(tidyverse)
df <- tibble(
ID1 = c(rep("a", 10), rep("b", 10)),
year = c(2001:2020),
var1 = rnorm(20),
var2 = rnorm(20))
df %>%
select(ID1, year, var1) %>%
filter(if_any(starts_with("var"), ~!is.na(.))) %>%
group_by(year) %>%
count() %>%
print(n = Inf)
I cannot use a loop that starts with for(i in names(df)) since I want to keep the variables "ID1" and "year". How can I run this piece of code for all the columns that start with "var"? I tried using quosures but it did not work as I receive the error select() doesn't handle lists. I also tried to work with select(starts_with("var") but with no success.
Many thanks!
Another possible solution:
library(tidyverse)
df %>%
group_by(ID1) %>%
summarise(across(starts_with("var"), ~ length(na.omit(.x))))
#> # A tibble: 2 × 3
#> ID1 var1 var2
#> <chr> <int> <int>
#> 1 a 10 10
#> 2 b 10 10
for(i in names(df)[grepl('var',names(df))])

How to subset dataframe based on conditions between columns across rows depending on values

I have a dataframe with information on individual id, period and code of work place. I would like to know who are the individuals who have worked alone for the entire time span of the dataset.
Consider the very simple example below. Individual A worked alone at two work places (x,y) in period 1. Individual B and C worked together at work place z in period 1. Individual B worked alone at work place w in period 2. Individual D worked alone at place k in period 2.
mydf <- data.frame(id=c('A','A','B','C','B','D'),
period=c(1,1,1,1,2,2),
work_place=c('x','y','z','z','w','k'))
I would like to identify the rows concerning those who have worked alone for the entire period, which in this case are those referring individuals A and D.
ids_alone <- data.frame(id=c('A','A','D'),
period=c(1,1,2),
work_place=c('x','y','k'))
Grouped by 'period', 'work_place', create a column 'n' with the number of distinct 'id's, then grouped by 'id', filter those 'id's having all elements of 'n' as 1
library(dplyr)
mydf %>%
group_by(period, work_place) %>%
mutate(n = n_distinct(id)) %>%
group_by(id) %>%
filter(all(n ==1)) %>%
ungroup %>%
select(-n)
-output
# A tibble: 3 x 3
# id period work_place
# <chr> <dbl> <chr>
#1 A 1 x
#2 A 1 y
#3 D 2 k
A data.table option (following the same idea from #akrun)
setDT(mydf)[
,
n := uniqueN(id),
.(period, work_place)
][
,
.SD[mean(n) == 1], id
][
,
n := NULL
][]
which gives
id period work_place
1: A 1 x
2: A 1 y
3: D 2 k

How to group multiple rows based on some criteria and sum values in R?

Hi All,
Example :- The above is the data I have. I want to group age 1-2 and count the values. In this data value is 4 for age group 1-2. Similarly I want to group age 3-4 and count the values. Here the value for age group 3-4 is 6.
How can I group age and aggregate the values correspond to it?
I know this way: code-
data.frame(df %>% group_by(df$Age) %>% tally())
But the values are aggregating on individual Age.
I want the values aggregating on multiple age to be a group as mentioned above example.
Any help on this will be greatly helpful.
Thanks a lot to All.
Here are two solutions, with base R and with package dplyr.
I will use the data posted by Shree.
First, base R.
I create a grouping variable grp and then aggregate on it.
grp <- with(df, c((age %in% 1:2) + 2*(age %in% 3:4)))
aggregate(age ~ grp, df, length)
# grp age
#1 1 4
#2 2 6
Second a dplyr way.
Function case_when is used to create a grouping variable. This allows for meaningful names to be given to the groups in an easy way.
library(dplyr)
df %>%
mutate(grp = case_when(
age %in% 1:2 ~ "2:3",
age %in% 3:4 ~ "3:4",
TRUE ~ NA_character_
)) %>%
group_by(grp) %>%
tally()
## A tibble: 2 x 2
# grp n
# <chr> <int>
#1 1:2 4
#2 3:4 6
Here's one way using dplyr and ?cut from base R -
df <- data.frame(age = c(1,1,2,2,3,3,3,4,4,4),
Name = letters[1:10],
stringsAsFactors = F)
df %>%
count(grp = cut(age, breaks = c(0,2,4)))
# A tibble: 2 x 2
grp n
<fct> <int>
1 (0,2] 4
2 (2,4] 6

Keeping IDs conditional on repeating variable

I have data that looks like this:
Is there a way I can very efficiently (without much R code) retain only 'ID' cases where instances of 'X' are equal to zero? For example, in this case only ID number 3 should be retained in my data set.
THIS ISSUE IS CLOSED - THERE ARE MULTIPLE STRONG ANSWERS IN THE COMMENTS BELOW
using the data.table package, I was able to quickly pull this together
library(data.table)
df <- data.table(ID=c(1,1,1,2,2,2,3,3,3), y=c(5,6,4,6,3,1,9,5,5), x=c(1,0,0,0,1,1,0,0,0))
df <- df[, .(ident = all(x ==0), y, x), by = ID][ident== TRUE] #aggregate, x, y and identifier by each ID
df[, ident := NULL] # get rid of redundant identifier column
df <- data.frame(ID=c(1,1,1,2,2,2,3,3,3), y=c(5,6,4,6,3,1,9,5,5), x=c(1,0,0,0,1,1,0,0,0))
subset(df, !ID %in% subset(df, x!=0)$ID)
That is, first find the ID's where x is not zero (subset(df, x!=0)$ID), and then exclude cases with those ID's (!ID %in% subset(df, x!=0)$ID)
try this:
first get all IDs for which any row has a non-zero value
Then use that to subset
df <- data.frame(ID=c(1,1,1,2,2,2,3,3,3), y=c(5,6,4,6,3,1,9,5,5), x=c(1,0,0,0,1,1,0,0,0))
exclude <- subset(df, x!=0)$ID
new_df <- subset(df, ! ID %in% exclude)
A base R option using ave, where we select the ID if all values (x) for the ID are 0.
df[ave(df$x == 0, df$ID, FUN = all), ]
# ID y x
#7 3 9 0
#8 3 5 0
#9 3 5 0
An equivalent dplyr solution would be
library(dplyr)
df %>%
group_by(ID) %>%
filter(all(x == 0)) %>%
ungroup()
# A tibble: 3 x 3
# ID y x
# <dbl> <dbl> <dbl>
#1 3. 9. 0.
#2 3. 5. 0.
#3 3. 5. 0.

dplyr n_distinct with condition

Using dplyr to summarise a dataset, I want to call n_distinct to count the number of unique occurrences in a column. However, I also want to do another summarise() for all unique occurrences in a column where a condition in another column is satisfied.
Example dataframe named "a":
A B
1 Y
2 N
3 Y
1 Y
a %>% summarise(count = n_distinct(A))
However I also want to add a count of n_distinct(A) where B == "Y"
The result should be:
count
3
when you add the condition the result should be:
count
2
The end result I am trying to achieve is both statements merged into one call that gives me a result like
count_all count_BisY
3 2
What is the appropriate way to go about this with dplyr?
This produces the distinct A counts by each value of B using dplyr.
library(dplyr)
a %>%
group_by(B) %>%
summarise(count = n_distinct(A))
This produces the result:
Source: local data frame [2 x 2]
B count
(fctr) (int)
1 N 1
2 Y 2
To produce the desired output added above using dplyr, you can do the following:
a %>% summarise(count_all = n_distinct(A), count_BisY = length(unique(A[B == 'Y'])))
This produces the result:
count_all count_BisY
1 3 2
An alternative is to use the uniqueN function from data.table inside dplyr:
library(dplyr)
library(data.table)
a %>% summarise(count_all = n_distinct(A), count_BisY = uniqueN(A[B == 'Y']))
which gives:
count_all count_BisY
1 3 2
You can also do everything with data.table:
library(data.table)
setDT(a)[, .(count_all = uniqueN(A), count_BisY = uniqueN(A[B == 'Y']))]
which gives the same result.
Filtering the dataframe before performing the summarise works
a %>%
filter(B=="Y") %>%
summarise(count = n_distinct(A))
We can also use aggregate from base R
aggregate(cbind(count=A)~B, a, FUN=function(x) length(unique(x)))
# B count
#1 N 1
#2 Y 2
Based on the OP's expected output
data.frame(count=length(unique(a$A)),
count_BisY = length(unique(a$A[a$B=="Y"])))

Resources