Drop rows conditional on value on other rows using dplyr in R

Drop rows conditional on value on other rows using dplyr in R - r

Using the example data provided below: For each institution type ("a" and "b") I want to drop rows with fac == "no" if there exists a row with fac == "yes" for the same year. I then want to sum the values by year. I am, however, not able to figure out how to drop the correct "no"-rows. Below are a couple of my attempts based on answers give here.
set.seed(123)
ext <- tibble(
institution = c(rep("a", 7), rep("b", 7)),
year = rep(c("2005", "2005", "2006", "2007", "2008", "2009", "2009"), 2),
fac = rep(c("yes", "no", "no", "no", "no", "yes", "no"), 2),
value = sample(1:100, 14, replace=T)
)
ext %>%
group_by(institution, year) %>%
filter(if (fac == "yes") fac != "no")
ext %>%
group_by(institution, year) %>%
case_when(fac == "yes" ~ filter(., fac != "no"))
ext %>%
group_by(institution, year) %>%
{if (fac == "yes") filter(., fac != "no")}

Another way would be:
library(dplyr)
ext %>%
group_by(institution, year) %>%
filter(fac == 'yes' | n() < 2)
# institution year fac value
# 1 a 2005 yes 31
# 2 a 2006 no 51
# 3 a 2007 no 14
# 4 a 2008 no 67
# 5 a 2009 yes 42
# 6 b 2005 yes 43
# 7 b 2006 no 25
# 8 b 2007 no 90
# 9 b 2008 no 91
# 10 b 2009 yes 69
In case you want the overall amounts by year, add these two lines, which will yield the following output:
group_by(year) %>%
summarise(value=sum(value))
# year value
# <chr> <int>
# 1 2005 74
# 2 2006 76
# 3 2007 104
# 4 2008 158
# 5 2009 111

Does this work: by summarise, I assumed you want to sum by year after applying the filtering.
library(dplyr)
ext %>% group_by(institution, year) %>% filter(fac == 'yes'|all(fac == 'no'))
# A tibble: 10 x 4
# Groups: institution, year [10]
institution year fac value
<chr> <chr> <chr> <int>
1 a 2005 yes 31
2 a 2006 no 51
3 a 2007 no 14
4 a 2008 no 67
5 a 2009 yes 42
6 b 2005 yes 43
7 b 2006 no 25
8 b 2007 no 90
9 b 2008 no 91
10 b 2009 yes 69
ext %>% group_by(institution, year) %>% filter(fac == 'yes'|all(fac == 'no')) %>%
ungroup() %>% group_by(year) %>% summarise(value = sum(value))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 5 x 2
year value
<chr> <int>
1 2005 74
2 2006 76
3 2007 104
4 2008 158
5 2009 111

Try creating a flag to identify the yes occurence and after that filter only the desired values. You would need to group by institution and year. Then, compute the length of values with yes greater or equal to one. With that you can flag the no values if there is some value yes inside the group. Finally, filter only the zero values in Flag and you will drop the rows as you expected. Here the code:
library(dplyr)
#Code
newdf <- ext %>% group_by(institution,year) %>%
mutate(NYes=length(fac[fac=='yes']),
Flag=ifelse(fac=='no' & NYes>=1,1,0)) %>%
filter(Flag==0) %>% select(-c(NYes,Flag))
Output:
# A tibble: 10 x 4
# Groups: institution, year [10]
institution year fac value
<chr> <chr> <chr> <int>
1 a 2005 yes 31
2 a 2006 no 51
3 a 2007 no 14
4 a 2008 no 67
5 a 2009 yes 42
6 b 2005 yes 43
7 b 2006 no 25
8 b 2007 no 90
9 b 2008 no 91
10 b 2009 yes 69
And the full code to summarise by year:
#Code 2
newdf <- ext %>% group_by(institution,year) %>%
mutate(NYes=length(fac[fac=='yes']),
Flag=ifelse(fac=='no' & NYes>=1,1,0)) %>%
filter(Flag==0) %>% select(-c(NYes,Flag)) %>%
ungroup() %>%
group_by(year) %>%
summarise(value=sum(value))
Output:
# A tibble: 5 x 2
year value
<chr> <int>
1 2005 74
2 2006 76
3 2007 104
4 2008 158
5 2009 111

An option with data.table
library(data.table)
setDT(ext)[ext[, .I[fac == 'yes'|all(fac == 'no')], .(institution, year)]$V1]

Related

R: Finding Out Which Students are Improving Their Grade

I am working with the R programming language.
Suppose I have the following dataset of student grades:
my_data = data.frame(id = c(1,1,1,1,1,2,2,2,3,3,3,3), year = c(2010,2011,2012,2013, 2014, 2008, 2009, 2010, 2018, 2019, 2020, 2021), grade = c(55, 56, 61, 61, 62, 90,89,89, 67, 87, 51, 65))
> my_data
id year grade
1 1 2010 55
2 1 2011 56
3 1 2012 61
4 1 2013 61
5 1 2014 62
6 2 2008 90
7 2 2009 89
8 2 2010 89
9 3 2018 67
10 3 2019 87
11 3 2020 51
12 3 2021 65
My Question: I want to find out which students improved their grades (or kept the same grade) from year to year, and which students got worse grades from year to year.
Using the idea of "grouped window functions", I tried to write the following functions :
check_grades_improvement <- function(grades){
for(i in 2:length(grades)){
if(grades[i] < grades[i-1]){
return(FALSE)
}
}
return(TRUE)
}
check_grades_decline <- function(grades){
for(i in 2:length(grades)){
if(grades[i] > grades[i-1]){
return(FALSE)
}
}
return(TRUE)
}
Then, I tried to apply these functions to my dataset:
improving_students <- my_data %>% group_by(id) %>%
filter(check_grades_improvement(grade)) %>%
select(id) %>%
unique()
worse_students <- my_data %>%
group_by(id) %>%
filter(check_grades_decline(grade)) %>%
select(id) %>%
unique()
But I am getting empty results
Can someone please show me what I am doing wrong and how I can fix this?
Thanks!

Something like this:
library(dplyr)
my_data %>%
group_by(id) %>%
mutate(x = grade-lag(grade, default = grade[1])) %>%
mutate(peformance = case_when(x == 0 ~ "kept_same",
x > 0 ~ "improved",
x < 0 ~ "got_worse",
TRUE ~ NA_character_), .keep="unused")
id year grade peformance
<dbl> <dbl> <dbl> <chr>
1 1 2010 55 kept_same
2 1 2011 56 improved
3 1 2012 61 improved
4 1 2013 61 kept_same
5 1 2014 62 improved
6 2 2008 90 kept_same
7 2 2009 89 got_worse
8 2 2010 89 kept_same
9 3 2018 67 kept_same
10 3 2019 87 improved
11 3 2020 51 got_worse
12 3 2021 65 improved

If we want to break the function at the first instance
check_grades_improvement <- function(grades){
for(i in 2:length(grades)){
if(grades[i] < grades[i-1]){
return(FALSE)
break
}
}
return(TRUE)
}
check_grades_decline <- function(grades){
for(i in 2:length(grades)){
if(grades[i] > grades[i-1]){
return(FALSE)
break
}
}
return(TRUE)
}
-testing
my_data %>%
group_by(id) %>%
filter(check_grades_improvement(grade)) %>%
ungroup %>%
select(id) %>%
unique()
# A tibble: 1 × 1
id
<dbl>
1 1
my_data %>%
group_by(id) %>%
filter(check_grades_decline(grade)) %>%
ungroup() %>%
select(id) %>%
unique()
# A tibble: 1 × 1
id
<dbl>
1 2
Or if it is for all instances
my_data %>%
arrange(id, year) %>%
group_by(id) %>%
filter(c(FALSE, diff(grade) > 0)) %>%
ungroup %>%
select(id) %>%
unique
# A tibble: 2 × 1
id
<dbl>
1 1
2 3
my_data %>%
arrange(id, year) %>%
group_by(id) %>%
filter(c(FALSE, diff(grade) < 0)) %>%
ungroup %>%
select(id) %>%
unique
# A tibble: 2 × 1
id
<dbl>
1 2
2 3

How to compute cumulative and one specific column in R?

I have the data about sales by years and by-products, let's say like this:
Year <- c(2010,2010,2010,2010,2010,2011,2011,2011,2011,2011,2012,2012,2012,2012,2012)
Model <- c("a","b","c","d","e","a","b","c","d","e","a","b","c","d","e")
Sale <- c("30","45","23","33","24","11","56","19","45","56","33","32","89","33","12")
df <- data.frame(Year, Model, Sale)
Firstly I need to calculate the "Share" column which represents the share of each product within each year.
After I compute cumulative share like this:
In the 3rd step need to identify products that accumulate total sales up to 70% in the last year (2012 in this case) and keep only these products in the whole dataframe + add a ranking column (based on last year) and summarises all the rest of products as category "other". So the final dataframe should be like this:

This is a fairly complex data wrangling task, but can be achieved using dplyr:
library(dplyr)
df %>%
mutate(Sale = as.numeric(Sale)) %>%
group_by(Year) %>%
mutate(Share = 100 * Sale/ sum(Sale),
Year_order = order(order(-Share))) %>%
arrange(Year, Year_order, by_group = TRUE) %>%
mutate(Cumm.Share = cumsum(Share)) %>%
ungroup() %>%
mutate(below_70 = Model %in% Model[Year == max(Year) & Cumm.Share < 70]) %>%
mutate(Model = ifelse(below_70, Model, 'Other')) %>%
group_by(Year, Model) %>%
summarize(Sale = sum(Sale), Share = sum(Share), .groups = 'keep') %>%
group_by(Year) %>%
mutate(pseudoShare = ifelse(Model == 'Other', 0, Share)) %>%
arrange(Year, -pseudoShare, by_group = TRUE) %>%
ungroup() %>%
mutate(Rank = match(Model, Model[Year == max(Year)])) %>%
select(-pseudoShare)
#> # A tibble: 9 x 5
#> Year Model Sale Share Rank
#> <dbl> <chr> <dbl> <dbl> <int>
#> 1 2010 a 30 19.4 2
#> 2 2010 c 23 14.8 1
#> 3 2010 Other 102 65.8 3
#> 4 2011 c 19 10.2 1
#> 5 2011 a 11 5.88 2
#> 6 2011 Other 157 84.0 3
#> 7 2012 c 89 44.7 1
#> 8 2012 a 33 16.6 2
#> 9 2012 Other 77 38.7 3
Note that in the output this code has kept groups a and c, rather than c and d, as in your expected output. This is because a and d have the same value in the final year (16.6), and therefore either could be chosen.
Created on 2022-04-21 by the reprex package (v2.0.1)

Year <- c(2010,2010,2010,2010,2010,2011,2011,2011,2011,2011,2012,2012,2012,2012,2012)
Model <- c("a","b","c","d","e","a","b","c","d","e","a","b","c","d","e")
Sale <- c("30","45","23","33","24","11","56","19","45","56","33","32","89","33","12")
df <- data.frame(Year, Model, Sale, stringsAsFactors=F)
years <- unique(df$Year)
shares <- c()
cumshares <- c()
for (year in years){
extract <- df[df$Year == year, ]
sale <- as.numeric(extract$Sale)
share <- 100*sale/sum(sale)
shares <- append(shares, share)
cumshare <- rev(cumsum(rev(share)))
cumshares <- append(cumshares, cumshare)
}
df$Share <- shares
df$Cumm.Share <- cumshares
df
gives
> df
Year Model Sale Share Cumm.Share
1 2010 a 30 19.354839 100.000000
2 2010 b 45 29.032258 80.645161
3 2010 c 23 14.838710 51.612903
4 2010 d 33 21.290323 36.774194
5 2010 e 24 15.483871 15.483871
6 2011 a 11 5.882353 100.000000
7 2011 b 56 29.946524 94.117647
8 2011 c 19 10.160428 64.171123
9 2011 d 45 24.064171 54.010695
10 2011 e 56 29.946524 29.946524
11 2012 a 33 16.582915 100.000000
12 2012 b 32 16.080402 83.417085
13 2012 c 89 44.723618 67.336683
14 2012 d 33 16.582915 22.613065
15 2012 e 12 6.030151 6.030151
I don't understand what you mean by step 3, how do you decide which products to keep?

How do I create a new factor level that summarizes total values of other factor levels?

I have a dataset where I have at least three columns
year sex value
1 2019 M 10
2 2019 F 20
3 2020 M 50
4 2020 F 20
I would like to group by the first column, year, and then add another level to sex that corresponds the total value in column 3, that is, I would like something like this:
year sex value
<int> <chr> <dbl>
1 2019 M 10
2 2019 F 20
3 2019 Total 30
4 2020 M 50
5 2020 F 20
6 2020 Total 70
Any help is appreciated, especially in dplyr.

Here is just another way of doing this:
library(dplyr)
library(purrr)
df %>%
group_split(year) %>%
map_dfr(~ add_row(.x, year = first(.x$year), sex = "Total", value = sum(.x$value)))
# A tibble: 6 x 3
year sex value
<int> <chr> <dbl>
1 2019 M 10
2 2019 F 20
3 2019 Total 30
4 2020 M 50
5 2020 F 20
6 2020 Total 70

You can summarise the data for each year and bind it to the original dataset.
library(dplyr)
df %>%
group_by(year) %>%
summarise(sex = 'Total',
value = sum(value)) %>%
bind_rows(df) %>%
arrange(year, sex)
# year sex value
# <int> <chr> <dbl>
#1 2019 F 20
#2 2019 M 10
#3 2019 Total 30
#4 2020 F 20
#5 2020 M 50
#6 2020 Total 70
Or in base R -
aggregate(value~year, df, sum) |>
transform(sex = 'Total') |>
rbind(df)
data
df <- data.frame(year = rep(2019:2020, each = 2),
sex = c('M', 'F'), value = c(10, 20, 50, 20))

Select sample from a grouping variable depending on another grouping in R

I have the following data frame with 1,000 rows; 10 Cities, each having 100 rows and I would like to randomly select 10 names by Year in the city and the selected should 10 sample names should come from at least one of the years in the City i.e the 10 names for City 1 should not come from only 1996 for instance.
City Year name
1 1 1996 b
2 1 1996 c
3 1 1997 d
4 1 1997 e
...
101 2 1996 f
102 2 1996 g
103 2 1997 h
104 2 1997 i
Desired Final Sample Data
City Year name
1 1 1996 b
2 1 1998 c
3 1 2001 d
...
11 2 1997 g
12 2 1999 h
13 2 2005 b
...
21 3 1998 a
22 3 2010 c
23 3 2005 d
Sample Data
df1 <- data.frame(City = rep(1:10, each = 100),
Year = rep(1996:2015, each = 5),
name = rep(letters[1:25], 40))
I am failing to randomly select the 10 sample names by Year (without repeating years - unless when the number of Years in a city is less than 10) for all the 10 Cities, how can I go over this?
The Final sample should have 10 names of each city and years should not repeat unless when they are less than 10 in that city.
Thank you.

First group by City and use sample_n to sample a sub-dataframe.
Then group by City and Year, and sample from name one element per group. Don't forget to set the RNG seed in order to make the result reproducible.
library(dplyr)
set.seed(2020)
df1 %>%
group_by(City) %>%
sample_n(min(n(), 10)) %>%
ungroup() %>%
group_by(City, Year) %>%
summarise(name = sample(name, 1))
#`summarise()` regrouping output by 'City' (override with `.groups` argument)
## A tibble: 4 x 3
## Groups: City [2]
# City Year name
# <int> <int> <chr>
#1 1 1996 b
#2 1 1997 e
#3 2 1996 f
#4 2 1997 h
Data
df1 <- read.table(text = "
City Year name
1 1 1996 b
2 1 1996 c
3 1 1997 d
4 1 1997 e
101 2 1996 f
102 2 1996 g
103 2 1997 h
104 2 1997 i
", header = TRUE)
Edit
Instead of reinventing the wheel, use package sampling, function strata to get an index into the data set and then filter its corresponding rows.
library(dplyr)
library(sampling)
set.seed(2020)
df1 %>%
mutate(row = row_number()) %>%
filter(row %in% strata(df1, stratanames = c('City', 'Year'), size = rep(1, 1000), method = 'srswor')$ID_unit) %>%
select(-row) %>%
group_by(City) %>%
sample_n(10) %>%
arrange(City, Year)

R: How to run a custom function after group_by using group_map?

Given that, i have a dataframe as below:
dt <- data.frame(year = sample(c(2000:2019),100,replace = T ),
month = sample(c(1:12),100,replace = T ),
paitent_ID = sample(c(1:50),100,replace = T ),
state = sample(c(1:10),100,replace = T ) )
and i need to apply the below function to this dataset after group by and sort:
newState <- function(dt){
dt["new"]= dt[0,"state"]*3
dt
}
So, this function is supposed to add a new column called new to each group.
Here is the group_by:
library(dplyr)
dt %>%
group_by(paitent_ID) %>%
group_map( ~ .x %>%
arrange( year,month)) %>%
group_map( ~ .x %>%
newState())
when i run the code, it complains with:
Error in UseMethod("group_split") :
no applicable method for 'group_split' applied to an object of class "list"

As #André Oliveira mentions in the comments, it is recommended to use mutate for adding a column. However, it is possible to do so with group_modify after making some small changes to your function.
newState <- function(dt, groupvars){
dt["new"]= dt[1,"state"]*3
dt
}
dt %>%
group_by(paitent_ID) %>%
arrange(year, month) %>%
group_modify(newState) %>%
ungroup
# # A tibble: 100 x 5
# paitent_ID year month state new
# <int> <int> <int> <int> <dbl>
# 1 1 2006 5 3 9
# 2 2 2012 12 3 9
# 3 3 2013 11 8 24
# 4 3 2014 10 1 24
# 5 3 2019 5 6 24
# 6 4 2006 7 5 15
# 7 4 2006 7 2 15
# 8 5 2003 8 8 24
# 9 7 2015 12 2 6
# 10 7 2017 8 10 6
And a more conventional approach
dt %>%
group_by(paitent_ID) %>%
arrange(year, month) %>%
mutate(new = state[1]*3)