Userfunction with optional grouping argument and if else using piping in R - r

I recently started to write my own functions to speed up standard and repetitive task while analyzing data with R.
At the moment I'm working on a function with three arguments and ran into a challenge I could not solve yet. I would like to have an optional grouping argument. During the process the function should check if there is a grouping argument and then continue using either subfunction 1 or 2.
But I always get the error "Object not found" if the grouping argument is not NA. How can I do this?
Edit: In my case the filter usually is used to filter certain valid or invalid years. If there is a grouping argument there will follow more steps in the pipe than if there is none.
require(tidyverse)
Data <- mpg
userfunction <- function(DF,Filter,Group) {
without_group <- function(DF) {
DF %>%
count(year)
}
with_group <- function(DF) {
DF %>%
group_by({{Group}}) %>%
count(year) %>%
pivot_wider(names_from=year, values_from=n) %>%
ungroup() %>%
mutate(across(.cols=2:ncol(.),.fns=~replace_na(.x, 0))) %>%
mutate(Mittelwert=round(rowMeans(.[,2:ncol(.)],na.rm=TRUE),2))
}
Obj <- DF %>%
ungroup() %>%
{if(Filter!=FALSE) filter(.,eval(rlang::parse_expr(Filter))) else filter(.,.$year==.$year)} %>%
{if(is.na(Group)) without_group(.) else with_group(.)}
return(Obj)
}
For NA it already works:
> Data %>%
+ userfunction(FALSE,NA)
# A tibble: 2 x 2
year n
<int> <int>
1 1999 117
2 2008 117
With argument it does not work:
> Data %>%
+ userfunction(FALSE,manufacturer)
Error in DF %>% ungroup() %>% { : object 'manufacturer' not found
Edit:
What I would expect from the above function would be the following output:
> Data %>% userfunction_exp(FALSE,manufacturer)
# A tibble: 15 x 4
manufacturer `1999` `2008` Mittelwert
<chr> <dbl> <dbl> <dbl>
1 audi 9 9 9
2 chevrolet 7 12 9.5
3 dodge 16 21 18.5
4 ford 15 10 12.5
5 honda 5 4 4.5
6 hyundai 6 8 7
7 jeep 2 6 4
8 land rover 2 2 2
9 lincoln 2 1 1.5
10 mercury 2 2 2
11 nissan 6 7 6.5
12 pontiac 3 2 2.5
13 subaru 6 8 7
14 toyota 20 14 17
15 volkswagen 16 11 13.5
Data %>% userfunction_exp("cyl==4",manufacturer)
# A tibble: 9 x 4
manufacturer `1999` `2008` mean
<chr> <dbl> <dbl> <dbl>
1 audi 4 4 4
2 chevrolet 1 1 1
3 dodge 1 0 0.5
4 honda 5 4 4.5
5 hyundai 4 4 4
6 nissan 2 2 2
7 subaru 6 8 7
8 toyota 11 7 9
9 volkswagen 11 6 8.5
2021-04-01 14:55: edited to add some information and add some steps to the pipe for function with_group.

Hi this is a good question!
There are multiple ways to achieve this as the previous answers pointed out. One way to do it in the tidyverse is tidy evaluation
Omitting your filter function (which you could explain in more detail...)
my_summary <- function(df, grouping_var) {
grp_var <- enquo(grouping_var) #capture group variable
df %>% my_group_by(grp_var)
}
my_group_by <- function(df, grouping_var){
# Check if group is supplied
if(rlang::quo_is_missing(grouping_var)) {
df %>% without_group()
} else {
df %>% with_group(grouping_var)
}
}
without_group <- function(df) {
# do whatever without group
df %>%
count(year)
}
with_group <- function(df, grouping_var) {
# do whatever with group
df %>%
group_by(!!grouping_var) %>% #Note the !!
count(year) %>%
pivot_wider(names_from=year, values_from=n)
}
Which will give you without any argument
> mpg %>% my_summary()
# A tibble: 2 x 2
year n
<int> <int>
1 1999 117
2 2008 117
With group passed to pipe
> mpg %>% my_summary(model)
# A tibble: 38 x 3
# Groups: model [38]
model `1999` `2008`
<chr> <int> <int>
1 4runner 4wd 4 2
2 a4 4 3
3 a4 quattro 4 4
4 a6 quattro 1 2
5 altima 2 4
6 c1500 suburban 2wd 1 4
7 camry 4 3
8 camry solara 4 3
9 caravan 2wd 6 5
10 civic 5 4
# ... with 28 more rows

I don't know what is the use of Filter argument so I'll keep it as it is for now.
group_by(A) %>% count(B) is same as count(A, B) so you can change your function to :
library(tidyverse)
userfunction <- function(DF,Filter,Group = NULL) {
DF %>%
count(year, {{Group}}) %>%
pivot_wider(names_from=year, values_from=n)
}
Data %>% userfunction(FALSE)
# `1999` `2008`
# <int> <int>
#1 117 117
Data %>% userfunction(FALSE,manufacturer)
# A tibble: 15 x 3
# manufacturer `1999` `2008`
# <chr> <int> <int>
# 1 audi 9 9
# 2 chevrolet 7 12
# 3 dodge 16 21
# 4 ford 15 10
# 5 honda 5 4
# 6 hyundai 6 8
# 7 jeep 2 6
# 8 land rover 2 2
# 9 lincoln 2 1
#10 mercury 2 2
#11 nissan 6 7
#12 pontiac 3 2
#13 subaru 6 8
#14 toyota 20 14
#15 volkswagen 16 11
Note that I have assigned the default value to Group as NULL so when you don't mention anything it ignores that argument.

Related

R: Turning row data from one dataframe into column data by group in another

I have data in the following format:
ID
Age
Sex
1
29
M
2
32
F
3
18
F
4
89
M
5
45
M
and;
ID
subID
Type
Status
Year
1
3
Car
Y
1
11
Toyota
NULL
2011
1
23
Kia
NULL
2009
2
5
Car
N
3
2
Car
Y
3
4
Honda
NULL
2019
3
7
Fiat
NULL
2006
3
8
Mitsubishi
NULL
2020
4
1
Car
N
5
7
Car
Y
Each ID in the second table has a row specifying if they have a car, and additional rows stating the brand of car/s they own. Each person has a maximum of 3 cars. I want to simplify this data into a single table as so.
ID
Age
Sex
Car?
Car.1
Car1.year
Car.2
Car2.year
Car.3
Car3.year
1
29
M
Y
Toyota
2011
Kia
2009
NULL
NULL
2
32
F
N
NULL
NULL
NULL
NULL
NULL
NULL
3
18
F
Y
Honda
2019
Fiat
2006
Mitsubishi
2020
4
89
M
N
NULL
NULL
NULL
NULL
NULL
NULL
5
45
M
Y
NULL
NULL
NULL
NULL
NULL
NULL
I've tried using the mutate function in dplyr with the case_when function, but I can't check conditions in another dataframe. If I try to join the tables together, I would have multiple rows for each ID which I want to avoid. The non-standard set up of the second table makes things complicated. My only remaining idea is to switch to Python/Pandas and create a for loop that slowly loops through each ID, searches the second dataframe if the person has a car and the car brands, then mutates a column in the first dataframe. But given the size of my dataset, this would be inefficient and take a long time.
What is the best way to do this?
You can try the following codes:
library(tidyverse)
df1
# A tibble: 5 x 3
ID Age Sex
<dbl> <dbl> <chr>
1 1 29 M
2 2 32 F
3 3 18 F
4 4 89 M
5 5 45 M
df2
# A tibble: 10 x 5
ID subID Type Status Year
<dbl> <dbl> <chr> <chr> <dbl>
1 1 3 Car Y NA
2 1 11 Toyota Y 2011
3 1 23 Kia Y 2009
4 2 5 Car N NA
5 3 2 Car Y NA
6 3 4 Honda Y 2019
7 3 7 Fiat Y 2006
8 3 8 Mitsubishi Y 2020
9 4 1 Clothed N NA
10 5 7 Clothed Y NA
df2 <- df2 %>% mutate(Status = if_else(Status == "NULL", "Y", Status))
df3 <- df2 %>% filter(!is.na(Year)) %>% group_by(ID) %>% mutate(index = row_number())
df4 <- df3 %>% pivot_wider(id_cols = c(ID), values_from = c(Type, Year), names_from = index )
So your desired output will be produced:
df1 %>% left_join(df2 %>% select(ID, Status) %>% distinct()) %>% left_join(df4)
# A tibble: 5 x 10
ID Age Sex Status Type_1 Type_2 Type_3 Year_1 Year_2 Year_3
<dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1 29 M Y Toyota Kia NA 2011 2009 NA
2 2 32 F N NA NA NA NA NA NA
3 3 18 F Y Honda Fiat Mitsubishi 2019 2006 2020
4 4 89 M N NA NA NA NA NA NA
5 5 45 M Y NA NA NA NA NA NA

Sum up with the next line into a new colum

I'm having some trouble on figuring out how to create a new column with the sum of 2 subsequent cells.
I have :
df1<- tibble(Years=c(1990, 2000, 2010, 2020, 2030, 2050, 2060, 2070, 2080),
Values=c(1,2,3,4,5,6,7,8,9 ))
Now, I want a new column where the first line is the sum of 1+2, the second line is the sum of 1+2+3 , the third line is the sum 1+2+3+4 and so on.
As 1, 2, 3, 4... are hipoteticall values, I need to measure the absolute growth from a decade to another in order to create later on a new variable to measure the percentage change from a decade to another.
library(tibble)
df1<- tibble(Years=c(1990, 2000, 2010, 2020, 2030, 2050, 2060, 2070, 2080),
Values=c(1,2,3,4,5,6,7,8,9 ))
library(slider)
library(dplyr, warn.conflicts = F)
df1 %>%
mutate(xx = slide_sum(Values, after = 1, before = Inf))
#> # A tibble: 9 x 3
#> Years Values xx
#> <dbl> <dbl> <dbl>
#> 1 1990 1 3
#> 2 2000 2 6
#> 3 2010 3 10
#> 4 2020 4 15
#> 5 2030 5 21
#> 6 2050 6 28
#> 7 2060 7 36
#> 8 2070 8 45
#> 9 2080 9 45
Created on 2021-08-12 by the reprex package (v2.0.0)
Assuming the last row is to be repeated. Otherwise the fill part can be skipped.
library(dplyr)
library(tidyr)
df1 %>%
mutate(x = lead(cumsum(Values))) %>%
fill(x)
# Years Values x
# <dbl> <dbl> <dbl>
# 1 1990 1 3
# 2 2000 2 6
# 3 2010 3 10
# 4 2020 4 15
# 5 2030 5 21
# 6 2050 6 28
# 7 2060 7 36
# 8 2070 8 45
# 9 2080 9 45
Using base R
v1 <- cumsum(df1$Values)[-1]
df1$new <- c(v1, v1[length(v1)])
You want the cumsum() function. Here are two ways to do it.
### Base R
df1$cumsum <- cumsum(df1$Values)
### Using dplyr
library(dplyr)
df1 <- df1 %>%
mutate(cumsum = cumsum(Values))
Here is the output in either case.
df1
# A tibble: 9 x 3
Years Values cumsum
<dbl> <dbl> <dbl>
1 1990 1 1
2 2000 2 3
3 2010 3 6
4 2020 4 10
5 2030 5 15
6 2050 6 21
7 2060 7 28
8 2070 8 36
9 2080 9 45
A data.table option
> setDT(df)[, newCol := shift(cumsum(Values), -1, fill = sum(Values))][]
Years Values newCol
1: 1990 1 3
2: 2000 2 6
3: 2010 3 10
4: 2020 4 15
5: 2030 5 21
6: 2050 6 28
7: 2060 7 36
8: 2070 8 45
9: 2080 9 45
or a base R option following a similar idea
transform(
df,
newCol = c(cumsum(Values)[-1],sum(Values))
)

dplyr: keep empty levels of factor but not empty levels of a combination of factors that don't appear in data

When grouping and summarising with dplyr, what is the correct way to keep empty levels of each grouping factor but not keep empty combinations from multiple grouping factors?
As an example, consider data recorded at different times at multiple sites. I might filter and then calculate something for each year in each site. I'd like to have the default value of the summary on an empty vector if the filter removes a year completely. So site "a" has 10 years and site "b" has 1 year so I'd always like 11 rows in the summary.
If I use .drop = TRUE in group_by I lose years:
library(dplyr)
library(zoo)
library(lubridate)
set.seed(1)
df <- data.frame(site = factor(c(rep("a", 120), rep("b", 12))),
date = c(seq.Date(as.Date("2000/1/1"), by = "month", length.out = 120), seq.Date(as.Date("2000/1/1"), by = "month", length.out = 12)),
value = rnorm(132, 50, 10))
df$year <- factor(lubridate::year(df$date))
df %>%
filter(value > 65) %>%
group_by(site, year, .drop = TRUE) %>%
summarise(f = first(date))
#> # A tibble: 6 x 3
#> # Groups: site [1]
#> site year f
#> <fct> <fct> <date>
#> 1 a 2000 2000-04-01
#> 2 a 2004 2004-08-01
#> 3 a 2005 2005-01-01
#> 4 a 2007 2007-11-01
#> 5 a 2008 2008-10-01
#> 6 a 2009 2009-02-01
and with .drop = FALSE I gain all the extra years for site "b" which were not in the original data:
df %>%
filter(value > 65) %>%
group_by(site, year, .drop = FALSE) %>%
summarise(f = first(date))
#> # A tibble: 20 x 3
#> # Groups: site [2]
#> site year f
#> <fct> <fct> <date>
#> 1 a 2000 2000-04-01
#> 2 a 2001 NA
#> 3 a 2002 NA
#> 4 a 2003 NA
#> 5 a 2004 2004-08-01
#> 6 a 2005 2005-01-01
#> 7 a 2006 NA
#> 8 a 2007 2007-11-01
#> 9 a 2008 2008-10-01
#> 10 a 2009 2009-02-01
#> 11 b 2000 NA
#> 12 b 2001 NA
#> 13 b 2002 NA
#> 14 b 2003 NA
#> 15 b 2004 NA
#> 16 b 2005 NA
#> 17 b 2006 NA
#> 18 b 2007 NA
#> 19 b 2008 NA
#> 20 b 2009 NA
The best way I could think of was to calculate counts, then merge then filter then drop the count variable, but that's pretty messy.
I know the .drop was only recently added to dplyr, which is very useful for one factor, but is there yet a clean way to do this for multiple factors?
df %>%
filter(value > 65) %>%
group_by(site, year, .drop = FALSE) %>%
summarise(f = first(date)) %>%
left_join(df %>% count(site, year, .drop = FALSE), by = c("site", "year")) %>%
filter(n > 0) %>%
select(-n)
#> # A tibble: 11 x 3
#> # Groups: site [2]
#> site year f
#> <fct> <fct> <date>
#> 1 a 2000 2000-04-01
#> 2 a 2001 NA
#> 3 a 2002 NA
#> 4 a 2003 NA
#> 5 a 2004 2004-08-01
#> 6 a 2005 2005-01-01
#> 7 a 2006 NA
#> 8 a 2007 2007-11-01
#> 9 a 2008 2008-10-01
#> 10 a 2009 2009-02-01
#> 11 b 2000 NA
Not sure if this is what you like.
If you replace dates with value < 65 with NA instead of filtering them out you can proceed as usual.
df %>%
mutate(date = replace(date, value < 65, NA)) %>%
group_by(site, year) %>%
summarise(f = first(date[!is.na(date)]))
# A tibble: 11 x 3
# Groups: site [2]
site year f
<fct> <fct> <date>
1 a 2000 NA
2 a 2001 NA
3 a 2002 2002-03-01
4 a 2003 NA
5 a 2004 NA
6 a 2005 NA
7 a 2006 2006-02-01
8 a 2007 NA
9 a 2008 2008-07-01
10 a 2009 2009-02-01
11 b 2000 2000-08-01

data frame selecting top by grouping

I have a data frame such as:
set.seed(1)
df <- data.frame(
sample = 1:50,
value = runif(50),
group = c(rep(NA, 20), gl(3, 10)))
I want to select the top 10 samples based on value. However, if there is a group corresponding to the sample, I only want to include one sample from that group. If group == NA, I want to include all of them. Arranging df by value looks like:
df_top <- df %>%
arrange(-value) %>%
top_n(10, value)
sample value group
1 46 0.7973088 3
2 49 0.8108702 3
3 22 0.8394404 1
4 2 0.8612095 NA
5 27 0.8643395 1
6 20 0.8753213 NA
7 44 0.8762692 3
8 26 0.8921983 1
9 11 0.9128759 NA
10 30 0.9606180 1
I would want to include samples 36, 22, 2, 20, 11, and the next five highest values in my data frame that continue to fit the pattern. How do I accomplish this?
I think I figured this out. Would this be the best way:
df_top <- df %>%
arrange(-value) %>%
group_by(group) %>%
filter(ifelse(!is.na(group), value == max(value), value == value)) %>%
ungroup() %>%
top_n(10, value)
# A tibble: 10 x 3
sample value group
<int> <dbl> <int>
1 18 0.992 NA
2 7 0.945 NA
3 21 0.935 1
4 4 0.908 NA
5 6 0.898 NA
6 35 0.827 2
7 41 0.821 3
8 20 0.777 NA
9 15 0.770 NA
10 17 0.718 NA
Similar method that uses slice instead of filter:
library(dplyr)
df_top <- df %>%
arrange(-value) %>%
group_by(group) %>%
slice(if(any(!is.na(group))) 1 else 1:n()) %>%
ungroup() %>%
top_n(10, value)
Result:
# A tibble: 10 x 3
sample value group
<int> <dbl> <int>
1 21 0.9347052 1
2 35 0.8273733 2
3 41 0.8209463 3
4 18 0.9919061 NA
5 7 0.9446753 NA
6 4 0.9082078 NA
7 6 0.8983897 NA
8 20 0.7774452 NA
9 15 0.7698414 NA
10 17 0.7176185 NA

how to replace missing values with previous year's binned mean

I have a data frame as below
p1_bin and f1_bin are calculated by cut function by me with
Bins <- function(x) cut(x, breaks = c(0, seq(1, 1000, by = 5)), labels = 1:200)
binned <- as.data.frame (sapply(df[,-1], Bins))
colnames(binned) <- paste("Bin", colnames(binned), sep = "_")
df<- cbind(df, binned)
Now how to calculate mean/avg for previous two years and replace in NA values with in that bin
for example : at row-5 value is NA for p1 and f1 is 30 with corresponding bin 7.. now replace NA with previous 2 years mean for same bin (7) ,i.e
df
ID year p1 f1 Bin_p1 Bin_f1
1 2013 20 30 5 7
2 2013 24 29 5 7
3 2014 10 16 2 3
4 2014 11 17 2 3
5 2015 NA 30 NA 7
6 2016 10 NA 2 NA
df1
ID year p1 f1 Bin_p1 Bin_f1
1 2013 20 30 5 7
2 2013 24 29 5 7
3 2014 10 16 2 3
4 2014 11 17 2 3
5 2015 **22** 30 NA 7
6 2016 10 **16.5** 2 NA
Thanks in advance
I believe the following code produces the desired output. There's probably a much more elegant way than using mean(rev(lag(f1))[1:2]) to get the average of the last two values of f1 but this should do the trick anyway.
library(dplyr)
df %>%
arrange(year) %>%
mutate_at(c("p1", "f1"), "as.double") %>%
group_by(Bin_p1) %>%
mutate(f1 = ifelse(is.na(f1), mean(rev(lag(f1))[1:2]), f1)) %>%
group_by(Bin_f1) %>%
mutate(p1 = ifelse(is.na(p1), mean(rev(lag(p1))[1:2]), p1)) %>%
ungroup
and the output is:
# A tibble: 6 x 6
ID year p1 f1 Bin_p1 Bin_f1
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2013 20 30.0 5 7
2 2 2013 24 29.0 5 7
3 3 2014 10 16.0 2 3
4 4 2014 11 17.0 2 3
5 5 2015 22 30.0 NA 7
6 6 2016 10 16.5 2 NA

Resources