Calculate mode for each column in dataframe using lapply dplyr

Calculate mode for each column in dataframe using lapply dplyr - r

I'm trying to create a function that essentially gets me the MODE...or MODE-X (2nd-Xth most common value & and the associated counts for each column in a data frame.
I can't figure out what I may be missing and I'm looking for some assistance? I believe it has to do with the passing in of a variable into dplyr function.
library(tidyverse)
myfunct_get_mode = function(x, rank=1){
mytable = dplyr::count(rlang::sym(x), sort = TRUE)
names(mytable)= c('variable','counts')
# return just the rank specified...such as mode or mode -1, etc
result = table %>% dplyr::slice(rlang::sym(rank))
return(result)
}
mtcars %>% lapply(. %>% (function(x) myfunct_get_mode(x, rank=2)))

There are some problems with your function:
You function-call is not doing what you think. Check with mtcars %>% lapply(. %>% (function(x) print(x))) that actually your x is the whole column of mtcars. To get the names of the column apply the function to names(mtcars). But then you also have to specify the dataframe you're working on.
To evaluate a symbol you get sym from you need to use !! in front of the rlang::sym(x).
rank is not a variable name, thus no need for rlang::sym here.
table should be mytable in second to last line of your function.
So how could it work (although there are probably better ways):
myfunct_get_mode = function(df, x, rank=1){
mytable = count(df, !!rlang::sym(x), sort = TRUE)
names(mytable)= c('variable','counts')
# return just the rank specified...such as mode or mode -1, etc
result = mytable %>% slice(rank)
return(result)
}
names(mtcars) %>% lapply(function(x) myfunct_get_mode(mtcars, x, rank=2))

If we need this in a list, we can use map
f1 <- function(dat, rank = 1) {
purrr::imap(dat, ~
dat %>%
count(!! rlang::sym(.y)) %>%
rename_all(~ c('variable', 'counts')) %>%
arrange(desc(counts)) %>%
slice(seq_len(rank))) #%>%
#bind_cols - convert to a data.frame
}
f1(mtcars, 2)

Related

Passing argument into function for group_by in dplyr [duplicate]

This question already has answers here:
How to pass column name as argument to function for dplyr verbs?
(4 answers)
Closed 7 months ago.
I am trying to use group_by within a function call in dplyr (R) and I am getting unexpected results. Here is an example of what I am trying to do:
df = data.frame(a = c(0,0,1,1), b = c(0,1,0,1), c = c(1,2,3,4))
result1 = df %>%
group_by(a,b) %>%
mutate(d = sum(c))
result1$d
myFunc <- function(df, var) {
output = df %>%
group_by(a,!!var) %>%
mutate(d = sum(c))
return(output)
}
result2 = myFunc(df,"b")
result2$d
result1$d yields [1,2,3,4] which is what I expected. result2$d yields [3,3,7,7] which I do not want, and I am not sure what is going on.
It works to have b (without quotes) as the function argument, and {{var}} in place of !!var. Unfortunately, in my case, my column names are in string format (but maybe there is a way to transform the string beforehand so that it will work with the {{}} notation?)

If you want to pass a character object that can refer to a certain column of a data frame, you should use !!sym(var):
myFunc <- function(df, var) {
output = df %>%
group_by(a, !!sym(var)) %>%
mutate(d = sum(c))
return(output)
}
myFunc(df, "b")
If you want to pass a data-masked argument, you should use {{ var }} or equivalently !!enquo(var):
myFunc <- function(df, var) {
output = df %>%
group_by(a, {{ var }}) %>%
mutate(d = sum(c))
return(output)
}
myFunc(df, b)
Note that I pass "b" and b respectively into the function in the two different cases.

If we want to use quoting and unquoting instead of curlycurly {{}} the we should consider this basic procedure: https://tidyeval.tidyverse.org/dplyr.html
Creating a function around dplyr pipelines involves three steps: abstraction, quoting, and unquoting.
1. Abstraction step:
Here we identify the varying steps. In our case var in group_by:
2. Quoting step:
Identify all the arguments where the user is allowed to refer to data frame columns directly.
The function can’t evaluate these arguments right away.
Instead they should be automatically quoted. Apply enquo() to these arguments
3. Unquoting step:
Identify where these variables are passed to other quoting functions and unquote with !!.
In this case we pass var to group_by():
myFunc <- function(df, var) {
var <- enquo(var)
output = df %>%
group_by(a,!!var) %>%
mutate(d = sum(c))
return(output)
}
result2 = myFunc(df,b)
output:
[1] 1 2 3 4

Just as I post a question, I come across something that works...
myFunc <- function(df, var) {
output = df %>%
group_by_at(.vars = c("a",var)) %>%
mutate(d = sum(c))
return(output)
}
result2 = myFunc(df,"b")

how to use a variable name within dplyr::lead/lag function

I have a tibble in which I want to lag/lead various columns and check their correlations.
Currently, for every column name, I have to have a separate function to do the lead/lag and correlation function.
Is there a way in which I could pass the column name as a variabe and then use that variable with lag/lead
#This is what I have tried unsuccessfully so far
library(janitor)
library(tidyverse)
(x <- mtcars %>%
as_tibble())
var_to_lag <- "carb"
# Tried without success
x %>% mutate(lag_var = lag(!!var_to_lag, 1))
x %>% mutate(lag_var = lag(contains(var_to_lag), 1))
x %>% mutate(lag_var = lag(vars(contains(var_to_lag)), 1))
x %>% mutate(lag_var = lag(vars(!!var_to_lag), 1))
Any ideas?

We could use mutate_at which accepts string input
library(dplyr)
x %>% mutate_at(vars(var_to_lag), list(lag_var = ~lag(.)))
We can also use get
x %>% mutate(lag_var = lag(get(var_to_lag)))
Or first convert var_to_lag to symbol (sym) and then evaluate (!!)
x %>% mutate(lag_var = lag(!!sym(var_to_lag)))

Iterating over values in character vector with dplyr functions

I have several variables (id.type and id.subtype in this example) I would like to check for distinct values in a tibble all.snags using the dplyr package. I would like them sorted and all values printed out in the console (a tibble typically prints only the first 10). The output would be equivalent to the following code:
distinct(all.snags,id.type) %>% arrange(id.type) %>% print(n = Inf)
distinct(all.snags,id.subtype) %>% arrange(id.subtype) %>% print(n = Inf)
I think this is better done by looping over the values in a vector, but I can't get it to work.
distinct.vars <- c("id.type","id.subtype")
for (i in distinct.vars) {
distinct(all.snags,distinct.vars[i]) %>%
arrange(distinct.vars[i]) %>%
print(n = Inf)
}

I think this function is what you want:
library(dplyr)
df = iris
print_distinct = function(df, columns) {
for (c in columns) {
print(df %>% distinct_(c) %>% arrange_(c))
}
}
print_distinct(df, c("Sepal.Length", "Sepal.Width"))

Error All select() inputs must resolve to integer column positions. The following do not:

I am trying to use dplyr computation as below and then call this in a function where I can change the column name and dataset name. The code is as below:
sample_table <- function(byvar = TRUE, dataset = TRUE) {
tcount <-
df2 %>% group_by(.dots = byvar) %>% tally() %>% arrange(byvar) %>% rename(tcount = n) %>%
left_join(
select(
dataset %>% group_by(.dots = byvar) %>% tally() %>% arrange(byvar) %>% rename(scount = n), byvar, scount
), by = c("byvar")
) %>%
mutate_each(funs(replace(., is.na(.), 0)),-byvar %>% mutate(
tperc = round(tcount / rcount, digits = 2), sperc = round(scount / samplesize, digits = 2),
absdiff = abs(sperc - tperc)
) %>%
select(byvar, tcount, tperc, scount, sperc, absdiff)
return(tcount)
}
category_Sample1 <- sample_table(byvar = "category", dataset = Sample1)
My function name is sample_table.
The Error message is as below:-
Error: All select() inputs must resolve to integer column positions.
The following do not:
* byvar
I know this is a repeat question and I have gone through the below links:
Function writing passing column reference to group_by
Error when combining dplyr inside a function
I am not sure where I am going wrong. rcount is the number of rows in df2 and samplesize is the number of rows in "dataset" dataframe. I have to compute the same thing for another variable with three different "dataset" names.

You use column references as strings (byvar) (Standard Evaluation) and normal reference (tcount, tperc etc.) (Non Standard Evaluation) together.
Make sure you use one of both and the appropriate function: select() or select_(). You can fix your issue by using
select(one_of(c(byvar,'tcount')))

dplyr: How to use select and filter inside functions; (...) not working for arguments

I'm trying to build some functions for creating standard tables from a questionnaire, using dplyr for the data manipulation. This question was very helpful for the group_by function, passing arguments (in this case, the name of the variable I want to use to make the table) to (...), but that seems to break down when trying to pass the same arguments to other dplyr commands, specifically 'select' and 'filter'. The error message I get is '...' used in an incorrect context'.
Does anyone have any ideas on this? Thank you
For the sake of completeness (and any other hints - I'm very new to writing functions), here is the code I would like to use:
myTable <- function(x, ...) {
df <-
x %>%
group_by(Var1, ...) %>%
filter(!is.na(...) & ... != '') %>% # To remove missing values: Not working!
summarise(value = n()) %>%
group_by(Var1) %>%
mutate(Tot = sum(value)) %>%
group_by(Var1, ...) %>%
summarise(num = sum(value), total = sum(Tot), proportion = num/total*100) %>%
select(Var1, ..., proportion) # To select desired columns: Not working!
tab <- dcast(df, Var1 ~ ..., value.var = 'proportion')
tab[is.na(tab)] <- 0
print(tab)
}