Select columns and group by columns in function arguments - r

I am trying to write a function so that I can input any columns to be described both at the overall level and by a grouping variable.
However, I am having trouble with getting output for grouped results.
My data:
df <- data.frame(gender=c("m", "f", "m","m"), age=c("18-22","23-32","23-32","50-60"), income=c("low", "low", "medium", "high"), group=c("A", "A", "B", "B"))
> df
gender age income group
1 m 18-22 low A
2 f 23-32 low A
3 m 23-32 medium B
4 m 50-60 high B
Function:
library(dplyr)
make_sum <- function(data=df, cols, group_var) {
data %>% dplyr::select(cols) %>%
# print tables with frequency and proportions
apply(2, function(x) {
n <- table(x, useNA = "no")
prop=round(n/length(x[!is.na(x)])*100,2)
print(cbind(n, prop))
})
# print tables by group
data %>% dplyr::select(cols, vars(group_var)) %>%
apply(2, function(x) {
n <- table(x, vars(group_var),useNA = "no")
print(n)
})
}
cols <- df %>% dplyr::select(gender,age, income) %>% names()
make_sum(data=df, cols=cols, group_var="group")
I get the proper output for the overall tables but not the grouped, with this error showing:
Error: `vars(group_var)` must evaluate to column positions or names, not a list
Desired output (example) for grouped gender variable:
A B
f 1 0
m 1 2

Instead of using the apply with MARGIN = 2, summarise_all can be called here. Also, the vars wrapped is applied along with a tidyverse function. Here, inorder to get the frequency, an option is to subset the column with [[ which is more direct. Also, as summarise returns only a single row (for each group - if there is grouping variable), we can wrap the output in a list
make_sum <- function(data=df, cols, group_var) {
data %>%
dplyr::select(cols) %>%
summarise_all(~ {
n <- table(., data[[group_var]], useNA = "no")
#list(round(n/length(.[!is.na(.)])*100,2))
list(n)
})
}
cols <- df %>%
dplyr::select(gender,age, income) %>%
names()
out <- make_sum(data=df, cols=cols, group_var="group")
out$gender
#[[1]]
#. A B
# f 1 0
# m 1 2

Related

Filter specific values of optional argument with dplyr

I have a dataframe that looks like this :
df <- data.frame(ID = rep(1:10, each = 6),
Site = rep(c("A","B","C","D"), each = 6, times = 10),
Department = rep(c("E","F","G","H"), each = 6, times = 10),
Occupation = rep(c("I","J","K","L"), each = 6, times = 10),
Construct = rep(paste0("X",1:6), times = 10),
Score = sample(c("Green","Orange","Red"), size = 60, replace = TRUE))
head(df)
Basically, each ID belongs to a site, a department and has an occupation, and is evaluated on six constructs.
I have adapted a previous function of mine to compute the N and the rate of each Score category for a given Construct, by any combination of Site, Department and Occupation :
my_function <- function(..., dimension = NULL){
df %>%
filter(Construct == dimension) %>%
group_by(..., Score) %>%
summarise(n = n()) %>%
mutate(rate= round(n/sum(n),2))
}
This works perfectly, as I simply have to indicate which Construct, and add any of the three factors (Site, Departement, Occupation) as optional arguments to obtain a summary. For example, a summary of X1 by Site and Department would be :
my_function(dimension = "X1", Site, Department)
However, I would like to filter out some of the values of the Occupation variable, but only when looking at a summary including this variable. I tried to do so by checking whether Occupation was passed as an optional argument, and exclude the specific values when it was the case. Something like :
my_function <- function(..., dimension = NULL){
if(hasArg(Occupation)){
df %>%
filter(Construct == dimension, Occupation != "I") %>%
group_by(..., Score) %>%
summarise(n = n()) %>%
mutate(rate= round(n/sum(n),2))
} else {
df %>%
filter(Construct == dimension) %>%
group_by(..., Score) %>%
summarise(n = n()) %>%
mutate(rate= round(n/sum(n),2))
}
}
But it does not seem to work, as it consistently returns includes the values I'd like to filter out, even when I specify Occupation as an optional argument. I tried to fiddle with things like curly-curly {{}} but I can't seem to get this function to filter the specific values.
hasArg seems to expect all of the arguments to be named, whereas in
my_function(dimension="X1", Site, Department, Occupation)
this is not the case.
Perhaps:
my_function <- function(..., dimension = NULL){
hasOcc <- "Occupation" %in% as.character(match.call())
if (hasOcc) {
df %>%
filter(Construct == dimension, Occupation != "I") %>%
group_by(..., Score) %>%
summarise(n = n()) %>%
mutate(rate= round(n/sum(n),2))
} else {
df %>%
filter(Construct == dimension) %>%
group_by(..., Score) %>%
summarise(n = n()) %>%
mutate(rate= round(n/sum(n),2))
}
}
my_function(Site, Department, Occupation, dimension = "X1")
# # A tibble: 7 x 6
# # Groups: Site, Department, Occupation [3]
# Site Department Occupation Score n rate
# <chr> <chr> <chr> <chr> <int> <dbl>
# 1 B F J Green 6 0.6
# 2 B F J Orange 4 0.4
# 3 C G K Green 2 0.2
# 4 C G K Orange 2 0.2
# 5 C G K Red 6 0.6
# 6 D H L Green 6 0.6
# 7 D H L Orange 4 0.4
Some other thoughts on the function:
reaching out of its scope to get df is not a good practice: it is not really reproducible, and it can be difficult to troubleshoot. For instance, if you forget to assign your data to df, you'll see
my_function(Site, Department, Occupation, dimension = "X1")
# Error in UseMethod("filter") :
# no applicable method for 'filter' applied to an object of class "function"
(This error is because it is finding stats::df.)
Further, if you want to use it against a different non-df-named dataset, you're out of luck.
Recommendation: explicitly pass the data. A tidyverse commonality is to pass it as the first argument. One side-benefit of this is that you can (generally) use this in the middle of a %>%-pipe directly.
my_function <- function(.data, ..., dimension = NULL) { .data %>% ... }
You can reduce the number of pipelines in there by including the Occupation conditional directly in the filter(..). This is not just code-golf: in more complex code examples, it's not hard to imagine updating one of the %>%-pipes and either forgetting the other or updating it differently. Since the only difference here is a component of filter, we can add it there:
my_function <- function(..., dimension = NULL){
hasOcc <- "Occupation" %in% as.character(match.call())
df %>%
filter(Construct == dimension, !hasOcc | Occupation != "I") %>%
group_by(..., Score) %>%
summarise(n = n()) %>%
mutate(rate= round(n/sum(n),2))
}
If dimension is required, don't default to NULL since, if omitted, this will produce an error.
my_function <- function(.data, ..., dimension) { ... }
If it is instead optional and you don't want to filter on it if not provided, then you need to check for that in your filter:
filter(if (is.null(dimension)) TRUE else Construct == dimension, ...)
If you can imagine wanting dimension to be either NA (matching an explicit NA value in the data) or you might want "one or more", then you may want to use %in% instead of ==:
NA == NA
# [1] NA
NA %in% NA
# [1] TRUE
So your function could use
filter(if (is.null(dimension)) TRUE else Construct %in% dimension, ...)
These points would result in your function being either
my_function <- function(.data, ..., dimension = NULL){
hasOcc <- "Occupation" %in% as.character(match.call())
.data %>%
filter(if (is.null(dimension)) TRUE else Construct %in% dimension, !hasOcc | Occupation != "I") %>%
group_by(..., Score) %>%
summarise(n = n()) %>%
mutate(rate= round(n/sum(n),2))
}
if dimension is optional, or
my_function <- function(.data, ..., dimension) {
hasOcc <- "Occupation" %in% as.character(match.call())
.data %>%
filter(Construct %in% dimension, !hasOcc | Occupation != "I") %>%
group_by(..., Score) %>%
summarise(n = n()) %>%
mutate(rate= round(n/sum(n),2))
}
otherwise.

Access column name inside function used inside summarize_all dplyr structure

I'm building a dplyr structure to run some custom functions over the columns of a dataframe in 1 block of code
currently my function looks this
funx <- function(x) {
logchoice <- if(max(x) < 400) {'T' } else { 'F' }
logtest <- suppressWarnings(log10(x))
remaining <- length(logtest[which(!is.na(logtest) & is.finite(logtest))])
x <- if(remaining > 0.75*length(x)) {suppressWarnings(log10(x)) } else { x }
x <- x[which(!is.na(x) & is.finite(x))]
y <- diptest::dip.test(x)
z <- tibble(pvalue = y$p.value, Transform = logchoice)
return(z)
}
and the dplyr structure looks like this:
mtcars %>%
sample_n(30) %>%
select(colnames(mtcars)[2:5]) %>%
summarise_all(list(~ list(funx(.)))) %>%
gather %>%
unnest %>%
arrange(pvalue) %>%
rename(Parameter = key)
which gives me:
Parameter pvalue Transform
1 cyl 0.00000000 T
2 drat 0.03026093 T
3 hp 0.04252001 T
4 disp 0.06050505 F
I would like to know how I can access the column name inside my function, mainly because I would like to change the name in the result table to look like the output of this: paste(original_column_name, 'log10', sep = '') if the function applies the log transformation, but leave the original name as is when it decides not to.
so the expected output is:
Parameter pvalue Transform
1 log10_cyl 0.00000000 T
2 log10_drat 0.03026093 T
3 log10_hp 0.04252001 T
4 disp 0.06050505 F
You were quite close. You can just add a mutate() to the end
mtcars %>%
sample_n(30) %>%
select(colnames(mtcars)[2:5]) %>%
summarise_all(list(~ list(funx(.)))) %>%
gather() %>%
unnest() %>%
arrange(pvalue) %>%
rename(Parameter = key) %>%
mutate(Parameter = ifelse(Transform == "T", paste0("log10_", Parameter), Parameter)) %>%
select(Parameter, pvalue)
# Parameter pvalue
# log10_cyl 0.00000000
# log10_drat 0.01389723
# disp 0.02771770
# log10_hp 0.08493466
Answering in a separate post as the solution is a different. To get the column names in a print(), I would pass them in the function and use purrr::map_dfr to build a dataframe of the result. The small changes I made are to grab the column name, col_name, and specify the dataframe. I tried a few approaches to grab the column name using your original function but came out unsuccessful.
logtest_pval <- function(col, df) {
col_name <- col
x <- df %>% pull(!!col)
logchoice <- ifelse(max(x) < 400, TRUE, FALSE)
logtest <- log10(x)
remaining <- length(logtest[which(!is.na(logtest) & is.finite(logtest))])
x <- if(remaining > 0.75*length(x)) {suppressWarnings(log10(x)) } else { x }
x <- x[which(!is.na(x) & is.finite(x))]
y <- diptest::dip.test(x)
z <-
tibble(
transform = logchoice,
column = ifelse(logchoice, paste0("log10_", col_name), col_name),
pvalue = y$p.value
)
print(paste0(z, collapse = " | "))
return(z)
}
Then you can build your dataframe:
purrr::map_dfr(
.x = names(mtcars), # the columns to use
.f = logtest_pval, # the function to use
df = mtcars # additional arguments needed
)
Here's another example
df <-
mtcars %>%
select_if(is.numeric)
pvalues <-
map_dfr(names(df), logtest_pval, df)

How to use apply functions correctly when there are NA values

I'd like to calculate a function on multiple columns of a dataframe with random NA values. I have two questions:
How to deal with NAs? The code runs when I try it on non-NA columns, but returns NA when there are NAs even though I remove them.
How to print the results in a dataframe format instead of multiple arrays? I used mapply but it doesn't seem to do the calculations correctly.
Here is my code:
#create a data frame with random NAs
df<-data.frame(category1 = sample(c(1:10),100,replace=TRUE),
category2 = sample(c(1:10),100,replace=TRUE)
)
insert_nas <- function(x) {
len <- length(x)
n <- sample(1:floor(0.2*len), 1)
i <- sample(1:len, n)
x[i] <- NA
x
}
df <- sapply(df, insert_nas) %>% as.data.frame()
df$type <- sample(c("A", "B", "C"),100,replace=TRUE)
#using apply:
library(NPS)
apply(df[,c('category1', 'category2')], 2,
function(x) df %>% filter(!is.na(x)) %>% group_by(type) %>%
transmute(nps(x)) %>% unique()
)
#results:
$category1
# A tibble: 3 x 2
# Groups: type [3]
type `nps(x)`
<chr> <dbl>
1 B NA
2 A NA
3 C NA
...
#using mapply
mapply(function(x) df %>% filter(!is.na(x)) %>% group_by(type) %>%
transmute(nps(x)) %>% unique(), df[,c('category1', 'category2')])
#results:
category1 category2
type Character,3 Character,3
nps(x) Numeric,3 Numeric,3
Regarding the function I use, it doesn't have a built in way to deal with NAs, so I remove NAs prior to calling it.
I still used the !is.na part of your code because it seems that nps can't deal with NA, even though the documentation said it should (possible bug). I changed your apply to lapply and passed the variables as the list. Then I used get to identify the variable name that appears in quotes as a variable in your df.
df<-data.frame(category1 = sample(c(1:10),100,replace=TRUE),
category2 = sample(c(1:10),100,replace=TRUE)
)
insert_nas <- function(x) {
len <- length(x)
n <- sample(1:floor(0.2*len), 1)
i <- sample(1:len, n)
x[i] <- NA
x
}
df <- sapply(df, insert_nas) %>% as.data.frame()
df$type <- sample(c("A", "B", "C"),100,replace=TRUE)
#using apply:
library(NPS)
df2 <- as.data.frame(lapply(c('category1', 'category2'),
function(x) df %>% filter(!is.na(get(x))) %>% group_by(type) %>%
transmute(nps(get(x))) %>% unique()
),stringsAsFactors = FALSE)
colnames(df2) <- c("type", "nps_cat1","type2","nps_cat2")
#type2 is redundant
df2 <- select(df2, -type2)

Apply function over data frame rows

I'm trying to apply a function over the rows of a data frame and return a value based on the value of each element in a column. I'd prefer to pass the whole dataframe instead of naming each variable as the actual code has many variables - this is a simple example.
I've tried purrr map_dbl and rowwise but can't get either to work. Any suggestions please?
#sample df
df <- data.frame(Y=c("A","B","B","A","B"),
X=c(1,5,8,23,31))
#required result
Res <- data.frame(Y=c("A","B","B","A","B"),
X=c(1,5,8,23,31),
NewVal=c(10,500,800,230,3100)
)
#use mutate and map or rowwise etc
Res <- df %>%
mutate(NewVal=map_dbl(.x=.,.f=FnAdd(.)))
Res <- df %>%
rowwise() %>%
mutate(NewVal=FnAdd(.))
#sample fn
FnAdd <- function(Data){
if(Data$Y=="A"){
X=Data$X*10
}
if(Data$Y=="B"){
X=Data$X*100
}
return(X)
}
If there are multiple values, it is better to have a key/val dataset, join and then do the mulitiplication
keyVal <- data.frame(Y = c("A", "B"), NewVal = c(10, 100))
df %>%
left_join(keyVal) %>%
mutate(NewVal = X*NewVal)
# Y X NewVal
#1 A 1 10
#2 B 5 500
#3 B 8 800
#4 A 23 230
#5 B 31 3100
It is not clear how many unique values are there in the actual dataset 'Y' column. If we have only a few values, then case_when can be used
FnAdd <- function(Data){
Data %>%
mutate(NewVal = case_when(Y == "A" ~ X * 10,
Y == "B" ~ X *100,
TRUE ~ X))
}
FnAdd(df)
# Y X NewVal
#1 A 1 10
#2 B 5 500
#3 B 8 800
#4 A 23 230
#5 B 31 3100
You were originally looking for a solution using dplyr's rowwise() function, so here is that solution. The nice thing about this approach is that you don't need to create a separate function.
Here's the version using if()
df %>%
rowwise() %>%
mutate(NewVal = ifelse(Y == "A", X * 10,
ifelse(Y == "B", X * 100)))
and here's the version using case_when:
df %>%
rowwise() %>%
mutate(NewVal = case_when(Y == "A" ~ X * 10,
Y == "B" ~ X * 100))

Calling recursive functions in R

Assuming I have a dataframe, df with this info
group wk source revenue
1 1 C 100
1 1 D 200
1 1 A 300
1 1 B 400
1 2 C 500
1 2 D 600
I'm trying to programatically filter's down to rows of unique combinations of group, wk and source, and then perform some operations on them, before combining them back into another dataframe. I want to write a function that can scale to any number of segments (and not just the example scenario here) and filter down rows. All I need to pass would be the column names by which I want to segment
eg.
seg <- c("group", "wk", "source")
One unique combination to filter rows in df would be
df %>% filter(group == 1 & wk == 1 & source == "A")
I wrote a recursive function (get_rows) to do so, but it doesn't seem to do what I want. Could anyone provide inputs on where I'm going wrong ?
library(dplyr)
filter_row <- function(df,x)
{
df %>% filter(group == x$group & wk == x$wk & source == x$source)
}
seg <- c("group", "wk", "source")
get_rows <- function(df,seg,pos = 1, l = list())
{
while(pos <= (length(seg) + 1))
{
if(pos <= length(seg))
for(j in 1:length(unique(df[,seg[pos]])))
{
k <- unique(df[,seg[pos]])
l[seg[pos]] <- k[j]
get_rows(df,seg,pos+1,l)
return()
}
if(pos > length(seg))
{
tmp <- df %>% filter_row(l)
<call some function on tmp>
return()
}
}
}
get_rows(df,seg)
EDIT: I understand there are prebuilt methods I can use to get what I need, but I'm curious about where I'm going wrong in the recursive function I wrote.
There might be a data.table/dplyr solution out there, but this one is pretty simple.
# Just paste together the values of the column you want to aggregate over.
# This creates a vector of factors
f <- function(data, v) {apply(data[,v,drop=F], 1, paste, collapse = ".")}
# Aggregate, tapply, ave, and a few more functions can do the same thing
by(data = df, # Your data here
INDICES = f(df, c("group", "wk", "source")), # Your data and columns here
FUN = identity, simplify = F) # Your function here
Can also use library(dplyr) and library(data.table)
df %>% data.table %>% group_by(group, wk, source) %>% do(yourfunctionhere, use . for x)

Resources