User defined function in R with dplyr - r

I have a dataframe and try to create a function that calculate number of records by TRT01AN and another variable chosen by the user (I just send a reduced DF with only one extra variable to make it simpler)
dataframe <- as.data.frame(cbind(ID,=c(1,2,3,4,5,6),TRT01AN = c(1, 1, 3, 2, 2, 2),
AGEGR1 =c("Adult","Child","Adolescent","Adolescent","Adolescent","Child")))
sub1 <- function(SUB1) {
# Calculate number of subjects in each treatment arm
bigN1 <- dataframe %>%
group_by_(SUB1,TRT01AN) %>%
summarise(N = n_distinct(ID))
return(bigN1)
}
bigN1<-sub1(SUB1="AGEGR1")
If I do that , with group_by_ I have an error that TRT01AN doesn't exist and if I use group_by, SUB1 can't be found... Any idea how I can have both variables, a "permanent" one and on defined as the argument of the function?
Thank you!

Try using curly braces (works with or without quotation marks in function call):
library(dplyr)
dataframe <-
as.data.frame(cbind(
ID = c(1, 2, 3, 4, 5, 6),
TRT01AN = c(1, 1, 3, 2, 2, 2),
AGEGR1 = c(
"Adult",
"Child",
"Adolescent",
"Adolescent",
"Adolescent",
"Child"
)
))
sub1 <- function(SUB1) {
# Calculate number of subjects in each treatment arm
bigN1 <- dataframe %>%
group_by({{SUB1}}, TRT01AN) %>%
summarise(N = n_distinct(ID))
return(bigN1)
}
bigN1 <- sub1(AGEGR1)

Related

Why does map_df produce many missing values? How can i concatenate across rows to removing NAs?

I'm trying to count how many students received 1s, 2s, 3s, 4s, and 5s across their subjects, and I want a column for each subject and the possible grade (math_1, science_2, etc.).
I originally wrote a for loop, but my actual dataset has so many cases that I need to use map. I can get it to work, but it produces many NAs and only one chunk per column has actual data. I'm curious to know either:
Why is map_df() doing this and how can I avoid it? OR
How can I tighten this up so I only have this information on one row per the original rows in the first dataset (18 rows)? In other words, I'd concatenate up and down the column, so all the NAs are filled in (unless there truly was missing data).
Here's my code
library(tidyverse)
#Set up - generate sample dataset and get all combinations of grades and subjects
student_grades <- tibble(student_id = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5),
subject = c(rep(c("english", "biology", "math", "history"), 4), NA, "biology"),
grade = as.character(c(1, 2, 3, 4, 5, 4, 3, 2, 2, 4, 1, 1, 1, 1, 2, 3, 3, 4)))
all_subject_combos <- c("english", "history", "math", "biology")
all_grades <- c("1", "2", "3",
"4", "5")
subjects_and_letter_grades <- expand.grid(all_subject_combos, all_grades)
all_combos <- subjects_and_letter_grades %>%
unite("names", c(Var1, Var2)) %>%
mutate(names = str_replace_all(names, "\\|", "_")) %>%
pull(names)
# iterate over each combination using map_df()
student_map <- map_df(all_combos,
~student_grades %>%
mutate("{.x}" := paste(i)) %>%
group_by(student_id) %>%
mutate("{.x}" := sum(case_when(str_detect(.x, subject) &
str_detect(.x, grade) ~ 1,
TRUE ~ 0), na.rm = T)))
EDIT
For the record, my almost identical for loop does not pad in many missing values. I assume it must have something to do with how it is building the dataset, but I don't know how I can override what map_df is doing under the hood.
student_map <- student_grades
for(i in all_combos) {
student_map <- student_map %>%
mutate("{i}" := paste(i)) %>%
group_by(student_id) %>%
mutate("{i}" := sum(case_when(str_detect(i, subject) &
str_detect(i, grade) ~ 1,
TRUE ~ 0), na.rm = T))
}
There is no i in the map as the default lambda value looped is .x. Also, it is better to use transmute instead of mutate as we need to return only the columns added in each iteration and then we bind with the original data at the end
library(dplyr)
library(purrr)
library(stringr)
student_map2 <- map_dfc(all_combos,
~ student_grades %>%
transmute(subject, grade, student_id, "{.x}" := .x) %>%
group_by(student_id) %>%
transmute("{.x}" := sum(case_when(str_detect( .x, subject) &
str_detect(.x, grade)~ 1, TRUE ~ 0), na.rm = TRUE)) %>%
ungroup %>%
select(-student_id)) %>%
bind_cols(student_grades, .)
-checking with OP's for loop output
> all.equal(student_map, student_map2, check.attributes = FALSE)
[1] TRUE
Though I can't figure out why map_df() is performing in this undesirable way, I did find a solution, inspired heavily by the answer to this post.
solution <- student_map %>%
group_by(student_id, subject, grade) %>%
summarise_all(~ last(na.omit(.)))
solution
Basically, this code removes any NAs and only keeps missing values if there are only missing values. Because those columns in my dataset will never have missing values, this solution works in my case.

How can column names by used in a complex dplyr function that uses base R elements?

First, this question is likely a repeat, however the solutions I've tried (eg using enquo() and !!, or get(), or {{}}) have not yielded results.
The Problem
I have a function that needs to take column names passed to it in a pipeline, perform a series of dplyr-based functions with some base R components, and return the new dataframe. The problem is that the function will not take the column names passed to it as variables in the referenced dataframe, treating them as strings instead.
The Data
This code will create a usable dataframe for this problem:
df_ext <- tibble(ID = c(rep(1, 5), rep(2, 5)),
TIME = rep(c(1, 2, 3, 4, 5), 2),
VAL = c(0, 1, 2, 2, 3, 1, 5, 0, 1, 4))
The Current Function
Here's a version of the function that I can share. It's designed to create a series of categories for the data I pass to it, but this is a simplified version that just calculates some basic groupings (ie, it doesn't do much of anything).
my_fun <- function(.data, id, time, val){
require("dplyr")
df <- df |>
group_by(id) |>
mutate(val_lag = if_else(val > 0, time - lag(time, default = 0), 0)) |>
mutate(first_time = min(time),
last_time = max(time),
first_val_pos = ifelse(any(val), min(time[val > 0]), NA),
last_val_pos = ifelse(any(val > 0), max(time[val > 0]), NA)) |>
group_by(grp = cumsum(val_lag == 0)) |>
mutate(val_pos_run = cumsum(val_lag)) |>
ungroup() |>
group_by(id) |>
mutate(ada_bl = ifelse(first_val_pos <= 0, val[time == first_val_pos], 0)) |>
ungroup()
df
}
df_ext |>
my_fun(id = ID, time = TIME, val = VAL)
If anyone can get the columns from the dataframe to pass into the function and get treated as columns in the pipe-referenced dataframe, you'd be ending a very frustrating headache.

Concatenate columns if they don't contain a zero

Im trying to concatenate 4 Columns into a single column named "tags" for later use of multilabel classification. I would like to concate the columns in a way that gives a an output only pasting columns that are not zero and thereto seperate them with a comma.
An example would be that the cell in last row would be {1,2} instead of {1,2,0,0}
I currently have no code that works as needed and haven't been able to find something on the internet. Do you guys have a tip to do this?
Current code:
df$TV[df$TV==1] = '1'
df$Internet[df$Internet ==1] = '2'
df$Mobil[df$Mobil==1] = '3'
df$Fastnet[df$Fastnet==1] = '4'
df$tags = paste(df$TV,df$Internet,df$Mobil,df$Fastnet, sep=",")
Base R option using apply -
cols <- c('TV', 'Internet', 'Mobil', 'Fastnet')
#df <- read.csv('stack.csv')
df$tags <- apply(df[cols], 1, function(x) toString(x[x!= 0]))
df
In dplyr we can use rowwise -
library(dplyr)
df <- df %>%
rowwise() %>%
mutate(tags = {
tmp <- c_across(all_of(cols))
toString(tmp[tmp != 0])
}) %>%
ungroup
df
We may use dapply from collapse
library(collapse)
cols <- c('TV', 'Internet', 'Mobil', 'Fastnet')
df$tags <- dapply(slt(df, cols) MARGIN = 1, FUN = function(x) toString(x[x != 0]))
data
df <- data.frame(TV = c(1, 3, 2, 0), Internet = c(1, 0, 1, 4), Mobil = c(0, 1, 3, 2), Fastnet = c(1, 5, 3, 2))

Sum of a list in R data frame

I have a column of type "list" in my data frame, I want to create a column with the sum.
I guess there is no visual difference, but my column consists of list(1,2,3)s and not c(1,2,3)s :
tibble(
MY_DATA = list(
list(2, 7, 8),
list(3, 10, 11),
list(4, 2, 8)
),
NOT_MY_DATA = list(
c(2, 7, 8),
c(3, 10, 11),
c(4, 2, 8)
)
)
Unfortunately when I try
mutate(NEW_COL = MY_LIST_COL_D %>% unlist() %>% sum())
the result is that every cell in the new column contains the sum of the entire source column (so a value in the millions)
I tried reduce and it did work, but was slow, I was looking for a better solution.
You could use the purrr::map_dbl, which should return a vector of type double:
library(tibble)
library(dplyr)
library(purrr)
df = tibble(
MY_LIST_COL_D = list(
c(2, 7, 8),
c(3, 10, 11),
c(4, 2, 8)
)
)
df %>%
mutate(NEW_COL= map_dbl(MY_LIST_COL_D, sum), .keep = 'unused')
# NEW_COL
<dbl>
# 1 17
# 2 24
# 3 14
Is this what you were looking for? If you don't want to remove the list column just disregard the .keep argument.
Update
With the underlying structure being lists, you can still apply the same logic, but one way to solve the issue is to unlist:
df = tibble(
MY_LIST_COL_D = list(
list(2, 7, 8),
list(3, 10, 11),
list(4, 2, 8)
)
)
df %>%
mutate(NEW_COL = map_dbl(MY_LIST_COL_D, ~ sum(unlist(.x))), .keep = 'unused')
# NEW_COL
# <dbl>
# 1 17
# 2 24
# 3 14
You can use rowwise in dplyr
library(dplyr)
df %>% rowwise() %>% mutate(NEW_COL = sum(MY_LIST_COL_D))
rowwise will also make your attempt work :
df %>% rowwise() %>% mutate(NEW_COL = MY_LIST_COL_D %>% unlist() %>% sum())
Can also use sapply in base R :
df$NEW_COL <- sapply(df$MY_LIST_COL_D, sum)

How to loop through vector of column names, mutate each column with assignment back to column and function referencing loop index

If you can forgive my interest in loops, I'd like to know how to loop through a vector of variable names (must be strings in my use case) and mutate the original columns. In this toy example, I want to calculate the mean of the column i plus z.
df_have <- data.frame(x=c(1, 1, 2, 3, 3),
y=c(2, 2, 3, 4, 4),
z=c(0, 1, 2, 3, 4))
for (i in c("x", "y")) {
df_test <-
df_have %>%
mutate(!!i := mean(i)+z)
}
df_want <- data.frame(x=c(2, 3, 4, 5, 6), # mean 2 + z
y=c(3, 4, 5, 6, 7), # mean 3 + z
z=c(0, 1, 2, 3, 4))
Well, if you want to do a loop, then
df_test <- df_have
for (i in c("x", "y")) {
df_test <-
df_test %>%
mutate(!!i := mean((!!as.name(i)))+z)
}
Note you need to turns those strings into symbols in order to use in the expression for mutate. An eaiser trick in this case would be
df_have %>% mutate_at(c("x","y"), funs(mean(.)+z))

Resources