R: summerise verb with varying length - r

[this question is now thoroughly rewritten. I hope this would clarify things]
I have a dataset describing several tests with multiple-answer questions. Each line contains the raw answers of one participant, and the score that participant was awarded for each question. Each test has a different answer key:
df <- data.frame(id =c(1, 2, 3, 4, 5, 6, 7, 8, 9,10), # participant's id
test =c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), # id of question set
ans01=c(1, 2, 3, 4, 1, 1, 2, 3, 3, 4), # raw answers
ans02=c(2, 2,NA, 3, 4, 4, 3, 1, 1, 2),
bin01=c(1, 0, 0, 0, 1, 0, 1, 0, 0, 0), # item scores
bin02=c(1, 1, 0, 0, 0, 1, 0, 0, 0, 0))
My problem is that the answer key is missing, and I need to recreate it from the dataset.
Currently, my solution is simple and creates a separate answer key for each test:
library(dplyr)
key_data <- df %>%
group_by(test) %>%
summarise(key01 = mean(ans01[bin01 == 1], na.rm=TRUE),
key02 = mean(ans02[bin02 == 1], na.rm=TRUE))
However, while this is ok for a short tests, it is not so ok for longer tests, containing dozens of questions.
Also, I want to be able to do so for future sets of tests, so flexibility is needed for the number of items.
Therefore, the question is whether there is a way to do so without writing a line for each item key.
Maybe loop through all variables, or passing string vectors as variable names?
[I answered my own question with a not-very-elegant solution. I'm sure this can be achieved in a much better way]

Hi you could use summarise_at and define the scope at the first argument.
See this exmaple using mtcars dataset.:
library(dplyr)
selection_a<-c("mpg","cyl","vs")
selection_b<-c("mpg","cyl","vs","qsec","carb")
# Use first selection (A)
mtcars %>%
summarise_at(selection_a , ~ mean(.x, na.rm = TRUE))
# Use second Selection (B)
mtcars %>%
summarise_at(selection_b , ~ mean(.x, na.rm = TRUE))
# combine selections (A+C)
selection_c<-c("gear","carb")
mtcars %>%
summarise_at( c(selection_a,selection_c), ~ mean(.x, na.rm = TRUE))

As a side note, I'm answering my own question with a-not-very-elegant-solution that I'm currently using. I'm sure there are shorter and much better solutions..
This required key table can be acquired with a loop, as follows:
MaxItems <- 2
pad0 <- function(x, n = 2) {
n0_pad <- n - nchar(x)
return(paste0(strrep("0",n0_pad), x))
}
library(dplyr)
## create the structure for the key table
keys <- df %>%
group_by(test) %>%
summarise(a01 = mean(ans01[bin01 == 1], na.rm=TRUE))
## add items to key table
for (i in 2:MaxItems) {
keysTemp <- df %>%
rename("tmpAns" = paste0("ans",pad0(i)),
"tmpBin" = paste0("bin",pad0(i))) %>%
group_by(test) %>%
summarise(tmpKey = mean(tmpAns[tmpBin == 1], na.rm=TRUE))
colnames(keysTemp)[2] <- paste0("a",pad0(i))
keys <- keys %>%
left_join(keysTemp, by = c("test"))
}

Related

Why does map_df produce many missing values? How can i concatenate across rows to removing NAs?

I'm trying to count how many students received 1s, 2s, 3s, 4s, and 5s across their subjects, and I want a column for each subject and the possible grade (math_1, science_2, etc.).
I originally wrote a for loop, but my actual dataset has so many cases that I need to use map. I can get it to work, but it produces many NAs and only one chunk per column has actual data. I'm curious to know either:
Why is map_df() doing this and how can I avoid it? OR
How can I tighten this up so I only have this information on one row per the original rows in the first dataset (18 rows)? In other words, I'd concatenate up and down the column, so all the NAs are filled in (unless there truly was missing data).
Here's my code
library(tidyverse)
#Set up - generate sample dataset and get all combinations of grades and subjects
student_grades <- tibble(student_id = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5),
subject = c(rep(c("english", "biology", "math", "history"), 4), NA, "biology"),
grade = as.character(c(1, 2, 3, 4, 5, 4, 3, 2, 2, 4, 1, 1, 1, 1, 2, 3, 3, 4)))
all_subject_combos <- c("english", "history", "math", "biology")
all_grades <- c("1", "2", "3",
"4", "5")
subjects_and_letter_grades <- expand.grid(all_subject_combos, all_grades)
all_combos <- subjects_and_letter_grades %>%
unite("names", c(Var1, Var2)) %>%
mutate(names = str_replace_all(names, "\\|", "_")) %>%
pull(names)
# iterate over each combination using map_df()
student_map <- map_df(all_combos,
~student_grades %>%
mutate("{.x}" := paste(i)) %>%
group_by(student_id) %>%
mutate("{.x}" := sum(case_when(str_detect(.x, subject) &
str_detect(.x, grade) ~ 1,
TRUE ~ 0), na.rm = T)))
EDIT
For the record, my almost identical for loop does not pad in many missing values. I assume it must have something to do with how it is building the dataset, but I don't know how I can override what map_df is doing under the hood.
student_map <- student_grades
for(i in all_combos) {
student_map <- student_map %>%
mutate("{i}" := paste(i)) %>%
group_by(student_id) %>%
mutate("{i}" := sum(case_when(str_detect(i, subject) &
str_detect(i, grade) ~ 1,
TRUE ~ 0), na.rm = T))
}
There is no i in the map as the default lambda value looped is .x. Also, it is better to use transmute instead of mutate as we need to return only the columns added in each iteration and then we bind with the original data at the end
library(dplyr)
library(purrr)
library(stringr)
student_map2 <- map_dfc(all_combos,
~ student_grades %>%
transmute(subject, grade, student_id, "{.x}" := .x) %>%
group_by(student_id) %>%
transmute("{.x}" := sum(case_when(str_detect( .x, subject) &
str_detect(.x, grade)~ 1, TRUE ~ 0), na.rm = TRUE)) %>%
ungroup %>%
select(-student_id)) %>%
bind_cols(student_grades, .)
-checking with OP's for loop output
> all.equal(student_map, student_map2, check.attributes = FALSE)
[1] TRUE
Though I can't figure out why map_df() is performing in this undesirable way, I did find a solution, inspired heavily by the answer to this post.
solution <- student_map %>%
group_by(student_id, subject, grade) %>%
summarise_all(~ last(na.omit(.)))
solution
Basically, this code removes any NAs and only keeps missing values if there are only missing values. Because those columns in my dataset will never have missing values, this solution works in my case.

User defined function in R with dplyr

I have a dataframe and try to create a function that calculate number of records by TRT01AN and another variable chosen by the user (I just send a reduced DF with only one extra variable to make it simpler)
dataframe <- as.data.frame(cbind(ID,=c(1,2,3,4,5,6),TRT01AN = c(1, 1, 3, 2, 2, 2),
AGEGR1 =c("Adult","Child","Adolescent","Adolescent","Adolescent","Child")))
sub1 <- function(SUB1) {
# Calculate number of subjects in each treatment arm
bigN1 <- dataframe %>%
group_by_(SUB1,TRT01AN) %>%
summarise(N = n_distinct(ID))
return(bigN1)
}
bigN1<-sub1(SUB1="AGEGR1")
If I do that , with group_by_ I have an error that TRT01AN doesn't exist and if I use group_by, SUB1 can't be found... Any idea how I can have both variables, a "permanent" one and on defined as the argument of the function?
Thank you!
Try using curly braces (works with or without quotation marks in function call):
library(dplyr)
dataframe <-
as.data.frame(cbind(
ID = c(1, 2, 3, 4, 5, 6),
TRT01AN = c(1, 1, 3, 2, 2, 2),
AGEGR1 = c(
"Adult",
"Child",
"Adolescent",
"Adolescent",
"Adolescent",
"Child"
)
))
sub1 <- function(SUB1) {
# Calculate number of subjects in each treatment arm
bigN1 <- dataframe %>%
group_by({{SUB1}}, TRT01AN) %>%
summarise(N = n_distinct(ID))
return(bigN1)
}
bigN1 <- sub1(AGEGR1)

How can column names by used in a complex dplyr function that uses base R elements?

First, this question is likely a repeat, however the solutions I've tried (eg using enquo() and !!, or get(), or {{}}) have not yielded results.
The Problem
I have a function that needs to take column names passed to it in a pipeline, perform a series of dplyr-based functions with some base R components, and return the new dataframe. The problem is that the function will not take the column names passed to it as variables in the referenced dataframe, treating them as strings instead.
The Data
This code will create a usable dataframe for this problem:
df_ext <- tibble(ID = c(rep(1, 5), rep(2, 5)),
TIME = rep(c(1, 2, 3, 4, 5), 2),
VAL = c(0, 1, 2, 2, 3, 1, 5, 0, 1, 4))
The Current Function
Here's a version of the function that I can share. It's designed to create a series of categories for the data I pass to it, but this is a simplified version that just calculates some basic groupings (ie, it doesn't do much of anything).
my_fun <- function(.data, id, time, val){
require("dplyr")
df <- df |>
group_by(id) |>
mutate(val_lag = if_else(val > 0, time - lag(time, default = 0), 0)) |>
mutate(first_time = min(time),
last_time = max(time),
first_val_pos = ifelse(any(val), min(time[val > 0]), NA),
last_val_pos = ifelse(any(val > 0), max(time[val > 0]), NA)) |>
group_by(grp = cumsum(val_lag == 0)) |>
mutate(val_pos_run = cumsum(val_lag)) |>
ungroup() |>
group_by(id) |>
mutate(ada_bl = ifelse(first_val_pos <= 0, val[time == first_val_pos], 0)) |>
ungroup()
df
}
df_ext |>
my_fun(id = ID, time = TIME, val = VAL)
If anyone can get the columns from the dataframe to pass into the function and get treated as columns in the pipe-referenced dataframe, you'd be ending a very frustrating headache.

Concatenate columns if they don't contain a zero

Im trying to concatenate 4 Columns into a single column named "tags" for later use of multilabel classification. I would like to concate the columns in a way that gives a an output only pasting columns that are not zero and thereto seperate them with a comma.
An example would be that the cell in last row would be {1,2} instead of {1,2,0,0}
I currently have no code that works as needed and haven't been able to find something on the internet. Do you guys have a tip to do this?
Current code:
df$TV[df$TV==1] = '1'
df$Internet[df$Internet ==1] = '2'
df$Mobil[df$Mobil==1] = '3'
df$Fastnet[df$Fastnet==1] = '4'
df$tags = paste(df$TV,df$Internet,df$Mobil,df$Fastnet, sep=",")
Base R option using apply -
cols <- c('TV', 'Internet', 'Mobil', 'Fastnet')
#df <- read.csv('stack.csv')
df$tags <- apply(df[cols], 1, function(x) toString(x[x!= 0]))
df
In dplyr we can use rowwise -
library(dplyr)
df <- df %>%
rowwise() %>%
mutate(tags = {
tmp <- c_across(all_of(cols))
toString(tmp[tmp != 0])
}) %>%
ungroup
df
We may use dapply from collapse
library(collapse)
cols <- c('TV', 'Internet', 'Mobil', 'Fastnet')
df$tags <- dapply(slt(df, cols) MARGIN = 1, FUN = function(x) toString(x[x != 0]))
data
df <- data.frame(TV = c(1, 3, 2, 0), Internet = c(1, 0, 1, 4), Mobil = c(0, 1, 3, 2), Fastnet = c(1, 5, 3, 2))

R ave with multiple arguments / rank by group with weighting

I am using ave for ranking values within groups in a dataset in R. In the example 'data' is a data.frame with the cols raw, group and others, for example
data <- data.frame(raw = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), group = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), weight = c(1, 2, 1, 2, 1, 2, 1, 2, 1, 2)))
The ranking works fine with
data$rank <- ave(data$raw, data$group, FUN = function(x) {rank(x)})
I would like to generalize this approach by applying weights. The weights are available as another col in the data.frame. The weighted ranking is a self defined function that needs both the raw scores and the weights vector. It is available via the cNORM package, code: https://github.com/WLenhard/cNORM/blob/master/R/utilities.R
Is it possible to use ave with multiple input variables, e. g.
data$rank <- ave(x = data$raw, data$group, y = data$weights, FUN = function(x, y) {weighted.rank(x, weights = y)})
so that both x and y are both the according subsets based on the grouping variable? I guess packages like dplyr have functions for that. Is there a way to do that with base R as well and without changing the order of the rows in the original data frame?
Many thanks!
Edit: The solution from Ronak Shah perfectly solves the problem. Thanks!
You can use by for base R option.
library(cNORM)
data$rank <- unlist(by(data, data$group, function(x) weighted.rank(x$raw, x$weight)))
In dplyr you could do :
library(dplyr)
data %>% group_by(group) %>% mutate(rank = weighted.rank(raw, weight))

Resources