How to mutate across a dataframe using dimensions instead of column names?

How to mutate across a dataframe using dimensions instead of column names? - r

The following bit of code does a nice job of calculating and outputting the proportions of row totals represented by each element in the dataframe named data:
Output:
# A tibble: 4 x 5
# Rowwise:
To A B C Sum
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 0.111 0.333 0.556 1
2 B 0.167 0.333 0.5 1
3 C 0.25 0.312 0.438 1
4 Sum 0.189 0.324 0.486 1
Code generating the above:
library(dplyr)
library(tidyverse)
data <-
data.frame(
To = c("A","B","C"),
A = c(1,2,4),
B = c(3,4,5),
C = c(5,6,7)
)
data <- data %>%
replace(is.na(.), 0) %>%
bind_rows(summarise_all(., ~(if(is.numeric(.)) sum(.) else "Sum")))
data <- cbind(data, Sum = rowSums(data[,-1]))
data %>%
rowwise() %>%
mutate(across(A:Sum, ~ sum(.) / Sum))
The mutate(across(...)...) above references the starting column A as the calculation starting point, which is correct. However, in the larger App this is intended for, the column names are dynamic. So I'd simply like to start with the first numeric column in the dataframe instead of using its name. Below is my attempt to do this:
data %>%
rowwise() %>%
mutate(across(-1:Sum, ~ sum(.) / Sum))
It calculates correctly but gives me the warning shown below. Is there a better way to do this? Instead of suppressWarnings() which is tempting?
Warning message:
Problem with mutate() input ..1.
i ..1 = across(-1:Sum, ~sum(.)/Sum).
i numerical expression has 4 elements: only the first used
i The warning occurred in row 1.

You can use a predicate function, wrapped in where()
data %>%
rowwise() %>%
mutate(across(where(is.numeric), ~ sum(.) / Sum))
You could also just de-select the first column with either its name or its index like these example:
data %>%
rowwise() %>%
mutate(across(-1, ~ sum(.) / Sum))
data %>%
rowwise() %>%
mutate(across(-To, ~ sum(.) / Sum))

Related

Running Levene's test for each column of a df in R

I have a data frame containing scores of several sub-scales of the same test (columns: participant, session, group, total score, one column per sub-scale). I am trying to run assumption checks for a two-way mixed ANOVA for each sub-scale. For convenience, I would like to write one loop per assumption check, that gives me the output for all sub-scales. This worked well for checking outliers, running Box's M test and for generating the actual ANOVA output. However, I get an error when trying the same thing with Levene's test. See code and errors below:
subscales <- c("awareness", "clarity", "impulse", "goals", "nonacceptance",
"strategies") # these correspond to the column names in the df
for (scale in subscales) {
ders %>%
group_by(session) %>%
levene_test(scale ~ group) %>%
kable(caption = scale) %>% print()
}
Error in mutate(., data = map(.data$data, .f, ...)) :
Caused by error in model.frame.default():
! variable lengths differ (found for 'group')
How can I run Levene's test for all columns in my df without just repeating the same code over and over? I'm new to R, so maybe I'm trying in a too pythonist kind of way and should use something like lapply() instead?

Create the formula with reformulate as the scale will be quoted string and thus, it needs the formula to be constructed either with reformulate or paste
for (scale in subscales) {
ders %>%
group_by(session) %>%
levene_test(reformulate('group', response = scale)) %>%
kable(caption = scale) %>% print()
}
This maybe also done with across
library(dplyr)
library(stringr)
library(tidyr)
library(rstatix)
data(mtcars)
mtcars %>%
mutate(carb = factor(carb)) %>%
group_by(cyl) %>%
summarise(across(c(mpg, disp),
~ levene_test(cur_data(),
reformulate('carb', response = cur_column())) %>%
rename_with(~ str_c(cur_column(), .x), everything()) )) %>%
unpack(where(is.tibble))
-output
# A tibble: 3 × 9
cyl mpgdf1 mpgdf2 mpgstatistic mpgp dispdf1 dispdf2 dispstatistic dispp
<dbl> <int> <int> <dbl> <dbl> <int> <int> <dbl> <dbl>
1 4 1 9 0.975 0.349 1 9 1.32e- 1 7.24e- 1
2 6 2 4 2.52 0.196 2 4 7.44e+29 7.23e-60
3 8 3 10 1.60 0.251 3 10 1.18e+ 1 1.27e- 3

Summing across in a dataframe with condition coming from another column

this is not a very good title for the question. I want to sum across certain columns in a data frame for each group, excluding one column for each of my groups. A simple example would be as follows:
df <- tibble(group_name = c("A", "B","C"), mean_A = c(1,2,3), mean_B = c(2,3,4), mean_C=c(3,4,5))
df %>% group_by(group_name) %>% mutate(m1 = sum(across(contains("mean"))))
This creates column m1, which is the sum across mean_a, mean_b, mean_c for each group. What I want to do is exclude mean_a for group a, mean_b for b and mean_c for c. The following does not work though (not surprisingly).
df %>% group_by(group_name) %>% mutate(m1 = sum(across(c(contains("mean") & !contains(group_name)))))
Do you have an idea how I could do this? My original data contains many more groups, so would be hard to do by hand.
Edit: I have tried the following way which solves it in a rudimentary fashion, but something (?grepl maybe) seems to not work great here and I get the wrong result.
df %>% pivot_longer(!group_name) %>% mutate(value2 = case_when(grepl(group_name, name) ~ 0, TRUE ~ value)) %>% group_by(group_name) %>% summarise(m1 = sum(value2))
Edit2: Found out what's wrong with the above, and below works, but still a lot of warnings so I recommend people to follow TarJae's response below
df %>% pivot_longer(!group_name) %>% group_by(group_name) %>% mutate(value2 = case_when(grepl(group_name, name) ~ 0, TRUE ~ value)) %>% group_by(group_name) %>% summarise(m1 = sum(value2))

Here is another option where you can just use group_name directly with the tidyselect helpers:
df %>%
rowwise() %>%
mutate(m1 = rowSums(select(across(starts_with("mean")), -ends_with(group_name)))) %>%
ungroup()
Output
group_name mean_A mean_B mean_C m1
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 2 3 5
2 B 2 3 4 6
3 C 3 4 5 7
How it works
The row-wise output of across is a 1-row tibble containing only the variables that start with "mean".
select unselects the subset of the variables from output by across that end with the value from group_name.
At this point you are left with a 1 x 2 tibble, which is then summed using rowSums.

Here is one way how we could do it:
We create a helper column to match column names
We set value of mean column to zeor if column names matches helper name.
Then we use transmute with select to calculate rowSums
Finally we cbind column m1 to df:
library(dplyr)
df %>%
mutate(helper = paste0("mean_", group_name)) %>%
mutate(across(starts_with("mean"), ~ifelse(cur_column()==helper, 0, .))) %>%
transmute(m1 = select(., contains("mean")) %>%
rowSums()) %>%
cbind(df)
m1 group_name mean_a mean_b mean_c
1 5 a 1 2 3
2 6 b 2 3 4
3 7 c 3 4 5

Standard deviation of average events per ID in R

Background
I've got this dataset d:
d <- data.frame(ID = c("a","a","a","a","a","a","b","b"),
event = c("G12","R2","O99","B4","B4","A24","L5","J15"),
stringsAsFactors=FALSE)
It's got 2 people (IDs) in it, and they each have some events.
The problem
I'm trying to get an average number (count) of events per person, along with a standard deviation for that average, all in one result (it can be a dataframe or not, doesn't matter).
In other words I'm looking for something like this:
| Mean | SD |
|------|------|
| 4.00 | 2.83 |
What I've tried
I'm not far off, I don't think -- it's just that I've got 2 separate pieces of code doing these calculations. Here's the mean:
d %>%
group_by(ID) %>%
summarise(event = length(event)) %>%
summarise(ratio = mean(event))
# A tibble: 1 x 1
ratio
<dbl>
1 4
And here's the SD:
d %>%
group_by(ID) %>%
summarise(event = length(event)) %>%
summarise(sd = sd(event))
# A tibble: 1 x 1
sd
<dbl>
1 2.83
But I when I try to pipe them together like so...
d %>%
group_by(ID) %>%
summarise(event = length(event)) %>%
summarise(ratio = mean(event)) %>%
summarise(sd = sd(event))
... I get an error:
Error in `h()`:
! Problem with `summarise()` column `sd`.
i `sd = sd(event)`.
x object 'event' not found
Any insight?

You have to put the last two calls to summarise() in the same call. The only remaining columns after summarise() will be those you named and the grouping columns, so after your second summarise, the event column no longer exists.
library(dplyr)
d <- data.frame(ID = c("a","a","a","a","a","a","b","b"),
event = c("G12","R2","O99","B4","B4","A24","L5","J15"),
stringsAsFactors=FALSE)
d %>%
group_by(ID) %>%
# the next summarise will be within ID
summarise(event = length(event)) %>%
# this summarise is overall
summarise(sd = sd(event),
ratio = mean(event))
#> # A tibble: 1 × 2
#> sd ratio
#> <dbl> <dbl>
#> 1 2.83 4
The code is a bit confusing because you are renaming the event variable, and doing the first summarise() within groups and the second without grouping. This code would be a little easier to read and get the same result:
d %>%
count(ID) %>%
summarise(sd = sd(n),
ratio = mean(n))
Created on 2022-05-25 by the reprex package (v2.0.1)

Calculating average rle$lengths over grouped data

I would like to calculate duration of state using rle() on grouped data. Here is test data frame:
DF <- read.table(text="Time,x,y,sugar,state,ID
0,31,21,0.2,0,L0
1,31,21,0.65,0,L0
2,31,21,1.0,0,L0
3,31,21,1.5,1,L0
4,31,21,1.91,1,L0
5,31,21,2.3,1,L0
6,31,21,2.75,0,L0
7,31,21,3.14,0,L0
8,31,22,3.0,2,L0
9,31,22,3.47,1,L0
10,31,22,3.930,0,L0
0,37,1,0.2,0,L1
1,37,1,0.65,0,L1
2,37,1,1.089,0,L1
3,37,1,1.5198,0,L1
4,36,1,1.4197,2,L1
5,36,1,1.869,0,L1
6,36,1,2.3096,0,L1
7,36,1,2.738,0,L1
8,36,1,3.16,0,L1
9,36,1,3.5703,0,L1
10,36,1,3.970,0,L1
", header = TRUE, sep =",")
I want to know the average length for state == 1, grouped by ID. I have created a function inspired by: https://www.reddit.com/r/rstats/comments/brpzo9/tidyverse_groupby_and_rle/
to calculate the rle average portion:
rle_mean_lengths = function(x, value) {
r = rle(x)
cond = r$values == value
data.frame(count = sum(cond), avg_length = mean(r$lengths[cond]))
}
And then I add in the grouping aspect:
DF %>% group_by(ID) %>% do(rle_mean_lengths(DF$state,1))
However, the values that are generated are incorrect:
ID
count
avg_length
1 L0
2
2
2 L1
2
2
L0 is correct, L1 has no instances of state == 1 so the average should be zero or NA.
I isolated the problem in terms of breaking it down into just summarize:
DF %>% group_by(ID) %>% summarize_at(vars(state),list(name=mean)) # This works but if I use summarize it gives me weird values again.
How do I do the equivalent summarize_at() for do()? Or is there another fix? Thanks

As it is a data.frame column, we may need to unnest afterwards
library(dplyr)
library(tidyr)
DF %>%
group_by(ID) %>%
summarise(new = list(rle_mean_lengths(state, 1)), .groups = "drop") %>%
unnest(new)
Or remove the list and unpack
DF %>%
group_by(ID) %>%
summarise(new = rle_mean_lengths(state, 1), .groups = "drop") %>%
unpack(new)
# A tibble: 2 × 3
ID count avg_length
<chr> <int> <dbl>
1 L0 2 2
2 L1 0 NaN
In the OP's do code, the column that should be extracted should be not from the whole data, but from the data coming fromt the lhs i.e. . (Note that do is kind of deprecated. So it may be better to make use of the summarise with unnest/unpack
DF %>%
group_by(ID) %>%
do(rle_mean_lengths(.$state,1))
# A tibble: 2 × 3
# Groups: ID [2]
ID count avg_length
<chr> <int> <dbl>
1 L0 2 2
2 L1 0 NaN

Generating demographic tables/frequencies of a large number of specific variables in R

I have larger data sets that for a 'first run' require basic frequencies by a group (groups are flagged with 1 or 0 in columns). The issue is that some of the basic frequencies are for a very large number of variables (180 or so) that aren't named with a specific prefix and aren't positioned say in columns 2:100. They could be in columns 2:80, and then say 90:117, etc.
I get the basic gist of doing this would be something like this:
mtcars %>% filter(gear == 4) %>% group_by(am) %>% summarise(n=n()) %>% mutate(perc = n / sum(n)*100)
One issue is that my 3-5 groups are all flagged in separate columns, so I would need to use filter(pop1 == 1), filter(pop2 == 1), etc. But is there a way for the group_by to go through 180 variables with an output for each variable? It's just a simple frequency of each variable's values or missing. So for the mtcars, it would be a frequency of just am, and then just vs. It would be fine for the output to be one long one with two columns.
Like this:
Variable Value n Perc
am 0 4 33.3
am 1 8 66.7
vs 0 2 16.7
vs 1 10 83.3
I recognize this might involve gather or pivot_longer, but could not figure out a way to only transform the 180 variables out of say 200 that I need frequencies on to long.
Edit:
I ended up using this to select for many columns:
positions <- c(4:176,198)
And using select(positions) to circumvent typing in all of the variables.

We can use map from purrr to separately do the grouping and apply the same code as in the OP's post
library(dplyr)
library(purrr)
map_dfr(c('am', 'vs'), ~
mtcars %>%
filter(gear == 4) %>%
group_by(Variable = .x, Value = !!rlang::sym(.x)) %>%
summarise(n = n()) %>%
mutate(perc = n/sum(n) * 100))
# A tibble: 4 x 4
# Groups: Variable [2]
# Variable Value n perc
# <chr> <dbl> <int> <dbl>
#1 am 0 4 33.3
#2 am 1 8 66.7
#3 vs 0 2 16.7
#4 vs 1 10 83.3
Or another option is to convert to 'long' format with pivot_longer
library(tidyr)
mtcars %>%
filter(gear == 4) %>%
select(vs, am) %>%
pivot_longer(everything()) %>%
count(name, value) %>%
mutate(perc = n/sum(n) * 100)