mutate columns after subsetting by value - r

I have a large dataframe and want to standardise multiple columns while conditioning the mean and the standard deviation on values. Say I have the following example data:
set.seed(123)
df = data.frame("sample" = c(rep(1:2, each = 5)),
"status" = c(0,1),
"s1" = runif(10, -1, 1),
"s2" = runif(10, -5, 5),
"s3" = runif(10, -25, 25))
and want to standardise every s1-s3 while conditioning the mean and standard deviation to be status==0. If I should do this for say, s1 only I could do the following:
df = df %>% group_by(sample) %>%
mutate(sd_s1 = (s1 - mean(s1[status==0])) / sd(s1[status==0]))
But my problem arises when I have to perform this operation on multiple columns. I tried writing a function to include with mutate_at:
standardize <- function(x) {
return((x - mean(x[status==0]))/sd(x[status==0]))
}
df = df %>% group_by(sample) %>%
mutate_at(vars(s1:s3), standardize)
Which just creates Na values for s1-s3.
I have tried to use the answer provided in:
R - dplyr - mutate - use dynamic variable names, but cannot figure out how to do the subsetting.
Any help is greatly appreciated. Thanks!

We could just use
df %>%
group_by(sample) %>%
mutate_at(vars(s1:s3), funs((.- mean(.[status == 0]))/sd(.[status == 0])))

Related

Median Split of one variable to create another variable

I am currently struggling with a median split in R studio. I wish to create a new column in my data frame which is a median split of another, however, I do not know how this can be accomplished. Any and all help will be appreciated. this is the code I have previously run:
medianpcr <- median(honourswork$PCR.x)
highmedian <- filter(honourswork, PCR.x <= medianpcr)
lowmedian <- filter(honourswork, PCR.x > medianpcr)
When you post a question on SO, it's always a good idea to include an example dataframe so that the answerer doesn't have to create one themselves.
Onto your question, if I understand you correctly, you can use the mutate() and case_when() from the dplyr package:
# Load the dplyr library
library(dplyr)
# Create an example dataframe
data <- data.frame(
rowID = c(1:20),
value = runif(20, 0, 50)
)
# Use case_when to mutate a new column 'category' with values based on
# the 'value' column
data2 <- data %>%
dplyr::mutate(category =
dplyr::case_when(
value > median(value) ~ "Highmedian",
value < median(value) ~ "Lowmedian",
value == median(value) ~ "Median"
)
)
More about case_when() here.
Hope this helps!
Let's first create some data:
set.seed(123)
honourswork <- data.frame(PCR.x = rnorm(100))
In dplyr, you might do:
library(tidyverse)
honourswork %>% mutate(medianpcr = median(PCR.x)) %>%
mutate(highmedian = ifelse(PCR.x > medianpcr, 1, 0)) -> honourswork
honourswork %>% mutate(medianpcr = median(PCR.x)) %>%
mutate(lowmedian = ifelse(PCR.x <= medianpcr, 1, 0)) -> honourswork
Equivalently in base R:
honourswork$highmedian <- 0
honourswork$highmedian[honourswork$PCR.x > median(honourswork$PCR.x)] <- 1
honourswork$lowmedian <- 0
honourswork$lowmedian[honourswork$PCR.x <= median(honourswork$PCR.x)] <- 1

Is there a quicker way to rowwise sum across selected columns in a data frame?

I'm using dplyr to sum rowwise across selected columns in a data frame. Because I'm using a character vector to specify the columns, it seems I need to use rowwise, which seems to be very calculation heavy and takes ages across my big data frame (>15 min). Does anyone know of a quicker way please!?
x <- data.frame("channel_1" = seq(1, 10),
"channel_2" = seq(1, 10),
"channel_3" = seq(1, 10),
"channel_4" = seq(1, 10),
"channel_5" = seq(1, 10))
ladder.channel <- "channel_4"
bleed.channels <- setdiff(c("channel_1", "channel_2", "channel_3", "channel_4", "channel_5"), ladder.channel)
y <- x %>%
mutate(correction = -pmax(!!!syms(bleed.channels))) %>%
rowwise() %>%
mutate(channel.corr = sum(across(all_of(c(ladder.channel, "correction")))))
Does this work?
x %>%
mutate(
correction = -pmax(!!!syms(bleed.channels)),
channel.corr = !!sym(ladder.channel) + correction
)

Apply a function within list-column to another column (compare to reference ecdf by group)

I have a dataset that is organized by groups (site) and has baseline observations (trt == 0) and observations collected from a modified environment (trt == 1, although it's not experimental data which is why I'm doing this). For the trt == 1 observations, I would like to calculate the quantile of each observation within the baseline ecdf for that group (i.e. site). My instinct was to use map2_dbl() but the ecdf to compare to is within the list-column itself, not external to the data. I'm struggling to get the correct syntax (in the R tidyverse).
df <- tibble(site = rep(letters[1:4], length.out = 2000),
trt = rep(c(0, 1), each = 1000),
value = c(rnorm(n = 1000), rnorm(.1, n = 1000)))
# calculate ecdf for baseline:
baseline <- df %>%
filter(trt == 0) %>%
group_by(site) %>%
summarize(ecdf0 = list(ecdf(value)))
# compare each trt = 1 observation to ecdf for that site:
trtQuantile <- df %>%
filter(trt == 1) %>%
inner_join(baseline)
# what would be next line is where I'm struggling to get the correct map syntax
head(trtQuantile)
# for the first row I am aiming for the result given by:
trtQuantile$ecdf0[[1]](trtQuantile$value[[1]])
Any advice from the purrr masters is appreciated! Thanks.
You can use map2_dbl :
library(dplyr)
library(purrr)
trtQuantile %>% mutate(out = map2_dbl(ecdf0, value, ~.x(.y)))
Or mapply in base R :
trtQuantile$out <- mapply(function(x, y) x(y),trtQuantile$ecdf0,trtQuantile$value)

compute residuals within groups in dplyr

I am trying to compute within group residuals in anova using R. My data frame is
df <- data.frame(V1 = c(rep("group1", 5), rep("group2", 7)),
value = c(6.6,4.6,8.5,6.1,8.4,
10.7,10.1,10.9,10.7,15.6,13.8,15.9))
I want to use a simple way using dplyr or else to combine following two lines of code
M <- df %>% group_by(V1) %>% summarise(avg = mean(value))
df$res <- ifelse(test = df$V1 == "group1", yes = (df$value - M$avg[1])^2,
no = (df$value - M$avg[2])^2)
I tried to use do() in dplyr but no success. I was wondering if there is a neat way of doing this.
If you need to keep using the original value column along with avg, then use mutate rather than summarize so that the means are just placed in a new column next to the original values:
df %>%
group_by(V1) %>%
mutate(avg = mean(value),
res = (value - avg)^2)

Cut a variable differently based on another grouping variable

Example: I have a dataset of heights by gender.
I'd like to split the heights into low and high where the cut points are defined as the mean - 2sd within each gender.
example dataset:
set.seed(8)
df = data.frame(sex = c(rep("M",100), rep("F",100)),
ht = c(rnorm(100, mean=1.7, sd=.17), rnorm(100, mean=1.6, sd=.16)))
I'd like to do something in a single line of vectorized code because I'm fairly sure that is possible, however, I do not know how to write it. I imagine that there may be a way to use cut(), apply(), and/or dplyr to achieve this.
How about this using cut from base R:
sapply(c("F", "M"), function(s){
dfF <- df[df$sex==s,] # filter out per gender
cut(dfF$ht, breaks = c(0, mean(dfF$ht)-2*sd(dfF$ht), Inf), labels = c("low", "high"))
})
# dfF$ht heights per gender
# mean(dfF$ht)-2*sd(dfF$ht) cut point
In the code below, I created 2 new variables. Both were created by grouping the sex variable and filtering the different ranges of ht.
library(dplyr)
df_low <- df %>% group_by(sex) %>% filter(ht<(mean(ht)-2*sd(ht)))
df_high<- df %>% group_by(sex) %>% filter(ht>(mean(ht)+2*sd(ht)))
Just discovered the following solution using base r:
df$ht_grp <- ave(x = df$ht, df$sex,
FUN = function(x)
cut(x, breaks = c(0, (mean(x, na.rm=T) - 2*sd(x, na.rm=T)), Inf)))
This works because I know that 0 and Inf are reasonable bounds, but I could also use min(x), and max(x) as my upper and lower bounds. This results in a factor variable that is split into low, high, and NA.
My prior solution:
I came up with the following two-step process which is not so bad:
df = merge(df,
setNames( aggregate(ht ~ sex, df, FUN = function(x) mean(x)-2*sd(x)),
c("sex", "ht_cutoff")),
by = "sex")
df$ht_is_low = ifelse(df$ht <= df$ht_cutoff, 1, 0)

Resources