Cut a variable differently based on another grouping variable - r

Example: I have a dataset of heights by gender.
I'd like to split the heights into low and high where the cut points are defined as the mean - 2sd within each gender.
example dataset:
set.seed(8)
df = data.frame(sex = c(rep("M",100), rep("F",100)),
ht = c(rnorm(100, mean=1.7, sd=.17), rnorm(100, mean=1.6, sd=.16)))
I'd like to do something in a single line of vectorized code because I'm fairly sure that is possible, however, I do not know how to write it. I imagine that there may be a way to use cut(), apply(), and/or dplyr to achieve this.

How about this using cut from base R:
sapply(c("F", "M"), function(s){
dfF <- df[df$sex==s,] # filter out per gender
cut(dfF$ht, breaks = c(0, mean(dfF$ht)-2*sd(dfF$ht), Inf), labels = c("low", "high"))
})
# dfF$ht heights per gender
# mean(dfF$ht)-2*sd(dfF$ht) cut point

In the code below, I created 2 new variables. Both were created by grouping the sex variable and filtering the different ranges of ht.
library(dplyr)
df_low <- df %>% group_by(sex) %>% filter(ht<(mean(ht)-2*sd(ht)))
df_high<- df %>% group_by(sex) %>% filter(ht>(mean(ht)+2*sd(ht)))

Just discovered the following solution using base r:
df$ht_grp <- ave(x = df$ht, df$sex,
FUN = function(x)
cut(x, breaks = c(0, (mean(x, na.rm=T) - 2*sd(x, na.rm=T)), Inf)))
This works because I know that 0 and Inf are reasonable bounds, but I could also use min(x), and max(x) as my upper and lower bounds. This results in a factor variable that is split into low, high, and NA.
My prior solution:
I came up with the following two-step process which is not so bad:
df = merge(df,
setNames( aggregate(ht ~ sex, df, FUN = function(x) mean(x)-2*sd(x)),
c("sex", "ht_cutoff")),
by = "sex")
df$ht_is_low = ifelse(df$ht <= df$ht_cutoff, 1, 0)

Related

group data by year and filter for all year

I have a df with trade data from 2006 to 2018 ('all_trade', 345344 rows).
I want to filter out the first quantile of trade value of each year (group by year, then apply a filter).
Tried
Library (dplyr)
trade_3q <- all_trade %>% group_by(all_trade$Year) %>% filter(all_trade$log_trade > quantile(all_trade$log_trade, 0.25)
but get "Error: Problem with filter() input ..1.
x Input ..1 must be of size 25874 or 1, not size 345344."
What am I getting wrong?
Thank you.
In order for others to answer or help, it will be very helpful if you can provide a sample data. Based on your descriptions, I can only guess how your data looks like. You can try this:
# simulate a data set
set.seed(1234)
tb <- tibble(year = sample(2006:2018, 50, replace = TRUE),
trade = runif(50, 50000, 1000000)) %>%
mutate(log_trade = log(trade))
# compute the quantile
tb2 <- tb %>%
group_by(year) %>%
summarise(q25 = quantile(log_trade, 0.25))
# join back and filter
tb %>% left_join(tb2, by = "year") %>%
filter(log_trade > q25)
quantile() returns the descriptive string too ie 25%.
You could try removing this by wrapping quantile() in as.numeric() or unname().
Also looks like you may need another close bracket at the end.

Apply a function within list-column to another column (compare to reference ecdf by group)

I have a dataset that is organized by groups (site) and has baseline observations (trt == 0) and observations collected from a modified environment (trt == 1, although it's not experimental data which is why I'm doing this). For the trt == 1 observations, I would like to calculate the quantile of each observation within the baseline ecdf for that group (i.e. site). My instinct was to use map2_dbl() but the ecdf to compare to is within the list-column itself, not external to the data. I'm struggling to get the correct syntax (in the R tidyverse).
df <- tibble(site = rep(letters[1:4], length.out = 2000),
trt = rep(c(0, 1), each = 1000),
value = c(rnorm(n = 1000), rnorm(.1, n = 1000)))
# calculate ecdf for baseline:
baseline <- df %>%
filter(trt == 0) %>%
group_by(site) %>%
summarize(ecdf0 = list(ecdf(value)))
# compare each trt = 1 observation to ecdf for that site:
trtQuantile <- df %>%
filter(trt == 1) %>%
inner_join(baseline)
# what would be next line is where I'm struggling to get the correct map syntax
head(trtQuantile)
# for the first row I am aiming for the result given by:
trtQuantile$ecdf0[[1]](trtQuantile$value[[1]])
Any advice from the purrr masters is appreciated! Thanks.
You can use map2_dbl :
library(dplyr)
library(purrr)
trtQuantile %>% mutate(out = map2_dbl(ecdf0, value, ~.x(.y)))
Or mapply in base R :
trtQuantile$out <- mapply(function(x, y) x(y),trtQuantile$ecdf0,trtQuantile$value)

How to create bins in R

I have a data frame named cst with columns country, ID, and age. I want to make bins for age (divide all ID's into deciles or quartiles) for each separate country. I used this way:
cut(cst[!is.na(cst$age), "age"], quantile(cst["age"], probs = seq(0,1,0.1), na.rm = T))
However, it makes bins for all data frame, but I need for each country separately.
Could you help me?
I'd try with a dplyr solution, this would look someithing like this:
library(dplyr)
cst2 <- cst %>%
group_by(country) %>%
mutate(
bin = cut(age, quantile(age, probs=seq(0,1,0.1), na.rm=TRUE))
) %>%
ungroup()
All you need to do is to apply a subset before using the cut. It also does not employ the dplyr library.
for (c in unique(as.list(cst$country))) {
sub <- subset(cst, country == c)
cut(sub[!is.na(sub$age), "age"], quantile(sub["age"], probs = seq(0,1,0.1), na.rm = T))
}

mutate columns after subsetting by value

I have a large dataframe and want to standardise multiple columns while conditioning the mean and the standard deviation on values. Say I have the following example data:
set.seed(123)
df = data.frame("sample" = c(rep(1:2, each = 5)),
"status" = c(0,1),
"s1" = runif(10, -1, 1),
"s2" = runif(10, -5, 5),
"s3" = runif(10, -25, 25))
and want to standardise every s1-s3 while conditioning the mean and standard deviation to be status==0. If I should do this for say, s1 only I could do the following:
df = df %>% group_by(sample) %>%
mutate(sd_s1 = (s1 - mean(s1[status==0])) / sd(s1[status==0]))
But my problem arises when I have to perform this operation on multiple columns. I tried writing a function to include with mutate_at:
standardize <- function(x) {
return((x - mean(x[status==0]))/sd(x[status==0]))
}
df = df %>% group_by(sample) %>%
mutate_at(vars(s1:s3), standardize)
Which just creates Na values for s1-s3.
I have tried to use the answer provided in:
R - dplyr - mutate - use dynamic variable names, but cannot figure out how to do the subsetting.
Any help is greatly appreciated. Thanks!
We could just use
df %>%
group_by(sample) %>%
mutate_at(vars(s1:s3), funs((.- mean(.[status == 0]))/sd(.[status == 0])))

Find the variance over a sliding window in dplyr

I want to find the variance of the previous three values in a group.
# make some data with categories a and b
library(dplyr)
df = expand.grid(
a = LETTERS[1:3],
index = 1:10
)
# add a variable that changes within each group
set.seed(9999)
df$x = runif(nrow(df))
# get the variance of a subset of x
varSubset = function(x, index, subsetSize) {
subset = (index-subsetSize+1):index
ifelse(subset[1]<1, -1, var(x[subset]))
}
df %>%
# group the data
group_by(a) %>%
# get the variance of the 3 most recent values
mutate(var3 = varSubset(x, index, 3))
It's calling the varSubset with both x and index as vectors.
I can't figure out how to treat x as a vector (of only the group) and index as a single value. I've tried rowwise(), but then I effectively lose grouping.
Why not use rollapply from zoo?:
library(dplyr)
library(zoo)
df %>% group_by(a) %>%
mutate(var = rollapply(x, 3, var, fill = NA, align = "right"))

Resources