Quantiles by factor levels in R - r

I have a data frame and I'm trying to create a new variable in the data frame that has the quantiles of a continuous variable var1, for each level of a factor strata.
# some data
set.seed(472)
dat <- data.frame(var1 = rnorm(50, 10, 3)^2,
strata = factor(sample(LETTERS[1:5], size = 50, replace = TRUE))
)
# function to get quantiles
qfun <- function(x, q = 5) {
quantile <- cut(x, breaks = quantile(x, probs = 0:q/q),
include.lowest = TRUE, labels = 1:q)
quantile
}
I tried using two methods, neither of which produce a usable result. Firstly, I tried using aggregate to apply qfun to each level of strata:
qdat <- with(dat, aggregate(var1, list(strata), FUN = qfun))
This returns the quantiles by factor level, but the output is hard to coerce back into a data frame (e.g., using unlist does not line the new variable values up with the correct rows in the data frame).
A second approach was to do this in steps:
tmp1 <- with(dat, split(var1, strata))
tmp2 <- lapply(tmp1, qfun)
tmp3 <- unlist(tmp2)
dat$quintiles <- tmp3
Again, this calculates the quantiles correctly for each factor level, but obviously, as with aggregate they aren't in the correct order in the data frame. We can check this by putting the quantile "bins" into the data frame.
# get quantile bins
qfun2 <- function(x, q = 5) {
quantile <- cut(x, breaks = quantile(x, probs = 0:q/q),
include.lowest = TRUE)
quantile
}
tmp11 <- with(dat, split(var1, strata))
tmp22 <- lapply(tmp11, qfun2)
tmp33 <- unlist(tmp22)
dat$quintiles2 <- tmp33
Many of the values of var1 are outside of the bins of quantile2. I feel like i'm missing something simple. Any suggestions would be greatly appreciated.

I think your issue is that you don't really want to aggregate, but use ave, (or data.table or plyr)
qdat <- transform(dat, qq = ave(var1, strata, FUN = qfun))
#using plyr
library(plyr)
qdat <- ddply(dat, .(strata), mutate, qq = qfun(var1))
#using data.table (my preference)
dat[, qq := qfun(var1), by = strata]
Aggregate usually implies returning an object that is smaller that the original. (inthis case you were getting a data.frame where x was a list of 1 element for each strata.

Use ave on your dat data frame. Full example with your simulated data and qfun function:
# some data
set.seed(472)
dat <- data.frame(var1 = rnorm(50, 10, 3)^2,
strata = factor(sample(LETTERS[1:5], size = 50, replace = TRUE))
)
# function to get quantiles
qfun <- function(x, q = 5) {
quantile <- cut(x, breaks = quantile(x, probs = 0:q/q),
include.lowest = TRUE, labels = 1:q)
quantile
}
And my addition...
dat$q <- ave(dat$var1,dat$strata,FUN=qfun)

Related

Call the same function by varying parameters with tidyverse and apply family functions

This is my first question here on Stack Overflow, so I apologize in advance if I won't be clear enough.
I searched for similar questions, but I didn't find anything (I probably didn't search enough!)
Given a data.frame (or a data.table or a tibble) consisting of four sets of points divided into two groups:
df_points <- tibble(
x = c(rnorm(10000, mean = 0), rnorm(10000, mean = 1),
rnorm(10000, mean = 0), rnorm(10000, mean = 4)),
dist = c(rep("d1", 10000), rep("d2", 10000),
rep("d1", 10000), rep("d2", 10000)),
overlap = c(rep("o1", 20000), rep("o2", 20000))
)
my goal is to apply the density function, using different values of bw, from and to for the "o1" and "o2" groups.
I would like to solve this problem in an elegant way with both a tidyverse and a R-base-data.table approach (apply family functions).
For now I have managed to do this via tidyverse:
I define a common_dens function which applies density and returns a tibble of the x and y of the distribution
common_dens <- function(df, Bw, lower, upper) {
d <- density(df, n = 2048, bw = Bw, from = lower, to = upper)
df_d <- tibble(x = d$x, y = d$y)
return(df_d)
}
assuming that the values of upper, lower and bws are the following:
lower <- c(-5.050, -4.705)
upper <- c(6.445, 9.070)
bws <- c(0.1427, 0.1417)
I get the desired dataframe through the following for loop:
df_dens <- NULL
for (i in 1:2) {
df_t <- df_points %>%
filter(overlap == unique(df_points$overlap)[[i]]) %>%
group_by(dist, overlap) %>%
summarise(common_dens(x, bws[i], lower[i], upper[i]))
df_dens <- rbind(df_dens, df_t)
}
Is there any way to remove the for loop?
Is there a way to do the same with apply family functions and data.table?
Thanks for your help!
The purrr::pmap function allows you to apply an arbitrary number of parameters to a function in succession. The pmap_dfr returns a data.frame bound by row:
Consider your parameters provided as a data.frame:
params <- data.frame(group = c("o1","o2"), bws, lower, upper)
group bws lower upper
1 o1 0.1427 -5.050 6.445
2 o2 0.1417 -4.705 9.070
The paramters are automatically assigned to the special symbols ..1, ..2, and so on:
library(purrr)
pmap_dfr(params, ~ df_points %>%
filter(overlap == ..1) %>%
group_by(dist, overlap) %>%
summarise(common_dens(x,Bw = ..2, lower = ..3, upper = ..4)))
It can get confusing which ..# is which, so a trick is to use with(list(...), ):
pmap_dfr(params, ~ with(list(...), df_points %>%
filter(overlap == group) %>%
group_by(dist, overlap) %>%
summarise(common_dens(x,Bw = bws, lower = lower, upper = upper))))
You could of course do the same with base R apply:
apply(params, 1, function(y){ df_points %>%
filter(overlap == y[1]) %>%
group_by(dist, overlap) %>%
summarise(common_dens(x, Bw = as.numeric(y[2]), lower = as.numeric(y[2]),
upper = as.numeric(y[4])))}) %>%
bind_rows()
However, because apply converts the types, you'll need to use as.numeric.

tsfknn:: extracting autoplot predictions into a table

In the tsfknn package, there is the autoplot function. It plots a prediction and the nearest neighbors used in the prediction. The predicted values are in red, and the values from data are in black. Is there a way to extract the red points into a dataframe?
Example:
# prepping data
data <- as.data.frame(seq(as.Date('2017/04/01'), as.Date('2019/11/01'), by="day"))
data <- rlang::set_names(data, "Date")
data[, "Year"] <- format(data[,"Date"], "%Y")
data[, "Month"] <- format(data[,"Date"], "%m")
data[, "Quantity"] <- sample(100, size = nrow(data), replace = TRUE)
monthly <- dplyr::group_by(data, Year, Month)
monthly <- summarise(monthly, sum(Quantity))
monthly <- set_names(monthly, c("Year", "Month", "Quantity"))
# ts the data
ts.example <- ts(data = monthly$Quantity, start = c(2017,4), frequency = 12)
# quarterly ts
quarterly <- aggregate(ts.example, nfrequency = 4)/3
# knn part
quarterly.knn <- knn_forecasting(quarterly, h = 4, lags = 1:4, k = 3, msas = "MIMO")
# plotting the predictions
autoplot(quarterly.knn, highlight = "none", faceting = TRUE)
I would like to know what exactly are those prediction values.
Thank you for your time!
library(tsfknn)
f <- knn_forecasting(AirPassengers, h = 10) # the function returns a list
print(f$prediction) # the prediction component contains the prediction

Stacking lapply results

I am using the following code to generate data, and i am estimating regression models across a list of variables (covar1 and covar2). I have also created confidence intervals for the coefficients and merged them together.
I have been examining all sorts of examples here and on other sites, but i can't seem to accomplish what i want. I want to stack the results for each covar into a single data frame, labeling each cluster of results by the covar it is attributable to (i.e., "covar1" and "covar2"). Here is the code for generating data and results using lapply:
##creating a fake dataset (N=1000, 500 at treated, 500 at control group)
#outcome variable
outcome <- c(rnorm(500, mean = 50, sd = 10), rnorm(500, mean = 70, sd = 10))
#running variable
running.var <- seq(0, 1, by = .0001)
running.var <- sample(running.var, size = 1000, replace = T)
##Put negative values for the running variable in the control group
running.var[1:500] <- -running.var[1:500]
#treatment indicator (just a binary variable indicating treated and control groups)
treat.ind <- c(rep(0,500), rep(1,500))
#create covariates
set.seed(123)
covar1 <- c(rnorm(500, mean = 50, sd = 10), rnorm(500, mean = 50, sd = 20))
covar2 <- c(rnorm(500, mean = 10, sd = 20), rnorm(500, mean = 10, sd = 30))
data <- data.frame(cbind(outcome, running.var, treat.ind, covar1, covar2))
data$treat.ind <- as.factor(data$treat.ind)
#Bundle the covariates names together
covars <- c("covar1", "covar2")
#loop over them using a convenient feature of the "as.formula" function
models <- lapply(covars, function(x){
regres <- lm(as.formula(paste(x," ~ running.var + treat.ind",sep = "")), data = d)
ci <-confint(regres, level=0.95)
regres_ci <- cbind(summary(regres)$coefficient, ci)
})
names(models) <- covars
print(models)
Any nudge in the right direction, or link to a post i just haven't come across, is greatly appreciated.
You can use do.call were de second argument is a list (like in here):
do.call(rbind, models)
I made a (possible) improve to your lapply function. This way you can save the estimated parameters and the variables in a data.frame:
models <- lapply(covars, function(x){
regres <- lm(as.formula(paste(x," ~ running.var + treat.ind",sep = "")), data = data)
ci <-confint(regres, level=0.95)
regres_ci <- data.frame(covar=x,param=rownames(summary(regres)$coefficient),
summary(regres)$coefficient, ci)
})
do.call(rbind,models)

R: How to create a Quartile Column within Groups

I have managed to create the column "qaurtile" with the following code, but I'd also like to create a column called "quartile_team" that shows the quartiles within each team. I can't figure out how to do this.
Help is appreciated,
Paul
# generate dataset
teams <- c(rep("East", 6), rep("West", 8), rep("North", 7), rep("South", 9))
time_spent <- rnorm(30)
dataset <- as.data.frame(cbind(teams, time_spent))
dataset$time_spent <- as.numeric(dataset$time_spent)
# create quartile column
dataset <- within(dataset,
quartile <- cut(x = time_spent,
breaks = quantile(time_spent, probs = seq(0, 1, 0.25)),
labels = FALSE,
include.lowest = TRUE))
There's far better way to do this but a quick and dirty solution would probably use plyr. I'll use your function for calculating quartiles within:
library(plyr)
ddply(dataset, "teams", function(team){
team_quartile <- cut(x = team$time_spent, breaks = quantile(team$time_spent, probs = seq(0, 1, 0.25)),
labels = FALSE,
include.lowest = TRUE)
data.frame(team, team_quartile)
})
Basically, you want to split the data frame up by the team and then perform the calculation on each subset of the data frame. You could use tapply for this as well.

help me improve my bootstrap

Consider the following code:
require(Hmisc)
num.boots <- 10
data <- rchisq(500, df = 5) #generate fake data
#create bins
binx <- cut(data, breaks = 10)
binx <- levels(binx)
binx <- sub("^.*\\,", "", binx)
binx <- as.numeric(substr(binx, 1, nchar(binx) - 1))
#pre-allocate a matrix to be filled with samples
output <- matrix(NA, nrow = num.boots, ncol = length(binx))
#do random sampling from the vector and calculate percent
# of values equal or smaller to the bin number (i)
for (i in 1:num.boots) {
walk.pair.sample <- sample(data, size = length(data), replace = TRUE)
data.cut <- cut2(x = walk.pair.sample, cuts = binx)
data.cut <- table(data.cut)/sum(table(data.cut))
output[i, ] <- data.cut
}
#do some plotting
plot(1:10, seq(0, max(output), length.out = nrow(output)), type = "n", xlab = "", ylab = "")
for (i in 1:nrow(output)) {
lines(1:10, output[i, 1:nrow(output)])
}
#mean values by columns
output.mean <- apply(output, 2, mean)
lines(output.mean, col="red", lwd = 3)
legend(x = 8, y = 0.25, legend = "mean", col = "red", lty = "solid", lwd = 3)
I was wondering if I can supply the boot:boot() function a function that has as its output a vector of length n > 1? Is it at all possible?
Here are my feeble attempts, but I must be doing something wrong.
require(boot)
bootstrapDistances <- function(data, binx) {
data.cut <- cut2(x = data, cuts = binx)
data.cut <- table(data.cut)/sum(table(data.cut))
return(data.cut)
}
> x <- boot(data = data, statistic = bootstrapDistances, R = 100)
Error in cut.default(x, k2) : 'breaks' are not unique
I don't really understand why Hmisc::cut2() isn't working properly in the boot() call, but works when I call it in a for() loop (see code above). Is the logic of my bootstrapDistances() function feasible with boot()? Any pointers much appreciated.
.:EDIT:.
Aniko suggested I modify my function in such a way, to include an index. While reading the documentation for boot(), this wasn't clear to me how it works, which explains why the function may not be working. Here's the new function Aniko suggested:
bootstrapDistances2 <- function(data, idx, binx) {
data.cut <- cut2(x = data[idx], cuts = binx)
data.cut <- table(data.cut)/sum(table(data.cut))
return(data.cut)
}
However, I managed to produce an error and I'm still working how to remove it.
> x <- boot(data = data, statistic = bootstrapDistances2, R = 100, binx = binx)
Error in t.star[r, ] <- statistic(data, i[r, ], ...) :
number of items to replace is not a multiple of replacement length
After I restarted my R session (also tried another version, 2.10.1), it seems to be working fine.
From the help-file for the boot function:
In all other cases statistic must take at least two arguments. The first argument passed will always be the original data. The second will be a vector of indices, frequencies or weights which define the bootstrap sample.
So you need to add a second parameter to your bootstrapDistances function that will tell it which elements of the data are selected:
bootstrapDistances2 <- function(data, idx, binx) {
data.cut <- cut2(x = data[idx], cuts = binx)
data.cut <- table(data.cut)/sum(table(data.cut))
return(data.cut)
}
And the results:
x <- boot(data = data, statistic = bootstrapDistances2, R = 100, binx=binx)
x
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = data, statistic = bootstrapDistances2, R = 100, binx = binx)
Bootstrap Statistics :
original bias std. error
t1* 0.208 0.00134 0.017342783
t2* 0.322 0.00062 0.021700803
t3* 0.190 -0.00034 0.018873433
t4* 0.136 -0.00116 0.016206197
t5* 0.078 -0.00120 0.011413265
t6* 0.036 0.00070 0.008510837
t7* 0.016 0.00074 0.005816417
t8* 0.006 0.00024 0.003654581
t9* 0.000 0.00000 0.000000000
t10* 0.008 -0.00094 0.003368961
Good answer, Aniko.
Also, the help page for "boot" states that the bootstrap statistic function may return a vector, not merely a scalar.

Resources