R: How to create a Quartile Column within Groups - r

I have managed to create the column "qaurtile" with the following code, but I'd also like to create a column called "quartile_team" that shows the quartiles within each team. I can't figure out how to do this.
Help is appreciated,
Paul
# generate dataset
teams <- c(rep("East", 6), rep("West", 8), rep("North", 7), rep("South", 9))
time_spent <- rnorm(30)
dataset <- as.data.frame(cbind(teams, time_spent))
dataset$time_spent <- as.numeric(dataset$time_spent)
# create quartile column
dataset <- within(dataset,
quartile <- cut(x = time_spent,
breaks = quantile(time_spent, probs = seq(0, 1, 0.25)),
labels = FALSE,
include.lowest = TRUE))

There's far better way to do this but a quick and dirty solution would probably use plyr. I'll use your function for calculating quartiles within:
library(plyr)
ddply(dataset, "teams", function(team){
team_quartile <- cut(x = team$time_spent, breaks = quantile(team$time_spent, probs = seq(0, 1, 0.25)),
labels = FALSE,
include.lowest = TRUE)
data.frame(team, team_quartile)
})
Basically, you want to split the data frame up by the team and then perform the calculation on each subset of the data frame. You could use tapply for this as well.

Related

Mahalanobis difference by group with dplyr

I want to get a Mahalanobis difference for each set of two scores, after being grouped by another variable. In this case, it would be a Mahalanobis difference for each Attribute (across each set of 2 scores). The output should be 3 Mahalanobis distances (one for A, B and C).
Currently I am working with (in my original dataframe, there are some NAs, hence I include one in the reprex):
library(tidyverse)
library(purrr)
df <- tibble(Attribute = unlist(map(LETTERS[1:3], rep, 5)),
Score1 = c(runif(7), NA, runif(7)),
Score2 = runif(15))
mah_db <- df %>%
dplyr::group_by(Attribute) %>%
dplyr::summarise(MAH = mahalanobis(Score1:Score2,
center = base::colMeans(Score1:Score2),
cov(Score1:Score2, use = "pairwise.complete.obs")))
This raises the error:
Caused by error in base::colMeans(): ! 'x' must be an array of at
least two dimensions
But as far as I can tell, I am giving colMeans two columns.
So what's going wrong here? And I wonder if even fixing this gives a complete solution?
It seems your question is more about the statistics than dplyr. So I just give a small example based on your data and an adapted example from ?mahalanobis. Perhaps also have a look here or here.
df <- subset(x = df0, Attribute == "A", select = c("Score1", "Score2"))
df$mahalanobis <- mahalanobis(x = df, center = colMeans(df), cov = cov(df))
df$p <- pchisq(q = df$mahalanobis, df = 2, lower.tail = FALSE)
plot(density(df$mahalanobis, bw = 0.3), ylim = c(0, 0.8),
main="Squared Mahalanobis distances");
grid()
rug(df$mahalanobis)
df <- subset(x = df0, Attribute == "B", select = c("Score1", "Score2"))
df <- df[complete.cases(df), ]
df$mahalanobis <- mahalanobis(x = df, center = colMeans(df), cov = cov(df))
df$p <- pchisq(q = df$mahalanobis, df = 2, lower.tail = FALSE)
lines(density(df$mahalanobis, bw = 0.3), col = "red",
main="Squared Mahalanobis distances");
rug(df$mahalanobis, col = "red")
df <- subset(x = df0, Attribute == "C", select = c("Score1", "Score2"))
df$mahalanobis <- mahalanobis(x = df, center = colMeans(df), cov = cov(df))
df$p <- pchisq(q = df$mahalanobis, df = 2, lower.tail = FALSE)
lines(density(df$mahalanobis, bw = 0.3), col = "green",
main="Squared Mahalanobis distances");
rug(df$mahalanobis, col = "green")
Hope, that helps (and too long for a comment).
(Of course you can make to code much shorter, but it shows in each step what happens.)

Stacking lapply results

I am using the following code to generate data, and i am estimating regression models across a list of variables (covar1 and covar2). I have also created confidence intervals for the coefficients and merged them together.
I have been examining all sorts of examples here and on other sites, but i can't seem to accomplish what i want. I want to stack the results for each covar into a single data frame, labeling each cluster of results by the covar it is attributable to (i.e., "covar1" and "covar2"). Here is the code for generating data and results using lapply:
##creating a fake dataset (N=1000, 500 at treated, 500 at control group)
#outcome variable
outcome <- c(rnorm(500, mean = 50, sd = 10), rnorm(500, mean = 70, sd = 10))
#running variable
running.var <- seq(0, 1, by = .0001)
running.var <- sample(running.var, size = 1000, replace = T)
##Put negative values for the running variable in the control group
running.var[1:500] <- -running.var[1:500]
#treatment indicator (just a binary variable indicating treated and control groups)
treat.ind <- c(rep(0,500), rep(1,500))
#create covariates
set.seed(123)
covar1 <- c(rnorm(500, mean = 50, sd = 10), rnorm(500, mean = 50, sd = 20))
covar2 <- c(rnorm(500, mean = 10, sd = 20), rnorm(500, mean = 10, sd = 30))
data <- data.frame(cbind(outcome, running.var, treat.ind, covar1, covar2))
data$treat.ind <- as.factor(data$treat.ind)
#Bundle the covariates names together
covars <- c("covar1", "covar2")
#loop over them using a convenient feature of the "as.formula" function
models <- lapply(covars, function(x){
regres <- lm(as.formula(paste(x," ~ running.var + treat.ind",sep = "")), data = d)
ci <-confint(regres, level=0.95)
regres_ci <- cbind(summary(regres)$coefficient, ci)
})
names(models) <- covars
print(models)
Any nudge in the right direction, or link to a post i just haven't come across, is greatly appreciated.
You can use do.call were de second argument is a list (like in here):
do.call(rbind, models)
I made a (possible) improve to your lapply function. This way you can save the estimated parameters and the variables in a data.frame:
models <- lapply(covars, function(x){
regres <- lm(as.formula(paste(x," ~ running.var + treat.ind",sep = "")), data = data)
ci <-confint(regres, level=0.95)
regres_ci <- data.frame(covar=x,param=rownames(summary(regres)$coefficient),
summary(regres)$coefficient, ci)
})
do.call(rbind,models)

How to get z-score distribution for 3 dataframes and plot all in one graph

I'm trying to create a single graph that contains boxplots of gene expression for 3 different variant types (synonymous, missense, and nonsense). Currently, these variant types are separated into 3 different data frames, each of which contain a Gene, SampleID, and Expression column.
In order to plot all 3 boxplots on a single graph, I need to normalize all the expression data for each variant type, which means I need to get the z-scores. My question is, how do I do that and then how do I plot all 3 variant types on one graph?
I've come across the solution:
missense$Zscore <- ave(m$expr, m$Gene, FUN = scale)
nonsense$Zscore <- ave(n$expr, n$Gene, FUN = scale)
synonymous$Zscore <- ave(s$expr, s$Gene, FUN = scale)
Is this the right approach? If so, where do I go from here?
Example dataframe (missense):
SampleID Expression Gene
HSB100 5.239237 ENSG00000188976
HSB105 4.443808 ENSG00000188976
HSB104 4.425764 ENSG00000188976
HSB121 4.063259 ENSG00000188976
Use scale function to get Z-scores.
missense <- data.frame(SampleID = c('HSB100', 'HSB105', 'HSB104', 'HSB121'),
Expression = c(5.239237, 4.443808, 4.425764, 4.063259),
Gene = c('ENSG00000188976', 'ENSG00000188976', 'ENSG00000188976', 'ENSG00000188976'))
missense$Zscore <- scale(missense$Expression)
missense
mean(missense$Zscore)
sd(missense$Zscore)
# Create fake data here
nonsense <-
data.frame(SampleID = c('HSB100', 'HSB105', 'HSB104', 'HSB121'),
Expression = c(1, 2, 3, 4),
Gene = c('ENSG00000188976', 'ENSG00000188976', 'ENSG00000188976', 'ENSG00000188976'))
nonsense$Zscore <- scale(nonsense$Expression)
synonymous <-
data.frame(SampleID = c('HSB100', 'HSB105', 'HSB104', 'HSB121'),
Expression = c(3, 4, 5, 6),
Gene = c('ENSG00000188976', 'ENSG00000188976', 'ENSG00000188976', 'ENSG00000188976'))
synonymous$Zscore <- scale(synonymous$Expression)
The trick is to bind all three data frames together and then plot using ggplot. Not familiar with base plot but this is what I would do:
# Add identifyer
missense$Type <- 'missense'
nonsense$Type <- 'nonsense'
synonymous$Type <- 'synonymous'
# Bind three together
data_all <- rbind(missense, nonsense, synonymous)
# Use ggplot to plot boxscores
library(ggplot2)
ggplot(data = data_all, aes(x = Type, y = Zscore)) + geom_boxplot()
If all the genes are the same in each corresponding data frame, then ave is not needed since no multiple groupings exist. Hence, you can run a simple calculation: m$Zscore <- scale(m$expr). From there as #emilliman5 comments, graph all three vectors with a list and even name x-axis with a named list:
# WITH SEABORN COLORS
boxplot(list(missense=m$Zscore, nonsense=n$Zscore, synonymous=s$Zscore),
col = c("#4c72b0","#55a868","#c44e52"))
Even consider row binding all data frames but adding a new column for a variant_type indicator. Then use ave since now genes will differ within data frame. And even use formula style instead of list() for boxplot:
all_gene_df <- rbind(transform(m, variant_type='missense'),
transform(n, variant_type='nonsense'),
transform(s, variant_type='synonymous'))
all_gene_df$Zscore <- with(all_gene_df, ave(expr, variant_type, FUN = scale))
# WITH SEABORN COLORS
boxplot(Zscore ~ variant_type, data = all_gene_df,
col = c("#4c72b0","#55a868","#c44e52"),
main = "ZScore Boxplots by Gene",
xlab = "Genes",
ylab = "ZScore")
Data
set.seed(103018)
m <- data.frame(SampleID = paste0(sample(LETTERS, 50, replace=TRUE), sample(LETTERS, 50, replace=TRUE),
sample(LETTERS, 50, replace=TRUE), sample(100:999, 50, replace=TRUE)),
expr = runif(50)*10,
gene = 'MISSENSE0001')
n <- data.frame(SampleID = paste0(sample(LETTERS, 50, replace=TRUE), sample(LETTERS, 50, replace=TRUE),
sample(LETTERS, 50, replace=TRUE), sample(100:999, 50, replace=TRUE)),
expr = runif(50)*10,
gene = 'NONSENSE0001')
s <- data.frame(SampleID = paste0(sample(LETTERS, 50, replace=TRUE), sample(LETTERS, 50, replace=TRUE),
sample(LETTERS, 50, replace=TRUE), sample(100:999, 50, replace=TRUE)),
expr = runif(50)*10,
gene = 'SYNONYMOUS0001')

Define function that takes variable from data.frame as input and creates new variables in R

Sorry if this has been asked before I have looked throughly but couldn't find anything.
My problem is the following. I have survey data and want to perform the same 3 steps with different variables. I always want the output to be separated by gender. So I want to create a function that automates this and returns three new variables. It should look something like this:
myfunction <- function(x) {
x.mean.by.gender <- svyby(~x, ~gender, svymean, design = s.design)
x.boxplot <- svyboxplot(x~gender, varwith=FALSE, design = s.design)
x.ttest <- svyttest(x~gender, design = s.design)}
myfunction(data$income)
This is a working example of what I want to do (boxplot doesn't split by gender but not important):
require("survey")
income <- runif(50, 1000, 2500)
wealth <- runif(50, 10000, 100000)
weight <- runif(50, 1.0, 1.99)
id <- seq(from = 1, to = 50, 1)
gender <- sample(0:1, 50, replace = TRUE)
data <- data.frame(income, wealth, weight, id, gender)
data.w <- svydesign(ids = ~id, data = data, weights = ~weight)
data.w <- update(data.w, count=1)
svytotal(~count, data.w)
# Income difference
income.mean.table <- svyby(~income, ~gender, svymean, design = data.w)
income.mean.table
income.boxplot <- svyboxplot(income~gender, varwith = FALSE, design = data.w)
income.ttest <- svyttest(income~gender, design = data.w)
income.ttest
# Wealth difference
wealth.mean.table <- svyby(~wealth, ~gender, svymean, design = data.w)
wealth.mean.table
wealth.boxplot <- svyboxplot(wealth~gender, varwith = FALSE, design = data.w)
wealth.ttest <- svyttest(wealth~gender, design = data.w)
wealth.ttest
So the function should perform one of those iterations of svyby, svyboxplot and svyttest and create variables with the name of the variable used as input.function (e.g.) income.ttest. I hope this clears things up. Sorry for the confusion.
If data objects are of different type, just make list and return it
UPDATE: version where you could use named elements of the return list
fl <- function(x) {
a <- x*12.0
b <- rep(1L, 12)
c <- "qqqqq"
list(A = a, B = b, C = c)
}
q <- fl(c(1, 2, 3, 4, 5))
print(q$A)
print(q$B)
print(q$C)

Quantiles by factor levels in R

I have a data frame and I'm trying to create a new variable in the data frame that has the quantiles of a continuous variable var1, for each level of a factor strata.
# some data
set.seed(472)
dat <- data.frame(var1 = rnorm(50, 10, 3)^2,
strata = factor(sample(LETTERS[1:5], size = 50, replace = TRUE))
)
# function to get quantiles
qfun <- function(x, q = 5) {
quantile <- cut(x, breaks = quantile(x, probs = 0:q/q),
include.lowest = TRUE, labels = 1:q)
quantile
}
I tried using two methods, neither of which produce a usable result. Firstly, I tried using aggregate to apply qfun to each level of strata:
qdat <- with(dat, aggregate(var1, list(strata), FUN = qfun))
This returns the quantiles by factor level, but the output is hard to coerce back into a data frame (e.g., using unlist does not line the new variable values up with the correct rows in the data frame).
A second approach was to do this in steps:
tmp1 <- with(dat, split(var1, strata))
tmp2 <- lapply(tmp1, qfun)
tmp3 <- unlist(tmp2)
dat$quintiles <- tmp3
Again, this calculates the quantiles correctly for each factor level, but obviously, as with aggregate they aren't in the correct order in the data frame. We can check this by putting the quantile "bins" into the data frame.
# get quantile bins
qfun2 <- function(x, q = 5) {
quantile <- cut(x, breaks = quantile(x, probs = 0:q/q),
include.lowest = TRUE)
quantile
}
tmp11 <- with(dat, split(var1, strata))
tmp22 <- lapply(tmp11, qfun2)
tmp33 <- unlist(tmp22)
dat$quintiles2 <- tmp33
Many of the values of var1 are outside of the bins of quantile2. I feel like i'm missing something simple. Any suggestions would be greatly appreciated.
I think your issue is that you don't really want to aggregate, but use ave, (or data.table or plyr)
qdat <- transform(dat, qq = ave(var1, strata, FUN = qfun))
#using plyr
library(plyr)
qdat <- ddply(dat, .(strata), mutate, qq = qfun(var1))
#using data.table (my preference)
dat[, qq := qfun(var1), by = strata]
Aggregate usually implies returning an object that is smaller that the original. (inthis case you were getting a data.frame where x was a list of 1 element for each strata.
Use ave on your dat data frame. Full example with your simulated data and qfun function:
# some data
set.seed(472)
dat <- data.frame(var1 = rnorm(50, 10, 3)^2,
strata = factor(sample(LETTERS[1:5], size = 50, replace = TRUE))
)
# function to get quantiles
qfun <- function(x, q = 5) {
quantile <- cut(x, breaks = quantile(x, probs = 0:q/q),
include.lowest = TRUE, labels = 1:q)
quantile
}
And my addition...
dat$q <- ave(dat$var1,dat$strata,FUN=qfun)

Resources