Mutate new column over a large list of tibbles - r

So I have used the following code to split the below dataframe (df1) into multiple dataframes/tibbles based on the filters so that I can work out the percentile rank of each metric.
df1:
name
group
metric
value
A
A
distance
10569
B
A
distance
12939
C
A
distance
11532
A
A
psv-99
29.30
B
A
psv-99
30.89
C
A
psv-99
28.90
split <- lapply(unique(df1$metric), function(x){
filter <- df1 %>% filter(group == "A" & metric == x)
})
This then gives me a large list of tibbles. I want to now mutate a new column for each tibble to work out the percentile rank of the value column which I can do using the following code:
df2 <- split[[1]] %>% mutate(percentile = percent_rank(value))
I could do this for each metric then row_bind them together, but that seems very messy. Could anyone suggest a better way of doing this?

No need to split the data here. You can use group_by to do the calculation for each metric separately.
library(dplyr)
df %>%
filter(group == "A") %>%
group_by(metric) %>%
mutate(percentile = percent_rank(value))

We can use base R
df1 <- subset(df, group == 'A')
df1$percentile <- with(df1, ave(value, metric, FUN = percent_rank))

df %>%
group_nest(group, metric) %>%
mutate(percentile = map(data, ~percent_rank(.x$value))) %>%
unnest(cols = c("data", "percentile"))

Related

Mutate percentile rank based on two columns

I've previously asked the following question: Mutate new column over a large list of tibbles & the solutions giving were perfect. I now have a follow-up question to this.
I now have the following dataset:
df1:
name
group
competition
metric
value
A
A
comp A
distance
10569
B
A
comp B
distance
12939
C
A
comp C
distance
11532
A
A
comp B
psv-99
29.30
B
A
comp A
psv-99
30.89
C
A
comp C
psv-99
32.00
I now want to find out the percentile rank of all the values in df1, but only based on the group & one of the competitions - competition A.
We could slice the rows where the 'comp A' is found %in% competition, then do a grouping by 'group' column and create a new column percentile with percent_rank
library(dplyr)
df <- df %>%
slice(which(competition %in% "comp A")) %>%
group_by(group) %>%
mutate(percentile = percent_rank(value))
Maybe just change metric to competition in the previous code? It would give you the percentile rank for all competitions, including A.
df1 %>%
group_nest(group, competition) %>%
mutate(percentile = map(data, ~percent_rank(.$value))) %>%
unnest(c(data, percentile))
You can filter the competition and group_by group.
library(dplyr)
df %>%
filter(competition == "comp A") %>%
group_by(group) %>%
mutate(percentile = percent_rank(value))

Applying functions in dplyr pipes

Given a data frame like data:
data <- data.frame(group = rep(c('a','b'), each= 100),
value = rnorm(200))
We want to filter values for group == b using dplyr and use boxplot.stats to identify outliers:
library(dplyr)
data%>%
filter(group == 'b')%>%
summarise(out.stats = boxplot.stats(value))
This returns the error Column out.stats must be length 1 (a summary value), not 4, why does this not work? How do you apply functions like this inside a pipe?
The following answers to the question and to the last comment to the question, where the OP asks for the row numbers of the outliers.
what if we want to return the row numbers that go with
boxplot.stats()$out from the pipe? so if we did
b<-data%>%filter(group=='b') outside of the pipe, we could have used:
which(b$value %in% boxplot.stats(b$value)$out)
This is done by left_joining with the original data.
library(dplyr)
set.seed(1234)
data <- data.frame(group = rep(c('a','b'), each= 100),
value = rnorm(200))
data %>% filter(group == 'b') %>% pull(value) %>%
boxplot.stats() %>% '[['('out') %>%
data.frame() %>%
left_join(data, by = c('.' = 'value'))
# . group
#1 3.043766 b
#2 -2.732220 b
#3 -2.855759 b
We can use the new version of dplyr which can also return summarise with more than one row
library(dplyr) # >= 1.0.0
data%>%
filter(group == 'b')%>%
summarise(out.stats = boxplot.stats(value))
# out.stats
#1 -2.4804222, -0.7546693, 0.1304050, 0.6390749, 2.2682247
#2 100
#3 -0.08980661, 0.35061653
#4 -3.014914

R run T-test/anova for each row with 2 groups with 3 samples

My dataset looks something like this:
df <- data.frame(compound = c("alanine ", "arginine", "asparagine", "aspartate"))
df <- matrix(rnorm(12*4), ncol = 12)
colnames(df) <- c("AC-1", "AC-2", "AC-3", "AM-1", "AM-2", "AM-3", "SC-1", "SC-2", "SC-3", "SM-1", "SM-2", "SM-3")
df <- data.frame(compound = c("alanine ", "arginine", "asparagine", "aspartate"), df)
df
compound AC.1 AC.2 AC.3 AM.1 AM.2 AM.3 SC.1 SC.2 SC.3 SM.1
1 alanine 1.18362683 -2.03779314 -0.7217692 -1.7569264 -0.8381042 0.06866567 0.2327702 -1.1558879 1.2077454 0.437707310
2 arginine -0.19610110 0.05361113 0.6478384 -0.1768597 0.5905398 -0.67945600 -0.2221109 1.4032349 0.2387620 0.598236199
3 asparagine 0.02540509 0.47880021 -0.1395198 0.8394257 1.9046667 0.31175358 -0.5626059 0.3596091 -1.0963363 -1.004673116
4 aspartate -1.36397906 0.91380826 2.0630076 -0.6817453 -0.2713498 -2.01074098 1.4619707 -0.7257269 0.2851122 -0.007027878
I want to perform a t-test for each row (compound) on the columns [2:4] as one, and [5:7] as one, and store all the p-values. Basically see if there is a difference between the AC group and AM group for each compound.
I am aware there is another topic with this however I couldn't find a viable solution for my problem.
PS. my real dataset has about 35000 rows (maybe it needs a different solution than only 4 rows)
After selecting the columns of interest, use pmap to apply the t.test on each row by selecting the first 3 and next 3 observations as input to t.test and bind the extracted 'p value' as another column in the original data
library(tidyverse)
df %>%
select(AC.1:AM.3) %>%
pmap_dbl(~ c(...) %>%
{t.test(.[1:3], .[4:6])$p.value}) %>%
bind_cols(df, pval_AC_AM = .)
Or after selecting the columns, do a gather to convert to 'long' format, spread, apply the t.test in summarise and join with the original data
df %>%
select(compound, AC.1:AM.3) %>%
gather(key, val, -compound) %>%
separate(key, into = c('key1', 'key2')) %>%
spread(key1, val) %>%
group_by(compound) %>%
summarise(pval_AC_AM = t.test(AC, AM)$p.value) %>%
right_join(df)
Update
If there are cases where there is only a unique value, then t.test shows error. One option is to run the t.test and get NA for those cases. This can be done with possibly
posttest <- possibly(function(x, y) t.test(x, y)$p.value, otherwise = NA)
df %>%
select(AC.1:AM.3) %>%
pmap_dbl(~ c(...) %>%
{posttest(.[1:3], .[4:6])}) %>%
bind_cols(df, pval_AC_AM = .)
posttest(rep(3,5), rep(1, 5))
#[1] NA
If you can use an external library:
library(matrixTests)
row_t_welch(df[,2:4], df[,5:7])$pvalue
[1] 0.67667626 0.39501003 0.26678161 0.01237438

compute residuals within groups in dplyr

I am trying to compute within group residuals in anova using R. My data frame is
df <- data.frame(V1 = c(rep("group1", 5), rep("group2", 7)),
value = c(6.6,4.6,8.5,6.1,8.4,
10.7,10.1,10.9,10.7,15.6,13.8,15.9))
I want to use a simple way using dplyr or else to combine following two lines of code
M <- df %>% group_by(V1) %>% summarise(avg = mean(value))
df$res <- ifelse(test = df$V1 == "group1", yes = (df$value - M$avg[1])^2,
no = (df$value - M$avg[2])^2)
I tried to use do() in dplyr but no success. I was wondering if there is a neat way of doing this.
If you need to keep using the original value column along with avg, then use mutate rather than summarize so that the means are just placed in a new column next to the original values:
df %>%
group_by(V1) %>%
mutate(avg = mean(value),
res = (value - avg)^2)

How to perform chisq.test within dplyr nest in R

I am having trouble figuring out how to perform a chisq.test within a nested list column of a data frame. If I need to turn the data list-column into a matrix, how do I do that, and then how do I properly refer to the variables for the chisq.test? Take the example below. Thank you!
Here is an example:
a <- rep(c('A', 'B'), 10)
b <- rep(c('a', 'b'), each = 10)
c <- as.numeric(rep(c(1:10), each = 2))
df <- as.data.frame(cbind(a, b, c)) %>%
mutate(c = as.numeric(c))
Is the distribution the same between factor 'b' (levels 'a' and 'b') with 'c' counts, within a subgroups of factor 'a'('A' and 'B')?
dfnest <- df %>%
nest(-a) %>%
mutate(chisq_p = map_dbl(data, ~chisq.test(.$b~.$c)$p.value))
The last line is what I want to accomplish, but the above is incorrect - how do I use the chisq.test within the list-column data, and insert the p.value into a new column?
Changing the arguments in the call of chisq.test returns the expected result.
df %>%
nest(-a) %>%
mutate(chisq_p = map_dbl(data, ~chisq.test(.)$p.value))
You can also use an anonymous function.
df %>%
nest(-a) %>%
mutate(chisq_p = map_dbl(data, function(f) { chisq.test(f)$p.value }))

Resources