Automate z-score calculation by group - r

I have the following data frame:
df<- splitstackshape::stratified(iris, group="Species", size=1)
I want to make a z-score for each species including all of the variables. I can do this manually by finding the SD and mean for each row and using the appropriate formula, but I need to do this several times over and would like to find a more efficient way.
I tried using scale(), but can't figure out how to get it to do the row-wise calculation that includes several variables and a grouping variable.
Using dplyr::group_by returns a "'x' must be numeric variable" error.

Are you sure the question is taking a z-score to each group? It should be for each value.
Lets say the functions to take z-score could be:
scale(x, center = TRUE, scale = TRUE)
Or
function_zscore = function(x){x <- x[na.rm = TRUE]; return(((x) - mean(x)) / sd(x))}
Both functions suggest that if the argument x is a vector, the results will return to a vector too.
df<- splitstackshape::stratified(iris, group="Species", size=1)
df <- tidyr::pivot_longer(df, cols = c(1:4), names_to = "var.name", values_to = "value")
df %>%
group_by(Species) %>%
mutate(zscore = scale(value, center = TRUE, scale = TRUE)[,1])
## A tibble: 12 x 4
## Groups: Species [3]
# Species var.name value zscore
# <fct> <chr> <dbl> <dbl>
# 1 setosa Sepal.Length 4.9 1.22
# 2 setosa Sepal.Width 3.1 0.332
# 3 setosa Petal.Length 1.5 -0.455
# 4 setosa Petal.Width 0.2 -1.09
# 5 versicolor Sepal.Length 5.9 1.10
# 6 versicolor Sepal.Width 3.2 -0.403
# 7 versicolor Petal.Length 4.8 0.486
# 8 versicolor Petal.Width 1.8 -1.18
# 9 virginica Sepal.Length 6.5 1.14
#10 virginica Sepal.Width 3 -0.574
#11 virginica Petal.Length 5.2 0.501
#12 virginica Petal.Width 2 -1.06
If we still hope to get a score for each group to describe how a sample deviates around the mean, a possible solution could be getting the coefficient of variation?
df %>%
group_by(Species) %>%
summarise(coef.var = 100*sd(value)/mean(value))
## A tibble: 3 x 2
# Species coef.var
# <fct> <dbl>
#1 setosa 83.8
#2 versicolor 45.8
#3 virginica 49.0

Related

In R, how to write function that runs t-test on list of dataframes

Using iris dataset as an example, I want to write a user defined function that
run pairwise t-test on all 4 columns exempting Species columns for each data split
export the results as 3 worksheets of a csv file
See below for my attempt:
library(tidyr)
library(reshape) # for melting /stacking the data
library(multcomp) # for pairwise test
library(xlsx) # export excel file with worksheet
options(scipen = 100)
# dataset
iris
data_stats <- function(data){
# melt the dataframe
df <- melt(data, id.vars=c('Species'),var='group')
# split the dataframe into three list of dataframe
dfsplit<-split(df,df$column)
# pairwise t-test
results <- pairwise.t.test(dfsplit$value, dfsplit$group,p.adjust.method = "BH")
# export each result as a worksheet of an excel file
write.xlsx(results, file="Results.xlsx", sheetName="versicolor_stats", row.names=FALSE)
write.xlsx(results, file="Results.xlsx", sheetName="virginica_stats", append=TRUE, row.names=FALSE)
write.xlsx(results, file="Results.xlsx", sheetName="setosa_stats", append=TRUE, row.names=FALSE)
}
# testing the code on iris data
data_stats(iris)
Please comment and share your code. Thanks
Here is an option with tidyverse - reshape to 'long' format with pivot_longer, then use group_modify to do the pairwise.t.test , tidy the output and unnest the list output
library(dplyr)
library(tidyr)
library(broom)
ttest_out <- iris %>%
pivot_longer(cols = -Species) %>%
group_by(Species) %>%
group_modify(~ .x %>%
summarise(out = list(pairwise.t.test(value, name) %>%
tidy))) %>%
ungroup %>%
unnest(out)
-output
ttest_out
# A tibble: 18 × 4
Species group1 group2 p.value
<fct> <chr> <chr> <dbl>
1 setosa Petal.Width Petal.Length 1.77e- 54
2 setosa Sepal.Length Petal.Length 2.77e-132
3 setosa Sepal.Length Petal.Width 1.95e-156
4 setosa Sepal.Width Petal.Length 1.61e- 86
5 setosa Sepal.Width Petal.Width 1.13e-123
6 setosa Sepal.Width Sepal.Length 4.88e- 71
7 versicolor Petal.Width Petal.Length 5.35e- 90
8 versicolor Sepal.Length Petal.Length 3.78e- 52
9 versicolor Sepal.Length Petal.Width 5.02e-125
10 versicolor Sepal.Width Petal.Length 1.36e- 45
11 versicolor Sepal.Width Petal.Width 3.46e- 44
12 versicolor Sepal.Width Sepal.Length 1.25e- 95
13 virginica Petal.Width Petal.Length 1.39e- 90
14 virginica Sepal.Length Petal.Length 6.67e- 22
15 virginica Sepal.Length Petal.Width 3.47e-110
16 virginica Sepal.Width Petal.Length 2.35e- 68
17 virginica Sepal.Width Petal.Width 1.87e- 19
18 virginica Sepal.Width Sepal.Length 2.47e- 92
Update: The statistical part of this question (applying pairwise.t.test to the iris dataset) has been answered previously on SO. And here is another solution.
The accepted solution runs a series of pairwise t-tests and produces a column of p-values (exactly as the question asks) but it's a not very meaningful set of t-tests. You might suspect that from the fact that we see p-values like 2.77e-132 and that group1 and group2 are continuous variables, not the levels of a factor.
The hypotheses that these tests evaluate is whether, for each species separately, sepal is the same as petal and length is the same as width. The pairwise t-test procedure is designed to compare a single continuous variable (say sepal length) across all the levels a factor (say species).
To begin with, let's apply pairwise.t.test to the column Sepal.Length, so that we can check later on that we get the right p-values.
library("broom")
library("tidyverse")
pairwise.t.test(iris$Sepal.Width, iris$Species)
#>
#> Pairwise comparisons using t tests with pooled SD
#>
#> data: iris$Sepal.Width and iris$Species
#>
#> setosa versicolor
#> versicolor < 2e-16 -
#> virginica 9.1e-10 0.0031
#>
#> P value adjustment method: holm
If you've ever seen the iris dataset, you know these p-values "make sense": Virginica & Versicolor are more similar to each other than to Setosa.
So now let's apply the tests in a tidy way to the four numeric columns.
t_pvals <- iris %>%
pivot_longer(
-Species,
names_to = "Variable",
values_to = "x"
) %>%
# The trick to performing the right tests is to group the tibble by Variable,
# not by Species because Species is the grouping variable for the t-tests.
group_by(
Variable
) %>%
group_modify(
~ tidy(pairwise.t.test(.x$x, .x$Species))
) %>%
ungroup()
t_pvals
#> # A tibble: 12 × 4
#> Variable group1 group2 p.value
#> <chr> <chr> <chr> <dbl>
#> 1 Petal.Length versicolor setosa 1.05e-68
#> 2 Petal.Length virginica setosa 1.23e-90
#> 3 Petal.Length virginica versicolor 1.81e-31
#> 4 Petal.Width versicolor setosa 2.51e-57
#> 5 Petal.Width virginica setosa 2.39e-85
#> 6 Petal.Width virginica versicolor 8.82e-37
#> 7 Sepal.Length versicolor setosa 1.75e-15
#> 8 Sepal.Length virginica setosa 6.64e-32
#> 9 Sepal.Length virginica versicolor 2.77e- 9
#> 10 Sepal.Width versicolor setosa 5.50e-17
#> 11 Sepal.Width virginica setosa 9.08e-10
#> 12 Sepal.Width virginica versicolor 3.15e- 3
The p-values for the Sepal.Width comparisons are at the bottom. We got the p-values we expected!
Next we format the p-values so that they are easier on the eyes.
t_pvals <- t_pvals %>%
mutate(
across(
p.value, rstatix::p_format,
accuracy = 0.05
)
)
t_pvals
#> # A tibble: 12 × 4
#> Variable group1 group2 p.value
#> <chr> <chr> <chr> <chr>
#> 1 Petal.Length versicolor setosa <0.05
#> 2 Petal.Length virginica setosa <0.05
#> 3 Petal.Length virginica versicolor <0.05
#> 4 Petal.Width versicolor setosa <0.05
#> 5 Petal.Width virginica setosa <0.05
#> 6 Petal.Width virginica versicolor <0.05
#> 7 Sepal.Length versicolor setosa <0.05
#> 8 Sepal.Length virginica setosa <0.05
#> 9 Sepal.Length virginica versicolor <0.05
#> 10 Sepal.Width versicolor setosa <0.05
#> 11 Sepal.Width virginica setosa <0.05
#> 12 Sepal.Width virginica versicolor <0.05
And finally we save the results to a file.
t_pvals %>%
write_csv("pairwse-t-tests-on-iris-data.csv")

How to add p value column in a data frame comparing row wise?

For the example dataset iris, I would like to compute a table that gives me the p values for a t-test comparing the species Sentosa and Versicolor to Virginia (i.e. Virginia would be the reference group/control)
Currently, I've processed the average values for columns (sepal length, sepal width, petal length, petal width) and am trying to do a t-test grouped by species against control.
as an example output would have these columns:
c=Sepal Width p value, Sepal length p value, Petal length p value, Petal width p value
Thanks in advance for all your help!
Edit 1:
Here is what I wrote applied to iris (which doesn't exactly fit). I basically cleaned up my data to only include certain independent variables, which is why I have so the %>%.
iris %>%
group_by(species) %>%
addcol = function(iris)%>%
Sepal.length.p.value = mutate(iris, function(t.test(vars(3), ~./[species == 'Sentosa'])))
and basically I did that for each of the independat variables.
You can try the following:
library(dplyr)
library(tidyr)
library(broom)
pivot_longer(iris,-Species) %>% group_by(name)
# A tibble: 600 x 3
# Groups: name [4]
Species name value
<fct> <chr> <dbl>
1 setosa Sepal.Length 5.1
2 setosa Sepal.Width 3.5
3 setosa Petal.Length 1.4
4 setosa Petal.Width 0.2
5 setosa Sepal.Length 4.9
6 setosa Sepal.Width 3
At this step, we have converted into long and group them according to the variable. It is a matter of applying a pairwise t.test within each group, and filtering out those you don't need. We can use broom for this:
res = pivot_longer(iris,-Species) %>% group_by(name) %>%
do(tidy(pairwise.t.test(.$value,.$Species,pool.sd =FALSE))) %>%
filter(group1=="virginica" | group2=="virginica")
# A tibble: 8 x 4
# Groups: name [4]
name group1 group2 p.value
<chr> <chr> <chr> <dbl>
1 Petal.Length virginica setosa 2.78e-49
2 Petal.Length virginica versicolor 4.90e-22
3 Petal.Width virginica setosa 7.31e-48
4 Petal.Width virginica versicolor 2.11e-25
5 Sepal.Length virginica setosa 1.19e-24
6 Sepal.Length virginica versicolor 1.87e- 7
7 Sepal.Width virginica setosa 9.14e- 9
8 Sepal.Width virginica versicolor 1.82e- 3
Note that I set pool.sd =FALSE in pairwise.t.test so that it would be similar to a t.test, but ideally, if you have many groups, and their variances are similar, it pays to use a pooled SD.
You can put this in wide format again:
pivot_wider(res,values_from=p.value,names_from=name)
# A tibble: 2 x 6
group1 group2 Petal.Length Petal.Width Sepal.Length Sepal.Width
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 virginica setosa 2.78e-49 7.31e-48 1.19e-24 0.00000000914
2 virginica versicolor 4.90e-22 2.11e-25 1.87e- 7 0.00182
This is a possible solution: Cycle through the variable names in iris using purrr::map_dfc() and within that map_dfc() you cycle through the treatment groups (versicolor and setosa) with purrr::map_dfr(). That way the results of the inner cycle are combined rowwise and the results of the outer cycle are combined columnwise.
var_names <- names(iris)
var_names <- var_names[-length(var_names)] # Last variable is the group/Species variable, we don't want to include that.
treat_group <- c(versicolor = "versicolor", setosa = "setosa") # Using a named vector here will help map_dfr() to give useful names to the rows, otherwise it would just be 1 and 2.
library(purrr)
library(dplyr)
map_dfc(var_names, function(x) {
map_dfr(treat_group, function(y) {
res <-
tibble(t.test(iris[x][iris$Species == "virginica",],
iris[x][iris$Species == y,])$p.value)
names(res) <- x
res
}, .id = "species")
}) %>%
select(-matches("[1-3]")) # drop columns with numeric characters in it, to get rid of repeated species columns
#> # A tibble: 2 x 5
#> species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 versicolor 1.87e- 7 0.00182 4.90e-22 2.11e-25
#> 2 setosa 3.97e-25 0.00000000457 9.27e-50 2.44e-48
You could just split your data into control and treatment groups and use dplyr::summarise within your groups to create a column that gives you the p-value of a t-test.
library(dplyr)
control <- iris %>%
filter(Species == "virginica")
dat <- iris %>%
group_by(Species) %>%
filter(Species != "virginica") %>%
summarise("Sepal Width p value" = t.test(Sepal.Width, control$Sepal.Width)$p.value,
"Sepal length p value" = t.test(Sepal.Length, control$Sepal.Length)$p.value,
"Petal length p value" = t.test(Petal.Length, control$Petal.Length)$p.value,
"Petal width p value" = t.test(Petal.Width, control$Petal.Width)$p.value)
With the output being:
# A tibble: 2 x 5
Species `Sepal Width p value` `Sepal length p value` `Petal length p value` `Petal width p value`
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 0.00000000457 3.97e-25 9.27e-50 2.44e-48
2 versicolor 0.00182 1.87e- 7 4.90e-22 2.11e-25

Is there a way to use a lookup value from a table in a mutate column?

library(tidyverse)
df <- iris %>%
group_by(Species) %>%
mutate(Petal.Dim = Petal.Length * Petal.Width,
rank = rank(desc(Petal.Dim))) %>%
mutate(new_col = rank == 4, Sepal.Width)
table <- df %>%
filter(rank == 4) %>%
select(Species, new_col = Sepal.Width)
correct_df <- left_join(df, table, by = "Species")
df
#> # A tibble: 150 x 8
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Dim
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 5.1 3.5 1.4 0.2 setosa 0.280
#> 2 4.9 3 1.4 0.2 setosa 0.280
#> 3 4.7 3.2 1.3 0.2 setosa 0.26
#> 4 4.6 3.1 1.5 0.2 setosa 0.3
#> 5 5 3.6 1.4 0.2 setosa 0.280
#> 6 5.4 3.9 1.7 0.4 setosa 0.68
#> 7 4.6 3.4 1.4 0.3 setosa 0.42
#> 8 5 3.4 1.5 0.2 setosa 0.3
#> 9 4.4 2.9 1.4 0.2 setosa 0.280
#> 10 4.9 3.1 1.5 0.1 setosa 0.15
#> # ... with 140 more rows, and 2 more variables: rank <dbl>, new_col <lgl>
I'm basically looking for new_col to show the value that corresponds with rank = 4 from the Sepal.Width column. In this case, those values would be 3.9, 3.3, and 3.8. I'm envisioning this similar to a VLookup, or Index/Match in Excel.
When ever I think "now I need to use VLOOKUP like I did in the past in Excel" I find the left_join() function helpful. It's also part of the dplyr package. Instead of "looking up" values in one table in another table, it's easier for R to just make one bigger table where one table remains unchanged (here the "left" one or the first term you put in the function) and the other is added using a column or columns they have in common as an index.
In your specific example, I can't entirely understand what you want new_col to have in it. If you want to do Excel-style VLOOKUP in R, then left_join() is the best starting point.
The question is not clear since it does not mention the purpose of a Vlookup or Index/Match like operation from Excel.
Also, you don't mention what value should "new_col" have if rank is not equal to 4.
Assuming the value is NA, the below solution with a simple ifelse would work:
df <- iris %>%
group_by(Species) %>%
mutate(Petal.Dim = Petal.Length * Petal.Width,
rank = rank(desc(Petal.Dim))) %>%
ungroup() %>%
mutate(new_col = ifelse(rank == 4, Sepal.Width,NA))
df

Predicting values with dplyr and augment

I'd like to fit models to a grouped data frame and then predict one new value per model (i.e. group).
library(dplyr)
library(broom)
data(iris)
dat <- rbind(iris, iris)
dat$Group <- rep(c("A", "B"), each = 150)
new.dat <- data.frame(Group = rep(c("A", "B"), each = 3),
Species = rep(c("setosa", "versicolor", "virginica"), times = 2),
Sepal.Width = 1:6)
> new.dat
Group Species val
1 A setosa 1
2 A versicolor 2
3 A virginica 3
4 B setosa 4
5 B versicolor 5
6 B virginica 6
However, augment returns 36 rows, as if each new value is fit with each model. How can I preserve the grouping here and get one fitted value per group?
dat %>%
group_by(Species, Group) %>%
do(augment(lm(Sepal.Length ~ Sepal.Width, data = .), newdata = new.dat))
# A tibble: 36 x 5
# Groups: Species, Group [6]
Group Species Sepal.Width .fitted .se.fit
<fct> <fct> <int> <dbl> <dbl>
1 A setosa 1 3.33 0.221
2 A versicolor 2 4.02 0.133
3 A virginica 3 4.71 0.0512
4 B setosa 4 5.40 0.0615
5 B versicolor 5 6.09 0.145
6 B virginica 6 6.78 0.234
7 A setosa 1 3.33 0.221
8 A versicolor 2 4.02 0.133
9 A virginica 3 4.71 0.0512
10 B setosa 4 5.40 0.0615
# ... with 26 more rows
(Note that due to the example data the rows are actually duplicates, which is however not the case with my original data).
You need to make the Species and Group of new.dat match those of the group currently being processed in do. You can do this like so:
group.cols <- c("Species", "Group")
dat %>%
group_by(!!! group.cols) %>%
do(augment(lm(Sepal.Length ~ Sepal.Width, data = .),
newdata = semi_join(new.dat, ., by = group.cols)))

How do I calculate a grouped z score in R using dplyr?

Using the iris dataset I'm trying to calculate a z score for each of the variables. I have the data in tidy format, by performing the following:
library(reshape2)
library(dplyr)
test <- iris
test <- melt(iris,id.vars = 'Species')
That gives me the following:
Species variable value
1 setosa Sepal.Length 5.1
2 setosa Sepal.Length 4.9
3 setosa Sepal.Length 4.7
4 setosa Sepal.Length 4.6
5 setosa Sepal.Length 5.0
6 setosa Sepal.Length 5.4
But when I try to create a z-score column for each group (e.g. the z-score for Sepal.Length will not be comparable to that of Sepal. Width) using the following:
test <- test %>%
group_by(Species, variable) %>%
mutate(z_score = (value - mean(value)) / sd(value))
The resulting z-scores have not been grouped, and are based on all of the data.
What's the best way to return the z-scores by group using dpylr?
Many thanks!
I believe that you were complicating when computing z-scores with mean/sd. Just use function scale.
test <- test %>%
group_by(Species, variable) %>%
mutate(z_score = scale(value))
test
## A tibble: 600 x 4
## Groups: Species, variable [12]
# Species variable value z_score
# <fctr> <fctr> <dbl> <dbl>
# 1 setosa Sepal.Length 5.1 0.26667447
# 2 setosa Sepal.Length 4.9 -0.30071802
# 3 setosa Sepal.Length 4.7 -0.86811050
# 4 setosa Sepal.Length 4.6 -1.15180675
# 5 setosa Sepal.Length 5.0 -0.01702177
# 6 setosa Sepal.Length 5.4 1.11776320
# 7 setosa Sepal.Length 4.6 -1.15180675
# 8 setosa Sepal.Length 5.0 -0.01702177
# 9 setosa Sepal.Length 4.4 -1.71919923
#10 setosa Sepal.Length 4.9 -0.30071802
## ... with 590 more rows
Edit.
Following a comment by the OP, I am posting some code to get the rows where Petal.Width has a positive z_score.
i1 <- which(test$variable == "Petal.Width" & test$z_score > 0)
test[i1, ]
## A tibble: 61 x 4
## Groups: Species, variable [3]
# Species variable value z_score
# <fctr> <fctr> <dbl> <dbl>
# 1 setosa Petal.Width 0.4 1.461300
# 2 setosa Petal.Width 0.3 0.512404
# 3 setosa Petal.Width 0.4 1.461300
# 4 setosa Petal.Width 0.4 1.461300
# 5 setosa Petal.Width 0.3 0.512404
# 6 setosa Petal.Width 0.3 0.512404
# 7 setosa Petal.Width 0.3 0.512404
# 8 setosa Petal.Width 0.4 1.461300
# 9 setosa Petal.Width 0.5 2.410197
#10 setosa Petal.Width 0.4 1.461300
## ... with 51 more rows
Your code is giving you z-scores by group. It seems to me these z-scores should be comparable exactly because you've individually scaled each group to mean=0 and sd=1, rather than scaling each value based on the mean and sd of the full data frame. For example:
library(tidyverse)
First, set up the melted data frame:
dat = iris %>%
gather(variable, value, -Species) %>%
group_by(Species, variable) %>%
mutate(z_score_group = (value - mean(value)) / sd(value)) %>% # You can also use scale(value) as pointed out by #RuiBarradas
ungroup %>%
mutate(z_score_ungrouped = (value - mean(value)) / sd(value))
Now look at the first three rows and compare with direct calculation:
head(dat, 3)
# Species variable value z_score_group z_score_ungrouped
# 1 setosa Sepal.Length 5.1 0.2666745 0.8278959
# 2 setosa Sepal.Length 4.9 -0.3007180 0.7266552
# 3 setosa Sepal.Length 4.7 -0.8681105 0.6254145
# z-scores by group
with(dat, (value[1:3] - mean(value[Species=="setosa" & variable=="Sepal.Length"])) / sd(value[Species=="setosa" & variable=="Sepal.Length"]))
# [1] 0.2666745 -0.3007180 -0.8681105
# ungrouped z-scores
with(dat, (value[1:3] - mean(value)) / sd(value))
# [1] 0.8278959 0.7266552 0.6254145
Now visualize the z-scores: The first graph below is the raw data. The second is the ungrouped z-scores--we've just rescaled the data to an overall mean=0 and SD=1. The third graph is what your code produces. Each group has been individually scaled to mean=0 and SD=1.
gridExtra::grid.arrange(
grobs=setNames(names(dat)[c(3,5,4)], names(dat)[c(3,5,4)]) %>%
map(~ ggplot(dat %>% mutate(group=paste(Species,variable,sep="_")),
aes_string(.x, colour="group")) + geom_density()),
ncol=1)

Resources