Why does `mutate(across(...))` with `scale()` adds [,1] to the column header? - r

This seems too basic to not be found in a search, but maybe I didn't use the correct search terms on Google.
I want to normalize a numeric column. When I modify that column with mutate(across(.., scale)) I get [,1] added to the header. Why is that?
library(dplyr, warn.conflicts = FALSE)
mtcars_mpg_only <-
mtcars %>%
as_tibble() %>%
select(mpg)
mtcars_mpg_only %>%
as_tibble() %>%
mutate(across(mpg, scale))
#> # A tibble: 32 x 1
#> mpg[,1]
#> <dbl>
#> 1 0.151
#> 2 0.151
#> 3 0.450
#> 4 0.217
#> 5 -0.231
#> 6 -0.330
#> 7 -0.961
#> 8 0.715
#> 9 0.450
#> 10 -0.148
#> # ... with 22 more rows
But if I use a different function rather than scale() (e.g., log()), then the column header remains as-is:
mtcars_mpg_only %>%
as_tibble() %>%
mutate(across(mpg, log))
#> # A tibble: 32 x 1
#> mpg
#> <dbl>
#> 1 3.04
#> 2 3.04
#> 3 3.13
#> 4 3.06
#> 5 2.93
#> 6 2.90
#> 7 2.66
#> 8 3.19
#> 9 3.13
#> 10 2.95
#> # ... with 22 more rows
I know how to remove/rename [,1] after the fact, but my question is why it's created to begin with?

It is because scale returns a matrix whereas log returns a plain vector. The mpg[, 1] is actually a matrix within a data.frame. See ?scale for the definition of its value.
class(scale(mtcars$mpg))
## [1] "matrix" "array"
class(log(mtcars$mpg))
## [1] "numeric"
Convert the matrix to a plain vector to avoid this, e.g.
mtcars_mpg_only %>%
mutate(across(mpg, ~ c(scale(.))))
# or extracting first column
mtcars_mpg_only %>%
mutate(across(mpg, ~ scale(.)[, 1]))
# or normalizing using mean and sd
mtcars_mpg_only %>%
mutate(across(mpg, ~ (. - mean(.)) / sd(.)))
# or without across
mtcars_mpg_only %>%
mutate(mpg = c(scale(mpg)))
# or using base R
mtcars_mpg_only |>
transform(mpg = c(scale(mpg)))

Related

R calculate most abundant taxa using phyloseq object

I would like to know if my approach to calculate the average of the relative abundance of any taxon is correct !!!
If I want to know if, to calculate the relative abundance (percent) of each family (or any Taxon) in a phyloseq object (GlobalPattern) will be correct like:
data("GlobalPatterns")
T <- GlobalPatterns %>%
tax_glom(., "Family") %>%
transform_sample_counts(function(x)100* x / sum(x)) %>% psmelt() %>%
arrange(OTU) %>% rename(OTUsID = OTU) %>%
select(OTUsID, Family, Sample, Abundance) %>%
spread(Sample, Abundance)
T$Mean <- rowMeans(T[, c(3:ncol(T))])
FAM <- T[, c("Family", "Mean" ) ]
#order data frame
FAM <- FAM[order(dplyr::desc(FAM$Mean)),]
rownames(FAM) <- NULL
head(FAM)
Family Mean
1 Bacteroidaceae 7.490944
2 Ruminococcaceae 6.038956
3 Lachnospiraceae 5.758200
4 Flavobacteriaceae 5.016402
5 Desulfobulbaceae 3.341026
6 ACK-M1 3.242808
in this case the Bacteroidaceae were the most abundant family in all the samples of GlobalPattern (26 samples and 19216 OTUs), it was present in 7.49% in average in 26 samples !!!!
It’s correct to make the T$Mean <- rowMeans(T[, c(3:ncol(T))]) to calculate the average any given Taxon ?
Bacteroidaceae has the highest abundance, if all samples were pooled together.
However, it has the highest abundance in only 2 samples.
Nevertheless, there is no other taxon having a higher abundance in an average sample.
Let's use dplyr verbs for all the steps to have a more descriptive and consistent code:
library(tidyverse)
library(phyloseq)
#> Creating a generic function for 'nrow' from package 'base' in package 'biomformat'
#> Creating a generic function for 'ncol' from package 'base' in package 'biomformat'
#> Creating a generic function for 'rownames' from package 'base' in package 'biomformat'
#> Creating a generic function for 'colnames' from package 'base' in package 'biomformat'
data(GlobalPatterns)
data <-
GlobalPatterns %>%
tax_glom("Family") %>%
transform_sample_counts(function(x)100* x / sum(x)) %>%
psmelt() %>%
as_tibble()
# highest abundance: all samples pooled together
data %>%
group_by(Family) %>%
summarise(Abundance = mean(Abundance)) %>%
arrange(-Abundance)
#> # A tibble: 334 × 2
#> Family Abundance
#> <chr> <dbl>
#> 1 Bacteroidaceae 7.49
#> 2 Ruminococcaceae 6.04
#> 3 Lachnospiraceae 5.76
#> 4 Flavobacteriaceae 5.02
#> 5 Desulfobulbaceae 3.34
#> 6 ACK-M1 3.24
#> 7 Streptococcaceae 2.77
#> 8 Nostocaceae 2.62
#> 9 Enterobacteriaceae 2.55
#> 10 Spartobacteriaceae 2.45
#> # … with 324 more rows
# sanity check: is total abundance of each sample 100%?
data %>%
group_by(Sample) %>%
summarise(Abundance = sum(Abundance)) %>%
pull(Abundance) %>%
`==`(100) %>%
all()
#> [1] TRUE
# get most abundant family for each sample individually
data %>%
group_by(Sample) %>%
arrange(-Abundance) %>%
slice(1) %>%
select(Family) %>%
ungroup() %>%
count(Family, name = "n_samples") %>%
arrange(-n_samples)
#> Adding missing grouping variables: `Sample`
#> # A tibble: 18 × 2
#> Family n_samples
#> <chr> <int>
#> 1 Desulfobulbaceae 3
#> 2 Bacteroidaceae 2
#> 3 Crenotrichaceae 2
#> 4 Flavobacteriaceae 2
#> 5 Lachnospiraceae 2
#> 6 Ruminococcaceae 2
#> 7 Streptococcaceae 2
#> 8 ACK-M1 1
#> 9 Enterobacteriaceae 1
#> 10 Moraxellaceae 1
#> 11 Neisseriaceae 1
#> 12 Nostocaceae 1
#> 13 Solibacteraceae 1
#> 14 Spartobacteriaceae 1
#> 15 Sphingomonadaceae 1
#> 16 Synechococcaceae 1
#> 17 Veillonellaceae 1
#> 18 Verrucomicrobiaceae 1
Created on 2022-06-10 by the reprex package (v2.0.0)

Is there a way to summarise if column value is x?

I am trying to make a data.frame which displays the average time an individual displays a behaviour.
I have been using group_by and summarise to calculate the averages across groups. But the output is many rows down. See an example using the iris dataset...
data(iris)
x <- iris %>%
group_by(Species, Petal.Length) %>%
summarise(mean(Sepal.Length))
I would like to get an output that has, for this example, one row per 'Species' and a column of averages per 'Petal.Length'.
I have resorted to creating multiple outputs and then using left_join to combine them into the desired data.frame. See example below...
a <- iris %>%
group_by(Species) %>%
filter(Petal.Length == 0.1) %>%
summarise(mean(Sepal.Length))
b <- iris %>%
group_by(Species) %>%
filter(Petal.Length == 0.2) %>%
summarise(mean(Sepal.Length))
left_join(a, b)
However, doing this twelve or more times at a time is tedious and I am sure there must be an easy way to get the mean(Sepal.Length) for the 'Petal.Length' 0.1, and 0.2, and 0.3 (etc) in the one output.
n.b. in my data Petal.Length would actually be characters that represent behaviours and Sepal.Length would be the duration of time
Some ideas:
library(tidyverse)
data(iris)
mutate(iris, Petal.Length_discrete = cut(Petal.Length, 5)) %>%
group_by(Species, Petal.Length_discrete) %>%
summarise(mean(Sepal.Length))
#> `summarise()` has grouped output by 'Species'. You can override using the `.groups` argument.
#> # A tibble: 7 x 3
#> # Groups: Species [3]
#> Species Petal.Length_discrete `mean(Sepal.Length)`
#> <fct> <fct> <dbl>
#> 1 setosa (0.994,2.18] 5.01
#> 2 versicolor (2.18,3.36] 5
#> 3 versicolor (3.36,4.54] 5.81
#> 4 versicolor (4.54,5.72] 6.43
#> 5 virginica (3.36,4.54] 4.9
#> 6 virginica (4.54,5.72] 6.32
#> 7 virginica (5.72,6.91] 7.25
iris %>%
group_split(Species, Petal.Length) %>%
map(~ summarise(.x, mean(Sepal.Length))) %>%
head(3)
#> [[1]]
#> # A tibble: 1 x 1
#> `mean(Sepal.Length)`
#> <dbl>
#> 1 4.6
#>
#> [[2]]
#> # A tibble: 1 x 1
#> `mean(Sepal.Length)`
#> <dbl>
#> 1 4.3
#>
#> [[3]]
#> # A tibble: 1 x 1
#> `mean(Sepal.Length)`
#> <dbl>
#> 1 5.4
Created on 2021-06-28 by the reprex package (v2.0.0)

dot notation in magrittr pipe not be

I thought I understood that in conjunction with the magrittr pipe, the dot-notation indicates where the dataset that is piped into a function should go for evaluation. When I was starting to work with purrr/broom to generate some nested dataframes with the linear models I was generating by group I ran into a problem. When using the dot notation it seems that my prior group_by command was being ignored. Took me a while to figure out that I should simply omit the dot-notation and it works like expected, but I would like to understand why it is not working.
Here is the sample code that I expected to generate identical data, but only the first example is generating linear models by group, while the second generates the model for the whole dataset, but then still stores it at the group level.
#// library and data prep
library(tidyverse)
library(broom)
data <- as_tibble(mtcars)
#// generates lm fit for the model by group
data %>%
#// group by factor
group_by(carb) %>%
#// summary for the grouped dataset
summarize(new = list( tidy( lm(formula = drat ~ mpg)))) %>%
#// unnest
unnest(cols = new)
#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 12 x 6
#> carb term estimate std.error statistic p.value
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 (Intercept) 1.72e+ 0 5.85e- 1 2.94e+ 0 3.24e- 2
#> 2 1 mpg 7.75e- 2 2.26e- 2 3.44e+ 0 1.85e- 2
#> 3 2 (Intercept) 1.44e+ 0 5.87e- 1 2.46e+ 0 3.95e- 2
#> 4 2 mpg 1.01e- 1 2.55e- 2 3.95e+ 0 4.26e- 3
#> 5 3 (Intercept) 3.07e+ 0 6.86e-15 4.48e+14 1.42e-15
#> 6 3 mpg 3.46e-17 4.20e-16 8.25e- 2 9.48e- 1
#> 7 4 (Intercept) 2.18e+ 0 4.29e- 1 5.07e+ 0 9.65e- 4
#> 8 4 mpg 8.99e- 2 2.65e- 2 3.39e+ 0 9.43e- 3
#> 9 6 (Intercept) 3.62e+ 0 NaN NaN NaN
#> 10 6 mpg NA NA NA NA
#> 11 8 (Intercept) 3.54e+ 0 NaN NaN NaN
#> 12 8 mpg NA NA NA NA
#// generates lm fit for the whole model
data %>%
#// group by factor
group_by(carb) %>%
#// summary for the whole dataset
summarize(new = list( tidy( lm(formula = drat ~ mpg, data = .)))) %>%
#// unnest
unnest(cols = new)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 12 x 6
#> carb term estimate std.error statistic p.value
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 (Intercept) 2.38 0.248 9.59 1.20e-10
#> 2 1 mpg 0.0604 0.0119 5.10 1.78e- 5
#> 3 2 (Intercept) 2.38 0.248 9.59 1.20e-10
#> 4 2 mpg 0.0604 0.0119 5.10 1.78e- 5
#> 5 3 (Intercept) 2.38 0.248 9.59 1.20e-10
#> 6 3 mpg 0.0604 0.0119 5.10 1.78e- 5
#> 7 4 (Intercept) 2.38 0.248 9.59 1.20e-10
#> 8 4 mpg 0.0604 0.0119 5.10 1.78e- 5
#> 9 6 (Intercept) 2.38 0.248 9.59 1.20e-10
#> 10 6 mpg 0.0604 0.0119 5.10 1.78e- 5
#> 11 8 (Intercept) 2.38 0.248 9.59 1.20e-10
#> 12 8 mpg 0.0604 0.0119 5.10 1.78e- 5
Created on 2021-01-04 by the reprex package (v0.3.0)
. in this case refers to data which is present in the previous step which is (data %>% group_by(carb)). Although the data is grouped it is still complete data. If you are on dplyr > 1.0.0 you could use cur_data() to refer to the data in the group.
library(dplyr)
library(broom)
library(tidyr)
data %>%
group_by(carb) %>%
summarize(new = list(tidy(lm(formula = drat ~ mpg, data = cur_data())))) %>%
unnest(cols = new)
This gives the same output as your first example.
Note that you can use . to refer to the grouped data with group_modify instead of summarise:
data %>%
group_by(carb) %>%
group_modify(~lm(formula = drat ~ mpg, data = .) %>% tidy)
* Just an alternative - I think list-columns + unnest-variants are considered the better approach now.

r studio: simulate my code 1000 times and pick the things which p value<0.05

Here is my original code:
x = rbinom(1000,1,0.5)
z = log(1.3)*x
pr = 1/(1+exp(-z))
y = rbinom(1000,1,pr)
k=glm(y~x,family="binomial")$coef
t=exp(k)
How can I simulate it 1000 times and pick the one with a p-value<0.05?
This is a perfect application for the tidyverse and it's list columns. Please see explanation in the inline comments.
library(tidyverse)
library(broom)
# create a tibble with an id column for each simulation and x wrapped in list()
sim <- tibble(id = 1:1000,
x = list(rbinom(1000,1,0.5))) %>%
# to generate z, pr, y, k use map and map2 from the purrr package to loop over the list column x
# `~ ... ` is similar to `function(.x) {...}`
# `.x` represents the variable you are using map on
mutate(z = map(x, ~ log(1.3) * .x),
pr = map(z, ~ 1 / (1 + exp(-.x))),
y = map(pr, ~ rbinom(1000, 1, .x)),
k = map2(x, y, ~ glm(.y ~ .x, family="binomial")),
# use broom::tidy to get the model summary in form of a tibble
sum = map(k, broom::tidy)) %>%
# select id and sum and unnest the tibbles
select(id, sum) %>%
unnest(cols = c(sum)) %>%
# drop the intercepts and every .x with a p < 0.05
filter(term !="(Intercept)",
p.value < 0.05)
sim
#> # A tibble: 545 x 6
#> id term estimate std.error statistic p.value
#> <int> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 3 .x 0.301 0.127 2.37 0.0176
#> 2 7 .x 0.263 0.127 2.06 0.0392
#> 3 8 .x 0.293 0.127 2.31 0.0211
#> 4 11 .x 0.377 0.128 2.96 0.00312
#> 5 12 .x 0.265 0.127 2.08 0.0373
#> 6 13 .x 0.366 0.127 2.88 0.00403
#> 7 14 .x 0.461 0.128 3.61 0.000305
#> 8 17 .x 0.274 0.127 2.16 0.0309
#> 9 18 .x 0.394 0.127 3.09 0.00200
#> 10 19 .x 0.371 0.127 2.92 0.00354
#> # … with 535 more rows
Created on 2020-05-18 by the reprex package (v0.3.0)
I won't do this for you, but these are the steps you'll probably want to go through:
Write your code as a function that returns the value you're interested in (presumably t)
Use something like replicate to run this function many times and record all the answers
Use something like quantile to extract the percentile you're interested in

How to write a function that conducts paired t-tests on all group/variable combinations in a data frame

I have a data frame similar to data created below:
ID <- data.frame(ID=rep(c(12,122,242,329,595,130,145,245,654,878),each=5))
Var <- data.frame(Variable=c("Copper","Iron","Lead","Zinc","CaCO"))
n <- 10
Variable <- do.call("rbind",replicate(n,Var,simplify=F))
Location <- rep(c("Alpha","Beta","Gamma"), times=c(20,20,10))
Location <- data.frame(Location)
set.seed(1)
FirstPt<- data.frame(FirstPt=sample(1:100,50,replace=T))
LastPt <- data.frame(LastPt=sample(1:100,50,replace=T))
First3<- data.frame(First3=sample(1:100,50,replace=T))
First5<- data.frame(First5=sample(1:100,50,replace=T))
First7<- data.frame(First7=sample(1:100,50,replace=T))
First10<- data.frame(First10=sample(1:100,50,replace=T))
Last3<- data.frame(Last3=sample(1:100,50,replace=T))
Last5<- data.frame(Last5=sample(1:100,50,replace=T))
Last7<- data.frame(Last7=sample(1:100,50,replace=T))
Last10<- data.frame(Last10=sample(1:100,50,replace=T))
data <- cbind(ID,Location,Variable,FirstPt,LastPt,First3,First5,First7,
First10,Last3,Last5,Last7,Last10)
This may be a two part question, but I want to write a function that groups all Variables that are the same (for instance, all the observations that are Copper) and conducts a paired t test between all possible combinations of the numeric columns (FirstPt:Last10). I want it to return the p values in a data frame like this:
Test P-Value
FirstPt.vs.LastPt …
FirstPt.vs.First3 …
ect... …
This will likely be a second function, but I also want to do this after the observations are grouped by Location so that the output data frame will look like this:
Test P-Value
FirstPt.vs.LastPt.InAlpha
FirstPt.vs.LastPt.InBeta
ect...
You can do both of these with one function:
library(tidyverse)
t.test.by.group.combos <- function(.data, groups){
by <- gsub(x = rlang::quo_get_expr(enquo(groups)), pattern = "\\((.*)?\\)", replacement = "\\1")[-1]
.data %>%
group_by(!!!groups) %>%
select_if(is.integer) %>%
group_split() %>%
map(.,
~pivot_longer(., cols = (FirstPt:Last10), names_to = "name", values_to = "val") %>%
nest(data = val) %>%
full_join(.,.,by = by) %>%
filter(name.x != name.y) %>%
mutate(test = paste(name.x, "vs",name.y, !!!groups, sep = "."),
p.value = map2_dbl(data.x,data.y, ~t.test(unlist(.x), unlist(.y))$p.value)) %>%
select(test,p.value)%>%
filter(!duplicated(p.value))
) %>%
bind_rows()
}
t.test.by.group.combos(data, vars(Variable))
#> # A tibble: 225 x 2
#> test p.value
#> <chr> <dbl>
#> 1 FirstPt.vs.LastPt.CaCO 0.511
#> 2 FirstPt.vs.First3.CaCO 0.184
#> 3 FirstPt.vs.First5.CaCO 0.494
#> 4 FirstPt.vs.First7.CaCO 0.354
#> 5 FirstPt.vs.First10.CaCO 0.893
#> 6 FirstPt.vs.Last3.CaCO 0.496
#> 7 FirstPt.vs.Last5.CaCO 0.909
#> 8 FirstPt.vs.Last7.CaCO 0.439
#> 9 FirstPt.vs.Last10.CaCO 0.146
#> 10 LastPt.vs.First3.CaCO 0.578
#> # … with 215 more rows
t.test.by.group.combos(data, vars(Variable, Location))
#> # A tibble: 674 x 2
#> test p.value
#> <chr> <dbl>
#> 1 FirstPt.vs.LastPt.CaCO.Alpha 0.850
#> 2 FirstPt.vs.First3.CaCO.Alpha 0.822
#> 3 FirstPt.vs.First5.CaCO.Alpha 0.895
#> 4 FirstPt.vs.First7.CaCO.Alpha 0.810
#> 5 FirstPt.vs.First10.CaCO.Alpha 0.645
#> 6 FirstPt.vs.Last3.CaCO.Alpha 0.870
#> 7 FirstPt.vs.Last5.CaCO.Alpha 0.465
#> 8 FirstPt.vs.Last7.CaCO.Alpha 0.115
#> 9 FirstPt.vs.Last10.CaCO.Alpha 0.474
#> 10 LastPt.vs.First3.CaCO.Alpha 0.991
#> # … with 664 more rows
This is kind of a lengthy function, but in general we group by the groups argument, then we select the groups and any integer columns, then we split the dataframe by the groups. After, we map all the combinations of variables and perform t.tests for each combo. Lastly, we rejoin all the groups into one dataframe.
I think this is what you want. The key was to use group_by and do from tidyverse.
df <- NULL
for(i in (4:(ncol(data)-1))){
for(j in ((i+1):ncol(data))){
df <- rbind(df,data %>%
group_by(Location) %>%
do(data.frame(pval = t.test(.[[i]],.[[j]], data = .)$p.value)) %>%
ungroup() %>%
mutate(Test = paste0(colnames(data)[i],'.vs.',colnames(data)[j]))
)
}
}
df$Test <- paste0(df$Test,'.In',df$Location)
Probably, you can acheive what you want using the below code :
library(dplyr)
library(tidyr)
data %>%
pivot_longer(cols = FirstPt:Last10) %>%
group_by(Variable) %>%
summarise(p_value = list(combn(name, 2, function(x)
t.test(value[name == x[1]], value[name == x[2]])$p.value)),
test = list(combn(name, 2, paste, collapse = "_"))) %>%
unnest(cols = c(test, p_value))
# Variable p_value test
# <fct> <dbl> <chr>
# 1 CaCO 0.915 FirstPt_LastPt
# 2 CaCO 0.529 FirstPt_First3
# 3 CaCO 0.337 FirstPt_First5
# 4 CaCO 0.350 FirstPt_First7
# 5 CaCO 0.395 FirstPt_First10
# 6 CaCO 0.765 FirstPt_Last3
# 7 CaCO 0.204 FirstPt_Last5
# 8 CaCO 0.873 FirstPt_Last7
# 9 CaCO 0.479 FirstPt_Last10
#10 CaCO 1 FirstPt_FirstPt
# … with 24,740 more rows
To do it grouped by Location you can add that into group_by command and keep rest of the code as it is.

Resources