I am doing a Shapiro Wilks test for multiple variables.
I do this as follows:
list= lapply(mtcars, shapiro.test)
I want to save the outout of list as a .txt file.
I have tried doing this:
write.table(paste(list), "SW List.txt")
That produces this:
When what I want is a .txt file with the variable names, as shown in the console when I run list:
What if instead, you map out all the stats and p values to a dataframe and then save the dataframe to text.
library(tidyverse)
imap_dfr(mtcars,
~ shapiro.test(.x) |>
(\(st) tibble(var = .y,
W = st$statistic,
p.value = st$p.value))())
#> # A tibble: 11 x 3
#> var W p.value
#> <chr> <dbl> <dbl>
#> 1 mpg 0.948 0.123
#> 2 cyl 0.753 0.00000606
#> 3 disp 0.920 0.0208
#> 4 hp 0.933 0.0488
#> 5 drat 0.946 0.110
#> 6 wt 0.943 0.0927
#> 7 qsec 0.973 0.594
#> 8 vs 0.632 0.0000000974
#> 9 am 0.625 0.0000000784
#> 10 gear 0.773 0.0000131
#> 11 carb 0.851 0.000438
Related
for the purposes of this question, let's create the following setup:
mtcars %>%
group_split(carb) %>%
map(select, mpg) -> criterion
mtcars %>%
group_split(carb) %>%
map(select, qsec) -> predictor
This code will create two lists of length 6. What I want to do is to perform 6 linear regressions within each of these 6 groups. I read about the map2 function and I thought that the code should look like this:
map2(criterion, predictor, lm(criterion ~ predictor))
But that doesn't seem to work. So in which way could this be done?
simplify2array (you need a list of vectors, not a list of data frames) and use a lambda-function with ~:
map2(simplify2array(criterion), simplify2array(predictor), ~ lm(.x ~ .y))
While the direct answer to your question is already given, note that we can also use dplyr::nest_by() and then proceed automatically rowwise.
Now your models are stored in the mod column and we can use broom::tidy etc. to work with the models.
library(dplyr)
library(tidyr)
mtcars %>%
nest_by(carb) %>%
mutate(mod = list(lm(mpg ~ qsec, data = data)),
res = list(broom::tidy(mod))) %>%
unnest(res) %>%
filter(term != "(Intercept)")
#> # A tibble: 6 x 8
#> # Groups: carb [6]
#> carb data mod term estimate std.error statistic p.value
#> <dbl> <list<tibble[,10]>> <list> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 [7 x 10] <lm> qsec -1.26 4.51 -0.279 0.791
#> 2 2 [10 x 10] <lm> qsec 0.446 0.971 0.460 0.658
#> 3 3 [3 x 10] <lm> qsec -2.46 2.41 -1.02 0.493
#> 4 4 [10 x 10] <lm> qsec 0.0597 0.991 0.0602 0.953
#> 5 6 [1 x 10] <lm> qsec NA NA NA NA
#> 6 8 [1 x 10] <lm> qsec NA NA NA NA
Created on 2022-09-30 by the reprex package (v2.0.1)
I have the following numeric data frame dataset:
x1 x2 x3 ...
1 2 3
...
I did the following applying shapiro test to all columns
lshap <- lapply(dataset, shapiro.test)
lres <- t(sapply(lshap, `[`, c("statistic","p.value")))
The output of lres looks like this:
statistic p.value
Strong 0.8855107 6.884855e-14
Hardworking 0.9360735 8.031421e-10
Focused 0.9350827 6.421583e-10
Now, when I do:
class(lres)
It gives me "matrix" "array"
My question is how I convert lres to a data frame?
I want this output as a data frame:
variable statistic p.value
Strong 0.8855107 6.884855e-14
Hardworking 0.9360735 8.031421e-10
Focused 0.9350827 6.421583e-10
...
When I do to_df <- as.data.frame(lres) I get the following weird output:
statistic p.value
Strong <dbl [1]> <dbl [1]>
Hardworking <dbl [1]> <dbl [1]>
Focused <dbl [1]> <dbl [1]>
Gritty <dbl [1]> <dbl [1]>
Adaptable <dbl [1]> <dbl [1]>
...
What is wrong with this?
In base R, the issue with OP's 'lres' is that each element is a list element in the matrix. Instead of doing that, we could use
out <- do.call(rbind, lapply(mtcars, function(x)
as.data.frame(shapiro.test(x)[c('statistic', 'p.value')])))
out <- cbind(variable = row.names(out), out)
row.names(out) <- NULL
-output
out
# variable statistic p.value
#1 mpg 0.9475647 1.228814e-01
#2 cyl 0.7533100 6.058338e-06
#3 disp 0.9200127 2.080657e-02
#4 hp 0.9334193 4.880824e-02
#5 drat 0.9458839 1.100608e-01
#6 wt 0.9432577 9.265499e-02
#7 qsec 0.9732509 5.935176e-01
#8 vs 0.6322635 9.737376e-08
#9 am 0.6250744 7.836354e-08
#10 gear 0.7727856 1.306844e-05
#11 carb 0.8510972 4.382405e-04
Or we can use as_tibble
library(dplyr)
library(tidyr)
as_tibble(lres, rownames = 'variable') %>%
unnest(-variable)
-output
# A tibble: 11 x 3
# variable statistic p.value
# <chr> <dbl> <dbl>
# 1 mpg 0.948 0.123
# 2 cyl 0.753 0.00000606
# 3 disp 0.920 0.0208
# 4 hp 0.933 0.0488
# 5 drat 0.946 0.110
# 6 wt 0.943 0.0927
# 7 qsec 0.973 0.594
# 8 vs 0.632 0.0000000974
# 9 am 0.625 0.0000000784
#10 gear 0.773 0.0000131
#11 carb 0.851 0.000438
Or can be done in a single step
library(purrr)
library(broom)
imap_dfr(mtcars, ~ shapiro.test(.x) %>%
tidy %>%
select(-method), .id = 'variable')
-output
# A tibble: 11 x 3
# variable statistic p.value
# <chr> <dbl> <dbl>
# 1 mpg 0.948 0.123
# 2 cyl 0.753 0.00000606
# 3 disp 0.920 0.0208
# 4 hp 0.933 0.0488
# 5 drat 0.946 0.110
# 6 wt 0.943 0.0927
# 7 qsec 0.973 0.594
# 8 vs 0.632 0.0000000974
# 9 am 0.625 0.0000000784
#10 gear 0.773 0.0000131
#11 carb 0.851 0.000438
data
lshap <- lapply(mtcars, shapiro.test)
lres <- t(sapply(lshap, `[`, c("statistic","p.value")))
I thought I understood that in conjunction with the magrittr pipe, the dot-notation indicates where the dataset that is piped into a function should go for evaluation. When I was starting to work with purrr/broom to generate some nested dataframes with the linear models I was generating by group I ran into a problem. When using the dot notation it seems that my prior group_by command was being ignored. Took me a while to figure out that I should simply omit the dot-notation and it works like expected, but I would like to understand why it is not working.
Here is the sample code that I expected to generate identical data, but only the first example is generating linear models by group, while the second generates the model for the whole dataset, but then still stores it at the group level.
#// library and data prep
library(tidyverse)
library(broom)
data <- as_tibble(mtcars)
#// generates lm fit for the model by group
data %>%
#// group by factor
group_by(carb) %>%
#// summary for the grouped dataset
summarize(new = list( tidy( lm(formula = drat ~ mpg)))) %>%
#// unnest
unnest(cols = new)
#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 12 x 6
#> carb term estimate std.error statistic p.value
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 (Intercept) 1.72e+ 0 5.85e- 1 2.94e+ 0 3.24e- 2
#> 2 1 mpg 7.75e- 2 2.26e- 2 3.44e+ 0 1.85e- 2
#> 3 2 (Intercept) 1.44e+ 0 5.87e- 1 2.46e+ 0 3.95e- 2
#> 4 2 mpg 1.01e- 1 2.55e- 2 3.95e+ 0 4.26e- 3
#> 5 3 (Intercept) 3.07e+ 0 6.86e-15 4.48e+14 1.42e-15
#> 6 3 mpg 3.46e-17 4.20e-16 8.25e- 2 9.48e- 1
#> 7 4 (Intercept) 2.18e+ 0 4.29e- 1 5.07e+ 0 9.65e- 4
#> 8 4 mpg 8.99e- 2 2.65e- 2 3.39e+ 0 9.43e- 3
#> 9 6 (Intercept) 3.62e+ 0 NaN NaN NaN
#> 10 6 mpg NA NA NA NA
#> 11 8 (Intercept) 3.54e+ 0 NaN NaN NaN
#> 12 8 mpg NA NA NA NA
#// generates lm fit for the whole model
data %>%
#// group by factor
group_by(carb) %>%
#// summary for the whole dataset
summarize(new = list( tidy( lm(formula = drat ~ mpg, data = .)))) %>%
#// unnest
unnest(cols = new)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 12 x 6
#> carb term estimate std.error statistic p.value
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 (Intercept) 2.38 0.248 9.59 1.20e-10
#> 2 1 mpg 0.0604 0.0119 5.10 1.78e- 5
#> 3 2 (Intercept) 2.38 0.248 9.59 1.20e-10
#> 4 2 mpg 0.0604 0.0119 5.10 1.78e- 5
#> 5 3 (Intercept) 2.38 0.248 9.59 1.20e-10
#> 6 3 mpg 0.0604 0.0119 5.10 1.78e- 5
#> 7 4 (Intercept) 2.38 0.248 9.59 1.20e-10
#> 8 4 mpg 0.0604 0.0119 5.10 1.78e- 5
#> 9 6 (Intercept) 2.38 0.248 9.59 1.20e-10
#> 10 6 mpg 0.0604 0.0119 5.10 1.78e- 5
#> 11 8 (Intercept) 2.38 0.248 9.59 1.20e-10
#> 12 8 mpg 0.0604 0.0119 5.10 1.78e- 5
Created on 2021-01-04 by the reprex package (v0.3.0)
. in this case refers to data which is present in the previous step which is (data %>% group_by(carb)). Although the data is grouped it is still complete data. If you are on dplyr > 1.0.0 you could use cur_data() to refer to the data in the group.
library(dplyr)
library(broom)
library(tidyr)
data %>%
group_by(carb) %>%
summarize(new = list(tidy(lm(formula = drat ~ mpg, data = cur_data())))) %>%
unnest(cols = new)
This gives the same output as your first example.
Note that you can use . to refer to the grouped data with group_modify instead of summarise:
data %>%
group_by(carb) %>%
group_modify(~lm(formula = drat ~ mpg, data = .) %>% tidy)
* Just an alternative - I think list-columns + unnest-variants are considered the better approach now.
I want to calculate the pair-wise correlations between "mpg" and all other numeric variables of interest for each cyl in the mtcars dataset. I would like to adopt the tidy data principle.
It's rather easy with corrr::correlate().
library(dplyr)
library(tidyr)
library(purrr)
library(corrr)
data(mtcars)
mtcars2 <- mtcars[,1:7] %>%
group_nest(cyl) %>%
mutate(cors = map(data, corrr::correlate),
stretch = map(cors, corrr::stretch)) %>%
unnest(stretch)
mtcars2 %>%
filter(x == "mpg")
By using corrr::correlate(), all available pair-wise correlations have been calculated. I could use dplyr::filter() to select the correlations of interest.
However, when datasets are large, a lot of calculations go to the unwanted correlations, making this approach very time-consuming. So I tried to calculate only mpg vs. others. I'm not very familiar with purrr, and the following code doesn't work.
mtcars2 <- mtcars[,1:7] %>%
group_nest(cyl) %>%
mutate(comp = map(data, ~colnames),
corr = map(comp, ~cor.test(data[["mpg"]], data[[.]])))
If you need to use cor.test, below is an option using broom:
library(broom)
library(tidyr)
library(dplyr)
mtcars[,1:7] %>%
pivot_longer(-c(mpg,cyl)) %>%
group_by(cyl,name) %>%
do(tidy(cor.test(.$mpg,.$value)))
# A tibble: 15 x 10
# Groups: cyl, name [15]
cyl name estimate statistic p.value parameter conf.low conf.high method
<dbl> <chr> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <chr>
1 4 disp -0.805 -4.07 0.00278 9 -0.947 -0.397 Pears…
2 4 drat 0.424 1.41 0.193 9 -0.236 0.816 Pears…
3 4 hp -0.524 -1.84 0.0984 9 -0.855 0.111 Pears…
4 4 qsec -0.236 -0.728 0.485 9 -0.732 0.424 Pears…
5 4 wt -0.713 -3.05 0.0137 9 -0.920 -0.198 Pears…
6 6 disp 0.103 0.232 0.826 5 -0.705 0.794 Pears…
7 6 drat 0.115 0.258 0.807 5 -0.699 0.799 Pears…
If you just need the correlation, for big datasets, the nesting etc might be costly and unnecessary because you can simply do cor(,) and melt that:
#define columns to correlate
cor_vars = setdiff(colnames(mtcars)[1:7],"cyl")
split(mtcars[,1:7],mtcars$cyl) %>%
map_dfr(~data.frame(x="mpg",y=cor_vars,
cyl=unique(.x$cyl),rho=as.numeric(cor(.x$mpg,.x[,cor_vars]))))
x y cyl rho
1 mpg mpg 4 1.00000000
2 mpg disp 4 -0.80523608
3 mpg hp 4 -0.52350342
4 mpg drat 4 0.42423947
5 mpg wt 4 -0.71318483
6 mpg qsec 4 -0.23595389
7 mpg mpg 6 1.00000000
8 mpg disp 6 0.10308269
9 mpg hp 6 -0.12706785
10 mpg drat 6 0.11471598
11 mpg wt 6 -0.68154982
12 mpg qsec 6 -0.41871779
13 mpg mpg 8 1.00000000
14 mpg disp 8 -0.51976704
15 mpg hp 8 -0.28363567
16 mpg drat 8 0.04793248
17 mpg wt 8 -0.65035801
18 mpg qsec 8 -0.10433602
Would this work for you? I have done this in the past but on smallish datasets and have not bench marked it so not sure of performance. I use pivot_longer to reshape the data prior to nesting. The variables you pass essentially work as the filtering step, sort of
mtcars2 <- mtcars[,1:7] %>%
pivot_longer(c(-mpg, -cyl), names_to = "y.var", values_to = "value" ) %>%
group_nest(cyl, y.var) %>%
mutate(x.var = "mpg", #just so you can see this in the output
cor = map_dbl(data, ~ {cor <- cor.test(.x$mpg, .x$value)
cor$estimate})) %>%
select(data, cyl, x.var , y.var, cor) %>%
arrange(cyl, y.var)
I have a list of data.frame and I'd like to run cor.test through each data.frame.
The data.frame has 8 columns, I would like to run cor.test for each of the first 7 columns against the 8th column.
I first set up the lists for storing the data
estimates = list()
pvalues = list()
Then here's the loop combining with lapply
for (i in 1:7){
corr <- lapply(datalist, function(x) {cor.test(x[,i], x[,8], alternative="two-sided", method="spearman", exact=FALSE, continuity=TRUE)})
estimates= corr$estimate
pvalues= corr$p.value
}
It ran without any errors but the estimates shows NULL
Which part of this went wrong? I used to run for loop over cor.test or run is with lapply, never put them together. I wonder if there's a solution to this or an alternative. Thank you.
We can use sapply, showing with an example on mtcars where cor.test is performed with all columns against the first column.
lst <- list(mtcars, mtcars)
lapply(lst, function(x) t(sapply(x[-8], function(y) {
val <- cor.test(y, x[[8]], alternative ="two.sided",
method="spearman", exact=FALSE, continuity=TRUE)
c(val$estimate, pval = val$p.value)
})))
[[1]]
# rho pval
#mpg 0.7065968 6.176953e-06
#cyl -0.8137890 1.520674e-08
#disp -0.7236643 2.906504e-06
#hp -0.7515934 7.247490e-07
#drat 0.4474575 1.021422e-02
#wt -0.5870162 4.163577e-04
#qsec 0.7915715 6.843882e-08
#am 0.1683451 3.566025e-01
#gear 0.2826617 1.168159e-01
#carb -0.6336948 9.977275e-05
#[[2]]
# rho pval
#mpg 0.7065968 6.176953e-06
#cyl -0.8137890 1.520674e-08
#.....
This returns you list of two column matrix with estimate and p.value respectively.
Disclaimer: This answer uses the developer version of manymodelr that I also wrote.
EDIT: You can map it to your list of data frames with Map or lapply for instance:
lst <- list(mtcars, mtcars) #Line copied and pasted from #Ronak Shah's answer
Map(function(x) manymodelr::get_var_corr(x, "mpg",get_all = TRUE,
alternative="two.sided",
method="spearman",
continuity=TRUE,exact=F),lst)
For a single data.frame object, we can use get_var_corr:
manymodelr::get_var_corr(mtcars, "mpg",get_all = TRUE,
alternative="two.sided",
method="spearman",
continuity=TRUE,exact=FALSE)
# Comparison_Var Other_Var p.value Correlation
# 1 mpg cyl 4.962301e-13 -0.9108013
# 2 mpg disp 6.731078e-13 -0.9088824
# 3 mpg hp 5.330559e-12 -0.8946646
# 4 mpg drat 5.369227e-05 0.6514555
# 5 mpg wt 1.553261e-11 -0.8864220
# 6 mpg qsec 7.042244e-03 0.4669358
# 7 mpg vs 6.176953e-06 0.7065968
# 8 mpg am 8.139885e-04 0.5620057
# 9 mpg gear 1.325942e-03 0.5427816
# 10 mpg carb 4.385340e-05 -0.6574976
purrr has some convenience functions could possibly make this operation a little more simple (although its debatable whether this is actually simpler than the Map/lapply way). Using Ronak's example list lst:
library(purrr)
lst <- list(mtcars, mtcars)
map2(map(lst, ~.[-8]), map(lst, 8), ~
map(.x, cor.test, y = .y,
alternative = "two.sided",
method = "spearman",
exact = FALSE,
continuity = TRUE) %>%
map_dfr(extract, c('estimate', 'p.value'), .id = 'var'))
# [[1]]
# # A tibble: 10 x 3
# var estimate p.value
# <chr> <dbl> <dbl>
# 1 mpg 0.707 0.00000618
# 2 cyl -0.814 0.0000000152
# 3 disp -0.724 0.00000291
# 4 hp -0.752 0.000000725
# 5 drat 0.447 0.0102
# 6 wt -0.587 0.000416
# 7 qsec 0.792 0.0000000684
# 8 am 0.168 0.357
# 9 gear 0.283 0.117
# 10 carb -0.634 0.0000998
#
# [[2]]
# # A tibble: 10 x 3
# var estimate p.value
# <chr> <dbl> <dbl>
# 1 mpg 0.707 0.00000618
# 2 cyl -0.814 0.0000000152
# 3 disp -0.724 0.00000291
# 4 hp -0.752 0.000000725
# 5 drat 0.447 0.0102
# 6 wt -0.587 0.000416
# 7 qsec 0.792 0.0000000684
# 8 am 0.168 0.357
# 9 gear 0.283 0.117
# 10 carb -0.634 0.0000998