I thought I understood that in conjunction with the magrittr pipe, the dot-notation indicates where the dataset that is piped into a function should go for evaluation. When I was starting to work with purrr/broom to generate some nested dataframes with the linear models I was generating by group I ran into a problem. When using the dot notation it seems that my prior group_by command was being ignored. Took me a while to figure out that I should simply omit the dot-notation and it works like expected, but I would like to understand why it is not working.
Here is the sample code that I expected to generate identical data, but only the first example is generating linear models by group, while the second generates the model for the whole dataset, but then still stores it at the group level.
#// library and data prep
library(tidyverse)
library(broom)
data <- as_tibble(mtcars)
#// generates lm fit for the model by group
data %>%
#// group by factor
group_by(carb) %>%
#// summary for the grouped dataset
summarize(new = list( tidy( lm(formula = drat ~ mpg)))) %>%
#// unnest
unnest(cols = new)
#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 12 x 6
#> carb term estimate std.error statistic p.value
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 (Intercept) 1.72e+ 0 5.85e- 1 2.94e+ 0 3.24e- 2
#> 2 1 mpg 7.75e- 2 2.26e- 2 3.44e+ 0 1.85e- 2
#> 3 2 (Intercept) 1.44e+ 0 5.87e- 1 2.46e+ 0 3.95e- 2
#> 4 2 mpg 1.01e- 1 2.55e- 2 3.95e+ 0 4.26e- 3
#> 5 3 (Intercept) 3.07e+ 0 6.86e-15 4.48e+14 1.42e-15
#> 6 3 mpg 3.46e-17 4.20e-16 8.25e- 2 9.48e- 1
#> 7 4 (Intercept) 2.18e+ 0 4.29e- 1 5.07e+ 0 9.65e- 4
#> 8 4 mpg 8.99e- 2 2.65e- 2 3.39e+ 0 9.43e- 3
#> 9 6 (Intercept) 3.62e+ 0 NaN NaN NaN
#> 10 6 mpg NA NA NA NA
#> 11 8 (Intercept) 3.54e+ 0 NaN NaN NaN
#> 12 8 mpg NA NA NA NA
#// generates lm fit for the whole model
data %>%
#// group by factor
group_by(carb) %>%
#// summary for the whole dataset
summarize(new = list( tidy( lm(formula = drat ~ mpg, data = .)))) %>%
#// unnest
unnest(cols = new)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 12 x 6
#> carb term estimate std.error statistic p.value
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 (Intercept) 2.38 0.248 9.59 1.20e-10
#> 2 1 mpg 0.0604 0.0119 5.10 1.78e- 5
#> 3 2 (Intercept) 2.38 0.248 9.59 1.20e-10
#> 4 2 mpg 0.0604 0.0119 5.10 1.78e- 5
#> 5 3 (Intercept) 2.38 0.248 9.59 1.20e-10
#> 6 3 mpg 0.0604 0.0119 5.10 1.78e- 5
#> 7 4 (Intercept) 2.38 0.248 9.59 1.20e-10
#> 8 4 mpg 0.0604 0.0119 5.10 1.78e- 5
#> 9 6 (Intercept) 2.38 0.248 9.59 1.20e-10
#> 10 6 mpg 0.0604 0.0119 5.10 1.78e- 5
#> 11 8 (Intercept) 2.38 0.248 9.59 1.20e-10
#> 12 8 mpg 0.0604 0.0119 5.10 1.78e- 5
Created on 2021-01-04 by the reprex package (v0.3.0)
. in this case refers to data which is present in the previous step which is (data %>% group_by(carb)). Although the data is grouped it is still complete data. If you are on dplyr > 1.0.0 you could use cur_data() to refer to the data in the group.
library(dplyr)
library(broom)
library(tidyr)
data %>%
group_by(carb) %>%
summarize(new = list(tidy(lm(formula = drat ~ mpg, data = cur_data())))) %>%
unnest(cols = new)
This gives the same output as your first example.
Note that you can use . to refer to the grouped data with group_modify instead of summarise:
data %>%
group_by(carb) %>%
group_modify(~lm(formula = drat ~ mpg, data = .) %>% tidy)
* Just an alternative - I think list-columns + unnest-variants are considered the better approach now.
Related
I am doing a Shapiro Wilks test for multiple variables.
I do this as follows:
list= lapply(mtcars, shapiro.test)
I want to save the outout of list as a .txt file.
I have tried doing this:
write.table(paste(list), "SW List.txt")
That produces this:
When what I want is a .txt file with the variable names, as shown in the console when I run list:
What if instead, you map out all the stats and p values to a dataframe and then save the dataframe to text.
library(tidyverse)
imap_dfr(mtcars,
~ shapiro.test(.x) |>
(\(st) tibble(var = .y,
W = st$statistic,
p.value = st$p.value))())
#> # A tibble: 11 x 3
#> var W p.value
#> <chr> <dbl> <dbl>
#> 1 mpg 0.948 0.123
#> 2 cyl 0.753 0.00000606
#> 3 disp 0.920 0.0208
#> 4 hp 0.933 0.0488
#> 5 drat 0.946 0.110
#> 6 wt 0.943 0.0927
#> 7 qsec 0.973 0.594
#> 8 vs 0.632 0.0000000974
#> 9 am 0.625 0.0000000784
#> 10 gear 0.773 0.0000131
#> 11 carb 0.851 0.000438
This seems too basic to not be found in a search, but maybe I didn't use the correct search terms on Google.
I want to normalize a numeric column. When I modify that column with mutate(across(.., scale)) I get [,1] added to the header. Why is that?
library(dplyr, warn.conflicts = FALSE)
mtcars_mpg_only <-
mtcars %>%
as_tibble() %>%
select(mpg)
mtcars_mpg_only %>%
as_tibble() %>%
mutate(across(mpg, scale))
#> # A tibble: 32 x 1
#> mpg[,1]
#> <dbl>
#> 1 0.151
#> 2 0.151
#> 3 0.450
#> 4 0.217
#> 5 -0.231
#> 6 -0.330
#> 7 -0.961
#> 8 0.715
#> 9 0.450
#> 10 -0.148
#> # ... with 22 more rows
But if I use a different function rather than scale() (e.g., log()), then the column header remains as-is:
mtcars_mpg_only %>%
as_tibble() %>%
mutate(across(mpg, log))
#> # A tibble: 32 x 1
#> mpg
#> <dbl>
#> 1 3.04
#> 2 3.04
#> 3 3.13
#> 4 3.06
#> 5 2.93
#> 6 2.90
#> 7 2.66
#> 8 3.19
#> 9 3.13
#> 10 2.95
#> # ... with 22 more rows
I know how to remove/rename [,1] after the fact, but my question is why it's created to begin with?
It is because scale returns a matrix whereas log returns a plain vector. The mpg[, 1] is actually a matrix within a data.frame. See ?scale for the definition of its value.
class(scale(mtcars$mpg))
## [1] "matrix" "array"
class(log(mtcars$mpg))
## [1] "numeric"
Convert the matrix to a plain vector to avoid this, e.g.
mtcars_mpg_only %>%
mutate(across(mpg, ~ c(scale(.))))
# or extracting first column
mtcars_mpg_only %>%
mutate(across(mpg, ~ scale(.)[, 1]))
# or normalizing using mean and sd
mtcars_mpg_only %>%
mutate(across(mpg, ~ (. - mean(.)) / sd(.)))
# or without across
mtcars_mpg_only %>%
mutate(mpg = c(scale(mpg)))
# or using base R
mtcars_mpg_only |>
transform(mpg = c(scale(mpg)))
I am working with the dataset uscrime but this question applied to any well-known dataset like cars.
After to googling I found extremely useful to standardize my data, considering that PCA finds new directions based on covariance matrix of original variables, and covariance matrix is sensitive to standardization of variables.
Nevertheless, I found "It is not necessary to standardize the variables, if all the variables are in same scale."
To standardize the variable I am using the function:
z_uscrime <- (uscrime - mean(uscrime)) / sd(uscrime)
Prior to standardize my data, how to check if all the variables are in the same scale or not?
Proving my point that you can standardize your data however many times you want
library(tidyverse)
library(recipes)
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stringr':
#>
#> fixed
#> The following object is masked from 'package:stats':
#>
#> step
simple_recipe <- recipe(mpg ~ .,data = mtcars) %>%
step_center(everything()) %>%
step_scale(everything())
mtcars2 <- simple_recipe %>%
prep() %>%
juice()
simple_recipe2 <- recipe(mpg ~ .,data = mtcars2) %>%
step_center(everything()) %>%
step_scale(everything())
mtcars3 <- simple_recipe2 %>%
prep() %>%
juice()
all.equal(mtcars2,mtcars3)
#> [1] TRUE
mtcars2 %>%
summarise(across(everything(),.fns = list(mean = ~ mean(.x),sd = ~sd(.x)))) %>%
pivot_longer(everything(),names_pattern = "(.*)_(.*)",names_to = c("stat", ".value"))
#> # A tibble: 11 x 3
#> stat mean sd
#> <chr> <dbl> <dbl>
#> 1 cyl -1.47e-17 1
#> 2 disp -9.08e-17 1
#> 3 hp 1.04e-17 1
#> 4 drat -2.92e-16 1
#> 5 wt 4.68e-17 1.00
#> 6 qsec 5.30e-16 1
#> 7 vs 6.94e-18 1.00
#> 8 am 4.51e-17 1
#> 9 gear -3.47e-18 1.00
#> 10 carb 3.17e-17 1.00
#> 11 mpg 7.11e-17 1
mtcars3 %>%
summarise(across(everything(),.fns = list(mean = ~ mean(.x),sd = ~sd(.x)))) %>%
pivot_longer(everything(),names_pattern = "(.*)_(.*)",names_to = c("stat", ".value"))
#> # A tibble: 11 x 3
#> stat mean sd
#> <chr> <dbl> <dbl>
#> 1 cyl -1.17e-17 1
#> 2 disp -1.95e-17 1
#> 3 hp 9.54e-18 1
#> 4 drat 1.17e-17 1
#> 5 wt 3.26e-17 1
#> 6 qsec 1.37e-17 1
#> 7 vs 4.16e-17 1
#> 8 am 4.51e-17 1
#> 9 gear 0. 1
#> 10 carb 2.60e-18 1
#> 11 mpg 4.77e-18 1
Created on 2020-06-07 by the reprex package (v0.3.0)
I am using the code from this question (below) to save columns of nested tibble into a new list of tibbles (each column being a tibble in the list). However, when using selected on the nested tibble, the nested variable is lost. Which I'd like to retain, it keeps the grouping variable with the results.
e.g., results %>% unnest(tidied) keeps "carb", but 'results %>% select(tidied) %>% map(~bind_rows(.))' does not.
How can I keep the nested variable with the selected columns?
library(tidyverse)
library(broom)
data(mtcars)
df <- mtcars
nest.df <- df %>% nest(-carb)
results <- nest.df %>%
mutate(fit = map(data, ~ lm(mpg ~ wt, data=.x)),
tidied = map(fit, tidy),
glanced = map(fit, glance),
augmented = map(fit, augment))
final <- results %>% select(glanced, tidied, augmented ) %>%
map(~bind_rows(.))
We can do a mutate_at before the select step (not clear about the expected output though). Here mutate_at in looping through each column, but these columns are also tibble, so inside the function (list(~), we use map2 to pass the column and the 'carb' column, then create a new column with the list tibble column by mutateing with new column 'carb'
results %>%
mutate_at(vars(glanced, tidied, augmented),
list(~ map2(.,carb, ~ .x %>% mutate(carb = .y)))) %>%
select(glanced, tidied, augmented) %>%
map(~ bind_rows(.x))
$glanced
# A tibble: 6 x 12
# r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
#1 0.696 0.658 2.29 18.3 0.00270 2 -21.4 48.7 49.6 41.9 8 4
#2 0.654 0.585 3.87 9.44 0.0277 2 -18.2 42.4 42.3 74.8 5 1
#3 0.802 0.777 2.59 32.3 0.000462 2 -22.6 51.1 52.1 53.5 8 2
#4 0.00295 -0.994 1.49 0.00296 0.965 2 -3.80 13.6 10.9 2.21 1 3
#5 0 0 NaN NA NA 1 Inf -Inf -Inf 0 0 6
#6 0 0 NaN NA NA 1 Inf -Inf -Inf 0 0 8
#$tidied
# A tibble: 10 x 6
# term estimate std.error statistic p.value carb
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) 27.9 2.91 9.56 0.0000118 4
# 2 wt -3.10 0.724 -4.28 0.00270 4
#...
#...
I have this simple dataframe. The sum column represents the sum of the row. I would like to use prop.test to determine the P-value for each column, and present that data as an additional row labeled p-value. I can use prop.test in the following way to determine a p value for any individual column, but cannot work out how to apply that to multiple columns with a single function.
Other Island N_Shelf N_Shore S_Shore Sum
Type1 10 4 1 0 3 18
Type2 19 45 1 9 11 85
This will output a p-value for the island column
ResI2<- prop.test(x=TableAvE_Island$Island, n=TableAvE_Island$Sum)
output:
data: TableAvE_Island$Island out of TableAvE_Island$Sum
X-squared = 4.456, df = 1, p-value = 0.03478
alternative hypothesis: two.sided
95 percent confidence interval:
-0.56027107 -0.05410802
sample estimates:
prop 1 prop 2
0.2222222 0.5294118
I've tried to use the apply command but cannot work out its usage, and the examples i've been able to find dont seem similar enough. Any pointers would be appreciated.
Here's a look with broom's function tidy, which takes output from tests and other operations and formats them as "tidy" data frames.
For the first prop.test that you posted, the tidy output looks like this:
library(tidyverse)
broom::tidy(prop.test(TableAvE_Island$Island, TableAvE_Island$Sum))
#> estimate1 estimate2 statistic p.value parameter conf.low
#> 1 0.2222222 0.5294118 4.456017 0.03477849 1 -0.5602711
#> conf.high
#> 1 -0.05410802
#> method
#> 1 2-sample test for equality of proportions with continuity correction
#> alternative
#> 1 two.sided
To do this for all the variables in your data frame vs Sum, I gathered it into a long shape
table_long <- gather(TableAvE_Island, key = variable, value = val, -Sum)
head(table_long)
#> # A tibble: 6 x 3
#> Sum variable val
#> <int> <chr> <int>
#> 1 18 Other 10
#> 2 85 Other 19
#> 3 18 Island 4
#> 4 85 Island 45
#> 5 18 N_Shelf 1
#> 6 85 N_Shelf 1
Then grouped the long-shaped data by variable, pipe it into do, which allows you to call a function on each of the groups in a data frame, using . as a standing for the subset of the data. Then I called tidy on the column containing the nested results of the prop.test. This gives you a data frame of all the relevant results of the test, with each of "Island", "N_Shelf", etc shown.
table_long %>%
group_by(variable) %>%
do(test = prop.test(x = .$val, n = .$Sum)) %>%
broom::tidy(test)
#> # A tibble: 5 x 10
#> # Groups: variable [5]
#> variable estimate1 estimate2 statistic p.value parameter conf.low
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Island 0.222 0.529 4.46 0.0348 1 -0.560
#> 2 N_Shelf 0.0556 0.0118 0.0801 0.777 1 -0.0981
#> 3 N_Shore 0 0.106 0.972 0.324 1 -0.205
#> 4 Other 0.556 0.224 6.54 0.0106 1 0.0523
#> 5 S_Shore 0.167 0.129 0.00163 0.968 1 -0.183
#> # ... with 3 more variables: conf.high <dbl>, method <fct>,
#> # alternative <fct>
Created on 2018-05-10 by the reprex package (v0.2.0).
We could gather into 'long' format and then store it as a list column
library(tidyverse)
res <- gather(TableAvE_Island, key, val, -Sum) %>%
group_by(key) %>%
nest() %>%
mutate(out = map(data, ~prop.test(.x$val, .x$Sum)))
res$out