step_BoxCox() with negative data

step_BoxCox() with negative data - r

My understanding is that the step_BoxCox() requires a strictly positive variable. However, I tried to apply the step on data that has some negative values, I didn't get an error or a warning. The output had no NA values.
I don't know what is wrong, if my understanding flawed, or am I using a wrong syntax or something.
library(recipes)
library(skimr)
# create dummy data
set.seed(123)
n <- 2e3
x1 <- rpois(n, lambda = 5) # has some zero vals
x2 <- rnorm(n) # has some -ve vals
x3 <- x1 + 10 # is strictly positive
y <- x1 + x2
data <- tibble(x1, x2, x3, y)
# a BocCox recipe
rec <- recipe(y ~ ., data = data) %>%
step_BoxCox(all_predictors())
rec
#> Data Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 3
#>
#> Operations:
#>
#> Box-Cox transformation on all_predictors()
# bake
processed <- rec %>%
prep() %>%
bake(new_data = NULL)
# check output
summary(data)
#> x1 x2 x3 y
#> Min. : 0.000 Min. :-3.047861 Min. :10.00 Min. :-2.048
#> 1st Qu.: 3.000 1st Qu.:-0.654767 1st Qu.:13.00 1st Qu.: 3.349
#> Median : 5.000 Median :-0.007895 Median :15.00 Median : 4.843
#> Mean : 4.981 Mean : 0.011176 Mean :14.98 Mean : 4.993
#> 3rd Qu.: 6.000 3rd Qu.: 0.688699 3rd Qu.:16.00 3rd Qu.: 6.486
#> Max. :14.000 Max. : 3.421095 Max. :24.00 Max. :15.225
summary(processed)
#> x1 x2 x3 y
#> Min. : 0.000 Min. :-3.047861 Min. :2.076 Min. :-2.048
#> 1st Qu.: 3.000 1st Qu.:-0.654767 1st Qu.:2.285 1st Qu.: 3.349
#> Median : 5.000 Median :-0.007895 Median :2.398 Median : 4.843
#> Mean : 4.981 Mean : 0.011176 Mean :2.388 Mean : 4.993
#> 3rd Qu.: 6.000 3rd Qu.: 0.688699 3rd Qu.:2.448 3rd Qu.: 6.486
#> Max. :14.000 Max. : 3.421095 Max. :2.756 Max. :15.225
sum(is.na(processed$x2))
#> [1] 0
skim(processed)
Created on 2021-04-29 by the reprex package (v0.3.0)

Related

Summarize the same variables from multiple dataframes in one table

I have voter and party-data from several datasets that I further separated into different dataframes and lists to make it comparable. I could just use the summary command on each of them individually then compare manually, but I was wondering whether there was a way to get them all together and into one table?
Here's a sample of what I have:
> summary(eco$rilenew)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3 4 4 4 4 5
> summary(ecovoters)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 3.000 4.000 3.744 5.000 10.000 26
> summary(lef$rilenew)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 3.000 3.000 3.692 4.000 7.000
> summary(lefvoters)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 2.000 3.000 3.612 5.000 10.000 332
> summary(soc$rilenew)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 4.000 4.000 4.143 5.000 6.000
> summary(socvoters)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 3.000 4.000 3.674 5.000 10.000 346
Is there a way I can summarize these lists (ecovoters, lefvoters, socvoters etc) and the dataframe variables (eco$rilenew, lef$rilenew, soc$rilenew etc) together and have them in one table?

You could put everything into a list and summarize with a small custom function.
L <- list(eco$rilenew, ecovoters, lef$rilenew,
lefvoters, soc$rilenew, socvoters)
t(sapply(L, function(x) {
s <- summary(x)
length(s) <- 7
names(s)[7] <- "NA's"
s[7] <- ifelse(!any(is.na(x)), 0, s[7])
return(s)
}))
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
[1,] 0.9820673 3.3320662 3.958665 3.949512 4.625109 7.229069 0
[2,] -4.8259384 0.5028293 3.220546 3.301452 6.229384 9.585749 26
[3,] -0.3717391 2.3280366 3.009360 3.013908 3.702156 6.584659 0
[4,] -2.6569493 1.6674330 3.069440 3.015325 4.281100 8.808432 332
[5,] -2.3625651 2.4964361 3.886673 3.912009 5.327401 10.349040 0
[6,] -2.4719404 1.3635785 2.790523 2.854812 4.154936 8.491347 346
Data
set.seed(42)
eco <- data.frame(rilenew=rnorm(800, 4, 1))
ecovoters <- rnorm(75, 4, 4)
ecovoters[sample(length(ecovoters), 26)] <- NA
lef <- data.frame(rilenew=rnorm(900, 3, 1))
lefvoters <- rnorm(700, 3, 2)
lefvoters[sample(length(lefvoters), 332)] <- NA
soc <- data.frame(rilenew=rnorm(900, 4, 2))
socvoters <- rnorm(700, 3, 2)
socvoters[sample(length(socvoters), 346)] <- NA

Can use map from tidyverse to get the summary list, then if you want the result as dataframe, then plyr::ldply can help to convert list to dataframe:
ll = map(L, summary)
ll
plyr::ldply(ll, rbind)
> ll = map(L, summary)
> ll
[[1]]
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.9821 3.3321 3.9587 3.9495 4.6251 7.2291
[[2]]
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-4.331 1.347 3.726 3.793 6.653 16.845 26
[[3]]
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.3717 2.3360 3.0125 3.0174 3.7022 6.5847
[[4]]
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-2.657 1.795 3.039 3.013 4.395 9.942 332
[[5]]
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.363 2.503 3.909 3.920 5.327 10.349
[[6]]
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-3.278 1.449 2.732 2.761 4.062 8.171 346
> plyr::ldply(ll, rbind)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1 0.9820673 3.332066 3.958665 3.949512 4.625109 7.229069 NA
2 -4.3312551 1.346532 3.725708 3.793431 6.652917 16.844796 26
3 -0.3717391 2.335959 3.012507 3.017438 3.702156 6.584659 NA
4 -2.6569493 1.795307 3.038905 3.012928 4.395338 9.941819 332
5 -2.3625651 2.503324 3.908727 3.920050 5.327401 10.349040 NA
6 -3.2779863 1.448814 2.732515 2.760569 4.061854 8.170793 346

What is the tidyverse method for splitting a df by multiple columns?

I would like to split a dataframe by multiple columns so that I can see the summary() output for each subset of the data.
Here's a way to do that using split() from base:
library(tidyverse)
#> Loading tidyverse: ggplot2
#> Loading tidyverse: tibble
#> Loading tidyverse: tidyr
#> Loading tidyverse: readr
#> Loading tidyverse: purrr
#> Loading tidyverse: dplyr
#> Conflicts with tidy packages ----------------------------------------------
#> filter(): dplyr, stats
#> lag(): dplyr, stats
mtcars %>%
select(1:3) %>%
mutate(GRP_A = sample(LETTERS[1:2], n(), replace = TRUE),
GRP_B = sample(c(1:2), n(), replace = TRUE)) %>%
split(list(.$GRP_A, .$GRP_B)) %>%
map(summary)
#> $A.1
#> mpg cyl disp GRP_A
#> Min. :10.40 Min. :4.0 Min. :108.0 Length:10
#> 1st Qu.:14.97 1st Qu.:4.5 1st Qu.:151.9 Class :character
#> Median :18.50 Median :7.0 Median :259.3 Mode :character
#> Mean :17.61 Mean :6.4 Mean :283.4
#> 3rd Qu.:20.85 3rd Qu.:8.0 3rd Qu.:430.0
#> Max. :24.40 Max. :8.0 Max. :472.0
#> GRP_B
#> Min. :1
#> 1st Qu.:1
#> Median :1
#> Mean :1
#> 3rd Qu.:1
#> Max. :1
#>
#> $B.1
#> mpg cyl disp GRP_A
#> Min. :15.00 Min. :4.0 Min. : 75.7 Length:5
#> 1st Qu.:21.00 1st Qu.:4.0 1st Qu.: 78.7 Class :character
#> Median :21.50 Median :4.0 Median :120.1 Mode :character
#> Mean :24.06 Mean :5.2 Mean :147.1
#> 3rd Qu.:30.40 3rd Qu.:6.0 3rd Qu.:160.0
#> Max. :32.40 Max. :8.0 Max. :301.0
#> GRP_B
#> Min. :1
#> 1st Qu.:1
#> Median :1
#> Mean :1
#> 3rd Qu.:1
#> Max. :1
#>
#> $A.2
#> mpg cyl disp GRP_A
#> Min. :15.20 Min. :4.000 Min. : 95.1 Length:9
#> 1st Qu.:16.40 1st Qu.:6.000 1st Qu.:160.0 Class :character
#> Median :18.10 Median :8.000 Median :275.8 Mode :character
#> Mean :19.84 Mean :6.667 Mean :234.0
#> 3rd Qu.:21.00 3rd Qu.:8.000 3rd Qu.:275.8
#> Max. :30.40 Max. :8.000 Max. :360.0
#> GRP_B
#> Min. :2
#> 1st Qu.:2
#> Median :2
#> Mean :2
#> 3rd Qu.:2
#> Max. :2
#>
#> $B.2
#> mpg cyl disp GRP_A
#> Min. :13.30 Min. :4 Min. : 71.1 Length:8
#> 1st Qu.:14.97 1st Qu.:4 1st Qu.:125.3 Class :character
#> Median :20.55 Median :6 Median :201.5 Mode :character
#> Mean :20.99 Mean :6 Mean :213.5
#> 3rd Qu.:23.93 3rd Qu.:8 3rd Qu.:315.5
#> Max. :33.90 Max. :8 Max. :360.0
#> GRP_B
#> Min. :2
#> 1st Qu.:2
#> Median :2
#> Mean :2
#> 3rd Qu.:2
#> Max. :2
How can I achieve this same result using a tidyverse verb? My initial thought was to use purrr::by_slice(), but apparently that has been deprecated.

dplyr 0.8.0 has introduced the verb that you were looking for: group_split()
From the documentation:
group_split() works like base::split() but
it uses the grouping structure from group_by() and therefore is subject to the data mask
it does not name the elements of the list based on the grouping as this typically loses information and is confusing.
group_keys() explains the grouping structure, by returning a data
frame that has one row per group and one column per grouping variable.
For your example:
mtcars %>%
select(1:3) %>%
mutate(GRP_A = sample(LETTERS[1:2], n(), replace = TRUE),
GRP_B = sample(c(1:2), n(), replace = TRUE)) %>%
group_split(GRP_A, GRP_B) %>%
map(summary)

EDIT: this answer is now outdated. See #MartijnVanAttekum's solution above.
The "tidy" solution seems to be a combination of "mutate + list-cols + purrr" according to Hadley.
library(tidyverse)
library(magrittr)
# group, nest, create a new col leveraging purrr::map()
mt_summary <-
mtcars %>%
select(1:3) %>%
mutate(GRP_A = sample(LETTERS[1:2], n(), replace = TRUE),
GRP_B = sample(c(1:2), n(), replace = TRUE)) %>%
group_by(GRP_A, GRP_B) %>%
nest() %>%
mutate(SUMMARY = map(data, .f = summary))
# check the structure
mt_summary
#> # A tibble: 4 Ã— 4
#> GRP_A GRP_B data SUMMARY
#> <chr> <int> <list> <list>
#> 1 A 1 <tibble [11 Ã— 3]> <S3: table>
#> 2 B 2 <tibble [9 Ã— 3]> <S3: table>
#> 3 A 2 <tibble [7 Ã— 3]> <S3: table>
#> 4 B 1 <tibble [5 Ã— 3]> <S3: table>
# extract the summaries
extract2(mt_summary, "SUMMARY") %>%
set_names(paste0(extract2(mt_summary, "GRP_A"),
extract2(mt_summary, "GRP_B")))
#> $A1
#> mpg cyl disp
#> Min. :10.40 Min. :4.000 Min. : 75.7
#> 1st Qu.:15.25 1st Qu.:4.000 1st Qu.:120.9
#> Median :19.20 Median :6.000 Median :167.6
#> Mean :20.43 Mean :6.182 Mean :229.0
#> 3rd Qu.:25.85 3rd Qu.:8.000 3rd Qu.:309.5
#> Max. :30.40 Max. :8.000 Max. :460.0
#>
#> $B2
#> mpg cyl disp
#> Min. :15.20 Min. :4.000 Min. : 78.7
#> 1st Qu.:17.80 1st Qu.:4.000 1st Qu.:120.3
#> Median :19.20 Median :6.000 Median :167.6
#> Mean :20.84 Mean :6.222 Mean :225.9
#> 3rd Qu.:21.50 3rd Qu.:8.000 3rd Qu.:351.0
#> Max. :32.40 Max. :8.000 Max. :400.0
#>
#> $A2
#> mpg cyl disp
#> Min. :15.20 Min. :4.000 Min. : 71.1
#> 1st Qu.:18.90 1st Qu.:4.000 1st Qu.:114.5
#> Median :21.40 Median :6.000 Median :145.0
#> Mean :21.79 Mean :5.429 Mean :176.0
#> 3rd Qu.:22.10 3rd Qu.:6.000 3rd Qu.:241.5
#> Max. :33.90 Max. :8.000 Max. :304.0
#>
#> $B1
#> mpg cyl disp
#> Min. :10.40 Min. :4.0 Min. :140.8
#> 1st Qu.:13.30 1st Qu.:8.0 1st Qu.:275.8
#> Median :14.30 Median :8.0 Median :350.0
#> Mean :15.62 Mean :7.2 Mean :319.7
#> 3rd Qu.:17.30 3rd Qu.:8.0 3rd Qu.:360.0
#> Max. :22.80 Max. :8.0 Max. :472.0

force summary() to report the number of NA's even if none

I have many numeric vectors, some have NA's, some don't. Here is an example with two vectors:
x1 <- c(1,2,3,2,2,4)
summary(x1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.000 2.000 2.333 2.750 4.000
x2 <- c(1,2,3,2,2,4,NA)
summary(x2)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.000 2.000 2.000 2.333 2.750 4.000 1
In the end, I want to rbind all the summary's:
rbind(summary(x1), summary(x2))
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
[1,] 1 2 2 2.333 2.75 4 1
[2,] 1 2 2 2.333 2.75 4 1
Warning message:
In rbind(summary(x1), summary(x2)) :
number of columns of result is not a multiple of vector length (arg 1)
Is there a way to force summary to count NA's without error nor warning?
All my trials failed:
summary(x1, na.rm=FALSE)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.000 2.000 2.333 2.750 4.000
summary(x1, useNA="always")
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.000 2.000 2.333 2.750 4.000
summary(addNA(x1))
1 2 3 4 <NA>
1 3 1 1 0
I also tried the following, but it is a bit of a hack:
tmp <- rbind(summary(x1[complete.cases(x1)]), summary(x2[complete.cases(x2)]))
tmp <- cbind(tmp, c(sum(is.na(x1)), sum(is.na(x2))))
colnames(tmp)[ncol(tmp)] <- "NA's"
tmp
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
[1,] 1 2 2 2.333 2.75 4 0
[2,] 1 2 2 2.333 2.75 4 1

I have not found a way to force summary to display NA's. However, you could write a custom function that returns what you want:
my_summary <- function(v){
if(!any(is.na(v))){
res <- c(summary(v),"NA's"=0)
} else{
res <- summary(v)
}
return(res)
}

Because the problem is that you are combining vectors of different lengths you can assign the length of the longest to the shortest. When you combine them, this will generate NAs for the missing data that we can easily replace with zeros.
s1 <- summary(x1)
s2 <- summary(x2)
length(s1) <- length(s2)
s <- rbind(s2,s1)
s[is.na(s)] <- 0
Output:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
s2 1 2 2 2.333 2.75 4 1
s1 1 2 2 2.333 2.75 4 0

The solutions that were given before ignore the fact that summary() also works for data.frames and matrices. I would usually handle this by recursive function definition although the result is not exactly the same as is with the original summary() function.
summaryna <- function(x, ...) {
# Recursive function definition in case of matrix or data.frame.
if(is.matrix(x)) {
return(apply(x,2,function(x)summaryna(x, ...)))
} else if (is.data.frame(x)) {
return(sapply(x,function(x)summaryna(x, ...)))
}
# This is the actual function.
sum <- summary(x, ...)
if(length(sum)<7) sum <- c(sum,"NA's"=0)
return(sum)
}

Restructure output of R summary function

Is there an easy way to change the output format for R's summary function so that the results print in a column instead of row? R does this automatically when you pass summary a data frame. I'd like to print summary statistics in a column when I pass it a single vector. So instead of this:
>summary(vector)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.000 2.000 6.699 6.000 559.000
It would look something like this:
>summary(vector)
Min. 1.000
1st Qu. 1.000
Median 2.000
Mean 6.699
3rd Qu. 6.000
Max. 559.000

Sure. Treat it as a data.frame:
set.seed(1)
x <- sample(30, 100, TRUE)
summary(x)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.00 10.00 15.00 16.03 23.25 30.00
summary(data.frame(x))
# x
# Min. : 1.00
# 1st Qu.:10.00
# Median :15.00
# Mean :16.03
# 3rd Qu.:23.25
# Max. :30.00
For slightly more usable output, you can use data.frame(unclass(.)):
data.frame(val = unclass(summary(x)))
# val
# Min. 1.00
# 1st Qu. 10.00
# Median 15.00
# Mean 16.03
# 3rd Qu. 23.25
# Max. 30.00
Or you can use stack:
stack(summary(x))
# values ind
# 1 1.00 Min.
# 2 10.00 1st Qu.
# 3 15.00 Median
# 4 16.03 Mean
# 5 23.25 3rd Qu.
# 6 30.00 Max.

summary to a data frame

Using summary(var) gives me the following output:
PAY_BACK_ORG
Min. : -16.40
1st Qu.: 0.00
Median : 26.40
Mean : 34.37
3rd Qu.: 53.60
Max. :4033.40
I want it as a dataframe which will look like this:
Min -16.40
1st Qu 0.00
Median 26.40
Mean 34.37
3rd Qu 53.60
Max 4033.40
How can I get it in?

Like this?
var <- rnorm(100)
x <- summary(var)
data.frame(x=matrix(x),row.names=names(x))
## x
## Min. -2.68300
## 1st Qu. -0.70930
## Median -0.09732
## Mean -0.00809
## 3rd Qu. 0.71550
## Max. 2.58100

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

step_BoxCox() with negative data - r

Related

Summarize the same variables from multiple dataframes in one table

What is the tidyverse method for splitting a df by multiple columns?

force summary() to report the number of NA's even if none

Restructure output of R summary function

summary to a data frame

Categories

Resources

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

step_BoxCox() with negative data - r

Related

Summarize the same variables from multiple dataframes in one table

What is the **tidyverse** method for splitting a df by multiple columns?

force summary() to report the number of NA's even if none

Restructure output of R summary function

summary to a data frame

Categories

Resources

What is the tidyverse method for splitting a df by multiple columns?