dplyr Summarise improperly excluding NA - r

We can group mtcars by cylinder and summarize miles per gallon with some simple code.
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(avg = mean(mpg))
This provides the correct output shown below.
cyl avg
1 4 26.66364
2 6 19.74286
3 8 15.10000
If I kindly ask dplyr to exclude NA I get some weird results.
mtcars %>%
group_by(cyl) %>%
summarise(avg = mean(!is.na(mpg)))
Since there are no NA in this data set the results should be the same as above. But it averages all mpg to exactly "1". A problem with my code or a bug in dplyr?
cyl avg
1 4 1
2 6 1
3 8 1
My actual data set does have some NA that I need to exclude only for this summarization, but exhibits the same behavior.

You want this:
mtcars %>%
group_by(cyl) %>%
summarise(avg = mean(mpg, na.rm = T))
# A tibble: 3 x 2
cyl avg
<dbl> <dbl>
1 4 26.66364
2 6 19.74286
3 8 15.10000
Right now, you're returning a logical vector with !is.na(mpg). When you take the mean() of a logical vector, it'll be coerced to 1, not the numeric value you desire.

The way you have coded it, the input to the mean() function is a vector of TRUE and FALSE values. Use mean(mpg[!is.na(mpg)]) instead.
Consider using data.table which I have used for illustration purposes. The following all produce the same result.
library(data.table)
MT[, mean(mpg), by = cyl]
cyl V1
1: 6 19.74286
2: 4 26.66364
3: 8 15.10000
MT[, mean(mpg, na.rm=TRUE), by = cyl]
cyl V1
1: 6 19.74286
2: 4 26.66364
3: 8 15.10000
MT[, mean(mpg[!is.na(mpg)]), by = cyl]
cyl V1
1: 6 19.74286
2: 4 26.66364
3: 8 15.10000

Related

Using mtcars data to make a summarised table of cylinders versus centered(mpg)

Bare with me... I am using the R/RStudio with the data mtcars, dplyr , mutate and the summarise commands. Also tried group by.
I want to center the values mtcars$mpg then take that info and display the summary of the number of cylinders vs centered mtcars$mpg.
So far...
mtcars %>% mutate(centered_mpg = mpg - mean(mpg, na.rm = TRUE)) %>% summarise(centered_mpg, cyl)
The above produces:
centered_mpg
cyl
0.909375
6
0.909375
6
2.709375
4
1.309375
6
...
...
INSTEAD, I WANT:
centered_mpg
cyl
x1
4
x2
6
x3
8
Are you looking for this?
with(mtcars, aggregate(list(centered_mpg=scale(mpg, scale=FALSE)), list(cyl=cyl), mean))
# cyl centered_mpg
# 1 4 6.5730114
# 2 6 -0.3477679
# 3 8 -4.9906250
It looks like you want to center each individual car's mpg by subtracting the global mean(mpg). This gives a centered_mpg for every car - and the code you have looks fine for this.
Then you want to calculate some sort of "summary" of the centered mpg values by cylinder group, so we need to group_by(cyl) and then define whatever summary function you want - here I use mean() but you can use median, sum, or whatever else you'd like.
mtcars %>%
mutate(centered_mpg = mpg - mean(mpg, na.rm = TRUE)) %>%
group_by(cyl) %>%
summarise(mean_centered_mpg = mean(centered_mpg))
# # A tibble: 3 x 2
# cyl mean_centered_mpg
# <dbl> <dbl>
# 1 4 6.57
# 2 6 -0.348
# 3 8 -4.99

group_by() level disappear after filter()/mutate()/count() without using ungroup

This problem bothers me for the entire day and I don't know why it happens.
The issue is group_by level will disappear after one line of code such as filter(),mutate(), count(), and in order to keep that level, I need to add group_by() everytime after these codes again to keep the group level.
Below I attach an example.
As you can see, if I add group_by after filter, it works fine.
data("mtcars")
> mtcars %>%
+ filter(hp == 110) %>%
+ group_by(cyl) %>%
+ count(mpg)
cyl mpg n
1 6 21.0 2
2 6 21.4 1
However, if I use group_by before filter and count the value, it will lose the group by level
data("mtcars")
> mtcars %>%
+ group_by(cyl) %>%
+ filter(hp == 110) %>%
+ count(mpg)
mpg n
1 21.0 2
2 21.4 1
In order to make it work, I need to change codes to
> mtcars %>%
+ group_by(cyl) %>%
+ filter(hp == 110) %>%
+ group_by(cyl) %>%
+ count(mpg)
cyl mpg n
1 6 21.0 2
2 6 21.4 1
This method also doesn't work:
> mtcars %>%
+ dplyr::group_by(cyl) %>%
+ dplyr::filter(hp == 110) %>%
+ dplyr::count(mpg)
mpg n
1 21.0 2
2 21.4 1
I am using another PC to run the codes and it works well.
data("mtcars")
mtcars %>%
+ group_by(cyl) %>%
+ filter(hp == 110) %>%
+ count(mpg)
# A tibble: 2 x 3
# Groups: cyl [1]
cyl mpg n
<dbl> <dbl> <int>
1 6 21 2
2 6 21.4 1
I have reinstalled dplyr package many times and this thing keeps happening. I am using version 1.0.2 for dplyr.
Really appreciate if someone can help me about this issue!
Edit:
The problem is being solved after I update my R version to 4.0.2 (my previous version is 3.6.3). Not sure why dplyr doesn't work properly undr 3.6.3 but at least the problem is being solved for now.
Try this:
data("mtcars")
> mtcars %>%
+ dplyr::group_by(cyl) %>%
+ dplyr::filter(hp == 110) %>%
+ dplyr::count(mpg)
There can be masking problem. Function filter is in dplyr and stats package as well. Same issue was discussed here. Similar problem occours with select function.
Also note in that context the difference between:
data("mtcars")
mtcars %>%
group_by(cyl,gear) %>%
summarize(
n=n()
) %>%
mutate(mysum = sum(n))
# A tibble: 8 x 4
# Groups: cyl [3]
cyl gear n mysum
<dbl> <dbl> <int> <int>
1 4 3 1 11
2 4 4 8 11
3 4 5 2 11
4 6 3 2 7
5 6 4 4 7
6 6 5 1 7
mtcars %>%
group_by(cyl,gear) %>%
count() %>%
mutate(mysum = sum(n))
# A tibble: 8 x 4
# Groups: cyl, gear [8]
cyl gear n mysum
<dbl> <dbl> <int> <int>
1 4 3 1 1
2 4 4 8 8
3 4 5 2 2
4 6 3 2 2
5 6 4 4 4
Summarise defaults to dropping the last grouping variable (.groups="drop_last"). And for a funny reason :)
https://twitter.com/hadleywickham/status/1254802700589555715

How to count() each variable automatically

I am cleaning some data and like to use the count() function in dplyr to look at unique values of every variable.
Is there a way to do this automatically? Right now I am using this method:
df %>% count(variable1)
df %>% count(variable2)
df %>% count(variable3)
...
I would like something that returns all of them without me having to repeat the line of code and type in each variable. I thought about trying to have R recognize all the column names and automatically fill them in but I'm not sure where to start. If I just add variables together, say
df %>% count(variable1, variable2)
I get counts by both of those variables when I want individual tables for each variable.
Assume that you want to count am, gear, and carb from mtcars. You can apply the function table() on each variable by map(), which returns a list object.
library(dplyr)
library(purrr)
mtcars %>%
select(am, gear, carb) %>%
map(table)
# $am
# 0 1
# 19 13
#
# $gear
# 3 4 5
# 15 12 5
#
# $carb
# 1 2 3 4 6 8
# 7 10 3 10 1 1
base Version :
lapply(mtcars[c("am", "gear", "carb")], table)
In addition, you can use summary(), which counts factor variables.
mtcars %>%
select(am, gear, carb) %>%
mutate(across(.fn = as.factor)) %>%
summary
# am gear carb
# 0:19 3:15 1: 7
# 1:13 4:12 2:10
# 5: 5 3: 3
# 4:10
# 6: 1
# 8: 1
It looks like you can use a tidyverse approach to solve your issue. You want to get the counts for each variable in your dataset (Please next time add a sample of df). You can get something close to what you want using data in long format. I will show you an example with mtcars data. I will choose some variables that display classes so that they can be summarised with counts. Here the code:
library(tidyverse)
#Data
data("mtcars")
I will select some categorical variables with next code, then I will reshape to long. Finally, I will use summarise() and n() (used for counting) with group_by() to determine the counts:
#Code
mtcars %>% select(cyl,vs,am,gear,carb) %>%
#Format to long
pivot_longer(cols = everything()) %>%
#Group and summarise
group_by(name,value) %>%
summarise(N=n())
Output:
# A tibble: 16 x 3
# Groups: name [5]
name value N
<chr> <dbl> <int>
1 am 0 19
2 am 1 13
3 carb 1 7
4 carb 2 10
5 carb 3 3
6 carb 4 10
7 carb 6 1
8 carb 8 1
9 cyl 4 11
10 cyl 6 7
11 cyl 8 14
12 gear 3 15
13 gear 4 12
14 gear 5 5
15 vs 0 18
16 vs 1 14
As you can see all the variables are showed with their respective groups and counts.
a simple solution would be to use sapply or lapply with table
sapply(df,table)
This will return you a list of count tables for each of the columns for dt. You can always pass in a subsetted dataframe to get the count for your variables of interest.

Mean with condition for multiple columns in r

Let's use mtcars to explain the situation.
What I want to do is the same below for multiple columns. To have the mean of a column qsec (in the example) regarding another column with a specific value (4 and 6, in the example below). I'll compare the result later so maybe I would store the results in a vector
table(mtcars$cyl)
4 6 8
11 7 14
mean(mtcars$qsec[mtcars$cyl == 4], na.rm = T)
mean(mtcars$qsec[mtcars$gear == 4], na.rm = T)
I would like to check the means of qsec regarding the cyl, and let's say gear and carb, with the same "pattern" for the mean i.e. mean of observations with 4 and mean of observations 6. In the true dataset would be several columns that have the same set of numbers (2, 0 and 1). I'll compare the means of a column (in the example qsec) with observations 2 and 0.
I've tried to look at the functions like tapply, apply, sapply. But I'm stuck in having the condition in the mean applying for every column (at once).
Hope I made myself clear.
Thank you!
The function you are looking for is aggregate:
aggregate(. ~ cyl, FUN=mean, data=mtcars[,c("cyl", "qsec", "gear", "carb")],
subset=cyl %in% c(4, 6)
)
cyl qsec gear carb
1 4 19.13727 4.090909 1.545455
2 6 17.97714 3.857143 3.428571
In the function above data= is the data.frame. Here we only selected the wanted columns. And the subset= specifies which rows of the data to keep (in this case only cyl 4 and 6).
The formula . ~ cyl instructs to summarise all columns according to the cyl column.
a data.table solution:
require(data.table)
mtcars[cyl %in% c(4, 6), .(mn_qsec = mean(qsec),
mn_gear = mean(gear),
mn_carb = mean(carb)),
by = cyl]
What I understand you're looking for is the mean of qsec for each level of cyl, gear, and carb separately, not in combination. This code gets you that, but doesn't directly let you select specific levels of those factors. If you need to be able to do that second part, I think you should be able to tweak this to get there, but I'm not sure how...
apply(mtcars[,c("cyl","gear","carb")], 2, function(x) {
aggregate(mtcars[,"qsec"],list(x),mean)
})
Output:
$cyl
Group.1 x
1 4 19.13727
2 6 17.97714
3 8 16.77214
$gear
Group.1 x
1 3 17.692
2 4 18.965
3 5 15.640
$carb
Group.1 x
1 1 19.50714
2 2 18.18600
3 3 17.66667
4 4 16.96500
5 6 15.50000
6 8 14.60000
On option is to use dplyr::mutate_at as OP wants to apply same function on multiple column. The solution will be as:
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise_at(vars(c("qsec", "gear", "carb")), funs(mean), na.rm = TRUE) %>%
filter(cyl!=8)
# # A tibble: 2 x 4
# cyl qsec gear carb
# <dbl> <dbl> <dbl> <dbl>
# 1 4.00 19.1 4.09 1.55
# 2 6.00 18.0 3.86 3.43

Using dplyr summarise_at with column index

I noticed that when supplying column indices to dplyr::summarize_at the column to be summarized is determined excluding the grouping column(s). I wonder if that is how it's supposed to be since by this design, using the correct column index depends on whether the summarising column(s) are positioned before or after the grouping columns.
Here's an example:
library(dplyr)
data("mtcars")
# grouping column after summarise columns
mtcars %>% group_by(gear) %>% summarise_at(3:4, mean)
## A tibble: 3 x 3
# gear disp hp
# <dbl> <dbl> <dbl>
#1 3 326.3000 176.1333
#2 4 123.0167 89.5000
#3 5 202.4800 195.6000
# grouping columns before summarise columns
mtcars %>% group_by(cyl) %>% summarise_at(3:4, mean)
## A tibble: 3 x 3
# cyl hp drat
# <dbl> <dbl> <dbl>
#1 4 82.63636 4.070909
#2 6 122.28571 3.585714
#3 8 209.21429 3.229286
# no grouping columns
mtcars %>% summarise_at(3:4, mean)
# disp hp
#1 230.7219 146.6875
# actual third & fourth columns
names(mtcars)[3:4]
#[1] "disp" "hp"
packageVersion("dplyr")
#[1] ‘0.7.2’
Notice how the summarised columns change depending on grouping and position of the grouping column.
Is this the same on other platforms? Is it a bug or a feature?
with version 0.7.5 this behavior can't be reproduced anymore:
library(dplyr)
mtcars %>% group_by(gear) %>% summarise_at(3:4, mean)
# # A tibble: 3 x 3
# gear disp hp
# <dbl> <dbl> <dbl>
# 1 3 326. 176.
# 2 4 123. 89.5
# 3 5 202. 196.
mtcars %>% group_by(cyl) %>% summarise_at(3:4, mean)
# # A tibble: 3 x 3
# cyl disp hp
# <dbl> <dbl> <dbl>
# 1 4 105. 82.6
# 2 6 183. 122.
# 3 8 353. 209.
#docendodiscimus thanks for pointing this out, because even if this feature was intentional, documentation doesn't explicitly explain this and in my case could be source of errors. Actually, this problem was solved before answering on the other question, and my comment above does it properly with the same logic.
At this moment, possible solution is to provide names instead of indexes. But one is still able to make it using indexes just by adding few symbols .vars = names(.)[3:4], like below:
mtcars %>%
group_by(cyl) %>%
summarise_at( .vars = colnames(.)[3:4] , mean)
mtcars %>%
group_by(cyl) %>%
summarise_at( .vars = names(.)[3:4] , mean)
## A tibble: 3 x 3
# cyl disp hp
# <dbl> <dbl> <dbl>
#1 4 105.1364 82.63636
#2 6 183.3143 122.28571
#3 8 353.1000 209.21429

Resources