Why does as_tibble() round floats to the nearest integer? - r

When using as_tibble in dplyr 0.7.4 and R 3.4.1 I get the following outputs
mtcars %>% aggregate(disp ~ cyl, data=., mean) %>% as_tibble()
which outputs
# A tibble: 3 x 2
cyl disp
<dbl> <dbl>
1 4.00 105
2 6.00 183
3 8.00 353
while
mtcars %>% aggregate(disp ~ cyl, data=., mean)
outputs
cyl disp
1 4 105.1364
2 6 183.3143
3 8 353.1000
Not really surprisingly, the following
mtcars %>% group_by(cyl) %>% summarise(disp=mean(disp))
gives again
# A tibble: 3 x 2
cyl disp
<dbl> <dbl>
1 4.00 105
2 6.00 183
3 8.00 353
Why is this rounding happening and how can I avoid it?

This is not a rounding, it's only a way for {tibble} to display data in a pretty way:
> mtcars %>%
+ aggregate(disp ~ cyl, data=., mean) %>%
+ as_tibble() %>%
+ pull(disp)
[1] 105.1364 183.3143 353.1000
If you want to see more digits, you have to print a data.frame:
> mtcars %>%
+ aggregate(disp ~ cyl, data=., mean) %>%
+ as_tibble() %>%
+ as.data.frame()
cyl disp
1 4 105.1364
2 6 183.3143
3 8 353.1000
(and yes, the two last lines are useless)

Related

Add Another Column Info to Results of groupby r

Can someone help me please?
I Have Column A, Column B and Column C, I want to get the top value of column C, grouped by A, but also have the information of B for those top values
Max <-X %>% select(A,B,C) %>% group_by(A) %>% summarise(top = max(C))
But this code only show me the top values of each unique A data, so I dont know whats the B value assigned to that. (Important, making group_by(A,B) doesnt work, because it doesnt give the top values for each unique A value, it returns the same as the data base X)
This could be achieved via dplyr::top_n or ? dplyr::slice_max like so:
library(dplyr)
mtcars %>% select(cyl, mpg, hp) %>% group_by(cyl) %>% top_n(1, hp)
#> # A tibble: 3 x 3
#> # Groups: cyl [3]
#> cyl mpg hp
#> <dbl> <dbl> <dbl>
#> 1 4 30.4 113
#> 2 6 19.7 175
#> 3 8 15 335
mtcars %>% select(cyl, mpg, hp) %>% group_by(cyl) %>% slice_max(hp)
#> # A tibble: 3 x 3
#> # Groups: cyl [3]
#> cyl mpg hp
#> <dbl> <dbl> <dbl>
#> 1 4 30.4 113
#> 2 6 19.7 175
#> 3 8 15 335
So, in your case it should be:
Max <-X %>% select(A,B,C) %>% group_by(A) %>% slice_max(C)

group_by() level disappear after filter()/mutate()/count() without using ungroup

This problem bothers me for the entire day and I don't know why it happens.
The issue is group_by level will disappear after one line of code such as filter(),mutate(), count(), and in order to keep that level, I need to add group_by() everytime after these codes again to keep the group level.
Below I attach an example.
As you can see, if I add group_by after filter, it works fine.
data("mtcars")
> mtcars %>%
+ filter(hp == 110) %>%
+ group_by(cyl) %>%
+ count(mpg)
cyl mpg n
1 6 21.0 2
2 6 21.4 1
However, if I use group_by before filter and count the value, it will lose the group by level
data("mtcars")
> mtcars %>%
+ group_by(cyl) %>%
+ filter(hp == 110) %>%
+ count(mpg)
mpg n
1 21.0 2
2 21.4 1
In order to make it work, I need to change codes to
> mtcars %>%
+ group_by(cyl) %>%
+ filter(hp == 110) %>%
+ group_by(cyl) %>%
+ count(mpg)
cyl mpg n
1 6 21.0 2
2 6 21.4 1
This method also doesn't work:
> mtcars %>%
+ dplyr::group_by(cyl) %>%
+ dplyr::filter(hp == 110) %>%
+ dplyr::count(mpg)
mpg n
1 21.0 2
2 21.4 1
I am using another PC to run the codes and it works well.
data("mtcars")
mtcars %>%
+ group_by(cyl) %>%
+ filter(hp == 110) %>%
+ count(mpg)
# A tibble: 2 x 3
# Groups: cyl [1]
cyl mpg n
<dbl> <dbl> <int>
1 6 21 2
2 6 21.4 1
I have reinstalled dplyr package many times and this thing keeps happening. I am using version 1.0.2 for dplyr.
Really appreciate if someone can help me about this issue!
Edit:
The problem is being solved after I update my R version to 4.0.2 (my previous version is 3.6.3). Not sure why dplyr doesn't work properly undr 3.6.3 but at least the problem is being solved for now.
Try this:
data("mtcars")
> mtcars %>%
+ dplyr::group_by(cyl) %>%
+ dplyr::filter(hp == 110) %>%
+ dplyr::count(mpg)
There can be masking problem. Function filter is in dplyr and stats package as well. Same issue was discussed here. Similar problem occours with select function.
Also note in that context the difference between:
data("mtcars")
mtcars %>%
group_by(cyl,gear) %>%
summarize(
n=n()
) %>%
mutate(mysum = sum(n))
# A tibble: 8 x 4
# Groups: cyl [3]
cyl gear n mysum
<dbl> <dbl> <int> <int>
1 4 3 1 11
2 4 4 8 11
3 4 5 2 11
4 6 3 2 7
5 6 4 4 7
6 6 5 1 7
mtcars %>%
group_by(cyl,gear) %>%
count() %>%
mutate(mysum = sum(n))
# A tibble: 8 x 4
# Groups: cyl, gear [8]
cyl gear n mysum
<dbl> <dbl> <int> <int>
1 4 3 1 1
2 4 4 8 8
3 4 5 2 2
4 6 3 2 2
5 6 4 4 4
Summarise defaults to dropping the last grouping variable (.groups="drop_last"). And for a funny reason :)
https://twitter.com/hadleywickham/status/1254802700589555715

how to calculate proportion by another variable (not by frequency) in dplyr in R

Using mtcars data, I want to calculate proportion of mpg for each group of cyl and am. How to calc it?
mtcars %>%
group_by(cyl, am) %>%
summarise(mpg = n(mpg)) %>%
mutate(mpg.gr = mpg/(sum(mpg))
Thanks in advance!
If I understand you correctly, you want the proportion of records for each combination of cyl and am. If so, then I believe your code isn't working because n() doesn't accept an argument. You also need to ungroup() before calculating your proportions.
You could simply do:
mtcars %>%
group_by(cyl, am) %>%
summarise(mpg = n()) %>%
ungroup() %>%
mutate(mpg.gr = mpg/(sum(mpg))
#> # A tibble: 6 x 4
#> cyl am mpg mpg.gr
#> <dbl> <dbl> <int> <dbl>
#> 1 4 0 3 0.0938
#> 2 4 1 8 0.25
#> 3 6 0 4 0.125
#> 4 6 1 3 0.0938
#> 5 8 0 12 0.375
#> 6 8 1 2 0.0625
Note that thanks to ungroup(), the proportions are calculated using the counts of all records, not just those within the cyl group, as before.

Loosing group_by information when using dplyr::do for the second time

I am running multiple models on multiple sections of my data set, similar to (but with many more models)
library(tidyverse)
d1 <- mtcars %>%
group_by(cyl) %>%
do(mod_linear = lm(mpg ~ disp + hp, data = ., x = TRUE))
d1
# Source: local data frame [3 x 3]
# Groups: <by row>
#
# # A tibble: 3 x 3
# cyl mod_linear
# * <dbl> <list>
# 1 4. <S3: lm>
# 2 6. <S3: lm>
# 3 8. <S3: lm>
I then tidy this tibble and save my parameter estimates using tidy() in the broom package.
I also want to calculate the standard deviation of the predictors (stored in models above as I set x = TRUE) to create and then compare re-scaled parameters. I can do the former of these using
d1 %>%
# group_by(cyl) %>%
do(term = colnames(.$mod$x),
pred_sd = apply(X = .$mod$x, MARGIN = 2, FUN = sd)) %>%
unnest()
# # A tibble: 9 x 2
# term pred_sd
# <chr> <dbl>
# 1 (Intercept) 0.00000
# 2 disp 26.87159
# 3 hp 20.93453
# 4 (Intercept) 0.00000
# 5 disp 41.56246
# 6 hp 24.26049
# 7 (Intercept) 0.00000
# 8 disp 67.77132
# 9 hp 50.97689
However, the result is not a grouped tibble so I end up loosing the cyl column to tell me which terms belong to which model. How can avoid this loss? - Adding in group_by again seems to throw an error.
n.b. I want avoid using purrr for at least for the first part (fitting the models) as I run different types of models and then need to reshape the results (d1), and I like the progress bar with do.
n.b. I want to work with the $x component of the models rather than the raw data as they have the data on correct scale (I am experimenting with different transformations of the predictors)
We can do this by nesting initially and then do the unnest
mtcars %>%
group_by(cyl) %>%
nest(-cyl) %>%
mutate(mod_linear = map(data, ~ lm(mpg ~ disp + hp, data = .x, x = TRUE)),
term = map(mod_linear, ~ names(coef(.x))),
pred = map(mod_linear, ~ .x$x %>%
as_tibble %>%
summarise_all(sd) %>%
unlist )) %>%
select(-data, -mod_linear) %>%
unnest
# A tibble: 9 x 3
# cyl term pred
# <dbl> <chr> <dbl>
#1 6.00 (Intercept) 0
#2 6.00 disp 41.6
#3 6.00 hp 24.3
#4 4.00 (Intercept) 0
#5 4.00 disp 26.9
#6 4.00 hp 20.9
#7 8.00 (Intercept) 0
#8 8.00 disp 67.8
#9 8.00 hp 51.0
Or instead of calling the map multiple times, this can be further made compact with
mtcars %>%
group_by(cyl) %>%
nest(-cyl) %>%
mutate(mod_contents = map(data, ~ {
mod <- lm(mpg ~ disp + hp, data = .x, x = TRUE)
term <- names(coef(mod))
pred <- mod$x %>%
as_tibble %>%
summarise_all(sd) %>%
unlist
tibble(term, pred)
}
)) %>%
select(-data) %>%
unnest
# A tibble: 9 x 3
# cyl term pred
# <dbl> <chr> <dbl>
#1 6.00 (Intercept) 0
#2 6.00 disp 41.6
#3 6.00 hp 24.3
#4 4.00 (Intercept) 0
#5 4.00 disp 26.9
#6 4.00 hp 20.9
#7 8.00 (Intercept) 0
#8 8.00 disp 67.8
#9 8.00 hp 51.0
If we start from 'd1' (based on the OP's code)
d1 %>%
ungroup %>%
mutate(mod_contents = map(mod_linear, ~ {
pred <- .x$x %>%
as_tibble %>%
summarise_all(sd) %>%
unlist
term <- .x %>%
coef %>%
names
tibble(term, pred)
})) %>%
select(-mod_linear) %>%
unnest

Using dplyr summarise_at with column index

I noticed that when supplying column indices to dplyr::summarize_at the column to be summarized is determined excluding the grouping column(s). I wonder if that is how it's supposed to be since by this design, using the correct column index depends on whether the summarising column(s) are positioned before or after the grouping columns.
Here's an example:
library(dplyr)
data("mtcars")
# grouping column after summarise columns
mtcars %>% group_by(gear) %>% summarise_at(3:4, mean)
## A tibble: 3 x 3
# gear disp hp
# <dbl> <dbl> <dbl>
#1 3 326.3000 176.1333
#2 4 123.0167 89.5000
#3 5 202.4800 195.6000
# grouping columns before summarise columns
mtcars %>% group_by(cyl) %>% summarise_at(3:4, mean)
## A tibble: 3 x 3
# cyl hp drat
# <dbl> <dbl> <dbl>
#1 4 82.63636 4.070909
#2 6 122.28571 3.585714
#3 8 209.21429 3.229286
# no grouping columns
mtcars %>% summarise_at(3:4, mean)
# disp hp
#1 230.7219 146.6875
# actual third & fourth columns
names(mtcars)[3:4]
#[1] "disp" "hp"
packageVersion("dplyr")
#[1] ‘0.7.2’
Notice how the summarised columns change depending on grouping and position of the grouping column.
Is this the same on other platforms? Is it a bug or a feature?
with version 0.7.5 this behavior can't be reproduced anymore:
library(dplyr)
mtcars %>% group_by(gear) %>% summarise_at(3:4, mean)
# # A tibble: 3 x 3
# gear disp hp
# <dbl> <dbl> <dbl>
# 1 3 326. 176.
# 2 4 123. 89.5
# 3 5 202. 196.
mtcars %>% group_by(cyl) %>% summarise_at(3:4, mean)
# # A tibble: 3 x 3
# cyl disp hp
# <dbl> <dbl> <dbl>
# 1 4 105. 82.6
# 2 6 183. 122.
# 3 8 353. 209.
#docendodiscimus thanks for pointing this out, because even if this feature was intentional, documentation doesn't explicitly explain this and in my case could be source of errors. Actually, this problem was solved before answering on the other question, and my comment above does it properly with the same logic.
At this moment, possible solution is to provide names instead of indexes. But one is still able to make it using indexes just by adding few symbols .vars = names(.)[3:4], like below:
mtcars %>%
group_by(cyl) %>%
summarise_at( .vars = colnames(.)[3:4] , mean)
mtcars %>%
group_by(cyl) %>%
summarise_at( .vars = names(.)[3:4] , mean)
## A tibble: 3 x 3
# cyl disp hp
# <dbl> <dbl> <dbl>
#1 4 105.1364 82.63636
#2 6 183.3143 122.28571
#3 8 353.1000 209.21429

Resources