I want to parse_factor then fct_recode several variables in a dataframe. The levels (and their recode values) are stored in named strings.
How can I use those to implement what I want?
Note that in my case, I cannot simply use mutate, because I have several variables to which I want to apply the recoding.
Below is an example of what I thought would work (but does not).
library(tidyverse)
#> ── Attaching packages ────── tidyverse 1.2.1 ──
#> ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
#> ✔ tibble 1.4.2 ✔ dplyr 0.7.4
#> ✔ tidyr 0.8.0 ✔ stringr 1.3.0
#> ✔ readr 1.1.1 ✔ forcats 0.3.0
#> ── Conflicts ───────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
gear_levels <- c("tri" = 3, "quad" = 4, "six" = 6, `NA` = 8)
mtcars %>%
mutate_at("gear", parse_factor, levels = gear_levels) %>%
mutate_at("gear", fct_recode, !!! gear_levels)
#> Warning: 5 parsing failures.
#> row # A tibble: 5 x 4 col row col expected actual expected <int> <int> <chr> <chr> actual 1 27 NA value in level set 5 row 2 28 NA value in level set 5 col 3 29 NA value in level set 5 expected 4 30 NA value in level set 5 actual 5 31 NA value in level set 5
#> Error: Can't use `!!!` on atomic vectors in non-quoting functions
As per lionel's comment, this is what coercing to list looks like. Note that you need to supply a character vector to fct_recode and that you have to replace the names after as.character. I'm not sure exactly how your desired levels are stored.
Also your supplied levels don't match those in mtcars$gear, in case you didn't realise.
library(tidyverse)
gear_levels <- c("tri" = 3, "quad" = 4, "six" = 6, `NA` = 8)
gear_recode <- as.list(as.character(gear_levels))
names(gear_recode) <- names(gear_levels)
mtcars %>%
mutate_at(vars(gear), parse_factor, levels = gear_levels) %>%
mutate_at(vars(gear), fct_recode, !!! gear_recode)
#> Warning: 5 parsing failures.
#> row # A tibble: 5 x 4 col row col expected actual expected <int> <int> <chr> <chr> actual 1 27 NA value in level set 5 row 2 28 NA value in level set 5 col 3 29 NA value in level set 5 expected 4 30 NA value in level set 5 actual 5 31 NA value in level set 5
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 quad 4
#> 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 quad 4
#> 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 quad 1
#> 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 tri 1
#> 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 tri 2
#> 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 tri 1
#> 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 tri 4
#> 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 quad 2
#> 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 quad 2
#> 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 quad 4
#> 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 quad 4
#> 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 tri 3
#> 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 tri 3
#> 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 tri 3
#> 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 tri 4
#> 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 tri 4
#> 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 tri 4
#> 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 quad 1
#> 19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 quad 2
#> 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 quad 1
#> 21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 tri 1
#> 22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 tri 2
#> 23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 tri 2
#> 24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 tri 4
#> 25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 tri 2
#> 26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 quad 1
#> 27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 <NA> 2
#> 28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 <NA> 2
#> 29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 <NA> 4
#> 30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 <NA> 6
#> 31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 <NA> 8
#> 32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 quad 2
Created on 2018-03-16 by the reprex package (v0.2.0).
Related
I am comparing two alternative strategies to estimate linear regression models on subsets of data using the data.table package for R. The two strategies produce the same coefficients, so they appear equivalent. This appearance is deceiving. My question is:
Why is the data stored inside the lm models different?
library(data.table)
dat = data.table(mtcars)
# strategy 1
mod1 = dat[, .(models = .(lm(hp ~ mpg, data = .SD))), by = vs]
# strategy 2
mod2 = dat[, .(data = .(.SD)), by = vs][
, models := lapply(data, function(x) lm(hp ~ mpg, x))]
At first glance, the two approaches seem to produce identical results:
# strategy 1
coef(mod1$models[[1]])
#> (Intercept) mpg
#> 357.97866 -10.12576
# strategy 2
coef(mod2$models[[1]])
#> (Intercept) mpg
#> 357.97866 -10.12576
However, if I try to extract data from the (expanded) model.frame, I get different results:
# strategy 1
expanded_frame1 = expand.model.frame(mod1$models[[1]], "am")
table(expanded_frame1$am)
#>
#> 0 1
#> 7 11
# strategy 2
expanded_frame2 = expand.model.frame(mod2$models[[1]], "am")
table(expanded_frame2$am)
#>
#> 0 1
#> 12 6
This is a trivial minimal working example. My real use-case is that I obtained radically different results when applying sandwich::vcovCL to computed clustered standard errors for my models.
Edit:
I'm accepting the answer by #TimTeaFan (excellent detective work!) but adding a bit of useful info here for future readers.
As #achim-zeileis pointed out elsewhere, we can replicate a similar behavior in the global environment:
d <- subset(mtcars, subst = vs == 0)
m0 <- lm(hp ~ mpg, data = d)
d <- mtcars[0, ]
expand.model.frame(m0, "am")
[1] hp mpg am
<0 rows> (or 0-length row.names)
This does not appear to be a data.table-specific issue. And in general, we have to be careful when re-evaluating the data from a model.
I don't have a complete answer, but I was able to pinpoint the problem to some extent.
When we compare the output of the two models, we can see that the result is equal except for the calls, which are different (which makes sense, since they actually are different):
# compare models
purrr::map2(mod1$models[[1]], mod2$models[[1]], all.equal)
#> $coefficients
#> [1] TRUE
#>
#> $residuals
#> [1] TRUE
#>
#> $effects
#> [1] TRUE
#>
#> $rank
#> [1] TRUE
#>
#> $fitted.values
#> [1] TRUE
#>
#> $assign
#> [1] TRUE
#>
#> $qr
#> [1] TRUE
#>
#> $df.residual
#> [1] TRUE
#>
#> $xlevels
#> [1] TRUE
#>
#> $call
#> [1] "target, current do not match when deparsed"
#>
#> $terms
#> [1] TRUE
#>
#> $model
#> [1] TRUE
So it seems that the initial call is working correctly with both approaches, the problem arises once we try to access the underlying data.
If we have a look at how expand.model.frame gets its data, we can see that it calls eval(model$call$data, envir) where envir is defined as environment(formula(model)) the environment associated with the formula of the lm object.
If we have a look at the data in the associated environment of each model and compare it with the data we expect it to hold, we can see that the second approach yields the data we expect, while the first approach using .SD in the call yields some different data.
It is still not clear to me, why and what is happening, but we now know the problem is in the call to .SD. I first thought, it might be caused by naming a data.table .SD, but after playing around with models where the data is a data.table called .SD this does not seem to be the issue.
# data of model 2 (identical to subsetted mtcars)
environment(formula(mod2$models[[1]]))$x[order(mpg),]
#> mpg cyl disp hp drat wt qsec am gear carb
#> 1: 10.4 8 472.0 205 2.93 5.250 17.98 0 3 4
#> 2: 10.4 8 460.0 215 3.00 5.424 17.82 0 3 4
#> 3: 13.3 8 350.0 245 3.73 3.840 15.41 0 3 4
#> 4: 14.3 8 360.0 245 3.21 3.570 15.84 0 3 4
#> 5: 14.7 8 440.0 230 3.23 5.345 17.42 0 3 4
#> 6: 15.0 8 301.0 335 3.54 3.570 14.60 1 5 8
#> 7: 15.2 8 275.8 180 3.07 3.780 18.00 0 3 3
#> 8: 15.2 8 304.0 150 3.15 3.435 17.30 0 3 2
#> 9: 15.5 8 318.0 150 2.76 3.520 16.87 0 3 2
#> 10: 15.8 8 351.0 264 4.22 3.170 14.50 1 5 4
#> 11: 16.4 8 275.8 180 3.07 4.070 17.40 0 3 3
#> 12: 17.3 8 275.8 180 3.07 3.730 17.60 0 3 3
#> 13: 18.7 8 360.0 175 3.15 3.440 17.02 0 3 2
#> 14: 19.2 8 400.0 175 3.08 3.845 17.05 0 3 2
#> 15: 19.7 6 145.0 175 3.62 2.770 15.50 1 5 6
#> 16: 21.0 6 160.0 110 3.90 2.620 16.46 1 4 4
#> 17: 21.0 6 160.0 110 3.90 2.875 17.02 1 4 4
#> 18: 26.0 4 120.3 91 4.43 2.140 16.70 1 5 2
# subset and order mtcars data
mtcars_vs0 <- subset(mtcars, vs == 0)
mtcars_vs0[order(mtcars_vs0$mpg), ]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
#> Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
#> Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
#> Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#> Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
#> Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
#> Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
#> AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
#> Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
#> Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
#> Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
#> Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
#> Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#> Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
#> Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
#> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# data of model 1 (not identical to mtcars)
environment(formula(mod1$models[[1]]))$.SD[order(mpg),]
#> mpg cyl disp hp drat wt qsec am gear carb
#> 1: 15.0 8 301.0 335 3.54 3.570 14.60 1 5 8
#> 2: 15.8 8 351.0 264 4.22 3.170 14.50 1 5 4
#> 3: 17.8 6 167.6 123 3.92 3.440 18.90 0 4 4
#> 4: 18.1 6 225.0 105 2.76 3.460 20.22 0 3 1
#> 5: 19.2 6 167.6 123 3.92 3.440 18.30 0 4 4
#> 6: 19.7 6 145.0 175 3.62 2.770 15.50 1 5 6
#> 7: 21.4 6 258.0 110 3.08 3.215 19.44 0 3 1
#> 8: 21.4 4 121.0 109 4.11 2.780 18.60 1 4 2
#> 9: 21.5 4 120.1 97 3.70 2.465 20.01 0 3 1
#> 10: 22.8 4 108.0 93 3.85 2.320 18.61 1 4 1
#> 11: 22.8 4 140.8 95 3.92 3.150 22.90 0 4 2
#> 12: 24.4 4 146.7 62 3.69 3.190 20.00 0 4 2
#> 13: 26.0 4 120.3 91 4.43 2.140 16.70 1 5 2
#> 14: 27.3 4 79.0 66 4.08 1.935 18.90 1 4 1
#> 15: 30.4 4 75.7 52 4.93 1.615 18.52 1 4 2
#> 16: 30.4 4 95.1 113 3.77 1.513 16.90 1 5 2
#> 17: 32.4 4 78.7 66 4.08 2.200 19.47 1 4 1
#> 18: 33.9 4 71.1 65 4.22 1.835 19.90 1 4 1
Add on
I tried digging a little deeper to see whats going on. First I called debug(as.formula) and then looked at the following objects in each iteration:
object
ls(environment(object))
We can see that in "strategy 2" each formula is associated with a different environment, and when looking at the environment we see it contains one object x, which when inspected (environment(object)$x) contains the expected mtcars data.
In "strategy 1" however, we can observe that each call to as.formula associates the same environment with the formula being created. Further, when inspecting the environment we can see that it is populated with the single vectors of the subsetted mtcars data (e.g. am, carb, cyl etc.) as well as some functions (e.g. .POSIXt, Cfastmean, strptime etc.). This is probably where things go awry. I would suspect that when associating the same environment with two different formulas (models), the first models underlying data gets "updated" when the second model is calculated. This should also be the reason why the model output itself is correct. To the time the first model is being calculated, the data is still correct. It is overwritten by the second model, which therefore is correct, too. But when accessing the underlying data afterwards things get messy.
Side note
I was curious if we can observe similar problems and differences in the tidyverse when using expand.model.frame and the answer is "yes". Here, the new rowwise notation throws an error, while the group_map as well as the map approach work:
# dplyr approaches:
# group_map: works
mod3 <- mtcars %>%
group_by(vs) %>%
group_map(~ lm(hp ~ mpg, data = .x))
expand.model.frame(mod3[[1]], "am")
# mutate / rowwise: does not work
mod4 <- mtcars %>%
nest_by(vs) %>%
mutate(models = list(lm(hp ~ mpg, data = data)))
expand.model.frame(mod4$models[[1]], "am")
# mutate / map: works
mod5 <- mtcars %>%
tidyr::nest(data = !vs) %>%
mutate(models = purrr::map(data, ~ lm(hp ~ mpg, data = .x)))
expand.model.frame(mod5$models[[1]], "am")
How to recode some dataframe values to NA if they don't appear in a separate vector?
More specifically, how to approach such task when:
each data column to clean has its specific set of "valid" values to keep, independent of other columns
column-specific values are given in a separate table (as vectors nested in a list-column in a tibble)
Example
My data to clean up is my_mtcars
I want to clean up certain columns (cars, gear, and carb)
In each of those columns, I want to keep only certain values as they are specified in a separate table table_valid_values under valid_values. Otherwise, values not specified as "valid" should turn to NA.
For any column of my_mtcars that does not appear in table_valid_values, no cleanup is needed.
library(tibble)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
my_mtcars <- rownames_to_column(mtcars, "cars")
as_tibble(my_mtcars)
#> # A tibble: 32 x 12
#> cars mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda RX4 ~ 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 Hornet 4 D~ 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 Hornet Spo~ 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ... with 22 more rows
table_valid_values <-
structure(
list(
var_name = c("cars", "gear", "carb"),
valid_values = list(
c("Valiant", "AMC Javelin", "Ferrari Dino"),
c(3, 5),
c(1, 4, 6)
)
),
row.names = c(NA, -3L),
class = c("tbl_df", "tbl", "data.frame")
)
table_valid_values
#> # A tibble: 3 x 2
#> var_name valid_values
#> <chr> <list>
#> 1 cars <chr [3]>
#> 2 gear <dbl [2]>
#> 3 carb <dbl [3]>
table_valid_values %>%
pull(valid_values)
#> [[1]]
#> [1] "Valiant" "AMC Javelin" "Ferrari Dino"
#>
#> [[2]]
#> [1] 3 5
#>
#> [[3]]
#> [1] 1 4 6
Created on 2021-01-27 by the reprex package (v0.3.0)
Desired Output
Provided with only table_valid_values, how can I clean up my_mtcars to get the following:
## cars mpg cyl disp hp drat wt qsec vs am gear carb
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 NA 21 6 160 110 3.9 2.62 16.5 0 1 NA 4
## 2 NA 21 6 160 110 3.9 2.88 17.0 0 1 NA 4
## 3 NA 22.8 4 108 93 3.85 2.32 18.6 1 1 NA 1
## 4 NA 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 NA 18.7 8 360 175 3.15 3.44 17.0 0 0 3 NA
## 6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 7 NA 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 8 NA 24.4 4 147. 62 3.69 3.19 20 1 0 NA NA
## 9 NA 22.8 4 141. 95 3.92 3.15 22.9 1 0 NA NA
## 10 NA 19.2 6 168. 123 3.92 3.44 18.3 1 0 NA 4
## 11 NA 17.8 6 168. 123 3.92 3.44 18.9 1 0 NA 4
## 12 NA 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 NA
## 13 NA 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 NA
## 14 NA 15.2 8 276. 180 3.07 3.78 18 0 0 3 NA
## 15 NA 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
## 16 NA 10.4 8 460 215 3 5.42 17.8 0 0 3 4
## 17 NA 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
## 18 NA 32.4 4 78.7 66 4.08 2.2 19.5 1 1 NA 1
## 19 NA 30.4 4 75.7 52 4.93 1.62 18.5 1 1 NA NA
## 20 NA 33.9 4 71.1 65 4.22 1.84 19.9 1 1 NA 1
## 21 NA 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
## 22 NA 15.5 8 318 150 2.76 3.52 16.9 0 0 3 NA
## 23 AMC Javelin 15.2 8 304 150 3.15 3.44 17.3 0 0 3 NA
## 24 NA 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
## 25 NA 19.2 8 400 175 3.08 3.84 17.0 0 0 3 NA
## 26 NA 27.3 4 79 66 4.08 1.94 18.9 1 1 NA 1
## 27 NA 26 4 120. 91 4.43 2.14 16.7 0 1 5 NA
## 28 NA 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 NA
## 29 NA 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
## 30 Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
## 31 NA 15 8 301 335 3.54 3.57 14.6 0 1 5 NA
## 32 NA 21.4 4 121 109 4.11 2.78 18.6 1 1 NA NA
I also wonder, what if we wanted to replace invalid values with a string of choice (say, invalid) rather than NA?
You could use dplyr as :
library(dplyr)
my_mtcars %>%
mutate(across(all_of(table_valid_values$var_name), ~{
replace(.x, !.x %in%
table_valid_values$valid_values[match(cur_column(),
table_valid_values$var_name)][[1]], NA)
}))
Similarly, in base R :
my_mtcars[table_valid_values$var_name] <- lapply(table_valid_values$var_name,
function(x) {
replace(my_mtcars[[x]],
!my_mtcars[[x]] %in% table_valid_values$valid_values[
match(x, table_valid_values$var_name)][[1]], NA)
})
my_mtcars
# cars mpg cyl disp hp drat wt qsec vs am gear carb
#1 <NA> 21.0 6 160.0 110 3.90 2.620 16.46 0 1 NA 4
#2 <NA> 21.0 6 160.0 110 3.90 2.875 17.02 0 1 NA 4
#3 <NA> 22.8 4 108.0 93 3.85 2.320 18.61 1 1 NA 1
#4 <NA> 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#5 <NA> 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 NA
#6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#7 <NA> 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#8 <NA> 24.4 4 146.7 62 3.69 3.190 20.00 1 0 NA NA
#9 <NA> 22.8 4 140.8 95 3.92 3.150 22.90 1 0 NA NA
#10 <NA> 19.2 6 167.6 123 3.92 3.440 18.30 1 0 NA 4
#11 <NA> 17.8 6 167.6 123 3.92 3.440 18.90 1 0 NA 4
#12 <NA> 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 NA
#13 <NA> 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 NA
#14 <NA> 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 NA
#15 <NA> 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
#16 <NA> 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
#17 <NA> 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
#18 <NA> 32.4 4 78.7 66 4.08 2.200 19.47 1 1 NA 1
#19 <NA> 30.4 4 75.7 52 4.93 1.615 18.52 1 1 NA NA
#20 <NA> 33.9 4 71.1 65 4.22 1.835 19.90 1 1 NA 1
#21 <NA> 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
#22 <NA> 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 NA
#23 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 NA
#24 <NA> 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
#25 <NA> 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 NA
#26 <NA> 27.3 4 79.0 66 4.08 1.935 18.90 1 1 NA 1
#27 <NA> 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 NA
#28 <NA> 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 NA
#29 <NA> 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
#30 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
#31 <NA> 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 NA
#32 <NA> 21.4 4 121.0 109 4.11 2.780 18.60 1 1 NA NA
Replace NA with any value you want.
This question already has answers here:
Use dynamic name for new column/variable in `dplyr`
(10 answers)
Closed 2 years ago.
I've used curly-curly with group_by and summarise as described in the rlang announcement. But I can't get it to work when mutating a variable in place. What's the best way to do this currently with dplyr?
Say I want to supply an unquoted column name and have it mutated, here's a toy example function that doesn't work:
my_fun <- function(dat, var_name){
dat %>%
mutate({{var_name}} = 1)
}
my_fun(mtcars, cyl)
What should that mutate line be to change any column in mtcars to be a constant?
You need to use the assignment operator (:=) if you want to use the curly-curly to specify a name on the left hand side of an assignment in mutate:
my_fun <- function(dat, var_name){
dat %>%
mutate({{var_name}} := 1)
}
Which allows:
my_fun(mtcars, cyl)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 21.0 1 160.0 110 3.90 2.620 16.46 0 1 4 4
#> 2 21.0 1 160.0 110 3.90 2.875 17.02 0 1 4 4
#> 3 22.8 1 108.0 93 3.85 2.320 18.61 1 1 4 1
#> 4 21.4 1 258.0 110 3.08 3.215 19.44 1 0 3 1
#> 5 18.7 1 360.0 175 3.15 3.440 17.02 0 0 3 2
#> 6 18.1 1 225.0 105 2.76 3.460 20.22 1 0 3 1
#> 7 14.3 1 360.0 245 3.21 3.570 15.84 0 0 3 4
#> 8 24.4 1 146.7 62 3.69 3.190 20.00 1 0 4 2
#> 9 22.8 1 140.8 95 3.92 3.150 22.90 1 0 4 2
#> 10 19.2 1 167.6 123 3.92 3.440 18.30 1 0 4 4
#> 11 17.8 1 167.6 123 3.92 3.440 18.90 1 0 4 4
#> 12 16.4 1 275.8 180 3.07 4.070 17.40 0 0 3 3
#> 13 17.3 1 275.8 180 3.07 3.730 17.60 0 0 3 3
#> 14 15.2 1 275.8 180 3.07 3.780 18.00 0 0 3 3
#> 15 10.4 1 472.0 205 2.93 5.250 17.98 0 0 3 4
#> 16 10.4 1 460.0 215 3.00 5.424 17.82 0 0 3 4
#> 17 14.7 1 440.0 230 3.23 5.345 17.42 0 0 3 4
#> 18 32.4 1 78.7 66 4.08 2.200 19.47 1 1 4 1
#> 19 30.4 1 75.7 52 4.93 1.615 18.52 1 1 4 2
#> 20 33.9 1 71.1 65 4.22 1.835 19.90 1 1 4 1
#> 21 21.5 1 120.1 97 3.70 2.465 20.01 1 0 3 1
#> 22 15.5 1 318.0 150 2.76 3.520 16.87 0 0 3 2
#> 23 15.2 1 304.0 150 3.15 3.435 17.30 0 0 3 2
#> 24 13.3 1 350.0 245 3.73 3.840 15.41 0 0 3 4
#> 25 19.2 1 400.0 175 3.08 3.845 17.05 0 0 3 2
#> 26 27.3 1 79.0 66 4.08 1.935 18.90 1 1 4 1
#> 27 26.0 1 120.3 91 4.43 2.140 16.70 0 1 5 2
#> 28 30.4 1 95.1 113 3.77 1.513 16.90 1 1 5 2
#> 29 15.8 1 351.0 264 4.22 3.170 14.50 0 1 5 4
#> 30 19.7 1 145.0 175 3.62 2.770 15.50 0 1 5 6
#> 31 15.0 1 301.0 335 3.54 3.570 14.60 0 1 5 8
#> 32 21.4 1 121.0 109 4.11 2.780 18.60 1 1 4 2
Does the dplyr function is_grouped_df() actually require the input to be a date frame (vs a data table, tibble, etc.)? If it does not require a data frame why isn't it named is_grouped instead of is_grouped_df?
#1 - mtcars with multiple classes
mtcars %>% group_by(cyl) %>% is_grouped_df()
#> [1] TRUE
mtcars %>% group_by(cyl) %>% class()
#> [1] "grouped_df" "tbl_df" "tbl" "data.frame"
I can group a multiple class mtcars data set, and confirm with the is_grouped_df() function that the data set is grouped.
#2 - mtcars as a tibble
mtcars %>% group_by(cyl) %>% as_tibble() %>% is_grouped_df()
#> [1] FALSE
mtcars %>% group_by(cyl) %>% as_tibble() %>% class()
#> [1] "tbl_df" "tbl" "data.frame"
I can try to force mtcars to be a tibble and notice that when I check if it is a is_grouped_df I get FALSE as the answer. Even though that doesn't seem to be the case. I never called the ungroup() function in my pipe after grouping. Why FALSE?
#3 - mtcars as a data frame (an attempt to return to #1)
mtcars %>% group_by(cyl) %>% as_tibble() %>% as.data.frame() %>% is_grouped_df()
#> [1] FALSE
mtcars %>% group_by(cyl) %>% as_tibble() %>% as.data.frame() %>% class()
#> [1] "data.frame"
I can try to force mtcars to be a data frame and notice that when I check if it is a is_grouped_df I get FALSE as the answer. Even though that doesn't seem to be the case. I never called the ungroup() function in my pipe after grouping. Why FALSE?
And now I circle back to the original question, "Does the dplyr function is_grouped_df() actually require the input to be a date frame (vs a data table, tibble, etc.)?". And why all the inconsistencies in my three examples above?
As soon as you add the as_tibble or as.data.frame functions, the groups that were created by the group_by function are deleted.
You can't create groups on a data frame. The data frame is converted to a tibble as soon as you use group_by
class(mtcars)
[1] "data.frame"
mtcars %>%
group_by(cyl) %>%
class()
[1] "grouped_df" "tbl_df" "tbl" "data.frame"
You can see how the data frame gets converted into a tibble by using group_by
mtcars %>%
group_by(cyl)
# A tibble: 32 x 11
# Groups: cyl [3]
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ... with 22 more rows
But as soon as you call as_tibble again, the groups dissapear.
mtcars %>%
group_by(cyl) %>%
as_tibble()
# A tibble: 32 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ... with 22 more rows
Same thing happens if you call as.data.frame after using group_by
mtcars %>%
group_by(cyl) %>%
as.data.frame()
mpg cyl disp hp drat wt qsec vs am gear carb
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
So basically, is_grouped_df works as intended, only detecting tibbles that have groups. In this case, it's important to note that as_tibble will effectively reset the groups that have already been created, so it ends up acting as ungroup if called on a tibble.
I get an error when trying to call my function where dplyr is used inside the function. Does dplyr not work inside R functions?
all_df_yoy <- function(all_df, units) {
all_df_yoy <- all_df %>% mutate(
players_units_yoy = units)
}
us_players_all_df_yoy <- all_df_yoy(us_players_all_df, players_units_us)
I get the following error.
Error in compat_lazy_dots(.dots, caller_env(), ..., .named = TRUE) :
object 'players_units_us' not found
However, players_units_us does indeed exist inside the data frame.
Without a minimal reproducible example it's impossible to answer this question to your exact scope, but you need to utilize tidyeval to code functions in the same way that library(dplyr) does. Here is a brief example of what you have to do
library(tidyverse)
create_new_col <- function(df, units) {
units <- enquo(units)
df %>%
mutate(players_units_yoy = !!units)
}
mtcars %>%
create_new_col(cyl)
#> mpg cyl disp hp drat wt qsec vs am gear carb players_units_yoy
#> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 6
#> 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 6
#> 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4
#> 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 6
#> 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 8
#> 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 6
#> 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 8
#> 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 4
#> 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 4
#> 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 6
#> 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 6
#> 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 8
#> 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 8
#> 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 8
#> 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 8
#> 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 8
#> 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 8
#> 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 4
#> 19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 4
#> 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 4
#> 21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 4
#> 22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 8
#> 23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 8
#> 24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 8
#> 25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 8
#> 26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 4
#> 27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 4
#> 28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 4
#> 29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 8
#> 30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 6
#> 31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 8
#> 32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 4
Created on 2019-05-02 by the reprex package (v0.2.1)
You can read more on this here: https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html
If you are new to programming in R, realize that this is a hurdle most users go through when beginning to develop their own packages. So don't worry if it doesn't click at first, become more familiar with R (try writing your functions using base R) and then come back to this topic.