Recode dataframe values to NA per column - r

How to recode some dataframe values to NA if they don't appear in a separate vector?
More specifically, how to approach such task when:
each data column to clean has its specific set of "valid" values to keep, independent of other columns
column-specific values are given in a separate table (as vectors nested in a list-column in a tibble)
Example
My data to clean up is my_mtcars
I want to clean up certain columns (cars, gear, and carb)
In each of those columns, I want to keep only certain values as they are specified in a separate table table_valid_values under valid_values. Otherwise, values not specified as "valid" should turn to NA.
For any column of my_mtcars that does not appear in table_valid_values, no cleanup is needed.
library(tibble)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
my_mtcars <- rownames_to_column(mtcars, "cars")
as_tibble(my_mtcars)
#> # A tibble: 32 x 12
#> cars mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda RX4 ~ 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 Hornet 4 D~ 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 Hornet Spo~ 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ... with 22 more rows
table_valid_values <-
structure(
list(
var_name = c("cars", "gear", "carb"),
valid_values = list(
c("Valiant", "AMC Javelin", "Ferrari Dino"),
c(3, 5),
c(1, 4, 6)
)
),
row.names = c(NA, -3L),
class = c("tbl_df", "tbl", "data.frame")
)
table_valid_values
#> # A tibble: 3 x 2
#> var_name valid_values
#> <chr> <list>
#> 1 cars <chr [3]>
#> 2 gear <dbl [2]>
#> 3 carb <dbl [3]>
table_valid_values %>%
pull(valid_values)
#> [[1]]
#> [1] "Valiant" "AMC Javelin" "Ferrari Dino"
#>
#> [[2]]
#> [1] 3 5
#>
#> [[3]]
#> [1] 1 4 6
Created on 2021-01-27 by the reprex package (v0.3.0)
Desired Output
Provided with only table_valid_values, how can I clean up my_mtcars to get the following:
## cars mpg cyl disp hp drat wt qsec vs am gear carb
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 NA 21 6 160 110 3.9 2.62 16.5 0 1 NA 4
## 2 NA 21 6 160 110 3.9 2.88 17.0 0 1 NA 4
## 3 NA 22.8 4 108 93 3.85 2.32 18.6 1 1 NA 1
## 4 NA 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 NA 18.7 8 360 175 3.15 3.44 17.0 0 0 3 NA
## 6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 7 NA 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 8 NA 24.4 4 147. 62 3.69 3.19 20 1 0 NA NA
## 9 NA 22.8 4 141. 95 3.92 3.15 22.9 1 0 NA NA
## 10 NA 19.2 6 168. 123 3.92 3.44 18.3 1 0 NA 4
## 11 NA 17.8 6 168. 123 3.92 3.44 18.9 1 0 NA 4
## 12 NA 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 NA
## 13 NA 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 NA
## 14 NA 15.2 8 276. 180 3.07 3.78 18 0 0 3 NA
## 15 NA 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
## 16 NA 10.4 8 460 215 3 5.42 17.8 0 0 3 4
## 17 NA 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
## 18 NA 32.4 4 78.7 66 4.08 2.2 19.5 1 1 NA 1
## 19 NA 30.4 4 75.7 52 4.93 1.62 18.5 1 1 NA NA
## 20 NA 33.9 4 71.1 65 4.22 1.84 19.9 1 1 NA 1
## 21 NA 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
## 22 NA 15.5 8 318 150 2.76 3.52 16.9 0 0 3 NA
## 23 AMC Javelin 15.2 8 304 150 3.15 3.44 17.3 0 0 3 NA
## 24 NA 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
## 25 NA 19.2 8 400 175 3.08 3.84 17.0 0 0 3 NA
## 26 NA 27.3 4 79 66 4.08 1.94 18.9 1 1 NA 1
## 27 NA 26 4 120. 91 4.43 2.14 16.7 0 1 5 NA
## 28 NA 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 NA
## 29 NA 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
## 30 Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
## 31 NA 15 8 301 335 3.54 3.57 14.6 0 1 5 NA
## 32 NA 21.4 4 121 109 4.11 2.78 18.6 1 1 NA NA
I also wonder, what if we wanted to replace invalid values with a string of choice (say, invalid) rather than NA?

You could use dplyr as :
library(dplyr)
my_mtcars %>%
mutate(across(all_of(table_valid_values$var_name), ~{
replace(.x, !.x %in%
table_valid_values$valid_values[match(cur_column(),
table_valid_values$var_name)][[1]], NA)
}))
Similarly, in base R :
my_mtcars[table_valid_values$var_name] <- lapply(table_valid_values$var_name,
function(x) {
replace(my_mtcars[[x]],
!my_mtcars[[x]] %in% table_valid_values$valid_values[
match(x, table_valid_values$var_name)][[1]], NA)
})
my_mtcars
# cars mpg cyl disp hp drat wt qsec vs am gear carb
#1 <NA> 21.0 6 160.0 110 3.90 2.620 16.46 0 1 NA 4
#2 <NA> 21.0 6 160.0 110 3.90 2.875 17.02 0 1 NA 4
#3 <NA> 22.8 4 108.0 93 3.85 2.320 18.61 1 1 NA 1
#4 <NA> 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#5 <NA> 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 NA
#6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#7 <NA> 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#8 <NA> 24.4 4 146.7 62 3.69 3.190 20.00 1 0 NA NA
#9 <NA> 22.8 4 140.8 95 3.92 3.150 22.90 1 0 NA NA
#10 <NA> 19.2 6 167.6 123 3.92 3.440 18.30 1 0 NA 4
#11 <NA> 17.8 6 167.6 123 3.92 3.440 18.90 1 0 NA 4
#12 <NA> 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 NA
#13 <NA> 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 NA
#14 <NA> 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 NA
#15 <NA> 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
#16 <NA> 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
#17 <NA> 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
#18 <NA> 32.4 4 78.7 66 4.08 2.200 19.47 1 1 NA 1
#19 <NA> 30.4 4 75.7 52 4.93 1.615 18.52 1 1 NA NA
#20 <NA> 33.9 4 71.1 65 4.22 1.835 19.90 1 1 NA 1
#21 <NA> 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
#22 <NA> 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 NA
#23 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 NA
#24 <NA> 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
#25 <NA> 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 NA
#26 <NA> 27.3 4 79.0 66 4.08 1.935 18.90 1 1 NA 1
#27 <NA> 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 NA
#28 <NA> 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 NA
#29 <NA> 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
#30 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
#31 <NA> 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 NA
#32 <NA> 21.4 4 121.0 109 4.11 2.780 18.60 1 1 NA NA
Replace NA with any value you want.

Related

Is there a way to combine mutate with another function to add a row of the numeric column means in R?

Thanks so much for any help. I have read tons of answers but I can't seem to figure this out for my specific case. I'm trying to use mutate() with another function to create a new row with the means of each column, should that column contain numeric variables. So far, I've only been able to add a column, which is not what I want. I tried the following:
x <- y %>%
mutate(Total = colMeans(select_if(., is.numeric), na.rm = TRUE)) %>%
head
This only added a column with the means, instead of a row.
How can I add a row called "Means" with the mean of each column? Thank you so much.
One way is to summarize and then bind_rows with the original data. I'll use mtcars with the rownames augmented.
mt <- rownames_to_column(mtcars)
mt %>%
group_by(cyl) %>%
summarize(across(-rowname, mean)) %>%
mutate(rowname = "Means") %>%
bind_rows(mt) %>%
arrange(cyl, rowname != "Means") %>%
print(n=99)
# # A tibble: 35 x 12
# cyl mpg disp hp drat wt qsec vs am gear carb rowname
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
# 1 4 26.7 105. 82.6 4.07 2.29 19.1 0.909 0.727 4.09 1.55 Means
# 2 4 22.8 108 93 3.85 2.32 18.6 1 1 4 1 Datsun 710
# 3 4 24.4 147. 62 3.69 3.19 20 1 0 4 2 Merc 240D
# 4 4 22.8 141. 95 3.92 3.15 22.9 1 0 4 2 Merc 230
# 5 4 32.4 78.7 66 4.08 2.2 19.5 1 1 4 1 Fiat 128
# 6 4 30.4 75.7 52 4.93 1.62 18.5 1 1 4 2 Honda Civic
# 7 4 33.9 71.1 65 4.22 1.84 19.9 1 1 4 1 Toyota Corolla
# 8 4 21.5 120. 97 3.7 2.46 20.0 1 0 3 1 Toyota Corona
# 9 4 27.3 79 66 4.08 1.94 18.9 1 1 4 1 Fiat X1-9
# 10 4 26 120. 91 4.43 2.14 16.7 0 1 5 2 Porsche 914-2
# 11 4 30.4 95.1 113 3.77 1.51 16.9 1 1 5 2 Lotus Europa
# 12 4 21.4 121 109 4.11 2.78 18.6 1 1 4 2 Volvo 142E
# 13 6 19.7 183. 122. 3.59 3.12 18.0 0.571 0.429 3.86 3.43 Means
# 14 6 21 160 110 3.9 2.62 16.5 0 1 4 4 Mazda RX4
# 15 6 21 160 110 3.9 2.88 17.0 0 1 4 4 Mazda RX4 Wag
# 16 6 21.4 258 110 3.08 3.22 19.4 1 0 3 1 Hornet 4 Drive
# 17 6 18.1 225 105 2.76 3.46 20.2 1 0 3 1 Valiant
# 18 6 19.2 168. 123 3.92 3.44 18.3 1 0 4 4 Merc 280
# 19 6 17.8 168. 123 3.92 3.44 18.9 1 0 4 4 Merc 280C
# 20 6 19.7 145 175 3.62 2.77 15.5 0 1 5 6 Ferrari Dino
# 21 8 15.1 353. 209. 3.23 4.00 16.8 0 0.143 3.29 3.5 Means
# 22 8 18.7 360 175 3.15 3.44 17.0 0 0 3 2 Hornet Sportabout
# 23 8 14.3 360 245 3.21 3.57 15.8 0 0 3 4 Duster 360
# 24 8 16.4 276. 180 3.07 4.07 17.4 0 0 3 3 Merc 450SE
# 25 8 17.3 276. 180 3.07 3.73 17.6 0 0 3 3 Merc 450SL
# 26 8 15.2 276. 180 3.07 3.78 18 0 0 3 3 Merc 450SLC
# 27 8 10.4 472 205 2.93 5.25 18.0 0 0 3 4 Cadillac Fleetwood
# 28 8 10.4 460 215 3 5.42 17.8 0 0 3 4 Lincoln Continental
# 29 8 14.7 440 230 3.23 5.34 17.4 0 0 3 4 Chrysler Imperial
# 30 8 15.5 318 150 2.76 3.52 16.9 0 0 3 2 Dodge Challenger
# 31 8 15.2 304 150 3.15 3.44 17.3 0 0 3 2 AMC Javelin
# 32 8 13.3 350 245 3.73 3.84 15.4 0 0 3 4 Camaro Z28
# 33 8 19.2 400 175 3.08 3.84 17.0 0 0 3 2 Pontiac Firebird
# 34 8 15.8 351 264 4.22 3.17 14.5 0 1 5 4 Ford Pantera L
# 35 8 15 301 335 3.54 3.57 14.6 0 1 5 8 Maserati Bora

Transfer unique values per column into rows - a maximum of 10 values per row

my problem in brief:
For this I would simply use the standard mtcars data frame.
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Now I want to transfer all properties/columns into rows and transfer all of the unique values into (maximum) 10 value columns. If there are more than 10 unique values, these should be included in a further line.
The expected dataframe looks like:
Prop Value1 Value2 Value3 Value4 Value5 Value6 Value7 Value8 Value9 Value10
mpg 21.0 22.8 21.4 18.7 18.1 14.3 24.4 19.2 17.8 16.4
mpg 17.3 15.2 10.4 14.7 32.4 30.4 33.9 21.5 15.5 13.3
mpg 27.3 26.0 15.8 19.7 15.0 NA NA NA NA NA
cyl ...
...
thank you very much for your help.
How about this method using a for loop
df = matrix(ncol = 11)[-1,]
for(i in 1:ncol(mtcars)){
a = unique(mtcars[,i])
b = length(a)%%10
if(b!=0){
c = matrix(c(unique(mtcars[,i]), rep(NA,10- b)), ncol=10, byrow = T)
}
if(b==0){
c = matrix(unique(mtcars[,i]), ncol=10, byrow = T)
}
c= cbind(rep(colnames(mtcars)[i], nrow(c)),c)
df= rbind(df,c)
}
df=as.data.frame(df)
the output looks like this
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
1 mpg 21 22.8 21.4 18.7 18.1 14.3 24.4 19.2 17.8 16.4
2 mpg 17.3 15.2 10.4 14.7 32.4 30.4 33.9 21.5 15.5 13.3
3 mpg 27.3 26 15.8 19.7 15 <NA> <NA> <NA> <NA> <NA>
4 cyl 6 4 8 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
5 disp 160 108 258 360 225 146.7 140.8 167.6 275.8 472
6 disp 460 440 78.7 75.7 71.1 120.1 318 304 350 400
7 disp 79 120.3 95.1 351 145 301 121 <NA> <NA> <NA>
8 hp 110 93 175 105 245 62 95 123 180 205
9 hp 215 230 66 52 65 97 150 91 113 264
10 hp 335 109 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
11 drat 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.07 2.93
12 drat 3 3.23 4.08 4.93 4.22 3.7 3.73 4.43 3.77 3.62
13 drat 3.54 4.11 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
14 wt 2.62 2.875 2.32 3.215 3.44 3.46 3.57 3.19 3.15 4.07
15 wt 3.73 3.78 5.25 5.424 5.345 2.2 1.615 1.835 2.465 3.52
16 wt 3.435 3.84 3.845 1.935 2.14 1.513 3.17 2.77 2.78 <NA>
17 qsec 16.46 17.02 18.61 19.44 20.22 15.84 20 22.9 18.3 18.9
18 qsec 17.4 17.6 18 17.98 17.82 17.42 19.47 18.52 19.9 20.01
19 qsec 16.87 17.3 15.41 17.05 16.7 16.9 14.5 15.5 14.6 18.6
20 vs 0 1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
21 am 1 0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
22 gear 4 3 5 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
23 carb 4 1 2 3 6 8 <NA> <NA> <NA> <NA>
library(tidyverse)
n <- as.integer(max(map_int(mtcars, ~length(unique(.x))), na.rm = T))
n
#> [1] 30
map_dfc(mtcars, ~c(unique(.x), rep(NA, n - length(unique(.x))))) %>%
mutate(val = paste0('value', 1 + (row_number() -1) %% 10),
row_seq = 1 + (row_number() -1) %/% 10) %>%
pivot_longer(!c(val, row_seq), values_drop_na = T) %>%
pivot_wider(id_cols = c(name, row_seq), names_from = val, values_from = value) %>%
arrange(name, row_seq)
#> # A tibble: 23 x 12
#> name row_seq value1 value2 value3 value4 value5 value6 value7 value8 value9
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 am 1 1 0 NA NA NA NA NA NA NA
#> 2 carb 1 4 1 2 3 6 8 NA NA NA
#> 3 cyl 1 6 4 8 NA NA NA NA NA NA
#> 4 disp 1 160 108 258 360 225 147. 141. 168. 276.
#> 5 disp 2 460 440 78.7 75.7 71.1 120. 318 304 350
#> 6 disp 3 79 120. 95.1 351 145 301 121 NA NA
#> 7 drat 1 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.07
#> 8 drat 2 3 3.23 4.08 4.93 4.22 3.7 3.73 4.43 3.77
#> 9 drat 3 3.54 4.11 NA NA NA NA NA NA NA
#> 10 gear 1 4 3 5 NA NA NA NA NA NA
#> # ... with 13 more rows, and 1 more variable: value10 <dbl>
Created on 2021-05-20 by the reprex package (v2.0.0)
You could do it like this:
library(tidyverse)
N_COLS <- 10
imap_dfr(mtcars,
~enframe(.x, name = NULL, value = "value") %>%
distinct() %>%
#arrange(value) %>%
mutate(Prop = .y,
colid = rep(seq(N_COLS), length.out = n()),
rowid = cumsum(colid - lag(colid, default = 0) < 0)) ) %>%
pivot_wider(values_from = value, names_from = colid, names_prefix = "Value") %>%
select(-rowid)
## A tibble: 23 x 11
#Prop Value1 Value2 Value3 Value4 Value5 Value6 Value7 Value8 Value9 Value10
#<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 mpg 21 22.8 21.4 18.7 18.1 14.3 24.4 19.2 17.8 16.4
#2 mpg 17.3 15.2 10.4 14.7 32.4 30.4 33.9 21.5 15.5 13.3
#3 mpg 27.3 26 15.8 19.7 15 NA NA NA NA NA
#4 cyl 6 4 8 NA NA NA NA NA NA NA
#5 disp 160 108 258 360 225 147. 141. 168. 276. 472
#6 disp 460 440 78.7 75.7 71.1 120. 318 304 350 400
#7 disp 79 120. 95.1 351 145 301 121 NA NA NA
#8 hp 110 93 175 105 245 62 95 123 180 205
#9 hp 215 230 66 52 65 97 150 91 113 264
#10 hp 335 109 NA NA NA NA NA NA NA NA
## ... with 13 more rows

How to get function to work where dplyr is being used?

I get an error when trying to call my function where dplyr is used inside the function. Does dplyr not work inside R functions?
all_df_yoy <- function(all_df, units) {
all_df_yoy <- all_df %>% mutate(
players_units_yoy = units)
}
us_players_all_df_yoy <- all_df_yoy(us_players_all_df, players_units_us)
I get the following error.
Error in compat_lazy_dots(.dots, caller_env(), ..., .named = TRUE) :
object 'players_units_us' not found
However, players_units_us does indeed exist inside the data frame​.
Without a minimal reproducible example it's impossible to answer this question to your exact scope, but you need to utilize tidyeval to code functions in the same way that library(dplyr) does. Here is a brief example of what you have to do
library(tidyverse)
create_new_col <- function(df, units) {
units <- enquo(units)
df %>%
mutate(players_units_yoy = !!units)
}
mtcars %>%
create_new_col(cyl)
#> mpg cyl disp hp drat wt qsec vs am gear carb players_units_yoy
#> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 6
#> 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 6
#> 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4
#> 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 6
#> 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 8
#> 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 6
#> 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 8
#> 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 4
#> 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 4
#> 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 6
#> 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 6
#> 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 8
#> 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 8
#> 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 8
#> 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 8
#> 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 8
#> 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 8
#> 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 4
#> 19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 4
#> 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 4
#> 21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 4
#> 22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 8
#> 23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 8
#> 24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 8
#> 25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 8
#> 26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 4
#> 27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 4
#> 28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 4
#> 29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 8
#> 30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 6
#> 31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 8
#> 32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 4
Created on 2019-05-02 by the reprex package (v0.2.1)
You can read more on this here: https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html
If you are new to programming in R, realize that this is a hurdle most users go through when beginning to develop their own packages. So don't worry if it doesn't click at first, become more familiar with R (try writing your functions using base R) and then come back to this topic.

Dplyr group_by and arrange functions together doesn't group same values together

I am using the mtcars built-in dataset. My code is as following:
data("mtcars")
a <- mtcars %>%
group_by(cyl) %>%
arrange(hp)
The output that I get:
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
2 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
3 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
4 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
5 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
6 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
7 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
8 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
9 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
10 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
11 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
12 21 6 160 110 3.9 2.62 16.5 0 1 4 4
13 21 6 160 110 3.9 2.88 17.0 0 1 4 4
14 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
15 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
16 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
17 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
18 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2
19 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2
20 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
21 19.2 8 400 175 3.08 3.84 17.0 0 0 3 2
22 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
23 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
24 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
25 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
26 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
27 10.4 8 460 215 3 5.42 17.8 0 0 3 4
28 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
29 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
30 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
31 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
32 15 8 301 335 3.54 3.57 14.6 0 1 5 8
As you can see group_by is redundant in this output. I only get my data arranged by "hp" column. I don't understand what I am doing wrong. I want to see everything grouped by "cyl" column and then arranged by "hp".
Grouping isn't really related to sorting. Also, group_by isn't redundant (in the sense of being absolutely ignored) as the second line of the output is
# Groups: cyl [3]
To see that group_by doesn't do sorting, just try
mtcars %>% group_by(cyl) %>% print(n = Inf)
Hence, what you want is first to arrange by cyl and then by hp:
mtcars %>% arrange(cyl, hp)

How can I unquote-splice in mutate_at?

I want to parse_factor then fct_recode several variables in a dataframe. The levels (and their recode values) are stored in named strings.
How can I use those to implement what I want?
Note that in my case, I cannot simply use mutate, because I have several variables to which I want to apply the recoding.
Below is an example of what I thought would work (but does not).
library(tidyverse)
#> ── Attaching packages ────── tidyverse 1.2.1 ──
#> ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
#> ✔ tibble 1.4.2 ✔ dplyr 0.7.4
#> ✔ tidyr 0.8.0 ✔ stringr 1.3.0
#> ✔ readr 1.1.1 ✔ forcats 0.3.0
#> ── Conflicts ───────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
gear_levels <- c("tri" = 3, "quad" = 4, "six" = 6, `NA` = 8)
mtcars %>%
mutate_at("gear", parse_factor, levels = gear_levels) %>%
mutate_at("gear", fct_recode, !!! gear_levels)
#> Warning: 5 parsing failures.
#> row # A tibble: 5 x 4 col row col expected actual expected <int> <int> <chr> <chr> actual 1 27 NA value in level set 5 row 2 28 NA value in level set 5 col 3 29 NA value in level set 5 expected 4 30 NA value in level set 5 actual 5 31 NA value in level set 5
#> Error: Can't use `!!!` on atomic vectors in non-quoting functions
As per lionel's comment, this is what coercing to list looks like. Note that you need to supply a character vector to fct_recode and that you have to replace the names after as.character. I'm not sure exactly how your desired levels are stored.
Also your supplied levels don't match those in mtcars$gear, in case you didn't realise.
library(tidyverse)
gear_levels <- c("tri" = 3, "quad" = 4, "six" = 6, `NA` = 8)
gear_recode <- as.list(as.character(gear_levels))
names(gear_recode) <- names(gear_levels)
mtcars %>%
mutate_at(vars(gear), parse_factor, levels = gear_levels) %>%
mutate_at(vars(gear), fct_recode, !!! gear_recode)
#> Warning: 5 parsing failures.
#> row # A tibble: 5 x 4 col row col expected actual expected <int> <int> <chr> <chr> actual 1 27 NA value in level set 5 row 2 28 NA value in level set 5 col 3 29 NA value in level set 5 expected 4 30 NA value in level set 5 actual 5 31 NA value in level set 5
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 quad 4
#> 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 quad 4
#> 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 quad 1
#> 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 tri 1
#> 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 tri 2
#> 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 tri 1
#> 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 tri 4
#> 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 quad 2
#> 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 quad 2
#> 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 quad 4
#> 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 quad 4
#> 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 tri 3
#> 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 tri 3
#> 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 tri 3
#> 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 tri 4
#> 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 tri 4
#> 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 tri 4
#> 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 quad 1
#> 19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 quad 2
#> 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 quad 1
#> 21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 tri 1
#> 22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 tri 2
#> 23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 tri 2
#> 24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 tri 4
#> 25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 tri 2
#> 26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 quad 1
#> 27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 <NA> 2
#> 28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 <NA> 2
#> 29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 <NA> 4
#> 30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 <NA> 6
#> 31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 <NA> 8
#> 32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 quad 2
Created on 2018-03-16 by the reprex package (v0.2.0).

Resources