tidyverse function to `mutate_sample`?

tidyverse function to `mutate_sample`? - r

I'm looking to mutate a column for a random sample, e.g., mutate_sample. Does anyone know whether there is a dplyr/other tidyverse verb for this? Below is a reprex for the behavior I am looking for and an attempt to functionalize (which isn't running because I'm struggling with quasiquotation in if_else).
library(dplyr)
library(tibble)
library(rlang)
# Setup -------------------------------------------------------------------
group_size <- 10
group_n <- 1
my_cars <-
mtcars %>%
rownames_to_column(var = "model") %>%
mutate(group = NA_real_, .after = model)
# Code to create mutated sample -------------------------------------------
group_sample <-
my_cars %>%
filter(is.na(group)) %>%
slice_sample(n = group_size) %>%
pull(model)
my_cars %>%
mutate(group = if_else(model %in% group_sample, group_n, group)) %>%
head()
#> model group mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 Mazda RX4 NA 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> 2 Mazda RX4 Wag 1 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> 3 Datsun 710 1 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> 4 Hornet 4 Drive NA 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> 5 Hornet Sportabout NA 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#> 6 Valiant NA 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# Function to create mutated sample ---------------------------------------
#
# Note: doesn't run because of var in if_else
# mutate_sample <- function(data, var, id, n, value) {
# # browser()
# sample <-
# data %>%
# filter(is.na({{var}})) %>%
# slice_sample(n = n) %>%
# pull({{id}})
#
# data %>%
# mutate(var = if_else({{id}} %in% sample, value, {{var}}))
# }
#
# mutate_sample(my_cars, group, model, group_size, group_n)
Created on 2020-10-21 by the reprex package (v0.3.0)
Looking through SO, I found this related post:
Mutate column as input to sample

I think you could achieve your goal with this two options.
With dplyr:
mtcars %>% mutate(group = sample(`length<-`(rep(group_n, group_size), n())))
or with base R:
mtcars[sample(nrow(mtcars), group_size), "group"] <- group_n
If you need an external function to handle it, you could go with:
mutate_sample <- function(.data, .var, .size, .value) {
mutate(.data, {{.var}} := sample(`length<-`(rep(.value, .size), n())))
}
mtcars %>% mutate_sample(group, group_size, group_n)
or
mutate_sample_rbase <- function(.data, .var, .size, .value) {
.data[sample(nrow(.data), size = min(.size, nrow(.data))),
deparse(substitute(.var))] <- .value
.data
}
mtcars %>% mutate_sample(group, group_size, group_n)
Note that if .size is bigger than the number of rows of .data, .var will be a constant equal to .value.
EDIT
If you're interested in keeping the old group, I suggest you another way to handle the problem:
library(dplyr)
# to understand this check out ?sample
resample <- function(x, ...){
x[sample.int(length(x), ...)]
}
# this is to avoid any error in case you choose a size bigger than the available rows to select in one group
resample_max <- function (x, size) {
resample(x, size = min(size, length(x)))
}
mutate_sample <- function(.data, .var, .size, .value) {
# creare column if it doesnt exist
if(! deparse(substitute(.var)) %in% names(.data)) .data <- mutate(.data, {{.var}} := NA)
# replace missing values randomly keeping existing non-missing values
mutate(.data, {{.var}} := replace({{.var}}, resample_max(which(is.na({{.var}})), .size), .value))
}
group_size <- 10
mtcars %>%
mutate_sample(group, group_size, 1) %>%
mutate_sample(group, group_size, 2)
#> mpg cyl disp hp drat wt qsec vs am gear carb group
#> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 NA
#> 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 NA
#> 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 2
#> 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 1
#> 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 NA
#> 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 1
#> 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 NA
#> 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 2
#> 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 NA
#> 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 NA
#> 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 1
#> 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 2
#> 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 1
#> 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 NA
#> 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 2
#> 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 1
#> 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 1
#> 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 2
#> 19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 NA
#> 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 NA
#> 21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 1
#> 22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 1
#> 23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 2
#> 24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 1
#> 25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 2
#> 26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 2
#> 27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 NA
#> 28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 2
#> 29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 1
#> 30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 NA
#> 31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 2
#> 32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 NA
Notice that this solution works even with grouped_df class (what you get after a dplyr::group_by): from each group [made by dplyr::group_by] a sample of .size units will be selected.
mtcars %>%
group_by(am) %>%
mutate_sample(group, 10, 1) %>%
ungroup() %>%
count(group)
#> # A tibble: 2 x 2
#> group n
#> <dbl> <int>
#> 1 1 20 # two groups, each with 10!
#> 2 NA 12

Related

apply function to dataframe's chosen column whilst grouping by another chosen column

I would like to apply the below function (cut.at.n.tile) to a data frame (some_data_frame) whilst grouping by a chosen column (e.g. SomeGroupingColumn) and choosing the target column (e.g. ChosenColumn). I tried using sapply() without success - see code below. Any input very much appreciated. Apologies for this not being fully replicable/self contained ...
cut.at.n.tile <- function(X, n = 7) {
cut(X, breaks = quantile(X, probs = (0:n)/n, na.rm = TRUE),
labels = 1:n, include.lowest = TRUE)
}
some_data_frame$SeasonTypeNumber = sapply(split(some_data_frame['ChosenColumn'], SomeGroupingColumn), cut.at.n.tile)

There are a few problems here.
some_data_frame['ChosenColumn'] always returns a single-column data.frame, not a vector which your function requires. I suggest switching to some_data_frame[['ChosenColumn']].
SomeGroupingColumn looks like it should be a column (hence the name) in the data, but it is not referenced within a frame. Perhaps some_data_frame[['SomeGroupingColumn']].
You need to ensure that the breaks= used are unique. For example,
cut.at.n.tile(subset(mtcars, cyl == 8)$disp)
# Error in cut.default(X, breaks = quantile(X, probs = (0:n)/n, na.rm = TRUE), :
# 'breaks' are not unique
If we debug that function, we see
X
# [1] 360.0 360.0 275.8 275.8 275.8 472.0 460.0 440.0 318.0 304.0 350.0 400.0 351.0 301.0
quantile(X, probs = (0:n)/n, na.rm = TRUE)
# 0% 14.28571% 28.57143% 42.85714% 57.14286% 71.42857% 85.71429% 100%
# 275.8000 275.8000 303.1429 336.2857 354.8571 371.4286 442.8571 472.0000
where 275.8 is repeated. This can happen based on nuances in the raw data, and you can't really predict when it will occur.
Since we'll likely have multiple groups, all of the subvectors' levels= (since cut returns a factor) must be the same length, though admittedly 1 in one group is unlikely to be the same as 1 in another group.
Since in this case we can never be certain which n-tile a number strictly applies (in 275.8 in the first or second n-tile?), we can only adjust one of the dupes and accept the imperfection. I suggest a cumsum(duplicated(.)*1e-9): the premise is that it adds an iota to each value that is a dupe, rendering it no-longer a dupe. It is possible that adding 1e-9 to one value will make it a dupe of the next ... so we can be a little OCD by repeatedly doing this until we have no duplicates.
sapply is unlikely to return a vector, much (almost "certainly") more likely to return a list (if the groups are not perfectly balanced) or a matrix (perfectly balanced). We cannot simply unlist, since the order of the unlisted vectors will likely not be the order of the source data.
We can use `split<-`, or we can use a few other techniques (dplyr and/or data.table)
Updated function, and demonstration with mtcars:
cut.at.n.tile <- function(X, n = 7) {
brks <- quantile(X, probs = (0:n)/n, na.rm = TRUE)
while (any(dupes <- duplicated(brks))) brks <- brks + cumsum(1e-9*dupes)
cut(X, breaks = brks, labels = 1:n, include.lowest = TRUE)
}
base R
ret <- lapply(split(mtcars[['disp']], mtcars[['cyl']]), cut.at.n.tile)
mtcars[["newcol"]] <- NA # create an empty column
split(mtcars[['newcol']], mtcars[['cyl']]) <- ret
mtcars
# mpg cyl disp hp drat wt qsec vs am gear carb newcol
# Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2
# Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2
# Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4
# Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 7
# Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 5
# Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 6
# Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 5
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 7
# Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 7
# Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 4
# Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 4
# Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 1
# Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 1
# Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 1
# Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 7
# Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 7
# Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 6
# Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 2
# Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 1
# Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 1
# Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 5
# Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 3
# AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 3
# Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 4
# Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 6
# Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 3
# Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 5
# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 3
# Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 4
# Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 1
# Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 2
# Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 6
Validation:
cut.at.n.tile(subset(mtcars, cyl == 8)$disp)
# [1] 5 5 1 1 1 7 7 6 3 3 4 6 4 2
# Levels: 1 2 3 4 5 6 7
subset(mtcars, cyl == 8)$newcol
# [1] 5 5 1 1 1 7 7 6 3 3 4 6 4 2
dplyr
library(dplyr)
mtcars %>%
group_by(cyl) %>%
mutate(newcol = cut.at.n.tile(disp)) %>%
ungroup()
# # A tibble: 32 × 12
# mpg cyl disp hp drat wt qsec vs am gear carb newcol
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 2
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 2
# 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 4
# 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 7
# 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 5
# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 6
# 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 5
# 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 7
# 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 7
# 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 4
# # … with 22 more rows
# # ℹ Use `print(n = ...)` to see more rows
data.table
library(data.table)
as.data.table(mtcars)[, newcol := cut.at.n.tile(disp), by = .(cyl)][]
# mpg cyl disp hp drat wt qsec vs am gear carb newcol
# <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <fctr>
# 1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2
# 2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2
# 3: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4
# 4: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 7
# 5: 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 5
# 6: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 6
# 7: 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 5
# 8: 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 7
# 9: 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 7
# 10: 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 4
# ---
# 23: 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 3
# 24: 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 4
# 25: 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 6
# 26: 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 3
# 27: 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 5
# 28: 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 3
# 29: 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 4
# 30: 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 1
# 31: 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 2
# 32: 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 6

Correct way of selecting the jth row while using summarise after group_by

My query is: May I use summarise after group_by like this:
mydataset %>%
arrange(grouping_variable,ordering_variable) %>%
group_by(grouping_variable) %>% summarise(answer = another_variable[j])
I expect to see the jth ranked row in each group when I do the above.
I think the above is correct but not mentioned in the documentation.
I ran the following experiment to determine this.
Here is the whole data set:
> mtcars %>% arrange(cyl,mpg) %>% group_by(cyl) %>% as.data.frame
mpg cyl disp hp drat wt qsec vs am gear carb
1 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
2 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
4 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
5 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
6 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
7 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
8 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
9 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
10 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
11 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
12 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
13 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
14 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
15 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
16 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
17 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
18 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
19 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
20 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
21 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
22 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
23 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
24 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
25 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
26 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
27 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
28 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
29 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
30 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
31 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
32 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Here is the first row (j=1) in each group:
> mtcars %>% arrange(cyl,mpg) %>% group_by(cyl) %>% summarise(answer = mpg[1])
# A tibble: 3 × 2
cyl answer
<dbl> <dbl>
1 4 21.4
2 6 17.8
3 8 10.4
Here is the second row in each group (j=2):
> mtcars %>% arrange(cyl,mpg) %>% group_by(cyl) %>% summarise(answer = mpg[2])
# A tibble: 3 × 2
cyl answer
<dbl> <dbl>
1 4 21.5
2 6 18.1
3 8 10.4
>
On the help page, it says that way to do this is as follows:
> mtcars %>% arrange(cyl,mpg) %>% group_by(cyl) %>% summarise(answer = first(mpg))
# A tibble: 3 × 2
cyl answer
<dbl> <dbl>
1 4 21.4
2 6 17.8
3 8 10.4
>
> mtcars %>% arrange(cyl,mpg) %>% group_by(cyl) %>% summarise(answer = nth(mpg,2))
# A tibble: 3 × 2
cyl answer
<dbl> <dbl>
1 4 21.5
2 6 18.1
3 8 10.4
>
I don't remember which website I read this on. Since it was not mentioned on help page that I why I am asking here.

Does dplyr `is_grouped_df()` actually require a date frame (vs a data table, tibble, etc.)?

Does the dplyr function is_grouped_df() actually require the input to be a date frame (vs a data table, tibble, etc.)? If it does not require a data frame why isn't it named is_grouped instead of is_grouped_df?
#1 - mtcars with multiple classes
mtcars %>% group_by(cyl) %>% is_grouped_df()
#> [1] TRUE
mtcars %>% group_by(cyl) %>% class()
#> [1] "grouped_df" "tbl_df" "tbl" "data.frame"
I can group a multiple class mtcars data set, and confirm with the is_grouped_df() function that the data set is grouped.
#2 - mtcars as a tibble
mtcars %>% group_by(cyl) %>% as_tibble() %>% is_grouped_df()
#> [1] FALSE
mtcars %>% group_by(cyl) %>% as_tibble() %>% class()
#> [1] "tbl_df" "tbl" "data.frame"
I can try to force mtcars to be a tibble and notice that when I check if it is a is_grouped_df I get FALSE as the answer. Even though that doesn't seem to be the case. I never called the ungroup() function in my pipe after grouping. Why FALSE?
#3 - mtcars as a data frame (an attempt to return to #1)
mtcars %>% group_by(cyl) %>% as_tibble() %>% as.data.frame() %>% is_grouped_df()
#> [1] FALSE
mtcars %>% group_by(cyl) %>% as_tibble() %>% as.data.frame() %>% class()
#> [1] "data.frame"
I can try to force mtcars to be a data frame and notice that when I check if it is a is_grouped_df I get FALSE as the answer. Even though that doesn't seem to be the case. I never called the ungroup() function in my pipe after grouping. Why FALSE?
And now I circle back to the original question, "Does the dplyr function is_grouped_df() actually require the input to be a date frame (vs a data table, tibble, etc.)?". And why all the inconsistencies in my three examples above?

As soon as you add the as_tibble or as.data.frame functions, the groups that were created by the group_by function are deleted.
You can't create groups on a data frame. The data frame is converted to a tibble as soon as you use group_by
class(mtcars)
[1] "data.frame"
mtcars %>%
group_by(cyl) %>%
class()
[1] "grouped_df" "tbl_df" "tbl" "data.frame"
You can see how the data frame gets converted into a tibble by using group_by
mtcars %>%
group_by(cyl)
# A tibble: 32 x 11
# Groups: cyl [3]
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ... with 22 more rows
But as soon as you call as_tibble again, the groups dissapear.
mtcars %>%
group_by(cyl) %>%
as_tibble()
# A tibble: 32 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ... with 22 more rows
Same thing happens if you call as.data.frame after using group_by
mtcars %>%
group_by(cyl) %>%
as.data.frame()
mpg cyl disp hp drat wt qsec vs am gear carb
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
So basically, is_grouped_df works as intended, only detecting tibbles that have groups. In this case, it's important to note that as_tibble will effectively reset the groups that have already been created, so it ends up acting as ungroup if called on a tibble.

Why does the base R `print()` function require a tibble?

Why does the base R print() function require a tibble when using the n = X argument? It appears this is true from the examples below.
This does NOT work
library(tidyverse)
mtcars %>% print(n = 20)
#> Error in print.default(m, ..., quote = quote, right = right, max = max) :
#> invalid 'na.print' specification
This does work
mtcars %>% as_tibble() %>% print(n = 20)
#> # A tibble: 32 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> 11 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
#> 12 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
#> 13 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
#> 14 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
#> 15 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
#> 16 10.4 8 460 215 3 5.42 17.8 0 0 3 4
#> 17 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
#> 18 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
#> 19 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
#> 20 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
#> # ... with 12 more rows

Your first example is equivalent to print(mtcars, n=20) -- which also fails.
Because mtcars is a data.frame your call dispatches on print.data.frame. And as args(print.data.frame) will tell you, there is no n= argument in it.
In short, you got confused between a specific dispatch (I presume print.tbl) with a more generic approach.
So a better title for the question might be 'Why does only the print method for tibbles have a n argument' -- for general use we commonly just invoke head as in
R> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
R>
which of course also works in a pipelined expression.

Using the pdf_data function from the pdftools package efficiently

The end goal is to use the pdftools package to efficiently move through a thousand pages of pdf documents to consistently, and safely, produce a useable dataframe/tibble. I have attempted to use the tabulizer package, and pdf_text functions, but the results were inconsistent. Therefore, started working through the pdf_data() function, which I prefer.
For those unfamiliar with the pdf_data function, it converts a pdf page into a coordinate grid, with the 0,0 coordinate being in the upper-left corner of the page. Therefore, by arranging the x,y coordinates, then pivoting the document into a wide format, all of the information is displayed as it would on the page, only with NAs for whitespaces
Here is a simple example using the familiar mtcars dataset.
library(pdftools)
library(tidyverse)
library(janitor)
pdf_file <- "https://github.com/ropensci/tabulizer/raw/master/inst/examples/data.pdf"
mtcars_pdf_df <- pdf_data(pdf_file)[[1]]
mtcars_pdf_df%>%
arrange(x, y)%>%
pivot_wider(id_cols = y, names_from = x, values_from = text)%>%
unite(col = Car_type, `154`:`215`, sep = " ", remove = TRUE, na.rm = TRUE)%>%
arrange(y)%>%
rename("Page Number" = `303`)%>%
unite(col = mpg, `253`:`254`, sep = "", remove = TRUE, na.rm = TRUE)%>%
unite(col = cyl, `283` : `291` , sep = "", remove = TRUE, na.rm = TRUE)%>%
unite(col = disp, `308` : `313`, sep = "", remove = TRUE, na.rm = TRUE)
It would be nice to not use a dozen or so unite functions in order to rename the various columns. I used the janitor package row_to_names() function at one point to convert row 1 to column names, which worked well but maybe someone has a better thought?
The central problem; removing the NAs from the dataset through uniting multiple columns, or shifting columns over so that NAs are filled by adjacent columns.
I'm trying to make this efficient. Possible using the purrr package? any help with making this process more efficient would be very appreciated.
The only information I had on the pdf_data() function going into this is from here...
https://ropensci.org/technotes/2018/12/14/pdftools-20/
Any additional resources would also be greatly appreciated (apart from the pdftools package help documentation/literature).
Thanks everyone! I hope this also helps others use the pdf_data() too :)

Here is one approach that could perhaps be generalised if you know the PDF is a reasonably neat table...
library(pdftools)
library(tidyverse)
pdf_file <- "https://github.com/ropensci/tabulizer/raw/master/inst/examples/data.pdf"
df <- pdf_data(pdf_file)[[1]]
df <- df %>% mutate(x = round(x/3), #reduce resolution to minimise inconsistent coordinates
y = round(y/3)) %>%
arrange(y, x) %>% #sort in reading order
mutate(group = cumsum(!lag(space, default = 0))) %>% #identify text with spaces and paste
group_by(group) %>%
summarise(x = first(x),
y = first(y),
text = paste(text, collapse = " ")) %>%
group_by(y) %>%
mutate(colno = row_number()) %>% #add column numbers for table data
ungroup() %>%
select(text, colno, y) %>%
pivot_wider(names_from = colno, values_from = text) %>% #pivot into table format
select(-y) %>%
set_names(c("car", .[1,-ncol(.)])) %>% #shift names from first row
slice(-1, -nrow(.)) %>% #remove names row and page number row
mutate_at(-1, as.numeric)
df
# A tibble: 32 x 12
car mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 Mazda RX4 Wag 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 Hornet 4 Drive 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ... with 22 more rows

I'll present a partial solution here, but please allow me to give you some background information first.
I am currently writing a pdf text / table extraction package from scratch in C++ with R bindings, which has required many months and many thousands of lines of code. I started writing it pretty much to do what you are looking to do: reliably extract tabular data from pdfs. I have got it to the point where it can quickly and reliably extract the text from a pdf document, with the associated positions and font of each text element (similar to pdftools).
I assumed that the technical part of reading the xrefs, handling encryption, writing a deflate decompressor, parsing the dictionaries, tokenizing and reading the page description programs would be the real challenges, and that figuring out a general algorithm to extract tabular data was just a detail I would figure out at the end.
Let me tell you, I'm stuck. I can assure you there is no simple, generalizable parsing function that you can write in a few lines of R to reliably extract tabular data from a pdf.
You have three options, as far as I can tell:
Stick to documents where you know the exact layout
Write a function with filter parameters that you can twiddle and check the results
Use a very complex / AI solution to get very good (though never perfect) reliability
For the pdf example you provided, something like the following works fairly well. It falls into the "twiddling parameters" category, and works by cutting the text into columns and rows based on the density function of the x and y co-ordinates of the text elements.
It could be refined a great deal to generalize it, but that would add a lot of complexity and would have to be tested on lots of documents
tabulize <- function(pdf_df, filter = 0.01)
{
xd <- density(pdf_df$x, filter)
yd <- density(pdf_df$y, filter)
pdf_df$col <- as.numeric(cut(pdf_df$x, c(xd$x[xd$y > .5] - 2, max(xd$x) + 3)))
pdf_df$row <- as.numeric(cut(pdf_df$y, c(yd$x[yd$y > .5] - 2, max(yd$x) + 3)))
pdf_df %<>% group_by(row, col) %>% summarise(label = paste(text, collapse = " "))
res <- matrix(rep("", max(pdf_df$col) * max(pdf_df$row)), nrow = max(pdf_df$row))
for(i in 1:nrow(pdf_df)) res[pdf_df$row[i], pdf_df$col[i]] <- pdf_df$label[i]
res <- res[which(apply(r, 1, paste, collapse = "") != ""), ]
res <- res[,which(apply(r, 2, paste, collapse = "") != "")]
as.data.frame(res[-1,])
}
which gives the following result:
tabulize(mtcars_pdf_df)
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
#> 1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#> 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> 3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#> 4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#> 5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#> 6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#> 7 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#> 8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#> 9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#> 10 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
#> 11 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
#> 12 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
#> 13 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
#> 14 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
#> 15 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
#> 16 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
#> 17 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
#> 18 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> 19 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> 20 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> 21 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
#> 22 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
#> 23 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
#> 24 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
#> 25 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
#> 26 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
#> 27 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
#> 28 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#> 29 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
#> 30 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
#> 31 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
#> 32 Volvo 142E 21.4 4 1 121.0 109 4.11 2.780 18.60 1 1 4 2

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

tidyverse function to `mutate_sample`? - r

Related

apply function to dataframe's chosen column whilst grouping by another chosen column

Correct way of selecting the jth row while using summarise after group_by

Does dplyr `is_grouped_df()` actually require a date frame (vs a data table, tibble, etc.)?

Why does the base R `print()` function require a tibble?

Using the pdf_data function from the pdftools package efficiently

Categories

Resources