I'm having trouble replacing values in a column of a R dataframe based upon conditions related to other data variables.
I've created a new dataframe called VAED1 based on the left join between the original data frame VAED (has over 20 variables) and another dataframe called new_map (has only 3 variables and one is called Category)
Here is the code i wrote that works fine:
#join the left side table (VAED) with the right side table (new_map) with the left join function
VAED1 <- VAED %>%
left_join(new_map, by = c("ID1" = "ID2"), suffix= c("_VAED", "_MAP"))***
I then added a three extra columns (nnate, NICU, enone) to the dataframe VAED1 using mutate function to create a new dataframe VAED2:
VAED2 <- VAED1 %>%
mutate(nnate = if_else((substr(W25VIC,1,1) == "P") & (CARE != "U") & (AGE < 1) , "Y", "N"))%>%
mutate(NICU = if_else((nnate == "Y") & (ICUH > 0), "Y", "N"))%>%
mutate(enone = if_else((EMNL == "E") , "Emerg", "Non-emerg")%>%***
Everything works fine to this point.
Finally I wanted to replace the values in one column called Category (this was a character variable in the original joined dataset new_map) based upon certain conditions of other variables in the dataframe. So only change values in the Category column when W25VIC and CARE variables equal certain values. Otherwise leave the original value,)
Use the code:
Category <- if_else((W25VIC == "R03A") & (SAMEDAY == "Y"), "08 Other multiday", Category)
This always shows an error - object 'W25VIC' and 'SAMEDAY' not found. It seems straightforward but the last line of code doesn't work no matter what i do. I check the dataframe using a Head command to make sure the data columns are there during each step. They exist but the code doesn't seem to recognise them.
Grateful for any ideas on what I am doing wrong.
Also used the command
Category[(W25VIC == "R03A") & (SAMEDAY == "Y")] <- "08 Other multiday"
Still same error message.
I think it is worth to readup on how the magrittr pipe works. The pipe takes an object from the left-hand side of an expression and moves it as the first argument into a function on the right.
So x %>% f() becomes f(x) and x %>% f(y) becomes f(x, y). In your last statement
Category <- if_else((W25VIC == "R03A") & (SAMEDAY == "Y"), "08 Other multiday", Category)
the x and the function of what to do following the evaluation of the if_else statement is missing. Here is an example how to use the pipe operator together with an if_else statement to generate a new column:
library(tidyverse)
data <- mtcars
new_data <- data %>% mutate( evaluation = if_else(hp > 150, "awesome", "lame"))
head(new_data, 20)
#> mpg cyl disp hp drat wt qsec vs am gear carb evaluation
#> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 lame
#> 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 lame
#> 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 lame
#> 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 lame
#> 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 awesome
#> 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 lame
#> 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 awesome
#> 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 lame
#> 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 lame
#> 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 lame
#> 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 lame
#> 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 awesome
#> 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 awesome
#> 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 awesome
#> 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 awesome
#> 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 awesome
#> 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 awesome
#> 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 lame
#> 19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 lame
#> 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 lame
Created on 2021-01-07 by the reprex package (v0.3.0)
Related
I found somewhat similar examples here and here, but I didn't follow the examples for the problem I am trying to solve.
What I would like to do is to use mutate and case_when to create a new column. The new column would create a category classification (e.g., "category_1") depending on the values from a different column. Since the number of values may change I want to make the case_when dynamic.
The problem is when this loop operates, it operates fine on each iteration, but when the loop advances it overwrites the previous values. So I am wondering how to use a case_when in a loop that would prevent the last loop value being evaluated while overwriting the previous iterations.
Here is a reproducible example:
library(tidyverse)
# Use built-in data frame for reproducible example
my_df <- mtcars
# Create sequence to reference beginning and end ranges within mpg values
mpg_vals <- sort(mtcars$mpg)
beg_seq <- seq(1, 31, 4)
end_seq <- seq(4, 32, 4)
# Create loop to fill in mpg category
for(i in 1:8){
my_df <- my_df %>%
mutate(mpg_class = case_when(
mpg %in% mpg_vals[beg_seq[i]:end_seq[i]] ~ paste0("category", i)
)
)
# Observe loop values
print(mpg_vals[beg_seq[i]:end_seq[i]])
print(paste0("category_", i))
}
Edit:
If I understand the questions right, you want every fourth ranking of mpg to get a new category. You might use:
my_df %>%
mutate(mpg_class = paste("category", 1 + min_rank(mpg) %/% 4))
That produces:
mpg cyl disp hp drat wt qsec vs am gear carb mpg_class
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 category 5
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 category 5
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 category 7
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 category 6
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 category 4
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 category 4
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 category 2
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 category 7
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 category 7
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 category 5
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 category 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 category 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 category 4
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 category 2
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 category 1
...
Original answer: A looped case_when seems complicated when you could do:
lengths <- end_seq - beg_seq + 1
my_df$mpg_class <- rep(paste0("category", 1:length(lengths)), lengths)
This finds the length of each category. Then we make a vector that repeats each category name as many times as the length of the category and assign that to an mpg_class column.
I am comparing two alternative strategies to estimate linear regression models on subsets of data using the data.table package for R. The two strategies produce the same coefficients, so they appear equivalent. This appearance is deceiving. My question is:
Why is the data stored inside the lm models different?
library(data.table)
dat = data.table(mtcars)
# strategy 1
mod1 = dat[, .(models = .(lm(hp ~ mpg, data = .SD))), by = vs]
# strategy 2
mod2 = dat[, .(data = .(.SD)), by = vs][
, models := lapply(data, function(x) lm(hp ~ mpg, x))]
At first glance, the two approaches seem to produce identical results:
# strategy 1
coef(mod1$models[[1]])
#> (Intercept) mpg
#> 357.97866 -10.12576
# strategy 2
coef(mod2$models[[1]])
#> (Intercept) mpg
#> 357.97866 -10.12576
However, if I try to extract data from the (expanded) model.frame, I get different results:
# strategy 1
expanded_frame1 = expand.model.frame(mod1$models[[1]], "am")
table(expanded_frame1$am)
#>
#> 0 1
#> 7 11
# strategy 2
expanded_frame2 = expand.model.frame(mod2$models[[1]], "am")
table(expanded_frame2$am)
#>
#> 0 1
#> 12 6
This is a trivial minimal working example. My real use-case is that I obtained radically different results when applying sandwich::vcovCL to computed clustered standard errors for my models.
Edit:
I'm accepting the answer by #TimTeaFan (excellent detective work!) but adding a bit of useful info here for future readers.
As #achim-zeileis pointed out elsewhere, we can replicate a similar behavior in the global environment:
d <- subset(mtcars, subst = vs == 0)
m0 <- lm(hp ~ mpg, data = d)
d <- mtcars[0, ]
expand.model.frame(m0, "am")
[1] hp mpg am
<0 rows> (or 0-length row.names)
This does not appear to be a data.table-specific issue. And in general, we have to be careful when re-evaluating the data from a model.
I don't have a complete answer, but I was able to pinpoint the problem to some extent.
When we compare the output of the two models, we can see that the result is equal except for the calls, which are different (which makes sense, since they actually are different):
# compare models
purrr::map2(mod1$models[[1]], mod2$models[[1]], all.equal)
#> $coefficients
#> [1] TRUE
#>
#> $residuals
#> [1] TRUE
#>
#> $effects
#> [1] TRUE
#>
#> $rank
#> [1] TRUE
#>
#> $fitted.values
#> [1] TRUE
#>
#> $assign
#> [1] TRUE
#>
#> $qr
#> [1] TRUE
#>
#> $df.residual
#> [1] TRUE
#>
#> $xlevels
#> [1] TRUE
#>
#> $call
#> [1] "target, current do not match when deparsed"
#>
#> $terms
#> [1] TRUE
#>
#> $model
#> [1] TRUE
So it seems that the initial call is working correctly with both approaches, the problem arises once we try to access the underlying data.
If we have a look at how expand.model.frame gets its data, we can see that it calls eval(model$call$data, envir) where envir is defined as environment(formula(model)) the environment associated with the formula of the lm object.
If we have a look at the data in the associated environment of each model and compare it with the data we expect it to hold, we can see that the second approach yields the data we expect, while the first approach using .SD in the call yields some different data.
It is still not clear to me, why and what is happening, but we now know the problem is in the call to .SD. I first thought, it might be caused by naming a data.table .SD, but after playing around with models where the data is a data.table called .SD this does not seem to be the issue.
# data of model 2 (identical to subsetted mtcars)
environment(formula(mod2$models[[1]]))$x[order(mpg),]
#> mpg cyl disp hp drat wt qsec am gear carb
#> 1: 10.4 8 472.0 205 2.93 5.250 17.98 0 3 4
#> 2: 10.4 8 460.0 215 3.00 5.424 17.82 0 3 4
#> 3: 13.3 8 350.0 245 3.73 3.840 15.41 0 3 4
#> 4: 14.3 8 360.0 245 3.21 3.570 15.84 0 3 4
#> 5: 14.7 8 440.0 230 3.23 5.345 17.42 0 3 4
#> 6: 15.0 8 301.0 335 3.54 3.570 14.60 1 5 8
#> 7: 15.2 8 275.8 180 3.07 3.780 18.00 0 3 3
#> 8: 15.2 8 304.0 150 3.15 3.435 17.30 0 3 2
#> 9: 15.5 8 318.0 150 2.76 3.520 16.87 0 3 2
#> 10: 15.8 8 351.0 264 4.22 3.170 14.50 1 5 4
#> 11: 16.4 8 275.8 180 3.07 4.070 17.40 0 3 3
#> 12: 17.3 8 275.8 180 3.07 3.730 17.60 0 3 3
#> 13: 18.7 8 360.0 175 3.15 3.440 17.02 0 3 2
#> 14: 19.2 8 400.0 175 3.08 3.845 17.05 0 3 2
#> 15: 19.7 6 145.0 175 3.62 2.770 15.50 1 5 6
#> 16: 21.0 6 160.0 110 3.90 2.620 16.46 1 4 4
#> 17: 21.0 6 160.0 110 3.90 2.875 17.02 1 4 4
#> 18: 26.0 4 120.3 91 4.43 2.140 16.70 1 5 2
# subset and order mtcars data
mtcars_vs0 <- subset(mtcars, vs == 0)
mtcars_vs0[order(mtcars_vs0$mpg), ]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
#> Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
#> Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
#> Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#> Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
#> Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
#> Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
#> AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
#> Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
#> Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
#> Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
#> Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
#> Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#> Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
#> Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
#> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# data of model 1 (not identical to mtcars)
environment(formula(mod1$models[[1]]))$.SD[order(mpg),]
#> mpg cyl disp hp drat wt qsec am gear carb
#> 1: 15.0 8 301.0 335 3.54 3.570 14.60 1 5 8
#> 2: 15.8 8 351.0 264 4.22 3.170 14.50 1 5 4
#> 3: 17.8 6 167.6 123 3.92 3.440 18.90 0 4 4
#> 4: 18.1 6 225.0 105 2.76 3.460 20.22 0 3 1
#> 5: 19.2 6 167.6 123 3.92 3.440 18.30 0 4 4
#> 6: 19.7 6 145.0 175 3.62 2.770 15.50 1 5 6
#> 7: 21.4 6 258.0 110 3.08 3.215 19.44 0 3 1
#> 8: 21.4 4 121.0 109 4.11 2.780 18.60 1 4 2
#> 9: 21.5 4 120.1 97 3.70 2.465 20.01 0 3 1
#> 10: 22.8 4 108.0 93 3.85 2.320 18.61 1 4 1
#> 11: 22.8 4 140.8 95 3.92 3.150 22.90 0 4 2
#> 12: 24.4 4 146.7 62 3.69 3.190 20.00 0 4 2
#> 13: 26.0 4 120.3 91 4.43 2.140 16.70 1 5 2
#> 14: 27.3 4 79.0 66 4.08 1.935 18.90 1 4 1
#> 15: 30.4 4 75.7 52 4.93 1.615 18.52 1 4 2
#> 16: 30.4 4 95.1 113 3.77 1.513 16.90 1 5 2
#> 17: 32.4 4 78.7 66 4.08 2.200 19.47 1 4 1
#> 18: 33.9 4 71.1 65 4.22 1.835 19.90 1 4 1
Add on
I tried digging a little deeper to see whats going on. First I called debug(as.formula) and then looked at the following objects in each iteration:
object
ls(environment(object))
We can see that in "strategy 2" each formula is associated with a different environment, and when looking at the environment we see it contains one object x, which when inspected (environment(object)$x) contains the expected mtcars data.
In "strategy 1" however, we can observe that each call to as.formula associates the same environment with the formula being created. Further, when inspecting the environment we can see that it is populated with the single vectors of the subsetted mtcars data (e.g. am, carb, cyl etc.) as well as some functions (e.g. .POSIXt, Cfastmean, strptime etc.). This is probably where things go awry. I would suspect that when associating the same environment with two different formulas (models), the first models underlying data gets "updated" when the second model is calculated. This should also be the reason why the model output itself is correct. To the time the first model is being calculated, the data is still correct. It is overwritten by the second model, which therefore is correct, too. But when accessing the underlying data afterwards things get messy.
Side note
I was curious if we can observe similar problems and differences in the tidyverse when using expand.model.frame and the answer is "yes". Here, the new rowwise notation throws an error, while the group_map as well as the map approach work:
# dplyr approaches:
# group_map: works
mod3 <- mtcars %>%
group_by(vs) %>%
group_map(~ lm(hp ~ mpg, data = .x))
expand.model.frame(mod3[[1]], "am")
# mutate / rowwise: does not work
mod4 <- mtcars %>%
nest_by(vs) %>%
mutate(models = list(lm(hp ~ mpg, data = data)))
expand.model.frame(mod4$models[[1]], "am")
# mutate / map: works
mod5 <- mtcars %>%
tidyr::nest(data = !vs) %>%
mutate(models = purrr::map(data, ~ lm(hp ~ mpg, data = .x)))
expand.model.frame(mod5$models[[1]], "am")
I'm looking to mutate a column for a random sample, e.g., mutate_sample. Does anyone know whether there is a dplyr/other tidyverse verb for this? Below is a reprex for the behavior I am looking for and an attempt to functionalize (which isn't running because I'm struggling with quasiquotation in if_else).
library(dplyr)
library(tibble)
library(rlang)
# Setup -------------------------------------------------------------------
group_size <- 10
group_n <- 1
my_cars <-
mtcars %>%
rownames_to_column(var = "model") %>%
mutate(group = NA_real_, .after = model)
# Code to create mutated sample -------------------------------------------
group_sample <-
my_cars %>%
filter(is.na(group)) %>%
slice_sample(n = group_size) %>%
pull(model)
my_cars %>%
mutate(group = if_else(model %in% group_sample, group_n, group)) %>%
head()
#> model group mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 Mazda RX4 NA 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> 2 Mazda RX4 Wag 1 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> 3 Datsun 710 1 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> 4 Hornet 4 Drive NA 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> 5 Hornet Sportabout NA 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#> 6 Valiant NA 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# Function to create mutated sample ---------------------------------------
#
# Note: doesn't run because of var in if_else
# mutate_sample <- function(data, var, id, n, value) {
# # browser()
# sample <-
# data %>%
# filter(is.na({{var}})) %>%
# slice_sample(n = n) %>%
# pull({{id}})
#
# data %>%
# mutate(var = if_else({{id}} %in% sample, value, {{var}}))
# }
#
# mutate_sample(my_cars, group, model, group_size, group_n)
Created on 2020-10-21 by the reprex package (v0.3.0)
Looking through SO, I found this related post:
Mutate column as input to sample
I think you could achieve your goal with this two options.
With dplyr:
mtcars %>% mutate(group = sample(`length<-`(rep(group_n, group_size), n())))
or with base R:
mtcars[sample(nrow(mtcars), group_size), "group"] <- group_n
If you need an external function to handle it, you could go with:
mutate_sample <- function(.data, .var, .size, .value) {
mutate(.data, {{.var}} := sample(`length<-`(rep(.value, .size), n())))
}
mtcars %>% mutate_sample(group, group_size, group_n)
or
mutate_sample_rbase <- function(.data, .var, .size, .value) {
.data[sample(nrow(.data), size = min(.size, nrow(.data))),
deparse(substitute(.var))] <- .value
.data
}
mtcars %>% mutate_sample(group, group_size, group_n)
Note that if .size is bigger than the number of rows of .data, .var will be a constant equal to .value.
EDIT
If you're interested in keeping the old group, I suggest you another way to handle the problem:
library(dplyr)
# to understand this check out ?sample
resample <- function(x, ...){
x[sample.int(length(x), ...)]
}
# this is to avoid any error in case you choose a size bigger than the available rows to select in one group
resample_max <- function (x, size) {
resample(x, size = min(size, length(x)))
}
mutate_sample <- function(.data, .var, .size, .value) {
# creare column if it doesnt exist
if(! deparse(substitute(.var)) %in% names(.data)) .data <- mutate(.data, {{.var}} := NA)
# replace missing values randomly keeping existing non-missing values
mutate(.data, {{.var}} := replace({{.var}}, resample_max(which(is.na({{.var}})), .size), .value))
}
group_size <- 10
mtcars %>%
mutate_sample(group, group_size, 1) %>%
mutate_sample(group, group_size, 2)
#> mpg cyl disp hp drat wt qsec vs am gear carb group
#> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 NA
#> 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 NA
#> 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 2
#> 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 1
#> 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 NA
#> 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 1
#> 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 NA
#> 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 2
#> 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 NA
#> 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 NA
#> 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 1
#> 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 2
#> 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 1
#> 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 NA
#> 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 2
#> 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 1
#> 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 1
#> 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 2
#> 19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 NA
#> 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 NA
#> 21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 1
#> 22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 1
#> 23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 2
#> 24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 1
#> 25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 2
#> 26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 2
#> 27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 NA
#> 28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 2
#> 29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 1
#> 30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 NA
#> 31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 2
#> 32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 NA
Notice that this solution works even with grouped_df class (what you get after a dplyr::group_by): from each group [made by dplyr::group_by] a sample of .size units will be selected.
mtcars %>%
group_by(am) %>%
mutate_sample(group, 10, 1) %>%
ungroup() %>%
count(group)
#> # A tibble: 2 x 2
#> group n
#> <dbl> <int>
#> 1 1 20 # two groups, each with 10!
#> 2 NA 12
I am trying to add a new column, with character strings based on another column, via an ifelse statement in dplyr. When the condition is met, I also want the following two rows to also show the same value.
I show an example from the mtcars dataset
mtcars %>%
mutate(type=ifelse(mpg>20,"Event", "No event")) %>%
mutate(type=ifelse(type=="Event", lead(type),`type`))
What I am trying to do here is produce a new column called type, which if the mpg>20, I want the row to state "event" and if not "no event". However, I also want the two rows following the mpg>20 also to show "Event", even if they don't meet the criteria.
Hope this makes sense
I am not sure I understand the problem correctly.
However you can try to modify the logical expression inside if_else:
mtcars %>%
mutate(type = if_else(mpg > 20 | lag(mpg) > 20 | lag(mpg, n = 2) > 20, "Event", "No event"))
mpg type
1 21.0 Event
2 21.0 Event
3 22.8 Event
4 21.4 Event
5 18.7 Event
6 18.1 Event
7 14.3 No event
8 24.4 Event
9 22.8 Event
10 19.2 Event
11 17.8 Event
12 16.4 No event
13 17.3 No event
14 15.2 No event
15 10.4 No event
16 10.4 No event
17 14.7 No event
18 32.4 Event
For a general solution, you can use zoos rolling function. You can adjust the window size based on how much you want to look back.
library(dplyr)
library(zoo)
mtcars %>% mutate(type = rollapplyr(mpg > 20, 3, any, partial = TRUE))
# mpg cyl disp hp drat wt qsec vs am gear carb type
#1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 TRUE
#2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 TRUE
#3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 TRUE
#4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 TRUE
#5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 TRUE
#6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 TRUE
#7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 FALSE
#8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 TRUE
#9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 TRUE
#10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 TRUE
#11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 TRUE
#12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 FALSE
#13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 FALSE
#...
#...
You can then change it to "Event", "No Event" using ifelse :
mtcars %>%
mutate(type = ifelse(rollapplyr(mpg > 20, 3, any, partial = TRUE),
'Event', 'No event'))
Or without ifelse :
mtcars %>%
mutate(type = c('No event', 'Event')
[rollapplyr(mpg > 20, 3, any, partial = TRUE) + 1])
The end goal is to use the pdftools package to efficiently move through a thousand pages of pdf documents to consistently, and safely, produce a useable dataframe/tibble. I have attempted to use the tabulizer package, and pdf_text functions, but the results were inconsistent. Therefore, started working through the pdf_data() function, which I prefer.
For those unfamiliar with the pdf_data function, it converts a pdf page into a coordinate grid, with the 0,0 coordinate being in the upper-left corner of the page. Therefore, by arranging the x,y coordinates, then pivoting the document into a wide format, all of the information is displayed as it would on the page, only with NAs for whitespaces
Here is a simple example using the familiar mtcars dataset.
library(pdftools)
library(tidyverse)
library(janitor)
pdf_file <- "https://github.com/ropensci/tabulizer/raw/master/inst/examples/data.pdf"
mtcars_pdf_df <- pdf_data(pdf_file)[[1]]
mtcars_pdf_df%>%
arrange(x, y)%>%
pivot_wider(id_cols = y, names_from = x, values_from = text)%>%
unite(col = Car_type, `154`:`215`, sep = " ", remove = TRUE, na.rm = TRUE)%>%
arrange(y)%>%
rename("Page Number" = `303`)%>%
unite(col = mpg, `253`:`254`, sep = "", remove = TRUE, na.rm = TRUE)%>%
unite(col = cyl, `283` : `291` , sep = "", remove = TRUE, na.rm = TRUE)%>%
unite(col = disp, `308` : `313`, sep = "", remove = TRUE, na.rm = TRUE)
It would be nice to not use a dozen or so unite functions in order to rename the various columns. I used the janitor package row_to_names() function at one point to convert row 1 to column names, which worked well but maybe someone has a better thought?
The central problem; removing the NAs from the dataset through uniting multiple columns, or shifting columns over so that NAs are filled by adjacent columns.
I'm trying to make this efficient. Possible using the purrr package? any help with making this process more efficient would be very appreciated.
The only information I had on the pdf_data() function going into this is from here...
https://ropensci.org/technotes/2018/12/14/pdftools-20/
Any additional resources would also be greatly appreciated (apart from the pdftools package help documentation/literature).
Thanks everyone! I hope this also helps others use the pdf_data() too :)
Here is one approach that could perhaps be generalised if you know the PDF is a reasonably neat table...
library(pdftools)
library(tidyverse)
pdf_file <- "https://github.com/ropensci/tabulizer/raw/master/inst/examples/data.pdf"
df <- pdf_data(pdf_file)[[1]]
df <- df %>% mutate(x = round(x/3), #reduce resolution to minimise inconsistent coordinates
y = round(y/3)) %>%
arrange(y, x) %>% #sort in reading order
mutate(group = cumsum(!lag(space, default = 0))) %>% #identify text with spaces and paste
group_by(group) %>%
summarise(x = first(x),
y = first(y),
text = paste(text, collapse = " ")) %>%
group_by(y) %>%
mutate(colno = row_number()) %>% #add column numbers for table data
ungroup() %>%
select(text, colno, y) %>%
pivot_wider(names_from = colno, values_from = text) %>% #pivot into table format
select(-y) %>%
set_names(c("car", .[1,-ncol(.)])) %>% #shift names from first row
slice(-1, -nrow(.)) %>% #remove names row and page number row
mutate_at(-1, as.numeric)
df
# A tibble: 32 x 12
car mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 Mazda RX4 Wag 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 Hornet 4 Drive 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ... with 22 more rows
I'll present a partial solution here, but please allow me to give you some background information first.
I am currently writing a pdf text / table extraction package from scratch in C++ with R bindings, which has required many months and many thousands of lines of code. I started writing it pretty much to do what you are looking to do: reliably extract tabular data from pdfs. I have got it to the point where it can quickly and reliably extract the text from a pdf document, with the associated positions and font of each text element (similar to pdftools).
I assumed that the technical part of reading the xrefs, handling encryption, writing a deflate decompressor, parsing the dictionaries, tokenizing and reading the page description programs would be the real challenges, and that figuring out a general algorithm to extract tabular data was just a detail I would figure out at the end.
Let me tell you, I'm stuck. I can assure you there is no simple, generalizable parsing function that you can write in a few lines of R to reliably extract tabular data from a pdf.
You have three options, as far as I can tell:
Stick to documents where you know the exact layout
Write a function with filter parameters that you can twiddle and check the results
Use a very complex / AI solution to get very good (though never perfect) reliability
For the pdf example you provided, something like the following works fairly well. It falls into the "twiddling parameters" category, and works by cutting the text into columns and rows based on the density function of the x and y co-ordinates of the text elements.
It could be refined a great deal to generalize it, but that would add a lot of complexity and would have to be tested on lots of documents
tabulize <- function(pdf_df, filter = 0.01)
{
xd <- density(pdf_df$x, filter)
yd <- density(pdf_df$y, filter)
pdf_df$col <- as.numeric(cut(pdf_df$x, c(xd$x[xd$y > .5] - 2, max(xd$x) + 3)))
pdf_df$row <- as.numeric(cut(pdf_df$y, c(yd$x[yd$y > .5] - 2, max(yd$x) + 3)))
pdf_df %<>% group_by(row, col) %>% summarise(label = paste(text, collapse = " "))
res <- matrix(rep("", max(pdf_df$col) * max(pdf_df$row)), nrow = max(pdf_df$row))
for(i in 1:nrow(pdf_df)) res[pdf_df$row[i], pdf_df$col[i]] <- pdf_df$label[i]
res <- res[which(apply(r, 1, paste, collapse = "") != ""), ]
res <- res[,which(apply(r, 2, paste, collapse = "") != "")]
as.data.frame(res[-1,])
}
which gives the following result:
tabulize(mtcars_pdf_df)
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
#> 1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#> 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> 3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#> 4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#> 5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#> 6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#> 7 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#> 8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#> 9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#> 10 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
#> 11 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
#> 12 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
#> 13 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
#> 14 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
#> 15 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
#> 16 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
#> 17 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
#> 18 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> 19 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> 20 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> 21 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
#> 22 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
#> 23 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
#> 24 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
#> 25 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
#> 26 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
#> 27 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
#> 28 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#> 29 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
#> 30 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
#> 31 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
#> 32 Volvo 142E 21.4 4 1 121.0 109 4.11 2.780 18.60 1 1 4 2