Find duplicated elements with dplyr

Find duplicated elements with dplyr - r

I tried using the code presented here to find ALL duplicated elements with dplyr like this:
library(dplyr)
mtcars %>%
mutate(cyl.dup = cyl[duplicated(cyl) | duplicated(cyl, from.last = TRUE)])
How can I convert code presented here to find ALL duplicated elements with dplyr? My code above just throws an error? Or even better, is there another function that will achieve this more succinctly than the convoluted x[duplicated(x) | duplicated(x, from.last = TRUE)]) approach?

I guess you could use filter for this purpose:
mtcars %>%
group_by(carb) %>%
filter(n()>1)
Small example (note that I added summarize() to prove that the resulting data set does not contain rows with duplicate 'carb'. I used 'carb' instead of 'cyl' because 'carb' has unique values whereas 'cyl' does not):
mtcars %>% group_by(carb) %>% summarize(n=n())
#Source: local data frame [6 x 2]
#
# carb n
#1 1 7
#2 2 10
#3 3 3
#4 4 10
#5 6 1
#6 8 1
mtcars %>% group_by(carb) %>% filter(n()>1) %>% summarize(n=n())
#Source: local data frame [4 x 2]
#
# carb n
#1 1 7
#2 2 10
#3 3 3
#4 4 10

Another solution is to use janitor package:
mtcars %>% get_dupes(wt)

We can find duplicated elements with dplyr as follows.
library(dplyr)
# Only duplicated elements
mtcars %>%
filter(duplicated(.[["carb"]])
# All duplicated elements
mtcars %>%
filter(carb %in% unique(.[["carb"]][duplicated(.[["carb"]])]))

The original post contains an error in using the solution from the related answer. In the example given, when you use that solution inside mutate, it tries to subset the cyl vector which will not be of the same length as the mtcars dataframe.
Instead you can use the following example with filter returning all duplicated elements or mutate with ifelse to create a dummy variable which can be filtered upon later:
library(dplyr)
# Return all duplicated elements
mtcars %>%
filter(duplicated(cyl) | duplicated(cyl, fromLast = TRUE))
# Or for making dummy variable of all duplicated
mtcars %>%
mutate(cyl.dup =ifelse(duplicated(cyl) | duplicated(cyl, fromLast = TRUE), 1,0))

# Adding a shortcut to the answer above
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
mtcars %>% count(carb)
#> # A tibble: 6 x 2
#> carb n
#> <dbl> <int>
#> 1 1. 7
#> 2 2. 10
#> 3 3. 3
#> 4 4. 10
#> 5 6. 1
#> 6 8. 1
mtcars %>% count(carb) %>% filter(n > 1)
#> # A tibble: 4 x 2
#> carb n
#> <dbl> <int>
#> 1 1. 7
#> 2 2. 10
#> 3 3. 3
#> 4 4. 10
# Showing an alternative that follows the apparent intention if the asker
duplicated_carb <- mtcars %>%
mutate(dup_carb = duplicated(carb)) %>%
filter(dup_carb)
duplicated_carb
#> mpg cyl disp hp drat wt qsec vs am gear carb dup_carb
#> 1 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 TRUE
#> 2 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 TRUE
#> 3 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 TRUE
#> 4 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 TRUE
#> 5 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 TRUE
#> 6 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 TRUE
#> 7 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 TRUE
#> 8 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 TRUE
#> 9 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 TRUE
#> 10 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 TRUE
#> 11 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 TRUE
#> 12 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 TRUE
#> 13 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 TRUE
#> 14 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 TRUE
#> 15 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 TRUE
#> 16 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 TRUE
#> 17 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 TRUE
#> 18 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 TRUE
#> 19 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 TRUE
#> 20 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 TRUE
#> 21 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 TRUE
#> 22 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 TRUE
#> 23 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 TRUE
#> 24 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 TRUE
#> 25 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 TRUE
#> 26 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 TRUE

You can create a Boolean mask with duplicated():
iris %>% duplicated()
[1] FALSE FALSE FALSE .... TRUE FALSE
[145] FALSE FALSE FALSE FALSE FALSE FALSE
And pass through square brackets indexing:
iris[iris %>% duplicated(),]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
143 5.8 2.7 5.1 1.9 virginica
Note: This approach is the closest thing to Pandas
that could be done with R and dplyr:
iris[iris %>% duplicated(), c("Petal.Length","Petal.Width","Species")]
Petal.Length Petal.Width Species
143 5.1 1.9 virginica

A more general solution if you want to group duplicates using many columns
df%>%
select(ID,COL1,COL2,all_of(vector_of_columns))%>%
distinct%>%
ungroup%>%rowwise%>%
mutate(ID_GROUPS=paste0(ID,"_",cur_group_rows()))%>%
ungroup%>%
full_join(.,df,by=c("INFO_ID","COL1","COL2",vector_of_columns))->chk

Find duplicate value in data frame with column
df<-dataset[duplicated(dataset$columnname),]

Related

How to eliminate schools with less than 20 students?

I have a dataset, espana2015, of a country with schools, students…. I want to eliminate schools with less than 20 students.
The variable of the schools is CNTSCHID
dim(espana2015)
[1] 6736 106
The only way, long, manual and not very efficient, is to write one by one the schools.
Here are only 13 schools with less than 20 students, but what if there are many more, e.g. more than 100 schools?
espana2015 %>% group_by(CNTSCHID) %>% summarise(students=n())%>%
filter(students < 20) %>% select (CNTSCHID) ->removeSch
removeSch
# A tibble: 13 x 1
CNTSCHID
<dbl>
1 72400046
2 72400113
3 72400261
4 72400314
5 72400396
6 72400472
7 72400641
8 72400700
9 72400711
10 72400736
11 72400909
12 72400927
13 72400979
espana2015 %>% subset(!CNTSCHID %in% c(72400046,72400113,72400261,
72400314,72400396,72400472,
72400641,72400700,72400711,
72400736,72400909,72400927,
72400979)) -> new_espana2015
Please help me to do it better
Walter

Lacking sample data, I'll demonstrate on mtcars, where my cyl is your CNTSHID.
library(dplyr)
table(mtcars$cyl)
# 4 6 8
# 11 7 14
mtcars %>%
group_by(cyl) %>%
filter(n() > 10) %>%
ungroup()
# # A tibble: 25 x 11
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
# 2 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
# 3 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
# 4 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
# 5 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
# 6 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
# 7 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
# 8 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
# 9 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
# 10 10.4 8 460 215 3 5.42 17.8 0 0 3 4
# # ... with 15 more rows
This works because the conditional in filter resolves to a single logical, and that length-1 true/false is then recycled for all rows in that group. That is, for cyl == 4, (n() > 10) --> (11 > 10) --> TRUE, so the filter is %>% filter(TRUE); the dplyr::filter function does "safe recycling" in a sense, where the conditional must be the same length as the number of rows, or length 1. When it is length 1, it is essentially saying "all or nothing".

R data.table: Difference between nested regressions results

I am comparing two alternative strategies to estimate linear regression models on subsets of data using the data.table package for R. The two strategies produce the same coefficients, so they appear equivalent. This appearance is deceiving. My question is:
Why is the data stored inside the lm models different?
library(data.table)
dat = data.table(mtcars)
# strategy 1
mod1 = dat[, .(models = .(lm(hp ~ mpg, data = .SD))), by = vs]
# strategy 2
mod2 = dat[, .(data = .(.SD)), by = vs][
, models := lapply(data, function(x) lm(hp ~ mpg, x))]
At first glance, the two approaches seem to produce identical results:
# strategy 1
coef(mod1$models[[1]])
#> (Intercept) mpg
#> 357.97866 -10.12576
# strategy 2
coef(mod2$models[[1]])
#> (Intercept) mpg
#> 357.97866 -10.12576
However, if I try to extract data from the (expanded) model.frame, I get different results:
# strategy 1
expanded_frame1 = expand.model.frame(mod1$models[[1]], "am")
table(expanded_frame1$am)
#>
#> 0 1
#> 7 11
# strategy 2
expanded_frame2 = expand.model.frame(mod2$models[[1]], "am")
table(expanded_frame2$am)
#>
#> 0 1
#> 12 6
This is a trivial minimal working example. My real use-case is that I obtained radically different results when applying sandwich::vcovCL to computed clustered standard errors for my models.
Edit:
I'm accepting the answer by #TimTeaFan (excellent detective work!) but adding a bit of useful info here for future readers.
As #achim-zeileis pointed out elsewhere, we can replicate a similar behavior in the global environment:
d <- subset(mtcars, subst = vs == 0)
m0 <- lm(hp ~ mpg, data = d)
d <- mtcars[0, ]
expand.model.frame(m0, "am")
[1] hp mpg am
<0 rows> (or 0-length row.names)
This does not appear to be a data.table-specific issue. And in general, we have to be careful when re-evaluating the data from a model.

I don't have a complete answer, but I was able to pinpoint the problem to some extent.
When we compare the output of the two models, we can see that the result is equal except for the calls, which are different (which makes sense, since they actually are different):
# compare models
purrr::map2(mod1$models[[1]], mod2$models[[1]], all.equal)
#> $coefficients
#> [1] TRUE
#>
#> $residuals
#> [1] TRUE
#>
#> $effects
#> [1] TRUE
#>
#> $rank
#> [1] TRUE
#>
#> $fitted.values
#> [1] TRUE
#>
#> $assign
#> [1] TRUE
#>
#> $qr
#> [1] TRUE
#>
#> $df.residual
#> [1] TRUE
#>
#> $xlevels
#> [1] TRUE
#>
#> $call
#> [1] "target, current do not match when deparsed"
#>
#> $terms
#> [1] TRUE
#>
#> $model
#> [1] TRUE
So it seems that the initial call is working correctly with both approaches, the problem arises once we try to access the underlying data.
If we have a look at how expand.model.frame gets its data, we can see that it calls eval(model$call$data, envir) where envir is defined as environment(formula(model)) the environment associated with the formula of the lm object.
If we have a look at the data in the associated environment of each model and compare it with the data we expect it to hold, we can see that the second approach yields the data we expect, while the first approach using .SD in the call yields some different data.
It is still not clear to me, why and what is happening, but we now know the problem is in the call to .SD. I first thought, it might be caused by naming a data.table .SD, but after playing around with models where the data is a data.table called .SD this does not seem to be the issue.
# data of model 2 (identical to subsetted mtcars)
environment(formula(mod2$models[[1]]))$x[order(mpg),]
#> mpg cyl disp hp drat wt qsec am gear carb
#> 1: 10.4 8 472.0 205 2.93 5.250 17.98 0 3 4
#> 2: 10.4 8 460.0 215 3.00 5.424 17.82 0 3 4
#> 3: 13.3 8 350.0 245 3.73 3.840 15.41 0 3 4
#> 4: 14.3 8 360.0 245 3.21 3.570 15.84 0 3 4
#> 5: 14.7 8 440.0 230 3.23 5.345 17.42 0 3 4
#> 6: 15.0 8 301.0 335 3.54 3.570 14.60 1 5 8
#> 7: 15.2 8 275.8 180 3.07 3.780 18.00 0 3 3
#> 8: 15.2 8 304.0 150 3.15 3.435 17.30 0 3 2
#> 9: 15.5 8 318.0 150 2.76 3.520 16.87 0 3 2
#> 10: 15.8 8 351.0 264 4.22 3.170 14.50 1 5 4
#> 11: 16.4 8 275.8 180 3.07 4.070 17.40 0 3 3
#> 12: 17.3 8 275.8 180 3.07 3.730 17.60 0 3 3
#> 13: 18.7 8 360.0 175 3.15 3.440 17.02 0 3 2
#> 14: 19.2 8 400.0 175 3.08 3.845 17.05 0 3 2
#> 15: 19.7 6 145.0 175 3.62 2.770 15.50 1 5 6
#> 16: 21.0 6 160.0 110 3.90 2.620 16.46 1 4 4
#> 17: 21.0 6 160.0 110 3.90 2.875 17.02 1 4 4
#> 18: 26.0 4 120.3 91 4.43 2.140 16.70 1 5 2
# subset and order mtcars data
mtcars_vs0 <- subset(mtcars, vs == 0)
mtcars_vs0[order(mtcars_vs0$mpg), ]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
#> Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
#> Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
#> Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#> Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
#> Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
#> Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
#> AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
#> Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
#> Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
#> Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
#> Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
#> Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#> Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
#> Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
#> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# data of model 1 (not identical to mtcars)
environment(formula(mod1$models[[1]]))$.SD[order(mpg),]
#> mpg cyl disp hp drat wt qsec am gear carb
#> 1: 15.0 8 301.0 335 3.54 3.570 14.60 1 5 8
#> 2: 15.8 8 351.0 264 4.22 3.170 14.50 1 5 4
#> 3: 17.8 6 167.6 123 3.92 3.440 18.90 0 4 4
#> 4: 18.1 6 225.0 105 2.76 3.460 20.22 0 3 1
#> 5: 19.2 6 167.6 123 3.92 3.440 18.30 0 4 4
#> 6: 19.7 6 145.0 175 3.62 2.770 15.50 1 5 6
#> 7: 21.4 6 258.0 110 3.08 3.215 19.44 0 3 1
#> 8: 21.4 4 121.0 109 4.11 2.780 18.60 1 4 2
#> 9: 21.5 4 120.1 97 3.70 2.465 20.01 0 3 1
#> 10: 22.8 4 108.0 93 3.85 2.320 18.61 1 4 1
#> 11: 22.8 4 140.8 95 3.92 3.150 22.90 0 4 2
#> 12: 24.4 4 146.7 62 3.69 3.190 20.00 0 4 2
#> 13: 26.0 4 120.3 91 4.43 2.140 16.70 1 5 2
#> 14: 27.3 4 79.0 66 4.08 1.935 18.90 1 4 1
#> 15: 30.4 4 75.7 52 4.93 1.615 18.52 1 4 2
#> 16: 30.4 4 95.1 113 3.77 1.513 16.90 1 5 2
#> 17: 32.4 4 78.7 66 4.08 2.200 19.47 1 4 1
#> 18: 33.9 4 71.1 65 4.22 1.835 19.90 1 4 1
Add on
I tried digging a little deeper to see whats going on. First I called debug(as.formula) and then looked at the following objects in each iteration:
object
ls(environment(object))
We can see that in "strategy 2" each formula is associated with a different environment, and when looking at the environment we see it contains one object x, which when inspected (environment(object)$x) contains the expected mtcars data.
In "strategy 1" however, we can observe that each call to as.formula associates the same environment with the formula being created. Further, when inspecting the environment we can see that it is populated with the single vectors of the subsetted mtcars data (e.g. am, carb, cyl etc.) as well as some functions (e.g. .POSIXt, Cfastmean, strptime etc.). This is probably where things go awry. I would suspect that when associating the same environment with two different formulas (models), the first models underlying data gets "updated" when the second model is calculated. This should also be the reason why the model output itself is correct. To the time the first model is being calculated, the data is still correct. It is overwritten by the second model, which therefore is correct, too. But when accessing the underlying data afterwards things get messy.
Side note
I was curious if we can observe similar problems and differences in the tidyverse when using expand.model.frame and the answer is "yes". Here, the new rowwise notation throws an error, while the group_map as well as the map approach work:
# dplyr approaches:
# group_map: works
mod3 <- mtcars %>%
group_by(vs) %>%
group_map(~ lm(hp ~ mpg, data = .x))
expand.model.frame(mod3[[1]], "am")
# mutate / rowwise: does not work
mod4 <- mtcars %>%
nest_by(vs) %>%
mutate(models = list(lm(hp ~ mpg, data = data)))
expand.model.frame(mod4$models[[1]], "am")
# mutate / map: works
mod5 <- mtcars %>%
tidyr::nest(data = !vs) %>%
mutate(models = purrr::map(data, ~ lm(hp ~ mpg, data = .x)))
expand.model.frame(mod5$models[[1]], "am")

tidyverse function to `mutate_sample`?

I'm looking to mutate a column for a random sample, e.g., mutate_sample. Does anyone know whether there is a dplyr/other tidyverse verb for this? Below is a reprex for the behavior I am looking for and an attempt to functionalize (which isn't running because I'm struggling with quasiquotation in if_else).
library(dplyr)
library(tibble)
library(rlang)
# Setup -------------------------------------------------------------------
group_size <- 10
group_n <- 1
my_cars <-
mtcars %>%
rownames_to_column(var = "model") %>%
mutate(group = NA_real_, .after = model)
# Code to create mutated sample -------------------------------------------
group_sample <-
my_cars %>%
filter(is.na(group)) %>%
slice_sample(n = group_size) %>%
pull(model)
my_cars %>%
mutate(group = if_else(model %in% group_sample, group_n, group)) %>%
head()
#> model group mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 Mazda RX4 NA 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> 2 Mazda RX4 Wag 1 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> 3 Datsun 710 1 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> 4 Hornet 4 Drive NA 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> 5 Hornet Sportabout NA 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#> 6 Valiant NA 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# Function to create mutated sample ---------------------------------------
#
# Note: doesn't run because of var in if_else
# mutate_sample <- function(data, var, id, n, value) {
# # browser()
# sample <-
# data %>%
# filter(is.na({{var}})) %>%
# slice_sample(n = n) %>%
# pull({{id}})
#
# data %>%
# mutate(var = if_else({{id}} %in% sample, value, {{var}}))
# }
#
# mutate_sample(my_cars, group, model, group_size, group_n)
Created on 2020-10-21 by the reprex package (v0.3.0)
Looking through SO, I found this related post:
Mutate column as input to sample

I think you could achieve your goal with this two options.
With dplyr:
mtcars %>% mutate(group = sample(`length<-`(rep(group_n, group_size), n())))
or with base R:
mtcars[sample(nrow(mtcars), group_size), "group"] <- group_n
If you need an external function to handle it, you could go with:
mutate_sample <- function(.data, .var, .size, .value) {
mutate(.data, {{.var}} := sample(`length<-`(rep(.value, .size), n())))
}
mtcars %>% mutate_sample(group, group_size, group_n)
or
mutate_sample_rbase <- function(.data, .var, .size, .value) {
.data[sample(nrow(.data), size = min(.size, nrow(.data))),
deparse(substitute(.var))] <- .value
.data
}
mtcars %>% mutate_sample(group, group_size, group_n)
Note that if .size is bigger than the number of rows of .data, .var will be a constant equal to .value.
EDIT
If you're interested in keeping the old group, I suggest you another way to handle the problem:
library(dplyr)
# to understand this check out ?sample
resample <- function(x, ...){
x[sample.int(length(x), ...)]
}
# this is to avoid any error in case you choose a size bigger than the available rows to select in one group
resample_max <- function (x, size) {
resample(x, size = min(size, length(x)))
}
mutate_sample <- function(.data, .var, .size, .value) {
# creare column if it doesnt exist
if(! deparse(substitute(.var)) %in% names(.data)) .data <- mutate(.data, {{.var}} := NA)
# replace missing values randomly keeping existing non-missing values
mutate(.data, {{.var}} := replace({{.var}}, resample_max(which(is.na({{.var}})), .size), .value))
}
group_size <- 10
mtcars %>%
mutate_sample(group, group_size, 1) %>%
mutate_sample(group, group_size, 2)
#> mpg cyl disp hp drat wt qsec vs am gear carb group
#> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 NA
#> 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 NA
#> 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 2
#> 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 1
#> 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 NA
#> 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 1
#> 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 NA
#> 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 2
#> 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 NA
#> 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 NA
#> 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 1
#> 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 2
#> 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 1
#> 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 NA
#> 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 2
#> 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 1
#> 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 1
#> 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 2
#> 19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 NA
#> 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 NA
#> 21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 1
#> 22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 1
#> 23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 2
#> 24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 1
#> 25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 2
#> 26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 2
#> 27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 NA
#> 28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 2
#> 29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 1
#> 30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 NA
#> 31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 2
#> 32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 NA
Notice that this solution works even with grouped_df class (what you get after a dplyr::group_by): from each group [made by dplyr::group_by] a sample of .size units will be selected.
mtcars %>%
group_by(am) %>%
mutate_sample(group, 10, 1) %>%
ungroup() %>%
count(group)
#> # A tibble: 2 x 2
#> group n
#> <dbl> <int>
#> 1 1 20 # two groups, each with 10!
#> 2 NA 12

Does dplyr `is_grouped_df()` actually require a date frame (vs a data table, tibble, etc.)?

Does the dplyr function is_grouped_df() actually require the input to be a date frame (vs a data table, tibble, etc.)? If it does not require a data frame why isn't it named is_grouped instead of is_grouped_df?
#1 - mtcars with multiple classes
mtcars %>% group_by(cyl) %>% is_grouped_df()
#> [1] TRUE
mtcars %>% group_by(cyl) %>% class()
#> [1] "grouped_df" "tbl_df" "tbl" "data.frame"
I can group a multiple class mtcars data set, and confirm with the is_grouped_df() function that the data set is grouped.
#2 - mtcars as a tibble
mtcars %>% group_by(cyl) %>% as_tibble() %>% is_grouped_df()
#> [1] FALSE
mtcars %>% group_by(cyl) %>% as_tibble() %>% class()
#> [1] "tbl_df" "tbl" "data.frame"
I can try to force mtcars to be a tibble and notice that when I check if it is a is_grouped_df I get FALSE as the answer. Even though that doesn't seem to be the case. I never called the ungroup() function in my pipe after grouping. Why FALSE?
#3 - mtcars as a data frame (an attempt to return to #1)
mtcars %>% group_by(cyl) %>% as_tibble() %>% as.data.frame() %>% is_grouped_df()
#> [1] FALSE
mtcars %>% group_by(cyl) %>% as_tibble() %>% as.data.frame() %>% class()
#> [1] "data.frame"
I can try to force mtcars to be a data frame and notice that when I check if it is a is_grouped_df I get FALSE as the answer. Even though that doesn't seem to be the case. I never called the ungroup() function in my pipe after grouping. Why FALSE?
And now I circle back to the original question, "Does the dplyr function is_grouped_df() actually require the input to be a date frame (vs a data table, tibble, etc.)?". And why all the inconsistencies in my three examples above?

As soon as you add the as_tibble or as.data.frame functions, the groups that were created by the group_by function are deleted.
You can't create groups on a data frame. The data frame is converted to a tibble as soon as you use group_by
class(mtcars)
[1] "data.frame"
mtcars %>%
group_by(cyl) %>%
class()
[1] "grouped_df" "tbl_df" "tbl" "data.frame"
You can see how the data frame gets converted into a tibble by using group_by
mtcars %>%
group_by(cyl)
# A tibble: 32 x 11
# Groups: cyl [3]
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ... with 22 more rows
But as soon as you call as_tibble again, the groups dissapear.
mtcars %>%
group_by(cyl) %>%
as_tibble()
# A tibble: 32 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ... with 22 more rows
Same thing happens if you call as.data.frame after using group_by
mtcars %>%
group_by(cyl) %>%
as.data.frame()
mpg cyl disp hp drat wt qsec vs am gear carb
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
So basically, is_grouped_df works as intended, only detecting tibbles that have groups. In this case, it's important to note that as_tibble will effectively reset the groups that have already been created, so it ends up acting as ungroup if called on a tibble.

How can I unquote-splice in mutate_at?

I want to parse_factor then fct_recode several variables in a dataframe. The levels (and their recode values) are stored in named strings.
How can I use those to implement what I want?
Note that in my case, I cannot simply use mutate, because I have several variables to which I want to apply the recoding.
Below is an example of what I thought would work (but does not).
library(tidyverse)
#> ── Attaching packages ────── tidyverse 1.2.1 ──
#> ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
#> ✔ tibble 1.4.2 ✔ dplyr 0.7.4
#> ✔ tidyr 0.8.0 ✔ stringr 1.3.0
#> ✔ readr 1.1.1 ✔ forcats 0.3.0
#> ── Conflicts ───────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
gear_levels <- c("tri" = 3, "quad" = 4, "six" = 6, `NA` = 8)
mtcars %>%
mutate_at("gear", parse_factor, levels = gear_levels) %>%
mutate_at("gear", fct_recode, !!! gear_levels)
#> Warning: 5 parsing failures.
#> row # A tibble: 5 x 4 col row col expected actual expected <int> <int> <chr> <chr> actual 1 27 NA value in level set 5 row 2 28 NA value in level set 5 col 3 29 NA value in level set 5 expected 4 30 NA value in level set 5 actual 5 31 NA value in level set 5
#> Error: Can't use `!!!` on atomic vectors in non-quoting functions

As per lionel's comment, this is what coercing to list looks like. Note that you need to supply a character vector to fct_recode and that you have to replace the names after as.character. I'm not sure exactly how your desired levels are stored.
Also your supplied levels don't match those in mtcars$gear, in case you didn't realise.
library(tidyverse)
gear_levels <- c("tri" = 3, "quad" = 4, "six" = 6, `NA` = 8)
gear_recode <- as.list(as.character(gear_levels))
names(gear_recode) <- names(gear_levels)
mtcars %>%
mutate_at(vars(gear), parse_factor, levels = gear_levels) %>%
mutate_at(vars(gear), fct_recode, !!! gear_recode)
#> Warning: 5 parsing failures.
#> row # A tibble: 5 x 4 col row col expected actual expected <int> <int> <chr> <chr> actual 1 27 NA value in level set 5 row 2 28 NA value in level set 5 col 3 29 NA value in level set 5 expected 4 30 NA value in level set 5 actual 5 31 NA value in level set 5
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 quad 4
#> 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 quad 4
#> 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 quad 1
#> 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 tri 1
#> 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 tri 2
#> 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 tri 1
#> 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 tri 4
#> 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 quad 2
#> 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 quad 2
#> 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 quad 4
#> 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 quad 4
#> 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 tri 3
#> 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 tri 3
#> 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 tri 3
#> 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 tri 4
#> 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 tri 4
#> 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 tri 4
#> 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 quad 1
#> 19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 quad 2
#> 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 quad 1
#> 21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 tri 1
#> 22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 tri 2
#> 23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 tri 2
#> 24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 tri 4
#> 25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 tri 2
#> 26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 quad 1
#> 27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 <NA> 2
#> 28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 <NA> 2
#> 29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 <NA> 4
#> 30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 <NA> 6
#> 31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 <NA> 8
#> 32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 quad 2
Created on 2018-03-16 by the reprex package (v0.2.0).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Find duplicated elements with dplyr - r

Another solution is to use janitor package: mtcars %>% get_dupes(wt)

We can find duplicated elements with dplyr as follows. library(dplyr) # Only duplicated elements mtcars %>% filter(duplicated(.[["carb"]]) # All duplicated elements mtcars %>% filter(carb %in% unique(.[["carb"]][duplicated(.[["carb"]])]))

A more general solution if you want to group duplicates using many columns df%>% select(ID,COL1,COL2,all_of(vector_of_columns))%>% distinct%>% ungroup%>%rowwise%>% mutate(ID_GROUPS=paste0(ID,"_",cur_group_rows()))%>% ungroup%>% full_join(.,df,by=c("INFO_ID","COL1","COL2",vector_of_columns))->chk

Find duplicate value in data frame with column df<-dataset[duplicated(dataset$columnname),]

Related

How to eliminate schools with less than 20 students?

R data.table: Difference between nested regressions results

tidyverse function to `mutate_sample`?

Does dplyr `is_grouped_df()` actually require a date frame (vs a data table, tibble, etc.)?

How can I unquote-splice in mutate_at?

Categories

Resources