I have a dataset, espana2015, of a country with schools, students…. I want to eliminate schools with less than 20 students.
The variable of the schools is CNTSCHID
dim(espana2015)
[1] 6736 106
The only way, long, manual and not very efficient, is to write one by one the schools.
Here are only 13 schools with less than 20 students, but what if there are many more, e.g. more than 100 schools?
espana2015 %>% group_by(CNTSCHID) %>% summarise(students=n())%>%
filter(students < 20) %>% select (CNTSCHID) ->removeSch
removeSch
# A tibble: 13 x 1
CNTSCHID
<dbl>
1 72400046
2 72400113
3 72400261
4 72400314
5 72400396
6 72400472
7 72400641
8 72400700
9 72400711
10 72400736
11 72400909
12 72400927
13 72400979
espana2015 %>% subset(!CNTSCHID %in% c(72400046,72400113,72400261,
72400314,72400396,72400472,
72400641,72400700,72400711,
72400736,72400909,72400927,
72400979)) -> new_espana2015
Please help me to do it better
Walter
Lacking sample data, I'll demonstrate on mtcars, where my cyl is your CNTSHID.
library(dplyr)
table(mtcars$cyl)
# 4 6 8
# 11 7 14
mtcars %>%
group_by(cyl) %>%
filter(n() > 10) %>%
ungroup()
# # A tibble: 25 x 11
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
# 2 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
# 3 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
# 4 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
# 5 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
# 6 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
# 7 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
# 8 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
# 9 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
# 10 10.4 8 460 215 3 5.42 17.8 0 0 3 4
# # ... with 15 more rows
This works because the conditional in filter resolves to a single logical, and that length-1 true/false is then recycled for all rows in that group. That is, for cyl == 4, (n() > 10) --> (11 > 10) --> TRUE, so the filter is %>% filter(TRUE); the dplyr::filter function does "safe recycling" in a sense, where the conditional must be the same length as the number of rows, or length 1. When it is length 1, it is essentially saying "all or nothing".
Related
I'm working with a dataframe that indexes values by three variables, date, campaign and country. Every other value is indexed according to these three values, as follows:
# Groups: date, campaign [1,325]
date campaign country cost clicks
<date> <dbl> <chr> <dbl> <dbl>
1 2021-03-01 10127671839 0 0.45 7
2 2021-03-01 10127671839 AD 0.47 10
3 2021-03-01 10127671839 AE 0.39 11
4 2021-03-01 10127671839 AF 0.27 2
5 2021-03-01 10127671839 AG 0 0
6 2021-03-01 10127671839 AI 1.28 2
7 2021-03-01 10127671839 AL 0.66 6
8 2021-03-01 10127671839 AM 0.33 2
9 2021-03-01 10127671839 AO 0 0
10 2021-03-01 10127671839 AR 0 0
# … with 335,215 more rows
What I'm trying to do is creating a moving average of those values (in the table above, "cost" and "clicks") that is still indexed on country, campaign and date.
Edit: I found a good function that works when there are only two index variables (in here: Rolling mean (moving average) by group/id with dplyr), but I am not skilled enough to tweak the code into working for three or more variables.
I think zoo::rollmean works well here, and dplyr::group_by can handle as many index variables as you need:
library(dplyr)
mtcars %>%
group_by(cyl, am, vs) %>%
mutate(across(c(mpg,disp), list(rm = ~ zoo::rollmeanr(., 2, fill = NA))))
# # A tibble: 32 x 13
# # Groups: cyl, am, vs [7]
# mpg cyl disp hp drat wt qsec vs am gear carb mpg_rm disp_rm
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 NA NA
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 21 160
# 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 NA NA
# 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 NA NA
# 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 NA NA
# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 19.8 242.
# 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 16.5 360
# 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 NA NA
# 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 23.6 144.
# 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 18.6 196.
# # ... with 22 more rows
The fill=NA argument means that the first in each series has no history to average on, so it is NA. If you prefer the first in a series to be an average of itself, you can instead use partial=TRUE (using rollapplyr instead):
mtcars %>%
group_by(cyl, am, vs) %>%
mutate(across(c(mpg,disp), list(rm = ~ zoo::rollapplyr(., 2, FUN = mean, partial = TRUE))))
# # A tibble: 32 x 13
# # Groups: cyl, am, vs [7]
# mpg cyl disp hp drat wt qsec vs am gear carb mpg_rm disp_rm
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 21 160
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 21 160
# 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 22.8 108
# 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 21.4 258
# 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 18.7 360
# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 19.8 242.
# 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 16.5 360
# 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 24.4 147.
# 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 23.6 144.
# 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 18.6 196.
# # ... with 22 more rows
I've used the align="right" variants of zoo's functions, assuming that your moving average is historical and that time increases in subsequent rows. If these assumptions are not true, make sure you intentionally choose between the align-variants.
I used dplyr::across here to handle an arbitrary number of columns in one step: Since I used a named list of "tilde-functions", it took the name of each function and appended it to the name of each of the column names. You can break it out into individual mutate assignments if you prefer, for readability, maintainability, or if you need different sets of arguments for each column.
I'm dealing with a big dataframe that has a number of columns I want to group by. I'd like to do something like this:
output <- df %>%
group_by(starts_with("GEN", ignore.case=TRUE),x,y) %>%
summarize(total=n()) %>%
arrange(desc(total))
is there a way to do this? Maybe with group_by_at or some other similar function?
To use starts_with() in group_by(), you need to wrap it in across(). Here is an example using some built data.
library(dplyr)
mtcars %>%
group_by(across(starts_with("c"))) %>%
summarize(total = n()) %>%
arrange(-total)
# A tibble: 9 x 3
# Groups: cyl [3]
cyl carb total
<dbl> <dbl> <int>
1 4 2 6
2 8 4 6
3 4 1 5
4 6 4 4
5 8 2 4
6 8 3 3
7 6 1 2
8 6 6 1
9 8 8 1
Yes, there is. You could use the group_by_at function:
mtcars %>% group_by_at(vars(starts_with("c"), gear))
Group by all columns whose name starts with "c" and by the column gear
Output
# A tibble: 32 x 11
# Groups: cyl, carb, gear [12]
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ... with 22 more rows
I have a dataset with 82147 obs and 36 variables and I need to find the top 200 levels of the "Description" column with the highest frequency, but there is a QTY column in the dataset so although I am finding the most frequent levels of "Description," I am not getting a true picture of what is the most frequently repeated items as each record also lists a QTY which is not 1.
Top20InvDesc <- names(sort(summary(as.factor(Inventory$Description)),
decreasing=T)[1:20])
Top20InvDesc
I have tried this and continue to scour the internet on how to do this, but I also do not know how to properly ask this question so I am looking at a lot of similar stuff but nothing that is what I need.
Top20InvDesc <- names(sort(summary(as.factor(Inventory$Description)),
decreasing=T)[1:20])
Top20InvDesc
and
library(dplyr)
Inventory %>%
group_by(Description) %>%
top_n(5, Qty)
Say that a "syringe" is one of the levels in the "Description" column and it is the most repeated level, but each record has a QTY of 5. There is also a level of "gloves" in the "Description" column and it is the 5th most repeated level, but the QTY is 1000 for each. I know that the "gloves" should be the first item in the new dataframe I am trying to make, but I cannot figure out how to get my code to do this. The easiest way I can think of to solve my problem is to create a new dataframe where each item is listed as QTY 1 and only use the top 20 items.
What I am Getting
Description
<fctr>
ARMBOARD INTRAVENOUS NEONATAL 4X1.5IN FOAM SEMIFLEXIBLE DISPOSABLE LATEX FREE-BG/24EA
Qty
<int>
32
What I want to get and the Armboard will now be listed 32 times.
Description
<fctr>
ARMBOARD INTRAVENOUS NEONATAL 4X1.5IN FOAM SEMIFLEXIBLE DISPOSABLE LATEX FREE-BG/24EA
Qty
<int>
1
My laptop has 32 GB memory and a 180 Watt power supply so I was thinking I will need to deal with the linger processing time, but this will also make the data much easier to work with.
library(dplyr)
details_from_top20 <- Inventory %>%
group_by(Description) %>%
summarise(n = sum(QTY)) %>%
top_n(20) %>%
left_join(Inventory)
For a reproducible example, we could use mtcars and get all the data for the cars which have the gear with highest total weight, in this case 3 gears. (It's a little contrived, but structurally the same problem.)
car_gears_with_top2_weights <- mtcars %>%
group_by(gear) %>%
summarise(total_wt = sum(wt*1000)) %>%
top_n(1) %>%
left_join(mtcars)
# A tibble: 15 x 12
gear total_wt mpg cyl disp hp drat wt qsec vs am carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 58389 21.4 6 258 110 3.08 3.22 19.4 1 0 1
2 3 58389 18.7 8 360 175 3.15 3.44 17.0 0 0 2
3 3 58389 18.1 6 225 105 2.76 3.46 20.2 1 0 1
4 3 58389 14.3 8 360 245 3.21 3.57 15.8 0 0 4
5 3 58389 16.4 8 276. 180 3.07 4.07 17.4 0 0 3
6 3 58389 17.3 8 276. 180 3.07 3.73 17.6 0 0 3
7 3 58389 15.2 8 276. 180 3.07 3.78 18 0 0 3
8 3 58389 10.4 8 472 205 2.93 5.25 18.0 0 0 4
9 3 58389 10.4 8 460 215 3 5.42 17.8 0 0 4
10 3 58389 14.7 8 440 230 3.23 5.34 17.4 0 0 4
11 3 58389 21.5 4 120. 97 3.7 2.46 20.0 1 0 1
12 3 58389 15.5 8 318 150 2.76 3.52 16.9 0 0 2
13 3 58389 15.2 8 304 150 3.15 3.44 17.3 0 0 2
14 3 58389 13.3 8 350 245 3.73 3.84 15.4 0 0 4
15 3 58389 19.2 8 400 175 3.08 3.84 17.0 0 0 2
I am trying to apply a sampling function in a grouped fashion to a data frame, where it should sample n samples from each group, or all group members if the group size is smaller than n.
Using dplyr, I first tried
library(dplyr)
mtcars %>% group_by(cyl) %>% sample_n(2)
This works when n is smaller than all the group sizes but does not take the full group when I choose n larger than the group size (note that there are 7 cars in one of the cyl groups):
mtcars %>% group_by(cyl) %>% sample_n(8)
Error: `size` must be less or equal than 7 (size of data),
set `replace` = TRUE to use sampling with replacement
I tried to solve this by creating an adapted group_n function like so:
sample_n_or_all <- function(tbl, n) {
if (nrow(tbl) < n)return(tbl)
sample_n(tbl, n)
}
but using my custom function (mtcars %>% group_by(cyl) %>% sample_n_or_all(8)) generates the same error.
Any suggestions how I can adapt my function so I can apply it to each of the groups? Or another solution to the problem?
We could check the number of rows in the group and pass the value to sample_n accordingly.
library(dplyr)
n <- 8
temp <- mtcars %>% group_by(cyl) %>% sample_n(if(n() < n) n() else n)
temp
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
# 2 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
# 3 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
# 4 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
# 5 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
# 6 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
# 7 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
# 8 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
# 9 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#10 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
# … with 13 more rows
We can check number of rows in each group after that.
table(temp$cyl)
#4 6 8
#8 7 8
table(mtcars$cyl)
# 4 6 8
#11 7 14
We can do this without using a logical condition with pmin
library(dplyr)
tmp <- mtcars %>%
group_by(cyl) %>%
sample_n(pmin(n(), n))
# A tibble: 23 x 11
# Groups: cyl [3]
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
# 2 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
# 3 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
# 4 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
# 5 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
# 6 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
# 7 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
# 8 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
# 9 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
#10 21 6 160 110 3.9 2.62 16.5 0 1 4 4
# … with 13 more rows
-checking
table(tmp$cyl)
# 4 6 8
# 8 7 8
So basically I want to turn a numeric income variable into an ordinal income variable where the cut-off points for the categories are decided so that each category ends up with the same N (or 1 less for one of the categories if it's an odd number N, to begin with).
Does anyone know how I can do this in R?
Here's an example using mtcars.
I'd suggest you use the ntile function that splits your variable into groups with the same number of cases.
Assume that the variable of interest is disp:
library(dplyr)
mtcars %>%
group_by(g = ntile(disp, 3)) %>% # split variable into 3 groups
mutate(g_range = paste0(min(disp), "-", max(disp))) %>% # create the ranges
ungroup() -> df
Your updated data (df) will look like this:
# # A tibble: 32 x 13
# mpg cyl disp hp drat wt qsec vs am gear carb g g_range
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <chr>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 2 146.7-301
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 2 146.7-301
# 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 1 71.1-145
# 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 2 146.7-301
# 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 3 304-472
# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 2 146.7-301
# 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 3 304-472
# 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 2 146.7-301
# 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 1 71.1-145
#10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 2 146.7-301
# # ... with 22 more rows
You can check the number of cases within each group:
df %>% count(g, g_range)
# # A tibble: 3 x 3
# g g_range n
# <int> <chr> <int>
# 1 1 71.1-145 11
# 2 2 146.7-301 11
# 3 3 304-472 10