Group all rows that meet a certain condition - r

I have the following dataframe df1:
company_location count
<chr> <int>
1 DE 28
2 JP 6
3 GB 47
4 HN 1
5 US 355
6 HU 1
I want to get to df2:
company_location count
<chr> <int>
1 DE 28
2 GB 47
3 US 355
4 OTHER 8
df2 is the same as df1 but sums together all the columns with count<10 and aggregates them in a row called OTHER
Does something like this exist: A group_by() function that only groups all the rows that match a particular condition into one group and leaves all the other rows in groups only containing them alone?

This is what fct_lump_min is for - it's a function from forcats, which is part of the tidyverse.
library(tidyverse)
df %>%
group_by(company_location = fct_lump_min(company_location, 10, count)) %>%
summarise(count = sum(count))
#> # A tibble: 4 x 2
#> company_location count
#> <fct> <int>
#> 1 DE 28
#> 2 GB 47
#> 3 US 355
#> 4 Other 8

Make a temporary variable regrouping company_location based on count, then summarise:
library(dplyr)
df1 %>%
group_by(company_location = replace(company_location, count < 10, 'OTHER')) %>%
summarise(count = sum(count))
# company_location count
# <chr> <int>
#1 DE 28
#2 GB 47
#3 OTHER 8
#4 US 355

Related

R output BOTH maximum and minimum value by group in dataframe

Let's say I have a dataframe of Name and value, is there any ways to extract BOTH minimum and maximum values within Name in a single function?
set.seed(1)
df <- tibble(Name = rep(LETTERS[1:3], each = 3), Value = sample(1:100, 9))
# A tibble: 9 x 2
Name Value
<chr> <int>
1 A 27
2 A 37
3 A 57
4 B 89
5 B 20
6 B 86
7 C 97
8 C 62
9 C 58
The output should contains TWO columns only (Name and Value).
Thanks in advance!
You can use range to get max and min value and use it in summarise to get different rows for each Name.
library(dplyr)
df %>%
group_by(Name) %>%
summarise(Value = range(Value), .groups = "drop")
# Name Value
# <chr> <int>
#1 A 27
#2 A 57
#3 B 20
#4 B 89
#5 C 58
#6 C 97
If you have large dataset using data.table might be faster.
library(data.table)
setDT(df)[, .(Value = range(Value)), Name]
You can use dplyr::group_by() and dplyr::summarise() like this:
library(dplyr)
set.seed(1)
df <- tibble(Name = rep(LETTERS[1:3], each = 3), Value = sample(1:100, 9))
df %>%
group_by(Name) %>%
summarise(
maximum = max(Value),
minimum = min(Value)
)
This outputs:
# A tibble: 3 × 3
Name maximum minimum
<chr> <int> <int>
1 A 68 1
2 B 87 34
3 C 82 14
What's a little odd is that my original df object looks a little different than yours, in spite of the seed:
# A tibble: 9 × 2
Name Value
<chr> <int>
1 A 68
2 A 39
3 A 1
4 B 34
5 B 87
6 B 43
7 C 14
8 C 82
9 C 59
I'm currently using rbind() together with slice_min() and slice_max(), but I think it may not be the best way or the most efficient way when the dataframe contains millions of rows.
library(tidyverse)
rbind(df %>% group_by(Name) %>% slice_max(Value),
df %>% group_by(Name) %>% slice_min(Value)) %>%
arrange(Name)
# A tibble: 6 x 2
# Groups: Name [3]
Name Value
<chr> <int>
1 A 57
2 A 27
3 B 89
4 B 20
5 C 97
6 C 58
In base R, the output format can be created with tapply/stack - do a group by tapply to get the output as a named list or range, stack it to two column data.frame and change the column names if needed
setNames(stack(with(df, tapply(Value, Name, FUN = range)))[2:1], names(df))
Name Value
1 A 27
2 A 57
3 B 20
4 B 89
5 C 58
6 C 97
Using aggregate.
aggregate(Value ~ Name, df, range)
# Name Value.1 Value.2
# 1 A 1 68
# 2 B 34 87
# 3 C 14 82

Define groups of columns and sum all i-th columns of each groups with dplyr

I have two groups of columns, each with 36 columns, and I want to sum all i-th column of group 1 with i-th column of group2, getting 36 columns. The number of columns in each group is not fix in my code, although each group has the same number of them.
Exemple. What I have:
teste <- tibble(a1=c(1,2,3),a2=c(7,8,9),b1=c(4,5,6),b2=c(10,20,30))
a1 a2 b1 b2
<dbl> <dbl> <dbl> <dbl>
1 1 7 4 10
2 2 8 5 20
3 3 9 6 30
What I want:
resultado <- teste %>%
summarise(
a_b1 = a1+b1,
a_b2 = a2+b2
)
a_b1 a_b2
<dbl> <dbl>
1 5 17
2 7 28
3 9 39
It would be nice to perform this operation with dplyr.
I would thank any help.
You will struggle to find a dplyr solution as simple and elegant as the base R one:
teste[1:2] + teste[3:4]
#> a1 a2
#> 1 5 17
#> 2 7 28
#> 3 9 39
Though I guess in dplyr you get the same result with:
teste %>% select(starts_with("a")) + teste %>% select(starts_with("b"))
teste %>%
summarise(across(starts_with("a")) + across(starts_with("b")))
# A tibble: 3 x 2
a1 a2
<dbl> <dbl>
1 5 17
2 7 28
3 9 39
This might also help in base R:
as.data.frame(do.call(cbind, lapply(split.default(teste, sub("\\D(\\d+)", "\\1", names(teste))), rowSums, na.rm = TRUE)))
1 2
1 5 17
2 7 28
3 9 39
Another dplyr solution. We can use rowwise and c_across together to sum the values per row. Notice that we can add na.rm = TRUE to the sum function in this case.
library(dplyr)
teste2 <- teste %>%
rowwise() %>%
transmute(a_b1 = sum(c_across(ends_with("1")), na.rm = TRUE),
a_b2 = sum(c_across(ends_with("2")), na.rm = TRUE)) %>%
ungroup()
teste2
# # A tibble: 3 x 2
# a_b1 a_b2
# <dbl> <dbl>
# 1 5 17
# 2 7 28
# 3 9 39

Merge rows containing similar strings using dplyr

I have a table containing the following data:
df <- tibble(
dose = seq(10, 50, 10),
date = c("2007-12-15", "2007-10-13","2007-10-13","2007-09-30","2007-09-30"),
response = c(45, 67, 66, 54, 55),
name = c("Peter,Martin", "Gale,Rebecca", "Rebecca,Gale", "Jonathan,Smith", "Smith,Jonathan")
)
The table:
# A tibble: 5 x 4
dose date response name
<dbl> <chr> <dbl> <chr>
1 10 2007-12-15 45 Peter,Martin
2 20 2007-10-13 67 Gale,Rebecca
3 30 2007-10-13 66 Rebecca,Gale
4 40 2007-09-30 54 Jonathan,Smith
5 50 2007-09-30 55 Smith,Jonathan
One of the columns called name either has a string "FirstName,LastName" or "LastName,FirstName". I wish to merge the rows that contain the same names if they are ordered either way. For example, the rows containing Rebecca,Gale and Gale,Rebecca should merge.
While merging, I wish to get the sums of the columns dose and response and want to keep the first of the date and name entries.
Expected outcome:
# A tibble: 3 x 4
dose date response name
<dbl> <chr> <dbl> <chr>
1 10 2007-12-15 45 Peter,Martin
2 50 2007-10-13 133 Gale,Rebecca
3 90 2007-09-30 109 Jonathan,Smith
Please note that I always want to merge using the name column and not the date column because even if the example contains the same dates, my bigger table has different dates for the same name.
Here is one idea.
library(tidyverse)
df2 <- df %>%
mutate(date = as.Date(date)) %>%
mutate(name = map_chr(name, ~toString(sort(str_split(.x, ",")[[1]])))) %>%
group_by(name) %>%
summarize(dose = sum(dose),
response = sum(response),
date = first(date)) %>%
select(names(df)) %>%
ungroup()
df2
# # A tibble: 3 x 4
# dose date response name
# <dbl> <date> <dbl> <chr>
# 1 50 2007-10-13 133 Gale, Rebecca
# 2 90 2007-09-30 109 Jonathan, Smith
# 3 10 2007-12-15 45 Martin, Peter

R dplyr::Filter dataframe by group and numeric vector?

I have dataframe df1 containing data and groups, and df2 which stores the same groups, and one value per group.
I want to filter rows of df1 by df2 where lag by group is higher than indicated value.
Dummy example:
# identify the first year of disturbance by lag by group
df1 <- data.frame(year = c(1:4, 1:4),
mort = c(5,16,40,4,5,6,10,108),
distance = rep(c("a", "b"), each = 4))
df2 = data.frame(distance = c("a", "b"),
my.median = c(12,1))
Now calculate the lag between values (creates new column) and filter df1 based on column values of df2:
# calculate lag between years
df1 %>%
group_by(distance) %>%
dplyr::mutate(yearLag = mort - lag(mort, default = 0)) %>%
filter(yearLag > df2$my.median) ##
This however does not produce expected results:
# A tibble: 3 x 4
# Groups: distance [2]
year mort distance yearLag
<int> <dbl> <fct> <dbl>
1 2 16 a 11
2 3 40 a 24
3 4 108 b 98
Instead, I expect to get:
# A tibble: 3 x 4
# Groups: distance [2]
year mort distance yearLag
<int> <dbl> <fct> <dbl>
1 3 40 a 24
2 1 5 b 5
3 3 10 b 4
The filter works great while applied to single value, but how to adapt it to vector, and especially vector of groups (as the order of elements can potentially change?)
Is this what you're trying to do?
df1 %>%
group_by(distance) %>%
dplyr::mutate(yearLag = mort - lag(mort, default = 0)) %>%
left_join(df2) %>%
filter(yearLag > my.median)
Result:
# A tibble: 4 x 5
# Groups: distance [2]
year mort distance yearLag my.median
<int> <dbl> <fct> <dbl> <dbl>
1 3 40 a 24 12
2 1 5 b 5 1
3 3 10 b 4 1
4 4 108 b 98 1
here is a data.table approach
library( data.table )
#creatae data.tables
setDT(df1);setDT(df2)
#create yearLag variable
df1[, yearLag := mort - shift( mort, type = "lag", fill = 0 ), by = .(distance) ]
#update join and filter wanted rows
df1[ df2, median.value := i.my.median, on = .(distance)][ yearLag > median.value, ][]
# year mort distance yearLag median.value
# 1: 3 40 a 24 12
# 2: 1 5 b 5 1
# 3: 3 10 b 4 1
# 4: 4 108 b 98 1
Came to the same conclusion. You should left_join the data frames.
df1 %>% left_join(df2, by="distance") %>%
group_by(distance) %>%
dplyr::mutate(yearLag = mort - lag(mort, default = 0)) %>%
filter(yearLag > my.median)
# A tibble: 4 x 5
# Groups: distance [2]
year mort distance my.median yearLag
<int> <dbl> <fct> <dbl> <dbl>
1 3 40 a 12 24
2 1 5 b 1 5
3 3 10 b 1 4
4 4 108 b 1 98

Group data hierarchically on two levels, then compute relative frequencies in R using dplyr [duplicate]

This question already has answers here:
Relative frequencies / proportions with dplyr
(10 answers)
Closed 3 years ago.
I want to do something which appears simple, but I don't have a good feel for R yet, it is a maze of twisty passages, all different.
I have a table with several variables, and I want to group on two variables ... I want a two-level hierarchical grouping, also known as a tree. This can evidently be done using the group_by function of dplyr.
And then I want to compute marginal statistics (in this case, relative frequencies) based on group counts for level 1 and level 2.
In pictures, given this table of 18 rows:
I want this table of 6 rows:
Is there a simple way to do this in dplyr? (I can do it in SQL, but ...)
Edited for example
For example, based on the nycflights13 package:
library(dplyr)
install.packages("nycflights13")
require(nycflights13)
data(flights) # contains information about flights, one flight per row
ff <- flights %>%
mutate(approx_dist = floor((distance + 999)/1000)*1000) %>%
select(carrier, approx_dist) %>%
group_by(carrier, approx_dist) %>%
summarise(n = n()) %>%
arrange(carrier, approx_dist)
This creates a tbl ff with the number of flights for each pair of (carrier, inter-airport-distance-rounded-to-1000s):
# A tibble: 33 x 3
# Groups: carrier [16]
carrier approx_dist n
<chr> <dbl> <int>
1 9E 1000 15740
2 9E 2000 2720
3 AA 1000 9146
4 AA 2000 17210
5 AA 3000 6373
And now I would like to compute the relative frequencies for the "approx_dist" values in each "carrier" group, for example, I would like to get:
carrier approx_dist n rel_freq
<chr> <dbl> <int>
1 9E 1000 15740 15740/(15740+2720)
2 9E 2000 2720 2720/(15740+2720)
If I understood your problem correctly, here is what you can do. This is not to exactly solve your problem (we don't have the data), but to give you some hints:
library(dplyr)
d <- data.frame(col1= rep(c("a", "a", "a", "b", "b", "b"),2),
col2 = rep(c("a1", "a2", "a3", "b1", "b2", "b3"),2),
stringsAsFactors = F)
d %>% group_by(col1) %>% mutate(count_g1 = n()) %>% ungroup() %>%
group_by(col1, col2) %>% summarise(rel_freq = n()/unique(count_g1)) %>% ungroup()
# # A tibble: 6 x 3
# col1 col2 rel_freq
# <chr> <chr> <dbl>
# 1 a a1 0.333
# 2 a a2 0.333
# 3 a a3 0.333
# 4 b b1 0.333
# 5 b b2 0.333
# 6 b b3 0.333
Update: #TimTeaFan's suggestion on how to re-write the code above using prop.table
d %>% group_by(col1, col2) %>% summarise(n = n()) %>% mutate(freq = prop.table(n))
Update: Running this trick on the ff table given in the question's example, which has everything set up except the last mutate:
ff %>% mutate(rel_freq = prop.table(n))
# A tibble: 33 x 4
# Groups: carrier [16]
carrier approx_dist n rel_freq
<chr> <dbl> <int> <dbl>
1 9E 1000 15740 0.853
2 9E 2000 2720 0.147
3 AA 1000 9146 0.279
4 AA 2000 17210 0.526
5 AA 3000 6373 0.195
6 AS 3000 714 1
7 B6 1000 24613 0.450
8 B6 2000 22159 0.406
9 B6 3000 7863 0.144
10 DL 1000 20014 0.416
# … with 23 more rows
...or
ff %>% mutate(rel_freq = n/sum(n))
Fake data for demonstration:
library(dplyr)
df <- data.frame(stringsAsFactors = F,
col1 = rep(c("A","B"), each = 9),
col2 = rep(1:3),
value = 1:18)
#> df
# col1 col2 value
#1 A 1 1
#2 A 2 2
#3 A 3 3
#4 A 1 4
#5 A 2 5
#6 A 3 6
#7 A 1 7
#8 A 2 8
#9 A 3 9
#10 B 1 10
#11 B 2 11
#12 B 3 12
#13 B 1 13
#14 B 2 14
#15 B 3 15
#16 B 1 16
#17 B 2 17
#18 B 3 18
Solution
df %>%
group_by(col1, col2) %>%
summarise(col2_ttl = sum(value)) %>% # Count is boring for this data, but you
mutate(share_of_col1 = col2_ttl / sum(col2_ttl)) #... could use `n()` for that
## A tibble: 6 x 4
## Groups: col1 [2]
# col1 col2 col2_ttl share_of_col1
# <chr> <int> <int> <dbl>
#1 A 1 12 0.267
#2 A 2 15 0.333
#3 A 3 18 0.4
#4 B 1 39 0.310
#5 B 2 42 0.333
#6 B 3 45 0.357
First we group by both columns. In this case, the ordering makes a difference, because the groups are created hierarchically, and each summary we run summarizes the last layer of grouping. So the summarise line (or summarize, it was written with UK spelling but with US spelling aliases) sums up the values in each col1-col2 combination, leaving a residual grouping by col1 which we can use in the next line. (Try putting a # after sum(value)) to see what is produced at that stage.)
In the last line, the col2_ttl is divided by the sum of all the col2_ttl in its group, ie the total across each col1.

Resources