This question already has answers here:
Getting the top values by group
(6 answers)
Closed 1 year ago.
I have this data:
df <- data.frame(
node = c("A", "B", "A", "A", "A", "B", "A", "A", "A", "B", "B", "B", "B"),
left = c("ab", "ab", "ab", "ab", "cc", "xx", "cc", "ab", "zz", "xx", "xx", "zz", "zz")
)
I want to count grouped frequencies and proportions and slice/filter out a sequence of grouped rows. Say, given the small dataset, I want to have the rows with the two highest Freq_left values per group. How can that be done? I can only extract the rows with the maximum Freq_left values but not the desired sequence of rows:
df %>%
group_by(node, left) %>%
# summarise
summarise(
Freq_left = n(),
Prop_left = round(Freq_left/nrow(.)*100, 4)
) %>%
slice_max(Freq_left)
# A tibble: 2 × 4
# Groups: node [2]
node left Freq_left Prop_left
<chr> <chr> <int> <dbl>
1 A ab 4 30.8
2 B xx 3 23.1
Expected output:
node left Freq_left Prop_left
<chr> <chr> <int> <dbl>
A ab 4 30.8
A cc 2 15.4
B xx 3 23.1
B zz 2 15.4
You could use dplyr::top_n or dplyr::slice_max:
Thanks to #PaulSmith for pointing out that dplyr::top_n is superseded in favor of dplyr::slice_max:
library(dplyr)
df %>%
group_by(node, left) %>%
# summarise
summarise(
Freq_left = n(),
Prop_left = round(Freq_left/nrow(.)*100, 4)
) %>%
slice_max(order_by = Prop_left, n = 2)
#> `summarise()` has grouped output by 'node'. You can override using the `.groups` argument.
#> # A tibble: 4 × 4
#> # Groups: node [2]
#> node left Freq_left Prop_left
#> <chr> <chr> <int> <dbl>
#> 1 A ab 4 30.8
#> 2 A cc 2 15.4
#> 3 B xx 3 23.1
#> 4 B zz 2 15.4
Related
I have a dataframe like this:
library(tibble)
df <- tribble(~First, ~Last, ~Reviewer, ~Assessment, ~Amount,
"a", "b", "c", "Yes", 10,
"a", "b", "d", "No", 8,
"e", "f", "c", "No", 7,
"e", "f", "e", "Yes", 6)
df
#> # A tibble: 4 × 5
#> First Last Reviewer Assessment Amount
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 a b c Yes 10
#> 2 a b d No 8
#> 3 e f c No 7
#> 4 e f e Yes 6
I want to to use pivot_wider to convert df to a dataframe like this:
tribble(~First, ~Last, ~Reviewer_1, ~Assessment_1, ~Amount_1, ~Reviewer_2, ~Assessment_2, ~Amount_2,
"a", "b", "c", "Yes", 10, "d", "No", 8,
"e", "f", "c", "No", 7, "e", "Yes", 6)
#> # A tibble: 2 × 8
#> First Last Reviewer_1 Assessment_1 Amount_1 Reviewer_2 Assessment_2 Amount_2
#> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl>
#> 1 a b c Yes 10 d No 8
#> 2 e f c No 7 e Yes 6
Is there a way to do this with the pivot_wider function? Note that the reviewer ID numbers in the second table are not included in the first table.
library(dplyr)
library(tidyr)
df %>%
group_by(First, Last) %>%
mutate(rn = row_number()) %>%
ungroup() %>%
pivot_wider(
c(First, Last), names_from = rn,
values_from = c(Reviewer, Assessment, Amount))
# # A tibble: 2 × 8
# First Last Reviewer_1 Reviewer_2 Assessment_1 Assessment_2 Amount_1 Amount_2
# <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
# 1 a b c d Yes No 10 8
# 2 e f c e No Yes 7 6
(order of columns notwithstanding)
Here is how you can do it :
df %>%
group_by(First, Last) %>%
mutate(Review_no = rank(Reviewer)) %>%
pivot_wider(names_from = Review_no,
values_from = c(Reviewer, Assessment, Amount))
output:
# A tibble: 2 x 8
# Groups: First, Last [2]
First Last Reviewer_1 Reviewer_2 Assessment_1 Assessment_2 Amount_1 Amount_2
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 a b c d Yes No 10 8
2 e f c e No Yes 7 6
What I would like to do is a bit difficult to explain, but the code would look something like this:
df_merged <- merge(df1, df2,
by.x = c("City", "District"),
by.y = c("City", "District" | "Area"),
all.x = TRUE)
Here "|", in the code above, would mean "OR".
The basic point is that I would like to merge the two frames by two columns. "City" matches for both data frames. However, I also need there to be a match based on "District".
The problem is that, due to human error while the dataset was originally made, in df2 some values for "District" were put in the "Area" column. Hence, ideally, if we have "District" being "A" in df1, then the merge occurs if "A" is found in either the "District" or "Area" column from df2.
Here is an example:
df1 <- data.frame(City = c("A", "B"), District = c("cc", "dd"))
df2 <- data.frame(City = c("A", "A", "B", "B"), Code = c("1a","2a","3a","4a"), District = c("cc", "Apple", "Pear", "Orange"), Area = c("e", "a", "dd", "f"))
df3 <- data.frame(City = c("A", "B"), District = c("cc","dd"), Code = c("1a", "3a"))
Here df3 is what I am aiming for! As you can see in df2, there is something messed up and the values for district got into the wrong column. In my original dataset, it is difficult to clean up this error.
> df1
City District
1 A cc
2 B dd
> df2
City Code District Area
1 A 1a cc e
2 A 2a Apple a
3 B 3a Pear dd
4 B 4a Orange f
> df3
City District Code
1 A cc 1a
2 B dd 3a
Here are a couple of options. One idea is that district might be NA if it is actually in Area. In this case you could coalesce the NA and join on District. Alternatively, you could map out rows in df2 that match rows in df1 and then expand df1 to accommodate those rows.
library(tidyverse)
df1 <- tibble(City = c(rep("Chicago", 4), rep("Tucson", 3)),
District = c("A", "A", "A", "B", "A", "B", "C"))
df2 <- tibble(City = c("Chicago", "Chicago", "Tucson", "Tucson"),
District = c("A", "B", NA, "C"),
Area = c("10", "30", "A", "20"),
value = c(1:4))
#option 1
left_join(df1,
df2 |>
mutate(District = coalesce(District, Area)),
by = c("City", "District" ))
#> # A tibble: 7 x 4
#> City District Area value
#> <chr> <chr> <chr> <int>
#> 1 Chicago A 10 1
#> 2 Chicago A 10 1
#> 3 Chicago A 10 1
#> 4 Chicago B 30 2
#> 5 Tucson A A 3
#> 6 Tucson B <NA> NA
#> 7 Tucson C 20 4
#option 2
df1 |>
mutate(matches = map2(City, District,
~filter(df2, City == .x & (District == .y | Area == .y))|>
select(-City, -District))) |>
unnest_wider(matches)
#> # A tibble: 7 x 4
#> City District Area value
#> <chr> <chr> <chr> <int>
#> 1 Chicago A 10 1
#> 2 Chicago A 10 1
#> 3 Chicago A 10 1
#> 4 Chicago B 30 2
#> 5 Tucson A A 3
#> 6 Tucson B <NA> NA
#> 7 Tucson C 20 4
I have panel data and I would like to get the percentage of observations in a column (Size) that are below 1 million.
My data is the following:
structure(list(Product = c("A", "A", "A", "A", "A", "A", "B",
"B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C"), Date = c("02.05.2018",
"04.05.2018", "05.05.2018", "06.05.2018", "07.05.2018", "08.05.2018",
"02.05.2018", "04.05.2018", "05.05.2018", "06.05.2018", "07.05.2018",
"08.05.2018", "02.05.2018", "04.05.2018", "05.05.2018", "06.05.2018",
"07.05.2018", "08.05.2018"), Size = c(100023423, 1920, 2434324342,
2342353566, 345345345, 432, 1.35135e+11, 312332, 23434, 4622436246,
3252243, 234525, 57457457, 56848648, 36363546, 36535636, 2345,
2.52646e+11)), class = "data.frame", row.names = c(NA, -18L))
So for instance, for Product A it would be 33.33% since two out of 6 observations are below one million.
I have tried the following in R
df <- df %>%
group_by(Product) %>%
dplyr:: summarise(CountDate = n(), SmallSize = count(Size<1000000))
However, I get an error saying that "no applicable method for 'count' applied to an object of class "logical"" eventhough the column Size has the format double.
After the code above I would then calculate SmallSize/CountDate to get the percentage.
What do I need to adjust to not get the error message?
Instead of count, which requires a data.frame/tibble, use sum on a logical vector to get the count - TRUE values will be counted as 1 and FALSE as 0
library(dplyr)
df %>%
group_by(Product) %>%
dplyr:: summarise(CountDate = n(),
SmallSize = sum(Size<1000000, na.rm = TRUE), .groups = "drop") %>%
dplyr::mutate(Percent = SmallSize/CountDate)
# A tibble: 3 × 4
Product CountDate SmallSize Percent
<chr> <int> <int> <dbl>
1 A 6 2 0.333
2 B 6 3 0.5
3 C 6 1 0.167
Also, we don't need to create both the columns. It can be directly calculated with mean
df %>%
group_by(Product) %>%
dplyr::summarise(Percent = mean(Size < 1000000, na.rm = TRUE))
# A tibble: 3 × 2
Product Percent
<chr> <dbl>
1 A 0.333
2 B 0.5
3 C 0.167
I have a tibble which resembles the following:
data<-tibble(ref=c("ABC", "ABC", "XYZ", "XYZ", "FGH", "FGH", "FGH"),
type=c("A", "B", "A", "A", "A", "A", "B"))
ref type
1 ABC A
2 ABC B
3 XYZ A
4 XYZ A
5 FGH A
6 FGH A
7 FGH B
I need to group by ref and if--within a group--type B is present, return that row, else default to return any row (but only 1 row) of type A.
Expected output:
ref type
1 ABC B
2 XYZ A
3 FGH B
with large amounts of data, it is better to do sorting before grouping
tidyverse
library(tidyverse)
df<-tibble(ref=c("ABC", "ABC", "XYZ", "XYZ", "FGH", "FGH", "FGH"),
type=c("A", "B", "A", "A", "A", "A", "B"))
distinct(df) %>%
arrange(ref, desc(type)) %>%
group_by(ref) %>%
slice_head(n = 1) %>%
ungroup()
#> # A tibble: 3 × 2
#> ref type
#> <chr> <chr>
#> 1 ABC B
#> 2 FGH B
#> 3 XYZ A
data.table
Created on 2022-04-27 by the reprex package (v2.0.1)
df<-data.frame(ref=c("ABC", "ABC", "XYZ", "XYZ", "FGH", "FGH", "FGH"),
type=c("A", "B", "A", "A", "A", "A", "B"))
library(data.table)
setDT(df)[order(ref, -type), .SD[1], by = ref]
#> ref type
#> 1: ABC B
#> 2: FGH B
#> 3: XYZ A
Created on 2022-04-27 by the reprex package (v2.0.1)
If you only have A and B, then you can arrange and simply get the first row, i.e.
library(dplyr)
data %>%
group_by(ref) %>%
filter(type %in% c('A', 'B')) %>% #If other types exist
arrange(desc(type)) %>%
slice(1L)
# A tibble: 3 x 2
# Groups: ref [3]
ref type
<chr> <chr>
1 ABC B
2 FGH B
3 XYZ A
We can use which.max over boolean to extract the desired rows
data %>%
group_by(ref) %>%
slice(which.max(type == "B")) %>%
ungroup()
which gives
# A tibble: 3 x 2
ref type
<chr> <chr>
1 ABC B
2 FGH B
3 XYZ A
I would like to reassign a given records to a single group if the records are duplicated. In the below dataset I would like to to have 12-4 all being assigned to group A or B but not both. Is there a way to go abou it?
library(tidyverse)
dat <- tibble(
group = c("A", "A", "A", "A", "B", "B", "B", "B", "B"),
assigned = c("12-1", "12-2", "12-3", "12-4", "12-4", "12-5", "12-6",
"12-7", "12-8")
)
# Attempts to tease out records for each group
dat %>% pivot_wider(names_from = group, values_from = assigned)
You can group by record and reassign all to the same group, chosen at random from the available groups:
dat %>%
group_by(assigned) %>%
mutate(group = nth(group, sample(n())[1])) %>%
ungroup()
#> # A tibble: 9 x 2
#> group assigned
#> <chr> <chr>
#> 1 A 12-1
#> 2 A 12-2
#> 3 A 12-3
#> 4 A 12-4
#> 5 A 12-4
#> 6 B 12-5
#> 7 B 12-6
#> 8 B 12-7
#> 9 B 12-8
library(tidyverse)
dat <- tibble(
group = c("A", "A", "A", "A", "B", "B", "B", "B", "B"),
assigned = c(
"12-1", "12-2", "12-3", "12-4", "12-4", "12-5", "12-6",
"12-7", "12-8"
)
)
dat %>%
select(-group) %>%
left_join(
dat %>%
left_join(dat %>% count(group)) %>%
# reassign to the smallest group
arrange(n) %>%
select(-n) %>%
distinct(assigned, .keep_all = TRUE)
)
#> Joining, by = "group"
#> Joining, by = "assigned"
#> # A tibble: 9 × 2
#> assigned group
#> <chr> <chr>
#> 1 12-1 A
#> 2 12-2 A
#> 3 12-3 A
#> 4 12-4 A
#> 5 12-4 A
#> 6 12-5 B
#> 7 12-6 B
#> 8 12-7 B
#> 9 12-8 B
Created on 2022-04-04 by the reprex package (v2.0.0)