Get top rows of a data.frame weighted by cumulated column values - r

Let's say we have a date.frame containing a column with numbers. Now I only want to filter those rows which make up the top 75% regarding to the numeric column.
Here's an example and a clumsy solution:
library(tidyverse)
d <- tribble(
~name, ~value,
"A", 40,
"B", 20,
"C", 10,
"D", 10,
"E", 5,
"F", 5,
"G", 3,
"H", 3,
"I", 3,
"J", 1,
)
d %>%
arrange(desc(value)) %>%
mutate(
relative_value = value / sum(value),
cum_relative_value = cumsum(relative_value)
) %>%
filter(lag(cum_relative_value) <= 0.75 | is.na(lag(cum_relative_value)))
#> # A tibble: 4 x 4
#> name value relative_value cum_relative_value
#> <chr> <dbl> <dbl> <dbl>
#> 1 A 40 0.4 0.4
#> 2 B 20 0.2 0.6
#> 3 C 10 0.1 0.7
#> 4 D 10 0.1 0.8
Created on 2021-04-30 by the reprex package (v1.0.0)
As you can see I calculate the percentage of the cumulated values and filter with respect to this value. I have to use lag() to get the row which surpasses the 0.75 bound and is.na() to get the first row.
This really feels clumsy. I thought there should be a solution with slice_* or fct_lump_prop() but I can't figure it out.
So is there any nice "dplyr"-way?

What about this?
d %>%
arrange(-value) %>%
filter(
lag(cumsum(prop.table(value)), default = 0) <= 0.75
)
which gives
# A tibble: 4 x 2
name value
<chr> <dbl>
1 A 40
2 B 20
3 C 10
4 D 10

There seems to be no such function. But the code could be simplified as follows:
d %>%
arrange(desc(value)) %>%
filter(cumsum(cumsum(value)/sum(value) >= 0.75) <= 1)
# # A tibble: 3 x 2
# name value
# <chr> <dbl>
# 1 A 40
# 2 B 20
# 3 C 10
# 4 D 10

While I think the proposed solutions are quite reasonable and sensible, I've tried to figure out a way of doing this with fct_lump. However, I can't justify why I set the prop argument to 0.05 considering you wanted the top 0.75 cumulative frequencies, except for the fact that I added up the cumulative sums of values and realized all frequencies fewer than 5 percent will lead to the desired output:
library(dplyr)
library(forcats)
d %>%
mutate(name = fct_lump(name, prop = 0.05, w = value)) %>%
filter(name != "Other")
# A tibble: 4 x 2
name value
<fct> <dbl>
1 A 40
2 B 20
3 C 10
4 D 10
I again admit this is not sensible approach to the problem and I would happily delete this solution. I just wanted to show how it is done with forcats package funcitons.

Related

How to calculate difference of measures between groups considering the date

I'm struggling on how can I calculate the difference between the first and last value, arranged by date, by groups. Here is a toy example:
test1 = data.frame(my_groups = c("A", "A", "A", "B", "B", "B", "C", "C", "C", "A", "A", "A"),
measure = c(10, 20, 5, 64, 2, 62 ,2, 5, 4, 6, 7, 105),
#distance = c(),
time= as.Date(c("20-09-2020", "25-09-2020", "19-09-2020", "20-05-2020", "20-05-2020", "20-06-2021",
"11-01-2021", "13-01-2021", "13-01-2021", "15-01-2021", "15-01-2021", "19-01-2021"), format = "%d-%m-%Y"))
# test1 %>% arrange(my_groups, time)
# my_groups measure time
# 1 A 5 2020-09-19
# 2 A 10 2020-09-20
# 3 A 20 2020-09-25
# 4 A 6 2021-01-15
# 5 A 7 2021-01-15
# 6 A 105 2021-01-19
# 7 B 64 2020-05-20
# 8 B 2 2020-05-20
# 9 B 62 2021-06-20
# 10 C 2 2021-01-11
# 11 C 5 2021-01-13
# 12 C 1 2021-01-13
#desired result
# my_groups diff
# 1 A 100 (105 - 5)
# 2 B 2 (64 - 62)
# 3 C 1 (1 - 2)
The equation inside the brackets in desired result is just to show where the diff came from.
Any hint on how can I do that?
Your sample data in data.frame does not match your console output, so results will be different.
Two methods, depending on a few factors.
Assuming that order is externally controlled,
test1 %>%
group_by(my_groups) %>%
slice(c(1, n())) %>%
summarize(diff = diff(measure))
# # A tibble: 3 x 2
# my_groups diff
# <chr> <dbl>
# 1 A 95
# 2 B -2
# 3 C 2
or just
test1 %>%
group_by(my_groups) %>%
summarize(diff = measure[n()] - measure[1])
The advantage of this is that it counters an issue with approach 2 below (ties in which.max): if you control the ordering yourself, you are guaranteed to use the first/last values you need.
NOTE that for this portion, I assume that the order of data you gave us in your sample data is relevant. I'm assuming that there is some way to guarantee that your results are found. With your latest comment, we can arrange before the summarization and get closer to your desired results with
test1 %>%
arrange(time, -measure) %>% # this is the "external" sorting I mentioned, so we don't need which.min/.max
group_by(my_groups) %>%
summarize(diff = measure[n()] - measure[1])
# # A tibble: 3 x 2
# my_groups diff
# <chr> <dbl>
# 1 A 100
# 2 B -2
# 3 C 2
Without pre-sorting, we can use which.min and which.max. The problem with this is that when ties occur, it may not choose the one that you want.
test1 %>%
group_by(my_groups) %>%
summarize(diff = measure[which.max(time)] - measure[which.min(time)])
# # A tibble: 3 x 2
# my_groups diff
# <chr> <dbl>
# 1 A 100
# 2 B -2
# 3 C 3
test1 %>%
dplyr::group_by(my_groups) %>%
dplyr::mutate(
first = min(time), last = max(time),
) %>%
dplyr::select(-time, -measure) %>%
dplyr::distinct() %>%
dplyr::mutate(diff = first - last) %>%
dplyr::select(-first, -last)

How to merge by two columns aggregating one of them

I'm struggling on how can I make a merge using two columns. I have one dataframe containing measure about how much palette was used in some dates. I have another dataframe containing the distance travelled by the car. Then I need to merge both, and the condition to join is that: the car and the sum of the distance of one car until the date that the measure of the palette occur.
Here is a toy example:
#palette measure dataframe
measure = data.frame(car = c("A", "A", "A", "B"), data1 = c("20-09-2020", "15-10-2020", "13-05-2021", "20-10-2021"), palette = c(5,4,3,5))
#> measure
# car data1 palette
#1 A 20-09-2020 5
#2 A 15-10-2020 4
#3 A 13-05-2021 3
#4 B 20-10-2021 5
#the distance dataframe
dist_ = data.frame(car = c("A", "C", "B", "A", "A", "A"), data2 = c("20-09-2020", "14-05-2020", "20-10-2021", "10-01-2021", "11-01-2021", "13-01-2021"), distance = c(10, 20, 10, 5, 3,8))
#> dist_
# car data2 distance
#1 A 20-09-2020 10
#2 C 14-05-2020 20
#3 B 20-10-2021 10
#4 A 10-01-2021 5
#5 A 11-01-2021 3
#6 A 13-01-2021 8
#for result I'd like something like
# car data1 palette distance
#1 A 20-09-2020 5 10
#2 A 15-10-2020 4 0
#3 A 13-05-2020 3 16
#4 B 20-10-2021 5 10
Note that the distance are summed until I have a date that the palette are measured. So I can say that a car has covered a distance of 16 km and its palette is 3 cm.
I thought that I could use something like merge(x = measure, y = dist_, by.x=c("car", "date1"), by.y=c("car", "data2"),all.x = T), but I don't know how to sum the distance values until the date of the pallete measure for a specif car.
Any hint on how could I do that?
Something like this would work:
library(tidyverse)
library(lubridate)
result <- left_join(measure, dist_, by = c("car")) %>%
mutate(across(c("data1", "data2"), dmy)) %>%
filter(data1 >= data2) %>%
group_by(car, data2) %>%
mutate(threshold = min(data1)) %>%
ungroup() %>%
filter(data1 == threshold) %>%
group_by(car, data1, palette)%>%
summarise(distance = sum(distance))
result
# A tibble: 3 x 4
# Groups: car, data1 [3]
car data1 palette distance
<chr> <date> <dbl> <dbl>
1 A 2020-09-20 5 10
2 A 2021-05-13 3 16
3 B 2021-10-20 5 10
If you want to keep the non-matches you could then rejoin with measure like so:
result.final <- measure %>%
mutate(data1 = dmy(data1))%>%
left_join(result, by = c("car", "data1", "palette"))
result.final
car data1 palette distance
1 A 2020-09-20 5 10
2 A 2020-10-15 4 NA
3 A 2021-05-13 3 16
4 B 2021-10-20 5 10

how to keep only rows that have highest value in certain column in R

I have a dataframe that looks like this:
library(tidyverse)
df <- tribble (
~Species, ~North, ~South, ~East, ~West,
"a", 4, 3, 2, 3,
"b", 2, 3, 4, 5,
"C", 2, 3, 3, 3,
"D", 3, 2, 2, 2
)
I want to filter for species that where the highest value is e.g. North.
In this case, species A and D would be selected. Expected output would be a df with only species A and D in it.
I used a workaround like this:
df %>%
group_by(species) %>%
mutate(rowmean = mean(North:West) %>%
filter(North > rowmean) %>%
ungroup() %>%
select(!rowmean)
which seems like a lot of code for a simple task!
I cant however find a way to do this more codefriendly. Is there a (preferably tidyverse) way to perform this task in a more clean way?
Kind regards
An easier approach is with max.col in base R. Select the columns that are numeric. Get the column index of each row where the value is max. Check if that is equal to 1 i.e. the first column (as we selected only from 2nd column onwards) and subset the rows
subset(df, max.col(df[-1], 'first') == 1)
# A tibble: 2 x 5
# Species North South East West
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 a 4 3 2 3
#2 D 3 2 2 2
If it is based on the rowwise mean
subset(df, North > rowMeans(df[-1]))
Or if we prefer to use dplyr
library(dplyr)
df %>%
filter(max.col(cur_data()[-1], 'first') == 1)
Similarly if it based on the rowwise mean
df %>%
filter(North > rowMeans(cur_data()[-1]))
# base
df[df$North > rowMeans(df[-1]), ]
# A tibble: 2 x 5
Species North South East West
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 4 3 2 3
2 D 3 2 2 2

R dplyr: filter common values by group

I need to find common values between different groups ideally using dplyr and R.
From my dataset here:
group val
<fct> <dbl>
1 a 1
2 a 2
3 a 3
4 b 3
5 b 4
6 b 5
7 c 1
8 c 3
the expected output is
group val
<fct> <dbl>
1 a 3
2 b 3
3 c 3
as only number 3 occurs in all groups.
This code seems not working:
# Filter the data
dd %>%
group_by(group) %>%
filter(all(val)) # does not work
Example here solves similar issue but have a defined vector of shared values. What if I do not know which ones are shared?
Dummy example:
# Reproducible example: filter all id by group
group = c("a", "a", "a",
"b", "b", "b",
"c", "c")
val = c(1,2,3,
3,4,5,
1,3)
dd <- data.frame(group,
val)
group_by isolates each group, so we can't very well group_by(group) and compare between between groups. Instead, we can group_by(val) and see which ones have all the groups:
dd %>%
group_by(val) %>%
filter(n_distinct(group) == n_distinct(dd$group))
# # A tibble: 3 x 2
# # Groups: val [1]
# group val
# <chr> <dbl>
# 1 a 3
# 2 b 3
# 3 c 3
This is one of the rare cases where we want to use data$column in a dplyr verb - n_distinct(dd$group) refers explicitly to the ungrouped original data to get the total number of groups. (It could also be pre-computed.) Whereas n_distinct(group) is using the grouped data piped in to filter, thus it gives the number of distinct groups for each value (because we group_by(val)).
A base R approach can be:
#Code
newd <- dd[dd$val %in% Reduce(intersect, split(dd$val, dd$group)),]
Output:
group val
3 a 3
4 b 3
8 c 3
A similar option in data.table as that of #GregorThomas solution is
library(data.table)
setDT(dd)[dd[, .I[uniqueN(group) == uniqueN(dd$group)], val]$V1]

How to calculate overlap between different categories in R

I have read around the forum but I have not found my desired answer.
I have the following dataset:
Dataset
The important columns are TGEClass and peptide:
I would like to calculate the overlap between the different TGEclasses
I used calculate.overlap(TGE) from VennDiagram but that does not give me the desired result;
The R code with a dummy dataset:
# A simple single-set diagram
C1 <- as.data.frame(letters[1:10])
C2 <- as.data.frame(letters[1:10])
data =cbind(C1,C2)
overlap <- calculate.overlap(data)
overlap = as.data.frame(overlap)
The R result:
The result:
a1 a2 a3
1 a a a
2 b b b
3 c c c
4 d d d
5 e e e
6 f f f
The desired result will look like this:
TGEClass
Desired Result
10 genes are expressed in both TGE classes
50 genes in only alternative
60 genes in only short
It is basically a ven diagram but in a table format.
Please note that each gene have a different number of TGE class categories.
I am very new to R so any help will be greatly appreciated.
Thanks very much,
Ishack
The output of VennDiagram::calculate.overlap() is not very convenient for later use (here using as.data.frame you just got lucky as both vectors are of same size).
You can actually use tidyverse to compute it yourself, and return the summary:
library(tidyverse)
list(
"Cardiome" = letters[1:10],
"SuperSet" = letters[8:24]
) %>%
map2_dfr(., names(.), ~tibble::enframe(.x) %>% mutate(group=.y)) %>%
add_count(value) %>%
group_by(value) %>%
summarise(group2 = ifelse(n()==2, "both", group)) %>%
count(group2)
#> # A tibble: 3 x 2
#> group2 n
#> <chr> <int>
#> 1 both 3
#> 2 Cardiome 7
#> 3 SuperSet 14
If you want to stick with the output of VennDiagram::calculate.overlap(), you can use something like:
library(tidyverse)
overlap <- VennDiagram::calculate.overlap(
x = list(
"Cardiome" = letters[1:10],
"SuperSet" = letters[8:24]
)
);
map2_dfr(overlap, names(overlap), ~tibble::enframe(.x) %>% mutate(group=.y)) %>%
spread(group, group) %>%
mutate(a1_only = !is.na(a1) & is.na(a2),
a2_only = !is.na(a2) & is.na(a1),
both = !is.na(a2) & !is.na(a1)) %>%
summarise_at(c("a1_only", "a2_only", "both"), sum) %>%
gather(group, number, everything())
#> # A tibble: 3 x 2
#> group number
#> <chr> <int>
#> 1 a1_only 10
#> 2 a2_only 17
#> 3 both 0

Resources