How to merge by two columns aggregating one of them

How to merge by two columns aggregating one of them - r

I'm struggling on how can I make a merge using two columns. I have one dataframe containing measure about how much palette was used in some dates. I have another dataframe containing the distance travelled by the car. Then I need to merge both, and the condition to join is that: the car and the sum of the distance of one car until the date that the measure of the palette occur.
Here is a toy example:
#palette measure dataframe
measure = data.frame(car = c("A", "A", "A", "B"), data1 = c("20-09-2020", "15-10-2020", "13-05-2021", "20-10-2021"), palette = c(5,4,3,5))
#> measure
# car data1 palette
#1 A 20-09-2020 5
#2 A 15-10-2020 4
#3 A 13-05-2021 3
#4 B 20-10-2021 5
#the distance dataframe
dist_ = data.frame(car = c("A", "C", "B", "A", "A", "A"), data2 = c("20-09-2020", "14-05-2020", "20-10-2021", "10-01-2021", "11-01-2021", "13-01-2021"), distance = c(10, 20, 10, 5, 3,8))
#> dist_
# car data2 distance
#1 A 20-09-2020 10
#2 C 14-05-2020 20
#3 B 20-10-2021 10
#4 A 10-01-2021 5
#5 A 11-01-2021 3
#6 A 13-01-2021 8
#for result I'd like something like
# car data1 palette distance
#1 A 20-09-2020 5 10
#2 A 15-10-2020 4 0
#3 A 13-05-2020 3 16
#4 B 20-10-2021 5 10
Note that the distance are summed until I have a date that the palette are measured. So I can say that a car has covered a distance of 16 km and its palette is 3 cm.
I thought that I could use something like merge(x = measure, y = dist_, by.x=c("car", "date1"), by.y=c("car", "data2"),all.x = T), but I don't know how to sum the distance values until the date of the pallete measure for a specif car.
Any hint on how could I do that?

Something like this would work:
library(tidyverse)
library(lubridate)
result <- left_join(measure, dist_, by = c("car")) %>%
mutate(across(c("data1", "data2"), dmy)) %>%
filter(data1 >= data2) %>%
group_by(car, data2) %>%
mutate(threshold = min(data1)) %>%
ungroup() %>%
filter(data1 == threshold) %>%
group_by(car, data1, palette)%>%
summarise(distance = sum(distance))
result
# A tibble: 3 x 4
# Groups: car, data1 [3]
car data1 palette distance
<chr> <date> <dbl> <dbl>
1 A 2020-09-20 5 10
2 A 2021-05-13 3 16
3 B 2021-10-20 5 10
If you want to keep the non-matches you could then rejoin with measure like so:
result.final <- measure %>%
mutate(data1 = dmy(data1))%>%
left_join(result, by = c("car", "data1", "palette"))
result.final
car data1 palette distance
1 A 2020-09-20 5 10
2 A 2020-10-15 4 NA
3 A 2021-05-13 3 16
4 B 2021-10-20 5 10

Related

How to calculate difference of measures between groups considering the date

I'm struggling on how can I calculate the difference between the first and last value, arranged by date, by groups. Here is a toy example:
test1 = data.frame(my_groups = c("A", "A", "A", "B", "B", "B", "C", "C", "C", "A", "A", "A"),
measure = c(10, 20, 5, 64, 2, 62 ,2, 5, 4, 6, 7, 105),
#distance = c(),
time= as.Date(c("20-09-2020", "25-09-2020", "19-09-2020", "20-05-2020", "20-05-2020", "20-06-2021",
"11-01-2021", "13-01-2021", "13-01-2021", "15-01-2021", "15-01-2021", "19-01-2021"), format = "%d-%m-%Y"))
# test1 %>% arrange(my_groups, time)
# my_groups measure time
# 1 A 5 2020-09-19
# 2 A 10 2020-09-20
# 3 A 20 2020-09-25
# 4 A 6 2021-01-15
# 5 A 7 2021-01-15
# 6 A 105 2021-01-19
# 7 B 64 2020-05-20
# 8 B 2 2020-05-20
# 9 B 62 2021-06-20
# 10 C 2 2021-01-11
# 11 C 5 2021-01-13
# 12 C 1 2021-01-13
#desired result
# my_groups diff
# 1 A 100 (105 - 5)
# 2 B 2 (64 - 62)
# 3 C 1 (1 - 2)
The equation inside the brackets in desired result is just to show where the diff came from.
Any hint on how can I do that?

Your sample data in data.frame does not match your console output, so results will be different.
Two methods, depending on a few factors.
Assuming that order is externally controlled,
test1 %>%
group_by(my_groups) %>%
slice(c(1, n())) %>%
summarize(diff = diff(measure))
# # A tibble: 3 x 2
# my_groups diff
# <chr> <dbl>
# 1 A 95
# 2 B -2
# 3 C 2
or just
test1 %>%
group_by(my_groups) %>%
summarize(diff = measure[n()] - measure[1])
The advantage of this is that it counters an issue with approach 2 below (ties in which.max): if you control the ordering yourself, you are guaranteed to use the first/last values you need.
NOTE that for this portion, I assume that the order of data you gave us in your sample data is relevant. I'm assuming that there is some way to guarantee that your results are found. With your latest comment, we can arrange before the summarization and get closer to your desired results with
test1 %>%
arrange(time, -measure) %>% # this is the "external" sorting I mentioned, so we don't need which.min/.max
group_by(my_groups) %>%
summarize(diff = measure[n()] - measure[1])
# # A tibble: 3 x 2
# my_groups diff
# <chr> <dbl>
# 1 A 100
# 2 B -2
# 3 C 2
Without pre-sorting, we can use which.min and which.max. The problem with this is that when ties occur, it may not choose the one that you want.
test1 %>%
group_by(my_groups) %>%
summarize(diff = measure[which.max(time)] - measure[which.min(time)])
# # A tibble: 3 x 2
# my_groups diff
# <chr> <dbl>
# 1 A 100
# 2 B -2
# 3 C 3

test1 %>%
dplyr::group_by(my_groups) %>%
dplyr::mutate(
first = min(time), last = max(time),
) %>%
dplyr::select(-time, -measure) %>%
dplyr::distinct() %>%
dplyr::mutate(diff = first - last) %>%
dplyr::select(-first, -last)

Get top rows of a data.frame weighted by cumulated column values

Let's say we have a date.frame containing a column with numbers. Now I only want to filter those rows which make up the top 75% regarding to the numeric column.
Here's an example and a clumsy solution:
library(tidyverse)
d <- tribble(
~name, ~value,
"A", 40,
"B", 20,
"C", 10,
"D", 10,
"E", 5,
"F", 5,
"G", 3,
"H", 3,
"I", 3,
"J", 1,
)
d %>%
arrange(desc(value)) %>%
mutate(
relative_value = value / sum(value),
cum_relative_value = cumsum(relative_value)
) %>%
filter(lag(cum_relative_value) <= 0.75 | is.na(lag(cum_relative_value)))
#> # A tibble: 4 x 4
#> name value relative_value cum_relative_value
#> <chr> <dbl> <dbl> <dbl>
#> 1 A 40 0.4 0.4
#> 2 B 20 0.2 0.6
#> 3 C 10 0.1 0.7
#> 4 D 10 0.1 0.8
Created on 2021-04-30 by the reprex package (v1.0.0)
As you can see I calculate the percentage of the cumulated values and filter with respect to this value. I have to use lag() to get the row which surpasses the 0.75 bound and is.na() to get the first row.
This really feels clumsy. I thought there should be a solution with slice_* or fct_lump_prop() but I can't figure it out.
So is there any nice "dplyr"-way?

What about this?
d %>%
arrange(-value) %>%
filter(
lag(cumsum(prop.table(value)), default = 0) <= 0.75
)
which gives
# A tibble: 4 x 2
name value
<chr> <dbl>
1 A 40
2 B 20
3 C 10
4 D 10

There seems to be no such function. But the code could be simplified as follows:
d %>%
arrange(desc(value)) %>%
filter(cumsum(cumsum(value)/sum(value) >= 0.75) <= 1)
# # A tibble: 3 x 2
# name value
# <chr> <dbl>
# 1 A 40
# 2 B 20
# 3 C 10
# 4 D 10

While I think the proposed solutions are quite reasonable and sensible, I've tried to figure out a way of doing this with fct_lump. However, I can't justify why I set the prop argument to 0.05 considering you wanted the top 0.75 cumulative frequencies, except for the fact that I added up the cumulative sums of values and realized all frequencies fewer than 5 percent will lead to the desired output:
library(dplyr)
library(forcats)
d %>%
mutate(name = fct_lump(name, prop = 0.05, w = value)) %>%
filter(name != "Other")
# A tibble: 4 x 2
name value
<fct> <dbl>
1 A 40
2 B 20
3 C 10
4 D 10
I again admit this is not sensible approach to the problem and I would happily delete this solution. I just wanted to show how it is done with forcats package funcitons.

Find minimum and maximum value for each group based on another variable in R [duplicate]

This question already has answers here:
How to use Dplyr's Summarize and which() to lookup min/max values
(3 answers)
Closed 1 year ago.
I would like to know how I can find the minimum and maximum day of year (DoY) based on water temperature (Wtemp) for each site (siteID).
Example Dataset:
df1 <- data.frame(matrix(ncol = 4, nrow = 20))
x <- c("siteID", "Date", "DoY", "Wtemp")
colnames(df1) <- x
df1$siteID <- c(101,101,101,101,101,
102,102,102,102,102,
103,103,103,103,103,
104,104,104,104,104)
df1$Date <- rep(seq(from = as.Date("2020-01-01"), to = as.Date("2020-01-05"), by = 1),4)
df1$DoY <- rep(seq(from = 1, to = 5, by = 1),4)
df1$Wtemp <- c(10,2,6,12,15,
20,15,5,10,16,
2,4,6,8,10,
12,14,16,18,20)
The output should look like this:
siteID DoY_MaxWtemp DoY_MinWtemp
1 101 5 2
2 102 1 3
3 103 5 1
4 104 5 1

We can group by 'siteID', get the index of 'max' and 'min' value of 'Wtemp' with which.max and which.min respectively, use that to extract the corresponding values of 'DoY' in summarise
library(dplyr)
df1 %>%
group_by(siteID) %>%
summarise(Doy_MaxWtemp = DoY[which.max(Wtemp)],
Doy_MinWtemp = DoY[which.min(Wtemp)], .groups = 'drop')
-output
# A tibble: 4 x 3
# siteID Doy_MaxWtemp Doy_MinWtemp
#* <dbl> <dbl> <dbl>
#1 101 5 2
#2 102 1 3
#3 103 5 1
#4 104 5 1

use assign / create new object with value (dplyr)

I want to create a new variable with the value of column number in my DF by its name.
I have managed to do this:
firstCol <- which(colnames(Mydf) == "Cars")
It takes the column number of the column with the name "Cars" and set its number to the object firstCol. It works well and good on base.
latly, I've been using dplyr and pipes and I'm trying to create a variable and do the same thing by using pipes but I'm unable to do this - use this line but in pipes %>%
Can you help me?
thanks,
Ido

The dplyr way to do this is select.
Here is an example using some made up data:
df <- data.frame(cars = sample(LETTERS, 100, replace = TRUE),
mpg = runif(100, 15, 45),
color = sample(c("green", "red","blue", "silver"),
100, replace = TRUE)) %>% tibbble()
df %>% select(cars)
# A tibble: 100 x 1
cars
<chr>
1 R
2 V
3 I
4 Q
5 P
6 D
7 J
8 Q
9 R
10 A
# ... with 90 more rows
You can also remove columns with select(-col_name)
df %>% select(-mpg)
# A tibble: 100 x 2
cars color
<chr> <chr>
1 R blue
2 V silver
3 I red
4 Q green
5 P silver
6 D silver
7 J green
8 Q blue
9 R red
10 A silver
# ... with 90 more rows

How to calculate overlap between different categories in R

I have read around the forum but I have not found my desired answer.
I have the following dataset:
Dataset
The important columns are TGEClass and peptide:
I would like to calculate the overlap between the different TGEclasses
I used calculate.overlap(TGE) from VennDiagram but that does not give me the desired result;
The R code with a dummy dataset:
# A simple single-set diagram
C1 <- as.data.frame(letters[1:10])
C2 <- as.data.frame(letters[1:10])
data =cbind(C1,C2)
overlap <- calculate.overlap(data)
overlap = as.data.frame(overlap)
The R result:
The result:
a1 a2 a3
1 a a a
2 b b b
3 c c c
4 d d d
5 e e e
6 f f f
The desired result will look like this:
TGEClass
Desired Result
10 genes are expressed in both TGE classes
50 genes in only alternative
60 genes in only short
It is basically a ven diagram but in a table format.
Please note that each gene have a different number of TGE class categories.
I am very new to R so any help will be greatly appreciated.
Thanks very much,
Ishack

The output of VennDiagram::calculate.overlap() is not very convenient for later use (here using as.data.frame you just got lucky as both vectors are of same size).
You can actually use tidyverse to compute it yourself, and return the summary:
library(tidyverse)
list(
"Cardiome" = letters[1:10],
"SuperSet" = letters[8:24]
) %>%
map2_dfr(., names(.), ~tibble::enframe(.x) %>% mutate(group=.y)) %>%
add_count(value) %>%
group_by(value) %>%
summarise(group2 = ifelse(n()==2, "both", group)) %>%
count(group2)
#> # A tibble: 3 x 2
#> group2 n
#> <chr> <int>
#> 1 both 3
#> 2 Cardiome 7
#> 3 SuperSet 14
If you want to stick with the output of VennDiagram::calculate.overlap(), you can use something like:
library(tidyverse)
overlap <- VennDiagram::calculate.overlap(
x = list(
"Cardiome" = letters[1:10],
"SuperSet" = letters[8:24]
)
);
map2_dfr(overlap, names(overlap), ~tibble::enframe(.x) %>% mutate(group=.y)) %>%
spread(group, group) %>%
mutate(a1_only = !is.na(a1) & is.na(a2),
a2_only = !is.na(a2) & is.na(a1),
both = !is.na(a2) & !is.na(a1)) %>%
summarise_at(c("a1_only", "a2_only", "both"), sum) %>%
gather(group, number, everything())
#> # A tibble: 3 x 2
#> group number
#> <chr> <int>
#> 1 a1_only 10
#> 2 a2_only 17
#> 3 both 0

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to merge by two columns aggregating one of them - r

Related

How to calculate difference of measures between groups considering the date

Get top rows of a data.frame weighted by cumulated column values

Find minimum and maximum value for each group based on another variable in R [duplicate]

use assign / create new object with value (dplyr)

How to calculate overlap between different categories in R

Categories

Resources