I'm trying to calculate the total sum of overlapping line segments across a single line. With line A, the segments are disjointed, so it's pretty simple to calculate. However, with lines B and C, there are overlapping line segments, so it's more complicated. I would need to somehow exclude parts of the previous lines that already part of the total sum.
data = read.table(text="
line left_line right_line small_line left_small_line right_small_line
A 100 120 101 91 111
A 100 120 129 119 139
B 70 90 63 53 73
B 70 90 70 60 80
B 70 90 75 65 85
C 20 40 11 1 21
C 20 40 34 24 44
C 20 40 45 35 55", header=TRUE)
This should be the expected result.
result = read.table(text="
total_overlapping
A 0.6
B 0.75
C 0.85", header=TRUE)
EDIT: Added a picture to better illustrate what I'm trying to figure out. There's 3 different pictures of lines (solid red line), with line segments (the dashed lines) overlapping. The goal is to figure out how much of the dashed lines are covering/overlapping.
Line A
Line B
Line C
If I understand correctly, the small_line variable is irrelevant here. The rest of the columns can be used to get the sum of overlapping segments:
Step 1. Get the start & end point for each segment's overlap with the corresponding line:
library(dplyr)
data1 <- data %>%
rowwise() %>%
mutate(overlap.start = max(left_line, left_small_line),
overlap.end = min(right_line, right_small_line)) %>%
ungroup() %>%
select(line, overlap.start, overlap.end)
> data1
# A tibble: 8 x 3
line overlap.start overlap.end
<fct> <int> <int>
1 A 100 111
2 A 119 120
3 B 70 73
4 B 70 80
5 B 70 85
6 C 20 21
7 C 24 40
8 C 35 40
Step 2. Within the rows corresponding to each line, sort the overlaps in order. consider it a new overlapping section if it is the first overlap, OR the previous overlap ends before it started. Label each new overlapping section:
data2 <- data1 %>%
arrange(line, overlap.start, overlap.end) %>%
group_by(line) %>%
mutate(new.section = is.na(lag(overlap.end)) |
lag(overlap.end) <= overlap.start) %>%
mutate(section.number = cumsum(new.section)) %>%
ungroup()
> data2
# A tibble: 8 x 5
line overlap.start overlap.end new.section section.number
<fct> <int> <int> <lgl> <int>
1 A 100 111 TRUE 1
2 A 119 120 TRUE 2
3 B 70 73 TRUE 1
4 B 70 80 FALSE 1
5 B 70 85 FALSE 1
6 C 20 21 TRUE 1
7 C 24 40 TRUE 2
8 C 35 40 FALSE 2
Step 3. Within each overlapping section, take the earliest starting point & the latest ending point. Calculate the length of each overlap:
data3 <- data2 %>%
group_by(line, section.number) %>%
summarise(overlap.start = min(overlap.start),
overlap.end = max(overlap.end)) %>%
ungroup() %>%
mutate(overlap = overlap.end - overlap.start)
> data3
# A tibble: 5 x 5
line section.number overlap.start overlap.end overlap
<fct> <int> <dbl> <dbl> <dbl>
1 A 1 100 111 11
2 A 2 119 120 1
3 B 1 70 85 15
4 C 1 20 21 1
5 C 2 24 40 16
Step 4. Sum the length of overlaps for each line:
data4 <- data3 %>%
group_by(line) %>%
summarise(overlap = sum(overlap)) %>%
ungroup()
> data4
# A tibble: 3 x 2
line overlap
<fct> <dbl>
1 A 12
2 B 15
3 C 17
Now, your expected result shows the expected percentage of overlap on each line, rather than the sum. If that's what you are looking for, you can add the length for each line to data4, & calculate accordingly:
data5 <- data4 %>%
left_join(data %>%
select(line, left_line, right_line) %>%
unique() %>%
mutate(length = right_line - left_line) %>%
select(line, length),
by = "line") %>%
mutate(overlap.percentage = overlap / length)
> data5
# A tibble: 3 x 4
line overlap length overlap.percentage
<fct> <dbl> <int> <dbl>
1 A 12 20 0.6
2 B 15 20 0.75
3 C 17 20 0.85
Related
Let's say I have a dataframe of Name and value, is there any ways to extract BOTH minimum and maximum values within Name in a single function?
set.seed(1)
df <- tibble(Name = rep(LETTERS[1:3], each = 3), Value = sample(1:100, 9))
# A tibble: 9 x 2
Name Value
<chr> <int>
1 A 27
2 A 37
3 A 57
4 B 89
5 B 20
6 B 86
7 C 97
8 C 62
9 C 58
The output should contains TWO columns only (Name and Value).
Thanks in advance!
You can use range to get max and min value and use it in summarise to get different rows for each Name.
library(dplyr)
df %>%
group_by(Name) %>%
summarise(Value = range(Value), .groups = "drop")
# Name Value
# <chr> <int>
#1 A 27
#2 A 57
#3 B 20
#4 B 89
#5 C 58
#6 C 97
If you have large dataset using data.table might be faster.
library(data.table)
setDT(df)[, .(Value = range(Value)), Name]
You can use dplyr::group_by() and dplyr::summarise() like this:
library(dplyr)
set.seed(1)
df <- tibble(Name = rep(LETTERS[1:3], each = 3), Value = sample(1:100, 9))
df %>%
group_by(Name) %>%
summarise(
maximum = max(Value),
minimum = min(Value)
)
This outputs:
# A tibble: 3 × 3
Name maximum minimum
<chr> <int> <int>
1 A 68 1
2 B 87 34
3 C 82 14
What's a little odd is that my original df object looks a little different than yours, in spite of the seed:
# A tibble: 9 × 2
Name Value
<chr> <int>
1 A 68
2 A 39
3 A 1
4 B 34
5 B 87
6 B 43
7 C 14
8 C 82
9 C 59
I'm currently using rbind() together with slice_min() and slice_max(), but I think it may not be the best way or the most efficient way when the dataframe contains millions of rows.
library(tidyverse)
rbind(df %>% group_by(Name) %>% slice_max(Value),
df %>% group_by(Name) %>% slice_min(Value)) %>%
arrange(Name)
# A tibble: 6 x 2
# Groups: Name [3]
Name Value
<chr> <int>
1 A 57
2 A 27
3 B 89
4 B 20
5 C 97
6 C 58
In base R, the output format can be created with tapply/stack - do a group by tapply to get the output as a named list or range, stack it to two column data.frame and change the column names if needed
setNames(stack(with(df, tapply(Value, Name, FUN = range)))[2:1], names(df))
Name Value
1 A 27
2 A 57
3 B 20
4 B 89
5 C 58
6 C 97
Using aggregate.
aggregate(Value ~ Name, df, range)
# Name Value.1 Value.2
# 1 A 1 68
# 2 B 34 87
# 3 C 14 82
I'm currently reviewing R for Data Science when I encounter this chunk of code.
The question for this code is as follows. I don't understand the necessity of the arrange function here. Doesn't arrange function just reorder the rows?
library(tidyverse)
library(nycflights13))
flights %>%
arrange(tailnum, year, month, day) %>%
group_by(tailnum) %>%
mutate(delay_gt1hr = dep_delay > 60) %>%
mutate(before_delay = cumsum(delay_gt1hr)) %>%
filter(before_delay < 1) %>%
count(sort = TRUE)
However, it does output differently with or without the arrange function, as shown below:
#with the arrange function
tailnum n
<chr> <int>
1 N954UW 206
2 N952UW 163
3 N957UW 142
4 N5FAAA 117
5 N38727 99
6 N3742C 98
7 N5EWAA 98
8 N705TW 97
9 N765US 97
10 N635JB 94
# ... with 3,745 more rows
and
#Without the arrange function
tailnum n
<chr> <int>
1 N952UW 215
2 N315NB 161
3 N705TW 160
4 N961UW 139
5 N713TW 128
6 N765US 122
7 N721TW 120
8 N5FAAA 117
9 N945UW 104
10 N19130 101
# ... with 3,774 more rows
I'd appreciate it if you can help me understand this. Why is it necessary to include the arrange function here?
Yes, arrange just orders the rows but you are filtering after that which changes the result.
Here is a simplified example to demonstrate how the output differs with and without arrange.
library(dplyr)
df <- data.frame(a = 1:5, b = c(7, 8, 9, 1, 2))
df %>% filter(cumsum(b) < 20)
# a b
#1 1 7
#2 2 8
df %>% arrange(b) %>% filter(cumsum(b) < 20)
# a b
#1 4 1
#2 5 2
#3 1 7
#4 2 8
On my attempt to learn dplyr, I want to divide each row by another row, representing the corresponding group's total.
I generated test data with
library(dplyr)
# building test data
data("OrchardSprays")
totals <- OrchardSprays %>% group_by(treatment) %>%
summarise(decrease = sum(decrease))
totals$decrease <- totals$decrease + seq(10, 80, 10)
totals$rowpos = totals$colpos <- "total"
df <- rbind(OrchardSprays, totals)
Note the line totals$decrease <- totals$decrease + seq(10, 80, 10): for the sake of the question, I assumed there was an additional decrease for each treatment, which was not observed in the single lines of the data frame but only in the "total" lines for each group.
What I now want to do is adding another column decrease_share to the data frame where each line's decrease value is divided by the corresponding treatment groups total decrease value.
So, for head(df) I would expect an output like this
> head(df)
decrease rowpos colpos treatment treatment_decrease
1 57 1 1 D 0.178125
2 95 2 1 E 0.1711712
3 8 3 1 B 0.09876543
4 69 4 1 H 0.08603491
5 92 5 1 G 0.1488673
6 90 6 1 F 0.1470588
My real world example is a bit more complex (more group variables and also more levels), therefore I am looking for a suitable solution in dplyr.
Here's a total dplyr approach:
library(dplyr) #version >= 1.0.0
OrchardSprays %>%
group_by(treatment) %>%
summarise(decrease = sum(decrease)) %>%
mutate(decrease = decrease + seq(10, 80, 10),
rowpos = "total",
colpos = "total") %>%
bind_rows(mutate(OrchardSprays, across(rowpos:colpos, as.character))) %>%
group_by(treatment) %>%
mutate(treatment_decrease = decrease / decrease[rowpos == "total"])
# A tibble: 72 x 5
# Groups: treatment [8]
treatment decrease rowpos colpos treatment_decrease
<fct> <dbl> <chr> <chr> <dbl>
1 A 47 total total 1
2 B 81 total total 1
3 C 232 total total 1
4 D 320 total total 1
5 E 555 total total 1
6 F 612 total total 1
7 G 618 total total 1
8 H 802 total total 1
9 D 57 1 1 0.178
10 E 95 2 1 0.171
# … with 62 more rows
I have a dataset like here:
customer_id <- c("1","1","1","2","2","2","2","3","3","3")
account_id <- as.character(c(11,11,11,55,55,55,55,38,38,38))
time <- c(as.Date("2017-01-01","%Y-%m-%d"), as.Date("2017-02-01","%Y-%m-%d"), as.Date("2017-03-01","%Y-%m-%d"),
as.Date("2017-12-01","%Y-%m-%d"), as.Date("2018-01-01","%Y-%m-%d"), as.Date("2018-02-01","%Y-%m-%d"),
as.Date("2018-03-01","%Y-%m-%d"), as.Date("2018-04-01","%Y-%m-%d"), as.Date("2018-05-01","%Y-%m-%d"),
as.Date("2018-06-01","%Y-%m-%d"))
tenor <- c(1,2,3,1,2,3,4,1,2,3)
variable_x <- c(87,90,100,120,130,150,12,13,15,14)
my_data <- data.table(customer_id,account_id,time,tenor,variable_x)
Now, I would like to create new variables "PD_Q1" up to "PD_Q20" that would equal to the value of "variable_x" when "tenor" is equal to 1 up to 20, i.e., PD_Q1 equal to variable_x's value if tenor = 1, PD_Q2 equal to variable_x's value if tenor = 2, etc. and I would like to do that by customer_id, account_id. I have the code for that, however only for PD_Q1 and I would like to make a loop that loops over i = 1:20 in which I change just tenor == i (this one is easy) and refer to columns PD_Qi in this loop, which is a problem for me. The code for one value of i is here:
my_data[tenor == 1, PD_Q1_temp := variable_x, by = c("customer_id", "account_id")]
list_accs <- my_data[tenor == 1, c("customer_id", "account_id", "PD_Q1_temp")]
list_accs <- unique(list_accs, by = c("customer_id", "account_id"))
names(list_accs) = c("customer_id", "account_id", "PD_Q1")
my_data = merge(x = my_data, y = list_accs, by = c("customer_id", "account_id"), all.x = TRUE)
my_data$PD_Q1_temp <- NULL
Now, can you please advise how to make a loop from 1 to 20, in which tenor, PD_Q1_temp and PD_Q1 would change? Specifically, I don't know how to refer to column names or variables using this i index within a loop.
The expected output for i = 1 and i = 2 (creating variables PD_Q1 and PD_Q2) is here:
> my_data
customer_id account_id time tenor variable_x PD_Q1 PD_Q2
1: 1 11 2017-01-01 1 87 87 90
2: 1 11 2017-02-01 2 90 87 90
3: 1 11 2017-03-01 3 100 87 90
4: 2 55 2017-12-01 1 120 120 130
5: 2 55 2018-01-01 2 130 120 130
6: 2 55 2018-02-01 3 150 120 130
7: 2 55 2018-03-01 4 12 120 130
8: 3 38 2018-04-01 1 13 13 15
9: 3 38 2018-05-01 2 15 13 15
10: 3 38 2018-06-01 3 14 13 15
now I want to create PD_Q3, PD_Q4 etc. in a loop using my code above that creates one such variable.
Can you show your expected output?
I think you can do what you want with tidyr::gather():
library(dplyr)
library(tidyr)
my_data %>%
tbl_df() %>%
select(-time) %>%
mutate(tenor = paste0("PD_Q", tenor)) %>%
spread(tenor, variable_x)
# # A tibble: 3 x 6
# customer_id account_id PD_Q1 PD_Q2 PD_Q3 PD_Q4
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 11 87 90 100 NA
# 2 2 55 120 130 150 12
# 3 3 38 13 15 14 NA
I have data that looks the following way:
Participant Round Total
1 100 5
1 101 8
1 102 12
1 200 42
2 100 14
2 101 71
40 100 32
40 101 27
40 200 18
I want to get a table with the Total of last Round (200) minus the Total of first Round (100) ;
For example - for Participant 1 - it is 42 - 5 = 37.
The final output should look like:
Participant Total
1 37
2
40 -14
With base R
aggregate(Total ~ Participant, df[df$Round %in% c(100, 200), ], diff)
# Participant Total
# 1 1 37
# 2 2
# 3 40 -14
Or similarly combined with subset
aggregate(Total ~ Participant, df, subset = Round %in% c(100, 200), diff)
Or with data.table
library(data.table) ;
setDT(df)[Round %in% c(100, 200), diff(Total), by = Participant]
# Participant V1
# 1: 1 37
# 2: 40 -14
Or using binary join
setkey(setDT(df), Round)
df[.(c(100, 200)), diff(Total), by = Participant]
# Participant V1
# 1: 1 37
# 2: 40 -14
Or with dplyr
library(dplyr)
df %>%
group_by(Participant) %>%
filter(Round %in% c(100, 200)) %>%
summarise(Total = diff(Total))
# Source: local data table [2 x 2]
#
# Participant Total
# 1 1 37
# 2 40 -14
you can try this
library(dplyr)
group_by(df, Participant) %>%
filter(row_number()==1 | row_number()==max(row_number())) %>%
mutate(df = diff(Total)) %>%
select(Participant, df) %>%
unique()
Source: local data frame [3 x 2]
Groups: Participant
Participant df
1 1 37
2 2 57
3 40 -14
try this:
df <- read.table(header = TRUE, text = "
Participant Round Total
1 100 5
1 101 8
1 102 12
1 200 42
2 100 14
2 101 71
2 200 80
40 100 32
40 101 27
40 200 18")
library(data.table)
setDT(df)[ , .(Total = Total[Round == 200] - Total[Round == 100]), by = Participant]
Everyone loves a bit of sqldf, so if your requirement isn't to use apply then try this:
Firstly some test data:
df <- read.table(header = TRUE, text = "
Participant Round Total
1 100 5
1 101 8
1 102 12
1 200 42
2 100 14
2 101 71
2 200 80
40 100 32
40 101 27
40 200 18")
Next use SQL to create 2 columns - one for the 100 round and one for the 200 round and subtract them
rolled <- sqldf("
SELECT tab_a.Participant AS Participant
,tab_b.Total_200 - tab_a.Total_100 AS Difference
FROM (
SELECT Participant
,Total AS Total_100
FROM df
WHERE Round = 100
) tab_a
INNER JOIN (
SELECT Participant
,Total AS Total_200
FROM df
WHERE Round = 200
) tab_b ON (tab_a.Participant = tab_b.Participant)
")