Amount of overlap of two ranges in R [DescTools?] - r

I need to know by how many integers two numeric ranges overlap. I tried using DescTools::Overlap, but the output is not what I expected.
library(DescTools)
library(tidyr)
df1 <- data.frame(ID = c('a', 'b', 'c', 'd', 'e'),
var1 = c(1, 2, 3, 4, 5),
var2 = c(9, 3, 5, 7, 11))
df1 %>% setNames(paste0(names(.), '_2')) %>% tidyr::crossing(df1) %>% filter(ID != ID_2) -> pairwise
pairwise$overlap <- DescTools::Overlap(c(pairwise$var1,pairwise$var2),c(pairwise$var1_2,pairwise$var2_2))
The output (entire column) is '10' for each row in the test dataset created above. I want the row-specific overlap for each, so the first 3 columns would be 2,3,4, respectively.

I find the easiest way to do it is using rowwise. This operation used to be disadvised, but since dplyr 1.0.0 release, it's been improved in terms of performance.
pairwise %>%
rowwise() %>%
mutate(overlap = Overlap(c(var1, var2), c(var1_2, var2_2))) %>%
ungroup()
#> # A tibble: 20 x 7
#> ID_2 var1_2 var2_2 ID var1 var2 overlap
#> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 a 1 9 b 2 3 1
#> 2 a 1 9 c 3 5 2
#> 3 a 1 9 d 4 7 3
#> 4 a 1 9 e 5 11 4
#> 5 b 2 3 a 1 9 1
#> 6 b 2 3 c 3 5 0
#> 7 b 2 3 d 4 7 0
#> 8 b 2 3 e 5 11 0
#> 9 c 3 5 a 1 9 2
#> 10 c 3 5 b 2 3 0
#> 11 c 3 5 d 4 7 1
#> 12 c 3 5 e 5 11 0
#> 13 d 4 7 a 1 9 3
#> 14 d 4 7 b 2 3 0
#> 15 d 4 7 c 3 5 1
#> 16 d 4 7 e 5 11 2
#> 17 e 5 11 a 1 9 4
#> 18 e 5 11 b 2 3 0
#> 19 e 5 11 c 3 5 0
#> 20 e 5 11 d 4 7 2

My version with apply function
pairwise$overlap <- apply(pairwise, 1,
function(x) DescTools::Overlap(as.numeric(c(x[5], x[6])),
as.numeric(c(x[2],x[3]))))
pairwise
# A tibble: 20 x 7
ID_2 var1_2 var2_2 ID var1 var2 overlap
<chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 a 1 9 b 2 3 1
2 a 1 9 c 3 5 2
3 a 1 9 d 4 7 3
4 a 1 9 e 5 11 4
5 b 2 3 a 1 9 1
6 b 2 3 c 3 5 0
7 b 2 3 d 4 7 0
8 b 2 3 e 5 11 0
9 c 3 5 a 1 9 2
10 c 3 5 b 2 3 0
11 c 3 5 d 4 7 1
12 c 3 5 e 5 11 0
13 d 4 7 a 1 9 3
14 d 4 7 b 2 3 0
15 d 4 7 c 3 5 1
16 d 4 7 e 5 11 2
17 e 5 11 a 1 9 4
18 e 5 11 b 2 3 0
19 e 5 11 c 3 5 0
20 e 5 11 d 4 7 2

Related

Sum values incrementally for panel data

I have a very basic question as I am relatively new to R. I was wondering how to add a value in a particular column to the previous one for each cross-sectional unit in my data separately. My data looks like this:
firm date value
A 1 10
A 2 15
A 3 20
A 4 0
B 1 0
B 2 1
B 3 5
B 4 10
C 1 3
C 2 2
C 3 10
C 4 1
D 1 7
D 2 3
D 3 6
D 4 9
And I want to achieve the data below. So I want to sum values for each cross-sectional unit incrementally.
firm date value cumulative value
A 1 10 10
A 2 15 25
A 3 20 45
A 4 0 45
B 1 0 0
B 2 1 1
B 3 5 6
B 4 10 16
C 1 3 3
C 2 2 5
C 3 10 15
C 4 1 16
D 1 7 7
D 2 3 10
D 3 6 16
D 4 9 25
Below is a reproducible example code. I tried lag() but couldn't figure out how to repeat it for each firm.
firm <- c("A","A","A","A","B","B","B","B","C","C","C", "C","D","D","D","D")
date <- c("1","2","3","4","1","2","3","4","1","2","3","4", "1", "2", "3", "4")
value <- c(10, 15, 20, 0, 0, 1, 5, 10, 3, 2, 10, 1, 7, 3, 6, 9)
data <- data.frame(firm = firm, date = date, value = value)
Does this work:
library(dplyr)
df %>% group_by(firm) %>% mutate(cumulative_value = cumsum(value))
# A tibble: 16 x 4
# Groups: firm [4]
firm date value cumulative_value
<chr> <int> <int> <int>
1 A 1 10 10
2 A 2 15 25
3 A 3 20 45
4 A 4 0 45
5 B 1 0 0
6 B 2 1 1
7 B 3 5 6
8 B 4 10 16
9 C 1 3 3
10 C 2 2 5
11 C 3 10 15
12 C 4 1 16
13 D 1 7 7
14 D 2 3 10
15 D 3 6 16
16 D 4 9 25
Using base R with ave
data$cumulative_value <- with(data, ave(value, firm, FUN = cumsum))
-output
> data
firm date value cumulative_value
1 A 1 10 10
2 A 2 15 25
3 A 3 20 45
4 A 4 0 45
5 B 1 0 0
6 B 2 1 1
7 B 3 5 6
8 B 4 10 16
9 C 1 3 3
10 C 2 2 5
11 C 3 10 15
12 C 4 1 16
13 D 1 7 7
14 D 2 3 10
15 D 3 6 16
16 D 4 9 25

Replicates number of repeated observations and generates a new ID that uniquely identifies observations based on a count value

I am trying to duplicates clusters of observations(ID) and generates a new variable that identifies the
clusters uniquely (new_ID). For instance, consider the data frame df1
df1 <- data.frame(ID=c("1", "1", "1", "2", "2", "3"), sex=c("M", "M", "M", "F", "F", "M"),count=c(4,4,4,3,3,2))
df1
#> ID sex count
#> 1 1 M 4
#> 2 1 M 4
#> 3 1 M 4
#> 4 2 F 3
#> 5 2 F 3
#> 6 3 M 2
df2 <- data.frame(
ID=c("1","1","1","1","1","1","1","1","1","1","1","1","2","2","2","2","2","2","3","3"),
new_ID = c("1","1","1","2","2","2","3","3","3","4","4","4","5","5","6","6","7","7", "8","9"),
sex=c("M","M","M","M","M","M","M","M","M","M","M","M", "F", "F", "F", "F","F", "F","M","M"),
count=c(4,4,4,4,4,4,4,4,4,4,4,4,3,3,3,3,3,3,2,2))
df2
#> ID new_ID sex count
#> 1 1 1 M 4
#> 2 1 1 M 4
#> 3 1 1 M 4
#> 4 1 2 M 4
#> 5 1 2 M 4
#> 6 1 2 M 4
#> 7 1 3 M 4
#> 8 1 3 M 4
#> 9 1 3 M 4
#> 10 1 4 M 4
#> 11 1 4 M 4
#> 12 1 4 M 4
#> 13 2 5 F 3
#> 14 2 5 F 3
#> 15 2 6 F 3
#> 16 2 6 F 3
#> 17 2 7 F 3
#> 18 2 7 F 3
#> 19 3 8 M 2
#> 20 3 9 M 2
Thank you for helping in advance.
If I have understood correctly,
library(dplyr)
df1 %>%
tidyr::uncount(count, .remove = FALSE) %>%
group_by(ID) %>%
mutate(new_ID = rep(seq_len(first(count)), each = n()/first(count))) %>%
ungroup() %>%
mutate(new_ID = data.table::rleid(new_ID))
# A tibble: 20 x 4
# ID sex count new_ID
# <chr> <chr> <dbl> <int>
# 1 1 M 4 1
# 2 1 M 4 1
# 3 1 M 4 1
# 4 1 M 4 2
# 5 1 M 4 2
# 6 1 M 4 2
# 7 1 M 4 3
# 8 1 M 4 3
# 9 1 M 4 3
#10 1 M 4 4
#11 1 M 4 4
#12 1 M 4 4
#13 2 F 3 5
#14 2 F 3 5
#15 2 F 3 6
#16 2 F 3 6
#17 2 F 3 7
#18 2 F 3 7
#19 3 M 2 8
#20 3 M 2 9

How to change the shape of a data frame in R? (stacking columns with the same names together)

I'm trying to reshape a data frame in R:
Gene_ID Value Gene_ID.1 Value.1 Gene_ID.2 Value.2
1 A 0 A 3 A 1
2 B 5 B 6 B 5
3 C 7 C 2 C 7
4 D 8 D 9 D 2
5 E 5 E 8 E 4
6 F 6 F 4 F 5
I want to make it look like this:
Gene_ID Value
1 A 0
2 B 5
3 C 7
4 D 8
5 E 5
6 F 6
7 A 1
8 B 5
9 C 7
10 D 2
11 E 4
12 F 5
13 A 3
14 B 6
15 C 2
16 D 9
17 E 8
18 F 4
So simply stack the columns with the same names together. Is there a way to do so?
Thanks!
You can use either the combination of gather()/spread() or pivot_longer() from the tidyr package.
To learn more about the new pivot_xxx() functions, check out these links:
A Graphical Introduction to tidyr's pivot_*()
Pivoting data from columns to rows (and back!) in the tidyverse
library(dplyr)
library(tidyr)
txt <- " Gene_ID.0 Value.0 Gene_ID.1 Value.1 Gene_ID.2 Value.2
1 A 0 A 3 A 1
2 B 5 B 6 B 5
3 C 7 C 2 C 7
4 D 8 D 9 D 2
5 E 5 E 8 E 4
6 F 6 F 4 F 5"
dat <- read.table(text = txt, header = TRUE)
Combine gather(), separate() and spread() functions
dat %>%
mutate(Row_Nr = row_number()) %>%
gather(key, value, -Row_Nr) %>%
separate(key, into = c("key", "Gene_Nr"), sep = "\\.") %>%
spread(key, value) %>%
select(-Row_Nr)
#> Warning: attributes are not identical across measure variables;
#> they will be dropped
#> Gene_Nr Gene_ID Value
#> 1 0 A 0
#> 2 1 A 3
#> 3 2 A 1
#> 4 0 B 5
#> 5 1 B 6
#> 6 2 B 5
#> 7 0 C 7
#> 8 1 C 2
#> 9 2 C 7
#> 10 0 D 8
#> 11 1 D 9
#> 12 2 D 2
#> 13 0 E 5
#> 14 1 E 8
#> 15 2 E 4
#> 16 0 F 6
#> 17 1 F 4
#> 18 2 F 5
Use pivot_longer()
### gather all values columns
### separate original column names by the period "."
### into Gene_ID/Value and Gene_Nr
dat %>%
pivot_longer(everything(),
names_to = c(".value", "Gene_Nr"),
names_pattern = "(.*)\\.(.*)")
#> Gene_Nr Gene_ID Value
#> 1 0 A 0
#> 2 1 A 3
#> 3 2 A 1
#> 4 0 B 5
#> 5 1 B 6
#> 6 2 B 5
#> 7 0 C 7
#> 8 1 C 2
#> 9 2 C 7
#> 10 0 D 8
#> 11 1 D 9
#> 12 2 D 2
#> 13 0 E 5
#> 14 1 E 8
#> 15 2 E 4
#> 16 0 F 6
#> 17 1 F 4
#> 18 2 F 5
Created on 2019-12-08 by the reprex package (v0.3.0)

How to fix the first row on repeated measures

I'm trying to translate a SAS code into R but I don't know how to translate the follow SAS code below:
data df; by id area;
if first.area and area = 'A' then do;
var1_sum = 0;
var2_sum = 0;
end;
if indicator = 'A' then do;
var1_sum + var1;
var2_sum + var2;
end;
From the dataset before:
id area var1 var2
1 A 9 9
1 A 4 8
1 A 5 2
1 B 1 4
1 B 8 5
1 B 0 6
1 C 3 7
1 C 2 8
We get the follow result when the SAS code above it's used:
id area var1 var2 var1_sum var2_sum
1 A 9 9 9 9
1 A 4 8 13 17
1 A 5 2 18 19
1 B 1 4 1 4
1 B 8 5 9 9
1 B 0 6 9 15
1 C 3 7 3 7
1 C 2 8 5 15
I am using dplyr to code in R. So, I have started "a portion" of the R code which I am trying to translate, but I don't know how to code the "if condition" stated in SAS:
df <- df %>%
group_by(id, area) %>%
.....
I am looking for help how to include the "if condition" for this case.
Thank you for your help.
Kind regards,
Rungo.
You can do this in base R with ave
## Your data
df = read.table(text="id area var1 var2
1 A 9 9
1 A 4 8
1 A 5 2
1 B 1 4
1 B 8 5
1 B 0 6
1 C 3 7
1 C 2 8",
header=TRUE)
df$var1_sum = ave(df$var1, df$id, df$area, FUN=cumsum)
df$var2_sum = ave(df$var2, df$id, df$area, FUN=cumsum)
df
id area var1 var2 var1_sum var2_sum
1 1 A 9 9 9 9
2 1 A 4 8 13 17
3 1 A 5 2 18 19
4 1 B 1 4 1 4
5 1 B 8 5 9 9
6 1 B 0 6 9 15
7 1 C 3 7 3 7
8 1 C 2 8 5 15
Using tidyverse approach you may use the following code:
mydf <- read.table(text="id area var1 var2
1 A 9 9
1 A 4 8
1 A 5 2
1 B 1 4
1 B 8 5
1 B 0 6
1 C 3 7
1 C 2 8",
header=TRUE)
library(tidyverse)
mydf %>%
group_by(id,area) %>%
mutate(var1sum=cumsum(var1),
var2sum=cumsum(var2))
The result is:
id area var1 var2 var1sum var2sum
<int> <fctr> <int> <int> <int> <int>
1 1 A 9 9 9 9
2 1 A 4 8 13 17
3 1 A 5 2 18 19
4 1 B 1 4 1 4
5 1 B 8 5 9 9
6 1 B 0 6 9 15
7 1 C 3 7 3 7
8 1 C 2 8 5 15

numbering duplicated rows in dplyr [duplicate]

This question already has answers here:
Using dplyr to get cumulative count by group
(3 answers)
Closed 5 years ago.
I come to an issue with numbering the duplicated rows in data.frame and could not find a similar post.
Let's say we have a data like this
df <- data.frame(gr=gl(7,2),x=c("a","a","b","b","c","c","a","a","c","c","d","d","a","a"))
> df
gr x
1 1 a
2 1 a
3 2 b
4 2 b
5 3 c
6 3 c
7 4 a
8 4 a
9 5 c
10 5 c
11 6 d
12 6 d
13 7 a
14 7 a
and want to add new column called x_dupl to show that first occurrence of x values is numbered as 1 and second time 2 and third time 3 and so on..
thanks in advance!
The expected output
> df
gr x x_dupl
1 1 a 1
2 1 a 1
3 2 b 1
4 2 b 1
5 3 c 1
6 3 c 1
7 4 a 2
8 4 a 2
9 5 c 2
10 5 c 2
11 6 d 1
12 6 d 1
13 7 a 3
14 7 a 3
Your example data (plus rows where gr = 7 as in your output), and named df1, not df:
df1 <- data.frame(gr = gl(7,2),
x = c("a","a","b","b","c","c","a","a","c","c","d","d","a","a"))
library(dplyr)
df1 %>%
group_by(x) %>%
mutate(x_dupl = dense_rank(gr)) %>%
ungroup()
# A tibble: 14 x 3
gr x x_dupl
<fctr> <fctr> <int>
1 1 a 1
2 1 a 1
3 2 b 1
4 2 b 1
5 3 c 1
6 3 c 1
7 4 a 2
8 4 a 2
9 5 c 2
10 5 c 2
11 6 d 1
12 6 d 1
13 7 a 3
14 7 a 3
A base R solution:
df <- data.frame(gr=gl(7,2),x=c("a","a","b","b","c","c","a","a","c","c","d","d","a","a"))
x <- rle(as.numeric(df$x))
x$values <- ave(x$values, x$values, FUN = seq_along)
df$x_dupl <- inverse.rle(x)
# gr x x_dupl
# 1 1 a 1
# 2 1 a 1
# 3 2 b 1
# 4 2 b 1
# 5 3 c 1
# 6 3 c 1
# 7 4 a 2
# 8 4 a 2
# 9 5 c 2
# 10 5 c 2
# 11 6 d 1
# 12 6 d 1
# 13 7 a 3
# 14 7 a 3

Resources