Find discontinuities in observational data with R - r

Data
id<-c("a","a","a","a","a","a","b","b","b","b","b","b")
d<-c(1,2,3,90,98,100000,4,6,7,8,23,45)
df<-data.frame(id,d)
I want to detect observational discontinuities of each "id".
My expected result is obtain a way to detect discontinuities without using means or medians as a reference.

You can check whether the difference between a row and the next one within each group is different than 1:
library(dplyr)
df %>%
group_by(id) %>%
mutate(dis = +(c(F, diff(d) != 1)))
# A tibble: 12 × 3
# Groups: id [2]
id d dis
<chr> <dbl> <int>
1 a 1 0
2 a 2 0
3 a 3 0
4 a 90 1
5 a 98 1
6 a 100000 1
7 b 4 0
8 b 6 1
9 b 7 0
10 b 8 0
11 b 23 1
12 b 45 1

Related

How to count groupings of elements in base R or dplyr using multiple conditions?

I am trying to count the number of elements by groupings, subject to the condition that each grouping code ("Group") is > 0. Suppose we start with the below output DF generated via the code immediately beneath:
Element Group reSeq
<chr> <dbl> <int>
1 R 0 1
2 R 0 1
3 X 0 1
4 X 1 2
5 X 1 2
6 X 0 1
7 X 0 1
8 X 0 1
9 B 0 1
10 R 0 1
11 R 2 2
12 R 2 2
13 X 3 3
14 X 3 3
15 X 3 3
library(dplyr)
myDF <- data.frame(
Element = c("R","R","X","X","X","X","X","X","B","R","R","R","X","X","X"),
Group = c(0,0,0,1,1,0,0,0,0,0,2,2,3,3,3)
)
myDF %>% group_by(Element) %>% mutate(reSeq = match(Group, unique(Group)))
Instead, I would like the reSeq column to calculate and output as shown below with explanations to the right:
Element Group reSeq reSeq explanation
<chr> <dbl> <int>
1 R 0 1 1st instance of R (ungrouped)(Group = 0 means not grouped)
2 R 0 2 2nd instance of R (ungrouped)(Group = 0 means not grouped)
3 X 0 1 1st instance of X (ungrouped)(Group = 0 means not grouped)
4 X 1 2 2nd instance of X (grouped by Group = 1)
5 X 1 2 2nd instance of X (grouped by Group = 1)
6 X 0 3 3rd instance of X (ungrouped)
7 X 0 4 4th instance of X (ungrouped)
8 X 0 5 5th instance of X (ungrouped)
9 B 0 1 1st instance of B (ungrouped)
10 R 0 3 3rd instance of R (ungrouped)
11 R 2 4 4th instance of R (grouped by Group = 2)
12 R 2 4 4th instance of R (grouped by Group = 2)
13 X 3 6 6th instance of X (grouped by Group = 3)
14 X 3 6 6th instance of X (grouped by Group = 3)
15 X 3 6 6th instance of X (grouped by Group = 3)
Any recommendations for doing this? If possible, starting with the dplyr code I use above because I am fairly familiar with it.
If we use rowid from data.table, can skip a couple of steps
library(dplyr)
library(data.table)
library(tidyr)
myDF %>%
mutate(reSeq = rowid(Element) * NA^!(Group == 0 |!duplicated(Group))) %>%
group_by(Element) %>%
fill(reSeq) %>%
mutate(reSeq = match(reSeq, unique(reSeq))) %>%
ungroup
-output
# A tibble: 15 × 3
Element Group reSeq
<chr> <dbl> <int>
1 R 0 1
2 R 0 2
3 X 0 1
4 X 1 2
5 X 1 2
6 X 0 3
7 X 0 4
8 X 0 5
9 B 0 1
10 R 0 3
11 R 2 4
12 R 2 4
13 X 3 6
14 X 3 6
15 X 3 6
Below is what I managed to cobble together. Maybe there's a cleaner solution? Here's the code:
library(dplyr)
library(tidyr)
myDF %>%
group_by(Element) %>%
mutate(eleCnt = row_number()) %>%
ungroup()%>%
mutate(reSeq = ifelse(Group == 0 | Group != lag(Group), eleCnt,0)) %>%
mutate(reSeq = na_if(reSeq, 0)) %>%
group_by(Element) %>%
fill(reSeq) %>%
mutate(reSeq = match(reSeq, unique(reSeq))) %>%
ungroup
And here's the output:
# A tibble: 15 x 4
Element Group eleCnt reSeq
<chr> <dbl> <int> <int>
1 R 0 1 1
2 R 0 2 2
3 X 0 1 1
4 X 1 2 2
5 X 1 3 2
6 X 0 4 3
7 X 0 5 4
8 X 0 6 5
9 B 0 1 1
10 R 0 3 3
11 R 2 4 4
12 R 2 5 4
13 X 3 7 6
14 X 3 8 6
15 X 3 9 6

Code values in new column based on whether values in another column are unique

Given the following data I would like to create a new column new_sequence based on the condition:
If only one id is present the new value should be 0. If several id's are present, the new value should numbered according to the values present in sequence.
dat <- tibble(id = c(1,2,3,3,3,4,4),
sequence = c(1,1,1,2,3,1,2))
# A tibble: 7 x 2
id sequence
<dbl> <dbl>
1 1 1
2 2 1
3 3 1
4 3 2
5 3 3
6 4 1
7 4 2
So, for the example data I am looking to produce the following output:
# A tibble: 7 x 3
id sequence new_sequence
<dbl> <dbl> <dbl>
1 1 1 0
2 2 1 0
3 3 1 1
4 3 2 2
5 3 3 3
6 4 1 1
7 4 2 2
I have tried with the code below, that does not work since all unique values are coded as 0
dat %>% mutate(new_sequence = ifelse(!duplicated(id), 0, sequence))
Use dplyr::add_count() rather than !duplicated():
library(dplyr)
dat %>%
add_count(id) %>%
mutate(new_sequence = ifelse(n == 1, 0, sequence)) %>%
select(!n)
Output:
# A tibble: 7 x 3
id sequence new_sequence
<dbl> <dbl> <dbl>
1 1 1 0
2 2 1 0
3 3 1 1
4 3 2 2
5 3 3 3
6 4 1 1
7 4 2 2
You can also try the following. After grouping by id check if the number of rows in the group n() is 1 or not. Use separate if and else instead of ifelse since the lengths are different within each group.
dat %>%
group_by(id) %>%
mutate(new_sequence = if(n() == 1) 0 else sequence)
Output
id sequence new_sequence
<dbl> <dbl> <dbl>
1 1 1 0
2 2 1 0
3 3 1 1
4 3 2 2
5 3 3 3
6 4 1 1
7 4 2 2

New column which counts the number of times a value in a specific row of one column appears in another column

I have tried searching for an answer to this question but it continues to elude me! I am working with crime data where each row refers to a specific crime incident. There is a variable for suspect ID, and a variable for victim ID. These ID numbers are consistent across the two columns (in other words, if a row contains the ID 424 in the victim ID column, and a separate row contains the ID 424 in the suspect column, I know that the same person was listed as a victim in the first crime and as a suspect in the second crime).
I want to create two new variables: one which counts the number of times the victim (in a particular crime incident) has been recorded as a suspect (in the dataset as a whole), and one which counts the number of times the suspect (in a particular crime incident) has been recorded as a victim (in the dataset as a whole).
Here's a simplified version of my data:
s.uid
v.uid
1
1
9
2
2
8
3
3
2
4
4
2
5
5
2
6
NA
7
7
5
6
8
9
5
And here is what I want to create:
s.uid
v.uid
s.in.v
v.in.s
1
1
9
0
1
2
2
8
3
0
3
3
2
0
1
4
4
2
0
1
5
5
2
1
1
6
NA
7
NA
0
7
5
6
1
0
8
9
5
1
2
Note that, where there is an NA, I would like the NA to be preserved. I'm currently trying to work in tidyverse and piping where possible, so I would prefer answers in that kind of format, but I'm open to any solution!
Using dplyr:
dat %>%
group_by(s.uid) %>%
mutate(s.in.v = sum(dat$v.uid %in% s.uid)) %>%
group_by(v.uid) %>%
mutate(v.in.s = sum(dat$s.uid %in% v.uid))
# A tibble: 8 × 4
# Groups: v.uid [6]
s.uid v.uid s.in.v v.in.s
<int> <int> <int> <int>
1 1 9 0 1
2 2 8 3 0
3 3 2 0 1
4 4 2 0 1
5 5 2 1 1
6 NA 7 0 0
7 5 6 1 0
8 9 5 1 2
First, a reprex of your data:
library(tidyverse)
# Replica of your data:
s.uid <- c(1:5, NA, 5, 9)
v.uid <- c(9, 8, 2, 2, 2, 7, 6, 5)
DF <- tibble(s.uid, v.uid)
Custom function to use:
# function to check how many times "a" (a length 1 atomic vector) occurs in "b":
f <- function(a, b) {
a <- as.character(a)
# make a lookup table a.k.a dictionary of values in b:
b_freq <- table(b, useNA = "always")
# if a is in b, return it's frequency:
if (a %in% names(b_freq)) {
return(b_freq[a])
}
# else (ie. a is not in b) return 0:
return(0)
}
# vectorise that, enabling intake of any length of "a":
ff <- function(a, b) {
purrr::map_dbl(.x = a, .f = f, b = b)
}
Finally:
DF |>
mutate(
s_in_v = ff(s.uid, v.uid),
v_in_s = ff(v.uid, s.uid)
)
Results in:
#> # A tibble: 8 × 4
#> s.uid v.uid s_in_v v_in_s
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 9 0 1
#> 2 2 8 3 0
#> 3 3 2 0 1
#> 4 4 2 0 1
#> 5 5 2 1 1
#> 6 NA 7 NA 0
#> 7 5 6 1 0
#> 8 9 5 1 2

If 1 appears, all subsequent elements of the variable must be 1, grouped by subject

I want make from:
test <- data.frame(subject=c(rep(1,10),rep(2,10)),x=1:10,y=0:1)
Something like that:
As I wrote in the title, when the first 1 appears all subsequent values of "y" for a given "subject" must change to 1, then the same for the next "subject"
I tried something like that:
test <- test%>%
group_nest(subject) %>%
mutate(XD = map(data,function(x){
ifelse(x$y[which(grepl(1, x$y))[1]:nrow(x)]==TRUE , 1,0)})) %>% unnest(cols = c(data,XD))
It didn't work :(
Try this:
library(dplyr)
#Code
new <- test %>%
group_by(subject) %>%
mutate(y=ifelse(row_number()<min(which(y==1)),y,1))
Output:
# A tibble: 20 x 3
# Groups: subject [2]
subject x y
<dbl> <int> <dbl>
1 1 1 0
2 1 2 1
3 1 3 1
4 1 4 1
5 1 5 1
6 1 6 1
7 1 7 1
8 1 8 1
9 1 9 1
10 1 10 1
11 2 1 0
12 2 2 1
13 2 3 1
14 2 4 1
15 2 5 1
16 2 6 1
17 2 7 1
18 2 8 1
19 2 9 1
20 2 10 1
Since you appear to just have 0's and 1's, a straightforward approach would be to take a cumulative maximum via the cummax function:
library(dplyr)
test %>%
group_by(subject) %>%
mutate(y = cummax(y))
#Duck's answer is considerably more robust if you have a range of values that may appear before or after the first 1.

Filter on groups where where at the max value of one variable, another variable equals a particular value

I want to filter on groups where at the max value of one variable, another variable equals a particular value.
I have data like so:
library(tidyverse)
df1 <- data.frame(grp = rep(letters[1:2],each=5),
day = 1:5,
value = c(0,5,7,1,1,5,8,5,3,0)) %>%
group_by(grp)
grp day value
1 a 1 0
2 a 2 5
3 a 3 7
4 a 4 1
5 a 5 1
6 b 1 5
7 b 2 8
8 b 3 5
9 b 4 3
10 b 5 0
And I want to filter on groups where at the max(day), value equals 1.
So the output would look like this:
grp day value
1 a 1 0
2 a 2 5
3 a 3 7
4 a 4 1
5 a 5 1
Data.table or dplyr solutions are welcome. Thanks!
As it is already grouped, simply apply filter by checking whether 'value' that corresponds to max value of day (which.max(day)) is 1
library(dplyr)
df1 %>%
filter(value[which.max(day)] ==1)
# A tibble: 5 x 3
# Groups: grp [1]
# grp day value
# <fct> <int> <dbl>
#1 a 1 0
#2 a 2 5
#3 a 3 7
#4 a 4 1
#5 a 5 1
Or have two conditions and wrap with any
df1 %>%
filter(any(value ==1 & day == max(day)))

Resources