Most efficient way of determing which ID does not have a pair? - r

Say that I have a dataframe that looks like the one below. In the dataframe we have the following pairs of IDs (4330, 4331), (2333,2334), (3336,3337), which are +/- 1 of each other. However, 3349 does not have pair. What would be the most efficient way of filtering out unpaired IDs?
ID sex zyg race SES
1 4330 2 2 2 1
2 4331 2 2 2 1
3 2333 2 2 1 78
4 2334 2 2 1 78
5 3336 2 2 1 18
6 3337 2 2 1 18
6 3349 2 2 1 18

This will return only pairs/twins (no unpaired or triplets, quadruplets, etc.). In base R:
df <- data.frame(ID = c(1:3, 4330, 4331, 2333, 2334, 3336, 3337, 3349), sex = 2)
df <- df[order(df$ID),]
df[
rep(
with(
rle(diff(df$ID)),
cumsum(lengths)[lengths == 1L & values == 1]
), each = 2
) + 0:1,
]
#> ID sex
#> 6 2333 2
#> 7 2334 2
#> 8 3336 2
#> 9 3337 2
#> 4 4330 2
#> 5 4331 2
Explanation:
After sorting the data, only individuals in a group (a twin, triplet, etc.) will have an ID difference of 1 from the individual in the next row. diff(df$ID) returns the difference in ID value from one row to the next along the whole data.frame. To identify twins, we want to find where diff(df$ID) has a 1 that is by itself (i.e., neither the previous value nor the next value is also 1). We use rle to find those lone 1s:
rle(diff(df$ID))
#> Run Length Encoding
#> lengths: int [1:8] 2 1 1 1 1 1 1 1
#> values : num [1:8] 1 2330 1 1002 1 12 981 1
Lone 1s occur when both the value of diff(df$ID) (values) and the length of runs of the same value (lengths) are both 1. This occurs with the third, fifth, and eighth run. The starting rows (within df) of all runs are given by cumsum(lengths), so we subset them at 3, 5, and 8 to get the starting index of each twin pair in df. We repeat each of those indices twice with rep(..., each = 2) then add 0:1 (taking advantage of recycling in R) to get the indices of any individual who is a twin.

Using dplyr::lag() and lead(), you can filter() to rows where the previous ID is ID - 1 or the next ID is ID + 1:
library(dplyr)
df %>%
filter(lag(ID) == ID - 1 | lead(ID) == ID + 1)
# A tibble: 6 × 5
ID sex zyg race SES
<dbl> <dbl> <dbl> <dbl> <dbl>
1 4330 2 2 2 1
2 4331 2 2 2 1
3 2333 2 2 1 78
4 2334 2 2 1 78
5 3336 2 2 1 18
6 3337 2 2 1 18
*edit, this will not filter out "triplets," "quadruplets," etc., contrary to the additional requirements mentioned in the comments.

Related

Select Random Consecutive Rows Per Group

I have data which is grouped by 'student_id':
my_data = data.frame(student_id = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3),
exam_no = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
result = rnorm(15,60,10))
my_data
student_id exam_no result
1 1 1 56.60374
2 1 2 55.76655
3 1 3 53.81728
4 1 4 74.82202
5 1 5 34.91834
6 2 1 58.32422
7 2 2 60.38213
8 2 3 49.40390
9 2 4 63.85426
10 2 5 40.32912
11 3 1 69.54969
12 3 2 43.36639
13 3 3 37.97265
14 3 4 52.36436
15 3 5 61.62080
My Question:
For each student, I want to select a set of consecutive rows, with random start and end rows.
For example, keep exams 2-4 for student 1, keep exams 2-5 for student 2, etc.
I thought of the following way to do this:
Create a data frame that contains the max number of exams each student takes (in my problem, each student takes the same number of exams, but in the future this could be different)
library(dplyr)
counts = my_data %>% group_by(student_id) %>% summarise(counts = n())
# create variables that indicate where to start ("min") and where to end ("max") for each student
counts$min = sample(1:counts$counts, 1)
counts$max = sample(counts$min:counts$counts,1)
From here, I was then going to write a loop that would select rows between "min" and "max" index for each student (e.g. my_data[min:max]), but the results from the previous code are giving me warnings and illogical results:
Warning message:
In 1:counts$counts :
numerical expression has 3 elements: only the first used
Warning messages:
1: In counts$min:counts$counts :
numerical expression has 3 elements: only the first used
2: In counts$min:counts$counts :
numerical expression has 3 elements: only the first used
# A tibble: 3 x 4
student_id counts min max
<dbl> <int> <int> <int>
1 1 5 4 5
2 2 5 4 5
3 3 5 4 5
I am not sure how to continue this - can someone please show me how to continue?
Thanks!
A base R option using cumsum to label the in-between consecutive rows
subset(
my_data,
ave(
exam_no,
student_id,
FUN = function(x) cumsum(seq_along(x) %in% sample.int(length(x), 2))
) == 1
)
which gives, for example
student_id exam_no result
2 1 2 61.83643
3 1 3 51.64371
4 1 4 75.95281
6 2 1 51.79532
7 2 2 64.87429
8 2 3 67.38325
11 3 1 75.11781
12 3 2 63.89843
13 3 3 53.78759
A more compact version by data.table with a similar idea as above is
library(data.table)
setDT(my_data)[, .SD[cumsum((1:.N) %in% sample.int(.N, 2)) == 1], student_id]
Using data.table, within each group, sample two values from .I (without replacement), and create a sequence of indices.
library(data.table)
setDT(my_data)
set.seed(3)
my_data[my_data[ , {ix = sample(.I, 2); ix[1]:ix[2]}, by = student_id]$V1]
# student_id exam_no result
# <num> <num> <num>
# 1: 1 5 74.05672
# 2: 1 4 49.37525
# 3: 1 3 67.41662
# 4: 1 2 67.64935
# 5: 2 4 55.15337
# 6: 2 3 58.95694
# 7: 3 4 50.79859
# 8: 3 3 53.66886
# 9: 3 2 47.01089

R: splitting dataframe into distinct subgroups containing sequence of groups

This question is similar to one already answered: R: Splitting dataframe into subgroups consisting of every consecutive 2 groups
However, rather than splitting into subgroups that have a type in common, I need to split into subgroups that contain two consecutive types and are distinct. The groups in my actual data have differing numbers of rows as well.
df <- data.frame(ID=c('1','1','1','1','1','1','1'), Type=c('a','a','b','c','c','d','d'), value=c(10,2,5,3,7,3,9))
ID Type value
1 1 a 10
2 1 a 2
3 1 b 5
4 1 c 3
5 1 c 7
6 1 d 3
7 1 d 9
So subgroup 1 would be Type a and b:
ID Type value
1 1 a 10
2 1 a 2
3 1 b 5
And subgroup 2 would be Type c and d:
ID Type value
4 1 c 3
5 1 c 7
6 1 d 3
7 1 d 9
I have tried manipulating the code from this previous example, but I can't figure out how to make this happen without having overlapping Types in each group. Any help would be greatly appreciated - thanks!
EDIT: thanks for pointing out I didn't actually include the correct link.
We can do a little manipulation of a dense_rank of the Type variable to make an appropriate grouping variable:
library(dplyr)
df %>%
group_by(g = (dense_rank(match(Type, Type)) - 1) %/% 2) %>%
group_split()
# [[1]]
# # A tibble: 3 × 4
# ID Type value g
# <chr> <chr> <dbl> <dbl>
# 1 1 a 10 0
# 2 1 a 2 0
# 3 1 b 5 0
#
# [[2]]
# # A tibble: 4 × 4
# ID Type value g
# <chr> <chr> <dbl> <dbl>
# 1 1 c 3 1
# 2 1 c 7 1
# 3 1 d 3 1
# 4 1 d 9 1
Explanation: match(Type, Type) converts Type into integers ordered by number of appearance - but not dense. dense_rank() makes that dense (no gaps). We then subtract 1 to make it start at 0 and %/% 2 to see how many 2s go into it, effectively grouping by pairs.
Here is a rle way, written as a function. Pass the data.frame and the split column name as a character string.
df <- data.frame(ID=c('1','1','1','1','1','1','1'),
Type=c('a','a','b','c','c','d','d'),
value=c(10,2,5,3,7,3,9))
split_two <- function(x, col) {
r <- rle(x[[col]])
r$values[c(FALSE, TRUE)] <- r$values[c(TRUE, FALSE)]
split(x, inverse.rle(r))
}
split_two(df, "Type")
#> $a
#> ID Type value
#> 1 1 a 10
#> 2 1 a 2
#> 3 1 b 5
#>
#> $c
#> ID Type value
#> 4 1 c 3
#> 5 1 c 7
#> 6 1 d 3
#> 7 1 d 9
Created on 2023-02-09 with reprex v2.0.2

Subset specific row and last row from data frame

I have a data frame which contains data relating to a score of different events. There can be a number of scoring events for one game. What I would like to do, is to subset the occasions when the score goes above 5 or below -5. I would also like to get the last row for each ID. So for each ID, I would have one or more rows depending on whether the score goes above 5 or below -5. My actual data set contains many other columns of information, but if I learn how to do this then I'll be able to apply it to anything else that I may want to do.
Here is a data set
ID Score Time
1 0 0
1 3 5
1 -2 9
1 -4 17
1 -7 31
1 -1 43
2 0 0
2 -3 15
2 0 19
2 4 25
2 6 29
2 9 33
2 3 37
3 0 0
3 5 3
3 2 11
So for this data set, I would hopefully get this output:
ID Score Time
1 -7 31
1 -1 43
2 6 29
2 9 33
2 3 37
3 2 11
So at the very least, for each ID there will be one line printed with the last score for that ID regardless of whether the score goes above 5 or below -5 during the event( this occurs for ID 3).
My attempt can subset when the value goes above 5 or below -5, I just don't know how to write code to get the last line for each ID:
Data[Data$Score > 5 | Data$Score < -5]
Let me know if you need anymore information.
You can use rle to grab the last row for each ID. Check out ?rle for more information about this useful function.
Data2 <- Data[cumsum(rle(Data$ID)$lengths), ]
Data2
# ID Score Time
#6 1 -1 43
#13 2 3 37
#16 3 2 11
To combine the two conditions, use rbind.
Data2 <- rbind(Data[Data$Score > 5 | Data$Score < -5, ], Data[cumsum(rle(Data$ID)$lengths), ])
To get rid of rows that satisfy both conditions, you can use duplicated and rownames.
Data2 <- Data2[!duplicated(rownames(Data2)), ]
You can also sort if desired, of course.
Here's a go at it in data.table, where df is your original data frame.
library(data.table)
setDT(df)
df[df[, c(.I[!between(Score, -5, 5)], .I[.N]), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
We are grouping by ID. The between function finds the values between -5 and 5, and we negate that to get our desired values outside that range. We then use a .I subset to get the indices per group for those. Then .I[.N] gives us the row number of the last entry, per group. We use the V1 column of that result as our row subset for the entire table. You can take unique values if unique rows are desired.
Note: .I[c(which(!between(Score, -5, 5)), .N)] could also be used in the j entry of the first operation. Not sure if it's more or less efficient.
Addition: Another method, one that uses only logical values and will never produce duplicate rows in the output, is
df[df[, .I == .I[.N] | !between(Score, -5, 5), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
Here is another base R solution.
df[as.logical(ave(df$Score, df$ID,
FUN=function(i) abs(i) > 5 | seq_along(i) == length(i))), ]
ID Score Time
5 1 -7 31
6 1 -1 43
11 2 6 29
12 2 9 33
13 2 3 37
16 3 2 11
abs(i) > 5 | seq_along(i) == length(i) constructs a logical vector that returns TRUE for each element that fits your criteria. ave applies this function to each ID. The resulting logical vector is used to select the rows of the data.frame.
Here's a tidyverse solution. Not as concise as some of the above, but easier to follow.
library(tidyverse)
lastrows <- Data %>% group_by(ID) %>% top_n(1, Time)
scorerows <- Data %>% group_by(ID) %>% filter(!between(Score, -5, 5))
bind_rows(scorerows, lastrows) %>% arrange(ID, Time) %>% unique()
# A tibble: 6 x 3
# Groups: ID [3]
# ID Score Time
# <int> <int> <int>
# 1 1 -7 31
# 2 1 -1 43
# 3 2 6 29
# 4 2 9 33
# 5 2 3 37
# 6 3 2 11

how to replace the NA in a data frame with the average number of this data frame

I have a data frame like this:
nums id
1233 1
3232 2
2334 3
3330 1
1445 3
3455 3
7632 2
NA 3
NA 1
And I can know the average "nums" of each "id" by using:
id_avg <- aggregate(nums ~ id, data = dat, FUN = mean)
What I would like to do is to replace the NA with the value of the average number of the corresponding id. for example, the average "nums" of 1,2,3 are 1000, 2000, 3000, respectively. The NA when id == 3 will be replaced by 3000, the last NA whose id == 1 will be replaced by 1000.
I tried the following code to achieve this:
temp <- dat[is.na(dat$nums),]$id
dat[is.na(dat$nums),]$nums <- id_avg[id_avg[,"id"] ==temp,]$nums
However, the second part
id_avg[id_avg[,"id"] ==temp,]$nums
is always NA, which means I always pass NA to the NAs I want to replace.
I don't know where I was wrong, or do you have better method to do this?
Thank you
Or you can fix it by:
dat[is.na(dat$nums),]$nums <- id_avg$nums[temp]
nums id
1 1233.000 1
2 3232.000 2
3 2334.000 3
4 3330.000 1
5 1445.000 3
6 3455.000 3
7 7632.000 2
8 2411.333 3
9 2281.500 1
What you want is contained in the zoo package.
library(zoo)
na.aggregate.default(dat, by = dat$id)
nums id
1 1233.000 1
2 3232.000 2
3 2334.000 3
4 3330.000 1
5 1445.000 3
6 3455.000 3
7 7632.000 2
8 2411.333 3
9 2281.500 1
Here is a dplyr way:
df %>%
group_by(id) %>%
mutate(nums = replace(nums, is.na(nums), as.integer(mean(nums, na.rm = T))))
# Source: local data frame [9 x 2]
# Groups: id [3]
# nums id
# <int> <int>
# 1 1233 1
# 2 3232 2
# 3 2334 3
# 4 3330 1
# 5 1445 3
# 6 3455 3
# 7 7632 2
# 8 2411 3
# 9 2281 1
You essentially want to merge the id_avg back to the original data frame by the id column, so you can also use match to follow your original logic:
dat$nums[is.na(dat$nums)] <- id_avg$nums[match(dat$id[is.na(dat$nums)], id_avg$id)]
dat
# nums id
# 1: 1233.000 1
# 2: 3232.000 2
# 3: 2334.000 3
# 4: 3330.000 1
# 5: 1445.000 3
# 6: 3455.000 3
# 7: 7632.000 2
# 8: 2411.333 3
# 9: 2281.500 1

Percolation clustering

Consider the following groupings:
> data.frame(x = c(3:5,7:9,12:14), grp = c(1,1,1,2,2,2,3,3,3))
x grp
1 3 1
2 4 1
3 5 1
4 7 2
5 8 2
6 9 2
7 12 3
8 13 3
9 14 3
Let's say I don't know the grp values but only have a vector x. What is the easiest way to generate grp values, essentially an id field of groups of values within a threshold from from each other? Is this a percolation algorithm?
One option would be to compare the next with the current value and check if the difference is greater than 1, and get the cumulative sum.
df1$grp <- cumsum(c(TRUE, diff(df1$x) > 1))
df1$grp
#[1] 1 1 1 2 2 2 3 3 3
EDIT: From #geotheory's comments.

Resources