Unique ID for interconnected cases - r

I have the following data frame, that shows which cases are interconnected:
DebtorId DupDebtorId
1: 1 2
2: 1 3
3: 1 4
4: 5 1
5: 5 2
6: 5 3
7: 6 7
8: 7 6
My goal is to assign a unique group ID to each group of cases. The desired output is:
DebtorId group
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 1
6: 6 2
7: 7 2
My train of thought:
library(data.table)
example <- data.table(
DebtorId = c(1,1,1,5,5,5,6,7),
DupDebtorId = c(2,3,4,1,2,3,7,6)
)
unique_pairs <- example[!duplicated(t(apply(example, 1, sort))),] #get unique pairs of DebtorID and DupDebtorID
unique_pairs[, group := .GRP, by=.(DebtorId)] #assign a group ID for each DebtorId
unique_pairs[, num := rowid(group)]
groups <- dcast(unique_pairs, group + DebtorId ~ num, value.var = 'DupDebtorId') #format data to wide for each group ID
#create new data table with unique cases to assign group ID
newdt <- data.table(DebtorId = sort(unique(c(example$DebtorId, example$DupDebtorId))), group = NA)
newdt$group <- as.numeric(newdt$group)
#loop through the mapped groups, selecting the first instance of group ID for the case
for (i in 1:nrow(newdt)) {
a <- newdt[i]$DebtorId
b <- min(which(groups[,-1] == a, arr.ind=TRUE)[,1])
newdt[i]$group <- b
}
Output:
DebtorId group
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 2
6: 6 3
7: 7 3
There are 2 problems in my approach:
From the output, you can see that it fails to recognize that case 5
belongs to group 1;
The final loop is agonizingly slow, which would
render it useless for my use case of 1M rows in my original data, and going the traditional := way does not work with which()
I'm not sure whether my approach could be optimized, or there is a better way of doing this altogether.

This functionality already exists in igraph, so if you don't need to do it yourself, we can build a graph from your data frame and then extract cluster membership. stack() is just an easy way to convert a named vector to data frame.
library(igraph)
g <- graph.data.frame(df)
df_membership <- clusters(g)$membership
stack(df_membership)
#> values ind
#> 1 1 1
#> 2 1 5
#> 3 2 6
#> 4 2 7
#> 5 1 2
#> 6 1 3
#> 7 1 4
Above, values corresponds to group and ind to DebtorId.

Related

Select Random Consecutive Rows Per Group

I have data which is grouped by 'student_id':
my_data = data.frame(student_id = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3),
exam_no = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
result = rnorm(15,60,10))
my_data
student_id exam_no result
1 1 1 56.60374
2 1 2 55.76655
3 1 3 53.81728
4 1 4 74.82202
5 1 5 34.91834
6 2 1 58.32422
7 2 2 60.38213
8 2 3 49.40390
9 2 4 63.85426
10 2 5 40.32912
11 3 1 69.54969
12 3 2 43.36639
13 3 3 37.97265
14 3 4 52.36436
15 3 5 61.62080
My Question:
For each student, I want to select a set of consecutive rows, with random start and end rows.
For example, keep exams 2-4 for student 1, keep exams 2-5 for student 2, etc.
I thought of the following way to do this:
Create a data frame that contains the max number of exams each student takes (in my problem, each student takes the same number of exams, but in the future this could be different)
library(dplyr)
counts = my_data %>% group_by(student_id) %>% summarise(counts = n())
# create variables that indicate where to start ("min") and where to end ("max") for each student
counts$min = sample(1:counts$counts, 1)
counts$max = sample(counts$min:counts$counts,1)
From here, I was then going to write a loop that would select rows between "min" and "max" index for each student (e.g. my_data[min:max]), but the results from the previous code are giving me warnings and illogical results:
Warning message:
In 1:counts$counts :
numerical expression has 3 elements: only the first used
Warning messages:
1: In counts$min:counts$counts :
numerical expression has 3 elements: only the first used
2: In counts$min:counts$counts :
numerical expression has 3 elements: only the first used
# A tibble: 3 x 4
student_id counts min max
<dbl> <int> <int> <int>
1 1 5 4 5
2 2 5 4 5
3 3 5 4 5
I am not sure how to continue this - can someone please show me how to continue?
Thanks!
A base R option using cumsum to label the in-between consecutive rows
subset(
my_data,
ave(
exam_no,
student_id,
FUN = function(x) cumsum(seq_along(x) %in% sample.int(length(x), 2))
) == 1
)
which gives, for example
student_id exam_no result
2 1 2 61.83643
3 1 3 51.64371
4 1 4 75.95281
6 2 1 51.79532
7 2 2 64.87429
8 2 3 67.38325
11 3 1 75.11781
12 3 2 63.89843
13 3 3 53.78759
A more compact version by data.table with a similar idea as above is
library(data.table)
setDT(my_data)[, .SD[cumsum((1:.N) %in% sample.int(.N, 2)) == 1], student_id]
Using data.table, within each group, sample two values from .I (without replacement), and create a sequence of indices.
library(data.table)
setDT(my_data)
set.seed(3)
my_data[my_data[ , {ix = sample(.I, 2); ix[1]:ix[2]}, by = student_id]$V1]
# student_id exam_no result
# <num> <num> <num>
# 1: 1 5 74.05672
# 2: 1 4 49.37525
# 3: 1 3 67.41662
# 4: 1 2 67.64935
# 5: 2 4 55.15337
# 6: 2 3 58.95694
# 7: 3 4 50.79859
# 8: 3 3 53.66886
# 9: 3 2 47.01089

create list from columns of data table expression

Consider the following dt:
dt <- data.table(a=c(1,1,2,3),b=c(4,5,6,4))
That looks like that:
> dt
a b
1: 1 4
2: 1 5
3: 2 6
4: 3 4
I'm here aggregating each column by it's unique values and then counting how many uniquye values each column has:
> dt[,lapply(.SD,function(agg) dt[,.N,by=agg])]
a.agg a.N b.agg b.N
1: 1 2 4 2
2: 2 1 5 1
3: 3 1 6 1
So 1 appears twice in dt and thus a.N is 2, the same logic goes on for the other values.
But the problem is if this transformations of the original datatable have different dimensions at the end, things will get recycled.
For example this dt:
dt <- data.table(a=c(1,1,2,3,7),b=c(4,5,6,4,4))
> dt[,lapply(.SD,function(agg) dt[,.N,by=agg])]
a.agg a.N b.agg b.N
1: 1 2 4 3
2: 2 1 5 1
3: 3 1 6 1
4: 7 1 4 3
Warning message:
In as.data.table.list(jval, .named = NULL) :
Item 2 has 3 rows but longest item has 4; recycled with remainder.
That is no longer the right answer because b.N should have now only 3 rows and things(vector) got recycled.
This is why I would like to transform the expression dt[,lapply(.SD,function(agg) dt[,.N,by=agg])] in a list with different dimensions, with the name of items in the list being the name of the columns in the new transformed dt.
A sketch of what I mean is:
newlist
$a.agg
1 2 3 7
$a.N
2 1 1 1
$b.agg
4 5 6 4
$b.N
3 1 1
Or even better solution would be to get a datatable with a track of the columns on another column:
dt_final
agg N column
1 2 a
2 1 a
3 1 a
7 1 a
4 3 b
5 1 b
6 1 b
Get the data in long format and then aggregate by group.
library(data.table)
dt_long <- melt(dt, measure.vars = c('a', 'b'))
dt_long[, .N, .(variable, value)]
# variable value N
#1: a 1 2
#2: a 2 1
#3: a 3 1
#4: a 7 1
#5: b 4 3
#6: b 5 1
#7: b 6 1
In tidyverse -
library(dplyr)
library(tidyr)
dt %>%
pivot_longer(cols = everything()) %>%
count(name, value)

Select value from previous group based on condition

I have the following df
df<-data.frame(value = c(1,1,1,2,1,1,2,2,1,2),
group = c(5,5,5,6,7,7,8,8,9,10),
no_rows = c(3,3,3,1,2,2,2,2,1,1))
where identical consecutive values form a group, i.e., values in rows 1:3 fall under group 5. Column "no_rows" tells us how many rows/entries each group has, i.e., group 5 has 3 rows/entries.
I am trying to substitute all values, where no_rows < 2, with the value from a previous group. I expect my end df to look like this:
df_end<-data.frame(value = c(1,1,1,1,1,1,2,2,2,2),
group = c(5,5,5,6,7,7,8,8,9,10),
no_rows = c(3,3,3,1,2,2,2,2,1,1))
I came up with this combination of if...else in a for loop, which gives me the desired output, however it is very slow and I am looking for a way to optimise it.
for (i in 2:length(df$group)){
if (df$no_rows[i] < 2){
df$value[i] <- df$value[i-1]
}
}
I have also tried with dplyr::mutate and lag() but it does not give me the desired output (it only removes the first value per group instead of taking the value of a previous group).
df<-df%>%
group_by(group) %>%
mutate(value = ifelse(no_rows < 2, lag(value), value))
I looked for a solution now for a few days but I could not find anything that fit my problem completly. Any ideas?
a data.table approach...
first, get the values of groups with length >=2, then fill in missing values (NA) by last-observation-carried-forward.
library(data.table)
# make it a data.table
setDT(df, key = "group")
# get values for groups of no_rows >= 2
df[no_rows >= 2, new_value := value][]
# value group no_rows new_value
# 1: 1 5 3 1
# 2: 1 5 3 1
# 3: 1 5 3 1
# 4: 2 6 1 NA
# 5: 1 7 2 1
# 6: 1 7 2 1
# 7: 2 8 2 2
# 8: 2 8 2 2
# 9: 1 9 1 NA
#10: 2 10 1 NA
# fill down missing values in new_value
setnafill(df, "locf", cols = c("new_value"))
# value group no_rows new_value
# 1: 1 5 3 1
# 2: 1 5 3 1
# 3: 1 5 3 1
# 4: 2 6 1 1
# 5: 1 7 2 1
# 6: 1 7 2 1
# 7: 2 8 2 2
# 8: 2 8 2 2
# 9: 1 9 1 2
#10: 2 10 1 2

create id variable from table of duplicates

I have a dataframe where each row has a unique identifier, but some rows are actually duplicates.
fdf <- data.frame(name = c("fred", "ferd", "frad", 'eric', "eirc", "george"),
id = 1:6)
fdf
#> name id
#> 1 fred 1
#> 2 ferd 2
#> 3 frad 3
#> 4 eric 4
#> 5 eirc 5
#> 6 george 6
I have determined which rows are duplicated and this information is stored in a second dataframe as pairs of the unique id's. So the key tells me row 1 is the same individual as rows 2 and 3, etc.
key <- data.frame(id1 = c(1,1,2,4), id2 = c(2,3,3,5))
key
#> id1 id2
#> 1 1 2
#> 2 1 3
#> 3 2 3
#> 4 4 5
I'm struggling to think up a straightforward way to use the key to create an id variable in my original dataframe. Desired output would be:
fdf$realid <- c(1,1,1,2,2,3)
fdf
#> name id realid
#> 1 fred 1 1
#> 2 ferd 2 1
#> 3 frad 3 1
#> 4 eric 4 2
#> 5 eirc 5 2
#> 6 george 6 3
Edit for clarity
Keys here are the set of true connections between rows in the data.frame fdf. Thus you can imagine starting with the set of all feasible connections:
# id1 id2
# 1 2
# 1 3
# 1 4
# ...
# 6 4
# 6 5
determining which are true connections (based on the other variables in each observation).
# id1 id2 match
# 1 2 match
# 1 3 no match
# 1 4 match
# ...
# 6 4 no match
# 6 5 no match
and sub-setting to the cases that are matches.
The easiest way would be to recreate the key data frame to the following format (i.e. which id belongs to which realid)
key <- data.frame(id = c(1, 2, 3, 4, 5, 6),
realid = c(1, 1, 1, 2, 2, 3))
Then it is just a matter of merging fdf and key together with merge
fdf <- merge(fdf, key_table, by.x = "id")
fdf
id name realid
1 1 fred 1
2 2 ferd 1
3 3 frad 1
4 4 eric 2
5 5 eirc 2
6 6 george 3
I didn't find a 'straight forward way', but it seems to work well.
First you check which IDs are together in a group, by checking whether there's 'overlap', i.e. whether the intersection between two rows in key is non-empty:
check_overlap <- function(pair1, pair2){
newset <- intersect(pair1, pair2)
length(newset) != 0
}
Then we can apply this function to the rows in key against the other rows. If a row has been matched already, it is automatically removed from key, like this:
check_overlaps <- function(key){
cont <- data.frame()
i <- 1
while(nrow(key) > 0){
ids <- apply(key, 1, check_overlap, key[1, ])
vals <- unique(unlist(key[ids, ]))
key <- key[!ids, ]
cont <- rbind(cont, cbind(vals, rep(i, length(vals))))
i <- i+1
}
return(cont)
}
new_ids <- check_overlaps(key)
# vals V2
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 2
# 5 5 2
The problem with merging fdf and new_ids, however, is that some old IDs may not occur in key, but they should be mapped to a new ID according to the new order. You can manipulate key a bit a priori and do:
for(val in unique(fdf$id)){
if(!(val %in% unlist(key))){
key <- rbind(key, c(val, val))
}
}
new_ids2 <- check_overlaps(key)
vals V2
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 2
# 5 5 2
# 6 6 3
Which is easy to merge with fdf like:
merge(fdf, new_ids2, by.x = "id", by.y = "vals")
id name V2
# 1 1 fred 1
# 2 2 ferd 1
# 3 3 frad 1
# 4 4 eric 2
# 5 5 eirc 2
# 6 6 george 3
If I understand your question correctly it can be solved by creating groups of matching ids and creating a new (real) id out of these groups:
# determine the groups of ids
id_groups <- list()
i = 1
for (id in unique(key$id1)) {
if (!(id %in% unlist(id_groups))) {
id_groups[[i]] <- c(id, key$id2[key$id1 == id])
i = i + 1
}
}
# add ids without match
id_groups <- c(id_groups, setdiff(fdf$id, unlist(id_groups)))
# for every id in fdf, set real_id to index in id_groups to which id belongs
fdf$real_id <- sapply(fdf$id, function(id) {
which(sapply(id_groups, function(group) id %in% group))
})

Summing the number of times a value appears in either of 2 columns

I have a large data set - around 32mil rows. I have information on the telephone number, the origin of the call, and the destination.
For each telephone number, I want to count the number of times it appeared either as Origin or as Destination.
An example data table is as follows:
library(data.table)
dt <- data.table(Tel=seq(1,5,1), Origin=seq(1,5,1), Destination=seq(3,7,1))
Tel Origin Destination
1: 1 1 3
2: 2 2 4
3: 3 3 5
4: 4 4 6
5: 5 5 7
I have working code, but it takes too long for my data since it involves a for loop. How can I optimize it?
Here it is:
for (i in unique(dt$Tel)){
index <- (dt$Origin == i | dt$Destination == i)
dt[dt$Tel ==i, "N"] <- sum(index)
}
Result:
Tel Origin Destination N
1: 1 1 3 1
2: 2 2 4 1
3: 3 3 5 2
4: 4 4 6 2
5: 5 5 7 2
Where N tells that Tel=1 appears 1, Tel=2 appears 1, Tel=3,4 and 5 each appear 2 times.
We can do a melt and match
dt[, N := melt(dt, id.var = "Tel")[, tabulate(match(value, Tel))]]
Or another option is to loop through the columns 2 and 3, use %in% to check whether the values in 'Tel' are present, then with Reduce and + get the sum of logical elements for each 'Tel', assign (:=) the values to 'N'
dt[, N := Reduce(`+`, lapply(.SD, function(x) Tel %in% x)), .SDcols = 2:3]
dt
# Tel Origin Destination N
#1: 1 1 3 1
#2: 2 2 4 1
#3: 3 3 5 2
#4: 4 4 6 2
#5: 5 5 7 2
A second method constructs a temporary data.table which is then joins to the original. This is longer and likely less efficient than #akrun's, but can be useful to see.
# get temporary data.table as the sum of origin and destination frequencies
temp <- setnames(data.table(table(unlist(dt[, .(Origin, Destination)], use.names=FALSE))),
c("Tel", "N"))
# turn the variables into integers (Tel is the name of the table above, and thus character)
temp <- temp[, lapply(temp, as.integer)]
Now, join the original table on
dt <- temp[dt, on="Tel"]
dt
Tel N Origin Destination
1: 1 1 1 3
2: 2 1 2 4
3: 3 2 3 5
4: 4 2 4 6
5: 5 2 5 7
You can get the desired column order using setcolorder
setcolorder(dt, c("Tel", "Origin", "Destination", "N"))

Resources