Reformating input data - r

I have a data with two columns (edge file) with representing vertex ids and there connections as
I need to reformat it, i.e. vertex ids need to be consecutive and starting with one like this
Can anyone suggest how to do it automatically?
Also, I would need conversion table with both original and new ids.
Your support is appreciated.

Here is another approach which uses factor() for renumbering:
# reshape from wide to long format using row numbers
tmp <- melt(setDT(DT)[, rn := .I], "rn", = "old")[
# create new ids from factor levels
, new := as.integer(factor(old))][]
# reshape back to wide format again
dcast(tmp, rn ~ variable, value.var = "new")[, -"rn"]
v1 v2
1: 1 2
2: 1 4
3: 1 6
4: 2 3
5: 2 4
6: 2 6
7: 4 5
8: 4 6
9: 5 6
10: 6 7
The translation table can be created by
tmp[, unique(.SD), .SDcols = c("old", "new")]
old new
1: 23732 1
2: 23778 2
3: 23871 4
4: 58009 5
5: 58098 6
6: 23824 3
7: 58256 7
In order to reproduce exactly OP's new id numbering we need to rearrange factor levels using the fct_inorder() function from the forcats package:
tmp <- melt(DT[, rn := .I], "rn", = "old")[
order(rn, variable), new := as.integer(forcats::fct_inorder(factor(old)))][]
dcast(tmp, rn ~ variable, value.var = "new")[, -"rn"]
v1 v2
1: 1 2
2: 1 3
3: 1 4
4: 2 5
5: 2 3
6: 2 4
7: 3 6
8: 3 4
9: 6 4
10: 4 7
Then, the translation becomes
old new
1: 23732 1
2: 23778 2
3: 23871 3
4: 58009 6
5: 58098 4
6: 23824 5
7: 58256 7
DT <- fread(

This isn't quite what you asked for, as I sorted the node names before assigning IDs.
What I chose to do is get all of the unique node IDs, sort them, and assign them each to an integer.
df <- structure(list(v1 = c(23732L, 23732L, 23732L, 23778L, 23778L,
23778L, 23871L, 23871L, 58009L, 58098L), v2 = c(23778L, 23871L,
58098L, 23824L, 23871L, 58098L, 58009L, 58098L, 58098L, 58256L
)), .Names = c("v1", "v2"), class = "data.frame", row.names = c(NA,
# Put nodes in ascending order
df <- df[order(df$v1, df$v2), ]
# create a mapping of node number to node ID (as a vector)
# All unique nodes between the two columns, sorted
node_names <- sort(unique(c(df$v1, df$v2)))
# a vector of integers from 1 to length(node_names)
node_id <- seq_along(node_names)
# assign (map) the node names to the integer values
names(node_id) <- node_names
# Add the node IDs to df
df$v1_id <- node_id[as.character(df$v1)]
df$v2_id <- node_id[as.character(df$v2)]
v1 v2 v1_id v2_id
1 23732 23778 1 2
2 23732 23871 1 4
3 23732 58098 1 6
4 23778 23824 2 3
5 23778 23871 2 4
6 23778 58098 2 6
7 23871 58009 4 5
8 23871 58098 4 6
9 58009 58098 5 6
10 58098 58256 6 7


Unique ID for interconnected cases

I have the following data frame, that shows which cases are interconnected:
DebtorId DupDebtorId
1: 1 2
2: 1 3
3: 1 4
4: 5 1
5: 5 2
6: 5 3
7: 6 7
8: 7 6
My goal is to assign a unique group ID to each group of cases. The desired output is:
DebtorId group
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 1
6: 6 2
7: 7 2
My train of thought:
example <- data.table(
DebtorId = c(1,1,1,5,5,5,6,7),
DupDebtorId = c(2,3,4,1,2,3,7,6)
unique_pairs <- example[!duplicated(t(apply(example, 1, sort))),] #get unique pairs of DebtorID and DupDebtorID
unique_pairs[, group := .GRP, by=.(DebtorId)] #assign a group ID for each DebtorId
unique_pairs[, num := rowid(group)]
groups <- dcast(unique_pairs, group + DebtorId ~ num, value.var = 'DupDebtorId') #format data to wide for each group ID
#create new data table with unique cases to assign group ID
newdt <- data.table(DebtorId = sort(unique(c(example$DebtorId, example$DupDebtorId))), group = NA)
newdt$group <- as.numeric(newdt$group)
#loop through the mapped groups, selecting the first instance of group ID for the case
for (i in 1:nrow(newdt)) {
a <- newdt[i]$DebtorId
b <- min(which(groups[,-1] == a, arr.ind=TRUE)[,1])
newdt[i]$group <- b
DebtorId group
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 2
6: 6 3
7: 7 3
There are 2 problems in my approach:
From the output, you can see that it fails to recognize that case 5
belongs to group 1;
The final loop is agonizingly slow, which would
render it useless for my use case of 1M rows in my original data, and going the traditional := way does not work with which()
I'm not sure whether my approach could be optimized, or there is a better way of doing this altogether.
This functionality already exists in igraph, so if you don't need to do it yourself, we can build a graph from your data frame and then extract cluster membership. stack() is just an easy way to convert a named vector to data frame.
g <-
df_membership <- clusters(g)$membership
#> values ind
#> 1 1 1
#> 2 1 5
#> 3 2 6
#> 4 2 7
#> 5 1 2
#> 6 1 3
#> 7 1 4
Above, values corresponds to group and ind to DebtorId.

Merge multiple numeric column as list typed column in data.table [R]

I'm trying to find a way to merge multiple column numeric column as a new list type column.
Data Table
dt <- data.table(
Expected Result
a b c d
1: 1 4 7 1,4,7
2: 2 5 8 2,5,8
3: 3 6 9 3,6,9
Attempt 1
I have tried doing append with a list with dt[,d:=list(c(a,b,c))] but it just append everything instead and get the incorrect result
a b c d
1: 1 4 7 1,2,3,4,5,6,...
2: 2 5 8 1,2,3,4,5,6,...
3: 3 6 9 1,2,3,4,5,6,...
Do a group by row and place the elements in the list
dt[, d := .(list(unlist(.SD, recursive = FALSE))), 1:nrow(dt)]
a b c d
1: 1 4 7 1,4,7
2: 2 5 8 2,5,8
3: 3 6 9 3,6,9
Or another option is paste and strsplit
dt[, d := strsplit(, c(.SD, sep=",")), ",")]
Or may use transpose
dt[, d := lapply(data.table::transpose(unname(.SD)), unlist)]
a b c d
1: 1 4 7 1,4,7
2: 2 5 8 2,5,8
3: 3 6 9 3,6,9
dt[, d := purrr::pmap(.SD, ~c(...))]

Sort a data.table programmatically using character vector of multiple column names

I need to sort a data.table on multiple columns provided as character vector of variable names.
This is my approach so far:
DT = data.table(x = rep(c("b","a","c"), each = 3), y = c(1,3,6), v = 1:9)
#column names to sort by, stored in a vector
keycol <- c("x", "y")
x y v
1: b 1 1
2: b 3 2
Somehow It displays just 2 rows and removes other records. But if I do this:
DT[order(x, y)]
x y v
1: a 1 4
2: a 3 5
3: a 6 6
4: b 1 1
5: b 3 2
6: b 6 3
7: c 1 7
8: c 3 8
9: c 6 9
It works like fluid.
Can anyone help with sorting using column name vector?
You need ?setorderv and its cols argument:
A character vector of column names of x by which to order
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
#column vector
keycol <-c("x","y")
setorderv(DT, keycol)
x y v
1: a 1 4
2: a 3 5
3: a 6 6
4: b 1 1
5: b 3 2
6: b 6 3
7: c 1 7
8: c 3 8
9: c 6 9
Note that there is no need to assign the output of setorderv back to DT. The function updates DT by reference.

How to replace a certain value in one data.table with values of another data.table of same dimension

Given two data.table:
dt1 <- data.table(id = c(1,-99,2,2,-99), a = c(2,1,-99,-99,3), b = c(5,3,3,2,5), c = c(-99,-99,-99,2,5))
dt2 <- data.table(id = c(2,3,1,4,3),a = c(6,4,3,2,6), b = c(3,7,8,8,3), c = c(2,2,4,3,2))
> dt1
id a b c
1: 1 2 5 -99
2: -99 1 3 -99
3: 2 -99 3 -99
4: 2 -99 2 2
5: -99 3 5 5
> dt2
id a b c
1: 2 6 3 2
2: 3 4 7 2
3: 1 3 8 4
4: 4 2 8 3
5: 3 6 3 2
How can one replace the -99 of dt1 with the values of dt2?
Wanted results should be dt3:
> dt3
id a b c
1: 1 2 5 2
2: 3 1 3 2
3: 2 3 3 4
4: 2 2 2 2
5: 3 3 5 5
You can do the following:
dt3 <-
dt2 <-
dt3[dt3 == -99] <- dt2[dt3 == -99]
# id a b c
# 1 1 2 5 2
# 2 3 1 3 2
# 3 2 3 3 4
# 4 2 2 2 2
# 5 3 3 5 5
If your data is all of the same type (as in your example) then transforming them to matrix is a lot faster and transparent:
dt1a <- as.matrix(dt1) ## convert to matrix
dt2a <- as.matrix(dt2)
# make a matrix of the same shape to access the right entries
missing_idx <- dt1a == -99
dt1a[missing_idx] <- dt2a[missing_idx] ## replace by reference
This is a vectorized operation, so it should be fast.
Note: If you do this make sure the two data sources match exactly in shape and order of rows/columns. If they don't then you need to join by the relevant keys and pick the correct columns.
EDIT: The conversion to matrix may be unnecessary. See kath's answer for a more terse solution.
Simple way could be to use setDF function to convert to data.frame and use data frame sub-setting methods. Restore to data.table at the end.
#Change to data.frmae
# Perform assignment
dt1[dt1==-99] = dt2[dt1==-99]
# Restore back to data.table
# id a b c
# 1 1 2 5 2
# 2 3 1 3 2
# 3 2 3 3 4
# 4 2 2 2 2
# 5 3 3 5 5
This simple trick would work efficiently.
index.replace = dt1==-99
dt1[index.replace] = dt2[index.replace]
This should work, a simple approach:
for (i in 1:nrow(dt1)){
for (j in 1:ncol(dt1)){
if (dt1[i,j] == -99) dt1[i,j] = dt2[i,j]

How do I create a panel dataset out of transition date data in R?

I have a dataset that is structured as follows:
ID origin destination time
1 a b 2
2 b a 1
2 a c 4
3 c b 1
3 b c 3
I would like to turn this into a ID-time panel dataset like:
ID location time
1 a 1
1 b 2
1 b 3
1 b 4
2 a 1
2 a 2
2 a 3
2 c 4
3 b 1
3 b 2
3 c 3
3 c 4
So basically, I need to create the panel rows for when a subject doesn't change location, and fill in the location they are supposed to be at based on the info on origin and destinations. Is there any function in R that can do this smoothly? I'd prefer solutions using data.table or dplyr.
You could make a table with every time for which you want to know the location of each ID:
newDT = DT[, CJ(ID = unique(ID), time = 1:4)]
Then put the original data in long format, inferring that
origin holds for time-1
destination holds for time
mDT = melt(DT, id = c("ID", "time"), = "loc", = "loc_role")
mDT[loc_role == "origin", time := time - 1L]
mDT[, loc_role := NULL]
setorder(mDT, ID, time)
ID time loc
1: 1 1 a
2: 1 2 b
3: 2 0 b
4: 2 1 a
5: 2 3 a
6: 2 4 c
7: 3 0 c
8: 3 1 b
9: 3 2 b
10: 3 3 c
...and fill in the new table with a rolling update join:
newDT[, location := mDT[.SD, on=.(ID, time), roll=TRUE, x.loc]]
ID time location
1: 1 1 a
2: 1 2 b
3: 1 3 b
4: 1 4 b
5: 2 1 a
6: 2 2 a
7: 2 3 a
8: 2 4 c
9: 3 1 b
10: 3 2 b
11: 3 3 c
12: 3 4 c
(Dplyr doesn't have rolling or update joins yet, so I guess there's no analogue.)
How it works
CJ takes the Cartesian product of some vectors, similar to expand.grid
melt transforms to long form, keeping variables passed as id =
x[i, v := expr] edits column v of table x on rows selected by i
setorder sorts in place
.SD in j of x[i,j] refers to the subset of data (x) selected by i
x[i, on=, roll=, expr] is a rolling join, with rows selected by table i, on= and roll=
The expression x.v inside a join selects column v from x
Regarding the last bullet, the prefix i.* would do the same thing for columns from i.
A similar method to Frank's solution, but using two joins would be:
res <- setDT(expand.grid(ID = unique(dt$ID), time = 1:4))
#Get origin
res[dt[,.(ID, origin, time = time - 1L)], location := origin, on = .(ID = ID, time = time)]
#Update origin and destination
res[dt, location := destination, on = c("ID", "time")][, location := zoo::na.locf(location), by = ID][order(ID, time)]
# ID time location
#1: 1 1 a
#2: 1 2 b
#3: 1 3 b
#4: 1 4 b
#5: 2 1 a
#6: 2 2 a
#7: 2 3 a
#8: 2 4 c
#9: 3 1 b
#10: 3 2 b
#11: 3 3 c
#12: 3 4 c
I don't think you need to do fancy joins for this problem:
maxt = max(dt$time)
dt[, .(location = c(rep(origin[1], time[1] - 1), rep(destination, diff(c(time, maxt + 1)))),
time = 1:maxt), by = ID]
# ID location time
# 1: 1 a 1
# 2: 1 b 2
# 3: 1 b 3
# 4: 1 b 4
# 5: 2 a 1
# 6: 2 a 2
# 7: 2 a 3
# 8: 2 c 4
# 9: 3 b 1
#10: 3 b 2
#11: 3 c 3
#12: 3 c 4
I've assumed that within a single ID next origin is the same as previous destination, as per the OP example.
