Reformating input data - r

I have a data with two columns (edge file) with representing vertex ids and there connections as
v1,v2
23732,23778
23732,23871
23732,58098
23778,23824
23778,23871
23778,58098
23871,58009
23871,58098
58009,58098
58098,58256
I need to reformat it, i.e. vertex ids need to be consecutive and starting with one like this
v1,v2
1,2
1,3
1,4
2,5
2,3
2,4
3,5
3,4
5,4
4,6
Can anyone suggest how to do it automatically?
Also, I would need conversion table with both original and new ids.
Your support is appreciated.

Here is another approach which uses factor() for renumbering:
library(data.table)
# reshape from wide to long format using row numbers
tmp <- melt(setDT(DT)[, rn := .I], "rn", value.name = "old")[
# create new ids from factor levels
, new := as.integer(factor(old))][]
# reshape back to wide format again
dcast(tmp, rn ~ variable, value.var = "new")[, -"rn"]
v1 v2
1: 1 2
2: 1 4
3: 1 6
4: 2 3
5: 2 4
6: 2 6
7: 4 5
8: 4 6
9: 5 6
10: 6 7
The translation table can be created by
tmp[, unique(.SD), .SDcols = c("old", "new")]
old new
1: 23732 1
2: 23778 2
3: 23871 4
4: 58009 5
5: 58098 6
6: 23824 3
7: 58256 7
In order to reproduce exactly OP's new id numbering we need to rearrange factor levels using the fct_inorder() function from the forcats package:
tmp <- melt(DT[, rn := .I], "rn", value.name = "old")[
order(rn, variable), new := as.integer(forcats::fct_inorder(factor(old)))][]
dcast(tmp, rn ~ variable, value.var = "new")[, -"rn"]
v1 v2
1: 1 2
2: 1 3
3: 1 4
4: 2 5
5: 2 3
6: 2 4
7: 3 6
8: 3 4
9: 6 4
10: 4 7
Then, the translation becomes
old new
1: 23732 1
2: 23778 2
3: 23871 3
4: 58009 6
5: 58098 4
6: 23824 5
7: 58256 7
Data
library(data.table)
DT <- fread(
"v1,v2
23732,23778
23732,23871
23732,58098
23778,23824
23778,23871
23778,58098
23871,58009
23871,58098
58009,58098
58098,58256"
)

This isn't quite what you asked for, as I sorted the node names before assigning IDs.
What I chose to do is get all of the unique node IDs, sort them, and assign them each to an integer.
df <- structure(list(v1 = c(23732L, 23732L, 23732L, 23778L, 23778L,
23778L, 23871L, 23871L, 58009L, 58098L), v2 = c(23778L, 23871L,
58098L, 23824L, 23871L, 58098L, 58009L, 58098L, 58098L, 58256L
)), .Names = c("v1", "v2"), class = "data.frame", row.names = c(NA,
-10L))
# Put nodes in ascending order
df <- df[order(df$v1, df$v2), ]
# create a mapping of node number to node ID (as a vector)
# All unique nodes between the two columns, sorted
node_names <- sort(unique(c(df$v1, df$v2)))
# a vector of integers from 1 to length(node_names)
node_id <- seq_along(node_names)
# assign (map) the node names to the integer values
names(node_id) <- node_names
# Add the node IDs to df
df$v1_id <- node_id[as.character(df$v1)]
df$v2_id <- node_id[as.character(df$v2)]
df
v1 v2 v1_id v2_id
1 23732 23778 1 2
2 23732 23871 1 4
3 23732 58098 1 6
4 23778 23824 2 3
5 23778 23871 2 4
6 23778 58098 2 6
7 23871 58009 4 5
8 23871 58098 4 6
9 58009 58098 5 6
10 58098 58256 6 7

Related

Unique ID for interconnected cases

I have the following data frame, that shows which cases are interconnected:
DebtorId DupDebtorId
1: 1 2
2: 1 3
3: 1 4
4: 5 1
5: 5 2
6: 5 3
7: 6 7
8: 7 6
My goal is to assign a unique group ID to each group of cases. The desired output is:
DebtorId group
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 1
6: 6 2
7: 7 2
My train of thought:
library(data.table)
example <- data.table(
DebtorId = c(1,1,1,5,5,5,6,7),
DupDebtorId = c(2,3,4,1,2,3,7,6)
)
unique_pairs <- example[!duplicated(t(apply(example, 1, sort))),] #get unique pairs of DebtorID and DupDebtorID
unique_pairs[, group := .GRP, by=.(DebtorId)] #assign a group ID for each DebtorId
unique_pairs[, num := rowid(group)]
groups <- dcast(unique_pairs, group + DebtorId ~ num, value.var = 'DupDebtorId') #format data to wide for each group ID
#create new data table with unique cases to assign group ID
newdt <- data.table(DebtorId = sort(unique(c(example$DebtorId, example$DupDebtorId))), group = NA)
newdt$group <- as.numeric(newdt$group)
#loop through the mapped groups, selecting the first instance of group ID for the case
for (i in 1:nrow(newdt)) {
a <- newdt[i]$DebtorId
b <- min(which(groups[,-1] == a, arr.ind=TRUE)[,1])
newdt[i]$group <- b
}
Output:
DebtorId group
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 2
6: 6 3
7: 7 3
There are 2 problems in my approach:
From the output, you can see that it fails to recognize that case 5
belongs to group 1;
The final loop is agonizingly slow, which would
render it useless for my use case of 1M rows in my original data, and going the traditional := way does not work with which()
I'm not sure whether my approach could be optimized, or there is a better way of doing this altogether.
This functionality already exists in igraph, so if you don't need to do it yourself, we can build a graph from your data frame and then extract cluster membership. stack() is just an easy way to convert a named vector to data frame.
library(igraph)
g <- graph.data.frame(df)
df_membership <- clusters(g)$membership
stack(df_membership)
#> values ind
#> 1 1 1
#> 2 1 5
#> 3 2 6
#> 4 2 7
#> 5 1 2
#> 6 1 3
#> 7 1 4
Above, values corresponds to group and ind to DebtorId.

Merge multiple numeric column as list typed column in data.table [R]

I'm trying to find a way to merge multiple column numeric column as a new list type column.
Data Table
dt <- data.table(
a=c(1,2,3),
b=c(4,5,6),
c=c(7,8,9)
)
Expected Result
a b c d
1: 1 4 7 1,4,7
2: 2 5 8 2,5,8
3: 3 6 9 3,6,9
Attempt 1
I have tried doing append with a list with dt[,d:=list(c(a,b,c))] but it just append everything instead and get the incorrect result
a b c d
1: 1 4 7 1,2,3,4,5,6,...
2: 2 5 8 1,2,3,4,5,6,...
3: 3 6 9 1,2,3,4,5,6,...
Do a group by row and place the elements in the list
dt[, d := .(list(unlist(.SD, recursive = FALSE))), 1:nrow(dt)]
-output
dt
a b c d
1: 1 4 7 1,4,7
2: 2 5 8 2,5,8
3: 3 6 9 3,6,9
Or another option is paste and strsplit
dt[, d := strsplit(do.call(paste, c(.SD, sep=",")), ",")]
Or may use transpose
dt[, d := lapply(data.table::transpose(unname(.SD)), unlist)]
dt
a b c d
1: 1 4 7 1,4,7
2: 2 5 8 2,5,8
3: 3 6 9 3,6,9
dt[, d := purrr::pmap(.SD, ~c(...))]

Sort a data.table programmatically using character vector of multiple column names

I need to sort a data.table on multiple columns provided as character vector of variable names.
This is my approach so far:
DT = data.table(x = rep(c("b","a","c"), each = 3), y = c(1,3,6), v = 1:9)
#column names to sort by, stored in a vector
keycol <- c("x", "y")
DT[order(keycol)]
x y v
1: b 1 1
2: b 3 2
Somehow It displays just 2 rows and removes other records. But if I do this:
DT[order(x, y)]
x y v
1: a 1 4
2: a 3 5
3: a 6 6
4: b 1 1
5: b 3 2
6: b 6 3
7: c 1 7
8: c 3 8
9: c 6 9
It works like fluid.
Can anyone help with sorting using column name vector?
You need ?setorderv and its cols argument:
A character vector of column names of x by which to order
library(data.table)
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
#column vector
keycol <-c("x","y")
setorderv(DT, keycol)
DT
x y v
1: a 1 4
2: a 3 5
3: a 6 6
4: b 1 1
5: b 3 2
6: b 6 3
7: c 1 7
8: c 3 8
9: c 6 9
Note that there is no need to assign the output of setorderv back to DT. The function updates DT by reference.

How to replace a certain value in one data.table with values of another data.table of same dimension

Given two data.table:
dt1 <- data.table(id = c(1,-99,2,2,-99), a = c(2,1,-99,-99,3), b = c(5,3,3,2,5), c = c(-99,-99,-99,2,5))
dt2 <- data.table(id = c(2,3,1,4,3),a = c(6,4,3,2,6), b = c(3,7,8,8,3), c = c(2,2,4,3,2))
> dt1
id a b c
1: 1 2 5 -99
2: -99 1 3 -99
3: 2 -99 3 -99
4: 2 -99 2 2
5: -99 3 5 5
> dt2
id a b c
1: 2 6 3 2
2: 3 4 7 2
3: 1 3 8 4
4: 4 2 8 3
5: 3 6 3 2
How can one replace the -99 of dt1 with the values of dt2?
Wanted results should be dt3:
> dt3
id a b c
1: 1 2 5 2
2: 3 1 3 2
3: 2 3 3 4
4: 2 2 2 2
5: 3 3 5 5
You can do the following:
dt3 <- as.data.frame(dt1)
dt2 <- as.data.frame(dt2)
dt3[dt3 == -99] <- dt2[dt3 == -99]
dt3
# id a b c
# 1 1 2 5 2
# 2 3 1 3 2
# 3 2 3 3 4
# 4 2 2 2 2
# 5 3 3 5 5
If your data is all of the same type (as in your example) then transforming them to matrix is a lot faster and transparent:
dt1a <- as.matrix(dt1) ## convert to matrix
dt2a <- as.matrix(dt2)
# make a matrix of the same shape to access the right entries
missing_idx <- dt1a == -99
dt1a[missing_idx] <- dt2a[missing_idx] ## replace by reference
This is a vectorized operation, so it should be fast.
Note: If you do this make sure the two data sources match exactly in shape and order of rows/columns. If they don't then you need to join by the relevant keys and pick the correct columns.
EDIT: The conversion to matrix may be unnecessary. See kath's answer for a more terse solution.
Simple way could be to use setDF function to convert to data.frame and use data frame sub-setting methods. Restore to data.table at the end.
#Change to data.frmae
setDF(dt1)
setDF(dt2)
# Perform assignment
dt1[dt1==-99] = dt2[dt1==-99]
# Restore back to data.table
setDT(dt1)
setDT(dt2)
dt1
# id a b c
# 1 1 2 5 2
# 2 3 1 3 2
# 3 2 3 3 4
# 4 2 2 2 2
# 5 3 3 5 5
This simple trick would work efficiently.
dt1<-as.matrix(dt1)
dt2<-as.matrix(dt2)
index.replace = dt1==-99
dt1[index.replace] = dt2[index.replace]
as.data.table(dt1)
as.data.table(dt2)
This should work, a simple approach:
for (i in 1:nrow(dt1)){
for (j in 1:ncol(dt1)){
if (dt1[i,j] == -99) dt1[i,j] = dt2[i,j]
}
}

How do I create a panel dataset out of transition date data in R?

I have a dataset that is structured as follows:
ID origin destination time
1 a b 2
2 b a 1
2 a c 4
3 c b 1
3 b c 3
I would like to turn this into a ID-time panel dataset like:
ID location time
1 a 1
1 b 2
1 b 3
1 b 4
2 a 1
2 a 2
2 a 3
2 c 4
3 b 1
3 b 2
3 c 3
3 c 4
So basically, I need to create the panel rows for when a subject doesn't change location, and fill in the location they are supposed to be at based on the info on origin and destinations. Is there any function in R that can do this smoothly? I'd prefer solutions using data.table or dplyr.
You could make a table with every time for which you want to know the location of each ID:
newDT = DT[, CJ(ID = unique(ID), time = 1:4)]
Then put the original data in long format, inferring that
origin holds for time-1
destination holds for time
mDT = melt(DT, id = c("ID", "time"), value.name = "loc", variable.name = "loc_role")
mDT[loc_role == "origin", time := time - 1L]
mDT[, loc_role := NULL]
setorder(mDT, ID, time)
ID time loc
1: 1 1 a
2: 1 2 b
3: 2 0 b
4: 2 1 a
5: 2 3 a
6: 2 4 c
7: 3 0 c
8: 3 1 b
9: 3 2 b
10: 3 3 c
...and fill in the new table with a rolling update join:
newDT[, location := mDT[.SD, on=.(ID, time), roll=TRUE, x.loc]]
ID time location
1: 1 1 a
2: 1 2 b
3: 1 3 b
4: 1 4 b
5: 2 1 a
6: 2 2 a
7: 2 3 a
8: 2 4 c
9: 3 1 b
10: 3 2 b
11: 3 3 c
12: 3 4 c
(Dplyr doesn't have rolling or update joins yet, so I guess there's no analogue.)
How it works
CJ takes the Cartesian product of some vectors, similar to expand.grid
melt transforms to long form, keeping variables passed as id =
x[i, v := expr] edits column v of table x on rows selected by i
setorder sorts in place
.SD in j of x[i,j] refers to the subset of data (x) selected by i
x[i, on=, roll=, expr] is a rolling join, with rows selected by table i, on= and roll=
The expression x.v inside a join selects column v from x
Regarding the last bullet, the prefix i.* would do the same thing for columns from i.
A similar method to Frank's solution, but using two joins would be:
library(data.table)
res <- setDT(expand.grid(ID = unique(dt$ID), time = 1:4))
#Get origin
res[dt[,.(ID, origin, time = time - 1L)], location := origin, on = .(ID = ID, time = time)]
#Update origin and destination
res[dt, location := destination, on = c("ID", "time")][, location := zoo::na.locf(location), by = ID][order(ID, time)]
# ID time location
#1: 1 1 a
#2: 1 2 b
#3: 1 3 b
#4: 1 4 b
#5: 2 1 a
#6: 2 2 a
#7: 2 3 a
#8: 2 4 c
#9: 3 1 b
#10: 3 2 b
#11: 3 3 c
#12: 3 4 c
I don't think you need to do fancy joins for this problem:
maxt = max(dt$time)
dt[, .(location = c(rep(origin[1], time[1] - 1), rep(destination, diff(c(time, maxt + 1)))),
time = 1:maxt), by = ID]
# ID location time
# 1: 1 a 1
# 2: 1 b 2
# 3: 1 b 3
# 4: 1 b 4
# 5: 2 a 1
# 6: 2 a 2
# 7: 2 a 3
# 8: 2 c 4
# 9: 3 b 1
#10: 3 b 2
#11: 3 c 3
#12: 3 c 4
I've assumed that within a single ID next origin is the same as previous destination, as per the OP example.

Resources