How to do a complex wide-to-long operation for network analysis - r

I have survey data that includes who the respondent is (iAmX), who they work with (withX), how frequently they work with each partner (freqX), and how satisfied they are with each partner (likeX). Participants can select multiple options for who they are and who they work with.
I would like to go from something like this, with one row per respondent:
df <- read.table(header=T, text='
id iAmA iAmB iAmC withA withB withC freqA freqB freqC likeA likeB likeC
1 X X NA X X NA 3 2 NA 3 2 NA
2 NA NA X X NA NA 5 NA NA 5 NA NA
')
To something like this, with one row per combination, where "from" is who the actor is and "to" is who they work with:
goal <- read.table(header=T, text='
id from to freq like
1 A A 3 3
1 B A 3 3
1 A B 2 2
1 B B 2 2
2 C A 5 5
')
I have tried some melt, gather, and reshape functions but frankly I think I'm just not up to the logic puzzle today. I would really appreciate some help!

Although I must admit I have not fully understood OP's logic, the code below reproduces the expected goal.
The key points here are data.table's incarnation of the melt() function which is able to reshape multiple measure columns simultaneously and the cross join function CJ().
library(data.table)
# reshape multiple measure columns simultaneously
cols <- c("iAm", "with", "freq", "like")
long <- melt(setDT(df), measure.vars = patterns(cols),
value.name = cols, variable.name = "to")[
# rename factor levels
, to := forcats::fct_relabel(to, function(x) LETTERS[as.integer(x)])]
# create combinations for each id
combi <- long[, CJ(from = na.omit(to[iAm == "X"]), to = na.omit(to[with == "X"])), by = id]
# join to append freq and like
result <- combi[long, on = .(id, to), nomatch = 0L][, -c("iAm", "with")]
# reorder result
setorder(result, id)
result
id from to freq like
1: 1 A A 3 3
2: 1 B A 3 3
3: 1 A B 2 2
4: 1 B B 2 2
5: 2 C A 5 5
The intermediate results are
long
id to iAm with freq like
1: 1 A X X 3 3
2: 2 A <NA> X 5 5
3: 1 B X X 2 2
4: 2 B <NA> <NA> NA NA
5: 1 C <NA> <NA> NA NA
6: 2 C X <NA> NA NA
and
combi
id from to
1: 1 A A
2: 1 A B
3: 1 B A
4: 1 B B
5: 2 C A

Related

Unique ID for interconnected cases

I have the following data frame, that shows which cases are interconnected:
DebtorId DupDebtorId
1: 1 2
2: 1 3
3: 1 4
4: 5 1
5: 5 2
6: 5 3
7: 6 7
8: 7 6
My goal is to assign a unique group ID to each group of cases. The desired output is:
DebtorId group
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 1
6: 6 2
7: 7 2
My train of thought:
library(data.table)
example <- data.table(
DebtorId = c(1,1,1,5,5,5,6,7),
DupDebtorId = c(2,3,4,1,2,3,7,6)
)
unique_pairs <- example[!duplicated(t(apply(example, 1, sort))),] #get unique pairs of DebtorID and DupDebtorID
unique_pairs[, group := .GRP, by=.(DebtorId)] #assign a group ID for each DebtorId
unique_pairs[, num := rowid(group)]
groups <- dcast(unique_pairs, group + DebtorId ~ num, value.var = 'DupDebtorId') #format data to wide for each group ID
#create new data table with unique cases to assign group ID
newdt <- data.table(DebtorId = sort(unique(c(example$DebtorId, example$DupDebtorId))), group = NA)
newdt$group <- as.numeric(newdt$group)
#loop through the mapped groups, selecting the first instance of group ID for the case
for (i in 1:nrow(newdt)) {
a <- newdt[i]$DebtorId
b <- min(which(groups[,-1] == a, arr.ind=TRUE)[,1])
newdt[i]$group <- b
}
Output:
DebtorId group
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 2
6: 6 3
7: 7 3
There are 2 problems in my approach:
From the output, you can see that it fails to recognize that case 5
belongs to group 1;
The final loop is agonizingly slow, which would
render it useless for my use case of 1M rows in my original data, and going the traditional := way does not work with which()
I'm not sure whether my approach could be optimized, or there is a better way of doing this altogether.
This functionality already exists in igraph, so if you don't need to do it yourself, we can build a graph from your data frame and then extract cluster membership. stack() is just an easy way to convert a named vector to data frame.
library(igraph)
g <- graph.data.frame(df)
df_membership <- clusters(g)$membership
stack(df_membership)
#> values ind
#> 1 1 1
#> 2 1 5
#> 3 2 6
#> 4 2 7
#> 5 1 2
#> 6 1 3
#> 7 1 4
Above, values corresponds to group and ind to DebtorId.

replace row values based on another row value in a data.table

I have a trivial question, though I am struggling to find a simple answer. I have a data table that looks something like this:
dt <- data.table(id= c(A,A,A,A,B,B,B,C,C,C), time=c(1,2,3,1,2,3,1,2,3), score = c(10,15,13,25,NA,NA,18,29,19))
dt
# id time score
# 1: A 1 NA
# 2: A 2 10
# 3: A 3 15
# 4: A 4 13
# 5: B 1 NA
# 6: B 2 25
# 7: B 3 NA
# 8: B 4 NA
# 9: C 1 18
# 10: C 2 29
# 11: C 3 NA
# 12: C 4 19
I would like to replace the missing values of my group "B" with the values of "A".
The final dataset should look something like this
final
# id time score
# 1: A 1 NA
# 2: A 2 10
# 3: A 3 15
# 4: A 4 13
# 5: B 1 NA
# 6: B 2 25
# 7: B 3 15
# 8: B 4 13
# 9: C 1 18
# 10: C 2 29
# 11: C 3 NA
# 12: C 4 19
In other words, conditional on the fact that B is NA, I would like to replace the score of "A". Do note that "C" remains NA.
I am struggling to find a clean way to do this using data.table. However, if it is simpler with other methods it would still be ok.
Thanks a lot for your help
Here is one option where we get the index of the rows which are NA for 'score' and the 'id' is "B", use that to replace the NA with the corresponding 'score' value from 'A'
library(data.table)
i1 <- setDT(dt)[id == 'B', which(is.na(score))]
dt[, score:= replace(score, id == 'B' & is.na(score), score[which(id == 'A')[i1]])]
Or a similar option in dplyr
library(dplyr)
dt %>%
mutate(score = replace(score, id == "B" & is.na(score),
score[which(id == "A")[i1]))

merge columns that have the same name r

I am working in R with a dataset that is created from mongodb with the use of mongolite.
I am getting a list that looks like so:
_id A B A B A B NA NA
1 a 1 b 2 e 5 NA NA
2 k 4 l 3 c 3 d 4
I would like to merge the datasetto look like this:
_id A B
1 a 1
2 k 4
1 b 2
2 l 3
1 e 5
2 c 3
1 NA NA
2 d 4
The NAs in the last columns are there because the columns are named from the first entry and if a later entry has more columns than that they don't get names assigned to them, (if I get help for this as well it would be awesome but it's not the reason I am here).
Also the number of columns might differ for different subsets of the dataset.
I have tried melt() but since it is a list and not a dataframe it doesn't work as expected, I have tried stack() but it dodn't work because the columns have the same name and some of them don't even have a name.
I know this is a very weird situation and appreciate any help.
Thank you.
using library(magrittr)
data:
df <- fread("
_id A B A B A B NA NA
1 a 1 b 2 e 5 NA NA
2 k 4 l 3 c 3 d 4 ",header=T)
setDF(df)
Code:
df2 <- df[,-1]
odds<- df2 %>% ncol %>% {(1:.)%%2} %>% as.logical
even<- df2 %>% ncol %>% {!(1:.)%%2}
cbind(df[,1,drop=F],
A=unlist(df2[,odds]),
B=unlist(df2[,even]),
row.names=NULL)
result:
# _id A B
# 1 1 a 1
# 2 2 k 4
# 3 1 b 2
# 4 2 l 3
# 5 1 e 5
# 6 2 c 3
# 7 1 <NA> NA
# 8 2 d 4
We can use data.table. Assuming A and B are always following each other. I created an example with 2 sets of NA's in the header. With grep we can find the ones fread has named V8 etc. Using R's recycling of vectors, you can rename multiple headers in one go. If in your case these are named differently change the pattern in the grep command. Then we melt the data in via melt
library(data.table)
df <- fread("
_id A B A B A B NA NA NA NA
1 a 1 b 2 e 5 NA NA NA NA
2 k 4 l 3 c 3 d 4 e 5",
header = TRUE)
df
_id A B A B A B A B A B
1: 1 a 1 b 2 e 5 <NA> NA <NA> NA
2: 2 k 4 l 3 c 3 d 4 e 5
# assuming A B are always following each other. Can be done in 1 statement.
cols <- names(df)
cols[grep(pattern = "^V", x = cols)] <- c("A", "B")
names(df) <- cols
# melt data (if df is a data.frame replace df with setDT(df)
df_melted <- melt(df, id.vars = 1,
measure.vars = patterns(c('A', 'B')),
value.name=c('A', 'B'))
df_melted
_id variable A B
1: 1 1 a 1
2: 2 1 k 4
3: 1 2 b 2
4: 2 2 l 3
5: 1 3 e 5
6: 2 3 c 3
7: 1 4 <NA> NA
8: 2 4 d 4
9: 1 5 <NA> NA
10: 2 5 e 5
Thank you for your help, they were great inspirations.
Even though #Andre Elrico gave a solution that worked in the reproducible example better #phiver gave a solution that worked better on my overall problem.
By using both those I came up with the following.
library(data.table)
#The data were in a list of lists called list for this example
temp <- as.data.table(matrix(t(sapply(list, '[', seq(max(sapply(list, lenth))))),
nrow = m))
# m here is the number of lists in list
cols <- names(temp)
cols[grep(pattern = "^V", x = cols)] <- c("B", "A")
#They need to be the opposite way because the first column is going to be substituted with id, and this way they fall on the correct column after that
cols[1] <- "id"
names(temp) <- cols
l <- melt.data.table(temp, id.vars = 1,
measure.vars = patterns(c("A", "B")),
value.name = c("A", "B"))
That way I can use this also if I have more than 2 columns that I need to manipulate like that.

Appending data frames in R based on column names

I am relatively new to R, so bear with me. I have a list of data frames that I need to combine into one data frame. so:
dfList <- list(
df1 = data.frame(x=letters[1:2],y=1:2),
df2 = data.frame(x=letters[3:4],z=3:4)
)
comes out as:
$df1
x y
1 a 1
2 b 2
$df2
x z
1 c 3
2 d 4
and I want them to combine common columns and add anything not already there. the result would be:
final result
x y z
1 a 1
2 b 2
3 c 3
4 d 4
Is this even possible?
Yep, it's pretty easy, actually:
library(dplyr)
df_merged <- bind_rows(dfList)
df_merged
x y z
1 a 1 NA
2 b 2 NA
3 c NA 3
4 d NA 4
And if you don't want NA in the empty cells, you can replace them like this:
df_merged[is.na(df_merged)] <- 0 # or whatever you want to replace NA with
Just using do.call with rbind.fill
do.call(rbind.fill,dfList)
x y z
1 a 1 NA
2 b 2 NA
3 c NA 3
4 d NA 4
You could do that with base function merge():
merge(dfList$df1, dfList$df2, by = "x", all = TRUE)
# x y z
# 1 a 1 NA
# 2 b 2 NA
# 3 c NA 3
# 4 d NA 4
Or with dplyr package with function full_join:
dplyr::full_join(dfList$df1, dfList$df2, by = "x")
# x y z
# 1 a 1 NA
# 2 b 2 NA
# 3 c NA 3
# 4 d NA 4
They both join everything that is in both data.frames.
Hope that works for you.

Reshaping data.table with cumulative sum

I want to reshape a data.table, and include the historic (cumulative summed) information for each variable. The No variable indicates the chronological order of measurements for object ID. At each measurement additional information is found. I want to aggregate the known information at each timestamp No for object ID.
Let me demonstrate with an example:
For the following data.table:
df <- data.table(ID=c(1,1,1,2,2,2,2),
No=c(1,2,3,1,2,3,4),
Variable=c('a','b', 'a', 'c', 'a', 'a', 'b'),
Value=c(2,1,3,3,2,1,5))
df
ID No Variable Value
1: 1 1 a 2
2: 1 2 b 1
3: 1 3 a 3
4: 2 1 c 3
5: 2 2 a 2
6: 2 3 a 1
7: 2 4 b 5
I want to reshape it to this:
ID No a b c
1: 1 1 2 NA NA
2: 1 2 2 1 NA
3: 1 3 5 1 NA
4: 2 1 NA NA 3
5: 2 2 2 NA 3
6: 2 3 3 NA 3
7: 2 4 3 5 3
So the summed values of Value, per Variable by (ID, No), cumulative over No.
I can get the result without the cumulative part by doing
dcast(df, ID+No~Variable, value.var="Value")
which results in the non-cumulative variant:
ID No a b c
1: 1 1 2 NA NA
2: 1 2 NA 1 NA
3: 1 3 3 NA NA
4: 2 1 NA NA 3
5: 2 2 2 NA NA
6: 2 3 1 NA NA
7: 2 4 NA 5 NA
Any ideas how to make this cumulative? The original data.table has over 250,000 rows, so efficiency matters.
EDIT: I just used a,b,c as an example, the original file has about 40 different levels. Furthermore, the NAs are important; there are also Value-values of 0, which means something else than NA
POSSIBLE SOLUTION
Okay, so I've found a working solution. It is far from efficient, since it enlarges the original table.
The idea is to duplicate each row TotalNo - No times, where TotalNo is the maximum No per ID. Then the original dcast function can be used to extract the dataframe. So in code:
df[,TotalNo := .N, by=ID]
df2 <- df[rep(seq(nrow(df)), (df$TotalNo - df$No + 1))] #create duplicates
df3 <- df2[order(ID, No)]#, No:= seq_len(.N), by=.(ID, No)]
df3[,No:= seq(from=No[1], to=TotalNo[1], by=1), by=.(ID, No)]
df4<- dcast(df3,
formula = ID + No ~ Variable,
value.var = "Value", fill=NA, fun.aggregate = sum)
It is not really nice, because the creation of duplicates uses more memory. I think it can be further optimized, but so far it works for my purposes. In the sample code it goes from 7 rows to 16 rows, in the original file from 241,670 rows to a whopping 978,331. That's over a factor 4 larger.
SOLUTION
Eddi has improved my solution in computing time in the full dataset (2.08 seconds of Eddi versus 4.36 seconds of mine). Those are numbers I can work with! Thanks everybody!
Your solution is good, but you're adding too many rows, that are unnecessary if you compute the cumsum beforehand:
# add useful columns
df[, TotalNo := .N, by = ID][, CumValue := cumsum(Value), by = .(ID, Variable)]
# do a rolling join to extend the missing values, and then dcast
dcast(df[df[, .(No = seq(No[1], TotalNo[1])), by = .(ID, Variable)],
on = c('ID', 'Variable', 'No'), roll = TRUE],
ID + No ~ Variable, value.var = 'CumValue')
# ID No a b c
#1: 1 1 2 NA NA
#2: 1 2 2 1 NA
#3: 1 3 5 1 NA
#4: 2 1 NA NA 3
#5: 2 2 2 NA 3
#6: 2 3 3 NA 3
#7: 2 4 3 5 3
Here's a standard way:
library(zoo)
df[, cv := cumsum(Value), by = .(ID, Variable)]
DT = dcast(df, ID + No ~ Variable, value.var="cv")
lvls = sort(unique(df$Variable))
DT[, (lvls) := lapply(.SD, na.locf, na.rm = FALSE), by=ID, .SDcols=lvls]
ID No a b c
1: 1 1 2 NA NA
2: 1 2 2 1 NA
3: 1 3 5 1 NA
4: 2 1 NA NA 3
5: 2 2 2 NA 3
6: 2 3 3 NA 3
7: 2 4 3 5 3
One alternative way to do it is using a custom built cumulative sum function. This is exactly the method in #David Arenburg's comment, but substitutes in a custom cumulative summary function.
EDIT: Using #eddi's much more efficient custom cumulative sum function.
cumsum.na <- function(z){
Reduce(function(x, y) if (is.na(x) && is.na(y)) NA else sum(x, y, na.rm = T), z, accumulate = T)
}
cols <- sort(unique(df$Variable))
res <- dcast(df, ID + No ~ Variable, value.var = "Value")[, (cols) := lapply(.SD, cumsum.na), .SDcols = cols, by = ID]
res
ID No a b c
1: 1 1 2 NA NA
2: 1 2 2 1 NA
3: 1 3 5 1 NA
4: 2 1 NA NA 3
5: 2 2 2 NA 3
6: 2 3 3 NA 3
7: 2 4 3 5 3
This definitely isn't the most efficient, but it gets the job done and gives you an admittedly very slow very slow cumulative summary function that handles NAs the way you want to.

Resources