Data management: flatten data with R - r

I have the following dataframe gathering the evolution of policies:
Df <- data.frame(Id_policy = c("A_001", "A_002", "A_003","B_001","B_002"),
date_new = c("20200101","20200115","20200304","20200110","20200215"),
date_end = c("20200503","20200608","20210101","20200403","20200503"),
expend = c("","A_001","A_002","",""))
which looks like that:
Id_policy date_new date_end expend
A_001 20200101 20200503
A_002 20200115 20200608 A_001
A_003 20200304 20210101 A_002
B_001 20200110 20200403
B_002 20200215 20200503
"Id_policy" refers to a specific policy, "date_new" the date of policy issuance, "date_end" the date of policy end. However, sometimes a policy is extended. When it is the case, a new policy is set and the variable "expend" provides the name of the previous policy it changes.
The idea here is to flatten the dataset so we only keep rows corresponding to different policies. So, the output would be something like this:
Id_policy date_new date_end expend
A_001 20200101 20210101
B_001 20200110 20200403
B_002 20200215 20200503
Has-someone faced a similar problem ?

One way is to treat this as a network problem and use igraph functions (related posts e.g. Make a group_indices based on several columns
; Fast way to group variables based on direct and indirect similarities in multiple columns).
Set the missing 'expend' to 'Id_policy'
Use graph_from_data_frame to create a graph, where 'expend' and 'Id_policy' columns are treated as an edge list.
Use components to get connected components of the graph, i.e. which 'Id_policy' are connected, directly or indirectly.
Select the membership element to get "the cluster id to which each vertex belongs".
Join membership to original data.
Grab relevant data grouped by membership.
I use data.table for the data wrangling steps, but this can of course also be done in base or dplyr.
library(data.table)
library(igraph)
setDT(Df)
Df[expend == "", expend := Id_policy]
g = graph_from_data_frame(Df[ , .(expend, Id_policy)])
mem = components(g)$membership
Df[.(names(mem)), on = .(Id_policy), mem := mem]
Df[ , .(Id_policy = Id_policy[1],
date_new = first(date_new),
date_end = last(date_end), by = mem]
# mem Id_policy date_new date_end
# 1: 1 A_001 20200101 20210101
# 2: 2 B_001 20200110 20200403
# 3: 3 B_002 20200215 20200503

Here is a solution using igraph for creating a directed network of id's, and data.table to do some binding and joining.
I kept in between results to show what each step does.
library( data.table )
library( igraph )
setDT(Df)
#create nodes and links
nodes <- Df[,1:3]
links <- Df[ !expend == "", .(from = expend, to = Id_policy) ]
g = graph_from_data_frame( links, vertices = nodes, directed = TRUE )
plot(g)
#find nodes without incoming (these are startpoints of paths)
in.nodes <- V(g)[degree(g, mode = 'in') == 0]
#define sumcomponents of the graph by looping the in.nodes
L <- lapply( in.nodes, function(x) names( subcomponent(g, x) ) )
# $A_001
# [1] "A_001" "A_002" "A_003"
# $B_001
# [1] "B_001"
# $B_002
# [1] "B_002"
L2 <- lapply( L, function(x) {
#get first and last element
dt <- data.table( start = x[1], end = x[ length(x) ] )
})
#bind list together to a single data.table
ans <- rbindlist( L2, use.names = TRUE, fill = TRUE, idcol = "Id_policy" )
# Id_policy start end
# 1: A_001 A_001 A_003
# 2: B_001 B_001 B_001
# 3: B_002 B_002 B_002
#update ans with values from original Df for start and end
ans[ Df, `:=`( start = i.date_new ), on = .(start = Id_policy) ][]
ans[ Df, `:=`( end = i.date_end ), on = .(end = Id_policy) ][]
# Id_policy start end
# 1: A_001 20200101 20210101
# 2: B_001 20200110 20200403
# 3: B_002 20200215 20200503

An outer for loop to go through each policy id in Df with an inner while loop to find the last extension for an original policy should work
Df <- data.frame(Id_policy = c("A_001", "A_002", "A_003","B_001","B_002"),
date_new = c("20200101","20200115","20200304","20200110","20200215"),
date_end = c("20200503","20200608","20210101","20200403","20200503"),
expend = c("","A_001","A_002","",""),
stringsAsFactors = F)
final_df <- data.frame(matrix(nrow = 0, ncol = 0), stringsAsFactors = F)
for (i in seq_len(nrow(Df))) {
# Check to see if the current policy ID is in the column expend
if (Df$Id_policy[i] %in% Df$expend || !Df$expend[i] == "") {
# Loop through expend policy until last one is found
found_last <- F
j <- i
end_date <- ""
c_policy_id <- Df$Id_policy[j]
expended_id <- Df$Id_policy[which(Df$expend == c_policy_id)]
if (length(expended_id) > 0) {
if (expended_id %in% Df$expend) {
while(!found_last) {
c_policy_id <- Df$Id_policy[j]
expended_id <- Df$Id_policy[which(Df$expend == c_policy_id)]
if (length(expended_id) > 0) {
if (expended_id %in% Df$expend) {
j <- which(Df$expend == expended_id)
}
}else{
end_date <- Df$date_end[j]
found_last <- T
}
}
if (!end_date == "") {
# Add to final df when found the last one
final_df <- bind_rows(final_df, data.frame(Id_policy = Df$Id_policy[i],
date_new = Df$date_new[i],
date_end = end_date,
expend = ""))
}
}
}
}else{
final_df <- bind_rows(final_df, Df[i, ])
}
}
final_df
Id_policy date_new date_end expend
1 A_001 20200101 20210101
2 B_001 20200110 20200403
3 B_002 20200215 20200503

Related

Why does cumsum function not work in data.table package when attempting chaining?

I'm trying to add a cumsum() column to a dataframe using the data.table package per the below code and it doesn't work in this case (see section marked # this doesn't work, after adding cumsum at the bottom:). I've used cumsum() before but as standalone function in setDT(), but now I'm trying to chain it to the code and it doesn't work. What am I doing wrong?
Here are the results I get for the code that does work (marked # this works, before cumsum: in the code below), except I added a column to the right "cumX" which is what I tried adding using chaining, whereby "cumX" runs a cumulative sum of column "1stStateX":
Period 1stStateX cumX
1: 1 0 0
2: 2 1 1
3: 3 1 2
4: 4 0 2
5: 5 0 2
In contrast, when I run the code marked # this doesn't work..., I get nothing back in the R studio console as illustrated here:
Also, when I run the code below marked # this works, before cumsum: for the simple sum() function (which works as intended), I noticed that the original dataframe testDF is altered as illustrated below. Why is that and how do I prevent that from happening?
Code:
library(data.table)
testDF <-
data.frame(
ID = c(rep(10,5),rep(50,5),rep(60,5)),
Period = c(1:5,1:5,1:5),
State = c("A","B","X","X","X",
"A","A","A","A","A",
"A","X","A","X","B")
)
# this works, before cumsum:
setDT(testDF)[
, `1stStateX` := .I == .I[State == 'X'][1],ID][
, .(`1stStateX` = sum(`1stStateX`, na.rm = TRUE)), by = Period]
# this doesn't work, after adding cumsum at the bottom:
setDT(testDF)[
, `1stStateX` := .I == .I[State == 'X'][1],ID][
, .(`1stStateX` = sum(`1stStateX`, na.rm = TRUE)), by = Period][
, cumX := cumsum(`1stStateX`),by = Period]
Final solution using ismirsehregal´s answer, and after studying the recommended vignette I got chaining to work:
library(data.table)
testDF <- data.frame(
ID = c(rep(10,5),rep(50,5),rep(60,5)),
Period = c(1:5,1:5,1:5),
State = c("A","B","X","X","X",
"A","A","A","A","A",
"A","X","A","X","B")
)
testDT <- testDF
testDT <- setDT(testDT)[, `1stStateX` := .I == .I[State == 'X'][1],ID][
, .(`1stStateX` = sum(`1stStateX`, na.rm = TRUE)), by = Period][
, cumX := cumsum(`1stStateX`)]
testDT <- as.data.frame(testDT)
print(testDT)
I don't know your expected output, however, I guess your confusion is the result of chaining code that modifies an existing data.table object (Please see ?`:=`) and code which results in a new data.table object (Please see section d) Select column(s) in j here).
Please check the following:
library(data.table)
testDF <- data.frame(
ID = c(rep(10,5),rep(50,5),rep(60,5)),
Period = c(1:5,1:5,1:5),
State = c("A","B","X","X","X",
"A","A","A","A","A",
"A","X","A","X","B")
)
DT <- copy(testDF)
setDT(DT)
DT[, `1stStateX` := .I == .I[State == 'X'][1], ID] # this step changes DT by reference, please see ?`:=`
aggregatedDT <- DT[, .(`1stStateX` = sum(`1stStateX`, na.rm = TRUE)), by = Period] # this does not change DT and results in another data.table
aggregatedDT[, cumX := cumsum(`1stStateX`)]
print(aggregatedDT)
Result:
Period 1stStateX cumX
1: 1 0 0
2: 2 1 1
3: 3 1 2
4: 4 0 2
5: 5 0 2

How to use functional-style programming in R when the list of inputs depends upon one another?

I have a data frame organized like so:
df <- data.frame(id = c(1, 1, 1),
startDate = c("1990-01-01", "1990-01-23", "1990-01-30"),
endDate = c("1990-01-24", "1990-01-25", "1990-01-31"))
Each row represents the start date and end date. There is some overlap in the data though I'd like to identify. For example, the first range of dates goes from January 1, 1990 to January 24, 1990, and then the second row of dates goes from January 23rd, 1990 to January 24th, 1990.
What I want to is create a new data frame which would something like this...
(illustrating by creating a new R dataframe).
df <- data.frame(id = c(1, 1, 1),
startDate = c("1990-01-01", "1990-01-23", "1990-01-30"),
endDate = c("1990-01-24", "1990-01-25", "1990-01-31"),
overlap = c(TRUE, TRUE, FALSE),
newStartDate = c("1990-01-01", "1990-01-01", "1990-01-30"),
newEndDate = c("1990-01-25", "1990-01-25", "1990-01-31"))
So first, identify every row which overlaps with another row. And then create new columns (newStartDate, newEndDate) which would take the earliest start date and the latest end date from all the overlapping dates.
I already have in my mind how this would work using pseudo code. However, I wonder if there was a way to make this work using "R" style programming, using vectors, and functions and all that. I'm having trouble conceptualizing of how it would work because there are lots of variables you would need to keep track of and what not, and I'm not sure how you could achieve it using things like map, apply, etc.
Hopefully my question is clear!
Below are some alternative approaches.
(1) and (1a) only use Base R. These are the same except that (1) uses an adjacency matrix and (1a) uses a neighborhood list to avoid forming the potentially large adjacency matrix.
(2) is an SQL solution using the sqldf package.
(3) uses the igraph package and may give a different answer than the above alternatives altlhough in the case of the example in the question the answers are the same. (3a) is like (3) but like (1a) avoids forming the adjacency matrix.
At the end we provide some graphics.
Alternatives
1) Base R First we convert the dates to Date class giving df2. Then we define a function betw which checks whether its first argument is between the second and third and use that to define a function overlap which given two row indexes into df2 determines whether they overlap (TRUE) or not (FALSE).
If V is the sequence from 1 to the number of rows in df2 then we can form an adjacency matrix adj such that adj[i,j] is 1 if rows i and j overlap. Using that it is straight forward to compute the overlap, newStartDate and newEndDate columns.
This approach uses no packages.
df2 <- transform(df, startDate = as.Date(startDate), endDate = as.Date(endDate))
betw <- function(x, a, b) x >= a & x <= b
overlap <- function(i, j) {
betw(df2[i, "startDate"], df2[j, "startDate"], df2[j, "endDate"]) ||
betw(df2[j, "startDate"], df2[i, "startDate"], df2[i, "endDate"])
}
# form adjacency matrix of graph having vertices V
V <- 1:nrow(df2)
adj <- sapply(V, function(u) sapply(V, overlap, u)) + 0
orig <- "1970-01-01"
transform(df2, overlap = colSums(adj) > 1,
newStartDate = as.Date(apply(adj, 1, function(ix) min(startDate[ix == 1])), orig),
newEndDate = as.Date(apply(adj, 1, function(ix) max(endDate[ix == 1])), orig))
giving:
id startDate endDate overlap newStartDate newEndDate
1 1 1990-01-01 1990-01-24 TRUE 1990-01-01 1990-01-25
2 1 1990-01-23 1990-01-25 TRUE 1990-01-01 1990-01-25
3 1 1990-01-30 1990-01-31 FALSE 1990-01-30 1990-01-31
1a) A variation of (1) which avoids forming the adj adjacency matrix is to create a neighors list such that nbrs[[i]] is a vector of the row numbers that row i overlaps.
nbrs <- lapply(1:nr, function(j) Filter(function(i) overlap(i, j), 1:nr))
names(nbrs) <- 1:nr
orig <- "1970-01-01"
transform(df2, overlap = lengths(nbrs) > 1,
newStartDate = as.Date(sapply(nbrs, function(ix) min(startDate[ix])), orig),
newEndDate = as.Date(sapply(nbrs, function(ix) max(endDate[ix])), orig))
2) sqldf Using df2 we can use SQL to compute the desired output in a single SQL statement:
library(sqldf)
sqldf("select
a.id,
a.startDate as startDate__Date,
a.endDate as endDate__Date,
count(b.rowid) > 1 as overlap__logical,
min(b.startDate) as newStartDate__Date,
max(b.endDate) as newEndDate__Date
from df2 as a
left join df2 as b on (a.startDate between b.startDate and b.endDate) or
(b.startDate between a.startDate and a.endDate)
group by a.rowid
order by a.rowid", method = "name__class")
giving:
id startDate endDate overlap newStartDate newEndDate
1 1 1990-01-01 1990-01-24 TRUE 1990-01-01 1990-01-25
2 1 1990-01-23 1990-01-25 TRUE 1990-01-01 1990-01-25
3 1 1990-01-30 1990-01-31 FALSE 1990-01-30 1990-01-31
3) igraph Another approach which is not equivalent to (1) or (2) but may be preferred is to partition the rows into connected components using a transitive completion of the overlap relation. It is similar to this question here: R: Find groups of vectors that have a > 80% overlap in their elements
Using adj from (1) form a graph g using the igraph package. Then rows that have no other rows in their connected component are not overlapped. If we number the connected components 1, 2, ... then memb is such that memb[i] is the number of the connected component containing row i so for each row we can find the minimum and maximum date over the connected component it belongs to. Although this gives the same answer as (1) for the input in the question, in general, this is different than (1) because, for example, if rows i and j do not overlap but do each overlap row k then i, j and k are all in the same connected component for purposes of computing the columns of the output.
library(igraph)
g <- graph_from_adjacency_matrix(adj, mode = "undirected", diag = FALSE)
memb <- components(g)$membership
# assemble desired output data frame
transform(df2,
overlap = ave(memb, memb, FUN = length) > 1,
newStartDate = ave(startDate, memb, FUN = min),
newEndDate = ave(endDate, memb, FUN = max))
giving:
id startDate endDate overlap newStartDate newEndDate
1 1 1990-01-01 1990-01-24 TRUE 1990-01-01 1990-01-25
2 1 1990-01-23 1990-01-25 TRUE 1990-01-01 1990-01-25
3 1 1990-01-30 1990-01-31 FALSE 1990-01-30 1990-01-31
3a) Alternately we can form g from nbrs to avoid forming adj like this:
g0 <- graph_from_edgelist(as.matrix(stack(nbrs)), directed = FALSE)
g <- simplify(g0) # remove self loops
Graphics
As an aside using g we can display a graphical representation where node i means row i and edges indicate overlap.
plot(g)
I created a solution for a similar problem. As I needed to apply the same logic to a large dataset, my way to go was Rcpp and data.table (sorting speed reason really). Works also on multiple groups - ids. The conti() function produced the aggregated range of time periods without, in this case, a day of interruption (can be tuned with tolerance):
conti <- function(
data = df,
group = "id", #the group variable by which to aggregate the dates
dateFrom = "startDate",
dateTo = "endDate",
tolerance = 0, #what gap shall be seen as uninterupted range on dates, here 0 tollerance
dateFormat = "%Y-%m-%d" #date format in df
) {
if(!require(Rcpp)){install.packages("Rcpp"); library(Rcpp)}
if(!require(data.table)){install.packages("data.table"); library(data.table)}
cppFunction('DataFrame BezRcpp(DataFrame dtable) {
int marker = 0;
IntegerVector ID = dtable["group"];
DateVector From = dtable["dateFrom"];
DateVector To = dtable["dateTo"];
IntegerVector Difference(ID.size(), 9999);
for (int i = 1; i < ID.size(); i++) {
if(ID[i] != ID[i-1]) {
marker = i;
} else {
Difference[i] = From[i] - To[marker];
if(Difference[i]>1) marker = i;
else if(To[i]>To[marker]){
To[marker] = To[i];
}}}
return DataFrame::create(
_["group"] = ID,
_["Difference"] = Difference,
_["dateFrom"] = From,
_["dateTo"] = To,
_["stringsAsFactors"] = false);
}'
)
conti_Rcpp_ <- function(data){
A <- Sys.time()
if(!"data.table" %in% class(data)) dtable <- as.data.table(data) else dtable <- copy(data)
setnames(dtable, old = c("id", "startDate", "endDate"), new = c("group", "dateFrom", "dateTo"))
if(class(dtable[["dateFrom"]])!="Date" || class(dtable[["dateTo"]])!="Date") for (j in c("dateFrom", "dateTo")) set(dtable, j = j, value = as.Date(dtable[[j]], dateFormat))
setorderv(dtable, c("group", "dateFrom"))
dt <- setDT(BezRcpp(dtable))
dt <- dt[Difference>(tolerance+1), c("group", "dateFrom", "dateTo"), with = F]
setnames(dt, new = c("id", "startDate", "endDate"), old = c("group", "dateFrom", "dateTo"))
B <- Sys.time()
print(paste0("Done in ", round(difftime(B, A, units = "secs"), 1), " secs. A data.table was produced."))
return(dt)
}
return(conti_Rcpp_(data))
}
Then
df <- data.frame(id = c(1L, 1L, 1L),
startDate = c("1990-01-01", "1990-01-23", "1990-01-30"),
endDate = c("1990-01-24", "1990-01-25", "1990-01-31"), stringsAsFactors = F)
conti(df)
#[1] "Done in 0 secs. A data.table was produced."
# id startDate endDate
#1: 1 1990-01-01 1990-01-25
#2: 1 1990-01-30 1990-01-31
You could construct a data.table - dependent function called find_overlaps like below:
library(data.table)
find_overlaps <- function(df,
groups = NULL,
start_var = NULL,
end_var = NULL,
fmt = "%Y-%m-%d") {
calc_cummax_Date <- function(x) setattr(cummax(unclass(x)), "class", c("Date", "IDate"))
df_overlap <- setDT(copy(df))
rangevars <- c(start_var, end_var)
groupsidx <- c(groups, "overlap_idx")
df_overlap <- df_overlap[
, (rangevars) := lapply(.SD, function(x) as.Date(as.character(x), format = fmt)), .SDcols = rangevars][
, max_until_now := shift(calc_cummax_Date(get(end_var)), fill = get(end_var)[1]), by = mget(groups)][
(max_until_now + 1L) < get(start_var), gap_between := 1][
is.na(gap_between), gap_between := 0][
, overlap_idx := cumsum(gap_between), by = mget(groups)][
, `:=` (overlap = .N > 1,
newStartDate = min(get(start_var)),
newEndDate = max(get(end_var))), by = groupsidx][
, c("gap_between", "max_until_now") := NULL
]
return(df_overlap)
}
Calling this function (with [] at the end for printing the output) would give you the desired output:
# Below code will only print the output, you have to save it by e.g. df <- find_overlaps(df, groups = "id", start_var = "startDate", end_var = "endDate")
find_overlaps(df, groups = "id", start_var = "startDate", end_var = "endDate")[]
id startDate endDate overlap_idx overlap newStartDate newEndDate
1: 1 1990-01-01 1990-01-24 0 TRUE 1990-01-01 1990-01-25
2: 1 1990-01-23 1990-01-25 0 TRUE 1990-01-01 1990-01-25
3: 1 1990-01-30 1990-01-31 1 FALSE 1990-01-30 1990-01-31
As you can see, I've also added a column named overlap_idx as I believe it may be useful to have a separate index for each non-overlapping range per each id.
The function can handle multiple groups. Since it checks for cumulative maximum in the end date variable it would work also with cases where a line has the lowest start date but also highest end date. Additional parameters (like max_days_between, i.e. the definition of what do you consider to be continuous - 1 day difference or more) can be added easily.
If you're interested, the above function is partly similar to a function called collapse_ranges from my package neatRanges (available on CRAN, but still in experimental state).
It would give you a collapsed output, similar to what you want but as a summary of only first and last start/end dates for each non-overlapping range:
install.packages('neatRanges')
library(neatRanges)
collapse_ranges(df, groups = "id", start_var = "startDate", end_var = "endDate")[]
id startDate endDate
1 1 1990-01-01 1990-01-25
2 1 1990-01-30 1990-01-31

Summary statistics from aggregated groups using data.table

I have a dataset with this structure:
library(data.table)
dt <- data.table(
record=c(1:20),
area=rep(LETTERS[1:4], c(4, 6, 3, 7)),
score=c(1,1:3,2:3,1,1,1,2,2,1,2,1,1,1,1,1:3),
cluster=c("X", "Y", "Z")[c(1,1:3,3,2,1,1:3,1,1:3,3,3,3,1:3)]
)
I would like to aggregate the data so I can identify the most common cluster in each area for a given score (for example 1). I would also like some basic frequencies and percentages to be calculated with an output looking something like this:
dt_summary_for_1_score <- data.table(
area=c("A","B","C","D"),
cluster_mode=c("X","X","X","Z"),
cluster_pct = c(100,66.6,100,80),
cluster_freq = c(2,2,1,4),
record_freq = c(2,3,1,5)
)
Ideally I would like a solution that uses data.table. Thanks.
I would leverage frank, though a solution with sort(table(cluster)) is possible as well.
dt_summary =
dt[ , .N, keyby = .(area, score, cluster)
][ , {
idx = frank(-N, ties.method = 'min') == 1
NN = sum(N)
.(
cluster_mode = cluster[idx],
cluster_pct = 100*N[idx]/NN,
cluster_freq = N[idx],
record_freq = NN
)
}, by = .(area, score)]
To get the example with score == 1 we can subset this:
dt_summary[score == 1]
# area score cluster_mode cluster_pct cluster_freq record_freq
# 1: A 1 X 100.00000 2 2
# 2: B 1 X 66.66667 2 3
# 3: C 1 X 100.00000 1 1
# 4: D 1 Z 80.00000 4 5
This returns different rows in the case of ties. You might try something like cluster_mode = paste(cluster[idx], collapse = '|') or cluster_mode = list(cluster[idx]) instead for alternatives.
Breaking down the logic:
# Count how many times each cluster shows up with each area/score
dt[ , .N, keyby = .(area, score, cluster)
][ , {
# Rank each cluster's count within each area/score & take the top;
# ties.method = 'min' guarantees that if there's
# a tie for "winner", _both_ will get rank 1
# (by default, ties.method = 'average')
# Note that it is slightly inefficient to negate N
# in order to sort in descending order, especially
# if there are a large number of groups. We could
# either vectorize negation by using -.N in the
# previous step or by using frankv (a lower-level
# version of frank) which has an 'order' argument
idx = frank(-N, ties.method = 'min') == 1
# calculate here since it's used twice
NN = sum(N)
.(
# use [idx] to subset and make sure there are
# only as many rows on output as there are
# top-ranked clusters for this area/score
cluster_mode = cluster[idx],
cluster_pct = 100*N[idx]/NN,
cluster_freq = N[idx],
record_freq = NN
)
}, by = .(area, score)]

optimizing solution to find common third on large data set

This is a follow up question to my previous question. I run into a problem to find a memory efficient solution to find a common third for my large data set (3.5 million groups and 6.2 million persons)
The proposed solution using the igraph package works fast for a normal sized data sets unfortunately runs into memory issues by creating a large matrix for bigger data sets. Similar issue comes up with my own solution using concatenated inner joins where the third inner join inflates the dataframe so my pc runs out of memory (16gb).
df.output <- inner_join(df,df, by='group' ) %>%
inner_join(.,df, by=c('person.y'='person')) %>%
inner_join(.,df, by=c('group.y'='group')) %>%
rename(person_in_common='person.y', pers1='person.x',pers2='person') %>%
select(pers1,pers2,person_in_common) %>%
filter(pers1!=pers2) %>%
distinct() %>%
filter(person_in_common!=pers1 & person_in_common!=pers2)
df.output[-3] <- t(apply(df.output[-3], 1,
FUN=function(x) sort(x, decreasing=FALSE)))
df.output <- unique(df.output)
Small data set example and expected output
df <- data.frame(group= c("a","a","b","b","b","c"),
person = c("Tom","Jerry","Tom","Anna","Sam","Nic"), stringsAsFactors = FALSE)
df
group person
1 a Tom
2 a Jerry
3 b Tom
4 b Anna
5 b Sam
6 c Nic
and expected result
df.output
pers1 pers2 person_in_common
1 Anna Jerry Tom
2 Jerry Sam Tom
3 Sam Tom Anna
4 Anna Tom Sam
6 Anna Sam Tom
I unfortunately don't have access to a machine with more ram and are also not really experienced with cloud computing, so I hope to make it work on my local pc. I would appreciate input how to optimize any of the solutions or an advise how to tackle the problem otherwise.
Edit 1
A dataframe which reflects my actual data size.
set.seed(33)
Data <- data.frame(group = sample(1:3700000, 14000000, replace=TRUE),
person = sample(1:6800000, 14000000,replace = TRUE))
Edit 2
My real data is a bit more complex in terms of larger groups and more persons per group as the example data. Consequently it gets more memory intense. I could not figure out how to simulate this kind of structure so following the real data for download:
Full person-group data
So, I managed to run this on your test data (I have 16 GB of RAM), but if you run this on your small example then you would see that it does not give the same results. I did not get why, but maybe you could hep me with that. So I will try to explain every step:
myFun <- function(dt) {
require(data.table)
# change the data do data.table:
setDT(dt)
# set key/order the data by group and person:
setkey(dt, group, person)
# I copy the initial data and change the name of soon to be merged column name to "p2"
# which represents person2
dta <- copy(dt)
setnames(dta, "person", "p2")
# the first merge using data.table:
dt1 <- dt[dta, on = "group", allow.cartesian = TRUE, nomatch = 0]
# now we remove rows where persons are the same:
dt1 <- dt1[person != p2] # remove equal persons
# and also we need to remove rows where person1 and person2 are the same,
# just in different order , example:
# 2: a Tom Jerry
# 3: a Jerry Tom
# is the same, if I get it right then you did this using apply in the end of code,
# but it would be much better if we could reduce data now
# also my approach will be much faster (we take pairwise min word to 2 column
# and max to the last):
l1 <- pmin(dt1[[2]], dt1[[3]])
l2 <- pmax(dt1[[2]], dt1[[3]])
set(dt1, j = 2L, value = l1)
set(dt1, j = 3L, value = l2)
# now lets clear memory and take unique rows of dt1:
rm(l1, l2, dt)
dt1 <- unique(dt1)
gc()
# change name for group column:
setnames(dta, "group", "g2")
# second merge:
dt2 <- dt1[dta, on = "p2", allow.cartesian = TRUE, nomatch = 0]
rm(dt1)
gc()
setnames(dta, "p2", "p3")
dt3 <- dt2[dta, on = "g2", allow.cartesian = TRUE, nomatch = 0] # third merge
rm(dt2)
gc()
dt3 <- dt3[p3 != p2 & p3 != person] # removing equal persons
gc()
dt3 <- dt3[, .(person, p2, p3)]
gc()
return(dt3[])
}
On Small data set example:
df <- data.frame(group = c("a","a","b","b","b","c"),
person = c("Tom","Jerry","Tom","Anna","Sam","Nic"),
stringsAsFactors = FALSE)
df
myFun(df)
# person p2 p3
# 1: Anna Tom Jerry
# 2: Sam Tom Jerry
# 3: Jerry Tom Anna
# 4: Sam Tom Anna
# 5: Jerry Tom Sam
# 6: Anna Tom Sam
# 7: Anna Sam Tom
Something similar to your result but not quite the same
Now with larger data:
set.seed(33)
N <- 10e6
dt <- data.frame(group = sample(3.7e6, N, replace = TRUE),
person = sample(6.8e6, N, replace = TRUE))
system.time(results <- myFun(dt)) # 13.22 sek
rm(results)
gc()
And:
set.seed(33)
N <- 14e6
dt <- data.frame(group = sample(3.7e6, N, replace = TRUE),
person = sample(6.8e6, N, replace = TRUE))
system.time(results <- myFun(dt)) # around 40 sek, but RAM does get used to max
Update:
Maybe you can try this splitting aproch, lets say with nparts 6-10?:
myFunNew3 <- function(dt, nparts = 2) {
require(data.table)
setDT(dt)
setkey(dt, group, person)
dta <- copy(dt)
# split into N parts
splits <- rep(1:nparts, each = ceiling(dt[, .N]/nparts))
set(dt, j = "splits", value = splits)
dtl <- split(dt, by = "splits", keep.by = F)
set(dt, j = "splits", value = NULL)
rm(splits)
gc()
i = 1
for (i in seq_along(dtl)) {
X <- copy(dtl[[i]])
setnames(dta, c("group", "person"))
X <- X[dta, on = "group", allow.cartesian = TRUE, nomatch = 0]
X <- X[person != i.person]
gc()
X <- X[dta, on = "person", allow.cartesian = TRUE, nomatch = 0]
gc()
setnames(dta, "group", "i.group")
X <- X[dta, on = "i.group", allow.cartesian = TRUE, nomatch = 0]
gc()
setnames(X, "i.person.1", "pers2")
setnames(X, "i.person", "pers1" )
setnames(X, "person", "person_in_common" )
X <- X[, .(pers1, pers2, person_in_common)]
gc()
X <- X[pers1 != pers2 & person_in_common != pers1 & person_in_common != pers2]
gc()
name1 <- "pers1"
name2 <- "pers2"
l1 <- pmin(X[[name1]], X[[name2]])
l2 <- pmax(X[[name1]], X[[name2]])
set(X, j = name1, value = l1)
set(X, j = name2, value = l2)
rm(l1, l2)
gc()
X <- unique(X)
gc()
if (i > 1) {
X1 <- rbindlist(list(X1, X), use.names = T, fill = T)
X1 <- unique(X1)
rm(X)
gc()
} else {
X1 <- copy(X)
}
dtl[[i]] <- 0L
gc()
}
rm(dta, dtl)
gc()
setkey(X1, pers1, pers2, person_in_common)
X1[]
}

R applying a data frame on another data frame

I have two data frames.
set.seed(1234)
df <- data.frame(
id = factor(rep(1:24, each = 10)),
price = runif(20)*100,
quantity = sample(1:100,240, replace = T)
)
df2 <- data.frame(
id = factor(seq(1:24)),
eq.quantity = sample(1:100, 24, replace = T)
)
I would like to use df2$­eq.quantity to find the closest absolute value compared to df$quantity, by the factor variable, id. I would like to do that for each id in df2 and bind it into a new data-frame, called results.
I can do it like this for each individually ID:
d.1 <- df2[df2$id == 1, 2]
df.1 <- subset(df, id == 1)
id.1 <- df.1[which.min(abs(df.1$quantity-d.1)),]
Which would give the solution:
id price quantity
1 66.60838 84
But I would really like to be able to use a smarter solution, and also gathered the results into a dataframe, so if I do it manually it would look kinda like this:
results <- cbind(id.1, id.2, etc..., id.24)
I had some trouble giving this question a good name?
data.tables are smart!
Adding this to your current example...
library(data.table)
dt = data.table(df)
dt2 = data.table(df2)
setkey(dt, id)
setkey(dt2, id)
dt[dt2, dif:=abs(quantity - eq.quantity)]
dt[,list(price=price[which.min(dif)], quantity=quantity[which.min(dif)]), by=id]
result:
dt[,list(price=price[which.min(dif)], quantity=quantity[which.min(dif)]), by=id]
id price quantity
1: 1 66.6083758 84
2: 2 29.2315840 19
3: 3 62.3379442 63
4: 4 54.4974836 31
5: 5 66.6083758 6
6: 6 69.3591292 13
...
Merge the two datasets and use lapply to perform the function on each id.
df3 <- merge(df,df2,all.x=TRUE,by="id")
diffvar <- function(df){
df4 <- subset(df3, id == df)
df4[which.min(abs(df4$quantity-df4$eq.quantity)),]
}
resultslist <- lapply(levels(df3$id),function(df) diffvar(df))
Combine the resulting list elements in a dataframe:
resultsdf <- data.frame(matrix(unlist(resultslist), ncol=4, byrow=T))
Or more easy:
library(plyr)
resultsdf <- ddply(df3, .(id), function(x)x[which.min(abs(x$quantity-x$eq.quantity)),])

Resources