Take this simple data frame of linked ids:
test <- data.frame(id1=c(10,10,1,1,24,8),id2=c(1,36,24,45,300,11))
> test
id1 id2
1 10 1
2 10 36
3 1 24
4 1 45
5 24 300
6 8 11
I now want to group together all the ids which link.
By 'link', I mean follow through the chain of links so that all ids in one group
are labelled together. A kind of branching structure. i.e:
Group 1
10 --> 1, 1 --> (24,45)
24 --> 300
300 --> NULL
45 --> NULL
10 --> 36, 36 --> NULL,
Final group members: 10,1,24,36,45,300
Group 2
8 --> 11
11 --> NULL
Final group members: 8,11
Now I roughly know the logic I would want, but don't know how I would implement it elegantly. I am thinking of a recursive use of match or %in% to go down each branch, but am truly stumped this time.
The final result I would be chasing is:
result <- data.frame(group=c(1,1,1,1,1,1,2,2),id=c(10,1,24,36,45,300,8,11))
> result
group id
1 1 10
2 1 1
3 1 24
4 1 36
5 1 45
6 1 300
7 2 8
8 2 11
The Bioconductor package RBGL (an R interface to the BOOST graph library) contains
a function, connectedComp(), which identifies the connected components in a graph --
just what you are wanting.
(To use the function, you will first need to install the graph and RBGL packages, available here and here.)
library(RBGL)
test <- data.frame(id1=c(10,10,1,1,24,8),id2=c(1,36,24,45,300,11))
## Convert your 'from-to' data to a 'node and edge-list' representation
## used by the 'graph' & 'RBGL' packages
g <- ftM2graphNEL(as.matrix(test))
## Extract the connected components
cc <- connectedComp(g)
## Massage results into the format you're after
ld <- lapply(seq_along(cc),
function(i) data.frame(group = names(cc)[i], id = cc[[i]]))
do.call(rbind, ld)
# group id
# 1 1 10
# 2 1 1
# 3 1 24
# 4 1 36
# 5 1 45
# 6 1 300
# 7 2 8
# 8 2 11
Here's an alternative answer that I have discovered myself after the nudging in the right direction by Josh. This answer uses the igraph package.
For those that are searching and come across this answer, my test dataset is referred to as an "edge list" or "adjacency list" in graph theory (http://en.wikipedia.org/wiki/Graph_theory)
library(igraph)
test <- data.frame(id1=c(10,10,1,1,24,8 ),id2=c(1,36,24,45,300,11))
gr.test <- graph_from_data_frame(test)
links <- data.frame(id=unique(unlist(test)),group=components(gr.test)$membership)
links[order(links$group),]
# id group
#1 10 1
#2 1 1
#3 24 1
#5 36 1
#6 45 1
#7 300 1
#4 8 2
#8 11 2
Without using packages:
# 2 sets of test data
mytest <- data.frame(id1=c(10,10,3,1,1,24,8,11,32,11,45),id2=c(1,36,50,24,45,300,11,8,32,12,49))
test <- data.frame(id1=c(10,10,1,1,24,8),id2=c(1,36,24,45,300,11))
grouppairs <- function(df){
# from wide to long format; assumes df is 2 columns of related id's
test <- data.frame(group = 1:nrow(df),val = unlist(df))
# keep moving to next pair until all same values have same group
i <- 0
while(any(duplicated(unique(test)$val))){
i <- i+1
# get group of matching values
matches <- test[test$val == test$val[i],'group']
# change all groups with matching values to same group
test[test$group %in% matches,'group'] <- test$group[i]
}
# renumber starting from 1 and show only unique values in group order
test$group <- match(test$group, sort(unique(test$group)))
unique(test)[order(unique(test)$group), ]
}
# test
grouppairs(test)
grouppairs(mytest)
You said recursive... and I thought I'd be super terse while I'm at it.
Test data
mytest <- data.frame(id1=c(10,10,3,1,1,24,8,11,32,11,45),id2=c(1,36,50,24,45,300,11,8,32,12,49))
test <- data.frame(id1=c(10,10,1,1,24,8),id2=c(1,36,24,45,300,11))
Recursive function to get the groupings
aveminrec <- function(v1,v2){
v2 <- ave(v1,by = v2,FUN = min)
if(identical(v1,v2)){
as.numeric(as.factor(v2))
}else{
aveminrec(v2,v1)
}
}
Prep data and simplify after
groupvalues <- function(valuepairs){
val <- unlist(valuepairs)
grp <- aveminrec(val,1:nrow(valuepairs))
unique(data.frame(grp,val)[order(grp,val), ])
}
Get results
groupvalues(test)
groupvalues(mytest)
aveminrec() is probably along the lines of what you were thinking, though I bet there's a way to be more direct about going down each branch instead of repeating ave() which is essentially split() and lapply(). Maybe recursively split and lapply? As it is, it's like repeated partial branching, or alternately simplifying 2 vectors slightly without group information loss.
Maybe parts of this would be used on a real problem, but groupvalues() is way too dense to read without some comments at least. I also haven't checked how performance compares to a for loop with ave and flipping the groups that way.
Related
I'm running discrete-event-simulations in R. The "heart" of my algorithm does the following (Pseudocode):
1) Iterate over events
a) Change event[i] depending on resources
b) Change resources depending on outcome of step a)
The following reproducible example catches the main aspects:
Generate some data:
set.seed(4)
n <- 3
nr_resources <- 2
events <- data.frame(
t = as.integer(trunc(cumsum(rexp(n)))),
resource = NA,
worktime = as.integer(trunc(runif(n)*10))
)
resources <- data.frame(
id = 1:nr_resources,
t_free = 0L
)
events
resources
# > events
# t resource worktime
# 0 NA 2
# 4 NA 8
# 5 NA 2
# > resources
# id t_free
# 1 0
# 2 0
Now we can simulate dispatching of resources:
for (i in 1:n) {
events$resource[i] <- resources$id[resources$t_free <= events$t[i]][1]
resources$t_free[events$resource[i]] <- events$t[i] + events$worktime[i]
}
events
resources
# > events
# t resource worktime
# 0 1 2
# 4 1 8
# 5 2 2
# > resources
# id t_free
# 1 12
# 2 7
This approach works fine but there are a number of drawbacks I'd like to eliminate. Since events and resources are split in two datasets, there is quite a few subsetting (search and replace) going on across the two datasets. This is not really readable. And in the real application it even becomes the performance bottleneck. (..of course the real example is quite a bit more complicated..)
Therefore I ask myself if there are better ways to accomplish this task in R.
I thought about replacing the for-loop with a common higher-order function but did not come to any results.
The typical R lapply-approach doesn't work because lapply is not built for this iterative changes in the input data. (As far as I can see..)
My task looks a bit like the Reduce pattern. Since Reduce(sum, 1:3, accumulate = TRUE) uses intermediate results and also preserves them, I thought I could use the Reduce function but did not achieve any results.
I also thought about restructuring my data but with no success till now.
What I tried in detail
On the algorithmic side:
Failing approach with lapply:
l <- list(events = events, resources = resources)
l <- lapply(l, function(x) {
l$events$resource <- l$resources$id[l$resources$t_free <= l$events$t][1]
l$resources$t_free[l$events$resource] <- l$events$t + l$events$worktime
return(l)
})
l$events
l$resources
The result becomes:
# $events
# t resource worktime
# 1 0 1 2
# 2 4 1 8
# 3 5 1 2
#
# $resources
# id t_free
# 1 1 7
# 2 2 0
Intermediate changes to resources are lost and therefore always resource 1 gets booked.
Failing approach with Reduce:
l <- list(events = events, resources = resources)
l <- Reduce(function(l) {
l$events$resource <- l$resources$id[l$resources$t_free <= l$events$t][1]
l$resources$t_free[l$events$resource] <- l$events$t + l$events$worktime
return(l)}, l, accumulate = TRUE)
This fails with
Error in f(init, x[[i]]) : unused argument (x[[i]])
On the data side:
The other approach I can think of is to change the data to be represented in one dataset. For example by multiplying the events by the number of resources. I tried the following:
data <- merge(events, resources)
data <- data[order(data$t), ]
data
# t resource worktime id t_free
# 0 NA 2 1 0
# 0 NA 2 2 0
# 4 NA 8 1 0
# 4 NA 8 2 0
# 5 NA 2 1 0
# 5 NA 2 2 0
for (i in seq_along(data)) {
if ( is.na(data$resource[i])) {
data$resource[data$t == data$t[i]] <- data$id[data$t_free <= data$t[i]][1]
data$t_free[data$id == data$resource[i]] <- data$t[i] + data$worktime[i]
}
}
data
# t resource worktime id t_free
# 0 1 2 1 12
# 0 1 2 2 7
# 4 1 8 1 12
# 4 1 8 2 7
# 5 2 2 1 12
# 5 2 2 2 7
events <- unique(data[,1:3])
events
# t resource worktime
# 0 1 2
# 4 1 8
# 5 2 2
resources <- unique(data[,4:5])
resources
# id t_free
# 1 12
# 2 7
This works as well, but I'm not sure if that leads to better performance, readability and changeability if scaled..
So my question is:
Are there any alternatives on the algorithmic side or on the data side that improves my actual solution(s)?
I honestly prefer your first for-loop,
you should consider using something like Rcpp::sourceCpp and migrating your logic to C++.
I think that should be readable and faster.
If you must do it in R,
here's a possibility:
t_free <- Reduce(x = 1L:n,
init = rep(0L, nr_resources),
accumulate = TRUE,
f = function(t_free, i) {
# which.max will return the location of the first TRUE
id <- which.max(t_free <= events$t[i])
# R makes a local copy of t_free here
t_free[id] <- events$t[i] + events$worktime[i]
# return the chosen resource for this "iteration"
attr(t_free, "resource") <- id
# return the modified copy
t_free
})
# events$resource column by extracting the resource attribute, igonring init
events$resource <- sapply(t_free[-1L], attr, "resource")
# your resources$t_free column in the last element
resources <- data.frame(id = 1L:nr_resources,
t_free = t_free[[n + 1L]])
Suppose the data frame is like this:
df <- data.frame(x = c(1,7,8,15,24,100,9,19,128))
How do I create a new variable that satisfies the following condition:
y = 1 if 1<=x<=7
y = 2 if 8<=x<=14
y = 3 if 15<=x<=21
...
y = k if 1+7*(k-1)<= x<= 7+7*(k-1)
so that I can have the new data frame like this
df <- data.frame(y = c(1,1,2,3,4,15, 2,3, 19))
I am wondering if a for loop can be applied in this case.
Via simple algebra, you can do:
df$y <- floor((df$x+6)/7)
df
# x y
# 1 1 1
# 2 7 1
# 3 8 2
# 4 15 3
# 5 24 4
# 6 100 15
# 7 9 2
# 8 19 3
# 9 128 19
In R you will often find it easier (less typing and less thinking) to use vectorized operators than for loops for simple computations like this. In this case we performed calls to +, /, and floor over a whole vector instead of looping and using them on each element.
Please note, I'm not a programmer by trade. I'm literature student. So please bear with me.
I would like to improve the existing working procedure. Certainly function split is one option (I'm not sure how however).
Basically, I'm trying to subdivide existing dataframe into list of sub samples so that the sequnce of id's is not splitted into second list.
Here is working example together with sample data:
df <- data.frame(id=c(rep(1,3),rep(2,2),rep(3,3),rep(4,2),5,6,7,8,9,rep(10,5)),r1=rep(1,40),r2=rep(2,40))
x <- transform(df, rec=ave(df$id,df$id, FUN=seq_along))
x$cum <- cumsum(x$rec)
x$dif <- diff(c(0,x$cum),1)
x$lab <- ifelse(x$dif!=1,0,1)
x$seq <- seq_along(x$id)
x$subs <- x$lab*x$seq
seqrow <- seq(1,nrow(x),3) # how many rows approx. per part
rw <- x$subs[x$subs %in% seqrow]
start_rw <- c(1,rw[2:length(rw)])
end_rw <- c(start_rw[2:length(start_rw)]-1,nrow(x))
df.lst <- list()
for(i in 1:length(start_rw)){
df.lst[[i]] <- x[(start_rw[i]:end_rw[i]), ]
}
In each list the id's should be also sorted increasingly and should be arranged according to id's.
Reading through your code, I would summarize your procedure as:
Compute seqrow, which is row numbers where you would be willing to split the list
Split df only at the positions in seqrow where df$id is new (hasn't appeared above); this list of positions is called start_rw in your code.
You can use duplicated to determine if df$id has appeared above or not, which enables you to grab start_rw more easily:
seqrow <- seq(1,nrow(df),3)
(start_rw <- intersect(which(!duplicated(df$id)), seqrow))
# [1] 1 4 13 16
All that remains is to split df at these positions. You can use diff to compute the number of elements in each grouping:
(groups <- rep(seq(start_rw), times=diff(c(start_rw, nrow(df)+1))))
# [1] 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
df.lst2 <- split(df, groups)
This matches the output of your code:
all.equal(unname(df.lst2), lapply(df.lst, function(x) x[,1:3]))
# [1] TRUE
I have an aggregation problem which I cannot figure out how to perform efficiently in R.
Say I have the following data:
group1 <- c("a","b","a","a","b","c","c","c","c",
"c","a","a","a","b","b","b","b")
group2 <- c(1,2,3,4,1,3,5,6,5,4,1,2,3,4,3,2,1)
value <- c("apple","pear","orange","apple",
"banana","durian","lemon","lime",
"raspberry","durian","peach","nectarine",
"banana","lemon","guava","blackberry","grape")
df <- data.frame(group1,group2,value)
I am interested in sampling from the data frame df such that I randomly pick only a single row from each combination of factors group1 and group2.
As you can see, the results of table(df$group1,df$group2)
1 2 3 4 5 6
a 2 1 2 1 0 0
b 2 2 1 1 0 0
c 0 0 1 1 2 1
shows that some combinations are seen more than once, while others are never seen. For those that are seen more than once (e.g., group1="a" and group2=3), I want to randomly pick only one of the corresponding rows and return a new data frame that has only that subset of rows. That way, each possible combination of the grouping factors is represented by only a single row in the data frame.
One important aspect here is that my actual data sets can contain anywhere from 500,000 rows to >2,000,000 rows, so it is important to be mindful of performance.
I am relatively new at R, so I have been having trouble figuring out how to generate this structure correctly. One attempt looked like this (using the plyr package):
choice <- function(x,label) {
cbind(x[sample(1:nrow(x),1),],data.frame(state=label))
}
df <- ddply(df[,c("group1","group2","value")],
.(group1,group2),
pick_junc,
label="test")
Note that in this case, I am also adding an extra column to the data frame called "label" which is specified as an extra argument to the ddply function. However, I killed this after about 20 min.
In other cases, I have tried using aggregate or by or tapply, but I never know exactly what the specified function is getting, what it should return, or what to do with the result (especially for by).
I am trying to switch from python to R for exploratory data analysis, but this type of aggregation is crucial for me. In python, I can perform these operations very rapidly, but it is inconvenient as I have to generate a separate script/data structure for each different type of aggregation I want to perform.
I want to love R, so please help! Thanks!
Uri
Here is the plyr solution
set.seed(1234)
ddply(df, .(group1, group2), summarize,
value = value[sample(length(value), 1)])
This gives us
group1 group2 value
1 a 1 apple
2 a 2 nectarine
3 a 3 banana
4 a 4 apple
5 b 1 grape
6 b 2 blackberry
7 b 3 guava
8 b 4 lemon
9 c 3 durian
10 c 4 durian
11 c 5 raspberry
12 c 6 lime
EDIT. With a data frame that big, you are better off using data.table
library(data.table)
dt = data.table(df)
dt[,list(value = value[sample(length(value), 1)]),'group1, group2']
EDIT 2: Performance Comparison: Data Table is ~ 15 X faster
group1 = sample(letters, 1000000, replace = T)
group2 = sample(LETTERS, 1000000, replace = T)
value = runif(1000000, 0, 1)
df = data.frame(group1, group2, value)
dt = data.table(df)
f1_dtab = function() {
dt[,list(value = value[sample(length(value), 1)]),'group1, group2']
}
f2_plyr = function() {ddply(df, .(group1, group2), summarize, value =
value[sample(length(value), 1)])
}
f3_by = function() {do.call(rbind,by(df,list(grp1 = df$group1,grp2 = df$group2),
FUN = function(x){x[sample(nrow(x),1),]}))
}
library(rbenchmark)
benchmark(f1_dtab(), f2_plyr(), f3_by(), replications = 10)
test replications elapsed relative
f1_dtab() 10 4.764 1.00000
f2_plyr() 10 68.261 14.32851
f3_by() 10 67.369 14.14127
One more way:
with(df, tapply(value, list( group1, group2), length))
1 2 3 4 5 6
a 2 1 2 1 NA NA
b 2 2 1 1 NA NA
c NA NA 1 1 2 1
# Now use tapply to sample withing groups
# `resample` fn is from the sample help page:
# Avoids an error with sample when only one value in a group.
resample <- function(x, ...) x[sample.int(length(x), ...)]
#Create a row index
df$idx <- 1:NROW(df)
rowidxs <- with(df, unique( c( # the `c` function will make a matrix into a vector
tapply(idx, list( group1, group2),
function (x) resample(x, 1) ))))
rowidxs
# [1] 1 5 NA 12 16 NA 3 15 6 4 14 10 NA NA 7 NA NA 8
df[rowidxs[!is.na(rowidxs)] , ]
I need to get the maximum of a variable in a nested list. For a certain station number "s" and a certain member "m", mylist[[s]][[m]] are of the form:
station date.time member bias
6019 2011-08-06 12:00 mbr003 86
6019 2011-08-06 13:00 mbr003 34
For each station, I need to get the maximum of bias of all members. For s = 3, I managed to do it through:
library(plyr)
var1 <- mylist[[3]]
var2 <- lapply(var1, `[`, 4)
var3 <- laply(var2, .fun = max)
max.value <- max(var3)
Is there a way of avoiding the column number "4" in the second line and using the variable name $bias in lapply or a better way of doing it?
You can use [ with the names of columns of data frames as well as their index. So foo[4] will have the same result as foo["bias"] (assuming that bias is the name of the fourth column).
$bias isn't really the name of that column. $ is just another function in R, like [, that is used for accessing columns of data frames (among other things).
But now I'm going to go out on a limb and offer some advice on your data structure. If each element of your nested list contains the data for a unique combination of station and member, here is a simplified toy version of your data:
dat <- expand.grid(station = rep(1:3,each = 2),member = rep(1:3,each = 2))
dat$bias <- sample(50:100,36,replace = TRUE)
tmp <- split(dat,dat$station)
tmp <- lapply(tmp,function(x){split(x,x$member)})
> tmp
$`1`
$`1`$`1`
station member bias
1 1 1 87
2 1 1 82
7 1 1 51
8 1 1 60
$`1`$`2`
station member bias
13 1 2 64
14 1 2 100
19 1 2 68
20 1 2 74
etc.
tmp is a list of length three, where each element is itself a list of length three. Each element is a data frame as shown above.
It's really much easier to record this kind of data as a single data frame. You'll notice I constructed it that way first (dat) and then split it twice. In this case you can rbind it all together again using code like this:
newDat <- do.call(rbind,lapply(tmp,function(x){do.call(rbind,x)}))
rownames(newDat) <- NULL
In this form, these sorts of calculations are much easier:
library(plyr)
#Find the max bias for each unique station+member
ddply(newDat,.(station,member),summarise, mx = max(bias))
station member mx
1 1 1 87
2 1 2 100
3 1 3 91
4 2 1 94
5 2 2 88
6 2 3 89
7 3 1 74
8 3 2 88
9 3 3 99
#Or maybe the max bias for each station across all members
ddply(newDat,.(station),summarise, mx = max(bias))
station mx
1 1 100
2 2 94
3 3 99
Here is another solution using repeated lapply.
lapply(tmp, function(x) lapply(lapply(x, '[[', 'bias'), max))
You may need to use [[ instead of [, but it should work fine with a string (don't use the $). try:
var2 <- lapply( var1, `[`, 'bias' )
or
var2 <- lapply( var1, `[[`, 'bias' )
depending on if var1 is a list.