Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 months ago.
Improve this question
I am working with the R programming language.
I have this dataset that contains consecutive exam results from students over a period of time - I am trying to use this dataset to calculate the conditional probability of a student failing the next exam, given that the student failed the previous exam. The data looks something like this:
library(data.table)
id = sample.int(10000, 100000, replace = TRUE)
res = c(1,0)
results = sample(res, 100000, replace = TRUE)
date_exam_taken = sample(seq(as.Date('1999/01/01'), as.Date('2020/01/01'), by="day"), 100000, replace = TRUE)
my_data = data.frame(id, results, date_exam_taken)
my_data <- my_data[order(my_data$id, my_data$date_exam_taken),]
my_data$general_id = 1:nrow(my_data)
my_data$exam_number = ave(my_data$general_id, my_data$id, FUN = seq_along)
my_data$general_id = NULL
I have been spending some time learning how to "vectorize and parallelize" a function in R that will be able to calculate these conditional probabilities .
My initial code was a FOR LOOP - I was told that "vectorizing" your code allows R to simultaneously perform operations on multiple elements compared to a FOR LOOP which only allows R to perform operations on a single element, thus allowing for an increase in efficiency and a decrease in time/speed. On the other hand, I was told that "parallelizing" can potentially make your code run faster as it allows your computer to allocate more processing power through allocating the task at hand to different "cores" within your computer.
Thus, I tried to exploit both of these principles (vectorizing and parallelizing) and incorporate them both into my code. Furthermore, I was told about the advantages of using the "data.table" library in R and converted my data frame into a "data.table" object for another potential gain in performance.
Here is my attempt:
# First, load the "doParallel" library
library(doParallel)
library(data.table)
# Set up a cluster with 4 cores
cl = makeCluster(4)
registerDoParallel(cl)
my_data = as.data.table(my_data)
# Now, you can use the `foreach` function with the `%dopar%` operator to parallelize the computation
my_vector = foreach(i = unique(my_data$id), .combine = rbind) %dopar% {
{tryCatch({
# Inside the function, everything works the same as before
#setDT(my_data)
start_i = my_data[my_data$id == i,]
pairs_i = data.table(first = head(start_i$results, -1), second = tail(start_i$results, -1))
frame_i = data.table(table(pairs_i))
frame_i$i = i
#print(frame_i)
return(frame_i)
}
, error = function(e){})
}}
# Don't forget to stop the cluster when you're done
stopCluster(cl)
From here, I would then perform a series of JOINS on the results and calculate the different conditional probabilities (pass given pass, pass given fail, fail given fail, fail given pass), and place these results in a 2x2 contingency table.
The code seems to run - but I am interested in knowing if I have done this correctly and what could have been done differently.
Can someone please tell me if I have done this correctly?
Thanks!
This particular problem is not well suited for parallelization. The expense of setting up the cluster exceeds any benefit, especially when a vectorized solution is straightforward.
By sorting the table by id then date_exam_taken, we can get the previous exam result and then drop rows where breaks in id occur.
data.table then allows us to efficiently aggregate by the four combinations of results in order to form the contingency table.
m <- matrix(
setorder(
dt[
# get the results from the previous row (first row is NA)
,res0 := shift(results)
][
# keep only rows that have the same id as the previous row
id == shift(id)
][
# group by the four cases (pass-pass, pass-fail, etc.) and count the occurrences of each
, .N, c("res0", "results")
# sort descending by the current exam, then descending by the previous exam
], -results, -res0
# put the results into a 2x2 matrix
)$N, 2, 2, 0, list(c("pass0", "fail0"), c("pass1", "fail1"))
)
m # 2x2 contingency table by count
#> pass1 fail1
#> pass0 22527 22460
#> fail0 22491 22524
m/rowSums(m) # by proportion pass/fail given the results of the previous exam
#> pass1 fail1
#> pass0 0.5007447 0.4992553
#> fail0 0.4996335 0.5003665
This will be very fast. However, if performance is an overriding concern (as opposed to, e.g., readability), we can squeeze a little more out by using tabulate instead of data.table grouping operations. Put the two solutions in functions and test their performance with microbenchmark.
f1 <- function(dt) {
if (!identical(key(dt), c("id", "date_exam_taken"))) setkey(dt, id, date_exam_taken)
m <- matrix(
setorder(
dt[
,res0 := shift(results)
][
id == shift(id)
][
, .N, c("res0", "results")
], -results, -res0
)$N, 2, 2, 0, list(c("pass0", "fail0"), c("pass1", "fail1"))
)
m/rowSums(m)
}
f2 <- function(dt) {
if (!identical(key(dt), c("id", "date_exam_taken"))) setkey(dt, id, date_exam_taken)
m <- matrix(
with(
dt,
tabulate((shift(id) == id)*(4L - 2L*results - shift(results)), 4L)
), 2, 2, 0, list(c("pass0", "fail0"), c("pass1", "fail1"))
)
m/rowSums(m)
}
microbenchmark::microbenchmark(f1(dt),
f2(dt),
check = "identical")
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> f1(dt) 3.0650 3.9351 4.931590 4.1148 4.63785 12.1386 100
#> f2(dt) 1.5193 1.7114 1.933132 1.7479 1.77500 7.8051 100
Data:
library(data.table)
(seed <- sample(.Machine$integer.max, 1))
#> [1] 1849007255
set.seed(seed)
dt <- data.table(
id = sample.int(10000, 100000, replace = TRUE),
results = sample(0:1, 100000, replace = TRUE),
date_exam_taken = sample(seq(as.Date('1999/01/01'), as.Date('2020/01/01'), by="day"), 100000, replace = TRUE)
)
Related
I am working with the R programming language.
I have this dataset that records exam results ( 1 = pass, 0 = fail) for a set of students at different times:
library(data.table)
library(doParallel)
# Generate some sample data
id = sample.int(10000, 100000, replace = TRUE)
res = c(1,0)
results = sample(res, 100000, replace = TRUE)
date_exam_taken = sample(seq(as.Date('1999/01/01'), as.Date('2020/01/01'), by="day"), 100000, replace = TRUE)
# Create a data frame from the sample data
my_data = data.frame(id, results, date_exam_taken)
my_data <- my_data[order(my_data$id, my_data$date_exam_taken),]
# Generate some additional columns for each record
my_data$general_id = 1:nrow(my_data)
my_data$exam_number = ave(my_data$general_id, my_data$id, FUN = seq_along)
my_data$general_id = NULL
# Convert the data frame to a data.table
my_data = setDT(my_data)
# Create a cluster with 4 workers
cl = makeCluster(4)
I have this function that tracks the number of times each student failed an exam given the student failed the previous exam, passed an exam given that the student passed the previous exam, passed an exam given that the student failed the previous exam and failed an exam given that the student passed the previous exam. Here is the function:
my_function <- function(i) {
# Use tryCatch to handle the case where there are no rows in the start_i data frame
tryCatch({
start_i = my_data[my_data$id == i,]
pairs_i = data.frame(first = head(start_i$results, -1), second = tail(start_i$results, -1))
frame_i = as.data.frame(table(pairs_i))
frame_i$i = i
return(frame_i)
}, error = function(err) {
# Return an empty data frame if there are no rows in the start_i data frame
return(data.frame())
})
}
Now, I would like to try and run this function on my data in parallel - that is, I would like to assign data belonging to different students to different cores within my computer, in an effort to accelerate the time required to perform this function. Here is my attempt:
# Export the data frames and the my_function to the workers on the cluster
clusterExport(cl, c("my_data", "my_function", "data.table"))
# Assign each worker a different subset of the data to work on
clusterSetRNGStream(cl)
n = nrow(my_data)
chunks = rep(1:4, each = n / 4)
my_data = my_data[chunks == 1,]
# Evaluate the code on the cluster (final_out is the final result)
final_out = parLapply(cl, unique(my_data$id), my_function)
# alternate version
final_out = clusterApply(cl, unique(my_data$id), my_function)
# Stop the cluster when finished
stopCluster(cl)
The code seems to have run without errors - but I am not sure if I have done everything correctly.
Can someone please comment on this?
Thanks!
So far as I can tell, the approach you've taken does what you expect. I am doubtful that the cluster is giving you any real speed improvement over other alternative methods. For example, if you use a dplyr pipeline, you could do it pretty easily:
out <- my_data %>%
arrange(id, exam_number) %>%
group_by(id) %>%
mutate(prev_exam = lag(results)) %>%
group_by(id, results, prev_exam) %>%
tally() %>%
na.omit()
On my machine, macOS 12.6, 3.6 GHz intel i9, 128GB RAM, the dplyr pipeline is about 3.5 times faster than the parallel approach. As #jblood94 said in his comment, the considerable resources in communication make the cluster solution pretty inefficient. Maybe there is an even better datatable solution.
I have a data table that provides the length and composition of given vectors
for example:
set.seed(1)
dt = data.table(length = c(100, 150),
n_A = c(30, 30),
n_B = c(20, 100),
n_C = c(50, 20))
I need to randomly split each vector into two subsets with 80% and 20% of observations respectively. I can currently do this using a for loop. For example:
dt_80_list <- list() # create output lists
dt_20_list <- list()
for (i in 1:nrow(dt)){ # for each row in the data.table
sample_vec <- sample( c( rep("A", dt$n_A[i]), # create a randomised vector with the given nnumber of each component.
rep("B", dt$n_B[i]),
rep("C", dt$n_C[i]) ) )
sample_vec_80 <- sample_vec[1:floor(length(sample_vec)*0.8)] # subset 80% of the vector
dt_80_list[[i]] <- data.table( length = length(sample_vec_80), # count the number of each component in the subset and output to list
n_A = length(sample_vec_80[which(sample_vec_80 == "A")]),
n_B = length(sample_vec_80[which(sample_vec_80 == "B")]),
n_C = length(sample_vec_80[which(sample_vec_80 == "C")])
)
dt_20_list[[i]] <- data.table( length = dt$length[i] - dt_80_list[[i]]$length, # subtract the number of each component in the 80% to identify the number in the 20%
n_A = dt$n_A[i] - dt_80_list[[i]]$n_A,
n_B = dt$n_B[i] - dt_80_list[[i]]$n_B,
n_C = dt$n_C[i] - dt_80_list[[i]]$n_C
)
}
dt_80 <- do.call("rbind", dt_80_list) # collapse lists to output data.tables
dt_20 <- do.call("rbind", dt_20_list)
However, the dataset I need to apply this to is very large, and this is too slow. Does anyone have any suggestions for how I could improve performance?
Thanks.
(I assumed your dataset consists of many more rows (but only a few colums).)
Here's a version I came up with, with mainly three changes
use .N and by= to count the number of "A","B","C" drawn in each row
use the size argument in sample
join the original dt and dt_80 to calculate dt_20 without a for-loop
## draw training data
dt_80 <- dcast(
dt[,row:=1:nrow(dt)
][, .(draw=sample(c(rep("A80",n_A),
rep("B80",n_B),
rep("C80",n_C)),
size=.8*length) )
, by=row
][,.N,
by=.(row,draw)],
row~draw,value.var="N")[,length80:=A80+B80+C80]
## draw test data
dt_20 <- dt[dt_80,
.(A20=n_A-A80,
B20=n_B-B80,
C20=n_C-C80),on="row"][,length20:=A20+B20+C20]
There is probably still room for optimization, but I hope it already helps :)
EDIT
Here I add my initial first idea, I did not post this because the code above is much faster. But this one might be more memory-efficient which seems crucial in your case. So, even if you already have a working solution, this might be of interest...
library(data.table)
library(Rfast)
## add row numbers
dt[,row:=1:nrow(dt)]
## sampling function
sampfunc <- function(n_A,n_B,n_C){
draw <- sample(c(rep("A80",n_A),
rep("B80",n_B),
rep("C80",n_C)),
size=.8*(n_A+n_B+n_C))
out <- Rfast::Table(draw)
return(as.list(out))
}
## draw training data
dt_80 <- dt[,sampfunc(n_A,n_B,n_C),by=row]
I'm trying to compute all the pairwise dissimilarities between observations in a data set consisting of only nominal variables using some self-defined dissimilarity metric.
Data looks like
set.seed(3424)
(mydata <- data.table(paste(sample(letters[1:5], 5, replace=T),
sample(LETTERS[1:5], 5, replace=T),
sep = ","),
paste(sample(LETTERS[1:5], 5, replace=T),
sample(LETTERS[1:5], 5, replace=T),
sep = ","),
paste(sample(letters[1:5], 5, replace=T),
sample(letters[1:5], 5, replace=T),
sep = ",")))
V1 V2 V3
1: a,A E,E b,b
2: e,D C,A d,d
3: d,B B,C d,d
4: c,B A,E b,d
5: a,B C,D d,a
library(data.table)
library(dplyr)
library(stringr)
metric <- function(pair){
intersection <- 0
union <- 0
for(i in 1:ncol(mydata)){
A <- pair[[1]][[i]]
B <- pair[[2]][[i]]
if(sum(is.na(A),is.na(B))==1)
union = union + 1
if(sum(is.na(A),is.na(B))==0){
intersection <- intersection + length(intersect(A,B))/length(union(A,B))
union = union + 1
}
}
1 - intersection/union
}
diss <- matrix(nrow = nrow(mydata), ncol = nrow(mydata))
for(i in 1:(nrow(mydata)-1)){
print(i) ## to check progress ##
for(j in (i+1):nrow(mydata)){
pair <- rbind(mydata[i], mydata[j])
diss[j, i] <- apply(pair, 1, function(x) strsplit(x, split=",")) %>% metric()
}
}
These loops work, but really slow when mydata has 1000+ rows and 100+ columns.
The metric I used here is Jaccard index, but a nested version. Since each element in the data is not a single value. So instead of treating each two levels as either match(0) or different(1), I use Jaccard when comparing levels as well.
Update:
Some context about my data, not the toy data I made up.
Each row represents a query, i.e. "SELECT ... FROM ... WHERE ...
...".
Each column contains part of the information in the query, i.e. 1st column contains everything between "SELECT" and "FROM", 2nd column contains what's between "FROM" and "WHERE", etc.
There are 100 columns and 400 rows, I don't why there are so many columns though.
Number of elements in one cell could be really arbitrary, some cells contain very long lists of values, while many are actually NAs. E.g.
SELECT
1: NA
2:p1.PLAYERID,f1.PLAYERNAME,p2.PLAYERID,f2.PLAYERNAME
3: PLAYER f1,PLAYER f2,PLAYS p1
4: NA
5: NA
6: c1.table_name t1,c2.table_name t2
7: NA
...
400: asd,vrht,yuetr,wxeq,yiknuy,sce,ercher
You can gain some speed pretty easily by doing less work. If you are only interested in pairwise comparisons, you only need to do N choose 2 comparisons, instead of N^2. You can implement that with F2() below.
set.seed(3424)
(mydata <- data.table(sample(letters[1:5], 50, replace = T),
sample(LETTERS[1:5], 50, replace = T),
sample(1:3, 50, replace = T)))
mydf<-data.frame(mydata)
f1<- function(){
diss <- matrix(nrow = nrow(mydata), ncol = nrow(mydata))
for(i in 1:(nrow(mydata)-1)){
print(i) ## to check progress ##
for(j in (i+1):nrow(mydata)){
pair <- rbind(mydata[i], mydata[j])
diss[j, i] <- apply(pair, 1, function(x) strsplit(x, split=",")) %>% metric()
}
}
return(diss)
}
f2<-function(){
met<-NULL
A<-NULL
B<-NULL
choices<-choose(nrow(mydf),2)
combs<-combn(nrow(mydf),2)
for(i in 1:choices) {
print(i)
pair<-rbind(mydf[combs[1,i],], mydf[combs[2,i],])
met[i]<- apply(pair, 1, function(x) strsplit(x, split=",")) %>% metric()
A[i]<-mydf[combs[1,i],1]
B[i]<-mydf[combs[2,i],2]
}
results<-data.frame(A,B, met)
return(results)
}
library(microbenchmark)
microbenchmark(f1(), f2(), times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
f1() 1381 1391.2 1416.8 1417.6 1434.9 1456 10
f2() 907 923.6 942.3 946.9 948.9 1008 10
It is a little faster, but not mind-blowingly so. My guess is that some more work needs to be done on the metric function you define. I tried to look at it and determine a way to vectorize it, but I could not find a way. If that can be done this problem would be trivial. For example, I have a similar program that measures pairwise cosine similarity between ~400 vectors of length ~5000. It has to make 400 choose 2 = 79800 comparisons and the entire program takes about 6 seconds to run.
It's similar to the original, but I made a few changes. It runs more quickly, but I didn't bother timing it. 1000 with this code seems about like 100 with the original.
The main changes:
remove rbind by passing in variables to mapply calculate union
variable instead of adding every time (union <- union + 1)
split strings all at once outside of loops
check length intersection before calculating union and adding intersection (lenint > 0)
Hopefully something helps your case.
rownum <- 1000
(mydata <- data.table(paste(sample(letters[1:5], rownum, replace=T),
sample(LETTERS[1:5], rownum, replace=T),
sep = ","),
paste(sample(LETTERS[1:5], rownum, replace=T),
sample(LETTERS[1:5], rownum, replace=T),
sep = ","),
paste(sample(letters[1:5], rownum, replace=T),
sample(letters[1:5], rownum, replace=T),
sep = ",")))
allsplit <- lapply(mydata,strsplit,split = ',')
allsplitdf <- cbind(allsplit[['V1']],allsplit[['V2']],allsplit[['V3']])
allsplitlist <- split(allsplitdf,1:nrow(allsplitdf))
metric2 <- function(p1,p2){
for(i in seq_along(p1)){
intersection <- 0
A <- p1[[i]]
B <- p2[[i]]
if(!any(is.na(A),is.na(B))){
lenint <- length(intersect(A,B))
if(lenint > 0){
intersection <- intersection + lenint/length(union(A,B))
}
}
}
1 - intersection/length(p1)
}
diss <- matrix(nrow = nrow(mydata), ncol = nrow(mydata))
for(i in 1:(nrow(mydata)-1)){
print(i) ## to check progress ##
for(j in (i+1):nrow(mydata)){
diss[j, i] <- mapply(metric2,p1 = allsplitlist[i],p2 = allsplitlist[j])
}
}
When constructing an algorithm it is important to keep in mind the speed/space trade off. What I mean by the speed/space trade off is that by storing your data within a different schema you can usually eliminate for loops. However, data stored within this new schema will generally occupy more space.
The reason your example is slow is because, among other things, you are looping over all the rows and the columns of you're data. With a 1000x100 data.frame that is 1e5 computations. One way to eliminate theloop over your rows is to store you data a bit differently. For example, I use the expand.grid command to combine all pairwise comparisons within the same data.frame, dTMP. I then strip the comma and allow each member of the pair to occupy it's own column (i.e. "a,A" which is originally contained in one variable, is now "a" and "A" and represent entries in two separate variables). In general, reshaping data into different formats is quick, or atleast quicker than looping over each row. This reshaping clearly, however, generate a data set which takes up more RAM. In your case the data.frame will be 1e6x4. Which is very large, but not so large as to clog up all your RAM.
The reward to doing all that hard work is that now it is trivial and extremely fast to obtain the intersect and union variables. You will of course still need to loop over each column, however, we've eliminated one loop by simply arranging your data. It is possible to remove the loop over the columns loop by utilizing 3D arrays, however, such an array would not fit into memory.
f3 <- function(){
intersection <- 0
for(v in names(mydata)){
dTMP <- expand.grid(mydata[[v]], mydata[[v]], stringsAsFactors = FALSE)[,c(2,1)]
#There is likely a more elegant way to do this.
dTMP <-
dTMP$Var2 %>%
str_split(., ",") %>%
unlist(.) %>%
matrix(., ncol = 2, nrow = nrow(dTMP), byrow = TRUE) %>%
cbind(., dTMP$Var1%>%
str_split(., ",") %>%
unlist(.) %>%
matrix(., ncol = 2, nrow = nrow(dTMP), byrow = TRUE)) %>%
as.data.frame(., stringsAsFactors = FALSE)
names(dTMP) <- c("v1", "v2", "v3", "v4")
intersect <- rowSums(dTMP[,c("v1", "v2")] == dTMP[,c("v3", "v4")])
intersect <- ifelse(rowSums(dTMP[,c("v1", "v2")] == dTMP[,c("v4", "v3")]) !=0, rowSums(dTMP[,c("v1", "v2")] == dTMP[,c("v4", "v3")]), intersect)
intersect <- ifelse(dTMP[, "v1"] == dTMP[, "v2"], 1, intersect)
MYunion <- sapply(as.data.frame(t(dTMP)), function(x) n_distinct(x))
intersection <- intersection + intersect/MYunion
}
union <- ncol(mydata)
return(matrix(1 - intersection/union, nrow = nrow(mydata), ncol = nrow(mydata), byrow = TRUE)) #This is the diss matrix, I think. Double check that I got the rows and columns correct
}
Update
I'm still having trouble replicating your results, however, I believe the newly updated code is very close. There is only one cell (2,1) of the dissimilarity matrix which our results differ when set.seed(3424). The problem with the current iteration, however, is that I need to implement a sapply to obtain MYunion. If you can think of a faster way do to do this, you'll get big speed gains. Read this SO post for suggests: Efficient Means of Identifying Number of Distinct Elements in a Row
I'm seeking to optimize this algorithm smartWindow and (and the process where I original post which explains some context around the function and how I got here:
Vectorizing a loop through lines of data frame R while accessing multiple variables the dataframe).
This currently takes me 240 seconds to run on my actual data. I've tried some Rprof It seems that chg2 <- line of smartWindow is eating the most time. I've also tried the compiler in R using cmpfun I'm wondering there's a way to significantly improve the speed of what I'm trying to do.
What I'm really looking for, is if there's a technique to accomplish what I've done below in something closer to 20 seconds than 240 seconds. I've shaved off 1-5% of of the computation time using various things. but what I'm really wondering is if I can decrease the time by a factor of a number greater than 2.
## the function
smartWindow <- function(tdate, aid, chgdf, datev='Submit.Date', assetv='Asset.ID', fdays=30, bdays=30) {
fdays <- tdate+fdays
bdays <- tdate-bdays
chg2 <- chgdf[chgdf[,assetv]==aid & chgdf[,datev]<fdays & chgdf[,datev]>bdays, ]
ret <- nrow(chg2)
return(ret)
}
## set up some data #################################################
dates <- seq(as.Date('2011-01-01'), as.Date('2013-12-31'), by='days')
aids <- paste(rep(letters[1:26], 3), 1:3, sep='')
n <- 3000
inc <- data.frame(
Submit.Date = sample(dates, n, replace=T),
Asset.ID = sample(aids, n, replace=T))
chg <- data.frame(
Submit.Date = sample(dates, n, replace=T),
Asset.ID = sample(aids, n, replace=T))
## applying function to just one incident ###########################
smartWindow(inc$Submit.Date[1], inc$Asset.ID[1], chgdf=chg, bdays=100)
## applying to every incident... this is process i seek to optimize #########
system.time({
inc$chg_b30 <- apply(inc[,c('Submit.Date', 'Asset.ID')], 1, function(row) smartWindow(as.Date(row[1]), row[2], chgdf=chg,
datev='Submit.Date', assetv='Asset.ID', bdays=30, fdays=0))
})
table(inc$chg_b30)
I have a data frame with 50000 rows and 200 columns. There are duplicate rows in the data and I want to aggregate the data by choosing the row with maximum coefficient of variation among the duplicates using aggregate function in R. With aggregate I can use "mean", "sum" by default but not coefficient variation.
For example
aggregate(data, as.columnname, FUN=mean)
Works fine.
I have a custom function for calculating coefficient of variation but not sure how to use it with aggregate.
co.var <- function(x)
(
100*sd(x)/mean(x)
)
I have tried
aggregate(data, as.columnname, function (x) max (co.var (x, data[index (x),])
but it is giving an error as object x is not found.
Assuming that I understand your problem, I would suggest using tapply() instead of aggregate() (see ?tapply for more info). However, a minimal working example would be very helpful.
co.var <- function(x) ( 100*sd(x)/mean(x) )
## Data with multiple repeated measurements.
## There are three things (ID 1, 2, 3) that
## are measured two times, twice each (val1 and val2)
myDF<-data.frame(ID=c(1,2,3,1,2,3),val1=c(20,10,5,25,7,2),
val2=c(19,9,4,24,4,1))
## Calculate coefficient of variation for each measurement set
myDF$coVar<-apply(myDF[,c("val1","val2")],1,co.var)
## Use tapply() instead of aggregate
mySel<-tapply(seq_len(nrow(myDF)),myDF$ID,function(x){
curSub<-myDF[x,]
return(x[which(curSub$coVar==max(curSub$coVar))])
})
## The mySel vector is then the vector of rows that correspond to the
## maximum coefficient of variation for each ID
myDF[mySel,]
EDIT:
There are faster ways, one of which is below. However, with a 40000 by 100 dataset, the above code only took between 16 and 20 seconds on my machine.
# Create a big dataset
myDF <- data.frame(val1 = c(20, 10, 5, 25, 7, 2),
val2 = c(19, 9, 4, 24, 4, 1))
myDF <- myDF[sample(seq_len(nrow(myDF)), 40000, replace = TRUE), ]
myDF <- cbind(myDF, rep(myDF, 49))
myDF$ID <- sample.int(nrow(myDF)/5, nrow(myDF), replace = TRUE)
# Define a new function to work (slightly) better with large datasets
co.var.df <- function(x) ( 100*apply(x,1,sd)/rowMeans(x) )
# Create two datasets to benchmark the two methods
# (A second method proved slower than the third, hence the naming)
myDF.firstMethod <- myDF
myDF.thirdMethod <- myDF
Time the original method
startTime <- Sys.time()
myDF.firstMethod$coVar <- apply(myDF.firstMethod[,
grep("val", names(myDF.firstMethod))], 1, co.var)
mySel <- tapply(seq_len(nrow(myDF.firstMethod)),
myDF.firstMethod$ID, function(x) {
curSub <- myDF.firstMethod[x, ]
return(x[which(curSub$coVar == max(curSub$coVar))])
}, simplify = FALSE)
endTime <- Sys.time()
R> endTime-startTime
Time difference of 17.87806 secs
Time second method
startTime3 <- Sys.time()
coVar3<-co.var.df(myDF.thirdMethod[,
grep("val",names(myDF.thirdMethod))])
mySel3 <- tapply(seq_along(coVar3),
myDF[, "ID"], function(x) {
return(x[which(coVar3[x] == max(coVar3[x]))])
}, simplify = FALSE)
endTime3 <- Sys.time()
R> endTime3-startTime3
Time difference of 2.024207 secs
And check to see that we get the same results:
R> all.equal(mySel,mySel3)
[1] TRUE
There is an additional change from the original post, in that the edited code considers that there may be more than one row with the highest CV for a given ID. Therefore, to get the results from the edited code, you must unlist the mySel or mySel3 objects:
myDF.firstMethod[unlist(mySel),]
myDF.thirdMethod[unlist(mySel3),]