I am working with the R programming language.
I have this dataset that records exam results ( 1 = pass, 0 = fail) for a set of students at different times:
library(data.table)
library(doParallel)
# Generate some sample data
id = sample.int(10000, 100000, replace = TRUE)
res = c(1,0)
results = sample(res, 100000, replace = TRUE)
date_exam_taken = sample(seq(as.Date('1999/01/01'), as.Date('2020/01/01'), by="day"), 100000, replace = TRUE)
# Create a data frame from the sample data
my_data = data.frame(id, results, date_exam_taken)
my_data <- my_data[order(my_data$id, my_data$date_exam_taken),]
# Generate some additional columns for each record
my_data$general_id = 1:nrow(my_data)
my_data$exam_number = ave(my_data$general_id, my_data$id, FUN = seq_along)
my_data$general_id = NULL
# Convert the data frame to a data.table
my_data = setDT(my_data)
# Create a cluster with 4 workers
cl = makeCluster(4)
I have this function that tracks the number of times each student failed an exam given the student failed the previous exam, passed an exam given that the student passed the previous exam, passed an exam given that the student failed the previous exam and failed an exam given that the student passed the previous exam. Here is the function:
my_function <- function(i) {
# Use tryCatch to handle the case where there are no rows in the start_i data frame
tryCatch({
start_i = my_data[my_data$id == i,]
pairs_i = data.frame(first = head(start_i$results, -1), second = tail(start_i$results, -1))
frame_i = as.data.frame(table(pairs_i))
frame_i$i = i
return(frame_i)
}, error = function(err) {
# Return an empty data frame if there are no rows in the start_i data frame
return(data.frame())
})
}
Now, I would like to try and run this function on my data in parallel - that is, I would like to assign data belonging to different students to different cores within my computer, in an effort to accelerate the time required to perform this function. Here is my attempt:
# Export the data frames and the my_function to the workers on the cluster
clusterExport(cl, c("my_data", "my_function", "data.table"))
# Assign each worker a different subset of the data to work on
clusterSetRNGStream(cl)
n = nrow(my_data)
chunks = rep(1:4, each = n / 4)
my_data = my_data[chunks == 1,]
# Evaluate the code on the cluster (final_out is the final result)
final_out = parLapply(cl, unique(my_data$id), my_function)
# alternate version
final_out = clusterApply(cl, unique(my_data$id), my_function)
# Stop the cluster when finished
stopCluster(cl)
The code seems to have run without errors - but I am not sure if I have done everything correctly.
Can someone please comment on this?
Thanks!
So far as I can tell, the approach you've taken does what you expect. I am doubtful that the cluster is giving you any real speed improvement over other alternative methods. For example, if you use a dplyr pipeline, you could do it pretty easily:
out <- my_data %>%
arrange(id, exam_number) %>%
group_by(id) %>%
mutate(prev_exam = lag(results)) %>%
group_by(id, results, prev_exam) %>%
tally() %>%
na.omit()
On my machine, macOS 12.6, 3.6 GHz intel i9, 128GB RAM, the dplyr pipeline is about 3.5 times faster than the parallel approach. As #jblood94 said in his comment, the considerable resources in communication make the cluster solution pretty inefficient. Maybe there is an even better datatable solution.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 months ago.
Improve this question
I am working with the R programming language.
I have this dataset that contains consecutive exam results from students over a period of time - I am trying to use this dataset to calculate the conditional probability of a student failing the next exam, given that the student failed the previous exam. The data looks something like this:
library(data.table)
id = sample.int(10000, 100000, replace = TRUE)
res = c(1,0)
results = sample(res, 100000, replace = TRUE)
date_exam_taken = sample(seq(as.Date('1999/01/01'), as.Date('2020/01/01'), by="day"), 100000, replace = TRUE)
my_data = data.frame(id, results, date_exam_taken)
my_data <- my_data[order(my_data$id, my_data$date_exam_taken),]
my_data$general_id = 1:nrow(my_data)
my_data$exam_number = ave(my_data$general_id, my_data$id, FUN = seq_along)
my_data$general_id = NULL
I have been spending some time learning how to "vectorize and parallelize" a function in R that will be able to calculate these conditional probabilities .
My initial code was a FOR LOOP - I was told that "vectorizing" your code allows R to simultaneously perform operations on multiple elements compared to a FOR LOOP which only allows R to perform operations on a single element, thus allowing for an increase in efficiency and a decrease in time/speed. On the other hand, I was told that "parallelizing" can potentially make your code run faster as it allows your computer to allocate more processing power through allocating the task at hand to different "cores" within your computer.
Thus, I tried to exploit both of these principles (vectorizing and parallelizing) and incorporate them both into my code. Furthermore, I was told about the advantages of using the "data.table" library in R and converted my data frame into a "data.table" object for another potential gain in performance.
Here is my attempt:
# First, load the "doParallel" library
library(doParallel)
library(data.table)
# Set up a cluster with 4 cores
cl = makeCluster(4)
registerDoParallel(cl)
my_data = as.data.table(my_data)
# Now, you can use the `foreach` function with the `%dopar%` operator to parallelize the computation
my_vector = foreach(i = unique(my_data$id), .combine = rbind) %dopar% {
{tryCatch({
# Inside the function, everything works the same as before
#setDT(my_data)
start_i = my_data[my_data$id == i,]
pairs_i = data.table(first = head(start_i$results, -1), second = tail(start_i$results, -1))
frame_i = data.table(table(pairs_i))
frame_i$i = i
#print(frame_i)
return(frame_i)
}
, error = function(e){})
}}
# Don't forget to stop the cluster when you're done
stopCluster(cl)
From here, I would then perform a series of JOINS on the results and calculate the different conditional probabilities (pass given pass, pass given fail, fail given fail, fail given pass), and place these results in a 2x2 contingency table.
The code seems to run - but I am interested in knowing if I have done this correctly and what could have been done differently.
Can someone please tell me if I have done this correctly?
Thanks!
This particular problem is not well suited for parallelization. The expense of setting up the cluster exceeds any benefit, especially when a vectorized solution is straightforward.
By sorting the table by id then date_exam_taken, we can get the previous exam result and then drop rows where breaks in id occur.
data.table then allows us to efficiently aggregate by the four combinations of results in order to form the contingency table.
m <- matrix(
setorder(
dt[
# get the results from the previous row (first row is NA)
,res0 := shift(results)
][
# keep only rows that have the same id as the previous row
id == shift(id)
][
# group by the four cases (pass-pass, pass-fail, etc.) and count the occurrences of each
, .N, c("res0", "results")
# sort descending by the current exam, then descending by the previous exam
], -results, -res0
# put the results into a 2x2 matrix
)$N, 2, 2, 0, list(c("pass0", "fail0"), c("pass1", "fail1"))
)
m # 2x2 contingency table by count
#> pass1 fail1
#> pass0 22527 22460
#> fail0 22491 22524
m/rowSums(m) # by proportion pass/fail given the results of the previous exam
#> pass1 fail1
#> pass0 0.5007447 0.4992553
#> fail0 0.4996335 0.5003665
This will be very fast. However, if performance is an overriding concern (as opposed to, e.g., readability), we can squeeze a little more out by using tabulate instead of data.table grouping operations. Put the two solutions in functions and test their performance with microbenchmark.
f1 <- function(dt) {
if (!identical(key(dt), c("id", "date_exam_taken"))) setkey(dt, id, date_exam_taken)
m <- matrix(
setorder(
dt[
,res0 := shift(results)
][
id == shift(id)
][
, .N, c("res0", "results")
], -results, -res0
)$N, 2, 2, 0, list(c("pass0", "fail0"), c("pass1", "fail1"))
)
m/rowSums(m)
}
f2 <- function(dt) {
if (!identical(key(dt), c("id", "date_exam_taken"))) setkey(dt, id, date_exam_taken)
m <- matrix(
with(
dt,
tabulate((shift(id) == id)*(4L - 2L*results - shift(results)), 4L)
), 2, 2, 0, list(c("pass0", "fail0"), c("pass1", "fail1"))
)
m/rowSums(m)
}
microbenchmark::microbenchmark(f1(dt),
f2(dt),
check = "identical")
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> f1(dt) 3.0650 3.9351 4.931590 4.1148 4.63785 12.1386 100
#> f2(dt) 1.5193 1.7114 1.933132 1.7479 1.77500 7.8051 100
Data:
library(data.table)
(seed <- sample(.Machine$integer.max, 1))
#> [1] 1849007255
set.seed(seed)
dt <- data.table(
id = sample.int(10000, 100000, replace = TRUE),
results = sample(0:1, 100000, replace = TRUE),
date_exam_taken = sample(seq(as.Date('1999/01/01'), as.Date('2020/01/01'), by="day"), 100000, replace = TRUE)
)
I have a dataset which I want to process using tq_mutate and rollapply with different parameter values.
Currently I'm using a for loop to go over all the parameter values but I'm sure this is not the most efficient or fastest way to do this task (especially when I am going to be looking at large numbers of parameter values). How could the for loop be improved or removed? I suspect it means using purrr::map or some other means (multithreading/multicore etc) but I've not been able to find useful examples online.
Below is some sample code. Please ignore the simplicity of the dataset and outputs of the scale function, it is for illustrative purposes only. What I want to do is iterate over many different V0 values.
library(dplyr)
library(tidyverse)
library(broom)
library(tidyquant)
my_bogus_function <- function(df, V0=1925) {
# WILL HAVE SOMETHING MORE SOPHISTICATED IN HERE BUT KEEPING IT SIMPLE
# FOR THE PURPOSES OF THE QUESTION
c(V0, V0*2)
}
window_size <- 7 * 24
cnames = c("foo", "bar")
df <- c("FB") %>%
tq_get(get = "stock.prices", from = "2016-01-01", to = "2017-01-01") %>%
dplyr::select("date", "open")
# CAN THIS LOOP BE DONE IN A MORE EFFICIENT MANNER?
for (i in (1825:1830)){
df <- df %>%
tq_mutate(mutate_fun = rollapply,
width = window_size,
by.column = FALSE,
FUN = my_bogus_function,
col_rename = gsub("$", sprintf(".%d", i), cnames),
V0 = i
)
}
# END OF THE FOR LOOP I WANT FASTER
Given that R uses one core I have found improvement by using the packages parallel, doSNOW and foreach which allows multiple cores to be used (Note that I'm on a windows machine so some other packages are not available).
I'm sure there are other answers out there to multithread/parallelise/vectorise code.
Here is the code for anyone interested.
library(dplyr)
library(tidyverse)
library(tidyquant)
library(parallel)
library(doSNOW)
library(foreach)
window_size <- 7 * 24
cnames = c("foo", "bar")
df <- c("FB") %>%
tq_get(get = "stock.prices", from = "2016-01-01", to = "2017-01-01") %>%
dplyr::select("date", "open")
my_bogus_function <- function(df, V0=1925) {
# WILL HAVE SOMETHING MORE SOPHISTICATED IN HERE BUT KEEPING IT SIMPLE
# FOR THE PURPOSES OF THE QUESTION
c(V0, V0*2)
}
# CAN THIS LOOP BE DONE IN A MORE EFFICIENT/FASTER MANNER? YES
numCores <- detectCores() # get the number of cores available
cl <- makeCluster(numCores, type = "SOCK")
registerDoSNOW(cl)
# Function to combine the outputs
mycombinefunc <- function(a,b){merge(a, b, by = c("date","open"))}
# Run the loop over multiple cores
meh <- foreach(i = 1825:1830, .combine = "mycombinefunc") %dopar% {
message(i)
df %>%
# Adjust everything
tq_mutate(mutate_fun = rollapply,
width = window_size,
by.column = FALSE,
FUN = my_bogus_function,
col_rename = gsub("$", sprintf(".%d", i), cnames),
V0 = i
)
}
stopCluster(cl)
# END OF THE FOR LOOP I WANTED FASTER
I have some code:
test<-therapyDF %>% group_by(therapyDF$prodcode) %>% summarize(count=n_distinct(therapyDF$patid))
that is designed to group all prodcode entries (drug) and then count how many patients (patid) have an instance of each drug. For example, the raw data is held in a dataframe similar to:
patid prodcode
1 A
1 B
2 C
3 A
3 A
3 B
Thus, output will be:
A 2
B 2
C 1
Is there any way of parallelising this code? The real data is in excess of 100 million records and it's been more than 8 hours of running in serial.
I'm struggling to adopt the *apply methodologies of R and the numerous R parallel packages. Splitting the original data frame would be a bit tricky, as the data is not very well organised (just how I got it) and it would require grouping an extracting by the $patid. I am running this on an 8 core intel Linux box.
Thanks
It is possible to parallelize a group_by-summarize with
summarize_par <- function(grouped_df, ...) {
sizes <- attr(grouped_df, "group_sizes")
ord <- order(sizes, decreasing = TRUE)
one_group <- function(gdf, i, size) {
size_i <- sizes[i]
structure(
gdf[attr(gdf, "indices")[[i]] + 1, ],
indices = list(0:(size_i - 1)),
group_sizes = size_i,
biggest_group_sizes = size_i,
labels = attr(gdf, "labels")[i, , drop = FALSE]
)
}
dots <- dplyr:::named_quos(...)
res <- foreach(ic = ord) %dopar% {
dplyr::summarise(one_group(grouped_df, ic), !!!dots)
}
do.call(rbind, res[match(seq_along(ord), ord)])
}
Test:
N <- 2e7
therapyDF <- data.frame(patid = sample.int(N/2, size = N, replace = TRUE),
prodcode = sample(LETTERS, size = N, replace = TRUE))
library(dplyr)
system.time(true <- therapyDF %>%
group_by(prodcode) %>%
summarize(count=n_distinct(patid)))
library(foreach)
library(doParallel)
registerDoParallel(cl <- makeForkCluster(detectCores() / 2))
system.time(test <- therapyDF %>%
group_by(prodcode) %>%
summarize_par(count=n_distinct(patid)))
all.equal(true$count, test$count)
stopCluster(cl)
I get:
sequential: 8.228
parallel with 2 cores: 6.525
Here, my computer has not a lot of cores neither a lot of memory.
You can expect better results with a better computer.
There is a performance issue with n_distinct at the moment (see https://github.com/tidyverse/dplyr/issues/977). You should use length(unique(patid)) instead of n_distinct(patid) to speed things up.
You should give data.table a try:
N <- 1e8
therapyDF <- data.frame(patid = sample.int(N/2, size = N, replace = TRUE),
prodcode = sample(LETTERS, size = N, replace = TRUE))
library(dplyr)
system.time(therapyDF %>%
group_by(prodcode) %>%
summarize(count=n_distinct(patid)))
#> User System verstrichen
#> 36.939 1.196 38.136
library(data.table)
setDT(therapyDF)
system.time(therapyDF[, .(count = uniqueN(patid)), by = prodcode])
#> User System verstrichen
#> 5.727 0.116 5.843
It uses OpenMP for parallel processing.
I am using ctree method of the Party R package to generate a decision tree.
My dataset has about 22 columns and 650000 rows of data. I allocated 10GB of memory to my r session using memory.limit command.
I have a 2.3 GHz i3 processor and 6GB of RAM. what am i doing wrong here.
the error i get is
Calloc could not allocate memory (6223507 of 8 bytes)
Ok, I finally found some time to do this. It's not the too elegant, but should work.
At first, load the libraries and the function below (you'll need to install data.table package)
library(data.table)
library(party)
WeightFunc <- function(data, DV){
# Creating some paste function in order to paste unique paths
paste2 <- function(x) paste(x, collapse = ",")
ignore <- DV
# Creating unique paths
test3 <- apply(data[setdiff(names(data),ignore)], 1, paste2)
# Binding the unique paths vector back to the original data
data <- cbind(data, test3)
#data
# Getting the values of each explaining variable per each unique path
dt <- data.table(data[setdiff(names(data), ignore)])
dt.out <- as.data.frame(dt[, head(.SD, 1), by = test3])
# Creating dummy variables per each value of our dependable variable for further calculations
DVLvs <- as.character(unique(data[, DV]))
data[, DVLvs[1]] <- ifelse(data[, DV] == DVLvs[1], 1, 0)
data[, DVLvs[2]] <- ifelse(data[, DV] == DVLvs[2], 1, 0)
data[, DVLvs[3]] <- ifelse(data[, DV] == DVLvs[3], 1, 0)
# Summing dummy variables per unique path
dt <- data.table(data[c("test3", DVLvs)])
dt.out2 <- as.data.frame(dt[, lapply(.SD, sum), by = test3])
# Binding unique pathes with sums
dt.out2$test3 <- dt.out$test3 <- NULL
test <- cbind(dt.out, dt.out2)
# Duplicating the data in order to create a weights for every level of expalined variable
test2 <- test[rep(1:nrow(test),each = 3), ]
test2 <- cbind(test2, AdjDV = DVLvs)
test2$Weights <- ifelse(is.element(seq(1:nrow(test2)), grep("[.]1", rownames(test2))), test2[, DVLvs[2]],
ifelse(is.element(seq(1:nrow(test2)), grep("[.]2",rownames(test2))), test2[, DVLvs[3]], test2[, DVLvs[1]]))
# Deleting unseassery column
test2[, DVLvs[1]] <- test2[, DVLvs[2]] <- test2[, DVLvs[3]] <- NULL
return(test2)
}
Now run this function on your data set where data is your data and DV is your explained variable name (in quotes) and save it in a new dataset, for example:
Newdata <- WeightFunc(data = Mydata, DV = "Success")
Now, this process could take a while if you have many unique pathes, but it shouldn't overload your memory. If you don't have too many unique paths, this function should reduce your data set by tens or even hundred times. Also, this function is only good for 3 level factor explained variable (like you have).
After that, you can run the ctree as you were doing previously, but with the new data and the new explained variable (which will be called AdjDV) and wiegths parameter which called Weights. You'll also have to exclude Weights out of the dataset while running the ctree.
Like that:
ct <- ctree(AdjDV ~., data = Newdata[setdiff(names(Newdata), "Weights")], weights = Newdata$Weights)
I am trying to reshape/ reduce my data. So far, I employ a for loop (very slow) but from what I perceive, this should be quite fast with Plyr.
I have many groups (firms, as a factor in the dataset) and I want to drop entirely every firm which shows a 0 entry for value in any of that firm's cells. I thus create a new data.frame but leave out all groups showing 0 for value at some point.
The forloop:
Data Creation:
set.seed(1)
mydf <- data.frame(firmname = sample(LETTERS[1:5], 40, replace = TRUE),
value = rpois(40, 2))
-----------------------------
splitby = mydf$firmname
new.data <- data.frame()
for (i in 1:(length(unique(splitby)))) {
temp <- subset(mydf, splitby == as.character(paste(unique(splitby)[i])))
if (all(temp$value > 0) == "TRUE") {
new.data <- rbind(new.data, temp)
}
}
Delete all empty firm factors
new.data$splitby <- factor(new.data$splitby)
Is there a way to achieve that with the plyr package? Can the subset function be used in that context?
EDIT: For the purpose of the reproduction of the problem, data creation, as suggested by BenBarnes, is added. Ben, thanks a lot for that. Furthermore, my code is altered so as to comply with the answers provided below.
You could supply an anonymous function to the .fun argument in ddply():
set.seed(1)
mydf <- data.frame(firmname = sample(LETTERS[1:5], 40, replace = TRUE),
value = rpois(40, 2))
library(plyr)
ddply(mydf,.(firmname), function(x) if(any(x$value==0)) NULL else x )
Or using [, as suggested by Andrie:
firms0 <- unique(mydf$firmname[which(mydf$value == 0)])
mydf[-which(mydf$firmname %in% firms0), ]
Note that the results of ddply are sorted according to firmname
EDIT
For the example in your comments, this approach is again faster than using ddply() to subset, selecting only firms with more than three entries:
firmTable <- table(mydf$firmname)
firmsGT3 <- names(firmTable)[firmTable > 3]
mydf[mydf$firmname %in% firmsGT3, ]