I am trying to learn how to use parallel processing in R. A snapshot of the data and the code is provided below.
Creating a rough dataset
library(truncnorm)
#Creating a mock dataframe
Market =c('City1','City2','City3','City4','City5','City2','City4','City1','City3','City5')
Car_type = c('A','A','A','A','A','B','B','B','B','B')
Variable1=c(.34,.19,.85,.27,.32,.43,.22,.56,.17,.11)
Car_purchased = c(1,0,0,1,0,1,0,0,1,1)
Market_data = data.frame(Market,Car_type,Variable1,Car_purchased)
Market_data2=do.call("rbind", replicate(100, Market_data, simplify = FALSE))
#Create a bigger dataset
Market_data2$Final_value = 0 #create a column of for future calculation
empty_list = list()
Writing a function and running the function
Car_Value=function(data){
market_list=unique(Market_data2$Market)
for (m in market_list){
market_subset = Market_data2[which(Market_data2$Market==m),]
for (i in 1:nrow(market_subset)){
if(market_subset[i,'Car_purchased']==1){
market_subset[i,'Final_value'] = rtruncnorm(1,a=-10,b=0,mean=max(market_subset$Variable1),sd=1)
} else{
market_subset[i,'Final_value'] = rtruncnorm(1,a=-10,b=0,mean = market_subset[i,'Variable1'],sd=1)
}
}
empty_list=rbind(empty_list,market_subset)
}
return(empty_list)
}
get_value = Car_Value(data=Market_data2)
In the above example, there are a total of 5 "Market" for cars and 2 "Car_type". Consumers may have bought the cars in either market. I have to calculate a value ("Final_value") from a given truncated normal distribution. This value only depends on the value of Variable1 of the given market. That is why I use the outer for loop. The means of the truncated normal distribution depends on the value of Variable1 (max(Variable1) in a market if the Car_purchased==1 or the given value if Car_purchased==0). This version of the code runs perfectly fine (although it is not optimized for speed).
Problem
Next what I would like to do is to use parallel processing for the outer for loop i.e. for the loop across the markets since the Final_value of a market depends only on the observations within the market.
Unfortunately, I only know how to implement parallel processing for each line of the dataset. For eg. my code (provided below) assigns the 1st line to the 1st core, 2nd line to the 2nd core and so on. This is inefficient and is taking a long time since each line has to create the subset and then find the max of the subset.
My inefficient version
library(parallel)
library(foreach)
library(doParallel)
library(iterators)
library(utils)
library(truncnorm)
cl=parallel::makeCluster(4,type="PSOCK")
registerDoParallel(cl)
clusterEvalQ(cl, {library(truncnorm)})
Car_Value_Parallel <- function(market_data){
output <- foreach(x = iter(market_data, by = "row"), .combine = rbind) %dopar% {
market_subset = market_data[which(market_data$Market==x$Market),]
if(x['Car_purchased']==1){
x['Final_value'] = rtruncnorm(1,a=-10,b=0,mean=max(market_subset$Variable1),sd=1)
} else{
x['Final_value'] = rtruncnorm(1,a=-10,b=0,mean = x['Variable1'],sd=1)
}
return(x)
}
output
}
get_value_parallel = Car_Value_Parallel(market_data = Market_data2)
stopCluster(cl)
This is highly inefficient if I run it on a dataset of size > 100K (My actual dataset is about 1.2 million rows). However, I could not implement the parallelization at the market level where the parallel computation will be as follows: Run the computation for City1 in the 1st core, City2 in the 2nd core and so on. Can someone please help? Any help is appreciated. Thanks.
P.S. My apologies for the long question. I just wanted to show all versions of the code that I have used.
I see no reason to pursue parallel processing with your data set. Instead, look into packages like dplyr or data.table for a more efficient solution.
From my understanding of your problem, for each Market you want to apply rtruncnorm to create the variable Final_value where the mean argument of the rtruncnorm's function depends on the variable Car_purchased.
We can accomplish this without the need of a for loop, using dplyr.
library(truncnorm)
library(dplyr)
# Creating a mock dataframe
Market <- c("City1", "City2", "City3", "City4", "City5", "City2", "City4", "City1", "City3", "City5")
Variable1 <- c(.34, .19, .85, .27, .32, .43, .22, .56, .17, .11)
Car_purchased <- c(1, 0, 0, 1, 0, 1, 0, 0, 1, 1)
Market_data <- data.frame(Market, Car_type, Variable1, Car_purchased)
Market_data2 <- replicate(100, Market_data, simplify = FALSE) %>% bind_rows()
#Create a bigger dataset
Market_data2$Final_value = 0 #create a column of for future calculation
empty_list = list()
Car_Value2 <- function(data) {
data %>%
group_by(Market) %>%
mutate(
Final_value = if_else(
Car_purchased == 1,
rtruncnorm(1, a = -10, b = 0, mean = max(Variable1), sd = 1),
rtruncnorm(1, a = -10, b = 0, mean = Variable1, sd = 1)
)
)
}
microbenchmark::microbenchmark(
Car_Value(Market_data2),
Car_Value2(Market_data2),
times = 100
)
#> Unit: milliseconds
#> expr min lq mean median uq
#> Car_Value(Market_data2) 66.109304 68.043575 69.030763 68.56569 69.681255
#> Car_Value2(Market_data2) 1.073318 1.101578 1.204737 1.17583 1.230687
#> max neval cld
#> 89.497035 100 b
#> 3.465425 100 a
# Even bigger dataframe
Market_data3 <- replicate(120000, Market_data, simplify = FALSE) %>% bind_rows()
microbenchmark::microbenchmark(
Car_Value2(data = Market_data3),
times = 100
)
#> Unit: milliseconds
#> expr min lq mean median
#> Car_Value2(data = Market_data3) 338.4615 341.7134 375.8769 397.7133
#> uq max neval
#> 399.8733 412.5134 100
Created on 2019-03-10 by the reprex package (v0.2.1)
Related
I am trying to identify the most probable group that an observation belongs to, for several thousand large datasets. It is possible that some of the data is incorrectly classified and I am trying to work out the most likely "true" value. I have tried to use knn3 from the caret package but the predictions take too long to compute. In researching alternatives I have came across the nn2 function from RANN package which performs a nearest neighbour search that is significantly faster than K-Nearest Neighbours.
library(RANN)
library(tidyverse)
iris.scaled <- iris %>%
mutate_if(is.numeric, scale)
iris.nn2 <- nn2(iris.scaled[1:4])
The result on the nn2 function is two lists, one of indices and one of distances. I want to use the indices table to work out the most likely grouping of each observation, however it returns the row number of the observation and not it's group. I need to replace this with the group it belongs to (in this case, the species column).
distance.index <- iris.nn2$nn.idx[,-1]
target = iris.scaled$Species
I have removed the first column as the first nearest neighbour is always the observation itself.
matrix(target[distance.index[,]], nrow = nrow(distance.index), ncol = ncol(distance.index))
This code gives me the output I want, but is there a tidier way of creating this table and then calculating the most common response for each row, with the speed of calculation being the key.
Your scaling can be a real bottleneck when you have more columns (tested on 200 x 22216 gene expression matrix). My version might not seem that impressive with the iris dataset, but on the larger dataset I get 1.3 sec vs. 32.8 sec execution time.
Using tabulate instead of table gives an additional improvement, which is dwarfed, however, by the matrix scaling.
I used a custom scale function here, but using base::scale on a matrix would already be a major improvement.
I also addressed the issue raised by M. Papenberg of "self" not being considered the nearest neighbor by setting those to NA.
invisible(lapply(c("tidyverse", "matrixStats", "RANN", "microbenchmark", "compiler"),
require, character.only=TRUE))
enableJIT(3)
# faster column scaling (modified from https://www.r-bloggers.com/author/strictlystat/)
colScale <- function(x, center = TRUE, scale = TRUE, rows = NULL, cols = NULL) {
if (!is.null(rows) && !is.null(cols)) {x <- x[rows, cols, drop = FALSE]
} else if (!is.null(rows)) {x <- x[rows, , drop = FALSE]
} else if (!is.null(cols)) x <- x[, cols, drop = FALSE]
cm <- colMeans(x, na.rm = TRUE)
if (scale) csd <- matrixStats::colSds(x, center = cm, na.rm = TRUE) else
csd <- rep(1, length = length(cm))
if (!center) cm <- rep(0, length = length(cm))
x <- t((t(x) - cm) / csd)
return(x)
}
# your posted version (mostly):
oldv <- function(){
iris.scaled <- iris %>%
mutate_if(is.numeric, scale)
iris.nn2 <- nn2(iris.scaled[1:4])
distance.index <- iris.nn2$nn.idx[,-1]
target = iris.scaled$Species
category_neighbours <- matrix(target[distance.index[,]], nrow = nrow(distance.index), ncol = ncol(distance.index))
class <- apply(category_neighbours, 1, function(x) {
x1 <- table(x)
names(x1)[which.max(x1)]})
cbind(iris, class)
}
## my version:
myv <- function(){
iris.scaled <- colScale(data.matrix(iris[, 1:(dim(iris)[2]-1)]))
iris.nn2 <- nn2(iris.scaled)
# set self neighbors to NA
iris.nn2$nn.idx[iris.nn2$nn.idx - seq_len(dim(iris.nn2$nn.idx)[1]) == 0] <- NA
# match up categories
category_neighbours <- matrix(iris$Species[iris.nn2$nn.idx[,]],
nrow = dim(iris.nn2$nn.idx)[1], ncol = dim(iris.nn2$nn.idx)[2])
# turn category_neighbours into numeric for tabulate
cn <- matrix(as.numeric(factor(category_neighbours, exclude=NULL)),
nrow = dim(iris.nn2$nn.idx)[1], ncol = dim(iris.nn2$nn.idx)[2])
cnl <- levels(factor(category_neighbours, exclude = NULL))
# tabulate frequencies and match up with factor levels
class <- apply(cn, 1, function(x) {
cnl[which.max(tabulate(x, nbins=length(cnl))[!is.na(cnl)])]})
cbind(iris, class)
}
microbenchmark(oldv(), myv(), times=100L)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> oldv() 11.015986 11.679337 12.806252 12.064935 12.745082 33.89201 100 b
#> myv() 2.430544 2.551342 3.020262 2.612714 2.691179 22.41435 100 a
Usually, I find myself using a few summary functions or making my own computations to get some additional initial information from the data. For example, I wanted to see the count and percentage per variable given a limit of distinct values:
table_transposed <- function(vector){
merge(as.data.frame(table(vector, dnn="values")),
as.data.frame(round(prop.table(table(vector, dnn="values")),2)),
by="values",
all.x=TRUE) %>%
data.table::transpose(keep.names = "values",
make.names = names(.)[1]) %T>%
{.[,c("values")] <- c("Count", "Percentage")}
}
table_transposed_filter <- function(dataframe, max_number_categories) {
(lapply(dataframe, function(x) NROW(unique(x))) <= max_number_categories) %>%
as.vector() %>%
{dataframe[,.]} %>%
lapply(table_transposed)
}
So, you give the dataframe and the threshold of distinct values per variable.
table_transposed_filter(mtcars, 10)
However, it's SUPER slow (maybe because of using merge() instead of left_join() from dplyr). Now, I'm trying to figure an efficient, fast, and simple way to do a combination of psych::describe(), Hmisc::describe(), other, and my own, for numeric and categorical variables (one descriptive function for each one). Something like (for numerical):
| Variable | dtype | mean | mode | variance | skew | percentile 25 | ...
If I create this table with mainly with sapply() for example, is it better (more efficient, faster, simpler code) than actually learning to create a r-package and developing in there?
PS: I thought to put this question in StackMetaExchange or Crossvalidation, but none of them seem to fit it.
Here's a somewhat faster version. It's about 2x faster on small data (like mtcars), but the difference narrows on litte bit on larger data.
This makes sense as the most expensive operation you do is table - your version does it twice, my version does it once. I didn't profile the code, but my guess is table is the bottleneck by more than one order of magnitude on any sizeable data, so it's a waste to try to optimize any other parts of the code.
t_transp = function(x, digits = 2) {
tab = table(x)
prop_tab = prop.table(tab)
df = data.frame(values = c("Count", "Percentage"))
df = cbind(df, rbind(tab, round(prop_tab, digits = digits)))
row.names(df) = NULL
df
}
t_transp_filter = function(data, n_max, ...) {
lapply(Filter(function(x) NROW(unique(x)) <= n_max, data), t_transp, ...)
}
Benchmarking:
microbenchmark::microbenchmark(
gregor = t_transp_filter(mtcars, n_max = 4),
OP = table_transposed_filter(mtcars, 4),
times = 20
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# gregor 1.6483 1.7093 2.253425 1.74765 1.84680 7.5394 20 a
# OP 5.6988 5.7627 6.316295 6.08545 6.57965 8.1048 20 b
set.seed(47)
df = as.data.frame(matrix(
c(sample(letters[1:5], size = 1e5 * 20, replace = T))
, ncol = 20))
microbenchmark::microbenchmark(
gregor = t_transp_filter(df, n_max = 5),
OP = table_transposed_filter(df, 5),
times = 20
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# gregor 59.5466 59.95545 63.6825 61.14075 67.2167 75.4270 20 a
# OP 110.3265 117.35585 123.8782 118.91005 133.7795 149.0651 20 b
I have a large spatial dataset (12M rows). The geometries are points on a map. For each row in the dataset, I'd like to find all the points that are within 500 meters of that point.
In r, using sf, I've been trying to do this by parallel looping through each row and running st_buffer and st_intersects, then saving the result as a list in a key-value format (the key being the origin point, the values being the neighbors).
The issue is that the dataset is too large. Even when parallelizing to upwards of 60 cores, the operation takes too long (>1 week and usually crashes).
What are the alternatives to this brute-force approach? Is it possible to build indexes using sf? Perhaps push the operation to an external database?
Reprex:
library(sf)
library(tidyverse)
library(parallel)
library(foreach)
# example data, convert to decimal:
nc <- st_read(system.file("shape/nc.shp", package="sf")) %>% st_transform(32618)
# expand the data a a bit to make the example more interesting:
nc <- rbind(nc,nc,nc)
nc <- nc %>% mutate(Id = row_number())
## can run in parallel if desired:
# num_cores <- parallel::detectCores()-2
# cl <- makeSOCKcluster(num_cores)
# registerDoSNOW(cl)
# or just run in sequence:
registerDoSEQ()
neighbors <- foreach(ii = 1:nrow(nc)
, .verbose = FALSE
, .errorhandling = "pass") %dopar% {
l = 500 # 500 meters
# isolate the row as the origin point:
row_interest <- filter(nc, row_number()==ii)
# create the buffer:
buffer <- row_interest %>% st_buffer(dist = l)
# extract the row numbers of the neighbors
comps_idx <- suppressMessages(st_intersects(buffer, nc))[[1]]
# get all the neighbors:
comps <- nc %>% filter(row_number() %in% comps_idx)
# remove the geometry:
comps <- comps %>% st_set_geometry(NULL)
# flow control in case there are no neibors:
if(nrow(comps)>0) {
comps$Origin_Key <- row_interest$Id
} else {
comps <- data_frame("lat" = NA_integer_,"lon" = NA_integer_, "bbl" = row_interest$bbl)
comps$Origin_Key <- row_interest$Id
}
return(comps)
}
closeAllConnections()
length(neighbors)==nrow(nc)
[1] TRUE
When working with sf objects, explicitly looping over features to perform
binary operations such as intersects is usually counterproductive (see also
How can I speed up spatial operations in `dplyr::mutate()`?)
An approach similar to yours (i.e., buffering and intersecting), but without
the explicit for loop works better.
Let's see how it performs on a reasonably big dataset of 50000 points:
library(sf)
library(spdep)
library(sf)
pts <- data.frame(x = runif(50000, 0, 100000),
y = runif(50000, 0, 100000))
pts <- sf::st_as_sf(pts, coords = c("x", "y"), remove = F)
pts_buf <- sf::st_buffer(pts, 5000)
coords <- sf::st_coordinates(pts)
microbenchmark::microbenchmark(
sf_int = {int <- sf::st_intersects(pts_buf, pts)},
spdep = {x <- spdep::dnearneigh(coords, 0, 5000)}
, times = 1)
#> Unit: seconds
#> expr min lq mean median uq max neval
#> sf_int 21.56186 21.56186 21.56186 21.56186 21.56186 21.56186 1
#> spdep 108.89683 108.89683 108.89683 108.89683 108.89683 108.89683 1
You can see here that the st_intersects approach is 5 times faster than
the dnearneigh one.
Unfortunately, this is unlikely to solve your problem. Looking at execution
times for datasets of different sizes we get:
subs <- c(1000, 3000, 5000, 10000, 15000, 30000, 50000)
times <- NULL
for (sub in subs[1:7]) {
pts_sub <- pts[1:sub,]
buf_sub <- pts_buf[1:sub,]
t0 <- Sys.time()
int <- sf::st_intersects(buf_sub, pts_sub)
times <- cbind(times, as.numeric(difftime(Sys.time() , t0, units = "secs")))
}
plot(subs, times)
times <- as.numeric(times)
reg <- lm(times~subs+I(subs^2))
summary(reg)
#>
#> Call:
#> lm(formula = times ~ subs + I(subs^2))
#>
#> Residuals:
#> 1 2 3 4 5 6 7
#> -0.16680 -0.02686 0.03808 0.21431 0.10824 -0.23193 0.06496
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 2.429e-01 1.371e-01 1.772 0.151
#> subs -2.388e-05 1.717e-05 -1.391 0.237
#> I(subs^2) 8.986e-09 3.317e-10 27.087 1.1e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.1908 on 4 degrees of freedom
#> Multiple R-squared: 0.9996, Adjusted R-squared: 0.9994
#> F-statistic: 5110 on 2 and 4 DF, p-value: 1.531e-07
Here, we see an almost perfect quadratic relationship between time and
number of points (as would be expected). On a 10M points subset, assuming
that the behaviour does not change, you would get:
predict(reg, newdata = data.frame(subs = 10E6))
#> 1
#> 898355.4
, which corresponds to about 10 days, assuming that the trend is constant
when further increasing the number of points (but the same would happen for
dnearneigh...)
My suggestion would be to "split" your points in chunks and then work on a
per-split basis.
You could for example order your points at the beginning along
the x-axis and then easily and quickly extract subsets of buffers and of points with which to compare them using data.table.
Clearly, the "points" buffer would need to be larger than that of "buffers" according
to the comparison distance. So, for example, if you make a subset of pts_buf with
centroids in [50000 - 55000], the corresponding subset of pts should include
points in the range [49500 - 55500].
This approach is easily parallelizable by assigning the different subsets to
different cores in a foreach or similar construct.
I do not even know if using spatial objects/operations is beneficial here, since once we have the coordinates all is needed is computing and subsetting euclidean distances: I suspect that a carefully coded brute force data.table-based approach could be also a feasible solution.
HTH!
UPDATE
In the end, I decided to give it a go and see how much speed we could gain from this kind of approach. Here is a possible implementation:
points_in_distance_parallel <- function(in_pts,
maxdist,
ncuts = 10) {
require(doParallel)
require(foreach)
require(data.table)
require(sf)
# convert points to data.table and create a unique identifier
pts <- data.table(in_pts)
pts <- pts[, or_id := 1:dim(in_pts)[1]]
# divide the extent in quadrants in ncuts*ncuts quadrants and assign each
# point to a quadrant, then create the index over "xcut"
range_x <- range(pts$x)
limits_x <-(range_x[1] + (0:ncuts)*(range_x[2] - range_x[1])/ncuts)
range_y <- range(pts$y)
limits_y <- range_y[1] + (0:ncuts)*(range_y[2] - range_y[1])/ncuts
pts[, `:=`(xcut = as.integer(cut(x, ncuts, labels = 1:ncuts)),
ycut = as.integer(cut(y, ncuts, labels = 1:ncuts)))] %>%
setkey(xcut, ycut)
results <- list()
cl <- parallel::makeCluster(parallel::detectCores() - 2, type =
ifelse(.Platform$OS.type != "windows", "FORK",
"PSOCK"))
doParallel::registerDoParallel(cl)
# start cycling over quadrants
out <- foreach(cutx = seq_len(ncuts)), .packages = c("sf", "data.table")) %dopar% {
count <- 0
# get the points included in a x-slice extended by `dist`, and build
# an index over y
min_x_comp <- ifelse(cutx == 1, limits_x[cutx], (limits_x[cutx] - maxdist))
max_x_comp <- ifelse(cutx == ncuts,
limits_x[cutx + 1],
(limits_x[cutx + 1] + maxdist))
subpts_x <- pts[x >= min_x_comp & x < max_x_comp] %>%
setkey(y)
for (cuty in seq_len(pts$ycut)) {
count <- count + 1
# subset over subpts_x to find the final set of points needed for the
# comparisons
min_y_comp <- ifelse(cuty == 1,
limits_y[cuty],
(limits_y[cuty] - maxdist))
max_y_comp <- ifelse(cuty == ncuts,
limits_y[cuty + 1],
(limits_y[cuty + 1] + maxdist))
subpts_comp <- subpts_x[y >= min_y_comp & y < max_y_comp]
# subset over subpts_comp to get the points included in a x/y chunk,
# which "neighbours" we want to find. Then buffer them.
subpts_buf <- subpts_comp[ycut == cuty & xcut == cutx] %>%
sf::st_as_sf() %>%
st_buffer(maxdist)
# retransform to sf since data.tables lost the geometric attrributes
subpts_comp <- sf::st_as_sf(subpts_comp)
# compute the intersection and save results in a element of "results".
# For each point, save its "or_id" and the "or_ids" of the points within "dist"
inters <- sf::st_intersects(subpts_buf, subpts_comp)
# save results
results[[count]] <- data.table(
id = subpts_buf$or_id,
int_ids = lapply(inters, FUN = function(x) subpts_comp$or_id[x]))
}
return(data.table::rbindlist(results))
}
parallel::stopCluster(cl)
data.table::rbindlist(out)
}
The function takes as input a points sf object, a target distance and a number
of "cuts" to use to divide the extent in quadrants, and provides in output
a data frame in which, for each original point, the "ids" of the points within
maxdist are reported in the int_ids list column.
On on a test dataset with a varying number of uniformly distributed point,
and two values of maxdist I got these kind of results (the "parallel" run is done using 6 cores):
So, here we get a 5-6X speed improvement already on the "serial" implementation, and another 5X thanks to parallelization over 6 cores.
Although the timings shown here are merely indicative, and related to the
particular test-dataset we built (on a less uniformly distributed dataset I wouldexpect a lower speed improvement) I think this is quite good.
HTH!
PS: a more thorough analysis can be found here:
https://lbusettspatialr.blogspot.it/2018/02/speeding-up-spatial-analyses-by.html
I have two alternatives, one that seems faster, and one that is not. The faster method may not be amenable for parallelization, unfortunately, and so it may not help.
library(sf)
nc <- st_transform(st_read(system.file("shape/nc.shp", package="sf")), 32618)
# create points
pts <- st_centroid(nc)
dis <- 50000
result <- list()
Your approach
system.time(
for (i in 1:nrow(pts)) {
b <- st_buffer(pts[i,], dist = dis)
result[[i]] <- st_intersects(b, nc)[[1]]
}
)
Slower alternative
system.time(
for (i in 1:nrow(pts)) {
b <- as.vector(st_distance(pts[i,], pts))
result[[i]] <- which(b <= dis)
}
)
For smaller datasets, without looping:
x <- st_distance(pts)
res <- apply(x, 1, function(i) which(i < dis))
Faster alternative (not obvious how to do in parallel), and perhaps an unfair comparison as we do not do the looping ourselves
library(spdep)
pts2 <- st_coordinates(pts)
system.time(x <- dnearneigh(pts2, 0, dis))
I would first get a list with the indices that indicate the neighbors, and extract attributes after that (that should be fast)
Working off of RobertH's answer, it is a bit faster to extract coordinates using sf::st_coordinates in this particular example.
library(sf)
library(spdep)
nc <- st_transform(st_read(system.file("shape/nc.shp", package="sf")), 32618)
# create points
pts <- st_centroid(nc)
dis <- 50000
# quickest solution:
x <- spdep::dnearneigh(sf::st_coordinates(pts), 0, dis)
microbenchmarking:
my_method <- function(pts) {
result <- list()
for (i in 1:nrow(pts)) {
b <- st_buffer(pts[i,], dist = dis)
result[[i]] <- st_intersects(b, nc)[[1]]
}
result
}
library(microbenchmark)
microbenchmark(
my_method(pts),
dnearneigh(as(pts, 'Spatial'), 0, dis),
dnearneigh(st_coordinates(pts), 0, dis)
)
Unit: microseconds
expr min lq mean median uq max neval
my_method(pts) 422807.146 427434.3450 435974.4320 429862.8705 434968.3975 596832.271 100
dnearneigh(as(pts, "Spatial"), 0, dis) 3727.221 3939.8540 4155.3094 4112.8200 4221.9525 7592.739 100
dnearneigh(st_coordinates(pts), 0, dis) 394.323 409.5275 447.1614 430.4285 484.0335 611.970 100
checking equivalence:
x <- dnearneigh(as(pts, 'Spatial'), 0, dis)
y <- dnearneigh(st_coordinates(pts), 0, dis)
all.equal(x,y, check.attributes = F)
[1] TRUE
I'm trying to divide each row of a dataframe by a number stored in a second mapping dataframe.
for(g in rownames(data_table)){
print(g)
data_table[g,] <- data_table[g,]/mapping[g,2]
}
However, this is incredibly slow, each row takes almost 1-2 seconds to run. I know iteration is usually not the best way to do things in R, but I don't know how else to do it. Is there any way I can speed up the runtime?
Try this :
sweep(data_table, 1, mapping[[2]], "/")
In terms of speed here is a benchmark for the possibilities using iris dataset and including your version :
microbenchmark::microbenchmark(
A = {
for(g in rownames(test)){
# print(g)
test[g,] <- test[g,]/test[g,2]
}
},
B = sweep(test, 1, test[[2]], "/"),
C = test / test[[2]],
times = 100
)
#Unit: microseconds
#expr min lq mean median uq max neval
#A 82374.693 83722.023 101688.1254 84582.052 147280.057 157507.892 100
#B 453.652 484.393 514.4094 513.850 539.480 623.688 100
#C 404.506 423.794 456.0063 446.101 470.675 729.205 100
you can vectorize this operation if the two variables have the same number of rows:
dt <- data.frame(a = rnorm(100), b = rnorm(100))
mapping <- data.frame(x = rnorm(100), y = rnorm(100))
dt / mapping[,2]
I have to apply a function to every row of a large table (~ 2M rows). I used to use plyr for that, but the table is growing continuously and the current solution starts to approach unacceptable runtimes. I thought I could just switch to data.table or dplyr and all is fine, but that's not the case.
Here's an example:
library(data.table)
library(plyr)
library(dplyr)
dt = data.table("ID_1" = c(1:1000), # unique ID
"ID_2" = ceiling(runif(1000, 0, 100)), # other ID, duplicates possible
"group" = sample(LETTERS[1:10], 1000, replace = T),
"value" = runif(1000),
"ballast1" = "X", # keeps unchanged in derive_dt
"ballast2" = "Y", # keeps unchanged in derive_dt
"ballast3" = "Z", # keeps unchanged in derive_dt
"value_derived" = 0)
setkey(dt, ID_1)
extra_arg = c("A", "F", "G", "H")
ID_1 is guaranteed to contain no duplicates. Now I define a function to apply to every row/ID_1:
derive = function(tmprow, extra_arg){
if(tmprow$group %in% extra_arg){return(NULL)} # exlude entries occuring in extra_arg
group_index = which(LETTERS == tmprow$group)
group_index = ((group_index + sample(1:26, 1)) %% 25) + 1
new_group = LETTERS[group_index]
if(new_group %in% unique(dt$group)){return(NULL)}
new_value = runif(1)
row_derived = tmprow
row_derived$group = new_group
row_derived$value = runif(1)
row_derived$value_derived = 1
return(row_derived)
}
This one doesn't do anything useful (the actual one does). The point is that the function takes one row and computes a new row of the same format.
Now the comparison:
set.seed(42)
system.time(result_dt <- dt[, derive(.SD, extra_arg), by = ID_1])
set.seed(42)
system.time(result_dplyr <- dt %>% group_by(ID_1) %>% do(derive(., extra_arg)))
set.seed(42)
system.time(results_plyr <- x <- ddply(dt, .variable = "ID_1", .fun = derive, extra_arg))
plyr is about 8x faster than both data.table and dplyr. Obviously I'm doing something wrong here, but what?
EDIT
Thanks to eddi's answer I could reduce runtimes for data.table and dplyr to ~ 0.6 and 0.8 of the plyr version, respectively. I intialized row_derived as data.frame: row_derived = as.data.frame(tmprow). That's cool, but I still expected a higher performance increase from these packages...any further suggestions?
The issue is the assignment you use has a very high overhead in data.table, and plyr converts the row to a data.frame before passing to your derive function, and thus avoids it:
library(microbenchmark)
df = as.data.frame(dt)
microbenchmark({dt$group = dt$group}, {df$group = df$group})
#Unit: microseconds
# expr min lq mean median uq max neval
# { dt$group = dt$group } 1895.865 2667.499 3092.38903 3080.3620 3389.049 4984.406 100
# { df$group = df$group } 26.045 45.244 64.13909 61.6045 79.635 157.266 100
I can't suggest a good fix, since you say your example is not real problem, so no point in solving it better. Some basic suggestions to look at are - vectorizing the code, and using := or set instead (depending on what exactly you end up doing).