I hope this message finds you well.
I have recently written a code in R that perfectly works, but unfortunately, it is so time-consuming with big data since it is based on a loop. Now, I wonder how I can rewrite this code so that there is no need for a loop. I really appreciated it if anyone can help.
for (k in 1:nrow(migration)){
migration$district[k]<-
sample(0:30, size=1,replace = TRUE,prob=migration[k,7:37])
}
It seems that in this case, using apply might be more efficient than a loop. An example running on a test dataset is:
#Make up some test data
n <- 10000
migration <- data.frame(lapply(1:37, function(x)runif(n)))
names(migration) <- c("district", paste0("col",2:37))
#Code from question
system.time(
for (k in 1:nrow(migration)){
migration$district[k]<-
sample(0:30, size=1,replace = TRUE,prob=migration[k,7:37])
}
)
# user system elapsed
# 1.99 0.01 2.00
#Alternative approach
system.time({
migration$district <- apply(migration[7:37], 1, function(x){sample(0:30, size=1, replace = TRUE, prob=x)})
}
)
# user system elapsed
# 0.06 0.00 0.06
Related
I am a beginner in data.table and I am trying to do a really simple operation which in the base dataframes would look like this:
percentages[percentages<0] = abs(percentages[percentages<0])
The data looks like this:
percentages
p1 p2 p3
1: 0.689 0.206 0.106
The solution for data.table that I have found so far to just get the data is:
percentages[,which(percentages<0),with=FALSE]
but it's more complicated than dataframe...there should be something better but I can't get anything.. any suggestion?
A general option may be using set. It includes a for loop but it would be more efficient as we are looping through the columns and not creating a matrix by doing (df1 < 0 - for huge datasets, this would consume some memory). Using set will be efficient as the documentation says overhead of [.data.table is avoided
for(j in seq_along(df1)){
set(df1, i = which(df1[[j]]<0), j=j, value = abs(df1[[j]]))
}
As the OP wants a single line code, for the single row example showed,
df1[, lapply(.SD, function(x) replace(x, x < 0, abs(x)))]
Benchmarks
Based on the system.time on a slightly bigger dataset
set.seed(42)
dfN <- data.frame(p1 = rnorm(1e7), p2 = rnorm(1e7), p3 = rnorm(1e7), p4 = rnorm(1e7))
dfN1 <- copy(dfN)
setDT(dfN1)
system.time({
i1 <- dfN < 0
dfN[i1] <- abs(dfN[i1])
})
# user system elapsed
# 1.63 0.50 2.12
system.time({
for(j in seq_along(dfN1)){
set(dfN1, i = which(dfN1[[j]]<0), j=j, value = abs(dfN1[[j]][dfN1[[j]]<0]))
}
})
# user system elapsed
# 0.91 0.08 0.98
as akrun posted above, the one-liner reply is
df1[, lapply(.SD, function(x) replace(x, x < 0, abs(x)))]
however, this is not exactly what I was looking for since it seems that data.table is much more syntactically complicated compared to data.frame (at least in this example)
we are basically doing the vectorization ourselves in data.table (using the lapply) while in data.frame it happens automatically
I am comparing common "tidying" operations in dplyr and in "plain R" (see the output here and source here to see what I mean).
I have a hard time finding a "canonical" and concise way to select columns using only variable names (by canonical, I mean, pure plain R and easily understandable for anyone with minimum understanding of R (so no "voodoo trick")).
Example:
## subset: all columns from "var_1" to "var_2" excluding "var_3"
## dplyr:
table %>% select(var_1:var_2, -var_3)
## plain R:
r <- sapply(c("var_1", "var_2", "var_3"), function(x) which(names(table)==x))
table[ ,setdiff(r[1]:r[2],r[3]) ]
Any suggestions to improve the plain R syntax?
Edit
I implemented some suggestions and compared performance over different syntaxes, and noticed the use of match and subset lead to surprising falls in performance:
# plain R, v1
system.time(for (i in 1:100) {
r <- sapply(c("size", "country"), function(x) which(names(cran_df)==x))
cran_df[,r[1]:r[2]] } )
## user system elapsed
## 0.006 0.000 0.007
# plain R, using match
system.time(for (i in 1:100) {
r <- match(c("size", "country"), names(cran_df))
cran_df[,r[1]:r[2]] %>% head(n=3) } )
## user system elapsed
## 0.056 0.028 0.084
# plain R, using match and subset
system.time(for (i in 1:100) {
r <- match(c("size", "country"), names(cran_df))
subset(cran_df, select=r[1]:r[2]) %>% head(n=3) } )
## user system elapsed
## 11.556 1.057 12.640
# dplyr
system.time(for (i in 1:100) select(cran_tbl_df,size:country))
## user system elapsed
## 0.034 0.000 0.034
Looks like the implementation of subset is sub-optimal...
You can use the built in subset function, which can take a select argument that follows similar (though not identical) syntax to dplyr::select. Note that dropping columns has to be done in a second step:
t1 <- subset(table, select = var1:var2)
t2 <- subset(t1, select = -var_3)
or:
subset(subset(table, select = var1:var2), select = -var_3)
For example:
subset(subset(mtcars, select = c(mpg:wt)), select = -hp)
I'm working with a fairly large graph in igraph R. (~5 million vertices, 40 million edges).
I want to create a new attribute for each vertex which is the average value of an attribute of each of their connections.
For example:
Person A has an X value of 10, they're connected to persons B, C and D who have x values of 20, 50 and 65 respectively. I want to assign a new value of 45 to Person A (average of 20, 50 and 65).
I'm currently using the following method (from another stackoverflow answer) (I'm using 10 cores)
adjcency_list <- get.adjlist(g)
avg_contact_val <- ldply(adjcency_list, function(neis){ mean(V(g)[neis]$X, na.rm = T)},
.parallel = TRUE
)
V(g)$avg_contact_val <- avg_contact_val
This works exactly as I need it to, but it doesn't scale very well and would take a (very!) long time to do on the full graph.
Is there a more efficient method of doing this?
Could this fall under a page rank type algorithm using the x value instead of degree
Would it be possible to use a GPU somehow?
Would this be quicker in igraph Python?
EDIT:
Here's some sample data and an attempt at the approaches suggested:
set.seed(12345)
g <- erdos.renyi.game(10000, .0005)
V(g)$NAME <- c(1:10000)
V(g)$X <- round(runif(10000,0,30))
adjcency_list <- get.adjlist(g)
sub_ages <- data.frame(NAME = V(g)$NAME, X = V(g)$X)
dta.table <- data.table(sub_ages, key = "NAME")
DATA TABLE APPROACH
system.time(
avg_contact_ages <- ldply(adjcency_list,
function(neis){
mean(dta.table[neis,mean(X)], na.rm = T)
}, .progress = "tk"
)
)
user system elapsed
38.87 1.50 40.37
DATA FRAME APPROACH
sub_ages2 <- data.frame(row.names = V(g)$NAME, X = V(g)$X)
system.time(
avg_contact_ages <- ldply(adjcency_list,
function(neis){
mean(sub_ages2[neis, "X"], na.rm = T)
}, .progress = "tk"
)
)
user system elapsed
8.69 1.28 9.99
ORIGINAL APPROACH
system.time(
avg_contact_ages <- ldply(adjcency_list,
function(neis){
mean(V(g)[neis]$X, na.rm = T)
} , .progress = "tk"
)
)
user system elapsed
16.74 2.35 19.14
Shadow's approach
system.time(
avg_nei <- ldply(V(g), function(vert){
mean(get.vertex.attribute(g, "X", index=neighbors(g,vert)), na.rm=TRUE)
}, .progress = "tk")
)
user system elapsed
8.80 1.42 10.23
Is there a more efficient method of doing this?
I think so. Do not call V(g) all the time, but put the attribute in a vector, and index it. If you include some example data, then I'll also include some code.
Could this fall under a page rank type algorithm using the x value instead of degree
No, PageRank is recursive, your rank depends on the whole network, not only on the score of your neighbors.
Would it be possible to use a GPU somehow?
Not with igraph. Put you can certainly make this fast enough without the GPU, so I would not go that way.
Would this be quicker in igraph Python?
Depends how you write it. If you write it the correct way in R, then it will not be faster in Python, either, imo.
EDIT:
I left out the progress bar, because that's the slowest, actually.
Fastest solution above with data frame
system.time({
sub_ages2 <- data.frame(row.names = V(g)$NAME, X = V(g)$X);
avg_contact_ages <- ldply(adjcency_list, function(neis) {
mean(sub_ages2[neis, "X"], na.rm = T)
})
})
# user system elapsed
# 0.368 0.020 0.386
Slightly faster with sapply
system.time({
sub_ages2 <- data.frame(row.names = V(g)$NAME, X = V(g)$X);
avg_contact_ages <- sapply(adjcency_list, function(neis) {
mean(sub_ages2[neis, "X"], na.rm = TRUE)
})
})
# user system elapsed
# 0.340 0.017 0.356
Using factors
system.time({
adj_vec <- unlist(adjcency_list)
adj_fac <- factor(rep(seq_along(adjcency_list),
sapply(adjcency_list, length)),levels=seq_len(vcount(g)))
avg_contact_ages <- tapply(V(g)$X[adj_vec], adj_fac, mean, na.rm=TRUE)
})
# user system elapsed
# 0.131 0.008 0.138
If you need more speedup, you'll probably need to go to C/C++, Rcpp would be a relatively easy solution.
The function get.vertex.attribute adds some speed. But for the size of your graph, that probably won't be enough. Anyway, here's my slightly faster version (in my benchmark tests for much smaller graphs, it's about 2.5 times faster than your version):
avg_nei <- ldply(V(g), function(vert){
mean(get.vertex.attribute(g, "X", index=neighbors(g,vert)), na.rm=TRUE)
}, .parallel = TRUE)
V(g)$avg_contact_val <- avg_nei
I've got a column in a CSV file that looks like c("","1","1 1e-3") (i.e. white space seperated). I'm trying to run through all values, taking the sum() of values where there is at least one value and returning NA otherwise.
My code currently does something like this:
x <- c("","1","1 2 3")
x2 <- as.numeric(rep(NA,length(x)))
for (i in 1:length(x)) {
si <- scan(text=x[[i]],quiet=TRUE)
if (length(si) > 0)
x2[[i]] <- sum(si)
}
I'm struggling to make this fast; x is really a set of columns from a CSV file containing a few hundred thousand rows and thought it should be possible to do this in R.
(these are thinned samples from the posterior of a reversible jump MCMC algorithm, hence combining multiple values as the dimensionality changes throughout the file and I want useful columns).
Building on the idea from #Chase, but handling NA and also avoiding a name for the helper function:
unlist(lapply(strsplit(x, " "),
function(v)
if (length(v) > 0)
sum(as.numeric(v))
else
NA
) )
This seems to perform a bit faster and may work for you.
#define a helper function
f <- function(x) sum(as.numeric(x))
unlist(lapply((strsplit(x3, " ")), f))
#-----
[1] 0 1 6
This will return a zero instead of NA, but maybe that isn't a deal breaker for you?
Let's see how this scales to a larger problem:
#set up variables
x3 <- rep(x, 1e5)
x4 <- as.numeric(rep(NA,length(x3)))
#initial approach
system.time(for (i in 1:length(x3)) {
si <- scan(text=x3[[i]],quiet=TRUE)
if (length(si) > 0)
x4[[i]] <- sum(si)
})
#-----
user system elapsed
30.5 0.0 30.5
#New approach:
system.time(unlist(lapply((strsplit(x3, " ")), f)))
#-----
user system elapsed
0.82 0.01 0.84
I have a matrix mat and would like to calculate the mean of the columns based on a grouping variable gp.
mat<-embed(1:5000,1461)
gp<-c(rep(1:365,each=4),366)
To do this, I use the following
colavg<-t(aggregate(t(mat),list(gp),mean))
But it takes much longer than I expect.
Any suggestions on making the code run faster?
Here is a fast algorithm, I commented in the code.
system.time({
# create a list of column indices per group
gp.list <- split(seq_len(ncol(mat)), gp)
# for each group, compute the row means
means.list <- lapply(gp.list, function(cols)rowMeans(mat[,cols, drop = FALSE]))
# paste everything together
colavg <- do.call(cbind, means.list)
})
# user system elapsed
# 0.08 0.00 0.08
You could use an apply function, for example from the excellent plyr package:
# Create data
mat<-embed(1:5000,1461)
gp<-c(rep(1:365,each=4),366)
# Your code
system.time(colavg<-t(aggregate(t(mat),list(gp),mean)))
library(plyr)
# Put all data in a data frame
df <- data.frame(t(mat))
df$gp <- gp
# Using an apply function
system.time(colavg2 <- t(daply(df, .(gp), colMeans)))
Output:
> # Your code
> system.time(colavg<-t(aggregate(t(mat),list(gp),mean)))
user system elapsed
134.21 1.64 139.00
> # Using an apply function
> system.time(colavg2 <- t(daply(df, .(gp), colMeans)))
user system elapsed
52.78 0.06 53.23