R: Speed up for loop - r

This is my R script with three nested for loops. It takes more than 2 minutes to finish 1 out of 2000 rounds of for loop. How to speedup this?
col<-NULL
row<-NULL
rep<-ncol(dat)-2
dist<-NULL
c1=3
for (i in 1:rep){
c2=3
for(j in 1:rep){
r=1
for (k in 1:nrow(dat)){
p<-(dat[r,c1]-dat[r,c2])^2
row<-rbind(row, p)
r=r+1
}
row<-sqrt(sum(row))
row1<-(1/(2*length(unique(dat[,1]))))*row
col<-cbind(col, row1)
c2=c2+1
row<-NULL
}
dist<-rbind(dist,col)
col<-NULL
c1=c1+1
}
EDIT:
> head(dat)
mark alle G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13 G14 G15 G16 G17 G18 G19 G20 G21 G22 G23 G24
1 M1 228 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.0 0.5 0 0
2 M1 234 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.5 0.5 1 1
3 M1 232 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0 0
4 M1 240 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0 0
5 M1 230 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.5 0.0 0 0
6 M1 238 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0 0

I don't know the Modified Rogers Genetic distance but it looks like the euclidean distance multiplied by 1/(2*length(unique(dat$mark))):
f <- 1/(2*length(unique(dat$mark)))
d <- f*dist(t(dat[, -c(1, 2)]), method="euclidean")

The biggest thing you can do to speed up the loop is to preallocate the vectors and matrices before the loops. Then, instead of using cbind() and rbind(), add the results to the vectors/matrices like so:
# Was: row<-rbind(row, p)
row[k] <- p
# Was: col<-cbind(col, row1)
col[j] <- row1
# Was: dist<-rbind(dist,col)
dist[i, ] <- col
After that, you can explore ways to vectorize the operation or, better yet, see if there already exists a function to perform this task (or if the task is based on something for which there exists a function). Further, anything that does not depend on the loop (such as row1<-(1/(2*length(unique(dat[,1]))))) should be moved out of the loop. Otherwise you are just recalculating the same value over and over which negatively effects performance.
The key with loops is avoiding rbind() and cbind() by preallocating the vectors and matrices before the loop will provide a lot of performance boost.

Although similar function already exists, I tried my own way.
I removed one complete for loop, rbind and cbind.
Now this takes only 124 seconds to write 1014 X 1014 matrix against 2 minutes for just one round of 1014 X 1014 matrix (means 1 X 1014).
dat<-read.table("alreq1.txt", sep="\t",header=T)
col<-NULL
row<-NULL
rep<-ncol(dat)-2
dist<-NULL
dist<- data.frame(matrix(NA, nrow = rep, ncol = rep))
m<-1/sqrt(2*length(unique(dat[,1])))
c1=3
for (i in 1:rep){
c2=3
for(j in 1:rep){
p<-na.omit(dat[,c1]-dat[,c2])^2
row<-sum(p)
row<-sqrt(row)*m
col[j] <- row
c2=c2+1
row<-NULL
p<-NULL
}
dist[i,] <- col
c1=c1+1
col<-NULL
}
Hopefully, still this code can be improved.

Related

Replace a sequence in data frame column

I have a data frame in R that looks somewhat like this:
A | B
0 0
1 0
0 0
0 0
0 1
0 1
1 0
1 0
1 0
I now want to replace all sequences of more than one "1" in the columns so that only the first "1" is kept and the others are replaced by "0", so that the result looks like this
A | B
0 0
1 0
0 0
0 0
0 1
0 0
1 0
0 0
0 0
I hope you understood what I meant (English is not my mother tongue and especially the R-"vocabulary" is a bit hard for, which is probably why I couldn't find a solution through googling). Thank you in advance!
Try this solution:
Input data
df<-data.frame(
A=c(1,0,0,0,0,0,1,1,1,0),
B=c(1,1,0,1,0,0,1,1,0,0))
f<-function(X)
{
return(as.numeric((diff(c(0,X)))>0))
}
Your output
data.frame(lapply(df,f))
A B
1 1 1
2 0 0
3 0 0
4 0 1
5 0 0
6 0 0
7 1 1
8 0 0
9 0 0
10 0 0
You can use ave and create groups based on the difference of your values to capture the consecutives 1s and 0s as different groups and replace duplicates with 0, i.e.
df[] <- lapply(df, function(i)ave(i, cumsum(c(1, diff(i) != 0)),
FUN = function(i) replace(i, duplicated(i), 0)))
which gives,
A B
1 0 0
2 1 0
3 0 0
4 0 0
5 0 1
6 0 0
7 1 0
8 0 0
9 0 0
Here's a simple one line answer:
> df * rbind(c(0,0), sapply(df, diff))
A B
1 0 0
2 1 0
3 0 0
4 0 0
5 0 1
6 0 0
7 1 0
8 0 0
9 0 0
This takes advantage of the fact that all unwanted 1's in the original data will become 0's with the diff function.
Here is an option with rleid
library(data.table)
df1[] <- lapply(df1, function(x) +(x==1& !ave(x, rleid(x), FUN = duplicated)))
df1
# A B
#1 0 0
#2 1 0
#3 0 0
#4 0 0
#5 0 1
#6 0 0
#7 1 0
#8 0 0
#9 0 0
<
Here's a more functional approach. Though, I find shorter answers here, but it's good to know the possible implementation under the hood:
# helper function
make_zero <- function(val)
{
get_index <- c()
for(i in seq(val))
{
if(val[i] == 1) get_index <- c(get_index, i)
else if (val[i] != 1) get_index <- c()
if(all(diff(get_index)) == 1)
{
val[get_index[-1]] <- 0
}
}
# set values as 0
return (val)
}
df <- sapply(df, make_zero)
head(df)
A B
[1,] 0 0
[2,] 1 0
[3,] 0 0
[4,] 0 0
[5,] 0 1
[6,] 0 0
[7,] 1 0
[8,] 0 0
[9,] 0 0
Explanation:
1. We save the indexes of consecutive 1s in get_index.
2. Next, we check if the difference between indexes is 1.
3. If found, we update the value in the column.

Transformations on data.table seems to take longer than on data.frame

I have a big dataframe (mydata) (5.2G: 6000rows, 230,000 columns) from which I need to subset some columns whose names match the values in column Name in the annotation file (annot), than convert it into a matrix. I read that data.tables are supposed to be much faster, so I am trying to convert my script to handle mydata as data.table instead of data.frame.
So far I don't have the impression that the processing speed is improved when I convert the original data into data.table and I am trying to understand why.
Some example data:
mydata=read.table(text="IID A B E G H W Z D N K
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
4 0 0 0 1 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0
6 0 0 2 0 2 0 0 0 1 1
7 0 2 0 1 0 0 0 0 0 0
8 0 0 2 0 0 0 0 0 0 0
9 0 0 0 0 2 0 0 0 1 0
10 1 0 0 0 0 0 2 0 0 2
11 0 0 0 0 0 0 0 0 0 0
12 0 0 0 1 0 0 0 0 0 0
13 2 0 0 0 0 0 0 0 2 1
14 0 0 0 0 0 0 2 0 0 0
15 1 0 0 0 0 0 0 0 0 0
16 0 0 0 1 0 0 0 0 0 1
17 0 0 0 0 0 0 0 0 0 0
18 0 0 0 0 0 0 0 0 0 0
19 0 0 0 0 0 0 0 0 0 0
20 0 0 0 0 0 0 0 0 0 0",h=T)
annot=read.table(text="Name Gene
A Gene1
B Gene2
E Gene3
G Gene4
H Gene5
W Gene6
Z Gene7
D Gene8
N Gene9
K Gene10",stringsAsFactors=F,h=T)
genes = c("Gene2","Gene4","Gene9")
Sample code:
#as DATA FRAME
start <- proc.time()
annot1 = annot[which(annot$Gene %in% genes),]
mydata1=mydata[,c(1,which(colnames(mydata) %in% annot1$Name))]
mydata1=mydata1[order(mydata1$IID),]
genomxwork=as.matrix(mydata1[,2:dim(mydata1)[2]])
df_time <- proc.time() - start
df_time
> utilisateur système écoulé
> 0.00 0.00 0.24
Same code but this time mydata is transformed into data.table:
#as DATA TABLE
mydataDT=as.data.table(mydata)
annotDT=as.data.table(annot)
start <- proc.time()
annotDT1 = annotDT[which(Gene %in% genes),,]
mydataDT1=mydataDT[,c(1,which(colnames(mydataDT) %in% annot1$Name)),with=F]
mydataDT1=mydataDT1[order(IID),,]
genomxworkDT=as.matrix(mydatDT1[,2:dim(mydataDT1)[2]])
dt_time <- proc.time() - start
dt_time
> utilisateur système écoulé
> 0.00 0.00 0.25
In the example the difference is obviously barely remarkable, but on my real data goes up pretty fast when I have few thousands of columns to extract.
This was for 90,000 columns:
df_time
> utilisateur système écoulé
> 10.91 0.91 12.02
dt_time
> utilisateur système écoulé
> 24.00 0.00 24.15
The various comments suggested
to use the microbenchmark package,
to improve the data.table code,
and to use matrices.
(In addition, general concerns were raised regarding the small size of the data set used for benchmarking and the layout of the production data set with 40 times more columns than rows.)
However, I've repeated the benchmark using microbenchmark with
the original data.frame code of the OP,
the original data.table code of the OP,
a streamlined data.table version,
a streamlined data.frame version,
a matrix version.
Results
Unit: microseconds
expr min lq mean median uq max neval
df_orig 192.944 218.2420 235.02202 225.7940 236.9325 394.950 100
dt_orig 1012.672 1038.1590 1104.42052 1063.2675 1093.2855 3483.561 100
dt2 962.454 984.7315 1040.32245 1001.3445 1026.8315 3130.523 100
df2 47.953 53.2400 64.45366 63.0565 65.6995 217.109 100
mat 2.644 4.5310 6.46469 6.4190 7.1750 51.352 100
With the given small data set of 20 rows and 11 columns, the matrix version is ten times faster than the streamlined data.frame version which is about 4 times faster than the original data.frame version. For this simple data retrieval task, the strongpoints of data.table e.g., update without copying, aren't used. So, it's no wonder that overhead is dominating for this toy example.
Benchmark code
microbenchmark::microbenchmark(
df_orig = {
annot1 = annot[which(annot$Gene %in% genes),]
mydata1=mydata[,c(1,which(colnames(mydata) %in% annot1$Name))]
mydata1=mydata1[order(mydata1$IID),]
genomxwork=as.matrix(mydata1[,2:dim(mydata1)[2]])
},
dt_orig = {
annotDT1 = annotDT[which(Gene %in% genes),,]
mydataDT1=mydataDT[,c(1,which(colnames(mydataDT) %in% annot1$Name)),with=F]
mydataDT1=mydataDT1[order(IID),,]
genomxworkDT=as.matrix(mydataDT1[,2:dim(mydataDT1)[2]])
},
dt2 = {
genomxworkDT <- as.matrix(mydataDT[
, .SD, .SDcols = annotDT[J(genes), on = "Gene"]$Name])
},
df2 = {
genomxwork <- as.matrix(mydata0[, names(annot_vec)[annot_vec %in% genes]])
},
mat = {
genomxwork <- mat[, names(annot_vec)[annot_vec %in% genes]]
},
times = 100L
)
Data
The data conversion to data.table and matrix, resp., is done outside of the benchmark as this probably would be also the case in the production setting.
mydata=read.table(text="IID A B E G H W Z D N K
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
4 0 0 0 1 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0
6 0 0 2 0 2 0 0 0 1 1
7 0 2 0 1 0 0 0 0 0 0
8 0 0 2 0 0 0 0 0 0 0
9 0 0 0 0 2 0 0 0 1 0
10 1 0 0 0 0 0 2 0 0 2
11 0 0 0 0 0 0 0 0 0 0
12 0 0 0 1 0 0 0 0 0 0
13 2 0 0 0 0 0 0 0 2 1
14 0 0 0 0 0 0 2 0 0 0
15 1 0 0 0 0 0 0 0 0 0
16 0 0 0 1 0 0 0 0 0 1
17 0 0 0 0 0 0 0 0 0 0
18 0 0 0 0 0 0 0 0 0 0
19 0 0 0 0 0 0 0 0 0 0
20 0 0 0 0 0 0 0 0 0 0",h=T)
mydataDT=as.data.table(mydata)
mat <- as.matrix(mydata[order(mydata$IID), ])
annot=read.table(text="Name Gene
A Gene1
B Gene2
E Gene3
G Gene4
H Gene5
W Gene6
Z Gene7
D Gene8
N Gene9
K Gene10",stringsAsFactors=F,h=T)
annotDT=as.data.table(annot)
annot_vec <- setNames(annot$Gene, annot$Name)
genes = c("Gene2","Gene4","Gene9")

Find all m-tuples that sum to n

I want to find ALL the non-negative integer solutions to the equation i+j+k+l+m=n where n is a non-negative integer. That is, I want to find all possible 5-tuples (i,j,k,l,m) with respect to a certain n, in R.
I wrote a code which is not working. I am suspicious there is something wrong in the looping.
For your convenience, I have taken n=3, so I am basically trying to compute all vectors (i,j,k,l,m) which are 35 in number, and the matrix a(35 by 5) is the matrix that is supposed to display those vectors. The whole thing is in the function "sample(n)", where if I put n=3 i.e. sample(3) when called will give me the matrix a. Please note that a (35 by 5) is defined beforehand with all entries 0.
sample=function(n){
i=0
j=0
k=0
l=0
m=0
for(p in 1:35){
while(i<=3){
while(j<=3){
while(k<=3){
while(l<=3){
m=n-(i+j+k+l)
if(m>-1){
a[p,]=c(i,j,k,l,m)
}
l=l+1}
k=k+1}
j=j+1}
i=i+1}
}
return(a)
}
When I call sample(3), I get my original a i.e. the matrix with all elements 0. What is wrong with this code? Please rectify it.
I don't think a brute-force approach will bring you much joy for this task. Instead you should look for existing functions that can be used and are efficient (i.e. implemented in C/C++).
n <- 3
library(partitions)
blockparts(rep(n, 5), n)
#[1,] 3 2 1 0 2 1 0 1 0 0 2 1 0 1 0 0 1 0 0 0 2 1 0 1 0 0 1 0 0 0 1 0 0 0 0
#[2,] 0 1 2 3 0 1 2 0 1 0 0 1 2 0 1 0 0 1 0 0 0 1 2 0 1 0 0 1 0 0 0 1 0 0 0
#[3,] 0 0 0 0 1 1 1 2 2 3 0 0 0 1 1 2 0 0 1 0 0 0 0 1 1 2 0 0 1 0 0 0 1 0 0
#[4,] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 3 0 0 0 0 0 0 1 1 1 2 0 0 0 1 0
#[5,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 3
I believe that your code isn't answering your stated problem (as I understand it), on top of possible errors in your code.
One way to think of the problem is that, given the quadruple (i,j,k,l), the value of m = n - (i + j + k + l), while noting that the quadruple (i,j,k,l) is constrained so that n >= i+j+k+l AND i,j,k,l >= 0. For example, consider the following algorithm:
Let i freely take any value between 0 and n.
Given i, j can take values between 0 and n-i.
Given (i,j), k takes values between 0 and n-i-j.
Given (i,j,k), l takes values between 0 and n-i-j-k.
Given (i,j,k,l), m is defined as m = n - i - j - k -l.
The following code ought to answer your question. Please comment if this is not what you were looking for.
sample.example = function(n){
a=array(0,c(0,5))
for(i in 0:n){
for(j in seq(from=0,to=n-i,by=1)){
for(k in seq(from=0,to=n-i-j,by=1)){
for(l in seq(from=0,to=n-i-j-k,by=1)){
m = n - i -j - k - l
a = rbind(a,c(i,j,k,l,m))
}}}}
return(a)
}

Generate random number of missing values in R

I have a data frame such like that:
df<-data.frame(time1=rbinom(100,1,0.3),
time2=rbinom(100,1,0.4),
time3=rbinom(100,1,0.5),
time4=rbinom(100,1,0.6))
How could I generate random missing values for each time variable with up to 20% number of missing? Namely, in this case, the total number of missing less than 20 in each column and they are missed in random from subjects (rows).
You could do:
insert_nas <- function(x) {
len <- length(x)
n <- sample(1:floor(0.2*len), 1)
i <- sample(1:len, n)
x[i] <- NA
x
}
df2 <- sapply(df, insert_nas)
df2
This will give you up to maximal 20% missings per column
colSums(is.na(df2)) / nrow(df2)
time1 time2 time3 time4
0.09 0.16 0.19 0.14
Here's one way:
as.data.frame(lapply(df, function(x)
"is.na<-"(x, sample(seq(x), floor(length(x) * runif(1, 0, .2))))))
Something like this, you mean?
nomissing <- sample(1:20,1)
testnos <- rbinom(100 - nomissing,1,0.3)
testnas <- rep(NA,nomissing)
testmix <- sample(x = c(testnos,testnas),100)
Output -
> testmix
[1] 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0
[37] 1 0 0 0 1 1 0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 1 1 1 NA 0 1 0 0
[73] 0 0 1 1 0 0 1 0 0 1 1 0 0 NA 1 0 0 0 0 0 1 0 NA NA 1 0 0 0

How to create weighted adjacency list/matrix from edge list?

My problem is very simple: I need to create an adjacency list/matrix from a list of edges.
I have an edge list stored in a csv document with column1 = node1 and column2 = node2 and I would like to convert this to a weighted adjacency list or a weighted adjacency matrix.
To be more precise, here's how the data looks like -where the numbers are simply node ids:
node1,node2
551,548
510,512
548,553
505,504
510,512
552,543
512,510
512,510
551,548
548,543
543,547
543,548
548,543
548,542
Any tips on how to achieve the conversion from this to a weighted adjacency list/matrix?
This is how I resolved to do it previously, without success (courtesy of Dai Shizuka):
dat=read.csv(file.choose(),header=TRUE) # choose an edgelist in .csv file format
el=as.matrix(dat) # coerces the data into a two-column matrix format that igraph likes
el[,1]=as.character(el[,1])
el[,2]=as.character(el[,2])
g=graph.edgelist(el,directed=FALSE) # turns the edgelist into a 'graph object'
Thank you!
This response uses base R only. The result is a standard matrix used to represent the adjacency matrix.
el <- cbind(a=1:5, b=5:1) #edgelist (a=origin, b=destination)
mat <- matrix(0, 5, 5)
mat[el] <- 1
mat
# [,1] [,2] [,3] [,4] [,5]
#[1,] 0 0 0 0 1
#[2,] 0 0 0 1 0
#[3,] 0 0 1 0 0
#[4,] 0 1 0 0 0
#[5,] 1 0 0 0 0
Here mat is your adjacency matrix defined from edgelist el, which is a simple cbind of the vectors 1:5 and 5:1.
If your edgelist includes weights, then you need a slightly different solution.
el <- cbind(a=1:5, b=5:1, c=c(3,1,2,1,1)) # edgelist (a=origin, b=destination, c=weight)
mat<-matrix(0, 5, 5)
for(i in 1:NROW(el)) mat[ el[i,1], el[i,2] ] <- el[i,3] # SEE UPDATE
mat
# [,1] [,2] [,3] [,4] [,5]
#[1,] 0 0 0 0 3
#[2,] 0 0 0 1 0
#[3,] 0 0 2 0 0
#[4,] 0 1 0 0 0
#[5,] 1 0 0 0 0
UPDATE
Some time later I realized that the for loop (3rd line) in the previous weighted edgelist example is unnecessary. You can replace it with the following vectorized operation:
mat[el[,1:2]] <- el[,3]
The post on my website you mention in the question (https://sites.google.com/site/daishizuka/toolkits/sna/sna_data) uses the igraph package, so make sure that is loaded.
Moreover, I recently realized that igraph provides a much easier way to create a weighted adjacency matrix from edgelists, using graph.data.frame(). I've updated this on my site, but here is a simple example:
library(igraph)
el=matrix(c('a','b','c','d','a','d','a','b','c','d'),ncol=2,byrow=TRUE) #a sample edgelist
g=graph.data.frame(el)
get.adjacency(g,sparse=FALSE)
That should do it. The sparse=FALSE argument tells it to show the 0s in the adjacency matrix.
If you really don't want to use igraph, I think this is a clunky way to do it:
el=matrix(c('a','b','c','d','a','d','a','b','c','d'),ncol=2,byrow=TRUE) #a sample edgelist
lab=names(table(el)) #extract the existing node IDs
mat=matrix(0,nrow=length(lab),ncol=length(lab),dimnames=list(lab,lab)) #create a matrix of 0s with the node IDs as rows and columns
for (i in 1:nrow(el)) mat[el[i,1],el[i,2]]=mat[el[i,1],el[i,2]]+1 #for each row in the edgelist, find the appropriate cell in the empty matrix and add 1.
Start with your data frame edges and use igraph to obtain adjacency matrix:
head(edges)
node1 node2
1 551 548
2 510 512
3 548 553
4 505 504
5 510 512
6 552 543
library(igraph)
as.matrix(get.adjacency(graph.data.frame(edges)))
551 510 548 505 552 512 543 553 504 547 542
551 0 0 2 0 0 0 0 0 0 0 0
510 0 0 0 0 0 2 0 0 0 0 0
548 0 0 0 0 0 0 2 1 0 0 1
505 0 0 0 0 0 0 0 0 1 0 0
552 0 0 0 0 0 0 1 0 0 0 0
512 0 2 0 0 0 0 0 0 0 0 0
543 0 0 1 0 0 0 0 0 0 1 0
553 0 0 0 0 0 0 0 0 0 0 0
504 0 0 0 0 0 0 0 0 0 0 0
547 0 0 0 0 0 0 0 0 0 0 0
542 0 0 0 0 0 0 0 0 0 0 0
Another possibility with the qdapTools package:
library(qdapTools)
el[rep(seq_len(nrow(el)), el[,'c']), c('a', 'b')] %>%
{split(.[,'b'], .[,'a'])} %>%
mtabulate()
## 1 2 3 4 5
## 1 0 0 0 0 3
## 2 0 0 0 1 0
## 3 0 0 2 0 0
## 4 0 1 0 0 0
## 5 1 0 0 0 0

Resources