R data.table: How to use apply function? - r

I need to briefly explain the context before letting you know my question.
I am trying to process a large graph, namely the Social circles: Google+ here. The file gplus_combined.txt downloaded from that site is read by using data.table package:
library(data.table)
data = fread('gplus_combined.txt',stringsAsFactors = TRUE)
Variable data is of dimensions dim(data) = c(30494865,2) and here is an example of a row of data:
>data[1,]
>1: 112188647432305746617 107727150903234299458
The two long integer strings are ids of nodes of the graph, and each row of data corresponds to an edge between the first and second node ids. Since working with node ids like those are not very convenient, I'd like to convert them to numbers using R function strtoi. Here is what I have tried
M = matrix(0,2,2)
for (i in 1:2) {
for (j in 1:2) {
M[i,j] = strtoi(data[i,j,with = FALSE])
}
}
print(M)
[,1] [,2]
[1,] 47826 45374
[2,] 65616 2462
This works well, for just two rows of data. But it is too slow for processing about 30 millions rows of data. So I want to use R function apply to speed up the calculation. The problem is that if I just use
apply(data[1:2,], 1:2, strtoi)
[1,] NA NA
[2,] NA NA
then it returns a 2x2 matrix with NA entries. Note that to get the matrix M above, I need to include the parameter with = FALSE,
strtoi(data[i,j,with = FALSE])
otherwise M would also be a matrix of NA entries. Is there a way to pass the option with = FALSE to apply function? Or any other faster way to get the same result like matrix M? Any sugguestions/comments are greatly appreciated!
Thank you for spending your time reading this long post!

Related

Populating a matrix using values from two other matrices of unpredictable size

I'd like to populate a matrix using information from two other matrices
I have managed to do this with a given dataset, but I need to integrate this within a larger script, and the size of the two matrices I'm using to populate the larger matrix may differ each time.
Example data:
days = 150
block <- matrix(c(50,120,150), nrow=3, ncol=1)
[,1]
[1,] 50
[2,] 120
[3,] 150
e1 <- matrix(c(0.1,0.5,0.7), nrow=3, ncol=1)
[,1]
[1,] 0.1
[2,] 0.5
[3,] 0.7
result <- matrix(0, nrow = 150, ncol=1)
I need to create a vector of numbers (taken from e1) that repeat themselves depending on each number in 'block'
The code below demonstrates the desired outcome in this instance, however I'm trying to write a more flexible script that can cope with fewer than or more than 3 'blocks'
I appreciate there is probably a much easier way of doing this, but my head is stuck in loop mode and I can't seem to get out of it!
for (v1 in 1:days){
if(v1 <= block[1,1]){
result[v1,1] <- e1[1,1]
}
else if (v1 > block[1,1] & v1 <= block[2,1]){
result[v1,1] <- e1[2,1]
}
else if (v1 > block[2,1] & v1 <= block[3,1]){
result[v1,1] <- e1[3,1]
}
}
Any help would be much appreciated!
You can get this by using a nice feature of rep:
result <- rep(e1, c(block[1], diff(block)))
# cast the vector as a column matrix
result <- matrix(result, length(result))
This works because rep will accept a vector in its second argument that tells it how many times to repeat each element of its first argument.
If you know the length ahead of time, you can combine the lines, like
result <- matrix(rep(e1, c(block[1], diff(block))), days)
for example.

Using paste function within colnames

I want to use iteration to turn the entries in a list into a 2x2 matrix, and then assign the same column and row names to these tables, as well as integer values for the matrix cells.
For examples sake let's pretend this is the list with the entries whose names I want to turn into matrices:
cnames <- c("Honda", "Toyota", "Nissan")
Creating the tables themselves seem to work fine with the assign function:
for (i in 1:length(cnames)){
assign(paste(cnames[i],"table",sep="_"), matrix(,nrow=2,ncol=2))
}
Which when I type, for instance:
> Honda_table
...returns:
[,1] [,2]
[1,] NA NA
[2,] NA NA
But if in the original iterative function I try to assign column names, like such:
for (i in 1:length(cnames)){
assign(paste(cnames[i],"table",sep="_"), matrix(,nrow=2,ncol=2))
colnames(paste(cnames[i],"table",sep="_")) <- c("A","B")
}
...I get this error instead:
Error : attempt to set 'colnames' on an object with less than two dimensions
I don't understand why this is coming up, since after using the original assign function, if I look up the dimensions any of the tables, such as:
>dim(honda_table)
...I get:
[1] 2 2
Which indicates it is a 2x2 dimensional object.
Moreover, I cannot assign pre-set values to the matrix cells, like so:
for (i in 1:length(cnames)){
assign(paste(cnames[i],"table",sep="_"), matrix(,nrow=2,ncol=2))
paste(cnames[i],"table",sep="_")[1,1] = 1
}
...without getting this error:
Error : incorrect number of subscripts on matrix
What is going on here?
Thanks.
I am not sure it is the best, and the most beautiful, way but seems to work:
for (i in 1:length(cnames)){
tab<- matrix(,nrow=2,ncol=2)
colnames(tab)<- c("A","B")
assign(paste(cnames[i],"table",sep="_"), tab)
}
rm(tab)
After much suggestion I ended up scraping the assign function and simply created a vector of tables instead

How to feed two arrays into apply

I have two 3-D arrays, and I want to calculate some statistics on them. As long as I am working with only one variable, I know how to do it. For example, to calculate the mean over the first dimension, I use the following:
obs<-array(1:8,c(2,2,2));
mod<-array(9:2,c(2,2,2));
meanObs <- apply(obs,c(2,3),mean) # mean of observation
meanMod <- apply(mod,c(2,3),mean) # mean od model simulation/forecast
However, I do not know how to feed two sliced array into apply. For example, I am trying to calculate the correlation coefficient over the first dimension. I can do it with the following loop functions:
pearsonCor<-matrix(, nrow = dim(obs)[2], ncol = dim(obs)[3])
for (i in 1:dim(obs)[2]){
for (j in 1:dim(obs)[3]){
pearsonCor[i,j]<-tryCatch(suppressWarnings(cor(obs[,i,j], mod[,i,j], method = "pearson")),
error=function(cond) {return(NA)})
}
}
result:
> pearsonCor
[,1] [,2]
[1,] -1 -1
[2,] -1 -1
But I want to learn how to deal with this situation with apply.Any help would be very much appreciated.
Thanks,
You can use expand.grid to get the index combination as in your nested for loop. Then apply over the data.frame of indices.
pearsonCor[] <- apply(expand.grid(1:dim(obs)[2], 1:dim(obs)[3]), 1, function(x)
cor(obs[,x[[1]], x[[2]]], mod[,x[[1]], x[[2]]]))
This will actually loop more quickly over the first variable (corresponding to i in the loops), so the indices would need to be reversed to have the matrix in the ordering of your question.

Creating an array or lists of lists in R

I have a list of matrices such that my_list[[1]] consists of a matrix and my_list[[2]] contains another matrix and so on. I want to embed this list inside a loop such that for every iteration of the loop I have a different my_list with different matrices, and want to be able to access them later. Is there any way I could do this in R? For example like creating an array (of size = number of iterations of the loop), and each index of the array would have a different list of matrices. Or something similar. And how can I access it. Could anyone please help me with this? I would greatly appreciate the help. I have looked around but cannot find a way to do this. Lists of lists seem to be an option, and I have tried to experiment with it for one iteration but it gives this error:
> nes <- list()
> nes[[1]] <- append(nes[[1]], my_list[[1]])
Error in nes[[1]] : subscript out of bounds
Would be great if anyone could help me with this.
EDIT:
Basically what I have is an initial list known as particles. Something like this:
for (k in 1:10)
{
# three centroids; k = 3
particle[[k]] <- rbind(features.dataf[sample(1:10, 1),2:4],
features.dataf[sample(1:10, 1),2:4],
features.dataf[sample(1:10, 1),2:4])
row.names(particle[[k]]) <- c(1,2,3)
}
Then I run this loop again. With an extra outer loop.
for (n in 1:30) {
for (k in 1:10) {
###some calculations
### create a vector f[k] with an f value for each k (calculated according to some formula)
pbestFitness[n,k] <- f[k] ##create a nXk dataframe that stores the f[k] value for every iteration of n
### over here I want to create a list of lists
}
}
In the above code where I create the list of lists, such that for every iteration of the outer loop I have a particle[[k]]th matrix stored.
Any particle[[k]] is of the form:
[,1] [,2] [,3]
[1,] 0.96436532 0.8958297 0.6089338
[2,] 0.08555853 0.7762849 0.6647247
[3,] 0.30792817 0.8061227 0.5099790
The desired output would be something like that if I try to access this new lists of lists (nes), its nes[[n]] value should have a list with k number of matrices.

Using the Outer Function

I'm having difficulty using the outer function. I've looked at a few threads, but haven't been able to find a solution.
I have a matrix, prices, with the following information:
25 26
I use the outer function as follows to multiply these numbers together:
a = outer(prices[1,1:2],prices[1,1:2],FUN ="*")
This gives me the following error:
Error in as.vector(X) %*% t(as.vector(Y)) :
requires numeric/complex matrix/vector arguments
If, however, I do the exact same thing, but with the numbers directly, it works as I would like it to:
a = outer(c(25,26),c(25,26),FUN ="*")
and returns a 2x2 matrix with the products.
Any help would be greatly appreciated.
Your prices matrix is apparently a data.frame instead of a matrix. You can either change that:
prices <- as.matrix(prices)
a <- outer(prices[1,1:2],prices[1,1:2],FUN ="*")
or you can just convert to numeric when you use it:
a <- outer(as.numeric(prices[1,1:2]),as.numeric(prices[1,1:2]),FUN ="*")
prices <- matrix(c(25,26), nrow=1)
a = outer(prices[1,1:2],prices[1,1:2],FUN ="*")
# [,1] [,2]
#[1,] 625 650
#[2,] 650 676

Resources