Creating large matrices in R in reasonable time - r

I am working on a movie recommender predicts a user's movie rating for an unseen movie. Most of the work is done and I have created a 7000x3000 matrix userRatingsNew containing 7000 users and their ratings for 3000 movies, replacing all the missing values with the predicted rating.
I was provided two other files, mapping and test, and used read.csv() to load them into matrices of the following format.
mapping is a 8,400,000x3 matrix that contains id, user, movie, where id is basically the transaction id associated with a user's rating of movie x.
test is a 8,400,000x2 matrix that contains id, rating, where rating is the user's rating for that movie associated with id. The values in the rating column are empty and I need to fill those in using the predicted values that I have already calculated.
Here is my code
writeResult <- function(userRatingsNew, mapping, test, writeToFile = FALSE){
start <- Sys.time()
result <- test
entries <- nrow(test)
for (i in 1:entries){
result[i,2] <- userRatingsNew[mapping[i,2], mapping[i,3]]
}
if (writeToFile)
write.csv(result, "result.csv", row.names=FALSE)
print(Sys.time()-start)
return(result)
}
My problem is that for i=1:100, it takes ~7 seconds. So in order to process all 8.4 million entries, it'd take ~163 hours. I tried using doMC() and implemented parallel processing, but I ran into the problem where my computer ran out of memory. What exactly can I do to speed this process up?

You can index a matrix with another matrix, as in:
M <- matrix(1:25,nc=5,nr=5)
M
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 6 11 16 21
# [2,] 2 7 12 17 22
# [3,] 3 8 13 18 23
# [4,] 4 9 14 19 24
# [5,] 5 10 15 20 25
m <- cbind(1:5,5:1)
m
# [,1] [,2]
# [1,] 1 5
# [2,] 2 4
# [3,] 3 3
# [4,] 4 2
# [5,] 5 1
M[m]
# [1] 21 17 13 9 5
So try
result[,2] <- userRatingsNew[mapping[,2:3]]
You should not need a loop.

A thought:
Instead of the 3000-sized dimension attached directly to the 7000-sized dimension, for each user you can attach an array which specifies the movie id/number/place in array, and their rating, in a series of 2d datapoints. Presumably most users will not rate all 3000 films. Let's say they rate 20 movies on average, and in each of 20 cases now it calls the array of movie names by correctly referring to the location in the array, then now you only need (7000) x (20x2+20) things going on, where 20x2 refers to the 20 ratings plus the reference to the film, and the other 20 is the fact of retrieving the film name. You can compile all reports first using array location and attach the name referring to an array of film names.

Related

Import Excel Data into R

I'm working on an excel-file consisting of a 261 x 10 matrix. The matrix consists of the weekly returns of 10 stocks from 2010 to 2015. So, I have 10 variables (stocks) and 261 observations (weekly returns) for each variable.
For my master thesis I have to apply a "rearrangement algorithm" developed by Rüschendorf and Puccetti (2012) on my matrix. I'm not going into further details on the theorical side of that concept. The thing is that I downloaded a package capable of performing the rearrangement algorithm in R. I tested it out and it works perfectly.
Actually the only thing I need to know is how to import my excel-matrix into R in order to be capable of performing the rearrangement algorithm on it. I can rewrite my matrix into R (manually) just by encoding every element of the matrix by using the matrix programming formula in R:
A = matrix( c(), nrow= , ncol= , byrow=TRUE)
The problem is that doing so for such a big matrix (261 x 10) would be very time consuming. Is their any way to import my excel-matrix in R and that R recognizes it as matrix consisting of numerical values ready for calculations (similar to the case of doing it manually) ? In such a way that I just have to run the "rearrangement algorithm" function provided in R.
Thanks in advance.
I make a selection within an opened Excel sheet and copied to the clipboard. This then worked on a Mac:
> con=pipe("pbpaste")
> dat <- data.matrix( read.table(con) )
> dat
V1 V2 V3
[1,] 1 1 1
[2,] 2 2 2
[3,] 3 3 3
[4,] 4 4 4
[5,] 5 5 5
[6,] 6 6 6
[7,] 7 7 7
[8,] 8 8 8
[9,] 9 9 9
[10,] 10 10 10
[11,] 11 11 11
[12,] 12 12 12
[13,] 13 13 13
[14,] 14 14 14
The method is somewhat different on Windows devices but the help page for ?connections should have your OS-specific techniques.
You didn't provide a minimal reproducible example, so the answers are probably gonna of lesser quality. Anyway, you should be able to load the the excel file with something like:
require(XLConnect)
wrkbk <- loadWorkbook("path/to/your/data.xlsx")
df <- readWorksheet(wrkbk, sheet = 1, header = TRUE)
And then convert the data.frame to a matrix via
ans <- as.matrix(df)
Otherwise, you need to save your file as a .txt or .csv plain-text file and use read.table or read.csv and the like. Consult their respective help pages.

Extract minima returns

I am trying to apply the block maxima (in my case minima) approach of Extreme Value Theory to financial returns. I have daily returns for 30 financial indices stored in a csv file called 'Returns'. I start by loading the data
Returns<-read.csv("Returns.csv", header=TRUE)
I then extract the minimum returns over consecutive non-overlapping blocks of equal length (i.e., 5 days) for each index I have in my 'Returns.csv' file. For that, I do the following
for (xx in Returns) #Obtain the minima.
{
rows<-length(xx) #This is the number of returns
m<-5 #When m<-5 we obtain weekly minima. Change accordingly (e.g., 20)
k<-rows/m #This is the number of blocks (i.e., number of returns/size of block),
bm<-rep(0,k) #which is also the number of extremes
for(i in 1:k){bm[i]<-min(xx[((i-1)*m+1):(i*m)])}
#Store the minima in a file 'minima.csv'
write.table(bm,file="minima.csv", append=TRUE, row.names=FALSE, col.names=FALSE)
The code extracts the minima returns for all indices correctly but when the minima are stored in the file 'minima.csv' they all appear in the same column (appended).
What I want the code to do is to read the financial returns contained in the first column of the file 'Returns.csv', extract the minima returns over consecutive non-overlapping blocks of equal length (i.e., 5 days) and store them in the first column of the file 'minima.csv'. Then do exactly the same for the financial returns contained in the second column of the file 'Returns.csv' and store the minima returns in the second column of the file 'minima.csv', and so on, until I reach column 30.
I think your data looks similar to this:
> m <- matrix(1:40, ncol=4)
> m
[,1] [,2] [,3] [,4]
[1,] 1 11 21 31
[2,] 2 12 22 32
[3,] 3 13 23 33
[4,] 4 14 24 34
[5,] 5 15 25 35
[6,] 6 16 26 36
[7,] 7 17 27 37
[8,] 8 18 28 38
[9,] 9 19 29 39
[10,] 10 20 30 40
Obviously you have more rows and columns and your data is not just the sequence of 1 to 40. To chunk each column with a size of 5 and find the minimum for each column run:
> apply(m, 2, function(x) sapply(split(x, ceiling(seq_along(x)/5)), min))
[,1] [,2] [,3] [,4]
1 1 11 21 31
2 6 16 26 36
Basically the apply is splitting m by the columns and applying the function to each column. The inner function takes each column, chunks the columns and then returns the minimum of each chunk. Your data is in a dataframe not a matrix so you need to do this before you run the command above.
m <- as.matrix(Returns)
To write this to a csv
> mins <- apply(m, 2, function(x) sapply(split(x, ceiling(seq_along(x)/5)), min))
> write.table(mins, file="test.min.csv", sep=',', row.names=F, col.names=F, quote=F)

filling matrix with circular patern

I want to write a function that fill a matrix m by m where m is odd as follows :
1) it's starts from middle cell of matrix (for example for 5 by 5 A, matrix middle cell are A[2,2] ) , and put number 1 there
2) it's go one cell forward and add 1 to previous cell and put it in second cell
3) it's go down and put 3, left 4, left 5, up 6, up 7,...
for example the resulting matrix could be like this :
> 7 8 9
6 1 2
5 4 3
could somebody help me to implement?
max_x=5
len=max_x^2
middle=ceiling(max_x/2)
A=matrix(NA,max_x,max_x)
increments=Reduce(
f=function(lhs,rhs) c(lhs,(-1)^(rhs/2+1)*rep(1,rhs)),
x=2*(1:(max_x)),
init=0
)[1:len]
idx_x=Reduce(
f=function(lhs,rhs) c(lhs,rep(c(TRUE,FALSE),each=rhs)),
1:max_x,
init=FALSE
)[1:len]
increments_x=increments
increments_y=increments
increments_x[!idx_x]=0
increments_y[idx_x]=0
A[(middle+cumsum(increments_x)-1)*(max_x)+middle+cumsum(increments_y)]=1:(max_x^2)
Gives
#> A
# [,1] [,2] [,3] [,4] [,5]
#[1,] 21 22 23 24 25
#[2,] 20 7 8 9 10
#[3,] 19 6 1 2 11
#[4,] 18 5 4 3 12
#[5,] 17 16 15 14 13
Explanation:
The vector increments denotes the steps along the path of the increasing numbers. It's either 0/+1/-1 for unchanged/increasing/decreasing row and column indices. Important here is that these numbers do not differentiate between steps along columns and rows. This is managed by the vector idx_x - it masks out increments that are either along a row (TRUE) or a column (FALSE).
The last line takes into account R's indexing logic (matrix index increases along columns).
Edit:
As per request of the OP, here some more information about how the increments vector is calculated.
You always go two consecutive straight lines of equal length (row-wise or column-wise). The length, however, increases by 1 after you have walked twice. This corresponds to the x=2*(1:(max_x)) argument together with rep(1,rhs). The first two consecutive walks are in increasing column/row direction. Then follow two in negative direction and so on (alternating). This is accounted for by (-1)^(rhs/2+1).

Finding the index of the minimum value which is larger than a threshold in R

This is probably very simple, but I'm missing the correct syntax in order to simplify it.
Given a matrix, find the entry in one column which is the lowest value, greater than some input parameter. Then, return an entry in a different column on that corresponding row. Not very complicated... and I've found something that works but, a more efficient solution would be greatly appreciated.
I found this link:Better way to find a minimum value that fits a condition?
which is great.. but that method of finding the least entry loses the index information required to find a corresponding value in a corresponding row.
Let's say column 2 is the condition column, and column 1 is the one I want to return.... currently I've made this: (note that this only works because row two is full of numbers which are less than 1).
matrix[which.max((matrix[,2]>threshhold)/matrix[,2]),1]
Any thoughts? I'm expecting that there is probably some quick and easy function which has this effect... it's just never been introduced to me haha.
rmk's answer shows the basic way to get a lot of info out of your matrix. But if you know which column you're testing for the minimum value (above your threshold), and then want to return a different value in that row, maybe something like
incol<- df[,4] # select the column to search
outcol <- 2 # select the element of the found row you want to get
threshold <- 5
df[ rev(order(incol>threshold))[1] ,outcol]
You could try the following. Say,
df <- matrix(sample(1:35,35),7,5)
> df
[,1] [,2] [,3] [,4] [,5]
[1,] 18 16 27 19 31
[2,] 24 1 7 12 5
[3,] 28 35 23 4 6
[4,] 33 3 25 26 15
[5,] 14 10 11 21 20
[6,] 9 2 32 17 13
[7,] 30 8 29 22 34
Say your threshold is 5:
apply(df,2,function(x){ x[x<5] <- max(x);which.min(x)})
[1] 6 7 2 2 2
Corresponding to the values:
[1] 9 8 7 12 5
This should give you the index of the smallest entry in each column greater than threshold according to the original column indexing.

Perform 'cross product' of two vectors, but with addition

I am trying to use R to perform an operation (ideally with similarly displayed output) such as
> x<-1:6
> y<-1:6
> x%o%y
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 2 3 4 5 6
[2,] 2 4 6 8 10 12
[3,] 3 6 9 12 15 18
[4,] 4 8 12 16 20 24
[5,] 5 10 15 20 25 30
[6,] 6 12 18 24 30 36
where each entry is found through addition not multiplication.
I would also be interested in creating the 36 ordered pairs (1,1) , (1,2), etc...
Furthermore, I want to use another vector like
z<-1:4
to create all the ordered triplets possible between x, y, and z.
I am using R to look into likelihoods of possible total when rolling dice with varied numbers of sizes.
Thank you for all your help! This site has been a big help to me. I appreciate anyone that takes the time to answer a stranger's question.
UPDATE So I found that `outer(x,y,'+') will do what I wanted first. But I still don't know how to create ordered pairs or ordered triplets.
Your first question is easily handled by outer:
outer(1:6,1:6,"+")
For the others, I suggest you try expand.grid, although there are specialized combination and permutation functions out there as well if you do a little searching.
expand.grid can answer your second question:
expand.grid(1:6,1:6)
expand.grid(1:6,1:6,1:4)

Resources