I'm trying to create a big matrix in R and fill it with predicted values from a fitted model. I created the matrix on a remote server and am filling it in using a for loop.
library("bigmemory")
pred = filebacked.big.matrix(nrow=50000,ncol=150000,backingfile="pred.bin",descriptorfile="pred.desc",backingpath="~/pred")
for(i in 1:nrow(data)){
pred[i,] = as.vector(intercept+beta%*%data[i,])
}
The actual .bin file is on a remote server. Yet the process of filling in the matrix still takes up a lot of physical memory on my computer. How do I avoid this?
I HOPE I UNDERSTOOD YOU RIGHT!
Filling in the matrix takes a lot of physical memory on your computer because the calculations are made on you computer.
If i see it right, you could think about it that way:
If i ask you to calculate something and write it down on my sheet of paper, you still have the whole calculation-process in "your brain";But the results are on my paper;
Well, how could you avoid this:
set up an r runtime on the Serve r- and let the server calculate !
The Server should do the "fill in data-process";
Related
I have successfully set up a R SQL Server ODBC connection by doing:
DBI_connection <- dbConnect(odbc(),
driver = "SQL Server"
server = server_name
database = database_name)
Dataset_in_R <- dbFetch(dbSendQuery(DBI_connection,
"SELECT * FROM MyTable_in_SQL"))
3 quick questions:
1-Is there a quicker way to copy data from SQL Server to R? This table has +44million rows and it is still running...
2-If I make any changes to this data in R does it change anything in my MyTable_in_SQL? I dont think so because I have saved it in a global data.frame variable in R, but just checking.
3-How to avoid going through this step every time I open R? Is there a way to save my data.frame in the "background" in R?
1: Is there a quicker way to copy data from SQL Server to R?
The answer here is rather simple to answer. the odbc package in R does quite a bit under-the-hood to ensure compatibility between the result fetched from the server and R's data structure. It might be possible to obtain a slight increase in speed by using an alternative package (RODBC is an old package, and it sometimes seems faster). In this case however, with 44 mil. rows, I expect that the bigger performance boost comes from preparing your sql-statement. The general idea would be to
Remove any unnecessary columns. Remember each column will need to be downloaded, so if you have 20 columns, removing 1 column may reduce your query execution time by ~5% (assuming linear run-time)
If you plan on performing aggregation, it will (very close to almost) faster to perform this directly in your query, eg, if you have a column called Ticker and a column called Volume and you want the average value of Volume you could calculate this directly in your query. Similar for last row using last_value(colname) over ([partition by [grouping col 1], [grouping col 2] ...] order by [order col 1], [order col 2]) as last_value_colname.
If you choose to do this, it might be beneficial to test your query on a small subset of rows using TOP N or LIMIT N eg: select [select statement] from mytable_in_sql order by [order col] limit 100 which would only return the first 100 rows. As Martin Schmelzer commented this can be done via R with the dplyr::tbl function as well, however it is always faster to correct your statement.
Finally if your query becomes more complex (does not seem to be the case here), it might be beneficial to create a View on the table CREATE VIEW with the specific select statement and query this view instead. The server will then try to optimize the query, and if your problem is on the server side rather than local side, this can improve performance.
Finally one must state the obvious. As noted above when you query the server you are downloading some (maybe quite a lot) of data. This can be improved by improving your internet connection either by repositioning your computer, router or directly connecting via a cord (or purely upgrading ones internet connection). For 44 Mil. rows if you have only a single 64 bit double precision variable, you have 44 * 10^6 / 1024^3 = 2.6 GiB of data (if not compressed). If you have 10 columns, this goes up to 26 GiB of data. It simply is going to take quite a long time to download all of this data. Thus getting this row count down would be extremely helpful!
As a side note you might be able to simply download the table directly via SSMS slightly faster (still slow due to table size) and then import the file locally. For the fastest speed you likely have to look into the Bulk import and export functionality of SQL-server.
2: If I make any changes to this data in R does it change anything in my MyTable_in_SQL?
No: R has no internal pointer/connection once the table has been loaded. I don't even believe a package exists (in R at least) that opens a stream to the table which could dynamically update the table. I know that a functionality like this exists in Excel, but even using this has some dangerous side effects and should (in my opinion) only be used in read-only applications, where the user wants to see a (almost) live-stream of the data.
3: How to avoid going through this step every time I open R? Is there a way to save my data.frame in the "background" in R?
To avoid this, simply save the table after every session. Whenever you close Rstudio it will ask you if you want to save your current session, and here you may click yes, at which point it will save .Rhistory and .Rdata in the getwd() directory, which will be imported the next time you open your session (unless you changed your working directory before closing the session using setwd(...). However I highly suggest you do not do this for larger datasets, as it will cause your R session to take forever to open the next time you open R, as well as possibly creating unnecessary copies of your data (for example if you import it into df and make a transformation in df2 then you will suddenly have 2 copies of a 2.6+ GiB dataset to load every time you open R). Instead I highly suggest saving the file using arrow::write_parquet(df, file_path), which is a much (and I mean MUCH!!) faster alternative to saving as RDS or csv files. These can't be opened as easily in Excel, but can be opened in R using arrow::read_parquet and python using pandas.read_parquet or pyarrow.parquet.read_parquet, while being compressed to a size that is usually 50 - 80 % smaller than the equivalent csv file.
Note:
If you already did save your R session after loading in the file, and you experience a very slow startup, I suggest removing the .RData file from your working directory. Usually the documents folder (C:/Users/[user]/Documents) from your system.
On question 2 you're correct, any changes in R won't change anything in the DB.
About question 3, you can save.image() or save.image('path/image_name.Rdata') and it will save your environment so you can recover it later on another session with load.image('path/image_name.Rdata').
Maybe with this you don't need a faster way to get data from a DB.
I'm facing a pretty expected problem while I'm running irritatingly the below code which creates all possible combinations for a specified sequence and then it stores them in the final.grid variable. The thing is that there is no only one sequence but about hundred of thousands of them and each one could have enough combinations.
for()...
combs = get.all.combs(sequence)
final.grid = rbind(final.grid, combs)
Anyway. Tried to run my code in a windows PC with 4GB RAM and after 4 hours (not even half of the combinations being calculated) R returned this error
Error: cannot allocate vector of size 4.0 Gb
What i was though as solution is to write after each iteration the final.grid to a file , free the allocated memory and continue. The truth is that I have not experience on such implementations with R and I don't know which solution to choose and if there are some of them that will do better and more efficiently. Have in mind that probably my final grid will need some GBs.
Somewhere in the stack exchange I read about ff package but there was not enough discussion on the subject (at least I didn't found it) and preferred to ask here for your opinions.
Thanks
I cannot understand very well your question, because the piece of code that you put is not clear to figure it out your problem.
But, you can try saving your results as .RData or .nc files, depending on the nature of your data. However, it could be better if you are more explicit about your problem, for instance showing what code is behind get.all.combs function or sequence data.
One thing you can try is the memory.limit() function to see if you can allocate enough memory for your work. This may not work if your Windows OS is 32 bit.
If you have large data object that you don't need for some parts of your program, you could first save them, and them remove using 'rm', and when you need them again you can load the objects.
The link below has more info that could be useful to you.
Increasing (or decreasing) the memory available to R processes
EDIT:
You can use object.size function to see memory requirement for objects you have. If they are too big, try loading them only when you need them.
It is possible one of the functions you use try to allocate more memory than you have. See if you can try to find where exactly the program crashes.
I have 2 moderate-size datasets that I am using in R. I want to check one dataset if its referenece number matches with the reference numbers in the other dataset and if so, allot a column in the second dataset which contains the value present in the column in the other dataset.
ghi2$state=ifelse(b1$accntnumber %in% ghi2$referencenumber,b1$address,0)
Every time I am running this code, my RStudio hangs up and is unresponsive for a long time. Is it because its taking the time to process the command or is my command wrong.
I am using a 2GB RAM system so I think R hangs up. Should I use the == operator instead of %in%? Would I get the same result?
1. Should I use the == operator instead of %in%?
No (!). See #2.
2. Would I get the same result?
No. The order and position have to match with ==. Also, see #Akrun's comment.
3. How to make it faster and/or deal with RStudio freezing
If RStudio freezes you can save your log file info, send it to the RStudio team who will quickly respond, and also you could bring your log files here for help.
Beyond that, general Big Data rules apply. Here are some tips:
Try data.table
Try it on the command line instead of RStudio
Watch your Resource Monitor (or whatever you use to monitor resources) and observe the memory and CPU usage
If it's a RAM issue you can
a. use a cloud account to get more RAM
b. buy some more RAM (just sayin')
c. use 64-bit R and increase the RAM available to R to its max if it's not already
If it's a CPU issue you can consider parallelization
If any of these ID's are being repeated (and this makes sense in the context of your specific use-case) you can use unique to avoid redundant comparisons
There are lots of other tips you can find in pre-existing Big Data Q&A's on SO as well.
I have a simple analysis to be done. I just need to calculate the correlation of the columns (or rows ,if transposed). Simple enough? I am unable to get the results for the whole week and I have looked through most of the solutions here.
My laptop has a 4GB RAM. I do have access to a server with 32 nodes. My data cannot be loaded here as it is huge (411k columns and 100 rows). If you need any other information or maybe part of the data I can try to put it up here, but the problem can be easily explained without really having to see the data. I simply need to get a correlation matrix of size 411k X 411k which means I need to compute the correlation among the rows of my data.
Concepts I have tried to code: (all of them in some way give me memory issues or run forever)
The most simple way, one row against all, write the result out using append.T. (Runs forever)
biCorPar.r by bobthecat (https://gist.github.com/bobthecat/5024079), splitting the data into blocks and using ff matrix. (unable to allocate memory to assign the corMAT matrix using ff() in my server)
split the data into sets (every 10000 continuous rows will be a set) and do correlation of each set against the other (same logic as bigcorPar) but I am unable to find a way to store them all together finally to generate the final 411kX411k matrix.
I am attempting this now, bigcorPar.r on 10000 rows against 411k (so 10000 is divided into blocks) and save the results in separate csv files.
I am also attempting to run every 1000 vs 411k in one node in my server and today is my 3rd day and I am still on row 71.
I am not an R pro so I could attempt only this much. Either my codes run forever or I do not have enough memory to store the results. Are there any more efficient ways to tackle this issue?
Thanks for all your comments and help.
I'm familiar with this problem myself in the context of genetic research.
If you are interested only in the significant correlations, you may find my package MatrixEQTL useful (available on CRAN, more info here: http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/ ).
If you want to keep all correlations, I'd like to first warn you that in the binary format (economical compared to text) it would take 411,000 x 411,000 x 8 bytes = 1.3 TB. If this what you want and you are OK with the storage required for that, I can provide my code for such calculations and storage.
I am not sure there are any R users out there, but just in case:
I am a novice at R and was kindly "handed down" the following R code snippet:
Beta <- exp(as.matrix(read.table('beta.transpose')))
WordFreq <- read.table('freq-matrix')
WordProbs <- WordFreq$V1 / sum(WordFreq)
infile <- file('freq-matrix')
outfile <- file('doc_topic_prob_matrix', 'w')
open(infile)
open(outfile)
for (i in 1:93049) {
vec <- t(scan(infile, nlines=1))
topics <- (vec/WordProbs) %*% Beta
write.table(topics, outfile, append=T, row.names=F, col.names=F)
}
When I tried running this on my dataset, the system thrashed and swapped like crazy. Now I realize that has a simple reason: the file freq-matrix holds a large (22GB) matrix and I was trying to read it into memory.
I have been told to use the Matrix package, because freq-matrix has many, many zeros all over the place and it handles such cases well. Will that help? If so, any hints on how to change this code would be most welcome. I have no R experience and just started reading through the introduction PDF available on the site.
Many thanks
~l
My suggestion might be completely off, because you don't give enough details about the contents of your files, and I had to guess from the code. Anyway, here it goes.
You don't state it, but I would assume that your code crashes on the second line, when you read in the big matrix. The loop reads the lines one-at-a-time, and should not crash. The only reason you need that big matrix is to calculate the WordProbs vector. So why don't you rewrite that part using the same looping using scan? In fact, you could probably don't even need to store the WordProbs vector, just sum(WordFreq) - you can get that using an initial run through hte file. Then rewrite the formula within the loop to calculate the current WordProb.
Belated answer, but I'd recommend reading the data into a memory mapped file, using the bigmemory package. After that, I'd look for the non-zero entries, which can then be represented as a 3 column matrix: (ix_row, ix_col, value). This is called a coordinate object list (COO), though the name is unimportant. From there, Matrix supports the creation of sparse matrices (via sparseMatrix). After you get the COO, you're pretty much set - conversion to and from the sparse matrix format is reasonably fast. Multiplying the matrix by Beta should be reasonably fast. If you need even greater speed, you could use an optimized BLAS library, but that opens up more questions. :)