Load NPZ sparse matrix in R - r

How can I read a sparse matrix that I have saved with Python as a *.npz file in R? I already came across two answers* on Stackoverflow but neither seems to do the job in my case.
The data set was created with Python from a Pandas data frame via:
scipy.sparse.save_npz(
"data.npz",
scipy.sparse.csr_matrix(DataFrame.values)
)
It seems like the first steps for importing the data set in R are as follows.
library(reticulate)
np = import("numpy")
npz1 <- np$load("data.npz")
However, this does not yield a data frame yet.
*1 Load sparce NumPy matrix into R
*2 Reading .npz files from R

I cannot access your dataset, so I can only speak from experience. When I try loading a sparse CSR matrix with numpy, it does not work ; the class of the object is numpy.lib.npyio.NpzFile, which I can't use in R.
The way I found to import the matrix into an R object, as has been said in a post you've linked, is to use scipy.sparse.
library(reticulate)
scipy_sparse = import("scipy.sparse")
csr_matrix = scipy_sparse$load_npz("path_to_your_file")
csr_matrix, which was a scipy.sparse.csr_matrix object in Python (Compressed Sparse Row matrix), is automatically converted into a dgRMatrix from the R package Matrix. Note that if you had used scipy.sparse.csc_matrix in Python, you would get a dgCMatrix (Compressed Sparse Column matrix). The actual function doing the hardwork converting the Python object into something R can use is py_to_r.scipy.sparse.csr.csr_matrix, from the reticulate package.
If you want to convert the dgRMatrix into a data frame, you can simply use
df <- as.data.frame(as.matrix(csr_matrix))
although this might not be the best thing to do memory-wise if your dataset is big.
I hope this helped!

Related

Is there a size limit in DataBricks for converting an R dataframe to a Spark dataframe?

I am new to Stack overflow and tried so many ways to solve the error but without any success. My problem: I CAN convert subsets of an R dataframe to a Spark dataframe, but not the whole dataframe. Similar questions but not the same include:
Not able to to convert R data frame to Spark DataFrame and
Is there any size limit for Spark-Dataframe to process/hold columns at a time?
Here some information about the R dataframe:
library(SparkR)
sparkR.session()
sparkR.version()
[1] "2.4.3"
dim(df)
[1] 101368 25
class(df)
[1] "data.frame"
When converting this to a Spark Dataframe:
sdf <- as.DataFrame(df)
Error in handleErrors(returnStatus, conn) : Error in handleErrors(returnStatus, conn) :
Error in handleErrors(returnStatus, conn) :
However, when I subset the R dataframe, it does NOT result in an error:
sdf_sub1 <- as.DataFrame(df[c(1:50000), ])
sdf_sub2 <- as.DataFrame(df[c(50001:101368), ])
class(sdf_sub1)
[1] "SparkDataFrame"
attr(,"package")
[1] "SparkR"
class(sdf_sub2)
[1] "SparkDataFrame"
attr(,"package")
[1] "SparkR"
How can I write the whole dataframe to a Spark DataFrame? (I want to saveAsTable afterwards).
I was thinking about a problem with capacity but I do not have a clue how to solve it.
Thanks a lot!!
In general you'll see poor performance when converting from R dataframes to Spark dataframes, and vice versa. Objects are representing differently in memory in Spark and R, and there is significant expansion of the object size when converting from one to the other. This often blows out the memory of the driver, making it difficult to copy/collect large objects to/from Spark. Fortunately, you have a couple options.
Use Apache Arrow to establish a common in memory format for objects, eliminating the need to copy and convert from representation in R to Spark. The link I provided has instructions on how to set this up on Databricks.
Write the dataframe to disk as parquet (or CSV) and then read it into Spark directly. You can use the arrow library in R to do this.
Increase the size of your driver node to accommodate the memory expansion. On Databricks you can select the driver node type (or ask your admin to do it) for your cluster - make sure you pick one with a lot of memory. For reference, I tested collecting a 2GB dataset and needed a 30GB+ driver. With arrow that comes down dramatically.
Anecdotally, there is a limit on the size of table that SparkR will convert from DataFrame to data.table that is memory-dependent. It is also far smaller than I would have expected, around 50,000 rows for my work
I had to convert some very large data.tables to DataFrames and ended up making a script to chunk them into smaller pieces to get around this. Initially I chose to chunk n rows of the data, but when a very wide table was converted this error returned. My work-around was to have a limit to the number of elements being converted.

Explaining Simple Loop in R

I successfully wrote a for loop in R. That is okay and I am very happy that it works. But I also want to understand what I've done exactly because I will have to work with loops later on in my analysis as well.
I work with Raster Data (DEMs). I load them into the environment as rasters and then I use the getValues function in the loop as I want to do some calculations. Looks as follows:
list <- dir(pattern=".tif", full.names=T)
tif.files <- list()
tif.files.values <- tif.files
for (i in 1: length(list)){
tif.files[[i]] <- raster (list[[i]])
tif.files.values[[i]] <- getValues(tif.files[[i]])
}
Okay, so far so good. I don't get why I have to specify tif.files and tif.files.values before I use them in the loop and I don't know why to specify them exactly how I did that. For the first part, the raster operation, I had a pattern. Maybe someone can explain the context. I really want to understand R.
When you do:
tif.files[[i]] <- raster (list[[i]])
then tif.files[[i]] is the result of running raster(list[[i]]), so that is storing the raster object. This object contains the metadata (extent, number of rows, cols etc) and the data, although if the tiff is huge it doesn't actually read it in at the time.
tif.files.values[[i]] <- getValues(tif.files[[i]])
that line calls getValues on the raster object, which reads the values from the raster and returns a vector. The values of the grid cells are now in tif.files.values[[i]].
Experiment by printing tif.files[[1]] and tif.files.values[[1]] at the R prompt.
Note
This is R, not RStudio, which is the interface you are using that has all the buttons and menus. The R language exists quite happily without it, and your question is just a language question. I've edited and tagged it now for you.

Convert Document Term Matrix (DTM) to Data Frame (R Programming)

I am a beginner at R programming language and currently try to work on a project.
There's a huge Document Term Matrix (DTM) and I would like to convert it into a Data Frame.
However due to the restrictions of the functions, I am not able to do so.
The method that I have been using is to first convert it into a matrix, and then convert it to data frame.
DF <- data.frame(as.matrix(DTM), stringsAsFactors=FALSE)
It was working perfectly with smaller size DTM. However when the DTM is too large, I am not able to convert it to a matrix, yielding the error as shown below:
Error: cannot allocate vector of size 2409.3 Gb
Tried looking online for a few days however I am not able to find a solution.
Would be really thankful if anyone is able to suggest what is the best way to convert a DTM into a DF (especially when dealing with large size DTM).
In the tidytext package there is actually a function to do just that. Try using the tidy function which will return a tibble (basically a fancy dataframe that will print nicely). The nice thing about the tidy function is it'll take care of the pesky StringsAsFactors=FALSE issue by not converting strings to factors and it will deal nicely with the sparsity of your DTM.
as.matrix is trying to convert your DTM into a non-sparse matrix with an entry for every document and term even if the term occurs 0 times in that document, which is causing your memory usage to ballon. tidy` will convert it into a dataframe where each document only has the counts for the term found in them.
In your example here you'd run
library(tidytext)
DF <- tidy(DTM)
There's even a vignette on how to use the tidytext packages (meant to work in the tidyverse) here.
It's possible that as.data.frame(as.matrix(DTM), stringsAsFactors=False) instead of data.frame(as.matrix(DTM), stringsAsFactors=False) might do the trick.
The API documentation notes that as.data.frame() simply coerces a matrix into a dataframe, whereas data.frame() creates a new data frame from the input.
as.data.frame(...) -> https://stat.ethz.ch/R-manual/R-devel/library/base/html/as.data.frame.html
data.frame(...) -> https://stat.ethz.ch/R-manual/R-devel/library/base/html/data.frame.html

H2O-R: Apply custom library function on each row of H2OFrame

After importing a relatively big table from MySQL into H2O on my machine, I tried to run a hashing algorithm (murmurhash from the R digest package) on one of its columns and save it back to H2O. As I found out, using as.data.frame on a H2OFrame object is not always advised: originally my H2OFrame is ~43k rows large, but the coerced DataFrame contains usually only ~30k rows for some reason (the same goes for using base::apply/base::sapply/etc on the H2OFrame).
I found out there is an apply function used for H2OFrames as well, but as I see, it can only be used with built-in R functions.
So, for example my code would look like this:
data[, "subject"] <- h2o::apply(data[, "subject"], 2, function(x)
digest(x, algo = "murmur32"))
I get the following error:
Error in .process.stmnt(stmnt, formalz, envs) :
Don't know what to do with statement: digest
I understand the fact that only the predefined functions from the Java backend can be used to manipulate H2O data, but is there perhaps another way to use the digest package from the client side without converting the data to DataFrame? I was thinking that in the worst case, I will have to use the R-MySQL driver to load the data first, manipulate it as a DataFrame and then upload it to the H2O cloud. Thanks for help in advance.
Due to the way H2O works, it cannot support arbitrary user-defined functions applied to H2OFrames the way that you can apply any function to a regular R data.frame. We already use the Murmur hash function in the H2O backend, so I have added a JIRA ticket to expose it to the H2O R and Python APIs. What I would recommend in the meantime is to copy just the single column of interest from the H2O cluster into R, apply the digest function and then update the H2OFrame with the result.
The following code will pull the "subject" column into R as a 1-column data.frame. You can then use the base R apply function to apply the murmur hash to every row, and lastly you can copy the resulting 1-column data.frame back into the "subject" column in your original H2OFrame, called data.
sub <- as.data.frame(data[, "subject"])
subhash <- apply(sub, 1, digest, algo = "murmur32")
data[, "subject"] <- as.h2o(subhash)
Since you only have 43k rows, I would expect that you'd still be able to do this with no issues on even a mediocre laptop since you are only copying a single column from the H2O cluster to R memory (rather than the entire data frame).

Plot data from SparkR DataFrame

I have an avro file which I am reading as follows:
avroFile <-read.df(sqlContext, "avro", "com.databricks.spark.avro")
This file as lat/lon columns but I am not able to plot them like a regular dataframe.
Neither am I able to access the column using the '$' operator.
ex.
avroFile$latitude
Any help regarding avro files and operation on them using R are appreciated.
If you want to use ggplot2 for plotting, try ggplot2.SparkR. This package allows you to take SparkR DataFrame directly as input for ggplot() function call.
https://github.com/SKKU-SKT/ggplot2.SparkR
And you won't be able to plot it directly. SparkR DataFrame is not compatible with functions which expect data.frame as an input. This is not even a data structure in a strict sense but simply a recipe how to process input data. It is materialized only when you execute an action.
If you want to plot it you'll have collect it first.. Beware that it fetches all the data the local machine so typically it is something you want to avoid on full data set.
As zero323 mentioned, you cannot currently run R visualizations on distributed SparkR DataFrames. You can run them on local data.frames. Here is one way you could make a new dataframe with just the columns you want to plot, and then collect a random sample of them to a local data.frame which you can plot from
latlong <- (avroFile, avroFile$latitude, avrofile$longitude)
latlongsample <- collect(sample(latlong, FALSE, .1))
plot(latlongsample)
the signature for sample method is:
sample(x, withReplacement, fraction, seed)

Resources