After importing a relatively big table from MySQL into H2O on my machine, I tried to run a hashing algorithm (murmurhash from the R digest package) on one of its columns and save it back to H2O. As I found out, using as.data.frame on a H2OFrame object is not always advised: originally my H2OFrame is ~43k rows large, but the coerced DataFrame contains usually only ~30k rows for some reason (the same goes for using base::apply/base::sapply/etc on the H2OFrame).
I found out there is an apply function used for H2OFrames as well, but as I see, it can only be used with built-in R functions.
So, for example my code would look like this:
data[, "subject"] <- h2o::apply(data[, "subject"], 2, function(x)
digest(x, algo = "murmur32"))
I get the following error:
Error in .process.stmnt(stmnt, formalz, envs) :
Don't know what to do with statement: digest
I understand the fact that only the predefined functions from the Java backend can be used to manipulate H2O data, but is there perhaps another way to use the digest package from the client side without converting the data to DataFrame? I was thinking that in the worst case, I will have to use the R-MySQL driver to load the data first, manipulate it as a DataFrame and then upload it to the H2O cloud. Thanks for help in advance.
Due to the way H2O works, it cannot support arbitrary user-defined functions applied to H2OFrames the way that you can apply any function to a regular R data.frame. We already use the Murmur hash function in the H2O backend, so I have added a JIRA ticket to expose it to the H2O R and Python APIs. What I would recommend in the meantime is to copy just the single column of interest from the H2O cluster into R, apply the digest function and then update the H2OFrame with the result.
The following code will pull the "subject" column into R as a 1-column data.frame. You can then use the base R apply function to apply the murmur hash to every row, and lastly you can copy the resulting 1-column data.frame back into the "subject" column in your original H2OFrame, called data.
sub <- as.data.frame(data[, "subject"])
subhash <- apply(sub, 1, digest, algo = "murmur32")
data[, "subject"] <- as.h2o(subhash)
Since you only have 43k rows, I would expect that you'd still be able to do this with no issues on even a mediocre laptop since you are only copying a single column from the H2O cluster to R memory (rather than the entire data frame).
Related
I am new to Stack overflow and tried so many ways to solve the error but without any success. My problem: I CAN convert subsets of an R dataframe to a Spark dataframe, but not the whole dataframe. Similar questions but not the same include:
Not able to to convert R data frame to Spark DataFrame and
Is there any size limit for Spark-Dataframe to process/hold columns at a time?
Here some information about the R dataframe:
library(SparkR)
sparkR.session()
sparkR.version()
[1] "2.4.3"
dim(df)
[1] 101368 25
class(df)
[1] "data.frame"
When converting this to a Spark Dataframe:
sdf <- as.DataFrame(df)
Error in handleErrors(returnStatus, conn) : Error in handleErrors(returnStatus, conn) :
Error in handleErrors(returnStatus, conn) :
However, when I subset the R dataframe, it does NOT result in an error:
sdf_sub1 <- as.DataFrame(df[c(1:50000), ])
sdf_sub2 <- as.DataFrame(df[c(50001:101368), ])
class(sdf_sub1)
[1] "SparkDataFrame"
attr(,"package")
[1] "SparkR"
class(sdf_sub2)
[1] "SparkDataFrame"
attr(,"package")
[1] "SparkR"
How can I write the whole dataframe to a Spark DataFrame? (I want to saveAsTable afterwards).
I was thinking about a problem with capacity but I do not have a clue how to solve it.
Thanks a lot!!
In general you'll see poor performance when converting from R dataframes to Spark dataframes, and vice versa. Objects are representing differently in memory in Spark and R, and there is significant expansion of the object size when converting from one to the other. This often blows out the memory of the driver, making it difficult to copy/collect large objects to/from Spark. Fortunately, you have a couple options.
Use Apache Arrow to establish a common in memory format for objects, eliminating the need to copy and convert from representation in R to Spark. The link I provided has instructions on how to set this up on Databricks.
Write the dataframe to disk as parquet (or CSV) and then read it into Spark directly. You can use the arrow library in R to do this.
Increase the size of your driver node to accommodate the memory expansion. On Databricks you can select the driver node type (or ask your admin to do it) for your cluster - make sure you pick one with a lot of memory. For reference, I tested collecting a 2GB dataset and needed a 30GB+ driver. With arrow that comes down dramatically.
Anecdotally, there is a limit on the size of table that SparkR will convert from DataFrame to data.table that is memory-dependent. It is also far smaller than I would have expected, around 50,000 rows for my work
I had to convert some very large data.tables to DataFrames and ended up making a script to chunk them into smaller pieces to get around this. Initially I chose to chunk n rows of the data, but when a very wide table was converted this error returned. My work-around was to have a limit to the number of elements being converted.
I'm sure this will be very easy as I'm still an R beginner but here goes...
I've started with a data frame which I've successfully put through lapply-split followed by rbindlist to regenerate as a dataframe.
From this same data set, I've subset some data and performed lapply-split followed by rbindlist and get the following error:
"Error in rbindlist(df) : Item 1 of list input is not a data.frame,
data.table or list"
This is confusing since it's the same (sub)set of data being split by the same parameter.
When I call:
df[1]
I get:
$SWS1Ami
[1] 13451.02
which is the mean value I wanted to calculate for the SWS1Ami group (so it seems to have done the lapply split correctly). When I call:
typeof(df[1])
I see it tells me this element(?) type is a list.
Two questions:
(1) What could cause rbindlist to not work after doing lapply-split? Why does this seem to sometimes work and sometimes not work?
(2) Is there a quick litmus test to tell if your dataframe is in the "right" setup to undergo lapply-split-rbindlist?
I have an avro file which I am reading as follows:
avroFile <-read.df(sqlContext, "avro", "com.databricks.spark.avro")
This file as lat/lon columns but I am not able to plot them like a regular dataframe.
Neither am I able to access the column using the '$' operator.
ex.
avroFile$latitude
Any help regarding avro files and operation on them using R are appreciated.
If you want to use ggplot2 for plotting, try ggplot2.SparkR. This package allows you to take SparkR DataFrame directly as input for ggplot() function call.
https://github.com/SKKU-SKT/ggplot2.SparkR
And you won't be able to plot it directly. SparkR DataFrame is not compatible with functions which expect data.frame as an input. This is not even a data structure in a strict sense but simply a recipe how to process input data. It is materialized only when you execute an action.
If you want to plot it you'll have collect it first.. Beware that it fetches all the data the local machine so typically it is something you want to avoid on full data set.
As zero323 mentioned, you cannot currently run R visualizations on distributed SparkR DataFrames. You can run them on local data.frames. Here is one way you could make a new dataframe with just the columns you want to plot, and then collect a random sample of them to a local data.frame which you can plot from
latlong <- (avroFile, avroFile$latitude, avrofile$longitude)
latlongsample <- collect(sample(latlong, FALSE, .1))
plot(latlongsample)
the signature for sample method is:
sample(x, withReplacement, fraction, seed)
I have a dataset that looks like this, except it's much longer and with many more values:
dataset <- data.frame(grps = c("a","b","c","a","d","b","c","a","d","b","c","a"), response = c(1,4,2,6,4,7,8,9,4,5,0,3))
In R, I would like to remove all rows containing the values "b" or "c" using a vector of values to remove, i.e.
remove<-c("b","c")
The actual dataset is very long with many hundreds of values to remove, so removing values one-by-one would be very time consuming.
Try:
dataset[!(dataset$grps %in% remove),]
There's also subset:
subset(dataset, !(grps %in% remove))
... which is really just a wrapper around [ that lets you skip writing dataset$ over and over when there are multiple subset criteria. But, as the help page warns:
This is a convenience function intended for use interactively. For
programming it is better to use the standard subsetting functions like
‘[’, and in particular the non-standard evaluation of argument
‘subset’ can have unanticipated consequences.
I've never had any problems, but the majority of my R code is scripting for my own use with relatively static inputs.
2013-04-12
I have now had problems. If you're building a package for CRAN, R CMD check will throw a NOTE if you have use subset in this way in your code - it will wonder if grps is a global variable, even though subset is evaluating it within dataset's environment (not the global one). So if there's any possiblity your code will end up in a package and you feel squeamish about NOTEs, stick with Rcoster's method.
The arules package in R uses the class 'transactions'. So in order to use the function apriori() I need to convert my existing data. I've got a Matrix with 2 columns and roughly 1.6mm rows and tried to convert the data like this:
transaction_data <- as(split(original_data[,"id"], original_data[,"type"]), "transactions")
where original_data is my data matrix. Because of the amount of data I used the largest AWS Amazon machine with 64gb RAM. After a while I get
resulting vector exceeds vector length limit in 'AnswerType'
The Memory Usage of the machine was still 'only' at 60%. Is this a R-based limitation? Is there any way to work around this other than using sampling? When only using 1/4 of the data the transformation worked fine.
Edit: As pointed out, one of the variables was a factor instead of character. After changing the transformation was processed quickly and correct.
I suspect that your problem is arising because one of the functions uses integers (rather than, say, floats) to index values. In any case, the size isn't too big, so this is surprising. Maybe the data has some other issue, such as characters as factors?
In general, though, I'd really recommend using memory mapped files, via bigmemory, which you can also split and process via bigsplit or mwhich. If offloading the data works for you, then you can also use a much smaller instance size and save $$. :)