Plot data from SparkR DataFrame - r

I have an avro file which I am reading as follows:
avroFile <-read.df(sqlContext, "avro", "com.databricks.spark.avro")
This file as lat/lon columns but I am not able to plot them like a regular dataframe.
Neither am I able to access the column using the '$' operator.
ex.
avroFile$latitude
Any help regarding avro files and operation on them using R are appreciated.

If you want to use ggplot2 for plotting, try ggplot2.SparkR. This package allows you to take SparkR DataFrame directly as input for ggplot() function call.
https://github.com/SKKU-SKT/ggplot2.SparkR

And you won't be able to plot it directly. SparkR DataFrame is not compatible with functions which expect data.frame as an input. This is not even a data structure in a strict sense but simply a recipe how to process input data. It is materialized only when you execute an action.
If you want to plot it you'll have collect it first.. Beware that it fetches all the data the local machine so typically it is something you want to avoid on full data set.

As zero323 mentioned, you cannot currently run R visualizations on distributed SparkR DataFrames. You can run them on local data.frames. Here is one way you could make a new dataframe with just the columns you want to plot, and then collect a random sample of them to a local data.frame which you can plot from
latlong <- (avroFile, avroFile$latitude, avrofile$longitude)
latlongsample <- collect(sample(latlong, FALSE, .1))
plot(latlongsample)
the signature for sample method is:
sample(x, withReplacement, fraction, seed)

Related

Explaining Simple Loop in R

I successfully wrote a for loop in R. That is okay and I am very happy that it works. But I also want to understand what I've done exactly because I will have to work with loops later on in my analysis as well.
I work with Raster Data (DEMs). I load them into the environment as rasters and then I use the getValues function in the loop as I want to do some calculations. Looks as follows:
list <- dir(pattern=".tif", full.names=T)
tif.files <- list()
tif.files.values <- tif.files
for (i in 1: length(list)){
tif.files[[i]] <- raster (list[[i]])
tif.files.values[[i]] <- getValues(tif.files[[i]])
}
Okay, so far so good. I don't get why I have to specify tif.files and tif.files.values before I use them in the loop and I don't know why to specify them exactly how I did that. For the first part, the raster operation, I had a pattern. Maybe someone can explain the context. I really want to understand R.
When you do:
tif.files[[i]] <- raster (list[[i]])
then tif.files[[i]] is the result of running raster(list[[i]]), so that is storing the raster object. This object contains the metadata (extent, number of rows, cols etc) and the data, although if the tiff is huge it doesn't actually read it in at the time.
tif.files.values[[i]] <- getValues(tif.files[[i]])
that line calls getValues on the raster object, which reads the values from the raster and returns a vector. The values of the grid cells are now in tif.files.values[[i]].
Experiment by printing tif.files[[1]] and tif.files.values[[1]] at the R prompt.
Note
This is R, not RStudio, which is the interface you are using that has all the buttons and menus. The R language exists quite happily without it, and your question is just a language question. I've edited and tagged it now for you.

H2O-R: Apply custom library function on each row of H2OFrame

After importing a relatively big table from MySQL into H2O on my machine, I tried to run a hashing algorithm (murmurhash from the R digest package) on one of its columns and save it back to H2O. As I found out, using as.data.frame on a H2OFrame object is not always advised: originally my H2OFrame is ~43k rows large, but the coerced DataFrame contains usually only ~30k rows for some reason (the same goes for using base::apply/base::sapply/etc on the H2OFrame).
I found out there is an apply function used for H2OFrames as well, but as I see, it can only be used with built-in R functions.
So, for example my code would look like this:
data[, "subject"] <- h2o::apply(data[, "subject"], 2, function(x)
digest(x, algo = "murmur32"))
I get the following error:
Error in .process.stmnt(stmnt, formalz, envs) :
Don't know what to do with statement: digest
I understand the fact that only the predefined functions from the Java backend can be used to manipulate H2O data, but is there perhaps another way to use the digest package from the client side without converting the data to DataFrame? I was thinking that in the worst case, I will have to use the R-MySQL driver to load the data first, manipulate it as a DataFrame and then upload it to the H2O cloud. Thanks for help in advance.
Due to the way H2O works, it cannot support arbitrary user-defined functions applied to H2OFrames the way that you can apply any function to a regular R data.frame. We already use the Murmur hash function in the H2O backend, so I have added a JIRA ticket to expose it to the H2O R and Python APIs. What I would recommend in the meantime is to copy just the single column of interest from the H2O cluster into R, apply the digest function and then update the H2OFrame with the result.
The following code will pull the "subject" column into R as a 1-column data.frame. You can then use the base R apply function to apply the murmur hash to every row, and lastly you can copy the resulting 1-column data.frame back into the "subject" column in your original H2OFrame, called data.
sub <- as.data.frame(data[, "subject"])
subhash <- apply(sub, 1, digest, algo = "murmur32")
data[, "subject"] <- as.h2o(subhash)
Since you only have 43k rows, I would expect that you'd still be able to do this with no issues on even a mediocre laptop since you are only copying a single column from the H2O cluster to R memory (rather than the entire data frame).

Efficient way to review formulas that generate named objects in R

If I have a named object (in my case a named plot) in R, is there an efficient way to double check the formula that generated it? As of now I am scrolling back through the console, but I'm hoping that there is a more efficient way.
For example, at the start of my project I input
Boxplot <- ggplot(plotting input) + geom_boxplot(plotting input)
Now I can call Boxplot by name to plot it, but I want to be able to efficiently review my ggplot input. Is there a tool to do this?
For your example, you can see the elements of Boxplot using:
names(Boxplot)
So you can see, for example, the input data using:
Boxplot$data
Or the parameters and type of the plot using:
Boxplot$layers

Extracting point data from a large shape file in R

I'm having trouble extracting point data from a large shape file (916.2 Mb, 4618197 elements - from here: https://earthdata.nasa.gov/data/near-real-time-data/firms/active-fire-data) in R. I'm using readShapeSpatial in maptools to read in the shape file which takes a while but eventually works:
worldmap <- readShapeSpatial("shp_file_name")
I then have a data.frame of coordinates that I want extract data for. However R is really struggling with this and either loses connection or freezes, even with just one set of coordinates!
pt <-data.frame(lat=-64,long=-13.5)
pt<-SpatialPoints(pt)
e<-over(pt,worldmap)
Could anyone advise me on a more efficient way of doing this?
Or is it the case that I need to run this script on something more powerful (currently using a mac mini with 2.3 GHz processor)?
Many thanks!
By 'point data' do you mean the longitude and latitude coordinates? If that's the case, you can obtain the data underlying the shapefile with:
worldmap#data
You can view this in the same way you would any other data frame, for example:
View(worldmap#data)
You can also access columns in this data frame in the same way you normally would, except you don't need the #data, e.g.:
worldmap$LATITUDE
Finally, it is recommended to use readOGR from the rgdal package rather than maptools::readShapeSpatial as the former reads in the CRS/projection information.

CSV file to Histogram in R

I'm a total newbie with R, and I'm trying to create a histogram (with value and frequency as the axises) from a csv file (just one row of values). Any idea how I can do this?
I'm also an R newbie, and I ran into the same thing. I made two separate mistakes, actually, so I'll describe them both here.
Mistake 1: Passing a frequency table to hist(). Originally I was trying to pass a frequency table to hist() instead of passing in the raw data. One way to fix this is to use the rep() ("replicate") function to explode your frequency table back into a raw dataset, as described here:
Creating a histogram using aggregated data
Simple R (histogram) from counted csv file
Instead of that, though, I just decided to read in my original dataset instead of the frequency table.
Mistake 2: Wrong data type. My raw data CSV file contains two columns: hostname and bookings (idea is to count the number of bookings each host generated during some given time period). I read it into a table.
> tbl <- read.csv('bookingsdata.csv')
Then when I tried to generate a histogram off the second column, I did this:
> hist(tbl[2])
This gave me the "'x' must be numeric" error you mention in a comment. (It was trying to read the "bookings" column header in as a data value.)
This fixed it:
> hist(tbl$bookings)
You should really start to read some basic R manual...
CRAN offers a lot of them (look into the Manuals and Contributed sections)
In any case:
setwd("path/to/csv/file")
myvalues <- read.csv("filename.csv")
hist(myvalues, 100) # Example: 100 breaks, but you can specify them at will
See the manual pages for those functions for more help (accessible through ?read.table, ?read.csv and ?hist).
To plot the histogram, the values must be of numeric class i.e the data must be of numeric value. Here the value of x seems to be of some other class.
Run the following command and see:
sapply(myvalues[1,],class)

Resources