I'm working with spark using R API, and have a grasp on how data is processed from spark, either when only spark native functions are used in which cases it is transparent for the user or in cases where spark_apply() is used, where it is required to have a better understanding on how the partitions are handled.
My doubt is regarding to plots where no aggregation is done, for example, is my understanding that if a group by is used before a plot not all the data will be used. But if I need to make say a scatter plot with 100 million dots, where is that data stored at this point? is it still distributed between all nodes? or is it at one node only, if the later... with the cluster get frozen because of this?
I know you write that no aggregation is (should be?) done, but I'd wager that is precisely what you need and want to do. The point of distributed computing is largely that partial results are computed, well, distributed at each node. For very big data sets, each node (often) sees only a subset of the data.
In regards to the plotting: a scatter plot more that even a few thousand (not to mention a 100 million) points will contain a significant amount of overplotting. Either you 'fix' that by making the points transparent, you do a density estimate, or you do some binning of the data (e.g. a hexbin plot or a heatmap). The latter can be done distributed by the nodes and the plot. The returned binned results from each node can then be aggregated to a final results by the master node and be plotted.
Even if you somehow had a node making a scatter plot of 100 million points, what is your output format? Vector graphics (e.g. pdf/svg) would create a huge file. Raster graphics (e.g. jpg, png) will effectively aggregate on your behalf when the plot is rasterized -- so you might as well control that yourself with bins the size of pixels.
Related
before k-means clustering for consumer segmentation, I want to identify and delete outliers of my sample. I tried hierarchical clustering with single linkage algorithm. The problem is, I have a sample with more than 800 cases, and in my plot (single linkage dendrogram) the numbers are written across each other and therefore not readable, so it is impossible for me to clearly identify the outliers by just looking at the graph :-/
Here they say, you can create boxplots based on the branch distance to identify outliers in a more objective way. I thought that would be also a great way to just make the row numbers of the outliers in my dataset readable, however I am struggling with creating the boxplots..
https://link.springer.com/article/10.1186/s12859-017-1645-5/figures/3
Does anyone know, how to write the code to get the boxplots based on the height of the branches?
This is the code I use for clustering and attached you can see the plot
dr_dist<-dist(dr_ma_cluster[,c(148:154)])
hc_dr<-hclust(dr_dist,method = "single") #single linkage
plot(hc_dr,labels=(row.names(dr_ma_cluster)))
This is my failed trial to do the boxplot, as I don't know how to address the branch height
> boxplot(hc_dr)
Error in x[floor(d)] + x[ceiling(d)] :
non-numeric argument for binary operator
> boxplot(hc_dr[,c(148:154)])
Error in hc_dr[, c(148:154)] : Incorrect number of dimensions
And here another way to do the graph (and some automated outlier detection approach), but it makes the readability even worse with large datasets..
Another code to plot the tree, even less readable for large datasets:
Delete outliers automatically of a calculated agglomerative hierarchical clustering data
Thanks for any help!!
boxplot(hc_dr$height) as suggested by StupidWolf was the simple thing I was looking for.
Unfortunately I did not manage to label the outlier dots with the rownames from the original dataframe. Rownames from the branch height table were useless as they were assigned in ascending order.
hang = 0.0001 gave a better look to the dendrogram, but labels were still unreadable as still over eachother.
If anyone has a similar problem check R Shiny, zoomable dendrogram program
the code given there in the answer was super easy to adapt, resulting in a zoomable dendrogram, which makes it easy to identify the relevant cases (->outliers). for details search dendextendas proposed by csgroen.
Both together, the boxplot and this nice tool served to identify the rownames of the outliers after single linkage clustering in order to delete them before km means clustering
For example, this is a heatmap from a website using GPS data:
I have gotten some degree of success with adding a weight parameter to each vertex and calculating the number of events that have vertices near those, but that takes a long time, especially with a large amount of data. It also appears a bit spotty when the distance between vertices is a bit wonky, which causes random splotches of different colors throughout the heatmap. It looks kind of cool, but it makes the data a bit harder to read.
When you zoom out, it looks a bit more continuous due to the paths overlapping more.
In R, the closest I can do to this involves using an alpha channel, but that only gets me a monochromatic heatmap, which is not always desirable, especially when you want to see lesser-traveled paths visibly. In theory I could do two lines to resolve the visibility part (first opaque, second semi-transparent), but I would like to be able to have different hue values.
Ideally I would like this to work with ggplot, but if it cannot, I would accept other methods, provided they are reasonably quick computationally.
Edit: The data format is a data frame with sequential (latitude, longitude) coordinate pairs, along with some associated data that can be used for filter & grouping (such as activity type and event ID).
Here is a sample of the data for the region displayed in the above images (~1.5 MB):
https://www.dropbox.com/s/13p2jtz4760m26d/sample_coordinate_data.csv?dl=0
I would try something like
ggplot() + geom_count(data, aes(longitude, latitude, alpha=..prop..))
but you need to show some data to check how it works.
I have a data frame with several million points in it - each having two values.
When I plot this like this:
plot(myData)
All the points are plotted, but the plot is quite busy, so I thought I'd plot it as a line:
plot(myData, type="l")
But while the x axis doesn't change (i.e. goes from 0 to 7e+07), the actual plotting stops at about 3e+07 and I don't actually get a proper line plot either.
Is there a limitation on line plotting?
Update
If I use
plot(myData, type="h")
I get correct and useable output, but I still wonder why the type="l" option fails so badly.
Further update
I am plotting a time series - here is one output using type="h":
That's perfectly usable, but having a line would allow me to compare several outputs.
High dimensional data graphic representation is growing issue in data analysis. The problem, actually, is not create the graph. The problem is make the graph capable of communicate information that we could transform in useful knowledge. Allow me to present an example to produce this point, by considering a data with a million observations, that is, not that big.
x <- rnorm(10^6, 0, 1)
y <- rnorm(10^6, 0, 1)
Let's plot it. R can yes easily manage such a problem. But can we? Probably not.
Afterall, what kind of information can we deduce from an ink hard stain? Probably, no more than a tasseographyst trying to divinate the future in patterns of tea leaves, coffee grounds, or wine sediments.
plot(x, y)
A different approach is represented by the smoothScatter function. It creates a density plot of bivariate data. There, we create two examples.
First, with defaults.
smoothScatter(x, y)
Second, the bandwidth was specified to be a little larger than the default, and five points are specified to be shown using a different symbol pch = 3.
smoothScatter(x, y, bandwidth=c(5,1)/(1/3), nrpoints=5, pch=3)
As you can see, the problem is not solved. Nevertheless, we can have a better grasp on the distribution of our data. This kind of approach is still in development, and there are several matters that are discussed and evolved. If this approach represents a more suitable approach to represent your big dataset, I suggest you to visit this blog that discuss throughfully the issue.
For what it's worth, all the evidence I have is that is computer - even though it was a lump of big iron - ran out of memory.
i have a problem with plotting two Raster Data Sets in R.
I use two different IRS LISS III Scenes (with the same Extent) and what i want is to plot the pixel values of both scenes in one Scatterplot (x= Layer1 and y=Layer2).
My problem is now the handling of the big amount of data. Each Scene has about 80.000.000 pixels due reclassification and other processing i was able to scale down the values to a amount of 12.000.000 in each raster. But when i try to import these values e.g. in a data.frame or load them from an ascii file i always got problems with my memory.
Is it possible two plot such an amount of data, and when yes it would be great if someone could help me, i was trying it for two days now and right now im desperated.
Many thanks,
Stefan
Use the raster package, there's a good chance it will work out of the box since it has good "out-of-memory" handling. If it doesn't work with the ASCII grids, convert them to something more efficient (like an LZW-compressed and tiled GeoTIFF) with GDAL. And if they are still too big resize them, that's all the graphics rendering process will do anyway. (You don't say how you resized originally, or give any details on how you are trying to read them).
Background:
I'm running a Monte Carlo simulation to show that a particular process (a cumulative mean) does not converge over time, and often diverges wildly in simulation (the expectation of the random variable = infinity). I want to plot about 10 of these simulations on a line chart, where the x axis has the iteration number, and the y axis has the cumulative mean up to that point.
Here's my problem:
I'll run the first simulation (each sim. having 10,000 iterations), and build the main plot based on its current range. But often one of the simulations will have a range a few orders of magnitude large than the first one, so the plot flies outside of the original range. So, is there any way to dynamically update the ylim or xlim of a plot upon adding a new set of points or lines?
I can think of two workarounds for this: 1. store each simulation, then pick the one with the largest range, and build the base graph off of that (not elegant, and I'd have to store a lot of data in memory, but would probably be laptop-friendly [[EDIT: as Marek points out, this is not a memory-intense example, but if you know of a nice solution that'd support far more iterations such that it becomes an issue (think high dimensional walks that require much, much larger MC samples for convergence) then jump right in]]) 2. find a seed that appears to build a nice looking version of it, and set the ylim manually, which would make the demonstration reproducible.
Naturally I'm holding out for something more elegant than my workarounds. Hoping this isn't too pedestrian a problem, since I imagine it's not uncommon with simulations in R. Any ideas?
I'm not sure if this is possible using base graphics, if someone has a solution I'd love to see it. However graphics systems based on grid (lattice and ggplot2) allow the graphics object to be saved and updated. It's insanely easy in ggplot2.
require(ggplot2)
make some data and get the range:
foo <- as.data.frame(cbind(data=rnorm(100), numb=seq_len(100)))
make an initial ggplot object and plot it:
p <- ggplot(as.data.frame(foo), aes(numb, data)) + layer(geom='line')
p
make some more data and add it to the plot
foo <- as.data.frame(cbind(data=rnorm(200), numb=seq_len(200)))
p <- p + geom_line(aes(numb, data, colour="red"), data=as.data.frame(foo))
plot the new object
p
I think (1) is the best option. I actually don't think this isn't elegant. I think it would be more computationally intensive to redraw every time you hit a point greater than xlim or ylim.
Also, I saw in Peter Hoff's book about Bayesian statistics a cool use of ts() instead of lines() for cumulative sums/means. It looks pretty spiffy: