Fast method for calculating frequency with Rcpp [duplicate] - r

I have ~ 5 very large vectors (~ 108 MM entries) so any plot/stuff I do with them in R takes quite long time.
I am trying to visualize their distribution (histogram), and was wondering what would be the best way to superimpose their histogram distributions in R without taking too long. I am thinking to first fit a distribution to the histogram, and then plot all the distribution line fits together in one plot.
Do you have some suggestions on how to do that?
Let us say my vectors are:
x1, x2, x3, x4, x5.
I am trying to use this code: Overlaying histograms with ggplot2 in R
Example of the code I am using for 3 vectors (R fails to do the plot):
n = length(x1)
dat <- data.frame(xx = c(x1, x2, x3),yy = rep(letters[1:3],each = n))
ggplot(dat,aes(x=xx)) +
geom_histogram(data=subset(dat,yy == 'a'),fill = "red", alpha = 0.2) +
geom_histogram(data=subset(dat,yy == 'b'),fill = "blue", alpha = 0.2) +
geom_histogram(data=subset(dat,yy == 'c'),fill = "green", alpha = 0.2)
but it takes forever to produce the plot, and eventually it kicks me out of R. Any ideas on how to use ggplot2 efficiently for large vectors? Seems to me that I had to create a dataframe, of 5*108MM entries and then plot, highly inefficient in my case.
Thanks!

Here's a little snippet of Rcpp that bins data very efficiently - on my computer it takes about a second to bin 100,000,000 observations:
library(Rcpp)
cppFunction('
std::vector<int> bin3(NumericVector x, double width, double origin = 0) {
int bin, nmissing = 0;
std::vector<int> out;
NumericVector::iterator x_it = x.begin(), x_end;
for(; x_it != x.end(); ++x_it) {
double val = *x_it;
if (ISNAN(val)) {
++nmissing;
} else {
bin = (val - origin) / width;
if (bin < 0) continue;
// Make sure there\'s enough space
if (bin >= out.size()) {
out.resize(bin + 1);
}
++out[bin];
}
}
// Put missing values in the last position
out.push_back(nmissing);
return out;
}
')
x8 <- runif(1e8)
system.time(bin3(x8, 1/100))
# user system elapsed
# 1.373 0.000 1.373
That said, hist is pretty fast here too:
system.time(hist(x8, breaks = 100, plot = F))
# user system elapsed
# 7.281 1.362 8.669
It's straightforward to use bin3 to make a histogram or frequency polygon:
# First we create some sample data, and bin each column
library(reshape2)
library(ggplot2)
df <- as.data.frame(replicate(5, runif(1e6)))
bins <- vapply(df, bin3, 1/100, FUN.VALUE = integer(100 + 1))
# Next we match up the bins with the breaks
binsdf <- data.frame(
breaks = c(seq(0, 1, length = 100), NA),
bins)
# Then melt and plot
binsm <- subset(melt(binsdf, id = "breaks"), !is.na(breaks))
qplot(breaks, value, data = binsm, geom = "line", colour = variable)
FYI, the reason I had bin3 on hand is that I'm working on how to make this speed the default in ggplot2 :)

Related

deSolve: differential equations with two consecutive dynamics

I am simulating a ring tube with flowing water and a temperature gradient using deSolve::ode(). The ring is modelled as a vector where each element has a temperature value and position.
I am modelling the heat diffusion formula:
1)
But I'm struggling with also moving the water along the ring. In theory, it's just about substituting the temperature at the element i in the tube vector with that at the element s places earlier. Since s may not be an integer, it can be separated into the integer part (n) and the fractional part (p): s=n+p. Consequently, the change in temperature due to the water moving becomes:
2)
The problem is that s equals to the water velocity v by the dt evaluated at each iteration of the ode solver.
My idea is to treat the phenomenons as additive, that is first computing (1), then (2) and finally adding them together. I'm afraid though about the effect of time. The ode solver with implicit methods decides the time step automatically and scales down linearly the unitary change delta.
My question is whether just returning (1) + (2) in the derivative function is correct or if I should break the two processes apart and compute the derivatives separately. In the second case, what would be the suggested approach?
EDIT:
As by suggestion by #tpetzoldt I tried to implement the water flow using ReacTran::advection.1D(). My model has multiple sources of variation of temperature: the spontaneous symmetric heat diffusion; the water flow; a source of heat that is turned on if the temperature near a sensor (placed before the heat source) drops below a lower threshold and is turned off if raises above an upper threshold; a constant heat dispersion determined by a cyclical external temperature.
Below the "Moving water" section there is still my previous version of the code, now substituted by ReacTran::advection.1D().
The plot_type argument allows visualizing either a time sequence of the temperature in the water tube ("pipe"), or the temperature sequence at the sensors (before and after the heater).
library(deSolve)
library(dplyr)
library(ggplot2)
library(tidyr)
library(ReacTran)
test <- function(simTime = 5000, vel = 1, L = 500, thresh = c(16, 25), heatT = 25,
heatDisp = .0025, baseTemp = 15, alpha = .025,
adv_method = 'up', plot_type = c('pipe', 'sensors')) {
plot_type <- match.arg(plot_type)
thresh <- c(16, 25)
sensorP <- round(L/2)
vec <- c(rep(baseTemp, L), 0)
eventfun <- function(t, y, pars) {
heat <- y[L + 1] > 0
if (y[sensorP] < thresh[1] & heat == FALSE) { # if heat is FALSE -> T was above the threshold
#browser()
y[L + 1] <- heatT
}
if (y[sensorP] > thresh[2] & heat == TRUE) { # if heat is TRUE -> T was below the threshold
#browser()
y[L + 1] <- 0
}
return(y)
}
rootfun <- function (t, y, pars) {
heat <- y[L + 1] > 0
trigger_root <- 1
if (y[sensorP] < thresh[1] & heat == FALSE & t > 1) { # if heat is FALSE -> T was above the threshold
#browser()
trigger_root <- 0
}
if (y[sensorP] > thresh[2] & heat == TRUE & t > 1) { # if heat is TRUE -> T was below the threshold
#browser()
trigger_root <- 0
}
return(trigger_root)
}
roll <- function(x, n) {
x[((1:length(x)) - (n + 1)) %% length(x) + 1]
}
fun <- function(t, y, pars) {
v <- y[1:L]
# Heat diffusion: dT/dt = alpha * d2T/d2X
d2Td2X <- c(v[2:L], v[1]) + c(v[L], v[1:(L - 1)]) - 2 * v
dT_diff <- pars * d2Td2X
# Moving water
# nS <- floor(vel)
# pS <- vel - nS
#
# v_shifted <- roll(v, nS)
# nS1 <- nS + 1
# v_shifted1 <- roll(v, nS + 1)
#
# dT_flow <- v_shifted + pS * (v_shifted1 - v_shifted) - v
dT_flow <- advection.1D(v, v = vel, dx = 1, C.up = v[L], C.down = v[1],
adv.method = adv_method)$dC
dT <- dT_flow + dT_diff
# heating of the ring after the sensor
dT[sensorP + 1] <- dT[sensorP + 1] + y[L + 1]
# heat dispersion
dT <- dT - heatDisp * (v - baseTemp + 2.5 * sin(t/(60*24) * pi * 2))
return(list(c(dT, 0)))
}
out <- ode.1D(y = vec, times = 1:simTime, func = fun, parms = alpha, nspec = 1,
events = list(func = eventfun, root = T),
rootfunc = rootfun)
if (plot_type == 'sensors') {
## Trend of the temperature at the sensors levels
out %>%
{.[,c(1, sensorP + 1, sensorP + 3, L + 2)]} %>%
as.data.frame() %>%
setNames(c('time', 'pre', 'post', 'heat')) %>%
mutate(Amb = baseTemp + 2.5 * sin(time/(60*24) * pi * 2)) %>%
pivot_longer(-time, values_to = "val", names_to = "trend") %>%
ggplot(aes(time, val)) +
geom_hline(yintercept = thresh) +
geom_line(aes(color = trend)) +
theme_minimal() +
theme(panel.spacing=unit(0, "lines")) +
labs(x = 'time', y = 'T°', color = 'sensor')
} else {
## Trend of the temperature in the whole pipe
out %>%
as.data.frame() %>%
pivot_longer(-time, values_to = "val", names_to = "x") %>%
filter(time %in% round(seq.int(1, simTime, length.out = 40))) %>%
ggplot(aes(as.numeric(x), val)) +
geom_hline(yintercept = thresh) +
geom_line(alpha = .5, show.legend = FALSE) +
geom_point(aes(color = val)) +
scale_color_gradient(low = "#56B1F7", high = "red") +
facet_wrap(~ time) +
theme_minimal() +
theme(panel.spacing=unit(0, "lines")) +
labs(x = 'x', y = 'T°', color = 'T°')
}
}
It's interesting that setting an higher number of segment (L = 500) and high speed (vel = 2) it's possible to observe a spiking sequence in the post heating sensor. Also, the processing time drastically increases, but more as an effect of increased velocity than due to increased pipe resolution.
My biggest doubt now is whether ReacTran::advection.1D() does make sense in my context since I'm modeling water temperature, while this function seems more related to the concentration of a solute in flowing water.
The problem looks like a PDE example with a mobile and a fixed phase. A good introduction about the "method of lines" (MOL) approach with R/deSolve can be be found in the paper about ReachTran from Soetaert and Meysman (2012) doi.org/10.1016/j.envsoft.2011.08.011.
An example PDE can be found at slide 55 of some workshop slides, more in the teaching package RTM.
R/deSolve/ReacTran tries to make ODEs/PDEs easy, but pitfalls remain. If numerical dispersion or oscillations occur, it can be caused by violating the Courant–Friedrichs–Lewy condition.

How to delete outliers from a QQ-plot graph made with ggplot()?

I have a two dimensional dataset (say columns x and y). I use the following function to plot a QQ-plot of this data.
# Creating a toy data for presentation
df = cbind(x = c(1,5,8,2,9,6,1,7,12), y = c(1,4,10,1,6,5,2,1,32))
# Plotting the QQ-plot
df_qq = as.data.frame(qqplot(df[,1], df[,2], plot.it=FALSE))
ggplot(df_qq) +
geom_point(aes(x=x, y=y), size = 2) +
geom_abline(intercept = c(0,0), slope = 1)
That is the resulting graph:
My question is, how to avoid plotting the last point (i.e. (12,32))? I would rather not delete it manually because i have several of these data pairs and there are similar outliers in each of them. What I would like to do is to write a code that somehow identifies the points that are too far from the 45 degree line and eliminate them from df_qq (for instance if it is 5 times further than the average distance to the 45 line it can be eliminated). My main objective is to make the graph easier to read. When outliers are not eliminated the more regular part of the QQ-plot occupies a too small part of the graph and it prevents me from visually evaluating the similarity of two vectors apart from the outliers.
I would appreciate any help.
There is a CRAN package, referenceIntervals that uses Cook's distance to detect outliers. By applying it to the values of df_qq$y it can then give an index into df_qq to be removed.
library(referenceIntervals)
out <- cook.outliers(df_qq$y)$outliers
i <- which(df_qq$y %in% out)
ggplot(df_qq[-i, ]) +
geom_point(aes(x=x, y=y), size = 2) +
geom_abline(intercept = c(0,0), slope = 1)
Edit.
Following the OP's comment,
But as far as I understand this function does not look at
the relation between x & y,
maybe the following function is what is needed to remove outliers only if they are outliers in one of the vectors but not in both.
cookOut <- function(X){
out1 <- cook.outliers(X[[1]])$outliers
out2 <- cook.outliers(X[[2]])$outliers
i <- X[[1]] %in% out1
j <- X[[2]] %in% out2
w <- which((!i & j) | (i & !j))
if(length(w)) X[-w, ] else X
}
Test with the second data set, the one in the comment.
The extra vector, id is just to make faceting easier.
df1 <- data.frame(x = c(1,5,8,2,9,6,1,7,12), y = c(1,4,10,1,6,5,2,1,32))
df2 <- data.frame(x = c(1,5,8,2,9,6,1,7,32), y = c(1,4,10,1,6,5,2,1,32))
df_qq1 = as.data.frame(qqplot(df1[,1], df1[,2], plot.it=FALSE))
df_qq2 = as.data.frame(qqplot(df2[,1], df2[,2], plot.it=FALSE))
df_qq_out1 <- cookOut(df_qq1)
df_qq_out2 <- cookOut(df_qq2)
df_qq_out1$id <- "A"
df_qq_out2$id <- "B"
df_qq_out <- rbind(df_qq_out1, df_qq_out2)
ggplot(df_qq_out) +
geom_point(aes(x=x, y=y), size = 2) +
geom_abline(intercept = c(0,0), slope = 1) +
facet_wrap(~ id)

ggplot loess line from one dataset over scatterplot of another

The function below calculates binned averages, sizes the bin points on the graph relative to the number of observations in each bin, and plots a lowess line through the bin means. Instead of plotting the lowess line through the bin means, however, I would like to plot the line through the original dataset so that the error bands on the lowess line represent the uncertainty in the actual dataset, not the uncertainty in the binned averages. How do I modify geom_smooth() so that it will plot the line using df instead of dfplot?
library(fields)
library(ggplot2)
binplot <- function(df, yvar, xvar, sub = FALSE, N = 50, size = 40, xlabel = "X", ylabel = "Y"){
if(sub != FALSE){
df <- subset(df, eval(parse(text = sub)))
}
out <- stats.bin(df[,xvar], df[,yvar], N= N)
x <- out$centers
y <- out$stats[ c("mean"),]
n <- out$stats[ c("N"),]
dfplot <- as.data.frame(cbind(x,y,n))
if(size != FALSE){
sizes <- n * (size/max(n))
}else{
sizes = 3
}
ggplot(dfplot, aes(x,y)) +
xlab(xlabel) +
ylab(ylabel) +
geom_point(shape=1, size = sizes) +
geom_smooth()
}
Here is a reproducible example that demonstrates how the function currently works:
sampleSize <- 10000
x1 <- rnorm(n=sampleSize, mean = 0, sd = 4)
y1 <- x1 * 2 + x1^2 * .3 + rnorm(n=sampleSize, mean = 5, sd = 10)
binplot(data.frame(x1,y1), "y1", "x1", N = 25)
As you can see, the error band on the lowess line reflects the uncertainty if each bin had an equal number of observations, but they do not. The bins at the extremes have far fewer obseverations (as illustrated by the size of the points) and the lowess line's error band should reflect that.
You can explicitly set the data= parameter for each layer. You will also need to change the aesthetic mapping since the original data.frame had different column names. Just change your geom_smooth call to
geom_smooth(data=df, aes_string(xvar, yvar))
with the sample data, this returned

Plotting family of functions with qplot without duplicating data

Given family of functions f(x;q) (x is argument and q is parameter) I'd like to visulaize this function family on x taking from the interval [0,1] for 9 values of q (from 0.1 to 0.9). So far my solution is:
f = function(p,q=0.9) {1-(1-(p*q)^3)^1024}
x = seq(0.0,0.99,by=0.01)
q = seq(0.1,0.9,by=0.1)
qplot(rep(x,9), f(rep(x,9),rep(q,each=100)), colour=factor(rep(q,each=100)),
geom="line", size=I(0.9), xlab="x", ylab=expression("y=f(x)"))
I get quick and easy visual with qplot:
My concern is that this method is rather memory hungry as I need to duplicate x for each parameter and duplicate each parameter value for whole x range. What would be alternative way to produce same graph without these duplications?
At some point ggplot will need to have the data available to plot it and the way that package works prohibits simply doing what you want. I suppose you could set up a blank plot if you know the x and y axis limits, and then loop over the 9 values of q, generating the data for that q, and adding a geom_line layer to the existing plot object. However, you'll have to produce the colours for each layer yourself.
If this is representative of the size of problem you have, I wouldn't worry too much about the memory footprint. We're only talking about a two vectors of length 900
> object.size(rnorm(900))
7240 bytes
and the 100 values over the range of x appears sufficient to give a smooth plot.
for loop to add layers to ggplot
require("ggplot2")
## something to replicate ggplot's colour palette, sure there is something
## to do this already in **ggplot** now...
ggHueColours <- function(n, h = c(0, 360) + 15, l = 65, c = 100,
direction = 1, h.start = 0) {
turn <- function(x, h.start, direction) {
(x + h.start) %% 360 * direction
}
if ((diff(h) %% 360) < 1) {
h[2] <- h[2] - 360 / n
}
hcl(h = turn(seq(h[1], h[2], length = n), h.start = h.start,
direction = direction), c = c, l = l)
}
f = function(p,q=0.9) {1-(1-(p*q)^3)^1024}
x = seq(0.0,0.99,by=0.01)
q = seq(0.1,0.9,by=0.1)
cols <- ggHueColours(n = length(q))
for(i in seq_along(q)) {
df <- data.frame(y = f(x, q[i]), x = x)
if(i == 1) {
plt <- ggplot(df, aes(x = x, y = y)) + geom_line(colour = cols[i])
} else {
plt <- plt + geom_line(data = df, colour = cols[i])
}
}
plt
which gives:
I'll leave the rest to you - I'm not familiar enough with ggplot to draw a legend manually.

SuperImpose Histogram fits in one plot ggplot

I have ~ 5 very large vectors (~ 108 MM entries) so any plot/stuff I do with them in R takes quite long time.
I am trying to visualize their distribution (histogram), and was wondering what would be the best way to superimpose their histogram distributions in R without taking too long. I am thinking to first fit a distribution to the histogram, and then plot all the distribution line fits together in one plot.
Do you have some suggestions on how to do that?
Let us say my vectors are:
x1, x2, x3, x4, x5.
I am trying to use this code: Overlaying histograms with ggplot2 in R
Example of the code I am using for 3 vectors (R fails to do the plot):
n = length(x1)
dat <- data.frame(xx = c(x1, x2, x3),yy = rep(letters[1:3],each = n))
ggplot(dat,aes(x=xx)) +
geom_histogram(data=subset(dat,yy == 'a'),fill = "red", alpha = 0.2) +
geom_histogram(data=subset(dat,yy == 'b'),fill = "blue", alpha = 0.2) +
geom_histogram(data=subset(dat,yy == 'c'),fill = "green", alpha = 0.2)
but it takes forever to produce the plot, and eventually it kicks me out of R. Any ideas on how to use ggplot2 efficiently for large vectors? Seems to me that I had to create a dataframe, of 5*108MM entries and then plot, highly inefficient in my case.
Thanks!
Here's a little snippet of Rcpp that bins data very efficiently - on my computer it takes about a second to bin 100,000,000 observations:
library(Rcpp)
cppFunction('
std::vector<int> bin3(NumericVector x, double width, double origin = 0) {
int bin, nmissing = 0;
std::vector<int> out;
NumericVector::iterator x_it = x.begin(), x_end;
for(; x_it != x.end(); ++x_it) {
double val = *x_it;
if (ISNAN(val)) {
++nmissing;
} else {
bin = (val - origin) / width;
if (bin < 0) continue;
// Make sure there\'s enough space
if (bin >= out.size()) {
out.resize(bin + 1);
}
++out[bin];
}
}
// Put missing values in the last position
out.push_back(nmissing);
return out;
}
')
x8 <- runif(1e8)
system.time(bin3(x8, 1/100))
# user system elapsed
# 1.373 0.000 1.373
That said, hist is pretty fast here too:
system.time(hist(x8, breaks = 100, plot = F))
# user system elapsed
# 7.281 1.362 8.669
It's straightforward to use bin3 to make a histogram or frequency polygon:
# First we create some sample data, and bin each column
library(reshape2)
library(ggplot2)
df <- as.data.frame(replicate(5, runif(1e6)))
bins <- vapply(df, bin3, 1/100, FUN.VALUE = integer(100 + 1))
# Next we match up the bins with the breaks
binsdf <- data.frame(
breaks = c(seq(0, 1, length = 100), NA),
bins)
# Then melt and plot
binsm <- subset(melt(binsdf, id = "breaks"), !is.na(breaks))
qplot(breaks, value, data = binsm, geom = "line", colour = variable)
FYI, the reason I had bin3 on hand is that I'm working on how to make this speed the default in ggplot2 :)

Resources