Hexbin: apply function for every bin - r

I would like to build the hexbin plot where for every bin is the "ratio between class 1 and class2 points falling into this bin" is plotted (either log or not).
x <- rnorm(10000)
y <- rnorm(10000)
h <- hexbin(x,y)
plot(h)
l <- as.factor(c( rep(1,2000), rep(2,8000) ))
Any suggestions on how to implement this? Is there a way to introduce function to every bin based on bin statistics?

#cryo111's answer has the most important ingredient - IDs = TRUE. After that it's just a matter of figuring out what you want to do with Inf's and how much do you need to scale the ratios by to get integers that will produce a pretty plot.
library(hexbin)
library(data.table)
set.seed(1)
x = rnorm(10000)
y = rnorm(10000)
h = hexbin(x, y, IDs = TRUE)
# put all the relevant data in a data.table
dt = data.table(x, y, l = c(1,1,1,2), cID = h#cID)
# group by cID and calculate whatever statistic you like
# in this case, ratio of 1's to 2's,
# and then Inf's are set to be equal to the largest ratio
dt[, list(ratio = sum(l == 1)/sum(l == 2)), keyby = cID][,
ratio := ifelse(ratio == Inf, max(ratio[is.finite(ratio)]), ratio)][,
# scale up (I chose a scaling manually to get a prettier graph)
# and convert to integer and change h
as.integer(ratio*10)] -> h#count
plot(h)

You can determine the number of class 1 and class 2 points in each bin by
library(hexbin)
library(plyr)
x=rnorm(10000)
y=rnorm(10000)
#generate hexbin object with IDs=TRUE
#the object includes then a slot with a vector cID
#cID maps point (x[i],y[i]) to cell number cID[i]
HexObj=hexbin(x,y,IDs = TRUE)
#find count statistics for first 2000 points (class 1) and the rest (class 2)
CountDF=merge(count(HexObj#cID[1:2000]),
count(HexObj#cID[2001:length(x)]),
by="x",
all=TRUE
)
#replace NAs by 0
CountDF[is.na(CountDF)]=0
#check if all points are included
sum(CountDF$freq.x)+sum(CountDF$freq.y)
But printing them is another story. For instance, what if there are no class 2 points in one bin? The fraction is not defined then.
In addition, as far as I understand hexbin is just a two dimensional histogram. As such, it counts the number of points that fall into a given bin. I do not think that it can handle non-integer data as in your case.

Related

Data frames using conditional probabilities to extract a certain range of values

I would like some help answering the following question:
Dr Barchan makes 600 independent recordings of Eric’s coordinates (X, Y, Z), selects the cases where X ∈ (0.45, 0.55), and draws a histogram of the Y values for these cases.
By construction, these values of Y follow the conditional distribution of Y given X ∈ (0.45,0.55). Use your function sample3d to mimic this process and draw the resulting histogram. How many samples of Y are displayed in this histogram?
We can argue that the conditional distribution of Y given X ∈ (0.45, 0.55) approximates the conditional distribution of Y given X = 0.5 — and this approximation is improved if we make the interval of X values smaller.
Repeat the above simulations selecting cases where X ∈ (0.5 − δ, 0.5 + δ), using a suitably chosen δ and a large enough sample size to give a reliable picture of the conditional distribution of Y given X = 0.5.
I know for the first paragraph we want to have the values generated for x,y,z we got in sample3d(600) and then restrict the x's to being in the range 0.45-0.55, is there a way to code (maybe an if function) that would allow me to keep values of x in this range but discard all the x's from the 600 generated not in the range? Also does anyone have any hints for the conditional probability bit in the third paragraph.
sample3d = function(n)
{
df = data.frame()
while(n>0)
{
X = runif(1,-1,1)
Y = runif(1,-1,1)
Z = runif(1,-1,1)
a = X^2 + Y^2 + Z^2
if( a < 1 )
{
b = (X^2+Y^2+Z^2)^(0.5)
vector = data.frame(X = X/b, Y = Y/b, Z = Z/b)
df = rbind(vector,df)
n = n- 1
}
}
df
}
sample3d(n)
Any help would be appreciated, thank you.
Your function produces a data frame. The part of the question that asks you to find those values in a data frame that are in a given range can be solved by filtering the data frame. Notice that you're looking for a closed interval (the values aren't included).
df <- sample3d(600)
df[df$X > 0.45 & df$X < 0.55,]
Pay attention to the comma.
You can use a dplyr solution as well, but don't use the helper between(), since it will look at an open interval (you need a closed interval).
filter(df, X > 0.45 & X < 0.55)
For the remainder of your assignment, see what you can figure out and if you run into a specific problem, stack overflow can help you.

(Simple) Generating Multiple (1000) random variables in R

I'm attempting to create 1000 samples of a certain variable Z, in which first I generate 12 uniform RV's Ui, and then have Z = ∑ (Ui-6) from i=1 to 12. I can generate one Z from
u <- runif(12)
Z <- sum(u-6)
However I am not sure how to go about repeating that 1000x. In the end, the desire is to plot out the histogram of the Z's, and ideally it to resemble the normal curve. Sorry, clearly I am as beginner as you can get in this realm. Thank you!
If I understand the question, this is a pretty straightforward way to do it -- use replicate() to perform the calculation as many times as you want.
# number of values to draw per iteration
n_samples <- 12
# number of iterations
n_iters <- 1000
# get samples, subtract 6 from each element, sum them (1000x)
Zs <- replicate(n_iters, sum(runif(n_samples) - 6))
# print a histogram
hist(Zs)
Is this what you're after?
set.seed(2017);
n <- 1000;
u <- matrix(runif(12 * n), ncol = 12);
z <- apply(u, 1, function(x) sum(x - 6));
# Density plot
require(ggplot2);
ggplot(data.frame(z = z), aes(x = z)) + geom_density();
Explanation: Draw 12 * 1000 uniform samples in one go, store in a 1000 x 12 matrix, and then sum row entries x - 6.

Error in rollapply: subscript out of bounds

I'd first like to describe my problem:
What i want to do is to calculate the number of spikes on prices in a 24 hour window, while I possess half hourly data.
I have seen all Stackoverflow posts like e.g. this one:
Rollapply for time series
(If there are more relevant ones, please let me know ;) )
As I cannot and probably also should not upload my data, here's a minimal example:
I simulate a random variable, convert it to an xts object, and use a user defined function to detect "spikes" (of course pretty ridiculous in this case, but illustrates the error).
library(xts)
##########Simulate y as a random variable
y <- rnorm(n=100)
##########Add a date variable so i can convert it to a xts object later on
yDate <- as.Date(1:100)
##########bind both variables together and convert to a xts object
z <- cbind(yDate,y)
z <- xts(x=z, order.by=yDate)
##########use the rollapply function on the xts object:
x <- rollapply(z, width=10, FUN=mean)
The function works as it is supposed to: it takes the 10 preceding values and calculates the mean.
Then, I defined an own function to find peaks: A peak is a local maximum (higher than m points around it) AND is at least as big as the mean of the timeseries+h.
This leads to:
find_peaks <- function (x, m,h){
shape <- diff(sign(diff(x, na.pad = FALSE)))
pks <- sapply(which(shape < 0), FUN = function(i){
z <- i - m + 1
z <- ifelse(z > 0, z, 1)
w <- i + m + 1
w <- ifelse(w < length(x), w, length(x))
if(all(x[c(z : i, (i + 2) : w)] <= x[i + 1])&x[i+1]>mean(x)+h) return(i + 1) else return(numeric(0))
})
pks <- unlist(pks)
pks
}
And works fine: Back to the example:
plot(yDate,y)
#Is supposed to find the points which are higher than 3 points around them
#and higher than the average:
#Does so, so works.
points(yDate[find_peaks(y,3,0)],y[find_peaks(y,3,0)],col="red")
However, using the rollapply() function leads to:
x <- rollapply(z,width = 10,FUN=function(x) find_peaks(x,3,0))
#Error in `[.xts`(x, c(z:i, (i + 2):w)) : subscript out of bounds
I first thought, well, maybe the error occurs because for it might run int a negative index for the first points, because of the m parameter. Sadly, setting m to zero does not change the error.
I have tried to trace this error too, but do not find the source.
Can anyone help me out here?
Edit: A picture of spikes:Spikes on the australian Electricity Market. find_peaks(20,50) determines the red points to be spikes, find_peaks(0,50) additionally finds the blue ones to be spikes (therefore, the second parameter h is important, because the blue points are clearly not what we want to analyse when we talk about spikes).
I'm still not entirely sure what it is that you are after. On the assumption that given a window of data you want to identify whether its center is greater than the rest of the window at the same time as being greater than the mean of the window + h then you could do the following:
peakfinder = function(x,h = 0){
xdat = as.numeric(x)
meandat = mean(xdat)
center = xdat[ceiling(length(xdat)/2)]
ifelse(all(center >= xdat) & center >= (meandat + h),center,NA)
}
y <- rnorm(n=100)
z = xts(y, order.by = as.Date(1:100))
plot(z)
points(rollapply(z,width = 7, FUN = peakfinder, align = "center"), col = "red", pch = 19)
Although it would appear to me that if the center point is greater than it's neighbours it is necessarily greater than the local mean too so this part of the function would not be necessary if h >= 0. If you want to use the global mean of the time series, just substitute the calculation of meandat with the pre-calculated global mean passed as an argument to peakfinder.

How to quickly calculate number of times a series crosses through a value?

I have a discrete cumulative random series in R:
x<-cumsum(rnorm(100))
When plotted:
What would be the best way to calculate how many times it crosses through a specific value; in this case, -2?
#sample data
set.seed(17)
x<-cumsum(rnorm(100))
And we can calculate the crosses by subtracting the value we are intereted in and seeing how often the sign of the value changes from positive to negative
target <- 5
idx <- which(diff(sign(x-target))!=0)
print(paste("it crosses", length(idx), "times"))
and here's what the results look like with the crossing intervals highlighted
plot(x, type="l")
abline(h=target, col="green")
abline(v=c(idx,idx+1), col=c("red","blue"), lty=2)
set.seed(2)
x <- cumsum(rnorm(100))
plot(x, type = "l")
abline(h = 0)
Find how many times x is greater than some value (0 in this case) using rle function on x>0 vector.
length(rle(x > 0)$length) - 1
## [1] 6
Perhaps this will help.
x<-cumsum(rnorm(100))
l1 = x >= 5
n = length(x)
temp = cbind.data.frame(l1[-1], l1[-n])
colnames(temp) = c("l1", "l2")
temp$val = (temp$l1 +temp$l2)
sum(which(temp == 1))
The strategy here is to create logical and lagged logical, bind
and find the entries where two logicals differ (crosses the line).

Detecting dips in a 2D plot

I need to automatically detect dips in a 2D plot, like the regions marked with red circles in the figure below. I'm only interested in the "main" dips, meaning the dips have to span a minimum length in the x axis. The number of dips is unknown, i.e., different plots will contain different numbers of dips. Any ideas?
Update:
As requested, here's the sample data, together with an attempt to smooth it using median filtering, as suggested by vines.
Looks like I need now a robust way to approximate the derivative at each point that would ignore the little blips that remain in the data. Is there any standard approach?
y <- c(0.9943,0.9917,0.9879,0.9831,0.9553,0.9316,0.9208,0.9119,0.8857,0.7951,0.7605,0.8074,0.7342,0.6374,0.6035,0.5331,0.4781,0.4825,0.4825,0.4879,0.5374,0.4600,0.3668,0.3456,0.4282,0.3578,0.3630,0.3399,0.3578,0.4116,0.3762,0.3668,0.4420,0.4749,0.4556,0.4458,0.5084,0.5043,0.5043,0.5331,0.4781,0.5623,0.6604,0.5900,0.5084,0.5802,0.5802,0.6174,0.6124,0.6374,0.6827,0.6906,0.7034,0.7418,0.7817,0.8311,0.8001,0.7912,0.7912,0.7540,0.7951,0.7817,0.7644,0.7912,0.8311,0.8311,0.7912,0.7688,0.7418,0.7232,0.7147,0.6906,0.6715,0.6681,0.6374,0.6516,0.6650,0.6604,0.6124,0.6334,0.6374,0.5514,0.5514,0.5412,0.5514,0.5374,0.5473,0.4825,0.5084,0.5126,0.5229,0.5126,0.5043,0.4379,0.4781,0.4600,0.4781,0.3806,0.4078,0.3096,0.3263,0.3399,0.3184,0.2820,0.2167,0.2122,0.2080,0.2558,0.2255,0.1921,0.1766,0.1732,0.1205,0.1732,0.0723,0.0701,0.0405,0.0643,0.0771,0.1018,0.0587,0.0884,0.0884,0.1240,0.1088,0.0554,0.0607,0.0441,0.0387,0.0490,0.0478,0.0231,0.0414,0.0297,0.0701,0.0502,0.0567,0.0405,0.0363,0.0464,0.0701,0.0832,0.0991,0.1322,0.1998,0.3146,0.3146,0.3184,0.3578,0.3311,0.3184,0.4203,0.3578,0.3578,0.3578,0.4282,0.5084,0.5802,0.5667,0.5473,0.5514,0.5331,0.4749,0.4037,0.4116,0.4203,0.3184,0.4037,0.4037,0.4282,0.4513,0.4749,0.4116,0.4825,0.4918,0.4879,0.4918,0.4825,0.4245,0.4333,0.4651,0.4879,0.5412,0.5802,0.5126,0.4458,0.5374,0.4600,0.4600,0.4600,0.4600,0.3992,0.4879,0.4282,0.4333,0.3668,0.3005,0.3096,0.3847,0.3939,0.3630,0.3359,0.2292,0.2292,0.2748,0.3399,0.2963,0.2963,0.2385,0.2531,0.1805,0.2531,0.2786,0.3456,0.3399,0.3491,0.4037,0.3885,0.3806,0.2748,0.2700,0.2657,0.2963,0.2865,0.2167,0.2080,0.1844,0.2041,0.1602,0.1416,0.2041,0.1958,0.1018,0.0744,0.0677,0.0909,0.0789,0.0723,0.0660,0.1322,0.1532,0.1060,0.1018,0.1060,0.1150,0.0789,0.1266,0.0965,0.1732,0.1766,0.1766,0.1805,0.2820,0.3096,0.2602,0.2080,0.2333,0.2385,0.2385,0.2432,0.1602,0.2122,0.2385,0.2333,0.2558,0.2432,0.2292,0.2209,0.2483,0.2531,0.2432,0.2432,0.2432,0.2432,0.3053,0.3630,0.3578,0.3630,0.3668,0.3263,0.3992,0.4037,0.4556,0.4703,0.5173,0.6219,0.6412,0.7275,0.6984,0.6756,0.7079,0.7192,0.7342,0.7458,0.7501,0.7540,0.7605,0.7605,0.7342,0.7912,0.7951,0.8036,0.8074,0.8074,0.8118,0.7951,0.8118,0.8242,0.8488,0.8650,0.8488,0.8311,0.8424,0.7912,0.7951,0.8001,0.8001,0.7458,0.7192,0.6984,0.6412,0.6516,0.5900,0.5802,0.5802,0.5762,0.5623,0.5374,0.4556,0.4556,0.4333,0.3762,0.3456,0.4037,0.3311,0.3263,0.3311,0.3717,0.3762,0.3717,0.3668,0.3491,0.4203,0.4037,0.4149,0.4037,0.3992,0.4078,0.4651,0.4967,0.5229,0.5802,0.5802,0.5846,0.6293,0.6412,0.6374,0.6604,0.7317,0.7034,0.7573,0.7573,0.7573,0.7772,0.7605,0.8036,0.7951,0.7817,0.7869,0.7724,0.7869,0.7869,0.7951,0.7644,0.7912,0.7275,0.7342,0.7275,0.6984,0.7342,0.7605,0.7418,0.7418,0.7275,0.7573,0.7724,0.8118,0.8521,0.8823,0.8984,0.9119,0.9316,0.9512)
yy <- runmed(y, 41)
plot(y, type="l", ylim=c(0,1), ylab="", xlab="", lwd=0.5)
points(yy, col="blue", type="l", lwd=2)
EDITED : function strips the regions to contain nothing but the lowest part, if wanted.
Actually, Using the mean is easier than using the median. This allows you to find regions where the real values are continuously below the mean. The median is not smooth enough for an easy application.
One example function to do this would be :
FindLowRegion <- function(x,n=length(x)/4,tol=length(x)/20,p=0.5){
nx <- length(x)
n <- 2*(n %/% 2) + 1
# smooth out based on means
sx <- rowMeans(embed(c(rep(NA,n/2),x,rep(NA,n/2)),n),na.rm=T)
# find which series are far from the mean
rlesx <- rle((sx-x)>0)
# construct start and end of regions
int <- embed(cumsum(c(1,rlesx$lengths)),2)
# which regions fulfill requirements
id <- rlesx$value & rlesx$length > tol
# Cut regions to be in general smaller than median
regions <-
apply(int[id,],1,function(i){
i <- min(i):max(i)
tmp <- x[i]
id <- which(tmp < quantile(tmp,p))
id <- min(id):max(id)
i[id]
})
# return
unlist(regions)
}
where
n determines how much values are used to calculate the running mean,
tol determines how many consecutive values should be lower than the running mean to talk about a low region, and
p determines the cutoff used (as a quantile) for stripping the regions to their lowest part. When p=1, the complete lower region is shown.
Function is tweaked to work on data as you presented, but the numbers might need to be adjusted a bit to work with other data.
This function returns a set of indices, which allows you to find the low regions. Illustrated with your y vector :
Lows <- FindLowRegion(y)
newx <- seq_along(y)
newy <- ifelse(newx %in% Lows,y,NA)
plot(y, col="blue", type="l", lwd=2)
lines(newx,newy,col="red",lwd="3")
Gives :
You have to smooth the graph in some way. Median filtration is quite useful for that purpose (see http://en.wikipedia.org/wiki/Median_filter). After smoothing, you will simply have to search for the minima, just as usual (i.e. search for the points where the 1st derivative switches from negative to positive).
A simpler answer (which also does not require smoothing) could be provided by adapting the maxdrawdown() function from the tseries. A drawdown is commonly defined as the retreat from the most-recent maximum; here we want the opposite. Such a function could then be used in a sliding window over the data, or over segmented data.
maxdrawdown <- function(x) {
if(NCOL(x) > 1)
stop("x is not a vector or univariate time series")
if(any(is.na(x)))
stop("NAs in x")
cmaxx <- cummax(x)-x
mdd <- max(cmaxx)
to <- which(mdd == cmaxx)
from <- double(NROW(to))
for (i in 1:NROW(to))
from[i] <- max(which(cmaxx[1:to[i]] == 0))
return(list(maxdrawdown = mdd, from = from, to = to))
}
So instead of using cummax(), one would have to switch to cummin() etc.
My first thought was something much cruder than filtering. Why not look for the big drops followed by long enough stable periods?
span.b <- 20
threshold.b <- 0.2
dy.b <- c(rep(NA, span.b), diff(y, lag = span.b))
span.f <- 10
threshold.f <- 0.05
dy.f <- c(diff(y, lag = span.f), rep(NA, span.f))
down <- which(dy.b < -1 * threshold.b & abs(dy.f) < threshold.f)
abline(v = down)
The plot shows that it's not perfect, but it doesn't discard the outliers (I guess it depends on your take on the data).

Resources