Maximum at any point of two lines in R - r

Suppose you have two lines, L1 and L2, which for each x value (x1 and x2 for example) they have known points at L1={(x1,L1_y1), (x2,L1_y2)}, and L2={(x1,L2_y1), (x2,L2_y2)}. By joining these points they may or may not have an intersection at some x3 where x1
Now suppose you want to know the maximum at any x value (not restricted to just x1, x2 etc, but anywhere along the axis) of both of these lines. Obviously it is often trivial to calculate for just a few lines, and a few different x value, but in my case I have several tens of thousand x values and a few lines to check it against, so it can't be done manually.
In R, is there some code which will calculate the maximum at any given point x3?
An example of this can be seen here with L1={(1,1), (2,4)}, and L2={(1,4),(2,1)}, illustrated by:
Here the intersection of these lines is at (1.5, 2.5). L2 is the maximum before this, and L1 after. This maximum line is shown in red below.
As you can see, it isn't enough just to take the max at every point and join these up, and so it will need to consider the lines as some form of function, and then take the maximum of this.
Also, as mention before as there are several thousand x values it will need to generalise to larger data.
To test the code further if you wish you can randomly generate y values for some x values, and it will be clear to see from a plot if it works correctly or not.
Thanks in advance!

Defining points constituting your lines from the example
L1 <- list(x = c(1, 2), y = c(1, 4))
L2 <- list(x = c(1, 2), y = c(4, 1))
defining a function taking a pointwise maximum of two functions corresponding to the lines
myMax <- function(x)
pmax(approxfun(L1$x, L1$y)(x), approxfun(L2$x, L2$y)(x))
This gives
plot(L1$x, L1$y, type = 'l')
lines(L2$x, L2$y, col = 'red')
curve(myMax(x), from = 1, to = 2, col = 'blue', add = TRUE)
Clearly this extends to more complex L1 and L2 as approxfun is just a piecewise-linear approximation. Also, you may add L3, L4, and so on.

Related

Computing the Tukey median

I am trying to compute the data depth of two variables with the following function:
library(depth)
x <- data.frame(data$`math score`, data$`reading score`)
depth(1000, x, method = "Tukey", approx = FALSE, eps = 1e-8, ndir = 1000)
the first variable after depth is u which stands for Numerical vector whose depth is to be calculated. Dimension has to be the same as that of the observations.
I have 1000 observations however I get the following error message:
Error in depth(1000, x, method = "Tukey", approx = FALSE, eps = 1e-08, :
Dimension mismatch between the data and the point u.
Does someone know how to solve this issue?
Thank you in advance!
If you look at the documentation for the function depth, it says:
u    Numerical vector whose depth is to be calculated. Dimension has to be the same as that of the observations.
So u has to be a point in multidimensional space represented by a vector with n components, whereas x has to be a matrix or data frame of m by n components, (m rows for m points). You are comparing u to all the other multidimensional points in the set x to find the minimum number of points that could share a half-space with u.
Let's create a very example in two dimensional space:
library(depth)
set.seed(100)
x <- data.frame(x = c(rnorm(10, -5, 2), rnorm(10, 5, 2)), y = rnorm(20, 0, 2))
plot(x)
The depth function calculates the depth of a particular point relative to the data. So let's use the origin:
u <- data.frame(x = 0, y = 0)
points(u, col = "red", pch = 16)
Naively we might think that the origin here has a depth of 10/20 points (i.e. the most obvious way to partition this dataset is a vertical line through the origin with 10 points on each side, but instead we find:
depth(u, x)
#> [1] 0.35
This indicates that there is a half-space including the origin that only contains 0.35 of the points, i.e. 7 points out of 20:
depth(u, x) * nrow(x)
#> [1] 7
And we can see that visually like this:
abline(0, -0.07)
points(x[x$y < (-0.07 * x$x),], col = "blue", pch = 16)
Where we have coloured these 7 points blue.
So it's not clear what result you expect from the depth function, but you will need to give it a value of c(math_score, reading_score) where math_score and reading_score are test values for which you want to know the depth.

How can I plot mixed integer problem result?

I solved a linear bi-objective mixed integer problem and I want to plot the results. results include lines and points. for example
list=[([0.0; 583000.0], 0), ([190670.0; 149600.0], 0), ([69686.0, 385000.0], 1), ([33296.0, 484000.0], 1), ([136554.0, 2.38075e5], 1), ([24556.0, 503800.0], 0), ([47462.0, 437800.0], 1), ([129686.0, 253000.0], 1), ([164278.0, 178200.0], 1)]
In this list third point ([69686.0, 385000.0], 1) second element 1 is determined that this point connected by a line to prior point ([190670.0; 149600.0], 0) is connected to second point by a line.
I coded that as follow:
using JuMP,Plots
list=[([0.0, 583000.0], 0), ([24556.0, 503800.0], 0), ([33296.0, 484000.0],1), ([47462.0, 437800.0], 1), ([69686.0, 385000.0], 1), ([129686.0, 253000.0], 1), ([136554.0, 23805.0], 1), ([164278.0, 178200.0], 1), ([190670.0, 149600.0], 0)]
x=zeros(1,1)
for i=1:size(list,1)
x=[x;list[i][1][1]]
end
row=1
x = x[setdiff(1:end, row), :]
y=zeros(1,1)
for i=1:size(list,1)
y=[y;list[i][1][2]]
end
row=1
y = y[setdiff(1:end, row), :]
for i=2:size(list,1)
if list[i][2]==0
plot(Int(x[i]),Int(y[i]),seriestype=:scatter)
plot(Int(x[i+1]),Int(y[i+1]),seriestype=:scatter)
end
if list[i][2]==1
plot(Int(x[i]),Int(y[i]))
plot(Int(x[i+1]),Int(y[i+1]))
end
end
but it is not worked. would you please help me.
thanks
You can simply push each line segment's x and y values to two separate arrays, x and y in the code below. After each line segment's values (i.e. x1 and x2 or y1 and y2) put a NaN into the arrays. This will prevent connecting a line segment to the next one if there should not be a connection. (for example, the case you see 1 and then a 0). And finally plot(x, y).
The following code snippet does it. Note that allx and ally are used to hold all points regardless of connection status. You may want to exclude connected points from them. x and y holds connected line segments.
using Plots
x, y, allx, ally = Float64[], Float64[], Float64[], Float64[]
# iterate through list
for i = 1:length(list)-1
if list[i+1][2] == 1
# push x1 from the first point, x2 from the second and a `NaN`
push!(x, list[i][1][1], list[i+1][1][1], NaN)
# push y1, y2, `NaN`
push!(y, list[i][1][2], list[i+1][1][2], NaN)
end
push!(allx, list[i][1][1])
push!(ally, list[i][1][2])
end
push!(allx, list[end][1][1])
push!(ally, list[end][1][2])
# scatter all points
scatter(allx, ally)
# plot connections with markers
plot!(x, y, linewidth=2, color=:red, marker=:circle)
This should hopefully give you the plot you wanted.
If you happen to use Gadfly.jl instead of Plots.jl, you can get a similar plot with
using Gadfly
connectedpoints = layer(x=x, y=y, Geom.path, Geom.point, Theme(default_color="red"))
allpoints = layer(x=allx, y=ally, Geom.point)
plot(connectedpoints, allpoints)
As a side note if you plan to plot another series on top of a plot object already created, you should use plot! instead of plot.

slice3Drgl R plot several slices

My dataset is a 3 dimensional array (30,15,4) representing temperature through the water column at four different depths (0.2 (layer1),0.4(layer2),0.6(layer3),0.8(layer4)). Thus, I am trying to get 3d plots with 4 layers.So far, I have used slice3Drgl function. However, I have realized that in the plots I got, the 2nd and 3rd layer look the same (T data from the 0.4 depth is plotted twice), and the fourth layer (upper layer in plot) shows T data belonging to the 0.6 depth.
This is my code:
dd<-array(d$T,dim=c(30,15,4))
x = seq(0.126,3.780, by=0.126)
y=seq(0.125,1.875,by=0.125)
z = seq(0.2,0.8,by=0.2)
grid<-mesh(x,y,z)
colvar<-with(grid,dd)
col=jet.col(100)
slice3Drgl (x, y, z, xs=NULL, ys=1,zs=c(0.2,0.4,0.6,0.8),colvar = colvar, col=col, clim = c( -1.392,0),theta = 60,smooth=TRUE,
colkey(clim = c(-1.392,0),at = do.breaks(c(-1.392,0), 2)))
This is the plot I get:
I have checked the data and its correct, no repeated T data in 2nd and 3rd layers.
In addition when defining zs as:
zs<-z[seq(1,length(z),len=4)]
The 3d plot improves a bit. I can see the correct T data on 1,2,3 layers. But 4th layer still shows T data corresponding to 0.6 depth, instead to 0.8 depth (the 4th layer).
Hope my question is more or less clear and that someone can give me some useful insight,
there is not too much information about this function in the web unfortunately.
Thanks in advance,
Best,
Amaia
This error originates with zs parameter for which package document provide no information. Here is a work around that I have used in the past:
dt<-seq(0.01,27, by=0.01)
dd<-array(dt,dim=c(30,15,5))
x = seq(0.126,3.780, by=0.126)
y=seq(0.125,1.875,by=0.125)
z = seq(0.2,1.0,by=0.2)
grid<-mesh(x,y,z)
colvar<-with(grid,dd)
col=jet2.col(100)
slice3Drgl (x, y, z, xs=NULL, ys=NULL,zs=z,zlim=c(0.2,0.8),colvar = colvar, col=col, clim = c(0,18),theta = 60,smooth=FALSE,
colkey(clim = c(0,18),at = do.breaks(c(0,18), 2)))
Basically all you need is add a dummy z layer into the data and then ignore it during the plotting by setting the zlim.
I hope this helps.

Hexbin: apply function for every bin

I would like to build the hexbin plot where for every bin is the "ratio between class 1 and class2 points falling into this bin" is plotted (either log or not).
x <- rnorm(10000)
y <- rnorm(10000)
h <- hexbin(x,y)
plot(h)
l <- as.factor(c( rep(1,2000), rep(2,8000) ))
Any suggestions on how to implement this? Is there a way to introduce function to every bin based on bin statistics?
#cryo111's answer has the most important ingredient - IDs = TRUE. After that it's just a matter of figuring out what you want to do with Inf's and how much do you need to scale the ratios by to get integers that will produce a pretty plot.
library(hexbin)
library(data.table)
set.seed(1)
x = rnorm(10000)
y = rnorm(10000)
h = hexbin(x, y, IDs = TRUE)
# put all the relevant data in a data.table
dt = data.table(x, y, l = c(1,1,1,2), cID = h#cID)
# group by cID and calculate whatever statistic you like
# in this case, ratio of 1's to 2's,
# and then Inf's are set to be equal to the largest ratio
dt[, list(ratio = sum(l == 1)/sum(l == 2)), keyby = cID][,
ratio := ifelse(ratio == Inf, max(ratio[is.finite(ratio)]), ratio)][,
# scale up (I chose a scaling manually to get a prettier graph)
# and convert to integer and change h
as.integer(ratio*10)] -> h#count
plot(h)
You can determine the number of class 1 and class 2 points in each bin by
library(hexbin)
library(plyr)
x=rnorm(10000)
y=rnorm(10000)
#generate hexbin object with IDs=TRUE
#the object includes then a slot with a vector cID
#cID maps point (x[i],y[i]) to cell number cID[i]
HexObj=hexbin(x,y,IDs = TRUE)
#find count statistics for first 2000 points (class 1) and the rest (class 2)
CountDF=merge(count(HexObj#cID[1:2000]),
count(HexObj#cID[2001:length(x)]),
by="x",
all=TRUE
)
#replace NAs by 0
CountDF[is.na(CountDF)]=0
#check if all points are included
sum(CountDF$freq.x)+sum(CountDF$freq.y)
But printing them is another story. For instance, what if there are no class 2 points in one bin? The fraction is not defined then.
In addition, as far as I understand hexbin is just a two dimensional histogram. As such, it counts the number of points that fall into a given bin. I do not think that it can handle non-integer data as in your case.

Sample regression, x = months, huge bandwidths

I have two vectors, x and y.
x is a vector where each entry represents a month for a period of several years, so I have (let's say) 10 years of data, then length(x) = 120 and so on.
(I have used the "posix.ct" command so they really are "months" in that sense, but couldn't I just have x as a numerical vector like c(1:n) or something, since I already know which month and which year a certain element of c(1:n) corresponds to? i.e if x = c(1:n), I know that x[13] is february of the second year and so on..)
y is a vector where each elements is an observation of a particular variable at a certain month.
So the observed data is grouped like this (january,0.123), (february,2.125) and so on.
I have two vectors for the months;
x1 = seq(as.POSIXct("YYYY-MM-DD", tz="GMT"),
as.POSIXct("YYYY-MM-DD", tz="GMT"),
by="month")
x2 = c(1:length(x1))
What I want to do is to run ksmooth:
plot(x1,y)
smooth = ksmooth(x2,y,"normal")
lines(smooth)
The reason that I use x1 in the plot() command is that I don't know how to otherwise get the x-axis in time.
R should automatically find a decent smoothing parameter when I haven't specified anything. The result is that ksmooth$y is equal to the input vector y! Also, a vertical bar is produced in the plot. If I replace x2 by x1 in the code above, ksmooth$y is NA for all values except for the first and last, which equal those of the input y.
So i try some bandwidths:
h = 0.1: now smooth$y = y, as before. A vertical bar is produced (it is the same color as I specified in the lines() command, so it must have to do with the ksmooth command.)
h = 10: get some non-strange results for smooth$y, however, a vertical bar is produced as before.
Then, I tried the crazy idea of very large bandwidths;
h = 1e+06: This produced nothing when I used x1 and x2 as in the code above. When I changed x2 to x1 however, I get some good results. For h = 1e+09 (that's huge!!) I get a very nice result. (I get a curve that fits the data and looks nice)
But h = 1e+09, is that reasonable? in all the examples I have looked h is something betweeen 0.1 and 10, give or take. heard something about a rule of thumb: h should equal n^(-1/5) where n is the number of data points.
I think the one thing that you are missing is that R doesn't find a decent smoothing parameter when you haven't specified anything, it just uses a bandwidth of 0.5, which is totally useless in your case.
The other thing you might be missing is that in ksmooth the bandwidth parameter is in terms of x. When ksmooth takes an x value of Date, it converts it to a numeric, which is the number of seconds. Therefore, your bandwidth will be measured in seconds, an undesirable result. When ksmooth takes an x value of months, it will default to a bandwidth of 0.5 months, also undesirable.
What you want to do is specify a reasonable bandwidth for the x that you are using. Here is an example:
x1 = seq(as.POSIXct("2000-01-01", tz="GMT"),
as.POSIXct("2010-12-31", tz="GMT"),
by="month")
x2 = c(1:length(x1))
set.seed(1)
y = runif(length(x1))
plot(x1,y,type='l')
smooth = ksmooth(x2,y,"normal")
lines(x1,smooth$y,col='blue',lwd=2)
lines(x1,ksmooth(x2,y,'normal',bandwidth=2)$y,col='red',lwd=2)
lines(x1,ksmooth(x2,y,'normal',bandwidth=10)$y,col='green',lwd=2)
lines(x1,ksmooth(x2,y,'normal',bandwidth=20)$y,col='orange',lwd=2)

Resources