Identify outliers in 3D space - math

I want to plot the trajectory of a rocket using X, Y, and Z coordinates. I think the first step is to identify and eliminate from the data set any outliers. How should I approach this issue (i.e., identifying outliers)?
I'm not a math/stats person, so I've read through Stack Overflow looking for options, but haven't found any that seem appropriate.
I have no code to show.

Related

Jump y axis values when highest value is to far away from the other points

Basically I'm building an area graph with Chart.js, the data that I'm using in order to build the graph usually contains a peak that is much higher than the rest of the points and the y-axis range of values will be to high, to notice the diference between the lower points and it wil seem almost as a parallel line to the x-axis as we can see in this image:
Graph with problems
The solution I want to try is to skip the values from the y-axis between the lower points and the peak of the graph, and accomplish a graph presentation similar to this one:
Solution graph sketch
As we can see at this sketch the y-axis has a normal scale until 300 but then as the next point is to far away from the other ones the y-axis values are skiped.
So what I want to know is if this jump on the values of the y-axis is possible to achieve with this library (Chart.js) and if so where can I find documentation about it, because I already looked everywhere and couldn't find a thing. If not I would ask you for recommendations of any other librarys where I could achieve this.

R plotting strangeness with large dataset

I have a data frame with several million points in it - each having two values.
When I plot this like this:
plot(myData)
All the points are plotted, but the plot is quite busy, so I thought I'd plot it as a line:
plot(myData, type="l")
But while the x axis doesn't change (i.e. goes from 0 to 7e+07), the actual plotting stops at about 3e+07 and I don't actually get a proper line plot either.
Is there a limitation on line plotting?
Update
If I use
plot(myData, type="h")
I get correct and useable output, but I still wonder why the type="l" option fails so badly.
Further update
I am plotting a time series - here is one output using type="h":
That's perfectly usable, but having a line would allow me to compare several outputs.
High dimensional data graphic representation is growing issue in data analysis. The problem, actually, is not create the graph. The problem is make the graph capable of communicate information that we could transform in useful knowledge. Allow me to present an example to produce this point, by considering a data with a million observations, that is, not that big.
x <- rnorm(10^6, 0, 1)
y <- rnorm(10^6, 0, 1)
Let's plot it. R can yes easily manage such a problem. But can we? Probably not.
Afterall, what kind of information can we deduce from an ink hard stain? Probably, no more than a tasseographyst trying to divinate the future in patterns of tea leaves, coffee grounds, or wine sediments.
plot(x, y)
A different approach is represented by the smoothScatter function. It creates a density plot of bivariate data. There, we create two examples.
First, with defaults.
smoothScatter(x, y)
Second, the bandwidth was specified to be a little larger than the default, and five points are specified to be shown using a different symbol pch = 3.
smoothScatter(x, y, bandwidth=c(5,1)/(1/3), nrpoints=5, pch=3)
As you can see, the problem is not solved. Nevertheless, we can have a better grasp on the distribution of our data. This kind of approach is still in development, and there are several matters that are discussed and evolved. If this approach represents a more suitable approach to represent your big dataset, I suggest you to visit this blog that discuss throughfully the issue.
For what it's worth, all the evidence I have is that is computer - even though it was a lump of big iron - ran out of memory.

Difficulties with adding arrows to plot in R

I am attempting to project data onto a plot in R and see the correlation between the points. I have added a line to let the reader see the connection between these points. I am however stumped when it comes to inputting arrows to show the direction of the line. Rddproj was just an arbitrary name given to the data. Three sets of x and y coordinates are plotted x=c(-0.7159425, -0.8129311, -0.7392371); y=0.7743088, 0.7732762, 0.7490996) Here is the example below.
x<-rddproj[1:3,1]; y<-rddproj[1:3,2]
plot(x,y)
My concern is that the second group of coordinates is the greatest negative point on the x-axis. In drawing a line with arrows, the arrow will most likely point towards this point, when it should be forming a V with that point in the middle. Is it possible to plot an arrow to reflect the placement of points in a group and not just the most positive point to the most negative point or vice versa?
The arrows function ( a modified segments function) is used for this purpose (to the extent that I understand the question) in base R:
# fixed your assignment code.
plot(NA, xlim=range(x), ylim=range(y) )
arrows(head(x,-1),head(y,-1),tail(x,-1), tail(y,-1), angle=30)
An alternative reading of your question would have the glaringly obvious solution : plot(x,y) which I hope is not what you were asking since that should have been satisfactory.

Polygon/contour around subset of vertices on graph (more precise than mark.groups in igraph)

Problem definition
I need to produce a number of specific graphs, and on these graphs, highlight subsets of vertices (nodes) by drawing a contour/polygon/range around or over them (see image below).
A graph may have multiple of these contours/ranges, and they may overlap, iff one or more vertices belong to multiple subsets.
Given a graph of N vertices, any subset may be of size 1..N.
However, vertices not belonging to a subset must not be inside the contour (as that would be misleading, so that's priority no. 1). This is gist of my problem.
All these graphs happen to have the property that the ranges are continuous, as the data they represent covers only directly connected subsets of vertices.
All graphs will be undirected and connected (no unconnected vertices will ever be plotted).
Reproducible attempts
I am using R and the igraph package. I have already tried some solutions, but none of them work well enough.
First attempt, mark.groups in plot.igraph:
library(igraph)
g = make_graph("Frucht")
l = layout.reingold.tilford(g,1)
plot(g, layout=l, mark.groups = c(1,3,6,12,5), mark.shape=1)
# bad, vertex 11 should not be inside the contour
plot(g, layout=l, mark.groups = c(1,6,12,5,11), mark.shape=1)
# 3 should not be in; image below
# just choosing another layout here is not a generalizable solution
The plot.igraph calls igraph.polygon, which calls convex_hull (also igraph), which calls xspline. The results is, from what I understand, something called a convex hull (which otherwise looks very nice!), but for my purposes that is not precise enough, covering vertices that should not be covered.
Second attempt with contour. So I tried implementing my own version, based on the solution suggested here:
library(MASS)
xx <- runif(5, 0, 1);yy <- abs(xx)+rnorm(5,0,0.2)
plot(xx,yy, xlim=c( min(xx)-sd(xx),max(xx)+sd(xx)), ylim =c( min(yy)-sd(yy), max(yy)+sd(yy)))
dens2 <- kde2d(xx, yy, lims=c(min(xx)-sd(xx), max(xx)+sd(xx), min(yy)- sd(yy), max(yy)+sd(yy) ),h=c(bandwidth.nrd(xx)/1.5, bandwidth.nrd(xx)/ 1.5), n=50 )
contour(dens2, level=0.001, col="red", add=TRUE, drawlabels=F)
The contour plot looks in principle like something I could use, given enough tweaking of the bandwidth and level values (to make the contour snug enough so it doesn't cover any points outside the group). However, this solution has the drawback that when the level value is too small, the contour breaks (doesn't produce a continuous area) - so if I would go that way, controlling for continuity (and determining good bandwidth/level values on the fly) automatically should be implemented. Another problem is, I cannot quite see how could I plot the contour over the plots produced by igraph: the layout.* commands produce what looks like a coordinate matrix, but the coordinates do not match the axis coordinates on the plot:
# compare:
layout.reingold.tilford(g,1)
plot(g, layout=l, axes=T)
The question:
What would be a better way to achieve the plotting of such ranges on graphs (ideally igraphs) in R that would meet the criteria outlined above - ranges that include only the vertices that belong to their subset and exclude all else - while being continous ranges?
The solution I am looking for should be scalable to graphs of different sizes and layouts that I may need to create (so hand-tweaking each graph by hand using e.g. tkplot is not a good solution). I am aware that on some graphs with some vertex groups, meeting both the criteria will indeed be impossible in practise, but intuitively it should be possible to implement something that still works most of the time with smallish (10..20 vertices) and not-too-complex graphs (ideally it would be possible to detect and give a warning if a perfectly fitting range could not be plotted). Either an improvement of the mark.groups approach (not necessarily within the package, but using the hull-idea mentioned above), or something with contour or a similar suitable function, or suggesting something else entirely would be welcome, as long as it works (most of the time).
Update stemming from the discussion: a solution that only utilizes functions of core R or CRAN packages (not external software) is desirable, since I will eventually want to incorporate this functionality in a package.
Edit: specified the last paragraph as per the comments.
The comment area is not long enough to fit my answer there, so I'm putting this here, although I'd rather post it as a comment as it is not a full solution.
Quite a long throw, but the first thing that popped into my mind is support vector machines. The idea would be that you construct a support vector machine classifier that classifies your points into two groups (in or out) based on the coordinates of the vertices, using some non-linear kernel function (I would try the radial basis function). Then you plot the separating hyperplane of the trained support vector machine. One drawback is that the area that you obtain this way might be unbounded (i.e. go to infinity in some directions), so this idea definitely requires some further thinking, but at least that's one possible direction to go.

Force starting point of lines()

Perhaps because the question is so basic, the keywords that I can think up for this question all directs me to other things. I am trying to draw a graph with spiky curve lines that connect the medians. The real data is very big, but the starting values are duplicates of (0,0):
DATA<-data.frame(time<-c(sort(rep(c(0,2,4,8,12),4))),
conc<-c(rep(0,4),rnorm(n=4,mean=30),
rnorm(n=4,mean=10),
rnorm(n=4,mean=35),
rnorm(n=4,mean=15)))
# Create blank graph
plot(NULL,NULL,xlab="Time",ylab="Conc",
xlim=c(0,15),ylim=c(0,40),main="Example")
# Add line
require(quantreg)
require(plyr)
require(MatrixModels)
DATA<-plyr::arrange(DATA,time)
fit3<-rqss(DATA$conc~qss(DATA$time,constraint="N"),tau=0.5,data = DATA)
lines(unique(DATA$time)[-1],fit3$coef[1] + fit3$coef[-1],lwd=2)
As you can see, the line does not connect to the starting (0,0) values and instead start at the next lowest level.
I was tempted to cheat, but it does not connect to the lines and I would really prefer to work it out with the rest of the code instead of trying to pass off two lines as one:
# Cheating getaway but does not work well, segments are not connected
segments(x0=0,y0=0,x1=2,y1=30,lwd=2)
Some relevant answers that I found were not appropriate for my situation.
Line in R plot should start at a different timepoint for example suggest modifying the data, which would not help to extend my line and plus my actual data is too big that I would be wary to do this kind of manipulation. I would not want to use plot(x,y,type="l") even though it goes through the (0,0) point, because 1) it looks bad on the huge data, and 2) I would have to overlay another similar line using lines(). I wonder whether it has more to do with rqss and less with lines?
I apologize if this has already been asked before.

Resources