Collision detection and nearing in R - r

I have x and y values for points (on a grid with discrete steps). I want to find those points that are in the same position or within a certain range from another point. I tried with the functions match(), duplicated(), which(), for loops, and if cases of different kinds and somehow got stuck.
As an example:
x <- c(23, 45, 98, 23, 12)
y <- c(15, 90, 10, 15, 70)
[1] and [4] would 'collide' in this case.
x <- c(24, 45, 98, 23, 12)
y <- c(14, 90, 10, 15, 70)
range<-1
[1] and [4] would again 'collide' in this case.
Either index or values of the points will do, however I will need one information per collision.

This is brute force but should work well as long and x and y are not massive.
x <- c(24, 45, 98, 23, 12)
y <- c(14, 90, 10, 15, 70)
range <- 2
temp = as.matrix(dist(cbind(x, y)))
diag(temp) = Inf
unique(t(apply(which(temp < range, arr.ind = TRUE), 1, sort)))
# [,1] [,2]
#4 1 4

Related

How are residuals of aov() calculated?

I'm wondering how residuals in aov() are calculated. I looked already for hours but can't figure it out.
I use an ANOVA for repeated measurements.
Data <- data.frame(subject = factor(rep(1:10, 3)),
age = factor(c(rep(4, 10),
rep(10, 10),
rep(35, 10))),
weight = c(20, 9, 16, 14, 30, 26, 26, 27, 13, 15,
27, 18, 30, 26, 43, 48, 38, 38, 22, 47,
50, 44, 52, 46, 64, 70, 73, 57, 54, 63))
ANOVA_MW <- aov(weight ~ age +
Error(subject / age),
data = Data)
summary(ANOVA_MW)
I know that the following command gives me something.
round(ANOVA_MW$subject:age$residuals, 2)
However, I get only 20 rather than 30 values. It starts with 11. This has propably something to do with the residuals of subject. I don't know.
The result of proj(ANOVA_MW) gives me the residuals that I calculated manually (value - personal mean - group mean + overall mean).
My question is, what are the other residuals above and why is everybody (so it feels) using them for normality testing?
I would love some helpful input. I already dove into the function but could not find an explanation.
Thanks.
residual sum of square = total sum of square - Factor sum of squares
In your case, factor is age.
The residuals should be normally distributed, it is one of the assumption of ANOVA.

How do I order components of bars within bar plot in R/ggplot2

Similar questions to this have been asked, but I have not been able to apply the suggested solutions successfully.
I have created a plot like so;
> elective_ga <- c(68, 51, 29, 10, 5)
> elective_epidural <- c(29, 42, 19, 3, 1)
> elective_cse <- c(0, 0, 0, 20, 7)
> elective_spinal <- c(3, 7, 52, 67, 87)
> years <- c('1982', '1987', '1992', '1997', '2002')
> values <- c(elective_ga, elective_epidural, elective_cse, elective_spinal)
> elective_technique <- data.frame(years, values)
> p <- ggplot(elective_technique, aes(years, values))
> p +geom_bar(stat='identity', aes(fill=c(rep('GA', 5), rep('Epidural', 5), rep('CSE', 5), rep('Spinal', 5)))) +labs(x='Year', y='Percent', fill='Type')
which produces the following chart;
I was expecting the bars to be stacked in the order (from top to bottom) GA, Epidural, CSE, Spinal. I would have thought the way I constructed the data frame that they should be ordered in this way but obviously I have not. Can anyone explain why the bars are ordered the way they are, and how to get them the way I want?
How about this?
elective_ga <- c(68, 51, 29, 10, 5)
elective_epidural <- c(29, 42, 19, 3, 1)
elective_cse <- c(0, 0, 0, 20, 7)
elective_spinal <- c(3, 7, 52, 67, 87)
years <- c('1982', '1987', '1992', '1997', '2002')
values <- c(elective_ga, elective_epidural, elective_cse, elective_spinal)
Type=c(rep('GA', 5), rep('Epidural', 5), rep('CSE', 5), rep('Spinal', 5))
elective_technique <- data.frame(years, values,Type)
elective_technique$Type=factor(elective_technique$Type,levels=c("GA","Epidural","CSE","Spinal"))
p <- ggplot(elective_technique, aes(years, values,fill=Type))+geom_bar(stat='identity') +
labs(x='Year', y='Percent', fill='Type')
One way is to reorder the levels of the factor.
library(ggplot2)
elective_ga <- c(68, 51, 29, 10, 5)
elective_epidural <- c(29, 42, 19, 3, 1)
elective_cse <- c(0, 0, 0, 20, 7)
elective_spinal <- c(3, 7, 52, 67, 87)
years <- c('1982', '1987', '1992', '1997', '2002')
values <- c(elective_ga, elective_epidural, elective_cse, elective_spinal)
type = c(rep('GA', 5), rep('Epidural', 5), rep('CSE', 5), rep('Spinal', 5))
elective_technique <- data.frame(years, values, type)
# reorder levels in factor
elective_technique$type <- factor(elective_technique$type,
levels = c("GA", "Epidural", "CSE", "Spinal"))
p <- ggplot(elective_technique, aes(years, values))
p +
geom_bar(stat='identity', aes(fill = type)) +
labs(x = 'Year', y = 'Percent', fill = 'Type')
The forcats package may provide a cleaner solution.

Highlight few points in plot

I try to plot the following data with few positions (points) to highlight
plot(b$pos,b$log_p,col==ifelse(b$pos==c(14824849,13920386,14837470),90,100), pch=19, xlab='Chromosome 21 position', ylab='-log10(p)')
The plot produced, only show one point highlighted red with the following warning message:
In b$pos == c(14824849, 13920386,14837470) : longer object length is not a multiple of shorter object length
OK, the issue is likely to be your condition in the ifelse. If you attempt the condition (b$pos==c(14824849,13920386,14837470)) outside of your ifelse() you will get an error message along the lines of:
longer object length is not a multiple of shorter object length
If you change the condition to:
b$pos %in% c(14824849,13920386,14837470)
You will get a vector of TRUE/FALSE values determining whether each entry in b$pos is present in the vector (14824849,13920386,14837470) rather than whether the entries in b$pos are equal to c(14824849,13920386,14837470).
x = c(49, 7, 66, 51, 43, 70, 35, 53, 6, 29)
y = c(10, 98, 44, 31, 37, 14, 64, 84, 4, 34)
x %in% c(6, 7)
[1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
plot(x, y, col=ifelse(x %in% c(6, 7), 'red', 'blue'))
Now this dataset has 10 x values, if you were to write this:
plot(x, y, col=ifelse(x == c(1, 7), 'red', 'blue'))
This would work fine, the x values would be compared against 1 and 7 alternately e.g:
49 == 1 ?
7 == 7 ?
66 == 1?
51 == 7? .... etc etc.
The error message was saying that your vector length of 3 did not exactly go into the length of the b$pos.
Within the tidyverse and ggplot you can try
library(tidyverse)
tibble(x = c(49, 7, 66, 51, 43, 70, 35, 53, 6, 29),
y = c(10, 98, 44, 31, 37, 14, 64, 84, 4, 34),
gr=x %in% c(6, 7)) %>%
ggplot(aes(x,y, col=gr)) +
geom_point(size=2) +
ggalt::geom_encircle(data= . %>% filter(gr), color="green", s_shape=0) +
theme_bw()
Using ggalt::geom_encircle function you can draw a circle around your points of interest.

Extract x and y values from cdplot()

How can I extract the exact probabilities for each factor y at any value of x with cdplot(y~x)
Thanks
Following the example from the help file of ?cdplot you can do...
## NASA space shuttle o-ring failures
fail <- factor(c(2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1,
1, 2, 1, 1, 1, 1, 1),
levels = 1:2, labels = c("no", "yes"))
temperature <- c(53, 57, 58, 63, 66, 67, 67, 67, 68, 69, 70, 70,
70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 81)
## CD plot
result <- cdplot(fail ~ temperature)
And this is a simple way to obtain the probabilities from the cdplot output.
# Getting the probabilities for each group.
lapply(split(temperature, fail), result[[1]])
$no
[1] 0.8166854 0.8209055 0.8209055 0.8209055 0.8090438 0.7901473 0.7718317 0.7718317 0.7579343
[10] 0.7664731 0.8062898 0.8326761 0.8326761 0.8905854 0.9185472 0.9626185
$yes
[1] 3.656304e-05 6.273653e-03 1.910046e-02 6.007471e-01 7.718317e-01 7.718317e-01 8.062898e-01
Note that result is a conditional density function (cumulative over the levels of fail) returned invisibly by cdplot, therefore we can split temperature by fail and apply the returned function over those values using lapply.
Here a simple version of getS3method('cdplot','default') :
get.props <- function(x,y,n){
ny <- nlevels(y)
yprop <- cumsum(prop.table(table(y)))
dx <- density(x, n )
y1 <- matrix(rep(0, n * (ny - 1L)), nrow = (ny - 1L))
rval <- list()
for (i in seq_len(ny - 1L)) {
dxi <- density(x[y %in% levels(y)[seq_len(i)]],
bw = dx$bw, n = n, from = min(dx$x), to = max(dx$x))
y1[i, ] <- dxi$y/dx$y * yprop[i]
}
}

R finding relative maximum from outliers

Suppose I have a vector of numbers that I want to find a general cutoff for. For example:
x <- c(35, 2, 3, 30, 1, 4, 33, 6, 36)
In this case, I would want to only extract a subset that countains 35, 30, 33, 36. In this case the cutoff would be at 30 Without hardcoding a definite cutoff, I would like my code to adapt to different vectors of numbers in order to find that cutoff.
Another example would be:
x <- c(1, 20, 42, 13, 118, 149, 130, 30, 11, 32, 120, 0.5, 0.03)
In this case, a reasonable cutoff would be around 118.
Currently I am hard coding the cutoffs because I am dealing with simple cases, however I would like to make this process more modular for more variable vectors.
You could use the quantile function
cutoff <- function(y, prob=0.7) y[y > quantile(y, prob)]
x <- c(35, 2, 3, 30, 1, 4, 33, 6, 36)
cutoff(x)
[1] 35 33 36
x <- c(1, 20, 42, 13, 118, 149, 130, 30, 11, 32, 120, 0.5, 0.03)
cutoff(x)
[1] 118 149 130 120
And you can define a different probability as desired
cutoff(x, 0.8)
[1] 149 130 120

Resources