Find two densities' point of intersection in R - r

I have two densities that overlap as seen in the attached picture. I want to find out where the two lines meet. How would I go about doing that?
This is the code that produced the image:
... #reading in files etc.
pdf("test-plot.pdf")
d1 <- density(somedata)
d2 <- density(someotherdata)
plot(d1)
par(col="red")
lines(d2)
dev.off()
The original data is just two monodimensional vectors, so what I'm interested in is the intersection point of their densities.
I tried to use the solution shown in here, but unfortunately, it neither gives me a number nor even draws the lines correctly:
edit: I have found what I was looking for

# create and plot example data
set.seed(1)
plotrange <- c(-1,8)
d1 <- density(rchisq(1000, df=2), from=plotrange[1], to=plotrange[2])
d2 <- density(rchisq(1000, df=3)-1, from=plotrange[1], to=plotrange[2])
plot(d1)
lines(d2)
# look for points of intersection
poi <- which(diff(d1$y > d2$y) != 0)
# Mark those points with a circle:
points(x=d1$x[poi], y=d1$y[poi], col="red")
# or with lines:
abline(v=d1$x[poi], col="orange", lty=2)
abline(h=d1$y[poi], col="orange", lty=2)

intersect(x,y)
see this help file
For example: If your data are in the same data.frame df
intersect(df$col1, df$col2)

Here is a small example extending John's answer with an example.
require(ggplot2)
require(reshape2)
set.seed(12)
df <- data.frame(x = round(rnorm(100, 20, 10),1), y = round((100/log(100:199)),1))
str(df)
# 'data.frame': 200 obs. of 2 variables:
# $ variable: Factor w/ 2 levels "x","y": 1 1 1 1 1 1 1 1 1 1 ...
# $ value : num 16.8 25.7 20.5 22 19 ...
# Melt and plot
mdf <- melt(df)
ggplot(mdf) +
geom_density(aes(x = value, color = variable))
# Find points that intersect
intersect(df$x, df$y)
# [1] 18.9 20.1 21.3 21.5 21.0 19.6 19.0 20.0 19.8
# To make the answer more complete, here is the source code of intersect.
function (x, y)
{
y <- as.vector(y)
unique(y[match(as.vector(x), y, 0L)])
}
<bytecode: 0x10285d400>
<environment: namespace:base>
>
# It's actually posible to use unique and match to produce the same output
unique(as.vector(df$y)[match(as.vector(df$x), df$y, 0L)])
# [1] 18.9 20.1 21.3 21.5 21.0 19.6 19.0 20.0 19.8!

I'm sure your answers are correct, but here's what finally worked for me:
d1$x[abs(d1$y-d2$y) < 0.00001 && d1$x < 1000 && d1$x > 500]
(because I really only needed to find out one value and am a total R newbie, which made it difficult to understand your answers, since I don't even understand most basic R concepts yet. Thank you for your help and sorry.

Related

Fill missing values with linear regression

I have a dataframe that contains 7 columns.
str(df)
'data.frame': 8760 obs. of 7 variables:
$ G1_d20_2014.SE1_ : num 25.1 25.1 25 25 25.1 ...
$ G1_d20_2014.SE4_ : num 42.4 42.3 42.3 42.3 42.3 ...
$ G1_d20_2014.SE7_ : num 34.4 34.4 34.4 34.4 34.4 ...
$ G1_d20_2014.SE22_: num 42.5 42.4 42.3 42.4 42.3 ...
$ G1_d20_2014.SE14_: num 52.5 52.5 52.5 52.5 52.4 ...
$ G1_d20_2014.SE26 : num 40.8 40.8 40.8 40.8 40.8 ...
Each column represents a unique sensor and the columns contain measurement data from sensors. Some of the columns contain missing values. I want to fill the data gaps in each column by linear regression. I already did this manually but there is one condition that is very important and I'm looking for a function that does this on its own, as it'd take too much time to do this for all the columns. Here's the condition:
Lets say G1_d20_2014_SE1 contains missing data. Then I want to fill the data gaps from that sensor with a complete dataset from another sensor where the correlation coefficient is highest.
Here is how I did that manually:
I created a function that creates an indicator variable. Indicator variable turns to 1 if value is not NA and to 0 if it is NA. Then I added this variable as a column to the dataset:
Indvar <- function(t) {
x <- dim(length(t))
x[which(!is.na(t))] = 1
x[which(is.na(t))] = 0
return(x)
}
df$I <- Indvar(df$G1_d20_2014.SE1_)
Next I looked between which sensor and sensor 1 the correlation coefficient is highest (in that case correlation coefficient highest between SE1 and SE14). Then I computed the linear regression, took the equation from it and put it into a for loop that fills up the NA values according to the equation whenever the indicator variable is 0:
lm(df$G1_d20_2014.SE1_ ~ df$G1_d20_2014.SE14_, data = df)
for (i in 1:nrow(df)) {
if (df$I[i] == 0)
{
df$G1_d20_2014.SE1_[i] = 8.037 + 0.315*df$G1_d20_2014.SE14_[i]
}
}
This works perfectly fine but it takes too much time doing this because I have a lot of dataframes that looks like the one up in the post.
I already tried using impute_lm from the simputation package but unfortunately it does not seem to care about where the correlation is highest before filling the data gaps. Here is what I wrote:
impute_fun <- impute_lm(df,
formula = SE1_ + SE4_ ~ SE14_ + SE26)
As I wrote SE14_ + SE26_ I checked if he uses the values from SE14 for imputing the values in SE1 but he doesn't, as the result is different from my manual result.
Is there any function that does what I want? I'm really frustrated because I've been looking for this for over 2 weeks now. I'd really really appreciate some help!
EDIT/Answer to #jay.sf
So I tried to make a function (s. below) out of it but there's something I struggle with:
I don't know how to specify in the function that I want to do this for for every column and that it removes the name of that sensor that I want to fill from the sapply(c("SE1_", "SE2_", ...) Because obviously, if I do this for SE1_ and SE1_ is still in the code the correlation will be 1 and nothing happens. Now as you can see this is also problematic for the rest of the code, e.g. in the line cor(df$SE1_, df[, x], use = "complete.obs")) as it says df$SE1_ here. Same for the df$SE1_imp <- ... line.
Of course I could just delete the sensor from the sapply(...) code so the first problem does not occur. I'm just wondering if there's a nicer way to do this. Same for the df$SE1_ parts, if I wanna impute the values for SE2_ then I'd have to change df$SE1_ to df$SE2_ and so on.
I tried to run the code like this (but without the SE1_ in the sapply(...) of course) and I got the error: Error in df[, x] : incorrect number of dimensions.
Any ideas how to solve these issues?
impFUN <- function(df) {
corr <- sapply(c("SE1_", "SE2_", "SE4_", "SE5_","SE6_",
"SE7_", "SE12_", "SE13_","SE14_", "SE15_",
"SE16_", "SE22_","SE23", "SE24", "SE25",
"SE26", "SE33", "SE34", "SE35", "SE36",
"SE37", "SE46", "SE51", "SE52", "SE53",
"SE54", "SE59", "SE60", "SE61", "SE62",
"SE68", "SE69", "SE70", "SE71", "SE72",
"SE73","SE74", "SE82", "SE83", "SE84",
"SE85", "SE86", "SE87", "SE99","SE100",
"SE101", "SE102", "SE103","SE104",
"SE106", "SE107","SE121"), function(x)
cor(df$SE1_, df[, x], use = "complete.obs"))
imp.use <- names(which.max(corr))
regr.model <- lm(reformulate(imp.use, "SE1_"))
df$SE1_imp <-
ifelse(is.na(df$SE1_), lm.cf[1] + df[[imp.use]]*lm.cf[2], df$SE1_)
}
What about this? First check which sensor correlates most with sensor 1.
corr <- sapply(c("sensor.2", "sensor.3", "sensor.4"), function(x)
cor(dat$sensor.1, dat[,x], use="complete.obs"))
# sensor.2 sensor.3 sensor.4
# 0.04397132 0.26880412 -0.06487781
imp.use <- names(which.max(corr))
# [1] "sensor.3"
Calculate the regression model,
lm.cf <- lm(reformulate(imp.use, "sensor.1"), dat)$coef
and to impute sensor 1 use the coefficients in an ifelse like this:
dat$sensor.1.imp <-
ifelse(is.na(dat$sensor.1), lm.cf[1] + dat[[imp.use]]*lm.cf[2], dat$sensor.1)
Result
head(dat)
# sensor.1 sensor.2 sensor.3 sensor.4 sensor.1.imp
# 1 2.0348728 -0.6374294 2.0005714 0.03403394 2.0348728
# 2 -0.8830567 -0.8779942 0.7914632 -0.66143678 -0.8830567
# 3 NA 1.2481243 -0.9897785 -0.36361831 -0.1943438
# 4 NA -0.1162450 0.6672969 -2.84821295 0.2312968
# 5 1.0407590 0.1906306 0.3327787 1.16064011 1.0407590
# 6 0.5817020 -0.6133034 0.5689318 0.71543751 0.5817020
Toy data:
library('MASS')
set.seed(42)
M <- mvrnorm(n=1e2, mu=c(0, 0, 0, 0),
Sigma=matrix(c(1, .2, .3, .1,
.2, 1, 0, 0,
.3, 0, 1, 0,
.1, 0, 0, 1), nrow=4),
empirical=TRUE)
dat <- as.data.frame(`colnames<-`(M, paste0("sensor.", 1:4)))
dat[sample(1:nrow(dat), 30), "sensor.1"] <- NA ## generate 30% missings

Calculating mean and interquartile range of 'cut' data to plot

Apologies I am new to R, I have a dataset with height and canopy density of trees for example:
i_h100 i_cd
2.89 0.0198
2.88 0.0198
17.53 0.658
27.23 0.347
I want to regroup 'h_100' into 2m intervals going from 2m min to 30m max, I then want to calculate the mean i_cd value and interquartile range for each of these intervals so that I can then plot these with a least squares regression. There is something wrong with the code I am using to get the mean. This is what I have so far:
mydata=read.csv("irelandish.csv")
height=mydata$i_h100
breaks=seq(2,30,by=2) #2m intervals
height.cut=cut(height, breaks, right=TRUE)
#attempt at calculating means per group
install.packages("dplyr")
mean=summarise(group_by(cut(height, breaks, right=TRUE),
mean(mydata$i_cd)))
install.packages("reshape2")
dcast(mean)
Thanks in advance for any advice.
Using aggregate() to calculate the groupwise means.
# Some example data
set.seed(1)
i_h100 <- round(runif(100, 2, 30), 2)
i_cd <- rexp(100, 1/i_h100)
mydata <- data.frame(i_cd, i_h100)
# Grouping i_h100
mydata$i_h100_2m <- cut(mydata$i_h100, seq(2, 30, by=2))
head(mydata)
# i_cd i_h100 i_h100_2m
# 1 2.918093 9.43 (8,10]
# 2 13.735728 12.42 (12,14]
# 3 13.966347 18.04 (18,20]
# 4 2.459760 27.43 (26,28]
# 5 8.477551 7.65 (6,8]
# 6 6.713224 27.15 (26,28]
# Calculate groupwise means of i_cd
i_cd_2m_mean <- aggregate(i_cd ~ i_h100_2m, mydata, mean)
# And IQR
i_cd_2m_iqr <- aggregate(i_cd ~ i_h100_2m, mydata, IQR)
upper <- i_cd_2m_mean[,2]+(i_cd_2m_iqr[,2]/2)
lower <- i_cd_2m_mean[,2]-(i_cd_2m_iqr[,2]/2)
# Plotting the result
plot.default(i_cd_2m_mean, xaxt="n", ylim=range(c(upper, lower)),
main="Groupwise means \U00B1 0.5 IQR", type="n")
points(upper, pch=2, col="lightblue", lwd=1.5)
points(lower, pch=6, col="pink", lwd=1.5)
points(i_cd_2m_mean, pch=16)
axis(1, i_cd_2m[,1], as.character(i_cd_2m[,1]), cex.axis=0.6, las=2)
Here is a solution,
library(reshape2)
library(dplyr)
mydata <- data_frame(i_h100=c(2.89,2.88,17.53,27.23),i_cd=c(0.0198,0.0198,0.658,0.347))
height <- mydata$i_h100
breaks <- seq(2,30,by=2) #2m intervals
height.cut <- cut(height, breaks, right=TRUE)
mydata$height.cut <- height.cut
mean_i_h100 <- mydata %>% group_by(height.cut) %>% summarise(mean_i_h100 = mean(i_h100))
A few remarks:
it is better to avoid naming variables with function names, so I changed the mean variable to mean_i_h100
I am using the pipe notation, which makes the code more readable, it avoids repeating the first argument of each function, you can find a more detailed explanation here.
Without the pipe notation, the last line of code would be:
mean_i_h100 <- summarise(group_by(mydata,height.cut),mean_i_h100 = mean(i_h100))
you have to load the two packages you installed with library

Levelplot with incomplete data

Using the lattice package in R:
I have let myself deeply into a rabbit hole and now I need some help to get out.
I have some (expensive) data points that naturally live on a 32x32 grid but I don't have all the possible data points
> str(data)
'data.frame': 53 obs. of 3 variables:
$ X: num 16 16 16 16 13 13 13 13 23 23 ...
$ Y: num 20 16 23 10 16 23 20 10 16 23 ...
$ Z: num 1558 1561 1555 1540 1538 ...
When I try to use levelplot like this,
> levelplot(data$Z ~ rbind(data$X, data$X) * rbind(data$Y, data$Y),
xlim=c(0.5, 32.5), ylim=c(0.5, 32.5))
the plot has the colored patches clustered in a (for me) confusing way. Output from levelplot
What I would like to achieve is that I have one colored patch per 1-by-1 index pair corresponding to my data. Absent grid points can be left white.
I tried to understand the R documentation but have given up.
Further, I have tried a grid with dummy NA, and then tried filling out the relevant data points. Something like
> x <- seq(1, 32, length.out=32)
> y <- seq(1, 32, length.out=32)
> data <- expand.grid(X=x, Y=y)
> data$Z <- NA
> tmp <- res[selected_data, ]
> data[(data$X == tmp$X) & (data$Y == tmp$Y), 'Z'] <- tmp$Z
Error in `[<-.data.frame`(`*tmp*`, (data$X == tmp$Input_Channel) & (data$Y == :
replacement has 53 rows, data has 1024
Where res is the source of data points and selected_data is a vector of logicals used to select data from res. Anyway, this doesn't work.
Regardless, trying to make this latter approach work has been a wrong turn. I'd rather have a proper solution with levelplot rather than my failed work around.
I found a workable solution which I share to help others:
> dataX <- c(seq(1, 32), rep(1, 32), tmp$X)
> dataY <- c(rep(1, 32), seq(1, 32), tmp$Y)
> dataZ <- c(rep(NA, 64), tmp$Z)
> levelplot(dataZ ~ dataX * dataY)
Adding the NAs in this manner gives the desired output. Desired output from levelplot

How to find value for/match to coordinates of closest proximity in a second df

I have a series of geographical positions at sea which I am trying to get geological sediment type information for. I am using an export of the national british geological sediment database (df1)which is a large data set of coordinates and sediment information.
Currently I have been rounding the coordinates in the BGS export file (df1) and averaging/recalculating the sediment type for these coordinate squares, then I have rounded my coordinates in (df2) and matched these to these squares to get a sediment classification.
The BGS export looks like this (df1);
NUM X Y GRAV SAND MUD
1 228 1.93656 52.31307 1.07 98.83 0.10
2 142 1.84667 52.45333 0.00 52.60 47.40
3 182 1.91950 52.17750 9.48 90.38 0.14
4 124 1.88333 52.70833 0.00 98.80 1.20
5 2807 1.91050 51.45000 2.05 97.91 0.05
6 2787 1.74683 51.99382 41.32 52.08 6.60
7 2776 1.66117 51.63550 9.83 87.36 2.81
8 2763 1.82467 51.71767 43.92 47.25 8.83
9 2753 1.76867 51.96349 57.66 39.18 3.15
10 68 2.86967 52.96333 0.30 98.90 0.80
11 2912 1.70083 51.77783 26.90 64.87 8.22
12 2914 1.59750 51.88882 32.00 65.02 2.97
13 2886 1.98833 51.34267 1.05 98.91 0.04
14 2891 1.87817 51.31549 68.57 31.34 0.08
15 2898 1.37433 51.41249 35.93 61.48 2.59
16 45 2.06667 51.82500 9.70 88.10 2.20
17 2904 1.63617 51.45999 16.28 66.67 17.05
My positions at sea look like this (df2);
haul DecStartLat DecStartLong
1993H_2 55.23983 -5.512830
2794H_1 55.26670 -5.516700
1993H_1 55.27183 -5.521330
0709A_71 55.26569 -5.519730
0396H_2 55.44120 -5.917800
0299H_2 55.44015 -5.917310
0514A_26 55.46897 -5.912167
0411A_64 55.47289 -5.911820
0410A_65 55.46869 -5.911930
0514A_24 55.63585 -5.783500
0295H_4 55.57250 -5.754300
0410A_62 55.63656 -6.041870
0413A_53 55.73280 -6.020600
0396H_13 55.66470 -6.002300
2794H_8 55.83330 -5.883300
0612A_15 55.84025 -5.912130
0410A_74 55.84311 -5.910180
0299H_16 55.90568 -5.732490
0200H_18 55.88600 -5.742900
0612A_18 55.90450 -5.835880
This is my script...
get.Sed.type <- function(x,y) {
x$Y2 <- round(x$Y, digits=1)
x$X2 <- round(x$X, digits=1)
x$BGSQ <- paste(x$Y2,x$X2,sep="_")
x$RATIO <- x$SAND/x$MUD
x <- aggregate(cbind(GRAV,RATIO)~BGSQ,data=x,FUN=mean)
FOLK <- (x$GRAV)
FOLK[(FOLK)<1] <- 0
FOLK[(FOLK)>=1&(FOLK)<5] <- 1
FOLK[(FOLK)>=5&(FOLK)<30] <- 5
FOLK[(FOLK)>=30&(FOLK)<80] <- 30
FOLK[(FOLK)>=80] <- 80
R_CLASS <- (x$RATIO)
R_CLASS[(R_CLASS)<1/9] <- 0
R_CLASS[(R_CLASS)>=1/9&(R_CLASS)<1] <- 0.1
R_CLASS[(R_CLASS)>=1&(R_CLASS)<9] <- 1
R_CLASS[(R_CLASS)>=9] <- 9
x$FOLK_CLASS <- NULL
x$FOLK_CLASS[(R_CLASS)==0&(FOLK)==0] <- "M"
x$FOLK_CLASS[(R_CLASS)%in%c(0,0.1)&(FOLK)==5] <- "gM"
x$FOLK_CLASS[(R_CLASS)==0.1&(FOLK)==0] <- "sM"
x$FOLK_CLASS[(R_CLASS)==0&(FOLK)==1] <- "(g)M"
x$FOLK_CLASS[(R_CLASS)==0.1&(FOLK)==1] <- "(g)sM"
x$FOLK_CLASS[(R_CLASS)==9&(FOLK)==0] <- "S"
x$FOLK_CLASS[(R_CLASS)==1&(FOLK)==0] <- "mS"
x$FOLK_CLASS[(R_CLASS)==9&(FOLK)==1] <- "(g)S"
x$FOLK_CLASS[(R_CLASS)==1&(FOLK)==1] <- "(g)sM"
x$FOLK_CLASS[(R_CLASS)==1&(FOLK)==5] <- "gmS"
x$FOLK_CLASS[(R_CLASS)==9&(FOLK)==5] <- "gS"
x$FOLK_CLASS[(FOLK)==80] <- "G"
x$FOLK_CLASS[(R_CLASS)%in%c(0,0.1)&(FOLK)==30] <- "mG"
x$FOLK_CLASS[(R_CLASS)==1&(FOLK)==30] <- "msG"
x$FOLK_CLASS[(R_CLASS)==9&(FOLK)==30] <- "sG"
y$Lat <- round(y$DecStartLat, digits=1)
y$Long <- round(y$DecStartLong, digits=1)
y$LATLONG100_sq <- paste(y$Lat,y$Long,sep="_")
y <- merge(y, x[,c(1,4)],all.x=TRUE,by.x="LATLONG100_sq",by.y="BGSQ")
#Delete unwanted columns
y <- y[, !(colnames(y) %in% c("Lat","Long","LATLONG100_sq"))]
#Name column something logical
colnames(y)[colnames(y) == 'FOLK_CLASS'] <- 'BGS_class'
return(y)
}
However I have a dozen or so positions in db2 for which there are no corresponding values in the BGS export (db1), I want to know how I can either ask it to do another average for the squares surrounding that respective square (i.e. round to a larger number and repeat the process) OR to ask it to find the coordinate in the BGS export file that is closest in proximity and take the existing value.
Going for the second option stated in the question, I suggest to frame the question as follows:
Say that you have a set of m coordinates from db1 and n coordinates from db2, m <=n, and that currently the intersection of these sets is empty.
You'd like to match each point from db1 with a point from db2 such that the "error" of the matching, e.g. sum of distances, will be minimized.
A simple greedy approach for solving this might be to generate an m x n matrix with the distances between each pair of coordinates, and sequentially select the closest match for each point.
Of course, If there are many points to match, or if you're after an optimal solution, you may want to consider more elaborate matching algorithms (e.g. the Hungarian algorithm).
Code:
#generate some data (this data will generate sub-optimal matching with greedy matching)
db1 <- data.frame(id=c("a1","a2","a3","a4"), x=c(1,5,10,20), y=c(1,5,10,20))
db2 <- data.frame(id=c("b1","b2","b3","b4"),x=c(1.1,2.1,8.1,14.1), y=c(1.1,1.1,8.1,14.1))
#create cartesian product
product <- merge(db1, db2, by=NULL)
#calculate auclidean distances for each possible matching
product$d <- sqrt((product$x.x - product$x.y)^2 + (product$y.x - product$y.y)^2)
#(naively & greedily) find the best match for each point
sorted <- product[ order(product[,"d"]), ]
found <- vector()
res <- vector() #this vector will hold the result
for (i in 1:nrow(db1)) {
for (j in 1:nrow(sorted)) {
db2_val <- as.character(sorted[j,"id.y"])
if (sorted[j,"id.x"] == db1[i, "id"] && length(grep(db2_val, found)) == 0) {
#print(paste("matching ", db1[i, "id"], " with ", db2_val))
res[i] <- db2_val
found <- c(found, db2_val)
break
}
}
}
Note that I'm sure the code can be improved and made more elegant by using methods other than loop.
Hopefully I do not misunderstand, but as far as I get from the title, you need to match based on minimum distance. If this distance is allowed to be Euclidean distance, then one can use the fast RANN package, if not, then one needs to compute the great circle distance.
Some of the provided data
BGS_df <-
read.table(text =
" NUM X Y GRAV SAND MUD
1 228 1.93656 52.31307 1.07 98.83 0.10
2 142 1.84667 52.45333 0.00 52.60 47.40
3 182 1.91950 52.17750 9.48 90.38 0.14
4 124 1.88333 52.70833 0.00 98.80 1.20
5 2807 1.91050 51.45000 2.05 97.91 0.05",
header = TRUE)
my_positions <-
read.table(text =
"haul DecStartLat DecStartLong
1993H_2 55.23983 -5.512830
2794H_1 55.26670 -5.516700
1993H_1 55.27183 -5.521330",
header = TRUE)
Euclidean distance (using RANN package)
library(RANN)
# For each point in my_positions, find the nearest neighbor from BGS_df:
# Give X and then Y (longtitude and then latitude)
# Note that argument k sets the number of nearest neighbours, here 1 (the closest)
closest_RANN <- RANN::nn2(data = BGS_df[, c("X", "Y")],
query = my_positions[, c("DecStartLong", "DecStartLat")],
k = 1)
results_RANN <- cbind(my_positions[, c("haul", "DecStartLong", "DecStartLat")],
BGS_df[closest_RANN$nn.idx, ])
results_RANN
# haul DecStartLong DecStartLat NUM X Y GRAV SAND MUD
# 4 1993H_2 -5.51283 55.23983 124 1.88333 52.70833 0 98.8 1.2
# 4.1 2794H_1 -5.51670 55.26670 124 1.88333 52.70833 0 98.8 1.2
# 4.2 1993H_1 -5.52133 55.27183 124 1.88333 52.70833 0 98.8 1.2
Great circle distance (using geosphere package)
library(geosphere)
# Compute matrix of great circle distances
dist_mat <- geosphere::distm(x = BGS_df[, c("X", "Y")],
y = my_positions[, c("DecStartLong", "DecStartLat")],
fun = distHaversine) # can try other distances
# For each column (point in my_positions) get the index of row of min dist
# (corresponds to row index in BGS_df)
BGS_idx <- apply(dist_mat, 2, which.min)
results_geo <- cbind(my_positions[, c("haul", "DecStartLong", "DecStartLat")],
BGS_df[BGS_idx, ])
identical(results_geo, results_RANN) # here TRUE, but not always expected

divide a range of values in bins of equal length: cut vs cut2

I'm using the cut function to split my data in equal bins, it does the job but I'm not happy with the way it returns the values. What I need is the center of the bin not the upper and lower ends.
I've also tried to use cut2{Hmisc}, this gives me the center of each bins, but it divides the range of data in bins that contains the same numbers of observations, rather than being of the same length.
Does anyone have a solution to this?
It's not too hard to make the breaks and labels yourself, with something like this. Here since the midpoint is a single number, I don't actually return a factor with labels but instead a numeric vector.
cut2 <- function(x, breaks) {
r <- range(x)
b <- seq(r[1], r[2], length=2*breaks+1)
brk <- b[0:breaks*2+1]
mid <- b[1:breaks*2]
brk[1] <- brk[1]-0.01
k <- cut(x, breaks=brk, labels=FALSE)
mid[k]
}
There's probably a better way to get the bin breaks and midpoints; I didn't think about it very hard.
Note that this answer is different than Joshua's; his gives the median of the data in each bins while this gives the center of each bin.
> head(cut2(x,3))
[1] 16.666667 3.333333 16.666667 3.333333 16.666667 16.666667
> head(ave(x, cut(x,3), FUN=median))
[1] 18 2 18 2 18 18
Use ave like so:
set.seed(21)
x <- sample(0:20, 100, replace=TRUE)
xCenter <- ave(x, cut(x,3), FUN=median)
We can use smart_cut from package cutr:
devtools::install_github("moodymudskipper/cutr")
library(cutr)
Using #Joshua's sample data:
median by interval (same output as #Joshua except it's an ordered factor) :
smart_cut(x,3, "n_intervals", labels= ~ median(.))
# [1] 18 2 18 2 18 18 ...
# Levels: 2 < 11 < 18
center of each interval (same output as #Aaron except it's an ordered factor) :
smart_cut(x,3, "n_intervals", labels= ~ mean(.y))
# [1] 16.67 3.333 16.67 3.333 16.67 16.67 ...
# Levels: 3.333 < 10 < 16.67
mean of values by interval :
smart_cut(x,3, "n_intervals", labels= ~ mean(.))
# [1] 17.48 2.571 17.48 2.571 17.48 17.48 ...
# Levels: 2.571 < 11.06 < 17.48
labels can be a character vector just like in base::cut.default, but it can also be, as it is here, a function of 2 parameters, the first being the values contained in the bin, and the second the cut points of the bin.
more on cutr and smart_cut

Resources