Using apply using multiple sources of data? - r

I'm still in the beginning stages of R but I've gotten a few functions down and now I'm looking for my final "project."
I've created a function that takes each of my four sources of data (different populations) and creates histograms, performs kolmogorov-smirnov tests, and then graphs any significant results for a given row. What I want to do is turn it into an apply function. However, the issue is that my function takes four variables, and I don't know a way to make apply take four sources of data.
hist_fx <- function(w,x,y,z) {
hist(w,prob=TRUE,col="green",xlim=c(-1,1),ylim=c(0,3))
lines(density(w),col="red")
abline(v=c(mean(w)),col="red")
hist(x,prob=TRUE,col="blue",xlim=c(-1,1),ylim=c(0,3))
lines(density(x),col="red")
abline(v=c(mean(x)),col="red")
hist(y,prob=TRUE,col="yellow",xlim=c(-1,1),ylim=c(0,3))
lines(density(y),col="red")
abline(v=c(mean(y)),col="red")
hist(z,prob=TRUE,col="purple",xlim=c(-1,1),ylim=c(0,3))
lines(density(z),col="red")
abline(v=c(mean(z)),col="red")
all <- c(w,x,y,z)
hist(all,prob=TRUE,xlim=c(-1,0.5),ylim=c(0,3))
lines(density(w),col="purple")
lines(density(x),col="red")
lines(density(y),col="blue")
lines(density(z),col="green")
plot(ecdf(w),col="green")
plot(ecdf(x),col="blue",add=TRUE)
plot(ecdf(y),col="red",add=TRUE)
plot(ecdf(z),col="purple",add=TRUE)
t1 <- ks.test(w,x)
print(t1)
t2 <- ks.test(w,y)
print(t2)
t3 <- ks.test(w,z)
print(t3)
if(t1$p.value < 0.05) {
plot(ecdf(w),col="green")
plot(ecdf(x),col="blue",add=TRUE)
}
if(t2p.value < 0.05) {
plot(ecdf(w),col="green")
plot(ecdf(y),col="red",add=TRUE)
}
if(t3$p.value < 0.05) {
plot(ecdf(w),col="green")
plot(ecdf(z),col="purple",add=TRUE)
}
}
I'm able to use this function with apply for one population at a time (i.e. turn hist_fx into a function of one variable). However, I can't find a way to make this work for all four populations at the same time. I've messed around with some for loops, though they haven't been successful as of yet.
One last thing that might be of use: my data is arranged such that independent variables are the rows and the dependent variables are columns. Consequently, I need to run these per row (hence my idea of a for loop).
EDIT:
Here's the dput for one of the populations:
dput(k2)
structure(c(-0.15, 0.13, 0.23, -0.23, 0.06, -0.11, 0.107, 0.06,
-0.17, 0.12, 0.06, -0.25, -0.32, 0.13, 0.06, -0.2, -0.08, 0.06,
0.12, 0.02, 0.11, -0.11, -0.15, 0.097, 0.347, -0.307, 0.097,
-0.047, 0.09, 0.01, -0.217, 0.117, 0.03, -0.3, -0.33, 0.13, 0.19,
-0.24, -0.08, -0.01, 0.15, 0.61, 0.18, -0.15, -0.103, 0.135,
0.31, -0.25, 0.157, -0.105, -0.08, 0.01, -0.165, 0.17, 0.1, -0.23,
-0.28, 0.15, 0.13, -0.14, -0.06, 0.01, 0.07, -0.02, 0.11, -0.06,
-0.123, 0.13, 0.35, -0.27, 0.165, -0.065, 0.135, 0.13, -0.17,
0.135, 0.08, -0.21, -0.25, 0.2, 0.16, -0.18, NA, -0.04, 0.05,
-0.02, 0.13, -0.14, -0.13, 0.098, 0.27, -0.193, 0.062, -0.08,
0.057, 0.028, -0.199, 0.1, 0.04, -0.24, -0.32, 0.13, 0.13, -0.15,
-0.05, 0.01, 0.08, -0.04, 0.1, -0.1, -0.14, 0.154, 0.261, -0.194,
0.1, -0.129, 0.063, 0.142, -0.136, 0.136, 0.08, -0.23, -0.24,
0.12, 0.1, -0.16, -0.06, 0.04, 0.09, -0.01, 0.04, -0.08, -0.127,
0.133, 0.337, -0.06, 0.11, -0.107, 0.16, 0.167, -0.183, 0.103,
0.05, -0.2, -0.3, 0.22, -0.01, -0.17, -0.14, 0.02, 0.07, 0.01,
0.11, -0.11, -0.155, 0.221, 0.22, -0.172, 0.09, -0.15, 0.12,
0.03, -0.153, 0.146, 0.11, -0.2, -0.24, 0.16, 0.07, -0.19, -0.1,
0.03, 0.17, 0.02, 0.09, -0.16, -0.062, 0.19, 0.269, -0.265, 0.118,
-0.11, 0.126, 0.094, -0.186, 0.151, 0.08, -0.26, -0.31, 0.13,
0.09, -0.23, -0.12, 0.05, 0.13, 0.01, 0.11, -0.14, -0.095, 0.14,
0.24, -0.46, 0.09, -0.17, 0.08, 0.01, -0.24, 0.16, 0.04, -0.38,
-0.39, 0.11, 0.06, -0.31, -0.25, 0.03, 0.21, -0.14, 0, -0.22,
-0.07, 0.148, 0.311, -0.27, 0.11, -0.055, 0.16, 0.04, -0.197,
0.064, 0.09, -0.24, -0.34, 0.17, 0.07, -0.15, -0.18, 0.03, 0.13,
0.07, 0.13, -0.08, -0.136, 0.142, 0.27, -0.257, 0.1, -0.13, 0.103,
0.064, -0.197, 0.118, 0.06, -0.29, -0.35, 0.13, 0.1, -0.19, -0.13,
0.01, 0.1, -0.01, 0.13, -0.15), .Dim = c(22L, 12L))
To further clarify, here's the format of the actual data frame:
c1 c2 c3 c4
r2 x x x
r3 x x x
r4 x x x
Each column represents a star's values for the variable on the row. As such, I want to create a histogram for each row, for each dataset.
For the values of the function, I just used those variables for simplicity's sake. w = population 1, x = population 2, y = population 3, z = population 4.
As for an example:
> hist_fx(k2[1,],n2[1,],j2[1,],g2[1,])
Two-sample Kolmogorov-Smirnov test
data: w and x
D = 1, p-value = 1.229e-05
alternative hypothesis: two-sided
Two-sample Kolmogorov-Smirnov test
data: w and y
D = 1, p-value = 1.229e-05
alternative hypothesis: two-sided
Two-sample Kolmogorov-Smirnov test
data: w and z
D = 1, p-value = 1.229e-05
alternative hypothesis: two-sided
My problem is that currently, I can only run the function one row at a time. I'd like to be able to do it for all rows. I was thinking of using apply because I've used it in a very similar context except only for one source of data.

Not quite sure of your needs but consider transposing, t() to run plots column-wise for row data. And consider using mapply(), the multivariate type of the apply family which runs an operation element-wise at the same time for equal-length objects. Even break apart the operations as running them together may only print/plot the last iteration to screen.
Transpose (data used were slight variations of posted dput matrix)
pop1 <- data.frame(t(data))
pop2 <- data.frame(t(data))
pop3 <- data.frame(t(data))
pop4 <- data.frame(t(data))
Histograms
hist_fx <- function(w,x,y,z) {
whist <- hist(w,prob=TRUE,col="green",xlim=c(-1,1),ylim=c(0,3))
lines(density(w),col="red")
abline(v=c(mean(w)),col="red")
xhist <- hist(x,prob=TRUE,col="blue",xlim=c(-1,1),ylim=c(0,3))
lines(density(x),col="red")
abline(v=c(mean(x)),col="red")
yhist <- hist(y,prob=TRUE,col="yellow",xlim=c(-1,1),ylim=c(0,3))
lines(density(y),col="red")
abline(v=c(mean(y)),col="red")
zhist <- hist(z,prob=TRUE,col="purple",xlim=c(-1,1),ylim=c(0,3))
lines(density(z),col="red")
abline(v=c(mean(z)),col="red")
}
# HISTOGRAM PLOTS FOR EACH DF COLUMN
output <- mapply(hist_fx, w=pop1, x=pop2, y=pop3, z=pop4)
Kolmogorov-Smirnov tests (using slight variations of dput data)
hist_fx <- function(w,x,y,z) {
t1 <- ks.test(w,x)
t2 <- ks.test(w,y)
t3 <- ks.test(w,z)
if(t1$p.value < 0.05) {
plot(ecdf(w),col="green")
plot(ecdf(x),col="blue",add=TRUE)
}
if(t2$p.value < 0.05) {
plot(ecdf(w),col="green")
plot(ecdf(y),col="red",add=TRUE)
}
if(t3$p.value < 0.05) {
plot(ecdf(w),col="green")
plot(ecdf(z),col="purple",add=TRUE)
}
return(c(t1, t2, t3))
}
output <- mapply(hist_fx, w=pop1, x=pop2, y=pop3, z=pop4)
output
# X1
# statistic 0.1666667
# p.value 0.9962552
# alternative "two-sided"
# method "Two-sample Kolmogorov-Smirnov test"
# data.name "w and x"
# statistic 0.25
# p.value 0.8474885
# alternative "two-sided"
# method "Two-sample Kolmogorov-Smirnov test"
# data.name "w and y"
# statistic 0.08333333
# p.value 1
# alternative "two-sided"
# method "Two-sample Kolmogorov-Smirnov test"
# data.name "w and z"
# X2
# statistic 0.25
# p.value 0.8474885
# alternative "two-sided"
# method "Two-sample Kolmogorov-Smirnov test"
# data.name "w and x"
# statistic 0.08333333
# p.value 1
# alternative "two-sided"
# method "Two-sample Kolmogorov-Smirnov test"
# data.name "w and y"
# statistic 0.1666667
# p.value 0.9962552
# alternative "two-sided"
# method "Two-sample Kolmogorov-Smirnov test"
# data.name "w and z"
# ...

Related

How to change the a axis to a time series in ggplot2

I'm trying to replicate the graph provided at https://www.chicagofed.org/research/data/cfnai/current-data since I will be needing graphs for data sets soon that look like this. I'm almost there, I can't seem to figure out how to change the x axis to the dates when using ggplot2. Specifically, I would like to change it to the dates in the Date column. I tried about a dozen ways and nothing is working. The data for this graph is under indexes on the website. Here's my code and the graph where dataSet is the data from the website:
library(ggplot2)
library(reshape2)
library(tidyverse)
library(lubridate)
df = data.frame(time = index(dataSet), melt(as.data.frame(dataSet)))
df
str(df)
df$data1.Date = as.Date(as.character(df$data1.Date))
str(df)
replicaPlot1 = ggplot(df, aes(x = time, y = value)) +
geom_area(aes(colour = variable, fill = variable)) +
stat_summary(fun = sum, geom = "line", size = 0.4) +
labs(title = "Chicago Fed National Activity Index (CFNAI) Current Data")
replicaPlot1 + scale_x_continuous(name = "time", breaks = waiver(), labels = waiver(), limits =
df$data1.Date)
replicaPlot1
Any sort of help on this would be very much appreciated!
G:\BOS\Common\R-Projects\Graphs\Replica of Chicago Fed National Acitivty index (PCA)\dataSet
Not sure what's your intention with data.frame(time = index(dataSet), melt(as.data.frame(dataSet))). When I download the data and read via readxl::read_excel I got a nice tibble with a date(time) column which after reshaping via tidyr::pivot_longer could easily be plotted and by making use of scale_x_datetime has a nicely formatted date axis:
Using just the first 20 rows of data try this:
library(ggplot2)
library(readxl)
library(tidyr)
df <- pivot_longer(df, -Date, names_to = "variable")
ggplot(df, aes(x = Date, y = value)) +
geom_area(aes(colour = variable, fill = variable)) +
stat_summary(fun = sum, geom = "line", size = 0.4) +
labs(title = "Chicago Fed National Activity Index (CFNAI) Current Data") +
scale_x_datetime(name = "time")
#> Warning: Removed 4 rows containing non-finite values (stat_summary).
#> Warning: Removed 4 rows containing missing values (position_stack).
Created on 2021-01-28 by the reprex package (v1.0.0)
DATA
# Data downloaded from https://www.chicagofed.org/~/media/publications/cfnai/cfnai-data-series-xlsx.xlsx?la=en
# df <- readxl::read_excel("cfnai-data-series-xlsx.xlsx")
# dput(head(df, 20))
df <- structure(list(Date = structure(c(
-87004800, -84412800, -81734400,
-79142400, -76464000, -73785600, -71193600, -68515200, -65923200,
-63244800, -60566400, -58060800, -55382400, -52790400, -50112000,
-47520000, -44841600, -42163200, -39571200, -36892800
), tzone = "UTC", class = c(
"POSIXct",
"POSIXt"
)), P_I = c(
-0.26, 0.16, -0.43, -0.09, -0.19, 0.58, -0.05,
0.21, 0.51, 0.33, -0.1, 0.12, 0.07, 0.04, 0.35, 0.04, -0.1, 0.14,
0.05, 0.11
), EU_H = c(
-0.06, -0.09, 0.01, 0.04, 0.1, 0.22, -0.04,
0, 0.32, 0.16, -0.2, 0.34, 0.06, 0.17, 0.17, 0.07, 0.12, 0.12,
0.15, 0.18
), C_H = c(
-0.01, 0.01, -0.05, 0.08, -0.07, -0.01,
0.12, -0.11, 0.1, 0.15, -0.04, 0.04, 0.17, -0.03, 0.05, 0.08,
0.09, 0.05, -0.06, 0.09
), SO_I = c(
-0.01, -0.07, -0.08, 0.02,
-0.16, 0.22, -0.08, -0.07, 0.38, 0.34, -0.13, -0.1, 0.08, -0.07,
0.06, 0.07, 0.12, -0.3, 0.35, 0.14
), CFNAI = c(
-0.34, 0.02, -0.55,
0.04, -0.32, 1, -0.05, 0.03, 1.32, 0.97, -0.46, 0.39, 0.38, 0.11,
0.63, 0.25, 0.22, 0.01, 0.49, 0.52
), CFNAI_MA3 = c(
NA, NA, -0.29,
-0.17, -0.28, 0.24, 0.21, 0.33, 0.43, 0.77, 0.61, 0.3, 0.1, 0.29,
0.37, 0.33, 0.37, 0.16, 0.24, 0.34
), DIFFUSION = c(
NA, NA, -0.17,
-0.14, -0.21, 0.16, 0.11, 0.17, 0.2, 0.5, 0.41, 0.28, 0.2, 0.32,
0.36, 0.32, 0.33, 0.25, 0.31, 0.47
)), row.names = c(NA, -20L), class = c(
"tbl_df",
"tbl", "data.frame"
))

How do I make segments (of my probabilities?)

I was wondering if there is a function which can help me with segmentation. Via mixtools (logisregmixEM), I got an optimum of 3 segments with corresponding size of 2.5%, 40.3% and 57.2%. I also got posterior probabilities. Is there any way how to create three separate segments with corresponding observations based on the probabilities, in which I end up with 3 segments with the above called sizes?
For what its worth some background information of my coefficients, and probabilities:
> dput(head(betas))
structure(list(comp1 = c(4.57, 0.08, 0.91, -0.11, 0.09, 0.07),
comp2 = c(2.04, -0.22, 0.19, 0.34, -0.34, -0.01), comp3 = c(0.88,
0.03, 0.42, -0.02, -0.17, -0.01)), row.names = c("beta.0",
"beta.1", "beta.2", "beta.3", "beta.4", "beta.5"), class = "data.frame")
> dput(head(posteriorp))
structure(c(0.06, 0.03, 0, 0.03, 0, 0, 0.61, 0.42, 0.07, 0.41,
0.31, 0.41, 0.33, 0.56, 0.93, 0.56, 0.69, 0.59), .Dim = c(6L,
3L), .Dimnames = list(NULL, c("comp.1", "comp.2", "comp.3")))

How to plot truncated distributions (truncdist) with fitdistrplus?

I am attempting to plot goodness of fit curves to truncated distributions from the fitdistrplus package using its plot function.
library(fitdistrplus)
library(truncdist)
library(truncnorm)
dataNum <- c(433.6668, 413.0450, 435.9952, 449.7559, 457.3629, 498.6187, 598.0335, 637.5611, 644.9193, 634.4843, 620.8676, 590.6622, 581.6411, 572.5022, 594.0925, 587.7293, 608.4948, 626.7594, 599.0286, 611.2966, 572.1749, 545.0071, 490.0298, 478.8484, 458.8293, 437.4878, 467.7026, 477.4094, 467.4182, 519.3056, 599.0155, 648.8603, 623.0672, 606.3737, 552.3653, 558.7612, 553.1345, 549.5961, 546.0578, 565.4582, 562.6825, 606.6225, 578.1584, 572.6201, 546.4735, 514.8147, 479.4638, 462.7702, 430.3652, 452.9671)
If I use the library(truncnorm) to fit a truncated normal distribution, everything works fine.
fit.dataNormTrunc2 <- fitdist(dataNum, "truncnorm", fix.arg=list(a=min(dataNum)), start = list(mean = mean(dataNum), sd = sd(dataNum)))
plot(fit.dataNormTrunc2)
However, if I try to use the truncdist package, only the histogram comparison plot prints without any of the other plots (e.g. qq-plot). I also get an error:
Error in qtNorm(p = c(0.01, 0.03, 0.05, 0.07, 0.09, 0.11, 0.13, 0.15, :
unused argument (p = c(0.01, 0.03, 0.05, 0.07, 0.09, 0.11, 0.13, 0.15, 0.17, 0.19, 0.21, 0.23, 0.25, 0.27, 0.29, 0.31, 0.33, 0.35, 0.37, 0.39, 0.41, 0.43, 0.45, 0.47, 0.49, 0.51, 0.53, 0.55, 0.57, 0.59, 0.61, 0.63, 0.65, 0.67, 0.69, 0.71, 0.73, 0.75, 0.77, 0.79, 0.81, 0.83, 0.85, 0.87, 0.89, 0.91, 0.93, 0.95, 0.97, 0.99))
The code used is:
dtNorm <- function(x, mean, sd) {
dtrunc(x, "norm", mean, sd, a=min(dataNum), b=Inf)
}
ptNorm <- function(x, mean, sd) {
ptrunc(x, "norm", mean, sd, a=min(dataNum), b=Inf)
}
qtNorm <- function(x, mean, sd) {
qtrunc(x, "norm", mean, sd, a=min(dataNum), b=Inf)
}
fit.dataNormTrunc <- fitdist(dataNum, "tNorm", start = c(mean=mean(dataNum), sd=sd(dataNum)))
plot(fit.dataNormTrunc)
I have also tried the truncdist approach with the lognormal functionand again the other 3 plots don't print out and I get the same error about the values not being used.

Understanding and implementing numerical integration with a quantile function in R

I need to calculate this integral below, using R:
The q_theta(x) function I managed to do in R with quantile regression (package: quantreg).
matrix=structure(c(0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09,
0.1, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2,
0.21, 0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3, 0.31,
0.32, 0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4, 0.41, 0.42,
0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5, 0.51, 0.52, 0.53,
0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.6, 0.61, 0.62, 0.63, 0.64,
0.65, 0.66, 0.67, 0.68, 0.69, 0.7, 0.71, 0.72, 0.73, 0.74, 0.75,
0.76, 0.77, 0.78, 0.79, 0.8, 0.81, 0.82, 0.83, 0.84, 0.85, 0.86,
0.87, 0.88, 0.89, 0.9, 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97,
0.98, 0.99, -22.2830664155772, -22.2830664155772, -19.9298291765612,
-18.2066426767652, -15.2657135034479, -14.921522915965, -13.5035945028536,
-13.1557269916064, -12.9495709618481, -11.6168348488161, -11.3999095021713,
-10.6962766764396, -10.0588239375837, -9.12944363439522, -8.15648778610587,
-8.04133299299019, -7.66558386420434, -7.50906566627427, -6.95626096568998,
-6.90630556403136, -6.53374879831376, -6.39324677042686, -6.20705804899049,
-6.09754765999465, -5.91272058217526, -5.75771166206242, -5.3770131257001,
-5.20892464393192, -5.07372162687422, -4.96706814289334, -4.64404095131293,
-4.1567394053577, -4.13209444755342, -3.85483644113723, -3.64855238293205,
-3.53054113507559, -3.46035383338799, -3.03155417364444, -2.93100183005178,
-2.90491824855193, -2.64056616049773, -2.51857727614607, -2.25163805172486,
-2.00934783937474, -1.89925824841417, -1.71405007411747, -1.65905834683964,
-1.47502511311988, -1.42755073292529, -1.20464216637298, -1.08574103345057,
-0.701134735371922, -0.590656010656201, -0.290335898959635, -0.0575062007348038,
0.0778328375033378, 0.165234593185889, 0.230651883848336, 0.316817885358695,
0.34841775605248, 0.516869604496075, 0.59743162507581, 0.857843937404964,
0.939734010162078, 1.12533017928147, 1.27037182428776, 1.52040854525927,
1.76577933448152, 2.07456447851822, 2.17389787235523, 2.27567786362425,
2.3850323163509, 2.55365596853891, 2.61208242890655, 2.77359226593771,
2.93275094039929, 3.07968072488942, 3.0822647851901, 3.26452177629061,
3.46223321951649, 3.66011832966054, 3.85710605543097, 4.05385887531972,
4.83943843494744, 5.05864734149161, 5.25501778319145, 5.38941130574907,
5.88571117751377, 6.5116611852713, 6.98632496342285, 7.21816245728101,
7.73244825971004, 7.80401007592906, 8.34648625541999, 9.83184090479964,
10.8324874884172, 11.3060100107816, 12.3048113953808, 13.1300123358331
), .Dim = c(99L, 2L), .Dimnames = list(NULL, c("Theta", "q(x)_(Theta)"
)))
This is my q_theta(x) function that I estimated in R. One of the question I have is:
a> If x is a standard normal distribution this integral is zero; Right?
b> Otherwise, in my case, the integral is not zero. How do I treat the q_1-Theta(x)? Its simply the sort(matrix[,"q(x)_(Theta)"],decreasing=TRUE) ?
And the integration would be:
sintegral(thau[1:50], (matrix[,"q(x)_(Theta)"][1:50] - sort(matrix[,"q(x)_(Theta)"],TRUE)[1:50])[1:50])$value
The median would be a comun point of this two functions. Right?
Thanks.
Recall your previous post Building a function by defining X and Y and then Integrating in R, we build a linear interpolation function
## note `rule = 2` to enable "extrapolation";
## otherwise `rule = 1` gives `NA` outside [0.01, 0.5]
integrand <- approxfun(mat[, 1], y, rule = 2)
Then we can perform numeric integration on [0, 0.5]:
integrate(integrand, lower = 0, upper = 0.5)
# -5.594405 with absolute error < 4e-04
Now for a>, let's have a proof first.
Note, your quantile function is not for normal distribution, so this result does not hold. You can actually verify this
quant <- approxfun(mat[, 1], mat[, 2], rule = 2)
integrate(quant, lower = 0, upper = 0.5)
# -3.737973 with absolute error < 0.00029
Compared with previous integration result -5.594405, the difference is not a factor of 2.

Interpolate within points in a vector

Vector V1 contains 56 observations for X, and vector BS contains a bootstrapped sample of V1 of length 100000. I would like to interpolate linearly within points in BS to fill in any missing values. For example, V1 contains no 0.27 values, and hence neither does BS. But BS would contain a few 0.28 and 0.26. I would like the interpolation to create a few 0.27 values and add those to BS. And so on for any missing values within the two extremes in the vector.
V1 <- c(0.18, 0.2, 0.24, 0.35, -0.22, -0.17, 0.28, -0.28, -0.14, 0.03, 0.87, -0.2, 0.06, -0.1, -0.72, 0.18, 0.01, 0.31, -0.36, 0.61, -0.16, -0.07, -0.13, 0.01, -0.09, 0.26, -0.14, 0.08, -0.62, -0.2, 0.3, -0.21, -0.11, 0.05, 0.06, -0.28, -0.27, 0.17, 0.42, -0.05, -0.15, 0.05, -0.07, -0.22, -0.34, 0.16, 0.34, 0.1, -0.12, 0.24, 0.45, 0.37, 0.61, 0.9, -0.25, 0.02)
BS <- sample(V1, 100000, replace=TRUE)
The approxfun functions do not help as are for interpolating within data sets. Have found a few questions/answers covering interpolating within different data sets, but not within one data set. Thank you for your help.
EDIT: please note I do not want to fit a normal distribution (or any other) to create those points.
You can use approx() (or approxfun()) to do this by treating BS as the y-coordinate and using sequential x-coordinates:
set.seed(1L); BS <- sample(V1,1e5L,T);
res <- approx(seq_along(BS),BS,n=length(BS)*2L-1L)$y;
The specification of n here is important. It ensures that exactly one interpolated value will be produced halfway between each adjacent pair of input values.
Here's a plot of an excerpt of the result, centered around the first occurrence of an adjacent pair of 0.26 and 0.28:
i <- which(BS[-length(BS)]==0.26 & BS[-1L]==0.28)[1L];
j <- i*2L-1L;
xlim <- c(j-6L,j+8L);
ylim <- c(-1,1);
xticks <- seq(xlim[1L],xlim[2L]);
yticks <- seq(ylim[1L],ylim[2L],0.05);
plot(NA,xlim=xlim,ylim=ylim,xlab='res index',ylab='y',axes=F,xaxs='i',yaxs='i');
abline(v=xticks,col='lightgrey');
abline(h=yticks,col='lightgrey');
axis(1L,xticks,cex.axis=0.7);
axis(2L,yticks,sprintf('%.02f',round(yticks,2L)),las=1L,cex.axis=0.7);
x <- seq(xlim[1L],xlim[2L],2L); y <- BS[seq(i-3L,len=8L)];
points(x,y,pch=16L,col='red',xpd=NA);
x <- seq(xlim[1L],xlim[2L]); y <- res[x];
points(x,y,pch=4L,cex=1.2,col='blue',xpd=NA);
text(x+0.24,y+0.03,y,cex=0.7,xpd=NA);
legend(xlim[1L]+1.5,0.87,c('input value','interpolated'),col=c('red','blue'),pch=c(16L,4L));

Resources