How can I reproduce the dist function - r

I simulated a data frame of points, x and y values, for various calculations. The dist function works pretty well to calculate the distances between every possible combination. And I've been trying to reproduce a simplified version that only does that (getting a Euclidean distance matrix of a data frame) but it hasn't been working so far.
If I was entering the two columns, I would do something like this but I'm trying to use just one input, the data frame.
dist <- function(x,y) {
distance <- sqrt(sum((x - y)^2))
return(distance)
}
I've tried using the source code for dist but I can't figure out to strip away all the stuff I dont want without breaking it.
function (x, method = "euclidean", diag = FALSE, upper = FALSE,
p = 2)
{
if (!is.na(pmatch(method, "euclidian")))
method <- "euclidean"
METHODS <- c("euclidean", "maximum", "manhattan", "canberra",
"binary", "minkowski")
method <- pmatch(method, METHODS)
if (is.na(method))
stop("invalid distance method")
if (method == -1)
stop("ambiguous distance method")
x <- as.matrix(x)
N <- nrow(x)
attrs <- if (method == 6L)
list(Size = N, Labels = dimnames(x)[[1L]], Diag = diag,
Upper = upper, method = METHODS[method], p = p,
call = match.call(), class = "dist")
else list(Size = N, Labels = dimnames(x)[[1L]], Diag = diag,
Upper = upper, method = METHODS[method], call = match.call(),
class = "dist")
.Call(C_Cdist, x, method, attrs, p)
}
Is anyone able to point me to a viable first step? I'm really trying to learn how to program without always relying on pre-packaged functions.

You could use outer
df <- data.frame(x = rnorm(100), y = rnorm(100))
outer(df$x, df$y, function(x, y)sqrt((x - y)^2))

Related

Avoiding duplication in R

I am trying to fit a variety of (truncated) probability distributions to the same very thin set of quantiles. I can do it but it seems to require lots of duplication of the same code. Is there a neater way?
I am using this code by Nadarajah and Kotz to generate the pdf of the truncated distributions:
qtrunc <- function(p, spec, a = -Inf, b = Inf, ...)
{
tt <- p
G <- get(paste("p", spec, sep = ""), mode = "function")
Gin <- get(paste("q", spec, sep = ""), mode = "function")
tt <- Gin(G(a, ...) + p*(G(b, ...) - G(a, ...)), ...)
return(tt)
}
where spec can be the name of any untruncated distribution for which code in R exists, and the ... argument is used to provide the names of the parameters of that untruncated distribution.
To achieve the best fit I need to measure the distance between the given quantiles and those calculated using arbitrary values of the parameters of the distribution. In the case of the gamma distribution, for example, the code is as follows:
spec <- "gamma"
fit_gamma <- function(x, l = 0, h = 20, t1 = 5, t2 = 13){
ct1 <- qtrunc(p = 1/3, spec, a = l, b = h, shape = x[1],rate = x[2])
ct2 <- qtrunc(p = 2/3, spec, a = l, b = h, shape = x[1],rate = x[2])
dist <- vector(mode = "numeric", length = 2)
dist[1] <- (t1 - ct1)^2
dist[2] <- (t2- ct2)^2
return(sqrt(sum(dist)))
}
where l is the lower truncation, h is the higher and I am given the two tertiles t1 and t2.
Finally, I seek the best fit using optim, thus:
gamma_fit <- optim(par = c(2, 4),
fn = fit_gamma,
l = l,
h = h,
t1 = t1,
t2 = t2,
method = "L-BFGS-B",
lower = c(1.01, 1.4)
Now suppose I want to do the same thing but fitting a normal distribution instead. The names of the parameters of the normal distribution that I am using in R are mean and sd.
I can achieve what I want but only by writing a whole new function fit_normal that is extremely similar to my fit_gamma function but with the new parameter names used in the definition of ct1 and ct2.
The problem of duplication of code becomes very severe because I wish to try fitting a large number of different distributions to my data.
What I want to know is whether there is a way of writing a generic fit_spec as it were so that the parameter names do not have to be written out by me.
Use x as a named list to create a list of arguments to pass into qtrunc() using do.call().
fit_distro <- function(x, spec, l = 0, h = 20, t1 = 5, t2 = 13){
args <- c(x, list(spec = spec, a = l, b = h))
ct1 <- do.call(qtrunc, args = c(list(p = 1/3), args))
ct2 <- do.call(qtrunc, args = c(list(p = 2/3), args))
dist <- vector(mode = "numeric", length = 2)
dist[1] <- (t1 - ct1)^2
dist[2] <- (t2 - ct2)^2
return(sqrt(sum(dist)))
}
This is called as follows, which is the same as your original function.
fit_distro(list(shape = 2, rate = 3), "gamma")
# [1] 13.07425
fit_gamma(c(2, 3))
# [1] 13.07425
This will work with other distributions, for however many parameters they have.
fit_distro(list(mean = 10, sd = 3), "norm")
# [1] 4.08379
fit_distro(list(shape1 = 2, shape2 = 3, ncp = 10), "beta")
# [1] 12.98371

MLE - Optimization with constraints as non-linear functions of the variables

I have a problem with the following optimization problem. In particular, I would like to add the following constraint to the MLE problem: (x - location)/scale > 0. Without this constraint, the LL is Inf and the L-BGFS-B optimization gives the following error
library(PearsonDS)
x <- rpearsonIII(n=1000, shape = 5, location = 6, scale = 7)
dpearson3 <- function (x, shape, location, scale, log = FALSE)
{
gscale <- abs(scale)
ssgn <- sign(scale)
density <- dgamma(ssgn * (x - location), shape = shape, scale = gscale, log = log)
return(density)
}
LL3 <- function(theta, x, display)
{
shape <- as.numeric(theta[1])
location <- as.numeric(theta[2])
scale <- as.numeric(theta[3])
tmp <- -sum(log(dpearson3(x, shape, location, scale, log = FALSE)))
if (is.na(tmp)) +Inf else tmp
if(display == 1){print(c(tmp, theta))}
return(sum(tmp))
}
control.list <- list(maxit = 100000, factr=1e-12, fnscale = 1)
fit <- optim(par = param,
fn = LL3,
hessian = TRUE,
method = "L-BFGS-B",
lower = c(0,-Inf,-Inf),
upper = c(Inf,Inf,Inf),
control = control.list,
x = x, display = 1)
Assume that I start the search from
param <- c(100,1000,10), I get the following error
Error in optim(par = param, fn = LL3, hessian = TRUE, method = "L-BFGS-B", :
L-BFGS-B needs finite values of 'fn'
How to solve the issue?
Changing the MLE function to
LL3 <- function(theta, x, display){
shape <- as.numeric(theta[1])
location <- as.numeric(theta[2])
scale <- as.numeric(theta[3])
tmp <- -sum(log(dpearson3(x, shape, location, scale, log = FALSE)))
if(min((x-location)/scale) < 0) tmp = + 100000000000 # I added this line
if (is.na(tmp)) +Inf else tmp
if(display == 1){print(c(tmp, theta))}
return(tmp)
}
is the smartest thing I could find. In this way I avoid the Inf problem. Any better answer?

How to create a plot of power of test in R?

I want to create a comparison for normal test with Shapiro-Wilks, Kolmogorov-Smirnov, Anderson-Darling, Cramer von Mises dan Adjusted Jarque-Bera methods based on the power of test (1-beta) on sample sizes n = 10,20,30,40 and 50.
testnormal=function(n,m,alfa)
{
require(nortest)
require(normtest)
require(xlsx)
pvalue=matrix(0,m,5)
decision=matrix(0,m,5)
for (i in 1:m)
{
data=runif(n,2,5)
test1=shapiro.test(data)
pv1=test1$p.value
pvalue[i,1]=pv1
if (pv1<alfa)
{
decision[i,1]=1
}
test2=ks.test(data,"pnorm",mean=mean(data),sd=sd(data))
pv2=test2$p.value
pvalue[i,2]=pv2
if (pv2<alfa)
{
decision[i,2]=1
}
test3=ad.test(data)
pv3=test3$p.value
pvalue[i,3]=pv3
if (pv3<alfa)
{
decision[i,3]=1
}
test4=cvm.test(data)
pv4=test4$p.value
pvalue[i,4]=pv4
if (pv4<alfa)
{
decision[i,4]=1
}
test5=ajb.norm.test(data)
pv5=test5$p.value
pvalue[i,5]=pv5
if (pv2<alfa)
{
decision[i,5]=1
}
}
result1=data.frame(pvalue)
result2=data.frame(decision)
colnames(result1)=c("SW","KS","AD","CvM","AJB")
colnames(result2)=c("SW","KS","AD","CvM","AJB")
write.xlsx(result1,"testnormal_pvalue.xlsx")
write.xlsx(result2,"testnormal_decision.xlsx")
one_min_beta=t(1-(colSums(decision)/m))
test.of.power=data.frame(one_min_beta)
colnames(test.of.power)=c("SW","KS","AD","CvM","AJB")
return(test.of.power)
}
simulation=testnormal(10,100,0.05)
simulation2=testnormal(20,100,0.05)
simulation3=testnormal(30,100,0.05)
simulation4=testnormal(40,100,0.05)
simulation5=testnormal(50,100,0.05)
output=rbind(simulation,simulation2,simulation3,simulation4,simulation5)
output
I want to graph the power of the test to see trends in the up and down trend of the power of the test over the sample size, anyone can help please?
I went through your code and rewrote along the way to better understand what you want (what is the excel stuff for?). I have broken it down to smaller functions to allow you to have more control in these kinds of simulation studies. The code is not particularly efficient.
But does this give you what you want?
library("nortest")
library("normtest")
library("dplyr")
library("ggplot2")
# Function for doing all tests and putting it into a data.frame
tests <- function(data) {
list_of_tests <- list(
SW = shapiro.test(data),
KS = ks.test(data, pnorm, mean = mean(data), sd = sd(data)),
AD = ad.test(data) ,
CMV = cvm.test(data),
AJB = ajb.norm.test(data)
)
# Combine to tibble
res <- bind_rows(lapply(list_of_tests, unclass))
res[c("method", "p.value")] # Keep only method and p-value cols
}
# Test it with e.g. 'tests(data = runif(8, 2, 5))'
# Function for repeated simulation and testing, combine results and derive power
testnormal <- function(n, m, alpha) {
# Important that runif is inside replicate
test_res <-
bind_rows(replicate(tests(data = runif(n, 2, 5)), n = m,
simplify = FALSE))
test_of_powers <-
test_res %>%
group_by(method) %>%
summarize(power = mean(p.value < alpha)) %>%
mutate(n = n, m = m, alpha = alpha)
return(test_of_powers)
}
# Repeat over a number of simulations:
sims <- expand.grid(n = c(10, 20, 30, 40, 50),
m = 1000,
alpha = 0.05)
output <- bind_rows(
mapply(testnormal, n = sims$n, m = sims$m, alpha = sims$alpha,
SIMPLIFY = FALSE)
)
Actually doing the plot:
# Plot it
ggplot(output, aes(x = n, y = power, col = method)) +
geom_line()
This way should make it easier to plot as well as making simulations over other grids of values (e.g. varying alpha) or expand your range of n, etc.

label ylab in timeSeries::plot, type = 'o'

How do I label the y-axis, using timeSeries::plot, with Greek letters? i.e. change SB, SP, etc. to \alpha, \beta etc., I'm am aware I need expression(), in some way. However I can't even get to the labels (I normally use ggplot2). Code below.
# install.packages("xtable", dependencies = TRUE)
library("timeSeries")
## Load Swiss Pension Fund Benchmark Data -
LPP <- LPP2005REC[1:12, 1:4]
colnames(LPP) <- abbreviate(colnames(LPP), 2)
finCenter(LPP) <- "GMT"
timeSeries::plot(LPP, type = "o")
It have been pointed out that the object structure, obtained with str(), is quite particular in LPP compared to say this object z
z <- ts(matrix(rnorm(300), 100, 3), start = c(1961, 1), frequency = 12)
plot(z)
If any one has an answer to both or any I would appreciate it. I realize I can convert the data and plot it with ggplot2, I have seen that here on SO, but I am interested in doing in directly on the timeSeries object LPP and the stats (time-series object) z
[ REVISION & Edited ]
When plot.type is "multiple", we can't define ylab directly. Both plot(ts.obj) (S3 method) and plot(timeSeries.obj) (S4 method) take colnames(obj) as ylab, and I don't know any methods of using Greek letters as colname. (The difference in structure mainly comes from the difference of S3 and S4; colnames(timeSeries.obj) is equivalent to timeSeries.obj#units; the defaults is Series i and TS.i).
We can step in ylab using the arugument, panel (It wants a function and the default is lines). It is used in for(i in 1:ncol(data)). I couldn't give panel.function a suitable "i" (I guess it can in some way, but I didn't think up), so I got "i" using which col the data matches.
for timeSeries
ylabs <- expression(alpha, beta, gamma, delta)
row1 <- LPP[1,]
timeSeries.panel.f <- function(x, y, ...) {
lines(x, y, ...)
mtext(ylabs[which(row1 %in% y[1])], 2, line = 3)
}
plot(LPP, panel = timeSeries.panel.f, type = "o", ann = F)
title("Title")
mtext("Time", 1, line = 3)
## If you aren't so concerned about warnings, here is more general.
## (Many functions read `...` and they return warnings).
timeSeries.panel.f2 <- function(x, y, ..., ylabs = ylabs, row1 = row1) {
lines(x, y, ...)
mtext(ylabs[which(row1 %in% y[1])], 2, line = 3)
}
plot(LPP, panel = timeSeries.panel.f2, type = "o", ann = F,
ylabs = expression(alpha, beta, gamma, delta), row1 = LPP[1,])
title("Title")
mtext("Time", 1, line = 3)
for ts
ylabs <- expression(alpha, beta, gamma)
row1 <- z[1,]
ts.panel.f <- function(y, ...) {
lines(y, ...)
mtext(ylabs[which(row1 %in% y[1])], 2, line = 3)
}
plot(z, panel = ts.panel.f, ann = F)
title("Title")
mtext("Time", 1, line = 3)
Of course you can archieve it using new functions made from the original (mostly the same as the original). I showed only the modified points.
modified plot(ts.obj) (made from plot.ts)
my.plot.ts <- function(~~~, my.ylab = NULL) {
:
nm <- my.ylab # before: nm <- colnames(x)
:
}
# use
my.plot.ts(z, my.ylab = expression(alpha, beta, gamma), type = "o")
modified plot(timeSeries.obj)
# made from `.plot.timeSeries`
my.plot.timeSeries <- function(~~~, my.ylab = NULL) {
:
my.plotTimeSeries(~~~, my.ylab = my.ylab)
}
# made from `timeSeries:::.plotTimeSeries`
my.plotTimeSeries <- function(~~~, my.ylab) {
:
nm <- my.ylab # before: nm <- colnames(x)
:
}
#use
my.plot.timeSeries(LPP, my.ylab = expression(alpha, beta, gamma, delta), type="o")

Fixing an error in a function that uses `replicate()` in R?

I have an R function called RR. I'm wondering how to fix the following error:
Error in rbinom(1, size = n, prob = p) :
promise already under evaluation: recursive default argument reference or
earlier problems?
RR = function(n, p, n.sim){
fun <- function(n = n, p = p){
x <- rbinom(1, size = n, prob = p)
res <- binom.test(x, n, p)[[4]]
c(Lower = res[1], Upper = res[2])
}
sim <- t(replicate(n.sim, fun()))
mean(sim[,1] <= p & p<= sim[,2])
}
# Example of use:
RR(n = 15, p = .5, n.sim = 5)
R throws this error message when you define defaults for a function that have the same name as the function parameters, and then call that function from within another function with the same parameter names. So function(x = x) is generally not a good idea. If you just change fun to
fun <- function(n2 = n, p2 = p) your code runs without issues.
I do not completely understand myself why this happens, but it is easy to avoid.

Resources