Using Parameter Vectors of Different Lengths in R - r

Assume I have two different parameter vectors used as inputs in a function. Also, assume the two parameter vectors are of different lengths. Is there a way to output all the possible values of that function? I know that if I use two parameter vectors of different lengths, then the shorter parameter vector is just repeated so that doesn't work. I can solve this "manually" as you can see below. But, I'd like to find a more efficient manner of calculating all the possible combinations. An example follows:
Assume I'm using the dbinom() function that has as inputs x (number of "successes" from the sample), n (number of observations in the sample), and p (the probability of "success" for each x). n stays constant at 20; however, my x varies from 7,8,9,...,20 ("x7" below) or 0,1 ("x1" below). Also, I want to evaluate dbinom() at different values of p, specifically from 0 to 1 in .1 increments ("p" below). As you can see the three parameter vectors x7, x1, and p are all of different lengths 14, 2, and 11, respectively.
> p<-seq(from=0,to=1,by=.1)
> x7<-seq.int(7,20)
> x1<-c(0,1)
> n<-20
I can evaluate each combination by using one of the vectors (x7/x2 or p) in dbinom() and then selecting a value for the remaining parameter. As you can see below, I used the vector x7 or x2 and then "manually" changed the p to equal 0,.1,.2,...,1.
> sum(dbinom(x7,n,.1))
[1] 0.002386089
> sum(dbinom(x7,n,.1))+sum(dbinom(x1,n,.1))
[1] 0.3941331
> sum(dbinom(x7,n,.2))+sum(dbinom(x1,n,.2))
[1] 0.1558678
> sum(dbinom(x7,n,.3))+sum(dbinom(x1,n,.3))
[1] 0.3996274
> sum(dbinom(x7,n,.4))+sum(dbinom(x1,n,.4))
[1] 0.7505134
> sum(dbinom(x7,n,.5))+sum(dbinom(x1,n,.5))
[1] 0.9423609
> sum(dbinom(x7,n,.6))+sum(dbinom(x1,n,.6))
[1] 0.9935345
> sum(dbinom(x7,n,.7))+sum(dbinom(x1,n,.7))
[1] 0.999739
> sum(dbinom(x7,n,.8))+sum(dbinom(x1,n,.8))
[1] 0.9999982
> sum(dbinom(x7,n,.9))+sum(dbinom(x1,n,.9))
[1] 1
> sum(dbinom(x7,n,1))+sum(dbinom(x1,n,1))
[1] 1
Basically, I want to know if there is a way to get R to print all the sums above from 0.3941331,0.1558678,...,1 with a single line of input or some other more efficient way of varying the parameter p without simply copying and changing p on each line.
*I'm new to Stackoverflow, so I apologize in advance if I have not formulated my question conventionally.

You are using dbinom with a range of x values and then summing. Instead using pbinom, which calculates then probability of P(x <=q) (or P(x >q) if lower.tail = FALSE).
Thus you can calculate P(x >6) + P(q <= 1) (which is what you appear to want to calculate)
pbinom(q = 6, p = p ,size = n, lower.tail = FALSE) + pbinom(q = 1, p = p, size = n)

Related

Create log spaced numbers in R with high upper bound

I want to create a set of 10 logspaced numbers from zero to some big number M, say M=60,000, for example in R.
First, I tried to use lseq() from the package emdbook. The problem with lseq, however, is that it cannot handle 0 as a starting point. (This is due to the fact that it will try to calculate log(0) and then fail).
Next, I tried to use logspace() from the pracma package in the following way:
Numbers <- log(logspace(0,M,10),base=10)
This works fine for values of M up to about 340. From then on the numbers in the set will become infinity because the exponential function becomes too large.
Is there any other way in R to create a set of logspaced numbers from zero to some big number M which will not make most of the numbers in the set infinity and which can actually handle zero as a starting point?
Correct me if I am wrong, but can't you just çalculate the logspaces for lower numbers and then multiply? They should be linearly related right? Just look at this output:
library(pracma)
> log(logspace(0,60, 10), base = 10)[1:5]
[1] 0.000000000000000 6.666666666666667 13.333333333333334 20.000000000000000 26.666666666666668
> log(logspace(0,600, 10), base = 10)[1:5]
[1] 0.000000000000000 66.666666666666671 133.333333333333343 200.000000000000000 266.666666666666686
> x1 <- (log(logspace(0,600, 10), base = 10)*100)[2]
> x1
[1] 6666.666666666667
> x2 <- seq(0 , 9, 1)*x1
> x2
[1] 0.000000000000 6666.666666666667 13333.333333333334 20000.000000000000 26666.666666666668
[6] 33333.333333333336 40000.000000000000 46666.666666666672 53333.333333333336 60000.000000000000

Generate numbers with specific correlation [with only positive values in the output]

I want to obtain a dataframe with simulated values which have a specific correlation to each other.
I need to use this function, but in the returned output there are also negative values, which do not have meaning for my purposes:
COR <- function (n, xmean, xsd, ymean, ysd, correlation) {
x <- rnorm(n)
y <- rnorm(n)
z <- correlation * scale(x)[,1] + sqrt(1 - correlation^2) *
scale(resid(lm(y ~ x)))[,1]
xresult <- xmean + xsd * scale(x)[,1]
yresult <- ymean + ysd * z
data.frame(x=xresult,y=yresult)
}
Please note that my question starts from this previous post (currently closed):
another similar discussion
Is there a method able to exclude from the final output all the rows which have at least one negative value? (in another terms, x and y must be always positives).
I spent many hours without any concrete result.....
Filtering rows which have at least one negative value can be done with the apply function, e.g.
df <- simcor(100, 1, 1, 1, 1, 0.8)
filter <- apply(df, 1, function(x) sum(x < 0) > 0)
df <- df[!filter,]
plot(df)
First, I create a dataframe df from your funcion. Then, I apply the function sum(x < 0) > 0 rowwise to the dataframe (the second argument of apply, 1 indicates to go along the first dimension of the dataframe or array). This will create a logical vector that is TRUE for every row with at least one negative value. Subsetting the dataframe with the inverse of that (!filter) leaves you with all rows that have no negative values.
UPDATE:
Seems like the package VineCopula offers functions to create distributions with a given correlation. However, I did not dive into the math as deep so I was not able to fully grasp how copulas (i.e. multivariate probability distributions) work. Using this package, you can at least create e.g. two gaussian distributions.
library(VineCopula)
BC <- BiCop(family = 1, par = 0.9)
sim <- BiCopSim(N = 1000, obj = BC)
cor(sim[,1], sim[,2])
plot(sim)
You might be able to then scale the resulting matrix to achieve a certain standard derivation.

kernel density bandwidth in R

I have two vectors: 1) ~1000 sample means and 2) the corresponding ~1000 standard deviations of those means. I would like to create a kernel density plot of these data, using the sample means as the observations from which density is estimated, and the standard deviations of each mean as the bandwidth for each observation. Problem is, density only allows a vector of length 1 to be used as a bandwidth. For example:
plot(density(means,bw=error))
returns the following warnings:
1: In if (!is.finite(bw)) stop("non-finite 'bw'") :
the condition has length > 1 and only the first element will be used
2: In if (bw <= 0) stop("'bw' is not positive.") :
the condition has length > 1 and only the first element will be used
3: In if (!is.finite(from)) stop("non-finite 'from'") :
the condition has length > 1 and only the first element will be used
4: In if (!is.finite(to)) stop("non-finite 'to'") :
the condition has length > 1 and only the first element will be used
...and I get a plot that uses the error of the first item in the list as the bandwidth for all of my observations.
Any ideas on how I could implement a separate, user-defined bandwidth for each observation used to produce a kernel density plot?
It doesn't look like density supports this sort of bandwidth specification. I suppose you could roll your own by
mydensity <- function(means, sds) {
x <- seq(min(means - 3*sds), max(means + 3*sds), length.out=512)
y <- sapply(x, function(v) mean(dnorm(v, means, sds)))
cbind(x, y)
}
This will be a good deal slower than the real function (which appears to use fft in the computation). Here it is at work, with small bandwidths at the left and large at the right:
set.seed(144)
means <- runif(1000)
sds <- ifelse(means < 0.5, 0.001, 0.05)
plot(mydensity(means, sds))

To find the distance between two roots in R

Suppose I have a function f(x) that is well defined on an interval I. I want to find the greatest and smallest roots of f(x), then taking the difference of them. What is a good way to program it?
To be precise, f can at worst be a rational function like (1+x)/(1-x). It should be a (high degree) polynomial most of the times. I only need to know the result numerically to some precision.
I am thinking about the following:
Convert f(x) into a form recognizable by R. (I can do)
Use R to list all roots of f(x) on I (I found the uniroot function only give me one root)
Use R to to find the maximum and minimum elements in the list (should be possible once I converted it to a vector)
Taking the difference of the two roots. (should be trivial)
I am stuck on step (2) and I do not know what to do. My professor give a brutal force solution, suggesting me to do:
Divide interval I into one million pieces.
Evaluate f on each end points, find the end points where f>=0.
Choose the maximum and minimum elements from the set formed in step 2.
Take the difference between them.
I feel this way is not very efficient and might not work for all f in general, but I am having trouble to implement it even for quadratics. I do not know how to do step (2) as well. So I want to ask for a hint or some toy examples.
At this point I am trying to implement the following code:
Y=rep(0,200)
dim(Y)=c(100,2)
for(i in 1:100){
X=rnorm(9,0,1)
Z=rnorm(16,0,1)
a=0.64
b=a*sum(Z^2)/sum(X^2)
root_intervals <- function(f, interval, n = 1e6) {
xvals <- seq(interval[1], interval[2], length = n)
yvals <- f(xvals)
ypos <- yvals > 0
x1 <- which(tail(ypos, -1) != head(ypos, -1))
x2 <- x1 + 1
## so all the zeroes we can see are between x1 and x2
return(cbind(xvals[x1], xvals[x2]))
}
at here everything is okay, but when I try to extract the roots to Y[i,1], Y[i,2] by
Y[i,1]=(ri<-root intervals(function(x)(x/(a*x+b))^{9/2}*(1/((1-a)+a*(1-a)/b*x))^4-0.235505, c(0,40),n=1e6)[1]
I found I cannot evaluate it anymore. R keep telling me
Error: unexpected symbol in:
"}
Y[i,1]=(ri<-root intervals"
and I got stuck. I really appreciate everyone's help as I am feeling lost.
I checked the function's expression many times using the plot function and it has no grammar mistakes. Also I believe it is well defined for all X in the interval.
This should give you a good start on the brute force solution. You're right, it's not elegant, but for relatively simple univariate functions, evaluating 1 million points is trivial.
root_intervals <- function(f, interval, n = 1e6) {
xvals <- seq(interval[1], interval[2], length = n)
yvals <- f(xvals)
ypos <- yvals > 0
x1 <- which(ypos[-1] != head(ypos, -1))
x2 <- x1 + 1
## so all the zeroes we can see are between x1 and x2
return(cbind(xvals[x1], xvals[x2]))
}
This function returns a two column matrix of x values, where the function changes sign between column 1 and column 2:
f1 <- function (x) 0.05 * x^5 - 2 * x^4 + x^3 - x^2 + 1
> (ri <- root_intervals(f1, c(-10, 10), n = 1e6))
[,1] [,2]
[1,] -0.6372706 -0.6372506
[2,] 0.8182708 0.8182908
> f1(ri)
[,1] [,2]
[1,] -3.045326e-05 6.163467e-05
[2,] 2.218895e-05 -5.579081e-05
Wolfram Alpha confirms results on the specified interval.
The top and bottom rows will be the min and max intervals found. These intervals (over which the function changes sign) are precisely what uniroot wants for it's interval, so you could use it to solve for the (more) exact roots. Of course, if the function changes sign twice within one interval (or any even number of times), it won't be picked up, so choose a big n!
Response to edited question:
Looks like your trying to define a bunch of functions, but your edits have syntax errors. Here's what I think you're trying to do: (this first part might take some more work to work right)
my_funs <- list()
Y=rep(0,200)
dim(Y)=c(100,2)
for(i in 1:100){
X=rnorm(9,0,1)
Z=rnorm(16,0,1)
a=0.64
b=a*sum(Z^2)/sum(X^2)
my_funs[[i]] <- function(x){(x/(a*x+b))^{9/2}*(1/((1-a)+a*(1-a)/b*x))^4-0.235505}
}
Here's using the root_intervals on the first of your generated functions.
> root_intervals(my_funs[[1]], interval = c(0, 40))
[,1] [,2]
[1,] 0.8581609 0.8582009
[2,] 11.4401314 11.4401714
Notice the output, a matrix, with the roots of the function being between the first and second columns. Being a matrix, you can't assign it to a vector. If you want a single root, use uniroot using each row to set the upper and lower bounds. This is left as an exercise to the reader.

Calculating correlation between residuals of linear regression with NAs and independent variable in R

I am trying to calculate the correlation coefficient between the residuals of a linear regression and the independent variable p.
Basically, the linear regression estimates the current sales as a function of the current price p and the past price p1.
The vector of current prices mydf$p has length 8, but the residuals is a vector of length 7 because one entry has been deleted due to the NA value of p1.
# lag vector and pad with NAs
# Source: http://heuristically.wordpress.com/2012/10/29/lag-function-for-data-frames/
lagpad <- function(x, k) {
if (!is.vector(x))
stop('x must be a vector')
if (!is.numeric(x))
stop('x must be numeric')
if (!is.numeric(k))
stop('k must be numeric')
if (1 != length(k))
stop('k must be a single number')
c(rep(NA, k), x)[1 : length(x)]
}
mydf <- data.frame(p = c(10, 8, 10, 9, 10, 9, 10, 8))
mydf$p1 <- lagpad(mydf$p,1)
mydf$sales <- with(mydf, 200 - 15 * p + 5 * p1) + rnorm(nrow(mydf), 0,0.13)
model <- lm(data = mydf, formula = 'sales ~ p + p1')
print(summary(model))
print(cor(residuals(model), mydf$p))
# Error in cor(residuals(model), mydf$p) : incompatible dimensions
In this particular case, it is easy to use mydf$p[2:8] instead of mydf$p.
However, in general, there may be multiple rows at random locations where then NAs are deleted.
How do I access the independent variables that were actually used in the regression after removing the rows containing NA?
One of my attempts was based on the R documentation for lm. I tried to access the "x" matrix through model[['x']] but that did not work.
You can get the actual data used to fit the model from model$model, and from there the p column:
cor(residuals(model), model$model$p)
Alternatively, is.na(mydf$p1) will tell you which rows in mydf have an NA in column p1:
cor(residuals(model), mydf$p[!is.na(mydf$p1)])
In general, is.na(x) tells us whether elements in x are NA or not:
> is.na(c(1,2,NA,4,NA,6))
[1] FALSE FALSE TRUE FALSE TRUE FALSE
model.matrix(model) seems to be what you are looking for
Then you can select the variables you want with [] and the column number or name
The x matrix is only created if you specify x=T in your call to lm. Then model$x will give you the value of x (this is more idiomatic that model[['x']].
lm handles missing values by just completely omitting an observation where a value is missing. Maybe you want to do something like:
cor(residuals(model), mydf$p[!is.na(mydf$p)])
?

Resources