Simulate Values under custom density - r

I have a theoretical and coding question that has to do with densities and simulating values.
I am building custom densities via the density(x) command. However I am hoping to generate 1000-10000 simulated values from this density. The overall goal is to take two densities build by density(x$y) form and run simulations and say this density A is more than density B x% of the time. I would just take each simulated value and see which is higher and code to count how many times A is higher than B.
Is there a way to accomplish this? Or is there some way to accomplish something similar with these densities? Thanks!

The sample function can take the midpoints of the intervals of the sample density and then use the densities as the prob-arguments.
mysamp <- sample(x= dens$x, size=1000 , prob=dens$y, repl=TRUE)
This has the disadvantage that you may need to jitter the result to avoid lots of duplicates.
mysamp <- jitter(mysamp)
Another method is to use approxfun and ecdf. You may need to invert the function (reverse role of x and y) in order to sample using the input of runif(1000) into the result. I'm pretty sure there are worked examples of this in SO and I'm pretty sure that I am one of many who in the past have posted such code to R-help. (If your searches have failed to find then then post the search strategies and others can try to improve upon them.)

Following #DWin's tip to invert the ecdf, here is how to implement such an approach, using a spline to fit the inverted step-function:
Given
z <- c(rnorm(40), runif(40))
plot(density(z))
Define
spl <- with(environment(ecdf(z)), splinefun(y, x))
sampler <- function(n)spl(runif(n))
Now you can call sampler() with the size you want:
plot(density(sampler(1000)))
Final note: This will never generate values outside the range of the original data, but duplicates will be extremely rare:
> anyDuplicated(sampler(1e4))
[1] 0

Related

Centering/standardizing variables in R [duplicate]

I'm trying to understand the definition of scale that R provides. I have data (mydata) that I want to make a heat map with, and there is a VERY strong positive skew. I've created a heatmap with a dendrogram for both scale(mydata) and log(my data), and the dendrograms are different for both. Why? What does it mean to scale my data, versus log transform my data? And which would be more appropriate if I want to look at the dendrogram illustrating the relationship between the columns of my data?
Thank you for any help! I've read the definitions but they are whooping over my head.
log simply takes the logarithm (base e, by default) of each element of the vector.
scale, with default settings, will calculate the mean and standard deviation of the entire vector, then "scale" each element by those values by subtracting the mean and dividing by the sd. (If you use scale(x, scale=FALSE), it will only subtract the mean but not divide by the std deviation.)
Note that this will give you the same values
set.seed(1)
x <- runif(7)
# Manually scaling
(x - mean(x)) / sd(x)
scale(x)
It provides nothing else but a standardization of the data. The values it creates are known under several different names, one of them being z-scores ("Z" because the normal distribution is also known as the "Z distribution").
More can be found here:
http://en.wikipedia.org/wiki/Standard_score
This is a late addition but I was looking for information on the scale function myself and though it might help somebody else as well.
To modify the response from Ricardo Saporta a little bit.
Scaling is not done using standard deviation, at least not in version 3.6.1 of R, I base this on "Becker, R. (2018). The new S language. CRC Press." and my own experimentation.
X.man.scaled <- X/sqrt(sum(X^2)/(length(X)-1))
X.aut.scaled <- scale(X, center = F)
The result of these rows are exactly the same, I show it without centering because of simplicity.
I would respond in a comment but did not have enough reputation.
I thought I would contribute by providing a concrete example of the practical use of the scale function. Say you have 3 test scores (Math, Science, and English) that you want to compare. Maybe you may even want to generate a composite score based on each of the 3 tests for each observation. Your data could look as as thus:
student_id <- seq(1,10)
math <- c(502,600,412,358,495,512,410,625,573,522)
science <- c(95,99,80,82,75,85,80,95,89,86)
english <- c(25,22,18,15,20,28,15,30,27,18)
df <- data.frame(student_id,math,science,english)
Obviously it would not make sense to compare the means of these 3 scores as the scale of the scores are vastly different. By scaling them however, you have more comparable scoring units:
z <- scale(df[,2:4],center=TRUE,scale=TRUE)
You could then use these scaled results to create a composite score. For instance, average the values and assign a grade based on the percentiles of this average. Hope this helped!
Note: I borrowed this example from the book "R In Action". It's a great book! Would definitely recommend.

Low-pass fltering of a matrix

I'm trying to write a low-pass filter in R, to clean a "dirty" data matrix.
I did a google search, came up with a dazzling range of packages. Some apply to 1D signals (time series mostly, e.g. How do I run a high pass or low pass filter on data points in R? ); some apply to images. However I'm trying to filter a plain R data matrix. The image filters are the closest equivalent, but I'm a bit reluctant to go this way as they typically involve (i) installation of more or less complex/heavy solutions (imageMagick...), and/or (ii) conversion from matrix to image.
Here is sample data:
r<-seq(0:360)/360*(2*pi)
x<-cos(r)
y<-sin(r)
z<-outer(x,y,"*")
noise<-0.3*matrix(runif(length(x)*length(y)),nrow=length(x))
zz<-z+noise
image(zz)
What I'm looking for is a filter that will return a "cleaned" matrix (i.e. something close to z, in this case).
I'm aware this is a rather open-ended question, and I'm also happy with pointers ("have you looked at package so-and-so"), although of course I'd value sample code from users with experience on signal processing !
Thanks.
One option may be using a non-linear prediction method and getting the fitted values from the model.
For example by using a polynomial regression, we can predict the original data as the purple one,
By following the same logic, you can do the same thing to all columns of the zz matrix as,
predictions <- matrix(, nrow = 361, ncol = 0)
for(i in 1:ncol(zz)) {
pred <- as.matrix(fitted(lm(zz[,i]~poly(1:nrow(zz),2,raw=TRUE))))
predictions <- cbind(predictions,pred)
}
Then you can plot the predictions,
par(mfrow=c(1,3))
image(z,main="Original")
image(zz,main="Noisy")
image(predictions,main="Predicted")
Note that, I used a polynomial regression with degree 2, you can change the degree for a better fitting across the columns. Or maybe, you can use some other powerful non-linear prediction methods (maybe SVM, ANN etc.) to get a more accurate model.

Understanding `scale` in R

I'm trying to understand the definition of scale that R provides. I have data (mydata) that I want to make a heat map with, and there is a VERY strong positive skew. I've created a heatmap with a dendrogram for both scale(mydata) and log(my data), and the dendrograms are different for both. Why? What does it mean to scale my data, versus log transform my data? And which would be more appropriate if I want to look at the dendrogram illustrating the relationship between the columns of my data?
Thank you for any help! I've read the definitions but they are whooping over my head.
log simply takes the logarithm (base e, by default) of each element of the vector.
scale, with default settings, will calculate the mean and standard deviation of the entire vector, then "scale" each element by those values by subtracting the mean and dividing by the sd. (If you use scale(x, scale=FALSE), it will only subtract the mean but not divide by the std deviation.)
Note that this will give you the same values
set.seed(1)
x <- runif(7)
# Manually scaling
(x - mean(x)) / sd(x)
scale(x)
It provides nothing else but a standardization of the data. The values it creates are known under several different names, one of them being z-scores ("Z" because the normal distribution is also known as the "Z distribution").
More can be found here:
http://en.wikipedia.org/wiki/Standard_score
This is a late addition but I was looking for information on the scale function myself and though it might help somebody else as well.
To modify the response from Ricardo Saporta a little bit.
Scaling is not done using standard deviation, at least not in version 3.6.1 of R, I base this on "Becker, R. (2018). The new S language. CRC Press." and my own experimentation.
X.man.scaled <- X/sqrt(sum(X^2)/(length(X)-1))
X.aut.scaled <- scale(X, center = F)
The result of these rows are exactly the same, I show it without centering because of simplicity.
I would respond in a comment but did not have enough reputation.
I thought I would contribute by providing a concrete example of the practical use of the scale function. Say you have 3 test scores (Math, Science, and English) that you want to compare. Maybe you may even want to generate a composite score based on each of the 3 tests for each observation. Your data could look as as thus:
student_id <- seq(1,10)
math <- c(502,600,412,358,495,512,410,625,573,522)
science <- c(95,99,80,82,75,85,80,95,89,86)
english <- c(25,22,18,15,20,28,15,30,27,18)
df <- data.frame(student_id,math,science,english)
Obviously it would not make sense to compare the means of these 3 scores as the scale of the scores are vastly different. By scaling them however, you have more comparable scoring units:
z <- scale(df[,2:4],center=TRUE,scale=TRUE)
You could then use these scaled results to create a composite score. For instance, average the values and assign a grade based on the percentiles of this average. Hope this helped!
Note: I borrowed this example from the book "R In Action". It's a great book! Would definitely recommend.

Random Data Sets within loops

Here is what I want to do:
I have a time series data frame with let us say 100 time-series of length 600 - each in one column of the data frame.
I want to pick up 4 of the time-series randomly and then assign them random weights that sum up to one (ie 0.1, 0.5, 0.3, 0.1). Using those I want to compute the mean of the sum of the 4 weighted time series variables (e.g. convex combination).
I want to do this let us say 100k times and store each result in the form
ts1.name, ts2.name, ts3.name, ts4.name, weight1, weight2, weight3, weight4, mean
so that I get a 9*100k df.
I tried some things already but R is very bad with loops and I know vector oriented
solutions are better because of R design.
Here is what I did and I know it is horrible
The df is in the form
v1,v2,v2.....v100
1,5,6,.......9
2,4,6,.......10
3,5,8,.......6
2,2,8,.......2
etc
e=NULL
for (x in 1:100000)
{
s=sample(1:100,4)#pick 4 variables randomly
a=sample(seq(0,1,0.01),1)
b=sample(seq(0,1-a,0.01),1)
c=sample(seq(0,(1-a-b),0.01),1)
d=1-a-b-c
e=c(a,b,c,d)#4 random weights
average=mean(timeseries.df[,s]%*%t(e))
e=rbind(e,s,average)#in the end i get the 9*100k df
}
The procedure runs way to slow.
EDIT:
Thanks for the help i had,i am not used to think R and i am not very used to translate every problem into a matrix algebra equation which is what you need in R.
Then the problem becomes a little bit complex if i want to calculate the standard deviation.
i need the covariance matrix and i am not sure i can if/how i can pick random elements for each sample from the original timeseries.df covariance matrix then compute the sample variance
t(sampleweights)%*%sample_cov.mat%*%sampleweights
to get in the end the ts.weighted_standard_dev matrix
Last question what is the best way to proceed if i want to bootstrap the original df
x times and then apply the same computations to test the robustness of my datas
thanks
Ok, let me try to solve your problem. As a foreword: I can think of no application where it is sensible to do what you are doing. However, that is for you to judge (non the less I would be interested in the application...)
First, note that the mean of the weighted sums equals the weighted sum of the means, as:
Let's generate some sample data:
timeseries.df <- data.frame(matrix(runif(1000, 1, 10), ncol=40))
n <- 4 # number of items in the convex combination
replications <- 100 # number of replications
Thus, we may first compute the mean of all columns and do all further computations using this mean:
ts.means <- apply(timeseries.df, 2, mean)
Let's create some samples:
samples <- replicate(replications, sample(1:length(ts.means), n))
and the corresponding weights for those samples:
weights <- matrix(runif(replications*n), nrow=n)
# Now norm the weights so that each column sums up to 1:
weights <- weights / matrix(apply(weights, 2, sum), nrow=n, ncol=replications, byrow=T)
That part was a little bit tricky. Run the single functions on each own with a small number of replications to figure out what they are doing. Note that I took a different approach for generating the weights: First get uniformly distributed data and then norm them by their sum. The result should be identical to your approach, but with arbitrary resolution and much better performance.
Again a little bit trick: Get the means for each time series and multiply them with the weights just computed:
ts.weightedmeans <- matrix(ts.means[samples], nrow=n) * weights
# and sum them up:
weights.sum <- apply(ts.weightedmeans, 2, sum)
Now, we are basically done - all information are available and ready to use. The rest is just a matter of correctly formatting the data.frame.
result <- data.frame(t(matrix(names(ts.means)[samples], nrow=n)), t(weights), weights.sum)
# For perfectness, use better names:
colnames(result) <- c(paste("Sample", 1:n, sep=''), paste("Weight", 1:n, sep=''), "WeightedMean")
I would assume this approach to be rather fast - on my system the code took 1.25 seconds with the amount of repetitions you stated.
Final word: You were in luck that I was looking for something that kept me thinking for a while. Your question was not asked in a way to encourage users to think about your problem and give good answers. The next time you have a problem, I would suggest you to read www.whathaveyoutried.com before and try to break down the problem as far as you are able to. The more concrete your problem, the faster and of higher quality your answers will be.
Edit
You mentioned correctly that the weights generated above are not uniformly distributed over the whole range of values. (I still have to object that even (0.9, 0.05, 0.025, 0.025) is possible, but it is very unlikely).
Now we are playing in a different league, though. I am pretty sure that the approach you took is not uniformly distributed as well - the probability of the last value being 0.9 is far less than the probability of the first one being that large. Honestly I do not have a good idea ready for you concerning the generation of uniformly distributed random numbers on the unit sphere according to the L_1 distance. (Actually, it is not really a unit sphere, but both problems should be identical).
Thus, I have to give up on this.
I would suggest you to raise a new question at stats.stackexchange.com concerning the generation of those random vectors. It probably is fairly simple using the correct technique. However, I doubt that this question with that heading and a fairly long answer will attract a potential responder... (If you ask the question over there, I would appreciate a link, as I would like to know the solution ;)
Concerning the variance: I do not fully understand which standard deviation you want to compute. If you just want to compute the standard deviation of each time series, why do you not use the built-in function sd? In the computation above you could just replace mean by it.
Bootstrapping: That is a whole new question. Separate different topics by starting new questions.

analytical derivative of splinefun()

I'm trying to fit a natural cubit spline to probabilistic data (probabilities that a random variable is smaller than certain values) to obtain a cumulative distribution function, which works well enough using splinefun():
cutoffs <- c(-90,-60,-30,0,30,60,90,120)
probs <- c(0,0,0.05,0.25,0.5,0.75,0.9,1)
CDF.spline <- splinefun(cutoffs,probs, method="natural")
plot(cutoffs,probs)
curve(CDF.spline(x), add=TRUE, col=2, n=1001)
I would then, however, like to use the density function, i.e. the derivative of the spline, to perform various calculations (e.g. to obtain the expected value of the random variable).
Is there any way of obtaining this derivative as a function rather than just evaluated at a discrete number of points via splinefun(x, deriv=1)?
This is pretty close to what I'm looking for, but alas the example doesn't seem to work in R version 2.15.0.
Barring an analytical solution, what's the cleanest numerical way of going about this?
If you change the environment assignment line for g in the code the Berwin Turlach provided on R-help to this:
environment(g) <- environment(f)
... you succeed in R 2.15.1.

Resources