I'm trying to calculate the quantiles for a large dataset in R. My code currently looks like this:
percentile <- numeric(length=5000000)
for (i in 1:5000000){
percentile[i] <- quantile(Result[1:i],0.1)
}
Where Result is a vector of 5 million observations. It is important that the quantile is calculated based on the number of observations to date, as I'm testing simulation convergence. Currently this code takes an extremely long time to run, making it unusable. Is there a quicker way to do this, using vectorisation or some function in the plyr package? I've already tried the foreach package and although slightly faster, this still takes a massive amount of time.
Thanks!
You are calculating much more quantiles than relevant. The code below should do
percentile <- sapply(
seq(1000, 5000000, by = 1000),
function(i){
quantile(head(Result, i), 0.1)
}
)
Related
So, I'd like to test how precise is t-test for detecting a mean for various distributions. But I don't want to have to define the sampling distribution each time I run the function in the function. If I write function(data, mju) and then as data input rnorm(n) or any other random sample, I obviously get the same results when replicating the function, because I only have the one "data" sample, that was first inputted. To understand more clearly what I want, here is the code:
t_ci <- function(data,mju){
prod(t.test(data)$conf.int - mju)
}
set.seed(NULL)
prec_t <- function(data, n, N, mju){
sim <- replicate(N, t_ci(data, mju))
sim[sim<0]/N
}
The first function checks, whether the real theoretical parameter "mju" in in the confidence interval. The second one replicates the function t_ci N times, to see how precise the t test confidence intervals are for selected data. I'd like to have an option to just indicate the distribution and then it would generate n-sized samples N times and calculate the precision. But as far as my code goes, it only replicates the same data over and over. Maybe there is a solution for this problem?
Also, it seems that something is wrong with the function prec_t, because I'd like to have a count of times the t_ci produced negative outcome and then divide by N.
Any help would be greatly appreciated! Thanks in advance.
Objective: The overall objective of the problem is to calculate the confidence interval (CI) of various sample sizes (n=2,4..1024) of rnorm, 10,000 times and then count the number of times each one fails (this likely requires a counter and an if/else statement). Finally the results are to be plotted
I am trying to calculate CI of the means for several simulations of a sample sizes, however, I am first trying to break down the code for one specific sample size a = 8.
The problem I have is that I do not know how to generate a linear model for each row. Would anyone know how I can do this? Here is what I have so far:
a <- 8
n.sim.3 <- 10000
for ( i in a) {
r.mat <- matrix(rnorm(i*n.sim.3), nrow=n.sim.3, ncol = a)
lm.tmp <- apply(three.mat,1,lm(n.sim.3~1) # The lm command is where I'm stuck I don't think this is correct)
confint.tmp <- confint(lm.tmp)
I am trying to fit about 300 piecewise regressions with the segmented function from the segmented package in R. This is taking a lot of time (~4days) because of the segmented function. I am already using all the cores of my computer, but I am not a programmer and I guess this code is probably not optimal. Can I improve the code below to make it run faster? How?
Here is a reproducible example. df is a simulated data frame that corresponds to one of the 300 datasets that I want to analyze. Each dataset is one day, and during each day I measure the temperature every 5 minutes, x is the temperature and y the time of the day. The figure below shows what my data look like. The pattern is very specific and repeatable across days and each change in slope corresponds to well understood biological mechanisms. This is why I can guess all the values of psi (for ex. time of sunrise and sunset).
Of course the real data are more variable and I use many iterations (about 200, here I reduced to 10 for the example) to increase my chances of getting a successful fit.
library(segmented)
y<-seq(1,288,1)
x<-c(seq(0,-30,-1),seq(-30,-54,-2),seq(-54,30,1),seq(30,10,-1),seq(10,90,1),seq(90,34,-1))
df<-data.frame(x,y)
head(df)
plot(x~y)
t1=31
t2=44
t3=129
t4=150
t5=231
iterations<-10
for (j in 1:iterations) {
res <- lm(formula=x~y,data=df)
try(result <- segmented(
res, seg.Z=~y, psi=c(t1,t2,t3,t4,t5),
control=seg.control(it.max=200, display=F, K=4, h=0.1, n.boot=100, random=T)))
}
result
Taking the lm out of the loop doesn't significantly improve the speed of the loop.
One thing that should help is to break out of the iterations once the result is found. In most cases it should find something on the first iteration and this will avoid running 200 unnecessary iterations.
rm(result)
for (j in 1:iterations) {
res <- lm(formula=x~y,data=df)
try(result <- segmented(
res, seg.Z=~y, psi=c(t1,t2,t3,t4,t5),
control=seg.control(it.max=200, display=F, K=4, h=0.1, n.boot=100, random=T)))
if (exists("result")) break
}
I'm looking to fit a weighted distribution to a data set I have.
I'm currently using the fitdist command but don't know if there is a way to add weighting.
library(fitdistrplus)
df<-data.frame(value=rlnorm(100,1,0.5),weight=runif(100,0,2))
#This is what I'm doing but not really what I want
fit_df<-fitdist(df$value,"lnorm")
#How to do this
fit_df_weighted<-fitdist(df$value,"lnorm",weight=df$weight)
I'm sure this has been answered before somewhere but I've looked and can't find anything.
thanks in advance,
Gordon
Perhaps you could use the rep() function and a quick loop to approximate the distribution.
You could multiply each weighted value by, say, 10000, round the number, and then use it to indicate how many multiples of the value you need in your vector. After running a quick loop, you could then run the vector through the fitdist() algorithm.
df$scaled_weight <- round(df$weight*10000,0)
my_vector <- vector()
## quick loop
for (i in 1:nrow(df)){
values <- rep(df$value[i], df$scaled_weight[i])
my_vector <- c(my_vector, values)
}
## find parameters
fit_df_weighted <- fitdist(my_vector,"lnorm")
The standard errors would be rubbish, but the estimated parameters should be sufficient.
Here is what I want to do:
I have a time series data frame with let us say 100 time-series of length 600 - each in one column of the data frame.
I want to pick up 4 of the time-series randomly and then assign them random weights that sum up to one (ie 0.1, 0.5, 0.3, 0.1). Using those I want to compute the mean of the sum of the 4 weighted time series variables (e.g. convex combination).
I want to do this let us say 100k times and store each result in the form
ts1.name, ts2.name, ts3.name, ts4.name, weight1, weight2, weight3, weight4, mean
so that I get a 9*100k df.
I tried some things already but R is very bad with loops and I know vector oriented
solutions are better because of R design.
Here is what I did and I know it is horrible
The df is in the form
v1,v2,v2.....v100
1,5,6,.......9
2,4,6,.......10
3,5,8,.......6
2,2,8,.......2
etc
e=NULL
for (x in 1:100000)
{
s=sample(1:100,4)#pick 4 variables randomly
a=sample(seq(0,1,0.01),1)
b=sample(seq(0,1-a,0.01),1)
c=sample(seq(0,(1-a-b),0.01),1)
d=1-a-b-c
e=c(a,b,c,d)#4 random weights
average=mean(timeseries.df[,s]%*%t(e))
e=rbind(e,s,average)#in the end i get the 9*100k df
}
The procedure runs way to slow.
EDIT:
Thanks for the help i had,i am not used to think R and i am not very used to translate every problem into a matrix algebra equation which is what you need in R.
Then the problem becomes a little bit complex if i want to calculate the standard deviation.
i need the covariance matrix and i am not sure i can if/how i can pick random elements for each sample from the original timeseries.df covariance matrix then compute the sample variance
t(sampleweights)%*%sample_cov.mat%*%sampleweights
to get in the end the ts.weighted_standard_dev matrix
Last question what is the best way to proceed if i want to bootstrap the original df
x times and then apply the same computations to test the robustness of my datas
thanks
Ok, let me try to solve your problem. As a foreword: I can think of no application where it is sensible to do what you are doing. However, that is for you to judge (non the less I would be interested in the application...)
First, note that the mean of the weighted sums equals the weighted sum of the means, as:
Let's generate some sample data:
timeseries.df <- data.frame(matrix(runif(1000, 1, 10), ncol=40))
n <- 4 # number of items in the convex combination
replications <- 100 # number of replications
Thus, we may first compute the mean of all columns and do all further computations using this mean:
ts.means <- apply(timeseries.df, 2, mean)
Let's create some samples:
samples <- replicate(replications, sample(1:length(ts.means), n))
and the corresponding weights for those samples:
weights <- matrix(runif(replications*n), nrow=n)
# Now norm the weights so that each column sums up to 1:
weights <- weights / matrix(apply(weights, 2, sum), nrow=n, ncol=replications, byrow=T)
That part was a little bit tricky. Run the single functions on each own with a small number of replications to figure out what they are doing. Note that I took a different approach for generating the weights: First get uniformly distributed data and then norm them by their sum. The result should be identical to your approach, but with arbitrary resolution and much better performance.
Again a little bit trick: Get the means for each time series and multiply them with the weights just computed:
ts.weightedmeans <- matrix(ts.means[samples], nrow=n) * weights
# and sum them up:
weights.sum <- apply(ts.weightedmeans, 2, sum)
Now, we are basically done - all information are available and ready to use. The rest is just a matter of correctly formatting the data.frame.
result <- data.frame(t(matrix(names(ts.means)[samples], nrow=n)), t(weights), weights.sum)
# For perfectness, use better names:
colnames(result) <- c(paste("Sample", 1:n, sep=''), paste("Weight", 1:n, sep=''), "WeightedMean")
I would assume this approach to be rather fast - on my system the code took 1.25 seconds with the amount of repetitions you stated.
Final word: You were in luck that I was looking for something that kept me thinking for a while. Your question was not asked in a way to encourage users to think about your problem and give good answers. The next time you have a problem, I would suggest you to read www.whathaveyoutried.com before and try to break down the problem as far as you are able to. The more concrete your problem, the faster and of higher quality your answers will be.
Edit
You mentioned correctly that the weights generated above are not uniformly distributed over the whole range of values. (I still have to object that even (0.9, 0.05, 0.025, 0.025) is possible, but it is very unlikely).
Now we are playing in a different league, though. I am pretty sure that the approach you took is not uniformly distributed as well - the probability of the last value being 0.9 is far less than the probability of the first one being that large. Honestly I do not have a good idea ready for you concerning the generation of uniformly distributed random numbers on the unit sphere according to the L_1 distance. (Actually, it is not really a unit sphere, but both problems should be identical).
Thus, I have to give up on this.
I would suggest you to raise a new question at stats.stackexchange.com concerning the generation of those random vectors. It probably is fairly simple using the correct technique. However, I doubt that this question with that heading and a fairly long answer will attract a potential responder... (If you ask the question over there, I would appreciate a link, as I would like to know the solution ;)
Concerning the variance: I do not fully understand which standard deviation you want to compute. If you just want to compute the standard deviation of each time series, why do you not use the built-in function sd? In the computation above you could just replace mean by it.
Bootstrapping: That is a whole new question. Separate different topics by starting new questions.