I'm trying to generate a random double in R between 0 and 100, with 0 being a possible result, and normally runif(1, min, max) would do what I need. However, if I understand it correctly runif will only give you results between the min and max and never the actual limits.
Is there a way in R to generate a random double that includes only one of the limits? (In this case, 0≤x<100)
josliber created a custom function that includes both limits (https://stackoverflow.com/a/24070116/6429759), but I'm afraid that I don't know if this can be modified to only include min.
I do realise this would only change the outcome an extremely small fraction of the time, but it's part of a function that will be run extremely frequently, so it's not for nothing.
Ignoring R for the moment, the probability of obtaining exactly 0 when sampling from the uniform distribution is 0 - so for all practical purposes, drawing from the open interval and the closed interval are essentially the same.
Now, in R (or any computer-based system, for that matter), we cannot actually represent an infinite number of numbers, because we're working with a finite representational system. So you technically are drawing from a finite population, and there is a non-zero probability of drawing a boundary point. However, a good random number generator (and R has several pretty good ones) will do a pretty good job of mimicking reality - which means that even if you drew from a closed interval instead of an open interval, the probability of actually drawing 0 is negligible.
Related
I have recently posted a "very newie to R" question about the correct way of doing this, if you are interested in it you can find it [here].1
I have now managed to develop a simple R script that does the job, but now the results are what troubles me.
Long story short I'm using R to analyze lpp (Linear Point Pattern) with mad.test.That function performs an hypothesis test where the null hypothesis is that the points are randomly distributed. Currently I have 88 lpps to analyze, and according to the p.value 86 of them are randomly distributed and 2 of them are not.
These are the two not randomly distributed lpps.
Looking at them you can see some kind of clusters in the first one, but the second one only has three points, and seems to me that there is no way one can assure only three points are not corresponding to a random distribution. There are other tracks with one, two, three points but they all fall into the "random" lpps category, so I don't know why this one is different.
So here is the question: how many points are too little points for CSR testing?
I have also noticed that these two lpps have a much lower $statistic$rank than the others. I have tried to find what that means but I'm clueless righ now, so here is another newie question: Is the $statistic$rank some kind of quality analysis indicator, and thus can I use it to group my lpp analysis into "significant ones" and "too little points" ones?
My R script and all the shp files can be downloaded from here(850 Kb).
Thank you so much for your help.
It is impossible to give an universal answer to the question about how many points is needed for an analysis. Usually 0, 1 and 2 are too few for a standalone analysis. However, if they are part of repeated measurements of the same thing they might be interesting still. Also, I would normally say that your example with 3 points is too few to say anything interesting. However, an extreme example would be if you have a single long line segment where one point occurs close to one end and two other occur close to each other at the other end. This is not so likely to happen for CSR and you may be inclined to not believe that hypothesis. This appears to be what happened in your case.
Regarding your question about the rank you might want to read a bit more up on the Monte Carlo test you are preforming. Basically, you summarise the point pattern by a single number (maximum absolute deviation of linear K) and then you look at how extreme this number is compared to numbers generated at random from CSR. Assuming you use 99 simulations of CSR you have 100 numbers in total. If your data ranks as the most extreme ($statistic$rank==1) among these it has p-value 1%. If it ranks as the 50th number the p-value is 50%. If you used another number of simulations you have to calculate accordingly. I.e. with 199 simulations rank 1 is 0.5%, rank 2 is 1%, etc.
There is a fundamental problem here with multiple testing. You are applying a hypothesis test 88 times. The test is (by default) designed to give a false positive in 5 percent (1 in 20) of applications, so if the null hypothesis is true, you should expect 88 /20 = 4.4 false positives to have occurred your 88 tests. So getting only 2 positive results ("non-random") is entirely consistent with the null hypothesis that ALL of the patterns are random. My conclusion is that the patterns are random.
I am currently working on an online machine learning algorithm, where I need to make sure each feature in the input vector has a 0 mean and 1 variance across the samples.
I think its trivial how to do this when you have all the samples beforehand, but this isn't the case in online learning.
Does anybody know how to normalize a new given vector in such a way so that each feature across the previous samples (+ the new one) has 0 mean and 1 variance?
Is it even possible?
Thanks
Bootstrap the first few hundred samples, estimate the mean and variance and do Gaussian normalization to mean 0 and variance 1, and normalize any future vector to this. No ML Algo is very strict about normalization to 0,1 and this should suffice.
For a strictly online problem where you learn from first defect onwards, I am not sure how to do it, unless you have some ideas of the range of the variables like max value of a pixel in grey scale image etc. Renormalizing and re-training after say every x defects are collected would prove way too costly.
In a question on Cross Validated (How to simulate censored data), I saw that the optim function was used as a kind of solver instead of as an optimizer. Here is an example:
optim(1, fn=function(scl){(pweibull(.88, shape=.5, scale=scl, lower.tail=F)-.15)^2})
# $par
# [1] 0.2445312
# ...
pweibull(.88, shape=.5, scale=0.2445312, lower.tail=F)
# [1] 0.1500135
I have found a tutorial on optim here, but I am still not able to figure out how to use optim to work as a solver. I have several questions:
What is first parameter (i.e., the value 1 being passed in)?
What is the function that is passed in?
I can understand that it is taking the Weibull probability distribution and subtracting 0.15, but why are we squaring the result?
I believe you are referring to my answer. Let's walk through a few points:
The OP (of that question) wanted to generate (pseudo-)random data from a Weibull distribution with specified shape and scale parameters, and where the censoring would be applied for all data past a certain censoring time, and end up with a prespecified censoring rate. The problem is that once you have specified any three of those, the fourth is necessarily fixed. You cannot specify all four simultaneously unless you are very lucky and the values you specify happen to fit together perfectly. As it happened, the OP was not so lucky with the four preferred values—it was impossible to have all four as they were inconsistent. At that point, you can decide to specify any three and solve for the last. The code I presented were examples of how to do that.
As noted in the documentation for ?optim, the first argument is par "[i]nitial values for the parameters to be optimized over".
Very loosely, the way the optimization routine works is that it calculates an output value given a function and an input value. Then it 'looks around' to see if moving to a different input value would lead to a better output value. If that appears to be the case, it moves in that direction and starts the process again. (It stops when it does not appear that moving in either direction will yield a better output value.)
The point is that is has to start somewhere, and the user is obliged to specify that value. In each case, I started with the OP's preferred value (although really I could have started most anywhere).
The function that I passed in is ?pweibull. It is the cumulative distribution function (CDF) of the Weibull distribution. It takes a quantile (X value) as its input and returns the proportion of the distribution that has been passed through up to that point. Because the OP wanted to censor the most extreme 15% of that distribution, I specified that pweibull return the proportion that had not yet been passed through instead (that is the lower.tail=F part). I then subtracted.15 from the result.
Thus, the ideal output (from my point of view) would be 0. However, it is possible to get values below zero by finding a scale parameter that makes the output of pweibull < .15. Since optim (or really most any optimizer) finds the input value that minimizes the output value, that is what it would have done. To keep that from happening, I squared the difference. That means that when the optimizer went 'too far' and found a scale parameter that yielded an output of .05 from pweibull, and the difference was -.10 (i.e., < 0), the squaring makes the ultimate output +.01 (i.e., > 0, or worse). This would push the optimizer back towards the scale parameter that makes pweibull output (.15-.15)^2 = 0.
In general, the distinction you are making between an "optimizer" and a "solver" is opaque to me. They seem like two different views of the same elephant.
Another possible confusion here involves optimization vs. regression. Optimization is simply about finding an input value[s] that minimizes (maximizes) the output of a function. In regression, we conceptualize data as draws from a data generating process that is a stochastic function. Given a set of realized values and a functional form, we use optimization techniques to estimate the parameters of the function, thus extracting the data generating process from noisy instances. Part of regression analyses partakes of optimization then, but other aspects of regression are less concerned with optimization and optimization itself is much larger than regression. For example, the functions optimized in my answer to the other question are deterministic, and there were no "data" being analyzed.
I'm using R to convert some shapefiles. R does this using just one core of my processor, and I want to speed it up using parallel processing. So I've parallelized the process like this.
Given files which is a list of files to convert:
library(doMC)
registerDoMC()
foreach(f=files) %dopar% {
# Code to do the conversion
}
This works just fine and it uses 2 cores. According to the documentation for registerDoMC(), by default that function uses half the cores detected by the parallel package.
My question is why should I use half of the cores instead of all the cores? (In this case, 4 cores.) By using the function registerDoMC(detectCores()) I can use all the cores on my system. What, if any, are the downsides to doing this?
Besides the question of scalability, there is a simple rule: Intel Hyperthreading cores do not help, under Windows at least. So I get 8 with detectCores(), but I never found an improvement when going beyond 4 cores, even with MCMC parallel threads which in general scale perfectly.
If someone has a case (under Windows) where there is such an improvement from Hyperthreading, please post it.
Any time you do parallel processing there is some overhead (which can be nontrivial, especially with locking data structures and blocking calls). For small batch jobs, running on a single core or two cores is much faster due to the fact that you're not paying that overhead.
I don't know the size of your job, but you should probably run some scaling experiments where you time your job on 1 processor, 2 processors, 4 processors, 8 processors, until you hit the max core count for your system (typically, you always double the processor count). EDIT: It looks like you're only using 4 cores, so time with 1, 2, and 4.
Run timing results for ~32 trials for each core count and get a confidence interval, then you can say for certain whether running on all cores is right for you. If your job takes a long time, reduce the # of trials, all the way down to 5 or so, but remember that more trials will give you a higher degree of confidence.
To elaborate:
Student's t-test:
The student's t-test essentially says "you calculated an average time for this core count, but that's not the true average. We can only get the true average if we had the average of an infinite number of data points. Your computed true average actually lies in some interval around your computed average"
The t-test for significance then basically compares the intervals around the true average for 2 datapoints and says whether they are significantly different or not. So you may have one average time be less than another, but because the standard deviation is sufficiently high, we can't for certain say that it's actually less; the true averages may be identical.
So, to compute this test for significance:
Run your timing experiments
For each core count:
Compute your mean and standard deviation. The standard deviation should be the population standard deviation, which is the square root of population variance
Population variance is (1/N) * summation_for_all_data_points((datapoint_i - mean)^2)
Now you will have a mean and standard deviations for each core count: (m_1, s_1), (m_2, s_2), etc.
- For every pair of core counts:
- Compute a t-value: t = (mean_1 - mean_2)/(s_1/ sqrt(#dataPoints))
The example t value I showed tests whether the mean timing results for core count of 1 is significantly different than the timing results for core count of 2. You could test the other way around by saying:
t = (m_2 - m_1)/(s_2/ sqrt(#dataPoints))
After you computed these t-values, you can tell whether they're significant by looking at the critical value table. Now, before you click that, you need to know about 2 more things:
Degrees of Freedom
This is related to the number of datapoints you have. The more datapoints you have, the smaller the interval around mean probably is. Degrees of freedom kind of measures your computed mean's ability to move about, and it is #dataPoints - 1 (v in the link I provided).
Alpha
Alpha is a probability threshold. In the Gaussian (Normal, bell-curved) distribution, alpha cuts the bell-curve on both the left and the right. Any probability in the middle of the cutoffs falls inside the threshold and is an insignificant result. A lower alpha makes it harder to get a significant result. That is alpha = 0.01 means only the top 1% of probabilities are significant, and alpha = 0.05 means the top 5%. Most people use alpha = 0.05.
In the table I link to, 1-alpha determines the column you will go down looking for a critical value. (so alpha = 0.05 gives 0.95, or a 95% confidence interval), and v is your degrees of freedom, or row to look at.
If your critical value is less than your computed t (absolute value), then your result is NOT significant. If the critical value is greater than your computed t (absolute value), then you have statistical significance.
Edit: The Student's t-test assumes that variances and standard deviations are the same between the two means being compared. That is, it assumes the distribution of data points around the true mean is equal. If you DON'T want to make this assumption, then you're looking for Welch's t-test, which is slightly different. The wiki page has a good formula for computing t-values for this test.
There is one situation you want to avoid:
spreading a task over all N cores
having each core work the task using something like OpenBLAS or MKL with all cores
because now you have an N by N contention: each of the N task wants to farm its linear algebra work out to all N cores.
Another (trivial) counter example is provided in a multiuser environment where not all M users on a machine can (simultaneously) farm out to N cores.
Another reason not to use all the available cores is if your tasks use a lot memory and you don't have enough memory to support that number of workers. Note that it can be tricky to determine how many workers can be supported by a given amount of memory, because doMC uses mclapply which forks the workers, so memory can be shared between the workers unless it is modified by one of them.
From the answers to this question, it's pretty clear that it's not always easy to figure out the right number of workers to use. One could argue that there shouldn't be a default value, and the user should be forced to specify the number, but I'm not sure if I'd go that far. At any rate, there isn't anything very magical about using half the number of cores.
Hm. I'm not a parallel processing expert, but I always thought the downside of using all your cores was that it made your machine sluggish when you tried to anything else. I've had this happen to myself personally, when I've used all the cores, so my practice now is to use 7 of my 8 cores when I'm doing something parallel, leaving me one core to do other things.
I'd like to combine a few metrics of nodes in a social network graph into a single value for rank ordering the nodes:
in_degree + betweenness_centrality = informal_power_index
The problem is that in_degree and betweenness_centrality are measured on different scales, say 0-15 vs 0-35000 and follow a power law distribution (at least definitely not the normal distribution)
Is there a good way to rescale the variables so that one won't dominate the other in determining the informal_power_index?
Three obvious approaches are:
Standardizing the variables (subtract mean and divide by stddev). This seems it would squash the distribution too much, hiding the massive difference between a value in the long tail and one near the peak.
Re-scaling variables to the range [0,1] by subtracting min(variable) and dividing by max(variable). This seems closer to fixing the problem since it won't change the shape of the distribution, but maybe it won't really address the issue? In particular the means will be different.
Equalize the means by dividing each value by mean(variable). This won't address the difference in scales, but perhaps the mean values are more important for the comparison?
Any other ideas?
You seem to have a strong sense of the underlying distributions. A natural rescaling is to replace each variate with its probability. Or, if your model is incomplete, choose a transformation that approximately acheives that. Failing that, here's a related approach: If you have a lot of univariate data from which to build a histogram (of each variate), you could convert each to a 10 point scale based on whether it is in the 0-10% percentile or 10-20%-percentile ...90-100% percentile. These transformed variates have, by construction, a uniform distribution on 1,2,...,10, and you can combine them however you wish.
you could translate each to a percentage and then apply each to a known qunantity. Then use the sum of the new value.
((1 - (in_degee / 15) * 2000) + ((1 - (betweenness_centrality / 35000) * 2000) = ?
Very interesting question. Could something like this work:
Lets assume that we want to scale both the variables to a range of [-1,1]
Take the example of betweeness_centrality that has a range of 0-35000
Choose a large number in the order of the range of the variable. As an example lets choose 25,000
create 25,000 bins in the original range [0-35000] and 25,000 bins in the new range [-1,1]
For each number x-i find out the bin# it falls in the original bin. Let this be B-i
Find the range of B-i in the range [-1,1].
Use either the max/min of the range of B-i in [-1,1] as the scaled version of x-i.
This preserves the power law distribution while also scaling it down to [-1,1] and does not have the problem as experienced by (x-mean)/sd.
normalizing to [0,1] would be my short answer recommendation to combine the 2 values as it will maintain the distribution shape as you mentioned and should solve the problem of combining the values.
if the distribution of the 2 variables is different which sounds likely this won't really give you what i think your after, which is a combined measure of where each variable is within its given distribution. you would have to come up with a metric which determines where in the given distribution the value lies, this could be done many ways, one of which would be to determine how many standard deviations away from the mean the given value is, you could then combine these 2 values in some way to get your index. (addition may no longer be sufficient)
you'd have to work out what makes the most sense for the data sets your looking at. standard deviations may well be meaningless for your application, but you need to look at statistical measures that related to the distribution and combine those, rather than combing absolute values, normalized or not.