The requirement is to divide the min & max number range to some intervals
For eg : Min = 0 ,Max =100, Intervals count =6 output = [0,20,40,60,80,100]
The algo should work for multiple number combination like negative to positive, negative to negative etc . My solution is failing for one of these number combination. Any hints ?
Related
I would like to generate random numbers with both a specified mean & sd, AND a specified min and/or max (i.e., give me 100 random numbers between 0 and 50 with a mean=20 and sd=10; OR give me 100 random numbers with min=10, mean=25, sd=15).
I can specify mean & sd in runif, rnorm.
I can specify a range in sample (though I don't think I can specify ONLY a min or a max)
I need something where I can specify both, in R.
Thanks!
I am new to statistics, so I excuse myself if this question is trivial
I have a variable that is normally distributed with a range between -15 and +15 like the following one:
df <- data.frame("weight" = runif(1000, min=-15, max=15), stringsAsFactors = FALSE)
The median and mean value of this variable is 0.
I need to transform this variable to use it as a weight in my regression. For substantive reasons, it does not make any sense to have negative values in my variable (it is itself the result of previous transformations).
Negative values of my variable should simply reduce the effects of my main explanatory variable (hence should be bounded between 0 and 1) while positive values should have a multiplicative effect on my explanatory variable (greater than 1). While values close to 0 of my weight should have no effect on my explanatory variable (close to 1).
Hence I would like centre my variable so that the minimum value of my weight is 0 and the median value becomes 1, while I do not want to put constraints on the maximum value thought this will necessarily change the mean (it will become greater than 1). I am not concerned about this provided that the median remains 1.
so far I have considered standardizing the variable between 0 and 2
library(BBmisc)
df$normalizedweight <- normalize(df$weight, method = "range",
range = c(0, 2))
however, this operation puts an unnecessary constraint to my normalized variable as the effect of my weight can be greater than a factor of two, while
To clarify, in the real data, negative values of the weight are perfectly mirroring positive values of the weight. Ideally, once I have standardized the data, I would want that multiplying the same number by the maximum and minimum value of the weight, would increase/decrease the value by the same proportion.
For example, taking the value of the response variable of 5 both for the maximum (10) and minimum value of my weight, the minimum value should be 0.1, so that 5*10 and 5*0.1, would be and proportional increase/decrease by a factor of 10 of my original value.
I thank you in advance for all the help you are able to provide
Best
One option is to used the exponential transformation. All your negative values will be between 0 and 1, and all your positive values will be over 1. And your median will be close to 1.
Moreover, as exp() will create very large value (exp(15) = 3 269 017), you can first divided your values by its maximum.
sample <- runif(10000, min=-15, max=15)
sample_transform = exp(sample / max(sample))
median(sample_transform)
# [1] 0.9930663
hist(sample_transform)
In some code I want to choose n random numbers in [0,1) which sum to 1.
I do so by choosing the numbers independently in [0,1) and normalizing them by dividing each one by the total sum:
numbers = [random() for i in range(n)]
numbers = [n/sum(numbers) for n in numbers]
My "problem" is, that the distribution I get out is quite skew. Choosing a million numbers not a single one gets over 1/2. By some effort I've calculated the pdf, and it's not nice.
Here is the weird looking pdf I get for 5 variables:
Do you have an idea for a nice algorithm to choose the numbers, that result in a more uniform or simple distribution?
You are looking to partition the distance from 0 to 1.
Choose n - 1 numbers from 0 to 1, sort them and determine the distances between each of them.
This will partition the space 0 to 1, which should yield the occasional large result which you aren't getting.
Even so, for large values of n, you can generally expect your max value to decrease as well, just not as quickly as your method.
You might be interested in the Dirichlet distribution which is used for generate quantities that sum to 1 if you're looking for probabilities. There's also a section on how to generate them using gamma distributions here.
Another way to get n random numbers which sum up to 1:
import random
def create_norm_arr(n, remaining=1.0):
random_numbers = []
for _ in range(n - 1):
r = random.random() # get a random number in [0, 1)
r = r * remaining
remaining -= r
random_numbers.append(r)
random_numbers.append(remaining)
return random_numbers
random_numbers = create_norm_arr(5)
print(random_numbers)
print(sum(random_numbers))
This makes higher numbers more likely.
I'm using the General Inquirer dictionary with the SentimentAnalysis package and I can't figure out how they assign the sentiment score...
For example, if I run the following code:
sentiment <- analyzeSentiment(sampledf)
summary(sentiment$SentimentGI)
I'll get an output like this:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.80000 -0.16667 -0.07692 -0.07313 0.00000 0.66667
What's the scale being used here? -1 to 1? I don't know how to interpret these results.
Thanks!
All sentiment-related scores are calculated based on the formula
(#positive - #negative) / #all
where #positive refers to the number of positive words, #negative to the number of negative words and #all to the total word count. Hence, the sentiment score comes from the interval [-1, +1]. A value of 0 indicates that there are as many positive as negative words in a document.
NB: In practice, the empirical mean/median value is not necessarily located at exactly zero as either positive/negative is perceived stronger or even appears more frequent. Hence, one would prefer to choose a different cutoff point to discriminate positive from negative.
Other scores are as follows:
Negativity or positivity only count the ratio of negative or positive words, respectively. Hence, this value is given by e.g. #negative / #all and is in [0, 1].
Polarity uses the formula (#positive - #negative) / (#positive + #negative).
Ratio is the share of dictionary expressions, i.e. (#positive + #negative) / #all.
Accordingo to ?runif, this function will not generate either of min or max bounds. How can I do something like runif but including min and max?
This is just for pure theory. I was wondering - what if I actually needed to randomly generate some values from uniform distribution, including the lower bound.
You can write your own uniform distribution function that includes the endpoints using the sample function:
myrunif <- function(n, min=0, max=1) {
min + (sample(.Machine$integer.max, n) - 1) / (.Machine$integer.max - 1) *
(max - min)
}
With this function, each endpoint has a small probability, 1/(.Machine$integer.max-1), of being returned.
However, it's worthwhile remembering that mathematically the probability of drawing either a or b (or any particular value) from a U(a, b) random variable is 0, so the current behavior of runif makes a lot of sense.
In pure theory the probability of any single value being generated from a continuous distribution will be 0, so the probability of min or max is 0.
From a practical standpoint if you really want to generate a uniform (which will round to a finite set of values and therefore having probability greater than 0 of being seen) with the possibility of seeing the desired min and max values, then just generate a uniform between min-epsilon and max+epsilon. Now min and max are in the range and have a chance of being chosen just like the other values. You just need to choose a value of epsilon such that values between min-epsilon and min will round to min and similar for the max.