Scale down a number relative to a max number - math

So I've got a little thing that I'd like to do, but not sure how the maths would really work, it is not my strongest side. I'd like to scale down a number (n) to be below a max number (m). For example:
n = 120, m = 80. Requirement: 120 scaled down s it is relative to the maximum of 80. The minimum number being 0.
Is this something that can be mathematically done or do I change my model. I'm looking for a mathematical way to do it so I can translate it into code

Related

How can I create a normal distributed set of data in R?

I'm a newbie in statistics and I'm studying R.
I decided to do this exercise to pratice some analysis with an original dataset.
This is the issue: I want to create a datset of let's say 100 subjects and for each one of them I have a test score.
This test score has a range that goes from 0 to 70 and the mean score is 48 (and its improbable that someone scores 0).
Firstly I tried to create the set with x <- round(runif(100, min=0, max=70)) , but then I found out that were not normally distributed using plot(x).
So I searched another Rcommand and found this, but I couldn't decide the min\max:
ex1 <- round(rnorm(100, mean=48 , sd=5))
I really can't understand what I have to do!
I would like to write a function that gives me a set of data normally distributed, in a range of 0-70, with a mean of 48 and a not so big standard deviation in order to do some T-test later...
Any help?
Thanks a lot in advance guys
The normal distribution, by definition, does not have a min or max. If you go more than a few standard deviations from the mean, the probability density is very small, but not 0. You can truncate a normal distribution, chopping of the tails. Here, I use pmin and pmax to set any values below 0 to 0, and any values above 70 to 70:
ex1 <- round(rnorm(100, mean=48 , sd=5))
ex1 <- pmin(ex1, 70)
ex1 <- pmax(ex1, 0)
You can calculate the probability of an individual observation being below or above a certain point using pnorm. For your mean of 48 and SD of 5, the probability an individual observation is less than 0 is very small:
pnorm(0, mean = 48, sd = 5)
# [1] 3.997221e-22
This probability is so small that the truncation step is unnecessary in most applications. But if you started experimenting with bigger standard deviations, or mean values closer to the bounds, it could become necessary.
This method of truncation is simple, but it is a bit of a hack. If you truncated a distribution to be within 1 SD of the mean using this method, you would end up with spikes a the upper and lower bound that are even higher than the density at the mean! But it should work well enough for less extreme applications. A more robust method might be to draw more samples than you need, and keep the first n samples that fall within your bounds. If you really care to do things right, there are packages that implement truncated normal distributions.
(Because the normal distribution is symmetric, and 100 is farther from your mean than 0, the probability of observations > 100 are even smaller.)

How do I write a mathematic formula that non-linearly scales a continuous variable to a 0-100 span where f(x)→100 where x→Inf

I am building an index of user-qualities defined as a sum of (often) correlated continious variables representing user-activity. The index is well-calibrated, and servs the purpos of my analysis, but is tricky to communicate to my co-workers, particularly, since outlier activities cause extremely tenatious users to score a very highly on the activity index.
For 97% of users, the index is distributed near-normally between 0 and 100, with a right tail of 3% of hyper-active users with an index > 100. Index-values beyond 200 should be extremely rare but are theoretically possible.
I'm looking to scale the tail back into a 0-100 span, but not linearly, since I would like the 3%-tail to be represented as small variances within the top-range of the 0-100 index. What I'm looking for a non-linear formula to scale my index, like this:
so that the lower tier of the unscaled index remains close to the scaled one, but where high index-values diverge, but where scaled values never reach 100 as my index goes towards infinity, so that x=0=f(x) but when x = 140, f(x) ≈ 99 or something similar
I'll implement the scaling in R, Python and BigQuery.
There are lots of ways to do this: take any function with the right shape and tweak it to your needs.
One family of functions with the right shape is
f(x) = x/pow(1 + pow(x/100, n), 1/n)
You can vary the parameter n to adjust the shape: increasing n pushes f(100) closer to 100. With n=5 you get something that looks pretty close to your drawing
f(x) = x/pow(1 + pow(x/100, 5), 0.2)
Another option is taking the hyperbolic tangent function tanh which you can of course tweak in similar ways:
f(x) = 100*pow(tanh(pow(x/100, n)), 1/n)
here's the curve with n=2:

Generating groups of skewed size but whose elements add to a fixed sum

I have some fixed number of people (e.g. 1000). I would like to split these 1000 people into some random number of classes Y (e.g. 5), but not equally. I want them to be distributed unevenly, according to some probability distribution that is heavily skewed (something like a power-law distribution).
My intuition is that I need to generate a distribution of probabilities that is (1) skewed and (2) which also adds up to 1.
My ad hoc solution was to generate random numbers from a power law distribution, multiply these by some scalar that ensures these add up to something close to my target number, adjust my target number to that new number, and then split accordingly.
But it seems awfully inelegant, and 'y_size' doesn't always sum to 1000, which requires looping through and trying again. What's a better approach?
require(poweRlaw)
x<-1000
y<-10
y_sizes<-rpldis(10,xmin=5,alpha=2,discrete_max=x)
y_sizes<-round(y_sizes * x/sum(y_sizes))
newx<-y_sizes #newx only approx = x rather than = x
people<-1:x
groups<-cut(
people,
c(0,cumsum(y_sizes))
) %>% as.numeric
data.frame(
people=people,
group=groups
)
The algorithm presented by Smith and Tromble in "Sampling Uniformly from the Unit Simplex" shows a solution. I have pseudocode on this algorithm in my section "Random Integers with a Given Positive Sum".

Simple 2 or 3 parameters float PRNG formula that changes faster than the float resolution and produces white noise?

I'm looking for a 2 or 3 parameters math formula with the following characteristics:
Simple (the fewest amount of operations the better)
Random output (non-periodic)
Normalized (Meaning the output will never be outside a given range; doesn't matter the range since once I know the range I can just divide and add/subtract to get it into the 0 to 1 range I'm looking for)
White noise (the more samples you get the more evenly distributed the outputs get across the range of possible output values, with no gaps or hotspots, to the extent permitted by the floating-point standard)
Random all the way down (no gradual changes between output values even if the inputs are changed by the smallest amount the float standard will allow. I understand that given the nature of randomness, it is possible two output values might be close together once in a while, but that must only happen by coincidence, and not because of smoothness or periodicity)
Uses only the operations listed bellow (but of course, any operations that can be done by a combination of the ones listed bellow are also allowed)
I need this because I need a good source of controllable randomness for some experiments I'm doing with Cycles material nodes in Blender. And since that is where the formula will be implemented, the only operations I have available are:
Addition
Subtraction
Multiplication
Division
Power (X to the power of Y)
Logarithm (I think it's X Log Y; I'm not very familiar with the logarithm operation, so I'm not 100% sure if that is enough to specify which type of logarithm it is; let me know if you need more information about it)
Sine
Cosine
Tangent
Arcsine
Arccosine
Arctangent (not Atan2, but that can be created by combining operations if necessary)
Minimum (Returns the lowest of 2 numbers)
Maximum (Returns the highest of 2 numbers)
Round (Returns the closest round number to the input)
Less-than (Returns 1 if X is less than Y, zero otherwise)
Greater-than (Returns 1 if X is more than Y, zero otherwise)
Modulo (Produces a sawtooth pattern of period Y; for positive X values it's in the 0 to Y range, and for negative values of X it's in the -Y to zero range)
Absolute (strips the sign of the input value, makes it positive if it was negative, doesn't do anything if it's already positive)
There is no iteration nor looping functionality available (and of course, branching can only be done by calculating all the branches and then doing something like multiplying the results of the branches not meant to be taken by zero and then adding the results of all of them together).

Finding a reasonable (noise-free) maximum element in a vector

Consider a vector V riddled with noisy elements. What would be the fastest (or any) way to find a reasonable maximum element?
For e.g.,
V = [1 2 3 4 100 1000]
rmax = 4;
I was thinking of sorting the elements and finding the second differential {i.e. diff(diff(unique(V)))}.
EDIT: Sorry about the delay.
I can't post any representative data since it contains 6.15e5 elements. But here's a plot of the sorted elements.
By just looking at the plot, a piecewise linear function may work.
Anyway, regarding my previous conjecture about using differentials, here's a plot of diff(sort(V));
I hope it's clearer now.
EDIT: Just to be clear, the desired "maximum" value would be the value right before the step in the plot of the sorted elements.
NEW ANSWER:
Based on your plot of the sorted amplitudes, your diff(sort(V)) algorithm would probably work well. You would simply have to pick a threshold for what constitutes "too large" a difference between the sorted values. The first point in your diff(sort(V)) vector that exceeds that threshold is then used to get the threshold to use for V. For example:
diffThreshold = 2e5;
sortedVector = sort(V);
index = find(diff(sortedVector) > diffThreshold,1,'first');
signalThreshold = sortedVector(index);
Another alternative, if you're interested in toying with it, is to bin your data using HISTC. You would end up with groups of highly-populated bins at both low and high amplitudes, with sparsely-populated bins in between. It would then be a matter of deciding which bins you count as part of the low-amplitude group (such as the first group of bins that contain at least X counts). For example:
binEdges = min(V):1e7:max(V); % Create vector of bin edges
n = histc(V,binEdges); % Bin amplitude data
binThreshold = 100; % Pick threshold for number of elements in bin
index = find(n < binThreshold,1,'first'); % Find first bin whose count is low
signalThreshold = binEdges(index);
OLD ANSWER (for posterity):
Finding a "reasonable maximum element" is wholly dependent upon your definition of reasonable. There are many ways you could define a point as an outlier, such as simply picking a set of thresholds and ignoring everything outside of what you define as "reasonable". Assuming your data has a normal-ish distribution, you could probably use a simple data-driven thresholding approach for removing outliers from a vector V using the functions MEAN and STD:
nDevs = 2; % The number of standard deviations to use as a threshold
index = abs(V-mean(V)) <= nDevs*std(V); % Index of "reasonable" values
maxValue = max(V(index)); % Maximum of "reasonable" values
I would not sort then difference. If you have some reason to expect continuity or bounded change (the vector is of consecutive sensor readings), then sorting will destroy the time information (or whatever the vector index represents). Filtering by detecting large spikes isn't a bad idea, but you would want to compare the spike to a larger neighborhood (2nd difference effectively has you looking within a window of +-2).
You need to describe formally the expected information in the vector, and the type of noise.
You need to know the frequency and distribution of errors and non-errors. In the simplest model, the elements in your vector are independent and identically distributed, and errors are all or none (you randomly choose to store the true value, or an error). You should be able to figure out for each element the chance that it's accurate, vs. the chance that it's noise. This could be very easy (error data values are always in a certain range which doesn't overlap with non-error values), or very hard.
To simplify: don't make any assumptions about what kind of data an error produces (the worst case is: you can't rule out any of the error data points as ridiculous, but they're all at or above the maximum among non-error measurements). Then, if the probability of error is p, and your vector has n elements, then the chance that the kth highest element in the vector is less or equal to the true maximum is given by the cumulative binomial distribution - http://en.wikipedia.org/wiki/Binomial_distribution
First, pick your favorite method for identifying outliers...
If you expect the numbers to come from a normal distribution, you can use a say 2xsd (standard deviation) above the mean to determine your max.
Do you have access to bounds of your noise-free elements. For example, do you know that your noise-free elements are between -10 and 10 ?
In that case, you could remove noise, and then find the max
max( v( find(v<=10 & v>=-10) ) )

Resources