Determination of threshold values to group variable in ranges - r

I have, let's say, 60 empirical realizations of PPR. My goal is to create PPR vector with average values of empirical PPR. This average values depend on what upper and lower limit of TTM i take - so I can take TTM from 60 to 1 and calculate average and in PPR vector put this one average number from row 1 to 60 or I can calculate average value of PPR from TTT >= 60 and TTM <= 30 and TTM > 30 and TTM <= 1 and these two calculated numbers put in my vector accordingly to TTM values. Finaly I want to obtain something like this on chart (x-axis is TTM, green line is my empirical PPR and black line is average based on significant changes during TTM). I want to write an algorithm which will help me find the best TTM thresholds to fit the best black line to green line.
TTM PPR
60 0,20%
59 0,16%
58 0,33%
57 0,58%
56 0,41%
...
10 1,15%
9 0,96%
8 0,88%
7 0,32%
6 0,16%
Can you please help me if you know any statistical method which might be applicable in this case or base idea for an algorithm which I could implement in VBA/R ?
I have used Solver and GRG Nonlinear** to deal with it but I believe that there is something more proper to be utilized.
** with Solver I had the problem that it found optimal solution - ok, but I re-run Solver and it found me new solution (with a little bit different values of TTM) and value of target function was lower that on first time (so, was the first solution really optimal ?)

I think this is what you want. The next step would be including a method that can recognize the break points. I am sure you need to define two new parameters, one as the sensitivity and one as the minimum number of points in a sample to be accepted to be categorized as a section (between two break points including start and end point)
Please hit the checkmark next to this answer if you are happy with it.
You can download the Excel file from here:
http://www.filedropper.com/statisticspatternchange

Related

correlation; lower values better than higher values R

I am trying to calculate the correlation between some vector of investment returns and a matching vector that has a number from 1 to 5 rating the quality of the company. It looks something like this (lets call this data returnrank:
company returns rank
at&t 0.09034 2
verizon 0.23341 1
sprint 0.03021 3
How can I make it so that when I calculate cor(returnrank$returns,returnrank$rank) it treats lower values as better and higher values as worse in the rank column
(ie: if a stock has high returns and what R would consider a low score (1), I want to see a high positive correlation because I am treating 1 as better than 5).
You probably just want:
cor(returnrank$returns, max(returnrank$rank) - returnrank$rank))
It may be better to just graph the data since it's unlikely to be a linear relationship given the nature of rank

How to calculate the expected cost?

I am not good at probability and I know it's not a coding problem directly. But I wish you would help me with this. While I was solving a computation problem I found this difficulty:
Problem definition:
The Little Elephant from the Zoo of Lviv is going to the Birthday
Party of the Big Hippo tomorrow. Now he wants to prepare a gift for
the Big Hippo. He has N balloons, numbered from 1 to N. The i-th
balloon has the color Ci and it costs Pi dollars. The gift for the Big
Hippo will be any subset (chosen randomly, possibly empty) of the
balloons such that the number of different colors in that subset is at
least M. Help Little Elephant to find the expected cost of the gift.
Input
The first line of the input contains a single integer T - the number
of test cases. T test cases follow. The first line of each test case
contains a pair of integers N and M. The next N lines contain N pairs
of integers Ci and Pi, one pair per line.
Output
In T lines print T real numbers - the answers for the corresponding test cases. Your answer will considered correct if it has at most 10^-6 absolute or relative error.
Example
Input:
2
2 2
1 4
2 7
2 1
1 4
2 7
Output:
11.000000000
7.333333333
So, Here I don't understand why the expected cost of the gift for the second case is 7.333333333, because the expected cost equals Summation[xP(x)] and according to this formula it should be 33/2?
Yes, it is a codechef question. But, I am not asking for the solution or the algorithm( because if I take the algo from other than it would not increase my coding potentiality). I just don't understand their example. And hence, I am not being able to start thinking about the algo.
Please help. Thanks in advance!
There are three possible choices, 1, 2, 1+2, with costs 4, 7 and 11. Each is equally likely, so the expected cost is (4 + 7 + 11) / 3 = 22 / 3 = 7.33333.

Generate a specific amount of random numbers that add up to a defined value

I would like to unit test the time writing software used at my company. In order to do this I would like to create sets of random numbers that add up to a defined value.
I want to be able to control the parameters:
Min and max value of the generated number
The n of the generated numbers
The sum of the generated numbers
For example, in 250 days a person worked 2000 hours. The 2000 hours have to randomly distributed over the 250 days. The maximum time time spend per day is 9 hours and the minimum amount is .25
I worked my way trough this SO question and found the method
diff(c(0, sort(runif(249)), 2000))
This results in 1 big number a 249 small numbers. That's why I would to be able to set min and max for the generated number. But I don't know where to start.
You will have no problem meeting any two out of your three constraints, but all three might be a problem. As you note, the standard way to generate N random numbers that add to a sum is to generate N-1 random numbers in the range of 0..sum, sort them, and take the differences. This is basically treating your sum as a number line, choosing N-1 random points, and your numbers are the segments between the points.
But this might not be compatible with constraints on the numbers themselves. For example, what if you want 10 numbers that add to 1000, but each has to be less than 100? That won't work. Even if you have ranges that are mathematically possible, forcing compliance with all the constraints might mean sacrificing uniformity or other desirable properties.
I suspect the only way to do this is to keep the sum constraint, the N constraint, do the standard N-1, sort, and diff thing, but restrict the resolution of the individual randoms to your desired minimum (in other words, instead of 0..100, maybe generate 0..10 times 10).
Or, instead of generating N-1 uniformly random points along the line, generate a random sample of points along the line within a similar low-resolution constraint.

Cumulative sum of a georeferenced variable in R

I have a number of fishing boat tracks, and I'm trying to detect a certain pattern in their movement using R. In doing so I have reached a point where I have discarded all points of the track where the desired pattern is not occurring within a given time window, and I'm left with the remaining georeferenced points. These points have a score value associated, which measures the 'intensity' of the desired pattern.
track_1[1:10,]:
LAT LON SCORE
1 32.34855 -35.49264 80.67
2 31.54764 -35.58691 18.14
3 31.38293 -35.25243 46.70
4 31.21447 -35.25830 22.65
5 30.76365 -35.38881 11.93
6 30.75872 -35.54733 22.97
7 30.60261 -35.95472 35.98
8 30.62818 -36.27024 31.09
9 31.35912 -35.73573 14.97
10 31.15218 -36.38027 37.60
The code bellow provides the same data
data.frame(cbind(
LAT=c(32.34855,31.54764,31.38293,31.21447,30.76365,30.75872,30.60261,30.62818,31.35912,31.15218),
LON=c(-35.49264,-35.58691,-35.25243,-35.25830,-35.38881,-35.54733,-35.95472,-36.27024,-35.73573,-36.38027),
SCORE=c(80.67,18.14,46.70,22.65,11.93,22.97,35.98,31.09,14.97,37.60)))
Because some of these points occur geographically close to each other I need to 'pool' their scores together. Hence, I now need a way to throw this data into some kind of a spatial grid and cumulatively sum the scores of all points that fall in the same cell of the grid. This would allow me to find in what areas a given fishing boat exhibits the pattern I'm after the most (and this is not just about time spent in one place). Ultimately, the preferred output would contain lat and lon for every grid cell (center), and the sum of all scores on each cell. In addition, I would also like to be able to adjust the sizing of the grid cells.
I've looked around and all I can find either does not preserve the georeferenced information, is very inefficient, or performs binning of data. There may already be some answers out there, but it might be the case that I'm not able to recognize them since I'm a bit out of my league on this stuff. Can someone please point me to some direction (package, function, etc.)? Any guidance will be greatly appreciated.
Take your lat/lon coordinates, and multiply them by the inverse of your desired grid cell edge lengths, measured in degrees. The result will be a pair of floating point numbers whose integer part identifies the grid cell in question. Take the floor of these and you have two numbers describing the cell, which you could paste to form a single string. You may add that as a new factor column of your data frame. Then you can perform operations based on that factor, like summarizing values.
Example:
latScale <- 2 # one cell for every 0.5 degrees
lonScale <- 2 # likewise
track_1$cell <- factor(with(track_1,
paste(floor(LAT*latScale), floor(LON*lonScale), sep='.')))
library(plyr)
ddply(track_1, .(cell), summarize,
LAT=mean(LAT), LON=mean(LON), SCORE=sum(SCORE))
If you want to, you can use weighted.mean instead of mean. If you don't like these factors, you can put more effort in making them nice (e.g. by using compass directions instead of signs), or drop them altogether and use a pair of integer columns instead.

How to determine FFT resulted indexes Freq and plot the Amplitude/Frequency Graph

I Have a bit of a hypothetical question to understand this concept..
Let's say I captured a mono voice clip with 8000hz sample rate, that is 4096 bytes in data..
Feeding the First 512 Bytes(16bit encoding) through an FFT of size 256, will return me 128 values, which i convert to amplitude.
So my frequencies for this output are
FFT BIN #1
0: 0*8000/256
1: 1*8000/256
.
.
127: 127*8000/256
So far so good ey? So now i 3584 bytes of unprocessed data left. So i perform another fft of 256 size on 512 bytes of data. And get the same amount of results..
So for this do i again have frequencies of:
FFT BIN #2:
Example1:
0: 0*8000/256
1: 1*8000/256
.
.
127: 127*8000/256
or
FFT BIN #2
Example2:
128: 129*8000/256
139: 130*8000/256
.
.
255: 255*8000/256
Because I would like to plot this amplitude/freq graph. But i don't understand if all these fft bins should be overlapped on the same frequencies like examaple1, or spread out like the second example.
Or am I trying to do something that is completely redundant? Because what i want to accomplish is find the peak amp value of every 30-50ms time frame to use for comparison of other sound files..
If anyone can clear this out for me, I'd be very grateful.
Your FFT result bins represent the same set of frequencies in every FFT, as in your example #1, but for different slices of time.
Each FFT will allow you to plot magnitude vs. frequency for about a 12 mS window of time.
You could also vector sum the FFT magnitudes together to get a Welch method PSD (power spectral density) for a longer time frame.
If you want to find the peak amp value of every 30-50ms time frame, you just need to plot the amp spectra for signals in each of the time frames.
Also, if you take FFT of 256 samples for each frame, then you should get 129, not 128, frequency components. The first one is the DC component, and the last one is the Nyquist frequency component.

Resources