Graphite - graph a related metric form a series list - graphite

I want to graph the mean response time for the 10 API calls which are called the most.
I have:
api.<route>.count
api.<route>.mean
I want to graph the mean value for the series with the highest counts.
I have the 10 highest count by using the highestCount( api.*.count ) so how do i take that list and replace .count with .mean
The useSeriesAbove method is very close to what i want... but I don't want to provide it with a static count.
useSeriesAbove(seriesList, value, search, replace) Compares the
maximum of each series against the given value. If the series maximum
is greater than value, the regular expression search and replace is
applied against the series name to plot a related metric
e.g. given useSeriesAbove(ganglia.metric1.reqs,10,’reqs’,’time’), the
response time metric will be plotted only when the maximum value of
the corresponding request/s metric is > 10
&target=useSeriesAbove(ganglia.metric1.reqs,10,"reqs","time")

Use limit(sortByMaxima(api.<route>.mean),10) for getting top 10 results.
Also, maybe mean time is not that you want if you want to measure latency - use 95th or 999th percentile - see https://news.ycombinator.com/item?id=10485804

Related

Looping to calculate sum of adjacent values while skipping blanks

I have a time series of emotional responses and I want to calculate a variable from the sum of absolute differences between these responses. For example, I have 10 variables for the intensity of sadness for T1-T10. However, there is some missing data for some participants, because some only responded for e.g. T1-5 or T1-8. So the number of responses I have for every participant varies.
Now I want to calculate a new variable (SAD_s) from the sum of absolute differences between these variables like this (T1s is the intensity of sadness for T1, T2s for T2 and so on):
COMPUTE SAD_s=abs(T2s-T1s)+abs(T3s-T2s) + abs(T4s-T3s) +abs(T5s-T4s)+abs(T6s-T5s) + abs(T7s-T6s) +abs(T8s-T7s)+abs(T9s-T8s) + abs(T10s-T9s) .
EXECUTE.
However, that only works for participants with the maximum of possible responses. For everyone else with missing data I get no value.
How can I make this work for participants who have missing data at the end of the time series (e.g. missing values from T7 onward, but complete data before that)? In principle, I would also like a solution for participants with missing values in between (e.g. T1-T7 complete, T8 missing, T9-T10 complete), but I would prioritize the former.
I also have a variable indicating the number of Ts participants responded to. I have a faint idea that I need to use a loop that is being repeated the number of times this variable indicates, but I don't know how to implement that.
If you want to just skip the missing value and still calculate differences between all pairs of adjacent valid values, you can go this way:
compute #lstvr=T1.
compute sad_s=0.
do repeat vr=T2 to T10.
if not missing(vr) and not missing (#lstvr) sad_s=sad_s+abs(vr-#lstvr).
if not missing(vr) #lstvr=vr.
end repeat.
If, as I understand from your comment, you do not want to compare values from the two sides of a missing value, just fix the second line within the loop like this:
compute #lstvr=vr. /* instead of "if not missing(vr) #lstvr=vr."

How to measure the fixed effect of a pair of individuals using the plm package in R?

I have a panel data set consisting of bonds with daily prices observed over a period of time. Thus each bond is repeated downwards with the corresponding daily price observations and dates (ref picture below). Half of the bonds are green (identified by a dummy variable) and each green bond is matched with a non-green bond, each pair is identified with a pair-id. So a green bond and its matched non-green bond have the same pair-id, and are observed over the same time span (say 100 days each), but the individual bond-id is unique.
I want to measure the fixed effect within each pair of bonds to figure out if there is a significant difference in yield to maturity (variable used = ask.yield) between the green bond and its matching non-green bond. Thus, I believe when identifying the paneldata in R, that the individual should be pair.id and the time index should be date. I use the following regression:
fixed <- plm(ask.yield ~ liquidity + green, data = paneldata, index = c(“pair.id”, “dates”), model = “within”)
Desired output (do not mind the numbers):
I get an error message saying:
Error in pdim.default(index[1], index[2]) :
duplicate couples (id-time)
I understand the error message – each pair.id in the panel data is recorded over the same dates twice (one time for the green bond, and one for the matching non-green bond).
Does anyone know how to get around this problem and still be able to measure the fixed effect within each pair of bonds?
From the error, there are duplications in the paired id, aka, the combination of pair.id and dates are not unique. Can you check whether the values of date unique for each pair.id?
If they are, you might need to convert the date to str, depending on the data type, the date might be converted to some value that might introduce the duplication values.
Hope this helps, since I don't have the data, I have no way to reproduce.

How to get these moving average results

I am trying to get moving average in this format
Where NA stays NA otherwise show average of three periods but for the first and second period assume the missing values to be extension of existing value.
I am trying the rollmean and rollapply functions with varying inputs but not the results I want.
tempo[,toto:= rollmean(original,3,align="left", fill="extend")]
tempo[,toto1:= rollapply(original,3,mean,align="left", na.pad=FALSE)]
tempo<-data.table(original = c(NA,NA,NA,10,0,0,0,10,10,10,0,NA,NA),
desired = c(NA,NA,NA,10,5,3.3,0,3.3,6.6,10,6.6,NA,NA))

Tableau - Average of Ranking based on Average

For a certain data range, for a specific dimension, I need to calculate the average value of a daily rank based on the average value.
First of all this is the starting point:
This is quite simple and for each day and category I get the AVG(value) and the Ranke based on that AVG(Value) computed using Category.
Now what I need is "just" a table with one row for each Category with the average value of that rank for the overall period.
Something like this:
Category Global Rank
A (blue) 1,6 (1+3+1+1+1+3)/6
B (orange) 2,3 (3+2+3+2+2+2)/6
C (red) 2,0 (2+1+2+3+3+1)/6
I tried using the LOD but it's not possble using rank table calculation inside them so I'm wondering if I'm missing anything or if it's even possible in Tableau.
Please find attached the twbx with the raw data here:
Any Help would be appreciated.

Sampling according to distribution from a large vetor in R

I have a large vector of 11 billion values. The distribution of the data is not know and therefore I would like to sample 500k data points based on the existing probabilities/distribution. In R there is a limitation of values that can be loaded in a vector - 2^31 -1 which is why I plan to do the sampling manually.
Some information about the data: The data is just integers. And many of them are repeated multiple times.
large.vec <- (1,2,3,4,1,1,8,7,4,1,...,216280)
To create the probabilities of 500k samples across the distribution I will first create the probability sequence.
prob.vec <- seq(0,1,,500000)
Next, convert these probabilities to position in the original sequence.
position.vec <- prob.vec*11034432564
The reason I created the position vector is so that I can pic data point at the specific position after I order the population data.
Now I count the occurrences of each integer value in the population. Create a data frame with the integer values and their counts. I also create the interval for each of these values
integer.values counts lw.interval up.interval
0 300,000,034 0 300,000,034
1 169,345,364 300,000,034 469,345,398
2 450,555,321 469,345,399 919,900,719
...
Now using the position vector, I identify which position value falls in which interval and based on that get the value of that interval.
This way I believe I have a sample of the population. I got a large chunk of the idea from this reference,
Calculate quantiles for large data.
I wanted to know if there is a better approach? Or if this approach could reasonably, albeit crudely give me a good sample of the population?
This process does take a reasonable amount of time, as the position vector as to go through all possible intervals in the data frame. For that I have made it parallel using RHIPE.
I understand that I will be able to do this only because the data can be ordered.
I am not trying to randomly sample here, I am trying to "sample" the data keeping the underlying distribution intact. Mainly reduce 11 billion to 500k.

Resources