Obtaining the percentile 95 of a matrix and then plotting it - r

EDIT:
I have been asked to add more detail. Originally I have a 360x180 matrix, and in it there are data of E-P values, these values stand for Evaporation (E) and Precipitation (P), and they basically indicate sources (E-P>0) and sinks(E-P<0) of moisture. In order to obtain the most important sources of moisture I have to take only the positive values, and I want to obtain the percentile 95 of these values, then plot the values which are above this threshold, since I wanted to do a reproducible example I used the peaks data:
I have done this in MATLAB but if it can be made on R it works for me as well.
I have an example 49x49 matrix like this:
a = peaks;
pcolor(a);
caxis([-10 10]);
cbh=colorbar('v');
set(cbh,'YTick',(-10:1:10))
And it shows something like this
What I want to do is to obtain the percentile 95 of only the positive values, and then plotting them.
How can I do this? and also, what would it be better: To replace all the values less than zero with 0's or Nan's??

If you have the statistics toolbox, you can use the function prctile to obtain a percentile. I don't have this toolbox, so I wrote my own version (a long time ago) based on the code for the function median. With either prctile or percentile you can do:
a = peaks;
t = percentile(a(a>0),95);
b = a > t;
subplot(1,2,1)
pcolor(a);
subplot(1,2,2)
pcolor(b);
a(a>0) is a vector with all the positive values in a. t is the 95th percentile of this vector.

Related

Trying to coerce the data onto a Gaussian curve and the results are not as expected

this is not a question about curve fitting. Instead, what I have is a collection of 60 different sites, from which I can collect maximum, minimum and average temperatures. I need to be able to use this data to calculate the operating temperature of a photovoltaic cell; it doesn't make sense to do this, however, with the average temperatures because it includes values from after sunset. Instead, I first create a "fake" average temperature (this is our "fake average", totalityoftemperatures_fakemeans) which is the average value of the maximum and minimum temperatures. At that point, I calculate an adjusted minimum temperature by subtracting one standard deviation (assuming 6 * sd = max - min), and then finally calculate an "adjusted" mean temperature which is the average of the new minimum (fake mean - 1 * sd) and the pre-existing maximum temperature (so this is our "adjusted mean").
What really bothers me is that this re-calculated average ought to be higher than the "fake" mean; after all, it is an average value of the adjusted minimum together with the original maximum value. I might also cross-post this to the statistics stack exchange or something, but I'm pretty sure that this is a coding issue right now. Is there anyone out there who can look at the below code in R?
#The first data sets of maxima and minima are taken from empirical data
for(i in 1:nrow(totalityofsites))
{
for(j in 1:12)
{
totalityoftemperatures_fakemeans[i,j] = mean(totalityoftemperatures_maxima[i,j], totalityoftemperatures_minima[i,j])
}
}
totality_onesigmaDF = abs((1/6)*(totalityoftemperatures_maxima - totalityoftemperatures_minima))
totalityoftemperatures_adjustedminima = totalityoftemperatures_fakemeans - totality_onesigmaDF
for(i in 1:nrow(totalityofsites))
{
for(j in 1:12)
{
totalityoftemperatures_adjustedmeans[i,j] = mean(totalityoftemperatures_adjustedminima[i,j], totalityoftemperatures_maxima[i,j])
}
}
#The second calculation of the average should be higher than "fake" but that is not the case
I think your problem lies in your use of the mean function. When you do this:
mean(totalityoftemperatures_adjustedminima[i,j], totalityoftemperatures_maxima[i,j])
You are calling mean with two arguments. The function only takes one argument, a vector of numbers. If you supply it with two numbers it will ignore the second one. Look:
mean(2, 100)
#[1] 2
Whereas if you concatenate the values into a single vector, you get the right answer:
mean(c(2, 100))
#[1] 51
So you need to change
mean(totalityoftemperatures_maxima[i,j], totalityoftemperatures_minima[i,j])
to
mean(c(totalityoftemperatures_maxima[i,j], totalityoftemperatures_minima[i,j]))
and
mean(totalityoftemperatures_adjustedminima[i,j], totalityoftemperatures_maxima[i,j])
to
mean(c(totalityoftemperatures_adjustedminima[i,j], totalityoftemperatures_maxima[i,j]))

Periodogram (TSA In R) can't find correct frequency

I'm trying to process a sinusoidal time series data set:
I am using this code in R:
library(readxl)
library(stats)
library(matplot.lib)
library(TSA)
Data_frame<-read_excel("C:/Users/James/Documents/labssin2.xlsx")
# compute the Fourier Transform
p = periodogram(Data_frame$NormalisedVal)
dd = data.frame(freq=p$freq, spec=p$spec)
order = dd[order(-dd$spec),]
top2 = head(order, 5)
# display the 2 highest "power" frequencies
top2
time = 1/top2$f
time
However when examining the frequency spectrum the frequency (which is in Hz) is ridiculously low ~ 0.02Hz, whereas it should have one much larger frequency of around 1Hz and another smaller one of 0.02Hz (just visually assuming this is a sinusoid enveloped in another sinusoid).
Might be a rather trivial problem, but has anyone got any ideas as to what could be going wrong?
Thanks in advance.
Edit 1: Using
result <- abs(fft(df$Data_frame.NormalisedVal))
Produces what I am expecting to see.
Edit2: As requested, text file with the output to dput(Data_frame).
http://m.uploadedit.com/bbtc/1553266283956.txt
The periodogram function returns normalized frequencies in the [0,0.5] range, where 0.5 corresponds to the Nyquist frequency, i.e. half your sampling rate. Since you appear to have data sampled at 60Hz, the spike at 0.02 would correspond to a frequency of 0.02*60 = 1.2Hz, which is consistent with your expectation and in the neighborhood of what can be seen in the data your provided (the bulk of the spike being in the range of 0.7-1.1Hz).
On the other hand, the x-axis on the last graph you show based on the fft is an index and not a frequency. The corresponding frequency should be computed according to the following formula:
f <- (index-1)*fs/N
where fs is the sampling rate, and N is the number of samples used by the fft. So in your graph the same 1.2Hz would appear at an index of ~31 assuming N is approximately 1500.
Note: the sampling interval in the data you provided is not quite constant and may affect the results as both periodogram and fft assume a regular sampling interval.

How to extract saved envelope values in Spatstat?

I am new to both R & spatstat and am working with the inhomogeneous pair correlation function. My dataset consists of point values spread across several time intervals.
sp77.ppp = ppp(sp77.dat$Plot_X, sp77.dat$Plot_Y, window = window77, marks = sp77.dat$STATUS)
Dvall77 = envelope((Y=dv77.ppp[dv77.ppp$marks=='2']),fun=pcfinhom, r=seq(0,20,0.25), nsim=999,divisor = 'd', simulate=expression((rlabel(dv77.ppp)[rlabel(dv77.ppp)$marks=='1']),(rlabel(dv77.ppp)[rlabel(dv77.ppp)$marks=='2'])), savepatterns = T, savefuns = T).
I am trying to compare multiple pairwise comparisons (from different time periods) and need to create a function that will go through for every calculated envelope value, at each ‘r’ value, and find the min and max differences between the envelopes.
My question is: How do I find the saved envelope values? I know that the savefuns = T is saving all the simulated envelope values but I can’t find how to extract the values. The summary (below) says that the values are stored. How do I call the values and extract them?
> print(Dvall77)
Pointwise critical envelopes for g[inhom](r)
and observed value for ‘(Y = dv77.ppp[dv77.ppp$marks == "2"])’
Edge correction: “iso”
Obtained from 999 evaluations of user-supplied expression
(All simulated function values are stored)
(All simulated point patterns are stored)
Alternative: two.sided
Significance level of pointwise Monte Carlo test: 2/1000 = 0.002
.......................................................................................
Math.label Description
r r distance argument r
obs {hat(g)[inhom]^{obs}}(r) observed value of g[inhom](r) for data pattern
mmean {bar(g)[inhom]}(r) sample mean of g[inhom](r) from simulations
lo {hat(g)[inhom]^{lo}}(r) lower pointwise envelope of g[inhom](r) from simulations
hi {hat(g)[inhom]^{hi}}(r) upper pointwise envelope of g[inhom](r) from simulations
.......................................................................................
Default plot formula: .~r
where “.” stands for ‘obs’, ‘mmean’, ‘hi’, ‘lo’
Columns ‘lo’ and ‘hi’ will be plotted as shading (by default)
Recommended range of argument r: [0, 20]
Available range of argument r: [0, 20]
Thanks in advance for any suggestions!
If you are looking to access the values of the summary statistic (ginhom) for each of the randomly labelled patterns this is in principle documented in help(envelope.ppp). Admittedly this is long and if you are new to both R and spatstat it is easy to get lost. The clue is in the value section of the help file. The result is a data.frame with the some additional classes (envelope and fv) and as the help file says:
Additionally, if ‘savepatterns=TRUE’, the return value has an
attribute ‘"simpatterns"’ which is a list containing the ‘nsim’
simulated patterns. If ‘savefuns=TRUE’, the return value has an
attribute ‘"simfuns"’ which is an object of class ‘"fv"’
containing the summary functions computed for each of the ‘nsim’
simulated patterns.
Then of course you need to know how to access an attribute in R, which is done using attr:
funs <- attr(Dvall77, "simfuns")
Then funs is a data.frame (and fv-object) with all the function values for each randomly labelled pattern.
I can't really understand from your question whether you just need the values of the upper and lower curve defining the envelope? In that case you just access them like an ordinary data.frame (and there is no need to save all the individual function values in the envelope):
lo <- Dvall77$lo
hi <- Dvall77$hi
d <- hi - lo
More elegantly you can do:
d <- with(Dvall77, hi - lo)

Label or score outliers in R

I'm looking for some easy to use algorithms in R to label (outlier or not) or score (say, 7.5) outliers row-wise. Meaning, I have a matrix m that contains several rows and I want to identify rows who represent outliers compared to the other rows.
m <- matrix( data = c(1,1,1,0,0,0,1,0,1), ncol = 3 )
To illustrate some more, I want to compare all the (complete) rows in the matrix with each other to spot outliers.
Here's some really simple outlier detection (using either the boxplot statistics or quantiles of the data) that I wrote a few years ago.
Outliers
But, as noted, it would be helpful if you'd describe your problem with greater precision.
Edit:
Also you say you want row-wise outliers. Do you mean to say that you're interested in identifying whole rows vs observations within a variable (as is typically done)? If so, you'll want to use some sort of distance metric, though which metric you choose will depend on your data.

How do you select the rank-k approximation for SVDImpute (package: imputation) in R?

I have a matrix with nominal values from 1-5, with some missing values. I would like to use SVDImpute (from the "imputation" package) in R to fill in the missing values, but I am unsure of what number to use for k (rank-k approximation) in the function.
The help page description of the imputation is:
Imputation using the SVD First fill missing values using the mean of
the column Then, compute a low, rank-k approximation of x. Fill the
missing values again from the rank-k approximation. Recompute the
rank-k approximation with the imputed values and fill again, repeating
num.iters times
To me, this sounds likes the columns means are calculated as part of the function; is this correct? If so, then how was the value of k=3 chosen for the example?
x = matrix(rnorm(100),10,10)
x.missing = x > 1
x[x.missing] = NA
SVDImpute(x, 3)
Any help is greatly appreciated.

Resources