How to calculate zipf exponent in R? - r

The generalized Zipf's law states that, if we rank a collection of n objects in non-decreasing order according to their size, the product of a power of the rank and of the size of each object is constant throughout the collection, i.e.
where r is the rank, zr is the size of the rth object, and alpha is Zipf's parameter.
I would like to calculate the exponent of the function which shows zipf's law in data, i.e. the Zipf's parameter/exponent. My data are the following:
> dput(df)
structure(list(x = c(1.06936486607035, 1.3232662468642, 1.57716762765805,
1.83106900845189, 2.08497038924574, 2.33887177003959, 2.59277315083344,
2.84667453162729, 3.10057591242114, 3.35447729321498, 3.60837867400883,
3.86228005480268, 4.11618143559653, 4.37008281639038, 4.62398419718422,
4.87788557797807, 5.13178695877192, 5.38568833956577, 5.63958972035962,
5.89349110115347, 6.14739248194731, 6.40129386274116, 6.65519524353501,
6.90909662432886, 7.16299800512271, 7.41689938591655, 7.6708007667104,
7.92470214750425, 8.1786035282981, 8.43250490909195, 8.6864062898858,
8.94030767067964, 9.19420905147349, 9.44811043226734, 9.70201181306119,
9.95591319385504, 10.2098145746489, 10.4637159554427, 10.7176173362366,
10.9715187170304, 11.2254200978243, 11.4793214786181, 11.733222859412,
11.9871242402058, 12.2410256209997, 12.4949270017935, 12.7488283825874,
13.0027297633812, 13.2566311441751, 13.5105325249689, 13.7644339057628,
14.0183352865566, 14.2722366673505, 14.5261380481443, 14.7800394289382,
15.033940809732, 15.2878421905258, 15.5417435713197, 15.7956449521135,
16.0495463329074, 16.3034477137012, 16.5573490944951, 16.8112504752889,
17.0651518560828, 17.3190532368766, 17.5729546176705, 17.8268559984643,
18.0807573792582, 18.334658760052, 18.5885601408459, 18.8424615216397,
19.0963629024336, 19.3502642832274, 19.6041656640213, 19.8580670448151,
20.111968425609, 20.3658698064028, 20.6197711871967, 20.8736725679905,
21.1275739487844, 21.3814753295782, 21.6353767103721, 21.8892780911659,
22.1431794719598, 22.3970808527536, 22.6509822335474, 22.9048836143413,
23.1587849951351, 23.412686375929, 23.6665877567228, 23.9204891375167,
24.1743905183105, 24.4282918991044, 24.6821932798982, 24.9360946606921
), y = c(-2.97228886692625, -2.95440976170107, -2.93928459279152,
-2.92685672250007, -2.91707897563357, -2.91054871731668, -2.90679861996743,
-2.90554785065139, -2.90675006309313, -2.91036572966993, -2.91696470816554,
-2.92597057051316, -2.93718053039632, -2.95054999876795, -2.96603736909913,
-2.98406085693689, -3.00405379487858, -3.02588740495999, -3.04950848046858,
-3.07486235427239, -3.10210692287855, -3.13082120061712, -3.16091945841148,
-3.19233074728207, -3.2249788128355, -3.25869455640463, -3.29332682179158,
-3.32879100108009, -3.36499680219032, -3.40182490231023, -3.43885206667123,
-3.47620809318544, -3.51379996912, -3.55153068719991, -3.58922700390204,
-3.62638735300239, -3.66333893302836, -3.70000206447245, -3.73629766644354,
-3.77202286263472, -3.80675861286812, -3.84092561898948, -3.87447795824782,
-3.90737403696004, -3.93941841852314, -3.97035436941129, -4.00059707307105,
-4.03013619484928, -4.05896597861451, -4.0869186255246, -4.11388702659758,
-4.14021632809744, -4.16591776523316, -4.19100561781447, -4.21534097913824,
-4.23891623603497, -4.26199110836985, -4.2845881782092, -4.30673264230625,
-4.32831990641116, -4.34940948783001, -4.37019317123469, -4.39070825537989,
-4.41099561113224, -4.43100956980122, -4.45086204575797, -4.47069657988101,
-4.49057219155847, -4.51055207455314, -4.53067990556232, -4.55108865901739,
-4.57188026595389, -4.59313206537618, -4.61492481492843, -4.63739606090163,
-4.6606424296565, -4.68472597512036, -4.70972674413206, -4.73572647040504,
-4.76290834240927, -4.79128541380379, -4.820888083087, -4.85177836410324,
-4.88401718152641, -4.91772243314579, -4.95283585285162, -4.98936501364757,
-5.02733146115584, -5.0667505747377, -5.10751161205962, -5.14958042252788,
-5.19293189917589, -5.23752155483956, -5.28329086087404, -5.3297251199846
)), row.names = c(NA, -95L), class = c("tbl_df", "tbl", "data.frame"
))
They result from the kernel density estimation of degree distibution of a network (i.e. in the x axis we have the degree and in the y axis the logarithm of number of nodes with that degree).
How can I estimate the Zipf's exponent from this dataset?

You can check out the gamlss package which provides functions to fit Zipf distribution (and other varieties of it).
https://cran.r-project.org/web/packages/gamlss/gamlss.pdf (pg. 47)
# install.packages('gamlss')
library(gamlss)
gamlss(
formula = ...,
data = ...,
family = ZIPF(mu.link = 'log')
)
But I don't really understand your data. You said the x axis is the degree so how come they are not integers? Is your network a valued network?
Also you said the y axis is the logarithm of number of nodes. But that would imply the number of nodes are numbers between 0 and 1 which doesn't really make sense.

Related

Finding peaks with minimum peak width in R - similar to MATLAB function

I need to find peaks in a time series data, but the result needs to be equal to the result of the findpeaks function in MATLAB, with the argument 'MinPeakWidth" set to 10. I have already tried a lot of functions in order to achieve this: pracma::findpeaks, fluoR::find_peaks, splus2R::peaks, IDPmisc::peaks (this one has one argument regarding peak width, but the result is not the same). I have already looked in other functions as well, including packages for chromatography and spectoscropy analysis in bioconductor. Beyond that, I have tried the functions (and little alterations) from this other question in stackoverflow: Finding local maxima and minima
The findpeaks function in MATLAB is used for finding local maximas and has the following charcateristics:
Find the local maxima. The peaks are output in order of occurrence. The first sample is not included despite being the maximum. For the flat peak, the function returns only the point with lowest index.
The explanation for the "MinPeakWidth' argument in MATLAB web site is
Minimum peak width, specified as the comma-separated pair consisting of 'MinPeakWidth' and a positive real scalar. Use this argument to select only those peaks that have widths of at least 'MinPeakWidth'.
If you specify a location vector, x, then 'MinPeakWidth' must be expressed in terms of x. If x is a datetime array, then specify 'MinPeakWidth' as a duration scalar or as a numeric scalar expressed in days.
If you specify a sample rate, Fs, then 'MinPeakWidth' must be expressed in units of time.
If you specify neither x nor Fs, then 'MinPeakWidth' must be expressed in units of samples.
Data Types: double | single | duration
This is the data:
valores <- tibble::tibble(V1 = c(
0.04386573, 0.06169861, 0.03743560, 0.04512523, 0.04517977, 0.02927114, 0.04224937, 0.06596527, 2.15621006, 0.02547804, 0.03134409, 0.02867694,
0.08251871, 0.03252856, 0.06901365, 0.03201109, 0.04214851, 0.04679828, 0.04076178, 0.03922274, 1.65163662, 0.03630282, 0.04146608, 0.02618668,
0.04845364, 0.03202031, 0.03699149, 0.02811389, 0.03354410, 0.02975296, 0.03378896, 0.04440788, 0.46503730, 0.06128226, 0.01934736, 0.02055138,
0.04233819, 0.03398005, 0.02528630, 0.03694652, 0.02888223, 0.03463824, 0.04380172, 0.03297124, 0.04850558, 0.04579087, 1.48031231, 0.03735059,
0.04192204, 0.05789367, 0.03819694, 0.03344671, 0.05867103, 0.02590745, 0.05405133, 0.04941912, 0.63658824, 0.03134409, 0.04151859, 0.03502503,
0.02182294, 0.15397702, 0.02455722, 0.02775277, 0.04596132, 0.03900906, 0.03383408, 0.03517160, 0.02927114, 0.03888822, 0.03077891, 0.04236406,
0.05663730, 0.03619537, 0.04294887, 0.03497815, 0.03995837, 0.04374904, 0.03922274, 0.03596561, 0.03157820, 0.26390591, 0.06596527, 0.04050374,
0.02888223, 0.03824380, 0.05459656, 0.02969611, 0.86277224, 0.02385613, 0.03888451, 0.06496997, 0.03930725, 0.02931837, 0.06021005, 0.03330982,
0.02649659, 0.06600261, 0.02854480, 0.03691669, 0.06584168, 0.02076757, 0.02624355, 0.03679596, 0.03377049, 0.03590172, 0.03694652, 0.03575540,
0.02532416, 0.02818711, 0.04565318, 0.03252856, 0.04121822, 0.03147210, 0.05002047, 0.03809792, 0.02802299, 0.03399243, 0.03466543, 0.02829443,
0.03339476, 0.02129232, 0.03103367, 0.05071605, 0.03590172, 0.04386435, 0.03297124, 0.04323263, 0.03506247, 0.06225121, 0.02862442, 0.02862442,
0.06032925, 0.04400082, 0.03765090, 0.03477973, 0.02024540, 0.03564245, 0.05199116, 0.03699149, 0.03506247, 0.02129232, 0.02389752, 0.04996414,
0.04281258, 0.02587514, 0.03079668, 0.03895791, 0.02639014, 0.07333564, 0.02639014, 0.04074970, 0.04346211, 0.06032925, 0.03506247, 0.04950545,
0.04133673, 0.03835127, 0.02616212, 0.03399243, 0.02962473, 0.04800780, 0.03517160, 0.04105323, 0.03649472, 0.03000509, 0.05367187, 0.03858981,
0.03684529, 0.02941408, 0.04733265, 0.02590745, 0.02389752, 0.02385495, 0.03649472, 0.02508245, 0.02649659, 0.03152265, 0.02906310, 0.04950545,
0.03497815, 0.04374904, 0.03610649, 0.03799523, 0.02912771, 0.03694652, 0.05105353, 0.03000509, 0.02902378, 0.06425520, 0.05660319, 0.03065341,
0.04449069, 0.03638436, 0.02582273, 0.03753463, 0.02756006, 0.07215131, 0.02418869, 0.03431030, 0.04474425, 0.42589279, 0.02879489, 0.02872819,
0.02512494, 0.02450022, 0.03416346, 0.04560013, 1.40417366, 0.04784363, 0.04950545, 0.04685682, 0.03346052, 0.03255004, 0.07296053, 0.04491526,
0.02910482, 0.05448995, 0.01934736, 0.02195528, 0.03506247, 0.03157064, 0.03504810, 0.03754736, 0.03301058, 0.06886929, 0.03994190, 0.05130644,
0.21007323, 0.05630628, 0.02893721, 0.03683226, 0.03825290, 0.02494987, 0.02633410, 0.02721408, 0.03798986, 0.33473991, 0.04236406, 0.02389752,
0.03562747, 0.04662421, 0.02373767, 0.04918125, 0.04478894, 0.02418869, 0.03511514, 0.02871556, 0.05586166, 0.49014922, 0.03406339, 0.84823093,
0.03416346, 0.08729506, 0.03147210, 0.02889640, 0.06181828, 0.04940672, 0.03666858, 0.03019139, 0.03919279, 0.04864613, 0.03720420, 0.04726722,
0.04141298, 0.02862442, 0.29112744, 0.03964319, 0.05657445, 0.03930888, 0.04400082, 0.02722065, 0.03451685, 0.02911419, 0.02831578, 0.04001334,
0.05130644, 0.03134409, 0.03408579, 0.03232126, 0.03624218, 0.04708792, 0.06291741, 0.05663730, 0.03813209, 0.70582932, 0.04149421, 0.03607614,
0.03201109, 0.02055138, 0.03727305, 0.03182562, 0.02987404, 0.04142461, 0.03433624, 0.04264550, 0.02875086, 0.05797661, 0.04248705, 0.04476514))
From the data above, I obtain 22 peaks using pracma::findpeaks function with the code bellow:
picos_r <- pracma::findpeaks(-valores$V1, minpeakdistance = 10)
Using the MATLAB function
picos_matlab = findpeaks(-dado_r, 'MinPeakWidth', 10);
I obtain 11 peaks, as the following:
picos_matlab <- c(-0.02547804, -0.02618668, -0.01934736, -0.02182294, -0.0245572200000000, -0.0202454, -0.02385495, -0.01934736, -0.02373767, -0.02862442, -0.02722065)
I used pracma::findpeaks because it has already given an equal result in another part of the function that I am writting. I have already tried to change the code of the pracma::findpeaks, but with little success.
The package cardidates contains a heuristic peak hunting algorithm that can somewhat be fine-tuned using parameters xmax, minpeak and mincut. It was designed for a special problem, but may also used for other things. Here an example:
library("cardidates")
p <- peakwindow(valores$V1)
plot(p) # detects 14 peaks
p <- peakwindow(valores$V1, minpeak=0.18)
plot(p) # detects 11 peaks
Details are described in the package vignette and in https://doi.org/10.1007/s00442-007-0783-2
Another option is to run a smoother before peak detection.
I'm not sure what your test case is: -valores$V1, valores$V1, or -dado_r (what is that)?
I think pracma::findpeaks() does quite well if you do:
x <- valores$V1
P <- pracma::findpeaks(x,
minpeakdistance = 10, minpeakheight = sd(x))
plot(x, type = 'l', col = 4)
grid()
points(P[,2], P[, 1], pch=20, col = 2)
It finds 11 peaks that stick out while four or five others are too near to be counted. All the smaller ones (standard deviation) are being ignored.

MLE of a distribution in R: fitdistrplus (SGT distribution), how do I do it?

For return data I am doing research about the importance of skewness and kurtosis for the cVaR calculation. We are comparing some distributions first, by estimating the parameters of the distribution using fitdist() in R using package "fitdistrplus". However, we want to do this for a various number of distributions (see picture: SGT, GT, SGED, GED, t, norm).
Below is a sample code for SGT, where there is a problem: it introduced NaN's for the standard errors for parameters p and q. I also don't really know how to choose the starting values exactly.
SGTstart <- list(mu=0,sigma=2, lambda = 0.5, p=2, q=8)
SGTfit_R <- fitdistrplus::fitdist(data = as.vector(coredata(R)), distr = "sgt", method = "mle", SGTstart)
summary(SGTfit_R)
Sample of the data to make it reproducable: return vector of my stock index
c("0", "-1,008599424", "0,73180187", "0,443174024", "-0,351935172", "-1,318784086", "-2,171323799", "1,270243431", "-0,761354019", "0,417350946", "0,906432976", "-0,066736422", "-0,867085373", "-0,119914361", "-0,300989601", "0,482518259", "0,787365385", "-1,443105439", "-0,318546686", "-3,467674998", "1,041540157", "1,371281289", "-1,176752782", "-1,116893343", "-0,127522915", "-0,658070287", "1,098348016", "0,296391358", "-0,810635352", "-0,041779322", "0,353974233", "0,120090141", "0,304927119", "-1,22772592", "0,040768364", "1,182218724", "0,123136685", "-0,682709972", "-0,174093506", "-0,539704174", "0,579080595", "0,326346169", "0,205503526", "-0,771928642", "1,490828799", "0,734822712", "-0,025733101", "0,246531452", "-0,695585736", "-0,732413919", "0,806417952", "0,396105099", "0,024558388", "-0,791232528", "0,730410255", "-1,438890702", "0,668400286", "1,440996039", "0,731823553", "0,177515522", "0,740085418", "0,926248628", "-0,63516084", "-0,89996829", "1,655117371", "0,501033581", "0,06526534", "1,320866692", "-0,496350734", "-0,10157668", "0,022333393", "-1,236934596", "-1,070586427", "0,661662029", "0,871334714", "0,758891429", "0,064748766", "-0,305132153", "-0,424033661", "1,223444774", "-0,441840866", "-0,661390655", "-2,148399329", "0,843067435", "0,601099664", "-0,329590349", "0,210791225", "-0,341341769", "-0,555892395", "0,624026986", "0,218851965", "-0,015859171", "0,524283138", "-0,855634719", "0,339281481", "0,038507713", "-1,943784688", "0,315857689", "-0,368982834", "-1,111684011", "-0,2409217", "0,421815833", "-0,079319721", "0,915338199", "0,537387704", "-0,023004636", "-0,331854888", "0,702733882", "-1,084343115", "0,16901282", "0,559404916", "-0,538587484", "0,153683523", "-0,336562411", "-0,274946953", "0,862901957", "0,117407383", "1,205205829", "0,633347347", "0,058712615", "-0,083562948", "1,343190727", "1,281380185", "0,750972389", "-1,538678151", "0,228222073", "0,635385022", "0,037379479", "-0,491444798", "-1,220272752", "1,093162287", "1,499512169", "0,041394336", "-0,113330512", "0,657485999", "-0,264647978", "0,115056075", "-0,009763771", "0,454629881", "0,322398317", "0,347112494", "0,948127411", "0,461194301", "-0,407013048", "-0,469481931", "-0,536045151", "0,114726251", "0,396772868", "0,525885581")
Best, Enjo
The answer was using package sgt

How to calculate the root mean squared deviation in R?

I state that I never used the root mean squared deviation before.
I'm just tryng to reproduce what I found in an article.
Pratically, I have to quantify the noise of a "methodology" (that is the result of different noises, because of the coupling of two instruments), measuring the noise of the methodology at three different points out of the normal operation, where we know that it measures only the noise.
At the end, in the process that I'm following, you have to calculate the standard deviation among this three points and multiply it for the factor of 1.96 for the 95th percentile confidence interval (and in this way get a Limit of Detection of the method).
The time resolution is 30 minutes, then a standard deviation among three points, then the following three points, and so on. I already have a dataset organized in this way and I have calculated the standard deviation.
Just because I'm following the method of an article, they compare the limit of detection calculated using the standard deviation and the one using the root mean squared deviation.
Because at the end I have to use this limit of detection to filter my data by noise, I want to compare which is the best method for my situation, as they did.
How can I calculate the root mean squared deviation, as well as I calculated the standard deviation? Each three points (three differente columns), then the three following points (same three columns but the following row) and so on?
I already tried with the rmse function of the Metrics package, but the problem is the it require only two values: an actual and a predicted values.
Of course, as well as I did for the standard deviation calculation, I should use aggregate to iterate the function each three columns per each row.
EDIT
I put a piece of the article that I'm following for this method, to try to make you understand better what I would like
"For time lags much different from the true time lag, the standard deviation of the flux provides a measure of the random error affecting the flux"......."Multiplying this measure of the random error by α gives an estimate of the measurement precision at a given confidence interval (α =1.96 for the 95th percentile; α =3 for the 99th percentile) which can be used as the flux limit of detection (LoD) (i.e. LoDσ = α ×REσ )......"A modification of the LoDσ approach is to calculate the random error based on the root mean squared deviation (RMSE) of flux from zero ,which reflects the variability in the crosscovariance function in these regions, but also its offset from zero
I share with you part of the dataset and the code the I tried to use.
Dataset
structure(list(`101_LOD` = c(-0.00647656063436054, 0.00645714072316343,
0.00174533523902105, -0.000354643362187957, -0.000599093190801188,
0.00086188829059792), `101_LOD.1` = c(0.00380625456526623, -0.00398115037246045,
0.00158673927930099, -0.00537583996746438, -0.00280048350643599,
0.00348232298529063), `101_LOD.2` = c(-0.00281100080425964, -0.00335537844222041,
0.00611652518452308, -0.000738139825060029, 0.00485039477849737,
0.00412428118507656), `107_LOD` = c(0.00264717678436649, 0.00339296025595841,
0.00392733001719888, 0.0106686039973083, 0.00886643251752075,
0.0426091484273961), `107_LOD.1` = c(0.000242380702002215, -0.00116108069669281,
0.0119784744970561, 0.00380805756323248, 0.00190407945251567,
0.00199684331869391), `107_LOD.2` = c(-0.0102716279438754, -0.00706528150567528,
-0.0108745954674186, -0.0122962259781756, -0.00590383880635847,
-0.00166664119985051), `111_LOD` = c(-0.00174374098054644, 0.00383270191075735,
-0.00118363208946644, 0.00107908760333878, -9.30127551375776e-05,
-0.00141500588842743), `111_LOD.1` = c(0.000769378300959002,
0.00253820252869653, 0.00110643824418424, -0.000338050323261079,
-0.00313666295753596, 0.0043919374295125), `111_LOD.2` = c(0.000177265973907964,
0.00199829884609846, -0.000490950219515303, -0.00100263695578483,
0.00122606902671889, 0.00934018452187161), `113_LOD` = c(0.000997977666838309,
0.0062400770296875, -0.00153620247996209, 0.00136849054508488,
-0.00145700847633675, -0.000591288575933268), `113_LOD.1` = c(-0.00114161441697546,
0.00152607521404826, 0.000811193628975422, -0.000799514037634276,
-0.000319008435039752, -0.0010086036089075), `113_LOD.2` = c(-0.000722312098377764,
0.00364767954707251, 0.000547744649351312, 0.000352509651080838,
-0.000852173274761947, 0.00360487150682726), `135_LOD` = c(-0.00634051802134062,
0.00426062889500736, 0.00484049067127332, 0.00216220020394825,
0.00165634168942681, -0.00537970105199375), `135_LOD.1` = c(-0.00209301968088832,
0.00535855274344209, -0.00119679744329422, 0.0041216882161451,
0.00512978202611836, 0.0014048506490567), `135_LOD.2` = c(0.00022377545723911,
0.00400550696583795, 0.00198972253447825, 0.00301341644871015,
0.00256802839330668, 0.00946109288597202), `137_LOD` = c(-0.0108508893475138,
-0.0231919072487789, -0.00346546003410657, -0.00154066625155414,
0.0247266017774909, -0.0254464953061609), `137_LOD.1` = c(-0.00363025194918789,
-0.00291104074373261, 0.0024998477144967, 0.000877707284759669,
0.0095477003599792, 0.0501795740749602), `137_LOD.2` = c(0.00930498343499501,
-0.011839104725282, 0.000274929503053888, 0.000715665078729413,
0.0145503185102915, 0.0890428314632625), `149_LOD` = c(-0.000194406250680231,
0.000355157226357547, -0.000353931679163222, 0.000101471293242973,
-0.000429409422518444, 0.000344585379249552), `149_LOD.1` = c(-0.000494386150759807,
0.000384907974061922, 0.000582537329068263, -0.000173285705433721,
-6.92758935962043e-05, 0.00237942557324254), `149_LOD.2` = c(0.000368606958615297,
0.000432568466833549, 3.33092313366271e-05, 0.000715304544370804,
-0.000656902381786168, 0.000855422043674721), `155_LOD` = c(-0.000696168382693618,
-0.000917607266525328, 4.77049670728094e-06, 0.000140297660927979,
-5.99898679530658e-06, 6.71169142984434e-06), `155_LOD.1` = c(-0.000213644203677328,
-3.44396001911029e-07, -0.000524232671878577, -0.000830180665933627,
1.47799998238307e-06, -5.97640014667251e-05), `155_LOD.2` = c(-0.000749882784933487,
0.000345737159390042, -0.00076916001239521, -0.000135205762575321,
-2.55352420251723e-06, -3.07199008030628e-05), `31_LOD` = c(-0.00212014938530172,
0.0247411322547065, -0.00107990654365844, -0.000409195814154659,
-0.00768439381433953, 0.001860128524035), `31_LOD.1` = c(-0.00248488588195854,
-0.011146734518705, -0.000167943850441196, -0.0021998906531997,
0.0166775965182051, -0.0156939303287719), `31_LOD.2` = c(0.00210626277375321,
-0.00327815351414411, -0.00271043947479133, 0.00118991079627845,
-0.00838520090692615, 0.0255825346347586), `33_LOD` = c(0.0335175783154054,
0.0130192144768818, 0.0890608024914352, -0.0142431454793663,
0.00961009674973182, -0.0429774973256228), `33_LOD.1` = c(0.018600175159935,
0.04588362587764, 0.0517479021554752, 0.0453766081395813, -0.0483559729403664,
0.123771869764484), `33_LOD.2` = c(0.01906507758481, -0.00984821669825455,
0.134177176083007, -0.00544320457445977, 0.0516083894733814,
-0.0941500564321804), `39_LOD` = c(-0.148517395684098, -0.21311281527214,
0.112875846920874, -0.134256453140454, 0.0429030528286934, -0.0115143877745049
), `39_LOD.1` = c(-0.0431568202849291, -0.159003698955288, 0.0429009071238143,
-0.126060096927082, -0.078848020069061, -0.0788748111534866),
`39_LOD.2` = c(-0.16276833960171, 0.0236589399437796, 0.0828435027244962,
-0.50219849047847, -0.105196237549017, -0.161206838628339
), `42_LOD` = c(-0.00643926654994104, -0.0069253267922805,
7.63419856289838e-05, -0.0185223126108671, 0.00120855708103566,
-0.00275288147011515), `42_LOD.1` = c(-0.000866169150506504,
-0.00147791175852563, -0.000670310173141084, -0.00757733007180311,
0.0151353172950393, -0.00114193461500327), `42_LOD.2` = c(0.00719928454572906,
0.00311615354837406, 0.00270759483782046, -0.0108062423259522,
0.00158765505419478, -0.0034831499672973), `45_LOD` = c(0.00557787518897268,
0.022337270533665, 0.00657118689440082, -0.00247269227623608,
0.0191646343214611, 0.0233090596023039), `45_LOD.1` = c(-0.0305395220788143,
0.077105031761457, -0.00101713990356452, 0.0147500116150713,
-5.43009569586179e-05, -0.0235006181977403), `45_LOD.2` = c(-0.0216498682456909,
-0.0413426968184435, -0.0210779895848601, -0.0147549519865421,
0.00305229143870313, -0.0483293292336662), `47_LOD` = c(-0.00467568767221499,
-0.0199796182799552, 0.00985966068611855, -0.031010117051163,
0.0319279109813341, 0.0350743318265918), `47_LOD.1` = c(0.00820166533285921,
-0.00748186905620154, -0.010483251821707, -0.00921919551377505,
0.0129546148757833, 0.000223462281435923), `47_LOD.2` = c(0.00172469728530889,
0.0181683409295075, 0.00264937907258855, -0.0569837400476351,
0.00514558635349483, 0.0963339573489031), `59_LOD` = c(-0.00664210061621158,
-0.062069664217766, 0.0104345353700492, 0.0115323589989968,
-0.000701276829098035, -0.0397759501000331), `59_LOD.1` = c(-0.00844888486350536,
0.0207426674766074, -0.0227755432761471, -0.00370561240222376,
0.0152046240483297, -0.0127327412801225), `59_LOD.2` = c(-0.000546590647534814,
0.0178115310450356, 0.00776130696191998, 0.00162470375408126,
-0.036140754156005, 0.0197791914089296), `61_LOD` = c(0.00797528044191513,
-0.00358928087671818, 0.000662870138322471, -0.0412142836466128,
-0.00571822580078707, -0.0333870884803465), `61_LOD.1` = c(0.000105849888219735,
-0.00694734283847093, -0.00656216592134899, 0.00161225110022219,
0.0125744958934939, -0.0178560868664668), `61_LOD.2` = c(0.0049288443167774,
0.0059411543659837, -0.00165857112209555, -0.0093669075333705,
0.00655185371925189, 0.00516436591134869), `69_LOD` = c(0.0140014747729604,
0.0119645827116724, 0.0059880663080946, -0.00339119330845176,
0.00406436116298777, 0.00374425148741196), `69_LOD.1` = c(0.00465076983995792,
0.00664902297016735, -0.00183936649215524, 0.00496509351837152,
-0.0224812403463345, -0.0193087796456654), `69_LOD.2` = c(-0.00934638876711703,
-0.00802183076602164, 0.00406752039394799, -0.000421337136630527,
-0.00406768983408334, -0.0046016148041856), `71_LOD` = c(-0.00206064862123214,
0.0058604630066848, -0.00353440181333921, -0.000305197461077327,
0.00266085011303462, -0.00105635261106644), `71_LOD.1` = c(3.66652318354654e-06,
0.00542612739642576, 0.000860385212430484, 0.00157520645492044,
-0.00280256517377998, -0.00474358065422048), `71_LOD.2` = c(-0.00167098030843413,
0.0059622082597603, -0.00121597491543965, -0.000791592953383716,
-0.0022790991468459, 0.00508978650148816), `75_LOD` = c(NA,
-0.00562613898652477, -0.000103076958936504, -3.76628574664693e-05,
-0.000325767611573817, 0.000117404893823389), `75_LOD.1` = c(NA,
NA, -0.000496324358203359, -0.000517476831074487, -0.00213096062838051,
-0.00111202867609916), `75_LOD.2` = c(NA, NA, -0.000169651845347418,
-4.72864955070539e-05, -0.00144880109085214, 0.00421635976535877
), `79_LOD` = c(-0.0011901810540199, 0.00731686066269579,
0.00538551997145174, -0.00578723012473479, -0.0030246805255648,
0.00146141135533218), `79_LOD.1` = c(-0.00424278455960268,
-0.010593752642875, 0.0065136497427927, -0.00427355522802769,
0.000539975609490915, -0.0206849687839064), `79_LOD.2` = c(-0.00366739576561779,
-0.00374066839898667, -0.00132764684703939, -0.00534145222725701,
0.00920940542227595, -0.0101871763957068), `85_LOD` = c(-0.0120254177480422,
0.00369546541331518, -0.00420718877886963, 0.00414911885475517,
-0.00130381692844529, -0.00812757789798261), `85_LOD.1` = c(-0.00302024868281014,
0.00537704163310547, 0.00184264538884543, -0.00159032685888543,
-0.0062127769817834, 0.00349476605688194), `85_LOD.2` = c(0.0122689407380797,
-0.00509605601025503, -0.00641413996554198, 0.000592176121486696,
0.00131237912317341, -0.00535018996837309), `87_LOD` = c(0.00613621268007298,
0.000410268892659307, -0.00239014321624482, -0.00171179729894864,
-0.00107159765522861, -0.00708388174601732), `87_LOD.1` = c(0.00144787264098156,
-0.0025946273860992, -0.00194897899110034, 0.00157863310440493,
-0.0048913305554607, -0.000585669821053749), `87_LOD.2` = c(-0.00224691693198253,
-0.00277315666829267, 0.00166487067514155, -0.00173757960229744,
-0.00362252480121682, -0.0101992979591839), `93_LOD` = c(-0.0234225447373586,
0.0390095666365413, 0.00606244490932179, 0.0264258422783391,
0.0161211132913951, -0.0617678157059), `93_LOD.1` = c(-0.0124876313221369,
-0.0309636779639578, 0.00610883313140442, -0.0192442672220773,
0.0129557286224975, -0.00869066964782635), `93_LOD.2` = c(-0.0219837540560547,
-0.00521242297372905, 0.0179965615561871, 0.0081370991723329,
1.45427765512579e-06, -0.0111199632179688), `99_LOD` = c(0.00412086456443205,
-0.00259940538393106, 0.00742537463584133, -0.00302091572866969,
-0.00320466045653491, -0.00168702410433936), `99_LOD.1` = c(0.00280546156134205,
-0.00472591065687533, 0.00518402193979284, -0.00130887074314965,
0.00148769905391341, 0.00366250488078969), `99_LOD.2` = c(-0.00240469207099292,
-9.57307699040024e-05, -0.000145493235845501, 0.000667454164326723,
-0.0057445759245933, 0.00433464631989088), H_LOD = c(-6248.9128518109,
-10081.9540490064, -6696.91582671427, -5414.20614601348,
-3933.64339240365, -13153.7509294302), H_LOD.1 = c(-6.2489128518109,
-10.0819540490064, -6.69691582671427, -5.41420614601348,
-3.93364339240365, -13.1537509294302), H_LOD.2 = c(-6248.9128518109,
-10081.9540490064, -6696.91582671427, -5414.20614601348,
-3933.64339240365, -13153.7509294302)), row.names = c(NA,
6L), class = "data.frame")
Code
LOD_rdu=sapply(split.default(LOD_ut, rep(seq((ncol(LOD_ut) / 3)), each = 3)), function(i)
apply(i, 1, rmse))
And I get this error Error in mse(actual, predicted) :
argument "predicted" is missing, with no default
It is difficult to understand precisely what you need, I will try to answer you,
From wikipedia, RMSD serves to compare a dataset generated by a model (the model in your article i guess) to an observed distribution.
From CRAN, RMSE function in modelr package has two arguments, model and data:
modelr::rmse(model = ,data = )
This function will give you the fit of your model to the data. The first argument is a model, meaning you will probably use a function like lm() to generate it. Because you don't detail the model I can't help you more.
The second argument is the dataset, the one you provide is quite disturbing to me. R will expect a tidy set with two columns x the time of the observation and y the value.
you can first what groups your columns:
GRP = sub("[.][0-9]*","",colnames(LOD_ut))
head(GRP)
[1] "101_LOD" "101_LOD" "101_LOD" "107_LOD" "107_LOD" "107_LOD"
you can also use 1:(ncol(LOD_ut)/3), just that the above gives you the groups back.
We can use the above like this:
LOD_ut[,GRP=="101_LOD"]
101_LOD 101_LOD.1 101_LOD.2
1 -0.0064765606 0.003806255 -0.0028110008
2 0.0064571407 -0.003981150 -0.0033553784
3 0.0017453352 0.001586739 0.0061165252
4 -0.0003546434 -0.005375840 -0.0007381398
5 -0.0005990932 -0.002800484 0.0048503948
6 0.0008618883 0.003482323 0.0041242812
This calls out your first three grouped columns. Now if you do apply(..,1,sd) you get the standard deviation, now we just do it over all groups
SD = sapply(unique(GRP),function(i)apply(LOD_ut[,GRP==i],1,sd,na.rm=T))
If you must do RMSE, use predicted to be the mean:
rmse_func = function(x){
Metrics:::rmse(mean(x,na.rm=T),x[!is.na(x)])
}
RMSE = sapply(unique(GRP),function(i){
apply(LOD_ut[,GRP==i],1,rmse_func)
})

Very high residual Sum-of-Squares

I'm having a problem with the square sum-of-residues of an fitting. The square sum of the residues is too high which indicates that the fit is not very good. However, visually it looks fine to have this very high residual value ... Can anyone help me to know what's going on?
My data:
x=c(0.017359, 0.019206, 0.020619, 0.021022, 0.021793, 0.022366, 0.025691, 0.025780, 0.026355, 0.028858, 0.029766, 0.029967, 0.030241, 0.032216, 0.033657,
0.036250, 0.039145, 0.040682, 0.042334, 0.043747, 0.044165, 0.044630, 0.046045, 0.048138, 0.050813, 0.050955, 0.051910, 0.053042, 0.054853, 0.056886,
0.058651, 0.059472, 0.063770,0.064567, 0.067415, 0.067802, 0.068995, 0.070742,0.073486, 0.074085 ,0.074452, 0.075224, 0.075853, 0.076192, 0.077002,
0.078273, 0.079376, 0.083269, 0.085902, 0.087619, 0.089867, 0.092606, 0.095944, 0.096327, 0.097019, 0.098444, 0.098868, 0.098874, 0.102027, 0.103296,
0.107682, 0.108392, 0.108719, 0.109184, 0.109623, 0.118844, 0.124023, 0.124244, 0.129600, 0.130892, 0.136721, 0.137456, 0.147343, 0.149027, 0.152818,
0.155706,0.157650, 0.161060, 0.162594, 0.162950, 0.165031, 0.165408, 0.166680, 0.167727, 0.172882, 0.173264, 0.174552,0.176073, 0.185649, 0.194492,
0.196429, 0.200050, 0.208890, 0.209826, 0.213685, 0.219189, 0.221417, 0.222662, 0.230860, 0.234654, 0.235211, 0.241819, 0.247527, 0.251528, 0.253664,
0.256740, 0.261723, 0.274585, 0.278340, 0.281521, 0.282332, 0.286166, 0.288103, 0.292959, 0.295201, 0.309456, 0.312158, 0.314132, 0.319906, 0.319924,
0.322073, 0.325427, 0.328132, 0.333029, 0.334915, 0.342098, 0.345899, 0.345936, 0.350355, 0.355015, 0.355123, 0.356335, 0.364257, 0.371180, 0.375171,
0.377743, 0.383944, 0.388606, 0.390111, 0.395080, 0.398209, 0.409784, 0.410324, 0.424782 )
y= c(34843.40, 30362.66, 27991.80 ,28511.38, 28004.74, 27987.13, 22272.41, 23171.71, 23180.03, 20173.79, 19751.84, 20266.26, 20666.72, 18884.42, 17920.78, 15980.99, 14161.08, 13534.40, 12889.18, 12436.11,
12560.56, 12651.65, 12216.11, 11479.18, 10573.22, 10783.99, 10650.71, 10449.87, 10003.68, 9517.94, 9157.04, 9104.01, 8090.20, 8059.60, 7547.20, 7613.51, 7499.47, 7273.46, 6870.20, 6887.01,
6945.55, 6927.43, 6934.73, 6993.73, 6965.39, 6855.37, 6777.16, 6259.28, 5976.27, 5835.58, 5633.88, 5387.19, 5094.94, 5129.89, 5131.42, 5056.08, 5084.47, 5155.40, 4909.01, 4854.71,
4527.62, 4528.10, 4560.14, 4580.10, 4601.70, 3964.90, 3686.20, 3718.46, 3459.13, 3432.05, 3183.09, 3186.18, 2805.15, 2773.65, 2667.73, 2598.55, 2563.02, 2482.63, 2462.49, 2478.10,
2441.70, 2456.16, 2444.00, 2438.47, 2318.64, 2331.75, 2320.43, 2303.10, 2091.95, 1924.55, 1904.91, 1854.07, 1716.52, 1717.12, 1671.00, 1602.70, 1584.89, 1581.34, 1484.16, 1449.26,
1455.06, 1388.60, 1336.71, 1305.60, 1294.58, 1274.36, 1236.51, 1132.67, 1111.35, 1095.21, 1097.71, 1077.05, 1071.04, 1043.99, 1036.22, 950.26, 941.06, 936.37, 909.72, 916.45,
911.01, 898.94, 890.68, 870.99, 867.45, 837.39, 824.93, 830.61, 815.49, 799.77, 804.84, 804.88, 775.53, 751.95, 741.01, 735.86, 717.03, 704.57, 703.74, 690.63,
684.24, 650.30, 652.74, 612.95 )
Then make fit using the nlsLM function (minpack.lm package):
library(magicaxis)
library(minpack.lm)
sig.backg=3*10^(-3)
mod <- nlsLM(y ~ a *( 1 + (x/b)^2 )^c+sig.backg,
start = c(a = 0, b = 1, c = 0),
trace = TRUE)
## plot data
magplot(x, y, main = "data", log = "xy", pch=16)
## plot fitted values
lines(x, fitted(mod), col = 2, lwd = 4 )
This value is the residue:
> print(mod)
Nonlinear regression model
model: y ~ a * (1 + (x/b)^2)^c + sig.backg
data: parent.frame()
a b c
68504.2013 0.0122 -0.6324
residual sum-of-squares: 12641435
Number of iterations to convergence: 34
Achieved convergence tolerance: 0.0000000149
sum-of-squares residual is too high : 12641435 ...
Is that so or is something wrong with the adjustment? It is bad?
It makes sense, since the squared mean of your response variable is 38110960. You can scale your data if you prefer to work with smaller numbers.
The residual sum of squares doesn't have much meaning without knowing the total sum of squares (from which R^2 can be calculated). Its value is going to increase if your data have large values or if you add more data points, regardless of how good your fit is. Also, you may want to look at a plot of your residuals versus fitted data, there is a clear pattern that should be explained by your model to ensure that your errors are Normally distributed.

Efficient computation of bivariate empirical cdf in R/Fortran

Given an n*2 data matrix X I'd like to calculate the bivariate empirical cdf for each observation, i.e. for each i in 1:n, return the percentage of observations with 1st element not greater than X[i,1] and 2nd element not greater than X[i,2].
Because of the nested search involved it gets terribly slow for n ~ 100k, even after porting it to Fortran. Does anyone know if there's a better way of handling sample sizes like this?
Edit: I believe this problem is similar (in terms of complexity) to finding Kendall's tau, which is of order O(n^2). In that case Knight (1966) has an algorithm to reduce it to O(n log(n)). Just wondering if there's any O(n*log(n)) algorithm for finding bivariate ecdf already out there.
Edit 2: This is the code I have in Fortran, as requested. This is called in R in the usual way, so the R code is omitted here. The code is meant for arbitrary dimensions, but for the specific thing I'm doing a bivariate one is good enough.
! Calculates multivariate empirical cdf for each point
! n: number of observations
! d: dimension (>=2)
! umat: data matrix
! outvec: vector of ecdf
subroutine mecdf(n,d,umat,outvec)
implicit none
integer :: n, d, i, j, k, tempsum
double precision, dimension(n) :: outvec
double precision, dimension(n,d) :: umat
logical :: flag
do i = 1,n
tempsum = 0
do j = 1,n
flag = .true.
do k = 1,d
if (umat(i,k) < umat(j,k)) then
flag = .false.
exit
end if
end do
if (flag) then
tempsum = tempsum + 1
end if
end do
outvec(i) = real(tempsum)/n
end do
return
end subroutine
I think my first effort was not really an ecdf, although it did map the points to the interval [0,1] The example, a 25 x 2 matrix generated with:
#M <- matrix(runif(100), ncol=2)
M <-
structure(c(0.0468267474789172, 0.296053855214268, 0.205678076483309,
0.467400068417192, 0.968577065737918, 0.435642971657217, 0.929023026255891,
0.038406387437135, 0.304360694251955, 0.964778139721602, 0.534192910650745,
0.741682186257094, 0.0848641532938927, 0.405901980120689, 0.957696850644425,
0.384813814423978, 0.639882878866047, 0.231505588628352, 0.271994129288942,
0.786155494628474, 0.349499785574153, 0.279077709652483, 0.206662984099239,
0.777465222170576, 0.705439242534339, 0.643429880728945, 0.887209519045427,
0.0794123203959316, 0.849177583120763, 0.704594585578889, 0.736909110797569,
0.503158083418384, 0.49449566937983, 0.408533290959895, 0.236613316927105,
0.297427259152755, 0.0677345870062709, 0.623845702270046, 0.139933609170839,
0.740499466424808, 0.628097783308476, 0.678438259987161, 0.186680511338636,
0.339367639739066, 0.373212536331266, 0.976724133593962, 0.94558056560345,
0.610417427960783, 0.887977657606825, 0.663434249348938, 0.447939050383866,
0.755168803501874, 0.478974275058135, 0.737040047068149, 0.429466919740662,
0.0021107573993504, 0.697435079608113, 0.444197302218527, 0.108997165458277,
0.856855363817886, 0.891898229718208, 0.93553287582472, 0.991948011796921,
0.630414301762357, 0.0604106825776398, 0.908968194155023, 0.0398679254576564,
0.251426834380254, 0.235532913124189, 0.392070295521989, 0.530511683085933,
0.319339724024758, 0.534880011575297, 0.92030712752603, 0.138276003766805,
0.213625695323572, 0.407931711757556, 0.605797187192366, 0.424798395251855,
0.471233424032107, 0.0105366336647421, 0.625802840106189, 0.524665891425684,
0.0375960320234299, 0.54812005511485, 0.0105806747451425, 0.438266788609326,
0.791981092421338, 0.363821814302355, 0.157931488472968, 0.47945317090489,
0.906797411618754, 0.762243523262441, 0.258681379957125, 0.308056800393388,
0.91944490163587, 0.412255838746205, 0.347220918396488, 0.68236422073096,
0.559149842709303), .Dim = c(50L, 2L))
So the task is to do a single summation of a two-part logical test on N items which I suspect is O(N*3). It might be marginally faster if implemented in Rcpp, but these are vectorized operations.
# Wrong: ecdf2d <- function(m,i,j) { ord <- rank(m[ , 1]^2+m[ , 2]^2)
# ord[i]/nrow(m)} # scales to [0,1] interval
ecdf2d.v2 <- function(obj, x, y) sum( obj[,1] < x & obj[,2] < y)/nrow(obj)

Resources