im trying to determine the asymptotic growth rate for the plot below, which has a logarithmic x-axis (base 2) and linear y-axis. It seems sub-logarithmic to me, but what how would one exactly determine the rate (in the big-O notation of asymptotic complexity)?
original plot above,in the one below the blue line is sqrt(), green log() and the last one the original function
Under the assumption you can exhibit a constant number c such that f(2^(i+1))/f(2^i)) = c for every integer i, you can consider the fact that
f(2^i) = c.f(2^(i-1)) = c^i.f(1)
So for any integer k,
f(k) = f(2^log2(k))
= c^log2(k).f(1)
= k^log2(c).f(1)
I tried to estimate few values of the ratio f(2^(i+1)) / f(2^i):
f(2^12) / f(2^11) ~= 0.250 / 0.175 ~= 1.43
f(2^11) / f(2^10) ~= 0.175 / 0.125 ~= 1.4
f(2^10) / f(2^9) ~= 0.125 / 0.085 ~= 1.47
f(2^9) / f(2^8) ~= 0.085 / 0.070 ~= 1.21
And it becomes too hard to read the values of the function for lower values of x.
It is not clear to me whether you truly have a constant ratio f(2^(i+1))/f(2^i) (you probably need more data for x > 2^13), but, as an example, if you chose to adopt the value of c = 1.4, you'd end up with the function f(k)/f(1) ~= k^0.49 ~= sqrt(k), i.e. 1/f(1).f would be "close" to the square root function.
Disclaimer:
Please take "close" here with extra care, as asymptotically, x^(0.5 +/- epsilon) for epsilon > 0 is nothing but far remote from sqrt(x) (I mean - the difference between both functions can be made arbitrarily large as x -> +Inf).
I was trying to fit this dataset:
#Mydataset damped sine wave data
#X ---- Y
45.80 320.0
91.60 -254.0
137.4 198.0
183.2 -156.0
229.0 126.0
274.8 -100.0
320.6 80.0
366.4 -64.0
412.2 52.0
458.0 -40.0
503.8 34.0
549.6 -26.0
595.4 22.0
641.2 -18.0
which, as you can see by the plot below, has the classical trend of a damped sine wave:
So i first set the macro for the fit
f(x) = exp(-a*x)*sin(b*x)
then i made the proper fit
fit f(x) 'data.txt' via a,b
iter chisq delta/lim lambda a b
0 2.7377200000e+05 0.00e+00 1.10e-19 1.000000e+00 1.000000e+00
Current data point
=========================
# = 1 out of 14
x = -5.12818e+20
z = 320
Current set of parameters
=========================
a = -5.12818e+20
b = -1.44204e+20
Function evaluation yields NaN ("not a number")
getting a NaN as result. So I looked around on STackOverflow and I remembered I've already have had in the past problems by fitting exponential due to their fast growth/decay which requires you to set initial parameters in order not to get this error (as I've asked here). So I tried by setting as starting parameters a and b as the ones expected, a = 9000, b=146000, but the result was more frustrating than the one before:
fit f(x) 'data.txt' via a,b
iter chisq delta/lim lambda a b
0 2.7377200000e+05 0.00e+00 0.00e+00 9.000000e+03 1.460000e+05
Singular matrix in Givens()
I've thought: "these are too much large numbers, let's try with smaller ones".
So i entered the values for a and b and started fitting again
a = 0.01
b = 2
fit f(x) 'data.txt' via a,b
iter chisq delta/lim lambda a b
0 2.7429059500e+05 0.00e+00 1.71e+01 1.000000e-02 2.000000e+00
1 2.7346318324e+05 -3.03e+02 1.71e+00 1.813940e-02 -9.254913e-02
* 1.0680927157e+137 1.00e+05 1.71e+01 -2.493611e-01 5.321099e+00
2 2.7344431789e+05 -6.90e+00 1.71e+00 1.542835e-02 4.310193e+00
* 6.1148639318e+81 1.00e+05 1.71e+01 -1.481123e-01 -1.024914e+01
3 2.7337226343e+05 -2.64e+01 1.71e+00 1.349852e-02 -9.008087e+00
* 6.4751980241e+136 1.00e+05 1.71e+01 -2.458835e-01 -4.089511e+00
4 2.7334273482e+05 -1.08e+01 1.71e+00 1.075319e-02 -4.346296e+00
* 1.8228530731e+121 1.00e+05 1.71e+01 -2.180542e-01 -1.407646e+00
* 2.7379223634e+05 1.64e+02 1.71e+02 8.277720e-03 -1.440256e+00
* 2.7379193486e+05 1.64e+02 1.71e+03 1.072342e-02 -3.706519e+00
5 2.7326800742e+05 -2.73e+01 1.71e+02 1.075288e-02 -4.338196e+00
* 2.7344116255e+05 6.33e+01 1.71e+03 1.069793e-02 -3.915375e+00
* 2.7327905718e+05 4.04e+00 1.71e+04 1.075232e-02 -4.332930e+00
6 2.7326776014e+05 -9.05e-02 1.71e+03 1.075288e-02 -4.338144e+00
iter chisq delta/lim lambda a b
After 6 iterations the fit converged.
final sum of squares of residuals : 273268
rel. change during last iteration : -9.0493e-07
degrees of freedom (FIT_NDF) : 12
rms of residuals (FIT_STDFIT) = sqrt(WSSR/ndf) : 150.905
variance of residuals (reduced chisquare) = WSSR/ndf : 22772.3
Final set of parameters Asymptotic Standard Error
======================= ==========================
a = 0.0107529 +/- 3.114 (2.896e+04%)
b = -4.33814 +/- 3.678 (84.78%)
correlation matrix of the fit parameters:
a b
a 1.000
b 0.274 1.000
I saw it produced some result, so I thought it was all ok, but my happiness lasted seconds, just until I plotted the output:
Wow. A really good one.
And I'm still here wondering what's wrong and how to get a proper fit of a damped sine wave dataset with gnuplot.
Hope someone knows the answer :)
The function you are fitting the data to is not a good match for the data. The envelope of the data is a decaying function, so you want a positive damping parameter a. But then your fitting function cannot be bigger than 1 for positive x, unlike your data. Also, by using a sine function in your fit you assume something about the phase behavior -- the fitted function will always be zero at x=0. However, your data looks like it should have a large, negative amplitude.
So let's choose a better fitting function, and give gnuplot a hand by choosing some reasonable initial guesses for the parameters:
f(x)=c*exp(-a*x)*cos(b*x)
a=1./500
b=2*pi/100.
c=-400.
fit f(x) 'data.txt' via a,b,c
plot f(x), "data.txt" w p
gives
I am using a data set of about 54K records and 5 classes(pop) of which one class is insignicant. I am using the caret package and the following to run rpart:
model <- train(pop ~ pe + chl_small, method = "rpart", data = training)
and I get the following tree:
n= 54259
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 54259 38614 pico (0.0014 0.18 0.29 0.25 0.28)
2) pe< 5004 39537 23961 pico (0 0.22 0.39 2.5e-05 0.38)
4) chl_small< 32070.5 16948 2900 pico (0 0.00012 0.83 5.9e-05 0.17) *
5) chl_small>=32070.5 22589 10281 ultra (0 0.39 0.068 0 0.54) *
3) pe>=5004 14722 1113 synecho (0.0052 0.052 0.0047 0.92 0.013) *
It is obvious that node 5 should be further split, but rpart is not doing it. I tried using cp = .001 to cp =.1 and also minbucket = 1000 as additional parameters, but no improvement.
Appreciate any help on this.
Try running the model with an even smaller cp=0.00001 or cp = -1. If it is still not splitting that node then it means that the split will not improve the overall fit.
You can also try changing the splitting criteria from the default Gini impurity to information gain criterion: parms = list(split = "information")
If you do force it to split, it might be a good idea to do a quick check:
compare the accuracy of the training vs testing set for the original model and model with small cp.
If the difference between training vs testing is much smaller for the original model then the other model probably overfits the data.
I am currently doing a PCA of some data with 35rows and 21 columńs by using the package FactoMineR of R. I'm doing this for my bachelor thesis and I'm studying forestry, so "I have no clue what I'm doing" :).
It works somehow and the interpretation is another chapter, but my Professors, unfortunately also have no clue what they are doing in this statistics kind of thing, so they expect the results in nice little word-sheets, with the data nicely arranged into tables.
The text-output is printed by me with the following methods:
capture.output(mydata)
summary.PCA(mydata)
summary(mydata)
summary.PCA is a tool directly from the package FactoMineR and I use it because capture.output keeps giving me errors when I try and capture PCA("whatever") with it.
But this output is impossible to import into a table unless I do i all by hand, which I cannot accept as a solution (I very much hope so).
Output like the following.. I don't see a way to put this into a table:
Call:
PCA(mydata)
Eigenvalues
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14 Dim.15 Dim.16 Dim.17 Dim.18 Dim.19 Dim.20 Dim.21
Variance 8.539 2.937 1.896 1.644 1.576 1.071 0.738 0.695 0.652 0.463 0.261 0.184 0.136 0.108 0.049 0.021 0.019 0.010 0.000 0.000 0.000
% of var. 40.662 13.984 9.027 7.830 7.505 5.100 3.517 3.311 3.106 2.203 1.242 0.878 0.650 0.513 0.233 0.102 0.093 0.046 0.000 0.000 0.000
Cumulative % of var. 40.662 54.645 63.672 71.502 79.007 84.107 87.624 90.934 94.041 96.244 97.486 98.363 99.013 99.526 99.759 99.862 99.954 100.000 100.000 100.000 100.000
So is there a way to do this? Do I have to transform the data before I can print it into a table?
I hope very much I have expressed myself clearly!
All the best!
Lukas
The summary.PCA function writes all the tables data are available in the outputs.
So you can do:
res <- PCA(mydata)
res$eig ### and you will have the table with the eigenvalues in an object
res$ind$coord ## and you will have the coordinate of the individuals in an object
write.infile(res,file="OutputFile.csv") ## and all the outputs will be written in a csv file
Hope it helps,
Francois
I am trying to understand PCA by finding practical examples online. Sadly most tutorials I have found don't really seem to show simple practical applications of PCA. After a lot of searching, I came across this
http://yatani.jp/HCIstats/PCA
It is a nice simple tutorial. I want to re-create the results in Matlab, but the tutorial is in R. I have been trying to replicate the results in Matlab, but have been so far unsuccessful; I am new to Matlab. I have created the arrays as follows:
Price = [6,7,6,5,7,6,5,6,3,1,2,5,2,3,1,2];
Software = [5,3,4,7,7,4,7,5,5,3,6,7,4,5,6,3];
Aesthetics = [3,2,4,1,5,2,2,4,6,7,6,7,5,6,5,7];
Brand = [4,2,5,3,5,3,1,4,7,5,7,6,6,5,5,7];
Then in his example, he does this
data <- data.frame(Price, Software, Aesthetics, Brand)
I did a quick search online, and this apparently converts vectors into a data table in R code. So in Matlab I did this
dataTable(:,1) = Price;
dataTable(:,2) = Software;
dataTable(:,3) = Aesthetics;
dataTable(:,4) = Brand;
Now it is the next part I am unsure of.
pca <- princomp(data, cor=TRUE)
summary(pca, loadings=TRUE)
I have tried using Matlab's PCA function
[COEFF SCORE LATENT] = princomp(dataTable)
But my results do not match the ones shown in the tutorial at all. My results are
COEFF =
-0.5958 0.3786 0.7065 -0.0511
-0.1085 0.8343 -0.5402 -0.0210
0.6053 0.2675 0.3179 -0.6789
0.5166 0.2985 0.3287 0.7321
SCORE =
-2.3362 0.0276 0.6113 0.4237
-4.3534 -2.1268 1.4228 -0.3707
-1.1057 -0.2406 1.7981 0.4979
-3.6847 0.4840 -2.1400 1.0586
-1.4218 2.9083 1.2020 -0.2952
-3.3495 -1.3726 0.5049 0.3916
-4.1126 0.1546 -2.4795 -1.0846
-1.7309 0.2951 0.9293 -0.2552
2.8169 0.5898 0.4318 0.7366
3.7976 -2.1655 -0.2402 -1.2622
3.3041 1.0454 -0.8148 0.7667
1.4969 2.9845 0.7537 -0.8187
2.3993 -1.1891 -0.3811 0.7556
1.7836 -0.0072 -0.2255 -0.7276
2.2613 -0.1977 -2.4966 0.0326
4.2350 -1.1899 1.1236 0.1509
LATENT =
9.3241
2.2117
1.8727
0.5124
Yet the results in the tutorial are
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4
Standard deviation 1.5589391 0.9804092 0.6816673 0.37925777
Proportion of Variance 0.6075727 0.2403006 0.1161676 0.03595911
Cumulative Proportion 0.6075727 0.8478733 0.9640409 1.00000000
Loadings:
Comp.1 Comp.2 Comp.3 Comp.4
Price -0.523 0.848
Software -0.177 0.977 -0.120
Aesthetics 0.597 0.134 0.295 -0.734
Brand 0.583 0.167 0.423 0.674
Could anyone please explain why my results differ so much from the tutorial. Am I using the wrong Matlab function?
Also if you are able to provide any other nice simple practical applications of PCA, would be very beneficial. Still trying to get my head around all the concepts in PCA and I like examples where I can code it and see the results myself, so I can play about with it, I find it is easier when to learn this way
Any help would be much appreciated!!
Edit: The issue is purely the scaling.
R code:
summary(princomp(data, cor = FALSE), loadings=T, cutoff = 0.01)
Loadings:
Comp.1 Comp.2 Comp.3 Comp.4
Price -0.596 -0.379 0.706 -0.051
Software -0.109 -0.834 -0.540 -0.021
Aesthetics 0.605 -0.268 0.318 -0.679
Brand 0.517 -0.298 0.329 0.732
According to the Matlab help you should use this if you want scaling:
Matlab code:
princomp(zscore(X))
Old answer (a red herring):
From help(princomp) (in R):
The calculation is done using eigen on the correlation or covariance
matrix, as determined by cor. This is done for compatibility with the
S-PLUS result. A preferred method of calculation is to use svd on x,
as is done in prcomp.
Note that the default calculation uses divisor N for the covariance
matrix.
In the documentation of the R function prcomp (help(prcomp)) you can read:
The calculation is done by a singular value decomposition of the
(centered and possibly scaled) data matrix, not by using eigen on the
covariance matrix. This is generally the preferred method for
numerical accuracy. [...] Unlike princomp, variances are computed with
the usual divisor N - 1.
The Matlab function apparently uses the svd algorithm. If I use prcom (without scaling, i.e., not based on correlations) with the example data I get:
> prcomp(data)
Standard deviations:
[1] 3.0535362 1.4871803 1.3684570 0.7158006
Rotation:
PC1 PC2 PC3 PC4
Price -0.5957661 0.3786184 -0.7064672 0.05113761
Software -0.1085472 0.8342628 0.5401678 0.02101742
Aesthetics 0.6053008 0.2675111 -0.3179391 0.67894297
Brand 0.5166152 0.2984819 -0.3286908 -0.73210631
This is (appart from the irrelevant signs) identical to the Matlab output.