plot more than 50 components in RSSA package in R - r

require(Rssa)
t=ssa(co2,200) #200 here should be the number of components
plot(t) # this way it plots only the first 50 not 200!
Above code produces a graph the first 50 components only. I need to plot more than 50 components
I tried
plot(t$sigma[1:200],type='l',log='y')
but it didn't work!
Example : similar to this case
accessing eigenvalues in RSSA package in R

Looking at the help page for ?ssa we see a parameter named neig which is documented as;
integer, number of desired eigentriples. If 'NULL', then sane default value will be used, see 'Details'
Using that as a named parameter:
t=ssa(co2, neig=200)
plot(t)
And:
> t$sigma
[1] 78886.190749 329.031810 327.198387 184.659743 88.695271 88.191805 52.380502
[8] 40.527875 31.329930 29.409384 27.157698 22.334446 17.237926 14.175096
[15] 14.111402 12.976716 12.943775 12.216524 11.830642 11.614243 11.226010
[22] 10.457529 10.435998 snipped the remaining 200 numbers.
(Apparently, the package authors do not consider 200 to be "sane" number to use, although looking at the values of the results from neig=50 and neig-200 I do not see a discernable cutpoint at the 50th eigenvalue. But ... they must set it in the code which I've shown you how to access.)

Related

Why does this xlim error occur in circlize initialization?

I want to initialize a new chord diagram with circlize, but I'm getting an error that doesn't seem to make any sense given the data I'm feeding into it:
Error: Since `xlim` is a matrix, it should have same number of rows as the length of the level of `sectors` and number of columns of 2.
I understand the requirement, but when I try to produce different plots, it fails for some but not others. Here's the relevant code snippet with some output for debugging
dev.new()
circos.clear()
circos.par(cell.padding=c(0,0,0,0), track.margin=c(0,0.01), gap.degree=1)
xlim = cbind(0, regionTotal)
print(class(region))
print(length(region))
print(class(xlim))
print(dim(xlim))
circos.initialize(factors=region, xlim=xlim)
The output for a plot that works fine:
[1] "character"
[1] 24
[1] "matrix" "array"
[1] 24 2
And for one that returns the error:
[1] "character"
[1] 50
[1] "matrix" "array"
[1] 50 2
Error: Since `xlim` is a matrix, it should have same number of rows as the length of the level of `sectors` and number of columns of 2.
I am aware of these question:
this one led me to check the class
and this one led me to check my circlize version (0.4.11)
What am I missing??? Thanks for any help you can provide.
After a lot of hair pulling, I figured out the problem: there was a repeated value in my region variable (the factors or sectors entry in circos.initialize), so the effective number of sectors was lower than the dimension of the variable. Hopefully nobody else is dumb enough to make this mistake, but just in case they are, now they can have an additional thing to check if they come across this error.

Correcting mis-typed data in R

I want to correct wrongly entered data in R. For example if I have a vector
V=c('PO','PO','P0')
I want R to recognize that the 0 in the last entry should be a o and to change it. Is there anyway to do that? I have trying to use correctTypos in the deducorrect package in R. However I am having some problem with the editset. I cannot seems to specify that all the entries have to be letters. Any help greatly appreciated.
Another example would be
V2=c('PL','P1','PL','XX')
That 1 should be an L.
The jaro-winkler distance was developed to find issues with data entry. But on entries only 2 long that is going to be difficult as 1 error tends to score higher than you want it to. You could combine this with other distance measurements available in the stringdist package. But in this case that might be too complicated.
Given your examples you might want to use the base function chartr and set up a replacement of numbers to letters.
chartr("01","OL", V2)
[1] "PL" "PL" "PL" "XX"
chartr("01","OL", V)
[1] "PO" "PO" "PO"
This will always replace the 1 by an L and a 0 (zero) by an O. You can add the 5 for S etc etc. But if there are other combo's it might get complicated.
Also note that the next iteration of the deducorrect package is the deductive package.

How to get an Element from a vector without using numbers or indices?

Lets say I have these two vectors in my R workspace with the following content:
[1] "Atom.Type" and "Molar.Mass"
> Atom.Type
[1] "Oxygen" "Lithium" "Nitrogen" "Hydrogen"
> Molar.Mass
[1] 16 6.9 14 1
I now want to assign the Molar.Mass belonging to "Lithium" (i.e. 6.9) to a new variable called mass.
The problem is: I have to do that without using any numbers or indices.
Does anyone have a suggestion for this problem?
This should work: mass<-Molar.Mass[Atom.Type=="Lithium"] Clearly this assumes the two vectors are of the same length and sorted correctly. See additional comment from Roland below.

Retrieve best number of clusters from NbClust

Many functions in R provide some sort of console output (such as NbClust() etc.) Is there any way of retrieving some of the output (e.g. a certain integer value) without having a look at the output? Any way of reading from the console?
Imagine the output would look like the following output from example code provided in the package manual:
[1] "Frey index : No clustering structure in this data set"
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 1 proposed 2 as the best number of clusters
* 2 proposed 4 as the best number of clusters
* 2 proposed 6 as the best number of clusters
* 1 proposed 7 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 4
*******************************************************************
How would I retrieve the value 4 from the last line of the above output?
It is better to work with objects rather than output in the console. Any "good" function would return hopefully structured output that can be accessed using $ or # signs, use str() to see object's structure.
In your case, I think this should work:
length(unique(res$Best.partition))
Another option is:
max(unlist(res[4]))

gnuplot computing stats over multiple columns

I have a simple 9 column file. I wan't to compute certain statistics for each column and then plot it (using gnuplot).
1) This is how I compute statistics for every column excluding the first one.
stats 'data' every ::2 name "stats"
2) In the output screen I can see that the operation is successful. Note that the number of columns/records is 8
* FILE:
Records: 8
Out of range: 0
Invalid: 0
Blank: 0
Data Blocks: 1
* COLUMNS:
Mean: 6.5000 491742.6625
Std Dev: 2.2913 703.4865
Sum: 52.0000 3.93394e+06
Sum Sq.: 380.0000 1.93449e+12
Minimum: 3.0000 [0] 490312.0000 [2]
Maximum: 10.0000 [7] 492643.5000 [7]
Quartile: 4.5000 491329.5000
Median: 6.5000 491911.1500
Quartile: 8.5000 492252.2500
Linear Model: y = 121.8 x + 4.91e+05
Correlation: r = 0.3966
Sum xy: 2.558e+07
3) Now I can access statistics on the first 2 columns by appending _x and _y like this
print stats_median_x
print stats_median_y
My questions are:
How can I access statistics (lets say medians) for the remaining 6 columns?
How could I plot lets say a line over all medians against some X axis?
I know that I can simply add a python script to pre-compute all this, but I would prefer to avoid it if there is an easy way to do it using gnuplot itself.
Thanks!
Short answer(s)
"How can I access statistics of the other column?"
with stats 'data'using n you will access to the nth column...
"How can I plot for example all medians?"
e.g. a set print and a do for cycle can create a data-file that you can use for the plot.
A working solution
set print "StatDat.dat"
do for [i=2:9] { # Here you will use i for the column.
stats 'data.dat' u i nooutput ;
print i, STATS_median, STATS_mean , STATS_stddev # ...
}
set print
plot "StatDat.dat" us 1:2 # or whatever column you want...
Some words more about it
Asking help to gnuplot itself with help stats it's possible to read a lot of interesting things :-).
Syntax:
stats 'filename' [using N[:M]] [name 'prefix'] [[no]output]]
This command prepares a statistical summary of the data in one or two columns of a file. The using specifier is interpreted in the same way as for plot commands. See plot for details on the index, every, and using directives.
From the first highlighted sentence we can understand that it prepares statistics for one or maximum two column each time (It's a pity let's see in future...).
From the second highlighted sentence it's possible to read that it will follow the same syntax of the plot command:
so stats 'data'using 3 will give you the statistic of the 3rd column in x
and stats 'data' using 4:5 of the 4th and 5th in x,y...
Notes about your interpretations
You said
This is how I compute statistics for every column excluding the first one.
stats 'data' every ::2 name "stats"
Not really this is the statistic for the first two column excluding the first two lines, indeed their counter starts from 0 and not from 1.
As consequence of the above assumption/interpretation, when we read
Records: 8
it means that the lines computed where 8; your file had 10 (usable) lines, you specify every ::2 and you skip the first two, thus you have 8 records useful for the statistic.
Indeed so we can better understand when in help stats it is said
STATS_records # total number of in-range data records
implying "used to compute this statistic".
Tested on gnuplot 4.6 patchlevel 4
Working on gnuplot Version 5.0 patchlevel 1

Resources