Working with Self Organizing Maps - How do I interpret the results? - multidimensional-array

I have this data set that I thought would be a good candidate for making a SOM.
So, I converted it to text thusly:
10
12 1 0 0
13 3 0 0
14 21 0 0
19 1983 15 0
24 5329 48 0
29 4543 50 0
34 3164 32 0
39 1668 22 1
44 459 4 0
49 17 0 0
I'm using Octave, so I transformed the data with these commands:
dataIn = fopen('data.txt','r');
n = fscanf(dataIn,'%d',1);
D = fscanf(dataIn,'%f'); %D is a 1 x n column matrix
D = D'; %Transpose the data D is now an n x 1 matrix
D = reshape(D, 4, []); % give D the shape of a 4 x n/4 matrix
D = D(2:4, :); % the dimensions to be used for the SOM will come from the bottom three rows
Now, I'm applying an SOM script to produce a map using D.
The script is here
and it's using findBMU defined as:
%finds best matching unit in SOM O
function [r c ] = findBMU( iv,O )
dist = zeros(size(O)); for i=1:3
dist(:,:,i) = O(:,:,i)-iv(i);
iv(i);
end
dist = sum(dist.^2,3);
[v r] = min(min(dist,[],2));
[v c] = min(min(dist,[],1));
In the end, it starts with a random map that looks like this:
and it becomes:
The thing is, I don't know what my SOM is saying. How do I read it?

Firstly, you should be aware that Octave provides at best an approximation to the SOM methodology. The main methodological advantage of the SOM is the potential transparent access of (all) the implied parameters, and those cannot be accessed in Octave any more.
Secondly, considering your data, it does not make much sense first to seriously destroy information by summarizing it then feeding a SOM with it. Basically you have four variables in your table shown above: age, total N, single N, twin N. What you have destroyed is the information about the region.
Such you put three distributions into the SOM. The only thing you could expect is clusters. Yet, the SOM is not built for building clusters. Instead, SOM is used for diagnostic and predictive modeling, in order to find the most accurate model and the most relevant variables. Note the term "best matching unit"!
In your example however you find just a distribution in the SOM. Basically, there is no interpretation, as there are neither variables nor is there a predictive/diagnostic purpose.
You could build a model, for instance, determining the similarity of distributions. Yet, for that you should use a goodness-of-fit test (non-parametric, Kolmogorof-Smirnov), not the SOM.

Related

R: Mutliplier lagrange test (lm.LMtests) : what to put for the listw argument

I want to do a lagrange multiplier test on a panel dataset of the following type:
UGA Date Sales Nb_AM Nb_BX ......
A 01/2017 1 4 14
A 02/2017 8 5 17
A 03/2017 26 2 24
B 01/2017 3 3 35
B 02/2017 5 10 42
B 03/2017 8 24 2
I want to use the following command : lm.LMtests()
However, according to R documentation, I need to put an argument of the
type "listw" in lm.LMtests but I have no idea what to put in my case. Could
you help me?
For the moment my code is the following :
fusion2<-read_excel("C:/Users/david/OneDrive/Bureau/Master data/Mémoire
data analyst/Bases de données/Fusion/fusion.xlsx")
modeleam<-Sales ~ Nb_AM + Nb_BX +
Total_PdS_sensibilisés_aux_événement_AM + Mails_AM_ouvert +
Mails_AM_non_ouvert + Total_PdS_sensibilisés_aux_RP_AM +
Total_PdS_sensibilisés_aux_Staff_AM + Total_PdS_sensibilisés_aux_Congrés_AM
+ Total_PdS_sensibilisés_aux_Opportunités_AM
mcoam <-lm(modeleam, data=fusion2)
lagrangeam <- lm.LMtests(mcoam, ,test="all")
Thanks in advance
when you're in the "matter" it's pretty basic. So this test is crated for spatial statistics and a listw-object is nothing else than a neighbour dependence. That means how strongly one value could potentially be influenced by neighbour values.
For that you need for example a simple feature with geometries of a landscape or a city where you can assign the values to a specific polygon. From this pattern, you can create a neighbourhood and then neighbourhood weights (listw-object).
Small tutorial:
library(spdep); library(sf)
#Get your data
shape_and_data <- st_read("your/shape")
#Create your neighbourhood, nb-object
data_nb <- poly2nb(shape_and_data)
#Create the neighbour weights, listw-object
data_listw <- nb2listw(data_nb)
#Calculate
lm.LMtests(lm(...), listw = data_listw, test = "all")
This is a really basic example. For creating the neighbourhood (nb-object) you can choose different methods and for the weights (listw) there are also several methods.
Hope it helped a bit,
Loubert

Adjusted survival curve based on weigthed cox regression

I'm trying to make an adjusted survival curve based on a weighted cox regression performed on a case cohort data set in R, but unfortunately, I can't make it work. I was therefore hoping that some of you may be able to figure it out why it isn't working.
In order to illustrate the problem, I have used (and adjusted a bit) the example from the "Package 'survival'" document, which means im working with:
data("nwtco")
subcoh <- nwtco$in.subcohort
selccoh <- with(nwtco, rel==1|subcoh==1)
ccoh.data <- nwtco[selccoh,]
ccoh.data$subcohort <- subcoh[selccoh]
ccoh.data$age <- ccoh.data$age/12 # Age in years
fit.ccSP <- cch(Surv(edrel, rel) ~ stage + histol + age,
data =ccoh.data,subcoh = ~subcohort, id=~seqno, cohort.size=4028, method="LinYing")
The data set is looking like this:
seqno instit histol stage study rel edrel age in.subcohort subcohort
4 4 2 1 4 3 0 6200 2.333333 TRUE TRUE
7 7 1 1 4 3 1 324 3.750000 FALSE FALSE
11 11 1 2 2 3 0 5570 2.000000 TRUE TRUE
14 14 1 1 2 3 0 5942 1.583333 TRUE TRUE
17 17 1 1 2 3 1 960 7.166667 FALSE FALSE
22 22 1 1 2 3 1 93 2.666667 FALSE FALSE
Then, I'm trying to illustrate the effect of stage in an adjusted survival curve, using the ggadjustedcurves-function from the survminer package:
library(suvminer)
ggadjustedcurves(fit.ccSP, variable = ccoh.data$stage, data = ccoh.data)
#Error in survexp(as.formula(paste("~", variable)), data = ndata, ratetable = fit) :
# Invalid rate table
But unfortunately, this is not working. Can anyone figure out why? And can this somehow be fixed or done in another way?
Essentially, I'm looking for a way to graphically illustrate the effect of a continuous variable in a weighted cox regression performed on a case cohort data set, so I would, generally, also be interested in hearing if there are other alternatives than the adjusted survival curves?
Two reasons it is throwing errors.
The ggadjcurves function is not being given a coxph.object, which it's halp page indicated was the designed first object.
The specification of the variable argument is incorrect. The correct method of specifying a column is with a length-1 character vector that matches one of the names in the formula. You gave it a vector whose value was a vector of length 1154.
This code succeeds:
fit.ccSP <- coxph(Surv(edrel, rel) ~ stage + histol + age,
data =ccoh.data)
ggadjustedcurves(fit.ccSP, variable = 'stage', data = ccoh.data)
It might not answer your desires, but it does answer the "why-error" part of your question. You might want to review the methods used by Therneau, Cynthia S Crowson, and Elizabeth J Atkinson in their paper on adjusted curves:
https://cran.r-project.org/web/packages/survival/vignettes/adjcurve.pdf

How would I build a selfStart with custom formula or insert my formula into nls()?

Please bear with me, as this is my first post in my first month of starting with R. I have some biphasic decay data, an example of which is included below:
N
Time
Signal
1
0.0001101
2.462455
2
0.0002230
2.362082
3
0.0003505
2.265309
4
0.0004946
2.180061
5
0.0006573
2.136348
6
0.0008411
2.071639
7
0.0010487
2.087519
8
0.0012832
1.971550
9
0.0015481
2.005190
10
0.0018473
1.969274
11
0.0021852
1.915299
12
0.0025669
1.893703
13
0.0029981
1.905901
14
0.0034851
1.839294
15
0.0040352
1.819827
16
0.0046565
1.756207
17
0.0053583
1.704472
18
0.0061510
1.630652
19
0.0070464
1.584315
20
0.0080578
1.574424
21
0.0092002
1.493813
22
0.0104905
1.349054
23
0.0119480
1.318979
24
0.0135942
1.242094
25
0.0154536
1.115491
26
0.0175539
1.065381
27
0.0199262
0.968143
28
0.0226057
0.846351
29
0.0256323
0.765699
30
0.0290509
0.736105
31
0.0329122
0.588751
32
0.0372736
0.539969
33
0.0421999
0.467340
34
0.0477642
0.389153
35
0.0540492
0.308323
36
0.0611482
0.250392
37
0.0691666
0.247006
38
0.0782235
0.177039
39
0.0884534
0.174750
40
0.1000082
0.191918
I have multiple curves to fit a double falling exponential that has the general formula where some fraction of the particle A is decaying fast (described by k1) and then the remaining fraction of particle A decays slowly (described by k2), summarized below:
where A is the particle fraction, k1 is a fast rate, k2 is the slow rate, and T is time. I believe should be entered as
DFE <- y ~ (A*exp(-c*t)) + ((A-b)*exp(-d*t))
I would like to create a selfStart code to apply to over 40 sets of data without having to guess the start values each time. I found some R documentation for this, but can't figure out where to go from here.
The problem is that I am very new to R (and programming in general) and really don't know how to do this. I have had success (meaning convergence was achieved) with
nls(Signal~ SSasymp(Time, yf, y0, log_alpha), data = DecayData)
which is a close estimate but not a truly good model. I was hoping I could somehow alter the SSasymp code to work with my equation, but I think that I am perhaps too naive to know even where to begin.
I would like to compare the asymptotic model with my double falling exponential, but the double falling exponential model never seems to reach convergence despite many, many, many trials and permutations. At this point, I am not even sure if I have entered the formula correctly anymore. So, I am wondering how to write a selfStart that would ideally give me extractable coefficients/half-times.
Thanks so much!
Edit:
As per Chris's suggestion in the comments, I have tried to insert the formula itself into the nls() command like so:
DFEm = nls("Signal" ~ (A*exp(-c*Time)) + ((A-b)*exp(-d*Time)), data = "Signal", trace= TRUE)
which returns
"Error in nls(Signal ~ (A * exp(-c * Time)) + ((A - b) * exp(-d * : 'data' must be a list or an environment"
So I am unsure of how to proceed, as I've checked spelling and capitalization. Is there something silly that I am missing?
Thanks in advance!

Stata twoway graph of means with confidence intervals

Using
clear
score group test
2 0 A
3 0 B
6 0 B
8 0 A
2 0 A
2 0 A
10 1 B
7 1 B
8 1 A
5 1 A
10 1 A
11 1 B
end
I want to scatter plot mean score by group for each test (same graph) with confidence intervals (the real data has thousands of observations). The resulting graph would have two sets of two dots. One set of dots for test==a (group==0 vs group==1) and one set of dots for test==b (group==0 vs group==1).
My current approach works but it is laborious. I compute all of the needed statistics using egen: the mean, number of observations, standard deviations...for each group by test. I then collapse the data and plot.
There has to be another way, no?
I assumed that Stata would be able to take as its input the score group and test variables and then compute and present this pretty standard graph.
After spending a lot of time on Google, I had to ask.
Although there are user-written programs, I lean towards statsby as a basic approach here. Discussion is accessible in this paper.
This example takes your data example (almost executable code). Some choices depend on the large confidence intervals implied. Note that if your version of Stata is not up-to-date, the syntax of ci will be different. (Just omit means.)
clear
input score group str1 test
2 0 A
3 0 B
6 0 B
8 0 A
2 0 A
2 0 A
10 1 B
7 1 B
8 1 A
5 1 A
10 1 A
11 1 B
end
save cj12 , replace
* test A
statsby mean=r(mean) ub=r(ub) lb=r(lb) N=r(N), by(group) clear : ///
ci means score if test == "A"
gen test = "A"
save cj12results, replace
* test B
use cj12
statsby mean=r(mean) ub=r(ub) lb=r(lb) N=r(N), by(group) clear : ///
ci means score if test == "B"
gen test = "B"
append using cj12results
* graph; show sample sizes too, but where to show them is empirical
set scheme s1color
gen where = -20
scatter mean group, ms(O) mcolor(blue) || ///
rcap ub lb group, lcolor(blue) ///
by(test, note("95% confidence intervals") legend(off)) ///
subtitle(, fcolor(ltblue*0.2)) ///
ytitle(score) xla(0 1) xsc(r(-0.25 1.25)) yla(-10(10)10, ang(h)) || ///
scatter where group, ms(none) mla(N) mlabpos(12) mlabsize(*1.5)
We can't compare your complete code or your graph, because you show neither.

Cluster center mean of DBSCAN in R?

Using dbscan in package fpc I am able to get an output of:
dbscan Pts=322 MinPts=20 eps=0.005
0 1
seed 0 233
border 87 2
total 87 235
but I need to find the cluster center (mean of cluster with most seeds). Can anyone show me how to proceed with this?
You need to understand that as DBSCAN looks for arbitrarily shaped clusters, the mean can be well outside of the cluster. Looking at means of DBSCAN clusters therefore is not really sensible.
Just index back into the original data using the cluster ID of your choice. Then you can easily do whatever further processing you want to the subset. Here is an example:
library(fpc)
n = 100
set.seed(12345)
data = matrix(rnorm(n*3), nrow=n)
data.ds = dbscan(data, 0.5)
> data.ds
dbscan Pts=100 MinPts=5 eps=0.5
0 1 2 3
seed 0 1 3 1
border 83 4 4 4
total 83 5 7 5
> colMeans(data[data.ds$cluster==0, ])
[1] 0.28521404 -0.02804152 -0.06836167

Resources