Error when using ''ward'' method with pvclust R package - r

I am having some troubles regarding a cluster analysis that I am trying to do with the pvclust package.
Specifically, I have a data matrix composed by species (rows) and sampling stations (columns). I want to perform a CA in order to group my sampling stations according to my species abundance (which I have previously log(x+1) transformed).
Once having prepared adequately my matrix,I've tried to run a CA according to the pvclust package, using Ward's clustering method and Bray-Curtis as distance index. However, every time I get the following error message:
''Error in hclust(distance, method = method.hclust) :
invalid clustering method''
I then tried to perform the same analysis using another cluster method, and I had no problem. I also tried to perform the same analysis using the hclust function from the vegan package, and I had no problem at all, too. The analysis run without any problems.
To better understand my problem, I'll display part of my matrix and the
script that I used to perfrom the analysis:
P1 P2 P3 P4 P5 P6
1 10.8750000 3.2888889 2.0769231 1.4166667 3.2395833 5.333333
3 0.3645833 0.3027778 0.3212038 0.7671958 0.4993676 0.000000
4 0.0000000 0.0000000 2.3500000 0.0000000 0.0000000 0.264000
5 0.0000000 0.7333333 0.2692308 0.0000000 0.2343750 0.000000
6 0.0000000 0.9277778 0.0000000 0.2936508 0.7291667 0.000000
7 0.4166667 6.3500000 1.0925463 0.5476190 0.1885169 0.000000
8 1.6250000 0.0000000 0.0000000 0.0000000 5.2187500 0.000000
9 0.0000000 0.8111111 0.0000000 0.0000000 0.0000000 0.000000
10 2.6770833 0.6666667 2.3304890 4.5906085 2.9652778 0.000000
15 1.8020833 0.9666667 1.4807137 3.3878968 0.1666667 0.000000
16 17.8750000 4.9555556 1.4615385 6.5000000 7.8593750 7.666667
19 4.5312500 1.0555556 3.5766941 6.7248677 2.3196181 0.000000
20 0.0000000 0.6777778 0.5384615 0.0000000 0.0000000 0.000000
21 0.0000000 0.9777778 0.0000000 0.2500000 0.0000000 0.000000
24 1.2500000 3.0583333 0.1923077 0.0000000 4.9583333 0.000000
25 0.0000000 0.0000000 2.5699634 0.0000000 0.0000000 0.000000
26 6.6666667 2.2333333 24.8730020 55.9980159 17.6239583 0.000000
Where P1-P6 are my sampling stations, and the leftmost row numbers are my different species. I'll denote this example matrix just as ''platforms''.
Afterwards, I've used the following code lines:
dist <- function(x, ...){
vegdist(x, ...)
}
result<-pvclust(platforms,method.dist = "bray",method.hclust = "ward")
It is noteworthy that I run the three first codelines, since the bray-curtis index isn't originally available in the pvclust package. Thus, running these codelines allowed me to specify the bray-curtis index in the pvclust function
Does anyone know why it doesn't work with the pvclust package?
Any help will be much appreciated.
Kind regards,
Marie

There are two related issues:
When calling method.hclust you need to pass hclust compatible methods. In theory pvclust checks for ward and converts to ward.D, but you probably want to pass the (correct) names of either ward.D or ward.D2.
You cannot over-write dist in that fashion. However, you can pass a custom function to pvclust.
For instance, this should work:
library(vegan)
library(pvclust)
sample.data <- "P1 P2 P3 P4 P5 P6
10.8750000 3.2888889 2.0769231 1.4166667 3.2395833 5.3333330
0.3645833 0.3027778 0.3212038 0.7671958 0.4993676 0.0000000
0.0000000 0.0000000 2.3500000 0.0000000 0.0000000 0.2640000
0.0000000 0.7333333 0.2692308 0.0000000 0.2343750 0.0000000
0.0000000 0.9277778 0.0000000 0.2936508 0.7291667 0.0000000
0.4166667 6.3500000 1.0925463 0.5476190 0.1885169 0.0000000
1.6250000 0.0000000 0.0000000 0.0000000 5.2187500 0.0000000
0.0000000 0.8111111 0.0000000 0.0000000 0.0000000 0.0000000
2.6770833 0.6666667 2.3304890 4.5906085 2.9652778 0.0000000
1.8020833 0.9666667 1.4807137 3.3878968 0.1666667 0.0000000
17.8750000 4.9555556 1.4615385 6.5000000 7.8593750 7.6666670
4.5312500 1.0555556 3.5766941 6.7248677 2.3196181 0.0000000
0.0000000 0.6777778 0.5384615 0.0000000 0.0000000 0.0000000
0.0000000 0.9777778 0.0000000 0.2500000 0.0000000 0.0000000
1.2500000 3.0583333 0.1923077 0.0000000 4.9583333 0.0000000
0.0000000 0.0000000 2.5699634 0.0000000 0.0000000 0.0000000
6.6666667 2.2333333 24.8730020 55.9980159 17.6239583 0.0000000"
platforms <- read.table(text = sample.data, header = TRUE)
result <- pvclust(platforms,
method.dist = function(x){
vegdist(x, "bray")
},
method.hclust = "ward.D")

Related

Calculationg median of observations in particular set of columns in R

I have an sf object containing the following columns of data:
HR60 HR70 HR80 HR90 HC60 HC70 HC80 HC90
0.000000 0.000000 8.855827 0.000000 0.0000000 0.0000000 0.3333333 0.0000000
0.000000 0.000000 17.208742 15.885624 0.0000000 0.0000000 1.0000000 1.0000000
1.863863 1.915158 3.450775 6.462453 0.3333333 0.3333333 1.0000000 2.0000000
...
How can I calculate the median of HR60 to HR90 columns for all observations and place it in a different column, let's say HR-median? I tried to use apply(), but this kind of works for the whole dataset only and I need only these 4 columns to be considered.
We can select those columns
df1$HR_median <- apply(subset(df1, select = HR60:HR90), 1, median)

How can I calculate the progressive mean in a vector BUT restarting when a condition is met?

Given the following vector,
x<-c(0,0,0,5.0,5.1,5.0,6.5,6.7,6.0,0,0,0,0,3.8,4.0,3.6)
I would like to have a vector with the cumulative mean, like
cumsum(x)/seq_along(x)
but restarting the computation each time that the difference between two subsequent values is grater than 1.3 or less than -1.3. My aim is to obtain a vector like
d<-c(0,0,0,5,5.05,5.03,6.5,6.6,6.37,0,0,0,0,3.8,3.9,3.8)
You can use cumsum(abs(diff(x)) > 1.3)) to define groups, which are used in aggregate to restart cumsum(x)/seq_along(x) each time when the difference is grater than 1.3 or less than -1.3.
unlist(aggregate(x, list(c(0, cumsum(abs(diff(x)) > 1.3))),
function(x) cumsum(x)/seq_along(x))[,2])
# [1] 0.000000 0.000000 0.000000 5.000000 5.050000 5.033333 6.500000 6.600000
# [9] 6.400000 0.000000 0.000000 0.000000 0.000000 3.800000 3.900000 3.800000
Maybe you can try ave + findInterval like below
ave(x,findInterval(seq_along(x),which(abs(diff(x))>1.3)+1),FUN = function(v) cumsum(v)/seq_along(v))
which gives
[1] 0.000000 0.000000 0.000000 5.000000 5.050000 5.033333 6.500000 6.600000
[9] 6.400000 0.000000 0.000000 0.000000 0.000000 3.800000 3.900000 3.800000

How should I interpret the results of function multinom in R?

I have a dataset with five categorical variables. And I ran a multinomial logistic regression with the function multinom in package nnet, and then derived the p values from the coefficients. But I do not know how to interpret the results.
The p values were derived according to UCLA's tutorial: https://stats.idre.ucla.edu/r/dae/multinomial-logistic-regression/ .
Just like this:
z <- summary(test)$coefficients/summary(test)$standard.errors
p <- (1 - pnorm(abs(z), 0, 1)) * 2
p
And I got this:
(Intercept) Age1 Age2 Age3 Age4 Unit1 Unit2 Unit3 Unit4 Unit5 Level1 Level2 Area1 Area2
Not severe 0.7388029 9.094373e-01 0 0.000000e+00 0.000000e+00 0 0.75159758 0 0 0.0000000 0.8977727 0.9333862 0.6285447 0.4457171
Very severe 0.0000000 1.218272e-09 0 6.599380e-06 7.811761e-04 0 0.00000000 0 0 0.0000000 0.7658748 0.6209889 0.0000000 0.0000000
Severe 0.0000000 8.744405e-08 0 1.052835e-06 3.299770e-04 0 0.00000000 0 0 0.0000000 0.8843606 0.4862364 0.0000000 0.0000000
Just so so 0.0000000 1.685045e-07 0 5.507560e-03 2.973261e-06 0 0.08427447 0 NaN 0.3010429 0.5552963 0.7291180 0.0000000 0.0000000
Not severe at all 0.0000000 0.000000e+00 0 0.000000e+00 0.000000e+00 0 NaN NaN 0 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
But how should I interpret these p values? Age3 is significantly related to Very severe? I am green to statistics and have no idea. Help me understand the results please. Thank you in advance.
I suggest using stargazer package to display coefficients and p-values (I believe that it is a more convenient and common way)
Regarding the interpretation of the results, in a multinomial model you can say: keeping all other variables constant, if Age3 is higher by one unit, the log odds for Very Severe relative to the reference category is higher/lower by that amount indicated by the value of the coefficient. The p-value just shows you whether the association between these two variables (predictor and response) is significant or not. Interpretation is the same that of other models.
Note: in case of p-value the null hypothesis is always that the coefficient is equal to zero (no effect at all). When p-value is less than 0.05, you can safely reject the null hypothesis and state that the predictor has an effect on the response variable.
I hope I could give you some hints

Calculate rolling annual returns from monthly over XTS object in R

I have an XTS object of monthly returns across multiple columns, I'm trying to calculate rolling annual returns (geometric) for each column.
Date Manager 1 Manager 2 Manager 3 Manager 4 Manager 5
20160430 0.0152000 0.0100700 0.0102210 0.0046160 NA
20160531 0.0462000 0.0515240 0.0287490 0.0374920 NA
20160630 0.0007000 0.0126830 0.0156410 0.0130820 NA
20160731 0.0200000 0.0158810 0.0239540 0.0214950 NA
20160831 0.0339000 0.0531980 0.0021170 0.0476160 0.0457650
20160930 -0.0071000 0.0047540 -0.0088080 0.0031540 -0.0034070
20161031 -0.0224000 -0.0181930 0.0181410 -0.0048280 0.0170850
20161130 -0.0439000 -0.0131600 -0.0243030 -0.0064650 -0.0007180
20161231 -0.0051000 0.0200130 0.0204210 0.0160740 0.0172270
20170131 0.0083000 0.0146560 0.0247000 0.0203410 0.0227060
20170228 0.0211000 -0.0067120 0.0257530 0.0029940 0.0124730
20170331 0.0530000 0.0532190 0.0283950 0.0416190 0.0237900
20170430 0.0638300 0.0592280 0.0341340 0.0437430 0.0293500
20170531 0.0339000 0.0264270 0.0287670 0.0207810 0.0179080
20170630 NA -0.0046950 -0.0091310 -0.0074520 -0.0137600
20170731 NA 0.0109280 0.0029630 0.0146560 0.0167990
20170831 NA 0.0290430 0.0372960 0.0284390 0.0229930
20170930 NA 0.0226390 0.0030190 0.0063850 -0.0087170
Exepcted Results:
Date Manager 1 Manager 2 Manager 3 Manager 4 Manager 5
20160430
20160531
20160630
20160731
20160831
20160930
20161031
20161130
20161231
20170131
20170228
20170331 0.121979182 0.212964432 0.176317288 0.213932804
20170430 0.175724107 0.271996881 0.204161963 0.261212111
20170531 0.161901314 0.241637796 0.204183032 0.240897626
20170630 0.220330851 0.174812396 0.215746067
20170731 0.214381041 0.150728807 0.207606539 0.200188843
20170831 0.186529323 0.191124778 0.185500853 0.174054195
20170930 0.207649992 0.205337395 0.189319163 0.167798654
I've been using the PerformanceAnalytics package, but having some trouble applying the function across each column:
apply.rolling(ManagerReturns, width = 12, trim = FALSE ,FUN = Return.annualized)
apply.rolling is a wrapper around rollapply. For some reason apply.rolling doesn't work correctly with your data, but using rollapply will solve the issue.
using rollapply I can get close to your outcome, with a but. The but is that the Return.annualized removes the NA values but continues to calculate. You can see this happening with Manager1 and Manager5. This is not because rollapply, but because of Return.annualized. For example Return.annualized(my_data$Manager5[1:12]) returns an annualized return of 0.2207884.
ra <- rollapply(my_data, width = 12, FUN = Return.annualized, fill = 0)
Manager1 Manager2 Manager3 Manager4 Manager5
2016-04-30 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
2016-05-31 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
2016-06-30 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
2016-07-31 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
2016-08-31 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
2016-09-30 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
2016-10-31 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
2016-11-30 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
2016-12-31 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
2017-01-31 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
2017-02-28 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
2017-03-31 0.1219792 0.2129644 0.1763173 0.2139328 0.2207884
2017-04-30 0.1757241 0.2719969 0.2041620 0.2612121 0.2409790
2017-05-31 0.1619013 0.2416378 0.2041830 0.2408976 0.2406184
2017-06-30 0.1769613 0.2203309 0.1748124 0.2157461 0.1982881
2017-07-31 0.1682027 0.2143810 0.1507288 0.2076065 0.2001888
2017-08-31 0.1368823 0.1865293 0.1911248 0.1855009 0.1740542
2017-09-30 0.1676742 0.2076500 0.2053374 0.1893192 0.1677987
Now you could do something like ra * !is.na(my_data) which will multiply ra with a 0 in case of NA's and will remove the last 4 records of Manager1. But it will not help with Manager5.
data:
my_data <- structure(c(0.0152, 0.0462, 7e-04, 0.02, 0.0339, -0.0071, -0.0224,
-0.0439, -0.0051, 0.0083, 0.0211, 0.053, 0.06383, 0.0339, NA,
NA, NA, NA, 0.01007, 0.051524, 0.012683, 0.015881, 0.053198,
0.004754, -0.018193, -0.01316, 0.020013, 0.014656, -0.006712,
0.053219, 0.059228, 0.026427, -0.004695, 0.010928, 0.029043,
0.022639, 0.010221, 0.028749, 0.015641, 0.023954, 0.002117, -0.008808,
0.018141, -0.024303, 0.020421, 0.0247, 0.025753, 0.028395, 0.034134,
0.028767, -0.009131, 0.002963, 0.037296, 0.003019, 0.004616,
0.037492, 0.013082, 0.021495, 0.047616, 0.003154, -0.004828,
-0.006465, 0.016074, 0.020341, 0.002994, 0.041619, 0.043743,
0.020781, -0.007452, 0.014656, 0.028439, 0.006385, NA, NA, NA,
NA, 0.045765, -0.003407, 0.017085, -0.000718, 0.017227, 0.022706,
0.012473, 0.02379, 0.02935, 0.017908, -0.01376, 0.016799, 0.022993,
-0.008717), .Dim = c(18L, 5L), .Dimnames = list(NULL, c("Manager1",
"Manager2", "Manager3", "Manager4", "Manager5")), index = structure(c(1461974400,
1464652800, 1467244800, 1469923200, 1472601600, 1475193600, 1477872000,
1480464000, 1483142400, 1485820800, 1488240000, 1490918400, 1493510400,
1496188800, 1498780800, 1501459200, 1504137600, 1506729600), tzone = "UTC", tclass = "Date"), class = c("xts",
"zoo"), .indexCLASS = "Date", tclass = "Date", .indexTZ = "UTC", tzone = "UTC")

Clustering, but with conditions (in R)

I am doing some clustering of documents using cosine similarity between each document. This is fine. However my problem is a little strange in that I only want to cluster certain documents with others, not all of the documents against each other. Here's an example...
I have two spreadsheets with 3 labels apiece. I want to cluster the labels that are similar to each other BETWEEN the documents but not in the internal of the document, so for instance
Doc1: has labels: sex and gender, tobacco use years, current age
Doc2: has labels: gender, age now, time of use
I want to cluster the labels between the two documents but not inside the document, so I've created a similarity matrix that looks like this:
d1_l1 d1_l2 d1_l3 d2_l1 d2_l2 d2_l3
d1_l1 1.0000000 NA NA 0.5773503 0.0 0.0000000
d1_l2 NA 1.0000000 NA 0.0000000 0.0 0.3333333
d1_l3 NA NA 1.0 0.0000000 0.5 0.0000000
d2_l1 0.5773503 0.0000000 0.0 1.0000000 NA NA
d2_l2 0.0000000 0.0000000 0.5 NA 1.0 NA
d2_l3 0.0000000 0.3333333 0.0 NA NA 1.0000000
where the cosine similarity between labels in the same document is set as NA. The problem is that agnes and other hierarchical clustering methods don't accept NA values. So what should I do? Am I thinking about this the wrong way?

Resources