Removing time series seasonality with monthly dummies - r

I am working with weekly google search volume (0-100) data of different Words from 2018-2021. For example my data for the word " gold price" reads as follows:
gold <- (ts(SVI_Log_returns_Winsorized$`gold price`,frequency =52,start = c(2018,1), end = c(2021,52)))
Time Series:
Start = c(2018, 1)
End = c(2021, 52)
Frequency = 52
[1] -0.10919929 0.10919929 -0.03509132 0.00000000 0.13353139 -0.16989904 -0.16034265 0.04255961 -0.04255961 -0.09097178 0.13353139 0.00000000 0.04082199 -0.04082199 0.00000000 -0.08701138
[17] 0.00000000 -0.04652002 0.09097178 -0.04445176 -0.04652002 0.00000000 0.00000000 0.04652002 0.04445176 -0.04445176 0.00000000 0.08701138 0.00000000 -0.04255961 0.00000000 0.26570317
[33] -0.14310084 -0.03922071 -0.12783337 0.08701138 -0.08701138 0.00000000 0.04445176 0.16034265 -0.07696104 -0.04082199 -0.04255961 0.00000000 -0.04445176 0.08701138 -0.08701138 0.08701138
[49] -0.04255961 0.23180161 0.15906469 -0.15906469 -0.10919929 -0.08004271 0.00000000 0.08004271 -0.12260232 0.00000000 0.08338161 -0.04082199 0.00000000 -0.04255961 0.04255961 -0.04255961
[65] -0.04445176 -0.04652002 0.04652002 -0.04652002 0.04652002 0.00000000 0.08701138 -0.08701138 -0.04652002 0.25131443 -0.07696104 0.27763174 0.08701138 -0.11778304 -0.06453852 0.03278982
[81] -0.03278982 0.03278982 0.30228422 0.00000000 -0.15028220 0.02666825 -0.08223810 -0.05884050 -0.06252036 0.00000000 0.00000000 -0.10178269 0.00000000 0.00000000 -0.07410797 0.03774033
[97] -0.03774033 -0.03922071 -0.04082199 0.04082199 0.03922071 0.03774033 0.10536052 0.15415068 0.25131443 -0.22977835 -0.03390155 0.09844007 -0.06453852 -0.06899287 0.22314355 0.30228422
[113] -0.20875481 0.30228422 0.08252102 -0.22977835 -0.22977835 0.00000000 0.07696104 0.05406722 -0.22977835 -0.07410797 -0.05264373 0.05264373 -0.16705408 0.00000000 0.05884050 -0.05884050
[129] 0.08701138 0.02739897 0.12675171 -0.10008346 0.30228422 0.30228422 0.00000000 -0.13503628 -0.21414799 -0.22977835 -0.06453852 -0.19574458 -0.05556985 0.13353139 -0.10536052 0.00000000
[145] 0.00000000 0.00000000 0.05406722 -0.14107860 0.24116206 -0.10008346 0.07598591 -0.02469261 -0.07796154 0.02666825 0.00000000 0.02597549 0.28768207 -0.14458123 -0.04546237 -0.02353050
[161] 0.30228422 -0.22977835 -0.02469261 0.13976194 0.06317890 -0.08515781 -0.11778304 -0.07796154 0.02666825 -0.05406722 -0.02817088 0.02817088 -0.05715841 0.11122564 0.12361396 -0.04762805
[177] -0.05001042 -0.02597549 -0.05406722 0.05406722 0.00000000 -0.08223810 -0.05884050 0.02985296 0.00000000 -0.02985296 -0.03077166 0.24783616 -0.15822401 -0.05884050 -0.06252036 -0.06669137
[193] 0.12921173 0.05884050 0.00000000 -0.02898754 0.00000000 -0.02985296 0.08701138 -0.02817088 0.10821358 -0.05264373 -0.08455739 0.02898754 -0.05884050 0.05884050 -0.02898754 -0.06062462
Plotting this data looks like this: DiffLog 'Gold price'
As seen here, the data seems to have seasonal components.
Decomposing the data using
decomp <- stl(gold,"periodic")
plot(decomp)
gives the following 'Gold Price' decomposed
Looking at the seasonal graph, it seems like the search volume for the word "gold price" drops a lot during the middle of each year.
I'm not quite sure on how to eliminate the seasonality in my data. I've found a couple of papers, which regress such Data on monthly dummies by keeping the residuals. I've tried to replicate this but I'm at loss on where to start. Can somebody advise me on how to approach the problem of seasonality?
Thanks!

I think that the book Forecasting - Principles and Practice is a great place to start from.

Related

`stoch.sens()` produces different results across code structures

I am using the hudsonia dataset from popbio for a reprex. Assuming that I have 2 lists of projection matrices that I want to calculate stochastic growth rate sensitivity.
What I want R to do is to apply stoch.sens to each nested list, and give me a new nested list of sensitivity and elasticity for each list of the projection matrices.
library(popbio)
data("hudsonia")
library(purrr)
library(tidyverse)
hudson <- list(hudsonia[1:2], hudsonia[3:4])
I set.seed(500), but it doesn't help. These three versions did the job as demonstrated in https://rdrr.io/cran/popbio/man/stoch.sens.html, but they give different results. Fortunately, the results seem to be consistent. How do I know which one is the most accurate?
scenario1_stoch.sens1 <- hudson %>%
map(., ~{stoch.sens(.)})
scenario1_stoch.sens1[[1]]
$sensitivities
seed seedlings tiny small medium large
[1,] 0.03305056 2.771649e-05 0.0001287404 0.0001431969 0.0001538398 0.0001839186
[2,] 13.00203595 1.134659e-02 0.0452484230 0.0484751986 0.0478620603 0.0884144846
[3,] 26.78972167 2.334817e-02 0.0933100332 0.1000829077 0.0989421815 0.1816309641
[4,] 46.35258286 4.059941e-02 0.1627644750 0.1744453985 0.1728116142 0.3133614288
[5,] 60.17717120 5.218833e-02 0.2105849414 0.2261332506 0.2239088481 0.4059634764
[6,] 73.30117860 6.334906e-02 0.2570030445 0.2762777810 0.2737179892 0.4930947167
$elasticities
seed seedlings tiny small medium large
[1,] 0.016508753 0.00000000 0.0005893993 0.001738768 0.0034331970 0.009230784
[2,] 0.005200814 0.00000000 0.0001764688 0.000494447 0.0008998067 0.003739933
[3,] 0.000000000 0.01114408 0.0552075615 0.017941807 0.0063520008 0.000000000
[4,] 0.000000000 0.00000000 0.0359152049 0.077326821 0.0301908444 0.026247911
[5,] 0.000000000 0.00000000 0.0000000000 0.066648140 0.1172797856 0.034134910
[6,] 0.000000000 0.00000000 0.0000000000 0.007905494 0.0624854738 0.409207593
scenario1_stoch.sens2 <- map(hudson, ~{stoch.sens(.x)})
scenario1_stoch.sens2[[1]]
$sensitivities
seed seedlings tiny small medium large
[1,] 0.03403483 2.845839e-05 0.0001207616 0.0001396471 0.0001501461 0.0001875122
[2,] 13.00850282 1.128956e-02 0.0517530409 0.0564579344 0.0569689775 0.0809149273
[3,] 26.69896759 2.310674e-02 0.1057631968 0.1155037098 0.1166128499 0.1659382881
[4,] 45.93156321 4.011413e-02 0.1806746089 0.1973586368 0.1990428759 0.2871106860
[5,] 57.80866176 5.033736e-02 0.2282652316 0.2491555124 0.2513725389 0.3608284037
[6,] 69.50472834 6.057715e-02 0.2735822612 0.2987184500 0.3012894262 0.4344459633
$elasticities
seed seedlings tiny small medium large
[1,] 0.017000397 0.00000000 0.0005528708 0.0016956653 0.003350765 0.009411144
[2,] 0.005203401 0.00000000 0.0002018369 0.0005758709 0.001071017 0.003422701
[3,] 0.000000000 0.01102885 0.0632803733 0.0202711514 0.007591172 0.000000000
[4,] 0.000000000 0.00000000 0.0392556049 0.0875884490 0.034936656 0.029256857
[5,] 0.000000000 0.00000000 0.0000000000 0.0739450382 0.132775014 0.037177426
[6,] 0.000000000 0.00000000 0.0000000000 0.0093164903 0.066822493 0.344268758
scenario1_stoch.sens3 <- stoch.sens(hudsonia[1:2])
$sensitivities
seed seedlings tiny small medium large
[1,] 0.03084729 0.0000258864 0.0001033805 0.0001115822 0.0001068774 0.0002138166
[2,] 12.76976503 0.0111371273 0.0541500215 0.0584835187 0.0602080777 0.0763170039
[3,] 26.19161727 0.0227584944 0.1105507047 0.1193949501 0.1227492926 0.1570462498
[4,] 45.34229432 0.0395745196 0.1895626453 0.2051304050 0.2107298137 0.2728195593
[5,] 56.73178410 0.0493258466 0.2361536206 0.2552820940 0.2616984321 0.3432671780
[6,] 68.02535325 0.0591084382 0.2809327390 0.3038881720 0.3110907601 0.4133270712
$elasticities
seed seedlings tiny small medium large
[1,] 0.015408220 0.00000000 0.0004732968 0.0013548868 0.002385150 0.010731348
[2,] 0.005107906 0.00000000 0.0002111851 0.0005965319 0.001131912 0.003228209
[3,] 0.000000000 0.01086263 0.0670680736 0.0204936687 0.008178951 0.000000000
[4,] 0.000000000 0.00000000 0.0401334404 0.0911770528 0.037510314 0.029048719
[5,] 0.000000000 0.00000000 0.0000000000 0.0762261032 0.139968147 0.036758280
[6,] 0.000000000 0.00000000 0.0000000000 0.0100422826 0.066522076 0.325381617

How to access and compare LASSO model coefficients with MLR3 (glmnet learner)?

Goal
Create a LASSO model using MLR3
Use nested CV with inner CV or bootstraps for hyperparameter (lambda) determination and outer CV for model performance evaluation (instead of doing just one test-train spit) and finding the standard deviation of the different LASSO regression coefficients amongst the different model instances.
Do a prediction on a testing data set not available yet.
Issues
I am unsure whether the nested CV approach as described is implemented correctly in my code below.
I am unsure whether alpha is set correctly alpha = 1 only.
I do not know how to access the LASSO lamda coefficients when using resampling in mlr3. (importance() in mlr3learners does not yet support LASSO)
I don't know how to apply a possible model to the unavailable testing set in mlr3.
Code
library(readr)
library(mlr3)
library(mlr3learners)
library(mlr3pipelines)
library(reprex)
# Data ------
# Prepared according to the Blog post by Julia Silge
# https://juliasilge.com/blog/lasso-the-office/
urlfile = 'https://raw.githubusercontent.com/shudras/office_data/master/office_data.csv'
data = read_csv(url(urlfile))[-1]
#> Warning: Missing column names filled in: 'X1' [1]
#> Parsed with column specification:
#> cols(
#> .default = col_double()
#> )
#> See spec(...) for full column specifications.
# Add a factor to data
data$factor = as.factor(c(rep('a', 20), rep('b', 50), rep('c', 30), rep('a', 6), rep('c', 10), rep('b', 20)))
# Task creation
task =
TaskRegr$new(
id = 'office',
backend = data,
target = 'imdb_rating'
)
# Model creation
graph =
po('scale') %>>%
po('encode') %>>% # make factors numeric
# How to normalize predictors, leaving target unchanged?
lrn('regr.cv_glmnet', # 10-fold CV for inner loop. Is alpha permanently set to 1?
id = 'rp', alpha = 1, family = 'gaussian'
)
graph_learner = GraphLearner$new(graph)
# Execution (actual modeling)
result =
resample(
task,
graph_learner,
rsmp('cv', folds = 5) # 5-fold for outer CV
)
#> INFO [13:21:53.485] Applying learner 'scale.encode.regr.cv_glmnet' on task 'office' (iter 3/5)
#> INFO [13:21:54.937] Applying learner 'scale.encode.regr.cv_glmnet' on task 'office' (iter 2/5)
#> INFO [13:21:55.242] Applying learner 'scale.encode.regr.cv_glmnet' on task 'office' (iter 1/5)
#> INFO [13:21:55.500] Applying learner 'scale.encode.regr.cv_glmnet' on task 'office' (iter 4/5)
#> INFO [13:21:55.831] Applying learner 'scale.encode.regr.cv_glmnet' on task 'office' (iter 5/5)
# How to access results, i.e. lamda coefficients,
# and compare them (why no variable importance for glmnet)
# Access prediction
result$prediction()
#> <PredictionRegr> for 136 observations:
#> row_id truth response
#> 2 8.3 8.373798
#> 6 8.7 8.455151
#> 9 8.4 8.358964
#> ---
#> 116 9.7 8.457607
#> 119 8.2 8.130352
#> 128 7.8 8.224150
Created on 2020-06-11 by the reprex package (v0.3.0)
Edit 1 (LASSO coefficients)
According to a comment from missuse LASSO coefficients can be accessed through result$data$learner[[1]]$model$rp$model$glmnet.fit$beta Additionally, I found that store_models = TRUE needs to be set in result to store the model and in turn access the coefficients.
despite setting alpha = 1, I optained multiple LASSO coefficients. I would like the 'best' LASSO coefficients (stemming from e. g. from lamda = lamda.min or lamda.1se). What do the different s1, s2, s3, ... mean? Are these different lamdas?
The different coefficients indeed seem to stem from different lambda values denoted as s1, s2 , s3, ... (Numer is index.) I suppose, the 'best' coefficients can be accessed by first finding the indices of the 'best' lambda index_lamda.1se = which(ft$lambda == ft$lambda.1se)[[1]]; index_lamda.min = which(ft$lambda == ft$lambda.min)[[1]] and then finding the set of coefficients. A more concise approach to find the 'best' coefficients is given in the comments by missuse.
library(readr)
library(mlr3)
library(mlr3learners)
library(mlr3pipelines)
library(reprex)
urlfile = 'https://raw.githubusercontent.com/shudras/office_data/master/office_data.csv'
data = read_csv(url(urlfile))[-1]
# Add a factor to data
data$factor = as.factor(c(rep('a', 20), rep('b', 50), rep('c', 30), rep('a', 6), rep('c', 10), rep('b', 20)))
# Task creation
task =
TaskRegr$new(
id = 'office',
backend = data,
target = 'imdb_rating'
)
# Model creation
graph =
po('scale') %>>%
po('encode') %>>% # make factors numeric
# How to normalize predictors, leaving target unchanged?
lrn('regr.cv_glmnet', # 10-fold CV for inner loop. Is alpha permanently set to 1?
id = 'rp', alpha = 1, family = 'gaussian'
)
graph$keep_results = TRUE
graph_learner = GraphLearner$new(graph)
# Execution (actual modeling)
result =
resample(
task,
graph_learner,
rsmp('cv', folds = 5), # 5-fold for outer CV
store_models = TRUE # Store model needed to acces coefficients
)
# LASSO coefficients
# Why more than one coefficient per predictor?
# What are s1, s2 etc.? Shouldn't 'lrn' fix alpha = 1?
# How to obtain the best coefficient (for lamda 1se or min) if multiple?
as.matrix(result$data$learner[[1]]$model$rp$model$glmnet.fit$beta)
#> s0 s1 s2 s3 s4 s5
#> andy 0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> angela 0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> b_j_novak 0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> brent_forrester 0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> darryl 0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> dwight 0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> episode 0 0.000000000 0.00000000 0.00000000 0.010297763 0.02170423
#> erin 0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> gene_stupnitsky 0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> greg_daniels 0 0.000000000 0.00000000 0.00000000 0.001845101 0.01309437
#> jan 0 0.000000000 0.00000000 0.00000000 0.005663699 0.01357832
#> jeffrey_blitz 0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> jennifer_celotta 0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> jim 0 0.006331732 0.01761548 0.02789682 0.036853510 0.04590513
#> justin_spitzer 0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> [...]
#> s6 s7 s8 s9 s10
#> andy 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> angela 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> b_j_novak 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> brent_forrester 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> darryl 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> dwight 0.002554576 0.007006995 0.011336058 0.01526851 0.01887180
#> episode 0.031963475 0.040864492 0.047487987 0.05356482 0.05910066
#> erin 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> gene_stupnitsky 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> greg_daniels 0.023040791 0.031866343 0.040170917 0.04779004 0.05472702
#> jan 0.021030152 0.028094541 0.035062678 0.04143812 0.04725379
#> jeffrey_blitz 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> jennifer_celotta 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> jim 0.053013058 0.058503984 0.062897112 0.06683734 0.07041964
#> justin_spitzer 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> kelly 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> ken_kwapis 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> kevin 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> lee_eisenberg 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> michael 0.057190859 0.062963830 0.068766981 0.07394472 0.07865977
#> mindy_kaling 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> oscar 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> pam 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> paul_feig 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> paul_lieberstein 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> phyllis 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> randall_einhorn 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> ryan 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> season 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> toby 0.000000000 0.000000000 0.005637169 0.01202893 0.01785309
#> factor.a 0.000000000 -0.003390125 -0.022365768 -0.03947047 -0.05505681
#> factor.b 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> factor.c 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> s11 s12 s13 s14
#> andy 0.000000000 0.000000000 0.000000000 0.0000000000
#> angela 0.000000000 0.000000000 0.000000000 0.0000000000
#> b_j_novak 0.000000000 0.000000000 0.000000000 0.0000000000
#> brent_forrester 0.000000000 0.000000000 0.000000000 0.0000000000
#> darryl 0.000000000 0.000000000 0.000000000 0.0017042281
#> dwight 0.022170870 0.025326337 0.027880703 0.0303865693
#> episode 0.064126846 0.069018240 0.074399623 0.0794693480
#> [...]
Created on 2020-06-15 by the reprex package (v0.3.0)
Edit 2 (optional follow up question)
Nested CV provides discrepancy-evalutation amongst multiple models. The discrepancy can be expressed as an error (e.g. RMSE) obtained by the outer CV. While that error may be small, individual LASSO coefficients (importance of predictors) from the models (instanciated by the outer CV) may vary considerably.
Does mlr3 provide functionality describing the consitancy in quantitative importance of predictor variables, i. e. RMSE of LASSO coefficients amongst models created by the outer CV? Or should a custom function be created, retrieving the LASSO coefficients using result$data$learner[[i]]$model$rp$model$glmnet.fit$beta (suggested by missuse) with i = 1, 2, 3, 4, 5 being the folds of the outer CV and then taking RMSE of the matching coefficients?

Problems when coding NLS in R with dims

I'm trying to code a non-linear regression in R to fit data I have regarding the relationship between temperatures and types of precipitations.
I first created 2 vectors with my data:
vec_temp_num
[1] -8.5 -8.0 -6.5 -6.1 -5.9 -5.8 -5.6 -5.4 -5.3 -5.1 -4.9 -4.8 -4.7 -4.5 -4.3 -4.2 -4.1
[18] -4.0 -3.9 -3.8 -3.7 -3.6 -3.5 -3.4 -3.3 -3.2 -3.1 -3.0 -2.9 -2.8 -2.6 -2.5 -2.4 -2.3
vec_rain
[1] 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
[9] 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 1.00000000 0.00000000 0.00000000
[17] 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
[25] 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.33333333 0.00000000 0.00000000
vec_temp_num contains a list of temperatures and vec_rain has for each of them a percentage by type of precipitation observed, basically rain or snow (I chose to start with one of them to simplify the process). Both vectors contain 300 lines and are "as.numeric".
The function is the following:
func_rain <- function(x,b){(1/(1+exp(-b*x)))}
I then tested my function and got a plot that looks like it should be, so until this step everything seems to be ok.
But when I try to write the nls formula:
Rain_fit<-nls(vec_rain~func_rain(vec_temp_num,b), start=c(vec_temp_num=2.6, b=1))
I get a error message saying:
Error in qr(.swts * gr) :
dims [product 2] do not match the length of object [300]
It seems that I should have my data as a matrix opposed to a vector (which I don't get because some forums advise to create vectors) but then I tried to directly use data from my data frame (df = dp.w2/rain & temperature columns):
Rain_fit<-
nls(dp.w2$rain~func_rain(dp.w2$temperature,b),start=list(temperature=2.6,rain=1,
b=1))
and got another error message:
Error in parse(text = x, keep.source = FALSE) :
:2:0: unexpected end of input
1: ~
^
I've read a lot of question/answers about the nls function but it's been some days now and I just can't find the right way to fit my data so thanks a lot in advance for your help!
PS: I'm a total beginner so if you could provide a "step by step" or detailed (for dummies) answer it would be awesome!

Error when using ''ward'' method with pvclust R package

I am having some troubles regarding a cluster analysis that I am trying to do with the pvclust package.
Specifically, I have a data matrix composed by species (rows) and sampling stations (columns). I want to perform a CA in order to group my sampling stations according to my species abundance (which I have previously log(x+1) transformed).
Once having prepared adequately my matrix,I've tried to run a CA according to the pvclust package, using Ward's clustering method and Bray-Curtis as distance index. However, every time I get the following error message:
''Error in hclust(distance, method = method.hclust) :
invalid clustering method''
I then tried to perform the same analysis using another cluster method, and I had no problem. I also tried to perform the same analysis using the hclust function from the vegan package, and I had no problem at all, too. The analysis run without any problems.
To better understand my problem, I'll display part of my matrix and the
script that I used to perfrom the analysis:
P1 P2 P3 P4 P5 P6
1 10.8750000 3.2888889 2.0769231 1.4166667 3.2395833 5.333333
3 0.3645833 0.3027778 0.3212038 0.7671958 0.4993676 0.000000
4 0.0000000 0.0000000 2.3500000 0.0000000 0.0000000 0.264000
5 0.0000000 0.7333333 0.2692308 0.0000000 0.2343750 0.000000
6 0.0000000 0.9277778 0.0000000 0.2936508 0.7291667 0.000000
7 0.4166667 6.3500000 1.0925463 0.5476190 0.1885169 0.000000
8 1.6250000 0.0000000 0.0000000 0.0000000 5.2187500 0.000000
9 0.0000000 0.8111111 0.0000000 0.0000000 0.0000000 0.000000
10 2.6770833 0.6666667 2.3304890 4.5906085 2.9652778 0.000000
15 1.8020833 0.9666667 1.4807137 3.3878968 0.1666667 0.000000
16 17.8750000 4.9555556 1.4615385 6.5000000 7.8593750 7.666667
19 4.5312500 1.0555556 3.5766941 6.7248677 2.3196181 0.000000
20 0.0000000 0.6777778 0.5384615 0.0000000 0.0000000 0.000000
21 0.0000000 0.9777778 0.0000000 0.2500000 0.0000000 0.000000
24 1.2500000 3.0583333 0.1923077 0.0000000 4.9583333 0.000000
25 0.0000000 0.0000000 2.5699634 0.0000000 0.0000000 0.000000
26 6.6666667 2.2333333 24.8730020 55.9980159 17.6239583 0.000000
Where P1-P6 are my sampling stations, and the leftmost row numbers are my different species. I'll denote this example matrix just as ''platforms''.
Afterwards, I've used the following code lines:
dist <- function(x, ...){
vegdist(x, ...)
}
result<-pvclust(platforms,method.dist = "bray",method.hclust = "ward")
It is noteworthy that I run the three first codelines, since the bray-curtis index isn't originally available in the pvclust package. Thus, running these codelines allowed me to specify the bray-curtis index in the pvclust function
Does anyone know why it doesn't work with the pvclust package?
Any help will be much appreciated.
Kind regards,
Marie
There are two related issues:
When calling method.hclust you need to pass hclust compatible methods. In theory pvclust checks for ward and converts to ward.D, but you probably want to pass the (correct) names of either ward.D or ward.D2.
You cannot over-write dist in that fashion. However, you can pass a custom function to pvclust.
For instance, this should work:
library(vegan)
library(pvclust)
sample.data <- "P1 P2 P3 P4 P5 P6
10.8750000 3.2888889 2.0769231 1.4166667 3.2395833 5.3333330
0.3645833 0.3027778 0.3212038 0.7671958 0.4993676 0.0000000
0.0000000 0.0000000 2.3500000 0.0000000 0.0000000 0.2640000
0.0000000 0.7333333 0.2692308 0.0000000 0.2343750 0.0000000
0.0000000 0.9277778 0.0000000 0.2936508 0.7291667 0.0000000
0.4166667 6.3500000 1.0925463 0.5476190 0.1885169 0.0000000
1.6250000 0.0000000 0.0000000 0.0000000 5.2187500 0.0000000
0.0000000 0.8111111 0.0000000 0.0000000 0.0000000 0.0000000
2.6770833 0.6666667 2.3304890 4.5906085 2.9652778 0.0000000
1.8020833 0.9666667 1.4807137 3.3878968 0.1666667 0.0000000
17.8750000 4.9555556 1.4615385 6.5000000 7.8593750 7.6666670
4.5312500 1.0555556 3.5766941 6.7248677 2.3196181 0.0000000
0.0000000 0.6777778 0.5384615 0.0000000 0.0000000 0.0000000
0.0000000 0.9777778 0.0000000 0.2500000 0.0000000 0.0000000
1.2500000 3.0583333 0.1923077 0.0000000 4.9583333 0.0000000
0.0000000 0.0000000 2.5699634 0.0000000 0.0000000 0.0000000
6.6666667 2.2333333 24.8730020 55.9980159 17.6239583 0.0000000"
platforms <- read.table(text = sample.data, header = TRUE)
result <- pvclust(platforms,
method.dist = function(x){
vegdist(x, "bray")
},
method.hclust = "ward.D")

Fixing nodes in igraph

I have a directed graph where there is a start and an end node and they are defined such that no node leaves end and no node enters start. In my graph I want to fix the node start at the top of the graph and the end at the bottom with the intermediate nodes staying in between. How can I achieve this?
> final_data_graph
(conversion) (start) alpha beta delta epsilon eta gamma iota kappa lambda mi theta
(conversion) 0.00000000 0 0.00000000 0.00000000 0.000000e+00 0.000000000 0.00000000 0.000000000 0.00000000 0.000000000 0.00000000 0.000000e+00 0.00000000
(start) 0.00000000 0 0.03771482 0.14413063 8.571551e-05 0.006128659 0.18025972 0.013071615 0.47426392 0.002914327 0.03891484 4.285776e-05 0.10118716
alpha 0.18078800 0 0.58092440 0.03215991 1.049263e-04 0.017732543 0.03667174 0.002675620 0.06395257 0.005666020 0.03242222 0.000000e+00 0.03840302
beta 0.09504413 0 0.08766124 0.35022064 8.486083e-05 0.009164969 0.24753904 0.004327902 0.12075696 0.004752206 0.02274270 0.000000e+00 0.04760692
delta 0.53333333 0 0.00000000 0.00000000 0.000000e+00 0.066666667 0.00000000 0.000000000 0.26666667 0.066666667 0.00000000 0.000000e+00 0.06666667
epsilon 0.38628763 0 0.13991081 0.04347826 0.000000e+00 0.105351171 0.08193980 0.005574136 0.10200669 0.007246377 0.05128205 0.000000e+00 0.05351171
eta 0.42928641 0 0.11002583 0.09969325 0.000000e+00 0.023167582 0.19058767 0.002421698 0.07402325 0.008072328 0.01840491 0.000000e+00 0.03535680
gamma 0.28192371 0 0.14427861 0.05804312 0.000000e+00 0.021558872 0.08291874 0.066334992 0.15754561 0.018242123 0.05306799 0.000000e+00 0.09950249
iota 0.23902022 0 0.06370199 0.04091585 1.102111e-04 0.009202623 0.03240205 0.001790930 0.53868408 0.004573759 0.02669863 5.510553e-05 0.03160302
kappa 0.43064985 0 0.06886518 0.03685742 9.699321e-04 0.018428710 0.06498545 0.002909796 0.09602328 0.128031038 0.05431620 0.000000e+00 0.08244423
lambda 0.34914361 0 0.08695652 0.02561850 5.855658e-04 0.020348412 0.02547211 0.002488655 0.07539160 0.034401991 0.31620553 0.000000e+00 0.04977309
mi 0.00000000 0 0.25000000 0.00000000 0.000000e+00 0.000000000 0.00000000 0.000000000 0.50000000 0.000000000 0.00000000 2.500000e-01 0.00000000
theta 0.13940821 0 0.17562196 0.07949360 1.472104e-04 0.025320183 0.07198587 0.004269101 0.13513911 0.019431768 0.20491683 0.000000e+00 0.12939791
zeta 0.09929633 0 0.15871775 0.07427678 0.000000e+00 0.039874902 0.07974980 0.001563722 0.23612197 0.007036747 0.08444097 0.000000e+00 0.07271306
zeta
(conversion) 0.000000000
(start) 0.001285733
alpha 0.008499029
beta 0.010098439
delta 0.000000000
epsilon 0.023411371
eta 0.008960284
gamma 0.016583748
iota 0.011241528
kappa 0.015518914
lambda 0.013614405
mi 0.000000000
theta 0.014868247
zeta 0.146207975
ig <- graph.adjacency(final_data_graph, mode="directed", weighted=TRUE)
plot(ig,edge.label=round(E(ig)$weight,3),edge.width=.01,edge.arrow.size=.05,layout=layout.reingold.tilford(ig, root=which(V(ig)$name=='(start)')),vertex.color="white")
I assume you want to plot your graph with the start node at the top and end at the bottom. If so, you can use layout.reingold.tilford e.g. :
library(igraph)
# example graph
g <- graph.empty(directed = T)
g <- g + vertices(c('D','A','E','F','C','B'))
g <- g + edge('A','B')
g <- g + edge('A','C')
g <- g + edge('B','E')
g <- g + edge('B','D')
g <- g + edge('C','D')
g <- g + edge('D','F')
g <- g + edge('E','F')
# create the layout specifying the root node (i.e. start)
ly <- layout.reingold.tilford(g, root=which(V(g)$name=='A'),flip.y=T)
# let's plot
plot.igraph(g,layout=ly)

Resources