I am analysing questionnaire data with R and testing whether different metadata explains differences in the answers. I use chi-squared test for that. I show here two examples, where the question is which pet person has and I am analysing whether people from different countries and different professions answer differently to the question:
tab <- matrix(c(7, 5, 14, 19, 3, 2, 17, 6, 12), ncol=3, byrow=TRUE)
colnames(tab) <- c('dog','cat','rabbit')
rownames(tab) <- c('Italy','Greece','Hungary')
tab <- as.table(tab)
tab
chisq.test(tab)
tab2 <- matrix(c(9, 8, 12, 18, 1, 5, 16, 5, 11), ncol=3, byrow=TRUE)
colnames(tab2) <- c('dog','cat','rabbit')
rownames(tab2) <- c('Nurse','Technician','Teacher')
tab2 <- as.table(tab2)
tab2
chisq.test(tab2)
However, I know that that the "country" and "profession" are not independent, and there is indeed a statistically significant correlation. My question is, how could I do some kind of adjusted Chi-squared test, to test correlation of country and profession with the answers independently of each other? Or how would you handle the data?
Related
I have been trying to do unsupervised feature selection using LASSO (by removing class column). The dataset includes categorical (factor) and continuous (numeric) variables. Here is the link. I built a design matrix using model.matrix() which creates dummy variables for each level of the categorical variables.
dataset <- read.xlsx("./hepatitis.data.xlsx", sheet = "hepatitis", na.strings = "")
names_df <- names(dataset)
formula_LASSO <- as.formula(paste("~ 0 +", paste(names_df, collapse = " + ")))
LASSO_df <- model.matrix(object = formula_LASSO, data = dataset, contrasts.arg = lapply(dataset[ ,sapply(dataset, is.factor)], contrasts, contrasts = FALSE ))
### Group LASSO using gglasso package
gglasso_group <- c(1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 15, 16, 17, 17)
fit <- gglasso(x = LASSO_df, y = y_k, group = gglasso_group, loss = "ls", intercept = FALSE, nlambda = 100)
# Cross validation
fit.cv <- cv.gglasso(x = LASSO_df, y = y_k, group = gglasso_group, nfolds = 10)
# Best lambda
best_lambda_fit.cv <- fit.cv$lambda.1se
# Final coefficients of variables
coefs = coef.gglasso(object = fit, s = best_lambda_fit.cv)
### Group LASSO with grpreg package
group_lasso <- grpreg(X = LASSO_df, y = y_k, group = gglasso_group, penalty = "grLasso")
plot(group_lasso)
cv_group_lasso <- cv.grpreg(X = LASSO_df, y = y_k, group = gglasso_group, penalty = "grLasso", se = "quick")
# Best lambda
best_lambda_group_lasso <- cv_group_lasso$lambda.min
coef_mat_group_lasso <- as.matrix(coef(cv_group_lasso))
If you check coefs and coef_mat_group_lasso, you will realize that they are not the same. Also, the best lambda values are not the same. I am not sure which one to choose for feature selection.
Any idea of how to remove intercept in grpreg() function? intercept = FALSE is not working.
Any help is appreciated. Thanks in advance.
Please refer to the gglasso paper and the grpreg paper.
Different objective functions. On page 175 of grpreg paper, the author performs a step called group standardization, which normalizes the feature matrix within each group by right-multiplying an orthonormal matrix and a non-negative diagonal matrix. After the group lasso step with group standardization, the estimated coefficients are left-multiplied by the same matrices such that we obtain the coefficients of the original linear model. In such a way, however, the group lasso penalty is not equivalent to that without group standardization. For the detailed discussion, please also find it on page 175.
Different algorithms. The grpreg uses block coordinate descent, while gglasso uses an algorithm called groupwise-majorization-descent. It is natural to see small numerical differences when the algorithms are not the same.
I'm trying to calculate the percentiles from 1:i in a column. For example, for the nth data point, calculate the percentile only using the first n values.
I have tried using quantile, but can't seem to figure out how to generalize it.
mydata <- c(1, 25, 43, 2, 5, 17, 40, 15, 12, 8)
perc.fn <- function(vec, n){
(rank(vec[1:n], na.last=TRUE) - 1)/(length(vec[1:n])-1)}
I am trying to calculate and visualize the Bray-Curtis dissimilarity between communities at paired/pooled sites using the Vegan package in R.
Below is a simplified example dataframe:
Site = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
PoolNumber = c(1, 3, 4, 2, 4, 1, 2, 3, 4, 4)
Sp1 = c(3, 10, 7, 0, 12, 9, 4, 0, 4, 3)
Sp2 = c(2, 1, 17, 1, 2, 9, 3, 1, 6, 7)
Sp3 = c(5, 12, 6, 10, 2, 4, 0, 1, 3, 3)
Sp4 = c(9, 6, 4, 8, 13, 5, 2, 20, 13, 3)
df = data.frame(Site, PoolNumber, Sp1, Sp2, Sp3, Sp4)
"Site" is a variable indicating the location where each sample was taken
The "Sp" columns indicate abundance values of species at each site.
I want to compare pairs of sites that have the same "PoolNumber" and get a dissimilarity value for each comparison.
Most examples suggest I should create a matrix with only the "Sp" columns and use this code:
matrix <- df[,3:6]
braycurtis = vegdist(matrix, "bray")
hist(braycurtis)
However, I'm not sure how to tell R which rows to compare if I eliminate the columns with "PoolNumber" and "Site". Would this involve organizing by "PoolNumber", using this as a row name and then writing a loop to compare every 2 rows?
I am also finding the output difficult to interpret. Lower Bray-Curtis values indicate more similar communities (closer to a value of 0), while higher values (closer to 1) indicate more dissimilar communities, but is there a way to tell directionality, which one of the pair is more diverse?
I am a beginner R user, so I apologize for any misuse of terminology/formatting. All suggestions are appreciated.
Thank you
Do you mean that you want to get a subset of dissimilarities with equal PoolNumber? The vegdist function will get you all dissimilarities, and you can pick your pairs from those. This is easiest when you first transform dissimilarities into a symmetric matrix and then pick your subset from that symmetric matrix:
braycurtis <- vegdist(df[,3:6])
as.matrix(braycurtis)[df$PoolNumber==4,df$PoolNumber==4]
as.dist(as.matrix(braycurtis)[df$PoolNumber==4,df$PoolNumber==4])
If you only want to have averages, vegan::meandist function will give you those:
meandist(braycurtis, df$PoolNumber)
Here diagonal values will be mean dissimilarities within PoolNumber and off-diagonal mean dissimilarities between different PoolNumbers. Looking at the code of vegan::meandist you can see how this is done.
Bray-Curtis dissimilarities (like all normal dissimilarities) are a symmetric measure and it has no idea on the concept of being diverse. You can assess the degree of being diverse for each site, but then you need to first tell us what do you mean with "diverse" (diversity or something else?). Then you just need to use those values in your calculations.
If you just want to look at number of items (species), the following function will give you the differences in the lower triangle (and the upper triangle values will be the same with a switch of a sign):
designdist(df[,3:6], "A-B", "binary")
Alternatively you can work with row-wise statistics and see their differences. This is an example with Shannon-Weaver diversity index:
H <- diversity(df[,3:6])
outer(H, H, "-")
To get the subsets, work similarly as with the Bray-Curtis index.
I'm having issues doing a multivariate Granger's causal test. I'll like to check if conditioning a third variable affects the results of a causal test.
Here's one sample for a single dependent and independent variable based on an earlier question I asked and was answered by #Alex
Granger's causality test by column
library(lmtest)
M1<- matrix( c(2,3, 1, 4, 3, 3, 1,1, 5, 7), nrow=5, ncol=2)
M2<- matrix( c(7,3, 6, 9, 1, 2, 1,2, 8, 1), nrow=5, ncol=2)
M3<- matrix( c(1, 3, 1,5, 7,3, 1, 3, 3, 4), nrow=5, ncol=2)
For example, the equation for a conditioned linear regression will be
formula = y ~ w + x * z
How do I carry out this test as a function of a third or fourth variable please?
1. The solution for stationary variables are well-established: See FIAR (v 0.3) package.
This is the paper related with the package that includes concrete example of multivariate Granger causality (in the case of all of the variables are stationary).
Page 12: Theory, Page 15: Practice.
2. In case of mixed (stationary, nonstationary) variables, make all the variables stationary first (via differencing etc.). Do not handle stationary ones (they are already stationary). Now again, you finish by the above procedure (in case I).
3. In case of "non-cointegrated nonstationary" variables, then there is no need for VECM. Run VAR with the stationary variables (by making them stationary first, of course). Apply FIAR::condGranger etc.
4. In case of "cointegrated nonstationary" variables, the answer is really really very long:
Johansen Procedure (detect rank via urca::cajo)
Apply vec2var to convert VECM to VAR (since FIAR is based on VAR).
John Hunter's latest book nicely summarizes what can happen and what can be done in this last case.
You may wanna read this as well.
To my knowledge: Conditional/partial Granger causality supersides the GC via "Block exogeneity Wald test over VAR".
Assume this easy example:
treatment <- factor(rep(c(1, 2), c(43, 41)), levels = c(1, 2),labels = c("placebo", "treated"))
improved <- factor(rep(c(1, 2, 3, 1, 2, 3), c(29, 7, 7, 13, 7, 21)),levels = c(1, 2, 3),labels = >c("none", "some", "marked"))
numberofdrugs<-rpois(84, 50)+1
healthvalue<-rpois(84,5)
y<-data.frame(healthvalue,numberofdrugs, treatment, improved)
test<-lm(healthvalue~numberofdrugs+treatment+improved, y)
What am I supossed to do when I'd like to estimate a beta-binomial regression with R? Is anybody familiar with it? Any thought is appreciated!
I don't see how this example relates to beta-binomial regression (i.e., you have generated count data, rather than (number out of total possible)). To simulate beta-binomial data, see rbetabinom in either the emdbook or the rmutil packages ...
library(sos); findFn("beta-binomial") finds a number of useful starting points, including
aod (analysis of overdispersed data), betabin function
betabinomial family in VGAM
hglm package
emdbook package (for dbetabinom) plus mle2 package