How to use sample weights in R - r

I am planning to fit a Multi-Group Confirmatory Factor Analysis about views on ethical matters. I will compare people from the regions of Wallonia and Flanders in Belgium. My two samples need to be weighted, in order to be representative of their populations in terms of in terms of age, gender, education and party choice.
Sampling weights where already provided in my dataset. I then created a variable wreg, combining weights for respondents from Wallonia and Flanders.
I am new to R, and read documentation about lavaan.survey and svydesign to learn about the code. However, I haven't yet succeeded in writing something correct. I always get error messages about the part concerning weights. Apparently the programme cannot read the sampling weights variable right.
Here is the code I used:
library(lavaan.survey)
f <- "C:/.../bges07_small.csv"
s <- read.csv(f,sep=";")
r <- s[is.na(s$flawal),]
rDesign <- svydesign(ids=~1, data=r, weights=~wreg)
model.1 <- 'ethic =~ q96_1+ q96_2 +q96_3'
fit <- cfa(model.1, data=r,ordered=c("q96_1","q96_2","q96_3"))
summary(fit, fit.measures=TRUE, modindices=FALSE,standardized=FALSE)
And this is the error message I had:
Erreur dans 1/as.matrix(weights) :
argument non numérique pour un opérateur binaire
Any suggestion on how I should write my model with R? Thanks a lot!

From the results of summary(r$wreg), it looks like your weights column is a factor, and not a numeric vector. Make sure you've read your data in correctly and that column doesn't contain any character-like values. You can manually convert it with
r$wreg <- as.numeric(r$wreg)
before running your model. Also, those look like very large weight values. Are you sure they are correct?

Related

Mice in R - how can I understand what this command does?

mice_mod <-
mice(titanicData[, !names(titanicData) %in%
c('PassengerId','Name','Ticket','Cabin','Survived')],
method='rf')
mice_output <- complete(mice_mod)
I am new to R and we had a college lecture yesterday. What does this command do? I have read the online documentation and broke down the command to a series of outputs, with no joy.
The mice function approximates missing values. In you case you are using the "rf" statement, which means the random forest imputations algorithm is used. Since I can't reproduce your dataset, I'm using airquality which is a built in dataset by R with NA values. Those can be approximated. You are creating kinda a prediction model with mice. Actually it is a mids object, which is used by mice for imputed datasets (documentation). If you want to use those imputations, you can call complete for creating the filled dataframe.
library(mice)
df<-airquality
mice_mod <- mice(df, method='rf')
mice_output <- complete(mice_mod)
When you compare df and mice_output, you'll see the NA values in Ozone and Solar got replaced.
In your example your lecturer is using all names which are not in the called list of names. So he is filtering the dataframe beforehand.
If you want more information about the algorithm: regarding to the documentation it is described in
Doove, L.L., van Buuren, S., Dusseldorp, E. (2014), Recursive
partitioning for missing data imputation in the presence of
interaction Effects. Computational Statistics \& Data Analysis, 72,
92-104.

How to run ANOVA/multifactor regression with unequal sample size R

I am new to R, and I unfortunately don't have a nice, simple dataset. I am trying to figure out how to set up my data and code so that it will give me a way to run a comparison on the following:
Experiment type factor levels ("Experiment"): Control, Heated
Competition type factor levels ("Comp"): Inter, Light, Med, Heavy
My goal is to see if there are interactions between temperature (Experiment type factor) and type of competition (Competition type factor) on rates of parasitism ("ParasRate") and rates of survival ("Survival") of the host.
I have tried following this YouTube tutorial.
I think I set something up wrong, since it is asking for numerical values, which my factors of course are not. So I don't think that tutorial by itself is what I need. I need to be able to edit it so I can run it in a meaningful way to look at these interactions. My hypothesis, briefly stated, is that there will be interactions between heat ("Heated") and inter-specific ("Inter") competition.
My data from Excel loads no problem and looks right, so it is not a data file problem.
#fit model using Competition Type and Experiment Type as X-variables
mod_CompExperiment <- (ParasRate ~ Experiment + Level)
summary(mod_CompExperiment)
mod_CompExperiment <- (ParasRate ~ Experiment + Level)
summary(mod_CompExperiment)
Length Class Mode
3 formula call
summary(mod_CompExperiment)
Length Class Mode
3 formula call
cor(ParasRate, Comp, method="pearson")
Error in cor(ParasRate, Comp, method = "pearson") : 'y' must be numeric

how to find differentially methylated regions (for example with probe lasso in Champ) based on regression continuous variable ~ beta (with CpGassoc)

I performed 450K Illumina methylation chips on human samples, and want to search for the association between a continuous variable and beta, adjusted for other covariates. For this, I used the CpGassoc package in R. I would also like to search for differentially methylated regions based on the significant CpG sites. However, the probe lasso function in the Champ package and also other packages for 450K DMR analyses always assume 2 groups for which DMRs need to be find. I do not have 2 groups, but this continuous variable. Is there a way to load my output from CpGassoc in the probe lasso function from Champ? Or into another bump hunter package? I'm a MD, not a bio-informatician, thus comb-p, etc. would not be possible for me.
Thank you very much for your help.
Kind regards,
Line
I have not worked with methylation data before, so take what I say with a grain of salt. Also, don't use acronyms without describing them I'm guessing most people on this site don't know what a DMR is.
you could use lasso from the glmnet package to run a lasso on your data. So if your continuous variable was age you could do something like. If meth.dt is your methylations data.table with your columns as the amount of methylation for a given site, and your rows as subjects. I'm not sure if methylation data is considered to be poisson, I know RNA-seq data is. I also can't get too specific but the following code should work after adjusting to your number of columns
#load libraries
library(data.table)
library(glmnet)
#read in data
meth.dt <- fread("/data")
#lasso
AgeLasso <- glmnet(as.matrix(meth.dt[,1:70999,with=F]),meth.dt$Age, family="poisson")
cv.AgeLasso <- cv.glmnet(as.matrix(meth.dt[,1:70999,with=F]), meth.dt$Age, family="poisson")
coefTranscripts <- coef(cv.AgeLasso, s= "lambda.1se")[,1][coef(cv.AgeLasso, s= "lambda.1se")[,1] != 0]
This will give you the methylation sites that are the best predictors of your continuous variable using a parsimonious model. For additional info about glmnet see http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html
Also might want to ask the people over at cross validated. They may have some better answers. http://stats.stackexchange.com
What is your continuous variable just out of curiosity?
Let me know how you ended up solving it if you don't use this method.

Error in bn.fit predict function in bnlear R

I have learned and fitted Bayesian Network in bnlearn R package and I wish to predict it's "event" node value.
fl="data/discrete_kdd_10.txt"
h=TRUE
dtbl1 = read.csv(file=fl, head=h, sep=",")
net=hc(dtbl1)
fitted=bn.fit(net,dtbl1)
I want to predict the value of "event" node based on the evidence stored in another file with the same structure as the file used for learning.
fileName="data/dcmp.txt"
dtbl2 = read.csv(file=fileName, head=h, sep=",")
predict(fitted,"event",dtbl2)
However, predict fails with
Error in check.data(data) : variable duration must have at least two levels.
I don't understand why there should be any restriction on number of levels of variables in the evidence data.frame.
The dtbl2 data.frame contains only few rows, one for each scenario in which I want to predict the "event" value.
I know I can use cpquery, but I wish to use the predict function also for networks with mixed variables (both discrete and continuous). I haven't found out how to make use of evidence of continuous variable in cpqery.
Can someone please explain what I'm doing wrong with the predict function and how should I do it right?
Thanks in advance!
The problem was that reading the evidence data.frame in
fileName="data/dcmp.txt"
dtbl2 = read.csv(file=fileName, head=h, sep=",")
predict(fitted,"event",dtbl2)
caused categoric variables to be factors with different number of levels (subset of levels of the original training set).
I used following code to solve this issue.
for(i in 1:dim(dtbl2)[2]){
dtbl2[[i]] = factor(dtbl2[[i]],levels = levels(dtbl1[[i]]))
}
By the way bnlearn package does fit models with mixed variables and also provides functions for predictions in them.

Getting factor analysis scores when using the factanal function

I have used the factanal function in R to do a factor analysis on a data set.
Viewing the summary of the output, I see I have access to the loading and other objects, but I am interested in the scores of the factor analysis.
How can I get the scores when using the factanal function?
I attempted to calculate the scores myself:
m <- t(as.matrix(factor$loadings))
n <- (as.matrix(dataset))
scores <- m%*%n
and got the error:
Error in m %*% n : non-conformable arrays
which I'm not sure why, since I double checked the dimension of the data and the dimensionality is in agreement.
Thanks everyone for your help.
Ah.
factormodel$loadings[,1] %*% t(dataset)
This question might be a bit dated, but nevertheless:
factanal returns a matrix of scores. You simply call it like you called the loadings: factor$scores. No need to calculate it yourself. But you do need to specify in the function that you want to produce the scores, by using the "scores" argument.
Your solution, of multiplying the loadings by the observation matrix, is wrong. According to the FA model, the observed dataset should be the multiplication of loadings and scores (plus the unique contributions, and then rotation). This is not equivalent to what you wrote. I think you treated the loadings as the coefficients from observed data to scores, rather than the other way around (from scores to observations).
I found this paper that explains about different ways to extract scores, might be useful.

Resources