for loops for regression over multiple variables & outputting a subset - r

I have tried to apply this QA: "efficient looping logistic regression in R" to my own problem but I cannot quite make it work. I haven't tried to use apply, but I was told by a few people that a for loop is the best here (if someone believes otherwise please feel free to explain!) I think this problem is pretty generalizeable and not too esoteric for the forum.
This is what I want to achieve: I have a dataset with 3 predictor variables (gender, age, race) and a dependent variable (a proportion) for 86 genetic positions for several people. I want to run bivariate linear regressions for each position (so 86 linear regressions for 3 predictor variables). Then I want to output the results in some easily legible format; my idea is a matrix with rows=gender, age, and race, and columns=the 86 positions. There would be a p value for each row*column combination. Then I could call the p values<0.1 (or whatever threshold I want) to easily see which predictors are significantly associated with proportion at each position.
This is the code I have so far.
BB <- seq.csv[,6:91] #the data frame containing the 86 positions
AA <- seq.csv[,2:4] #the data frame containing the 3 predictor variables
linreg <- matrix(NA,3,86) #make a results vector and fill it with NA
for (i in 1:86) #loop over each position variable
{
for (j in 1:3) #for each position variable, loop over each predictor
{
linreg[i,j] <- lm(BB[,i]~AA[,j]) #bivariate linear regression
}}
No matter how I change this (for example, simplifying it to loop over the positions for only one predictor), I still get an error that my matrices are not the same length (number of items to replace is not a multiple of replacement length). In fact, length(linreg)=286 (3*86) and length(BB)=86 and length(AA)=3. I know the latter two are dataframes, not matrices...but if I convert them to matrices I get an invalid type error (invalid type (list) for variable 'BB[, i]'). I do not know how to resolve this error because I just don't understand R well enough...I've consulted the books Applied Statistical Genetics with R and Art of R Programming to no avail, and I'm been Google searching all day. And I haven't even gotten to the coding for outputting the results...
I'd appreciate any debugging tips or some suggestions on a better way to code this! Thank you all in advance.

Really hard to give a definitive answer without knowing the structure of your data beforehand, but this might work. I'm assuming that your two data frames have the same number of rows (observations):
df <- cbind( AA[ , 2:4 ] , BB[ , 6:91 ] )
mods <- apply( as.data.frame( df[ , 4:89 ] ) , 2 , FUN = function(x){ lm( x ~ df[,1] + df[,2] + df[,3] } )
# The rows of this matrix will correspond to the intercept, gender, age, race, and the columns are the results for each of your 86 genetic postions
pvals <- sapply( mods , function(x){ summary(x)$coefficients[,4] )
As to whether or not that is the right thing to do I will trust to your judgement as a genetic epidemiologist!

Related

Multidimensional scaling with missing data

I've got a survey where a lot of people are randomly asked 8 out of 20 policy-based questions. I want to use MDS to bring these questions down to a single dimension to get the ideology of each of these respondents. However, because people are only asked a few questions, I can't get a dissimilarity matrix between each respondent because very few are asked the same 8 questions. I also can't remove rows with NA, because every row has 12 NAs. I have two options:
Create a regression on every one of the 20 variables with values of other questions in the survey asked to every participant (age, gender, etc), and impute the NAs based on those variables.
Use some sort of MDS method that doesn't require a complete matrix.
So far, I've been working with the first one, but the created models aren't always the best. Since the policy questions are yes-no, I called a binomial glm on each model:
complete <- function(x){
q_and_predictors <- data.frame(question = x, predictors)
logistic_reg <- glm(question ~ ., data = q_and_predictors, family = "binomial")
predictions <- predict(logistic_reg, newdata = predictors)
x <- ifelse(is.na(x), exp(predictions)/(1 + exp(predictions)), x)
return (x)
}
complete_questions <- apply(questions, 2, complete)
The questions dataframe contains all the policy questions, and the predictors dataframe contains all non-policy questions.
I found the McFadden R^2 value for each logistic model, and some were very good (>0.35), but some were not (<0.1). Ideally, I'd like to find a way to either impute the missing values with greater accuracy, or use an MDS algorithm that works with missing values.

Factor scores from factor analysis on ordinal categorical data in R

I'm having trouble computing factor scores from an exploratory factor analysis on ordered categorical data. I've managed to assess how many factors to draw, and to run the factor analysis using the psych package, but can't figure out how to get factor scores for individual participants, and haven't found much help online. Here is where I'm stuck:
library(polycor)
library(nFactors)
library(psych)
# load data
dat <- read.csv("https://raw.githubusercontent.com/paulrconnor/datasets/master/data.csv")
# convert to ordered factors
for(i in 1:length(dat)){
dat[,i] <- as.factor(dat[,i])
}
# compute polychoric correlations
pc <- hetcor(dat,ML=T)
# 2. choose number of factors
ev <- eigen(pc)
ap <- parallel(subject = nrow(dat),
var=ncol(dat),rep=100,cent=.05)
nS <- nScree(x = ev$values, aparallel = ap$eigen$qevpea)
dev.new(height=4,width=6,noRStudioGD = T)
plotnScree(nS) # 2 factors, maybe 1
# run FA
faPC <- fa(r=pc$correlations, nfactors = 2, rotate="varimax",fm="ml")
faPC$loadings
Edit: I've found a way to get scores using irt.fa() and scoreIrt(), but it involved converting my ordered categories to numeric so I'm not sure it's valid. Any advice would be much appreciated!
x = as.matrix(dat)
fairt <- irt.fa(x = x,nfactors=2,correct=TRUE,plot=TRUE,n.obs=NULL,rotate="varimax",fm="ml",sort=FALSE)
for(i in 1:length(dat)){dat[,i] <- as.numeric(dat[,i])}
scoreIrt(stats = fairt, items = dat, cut = 0.2, mod="logistic")
That's an interesting problem. Regular factor analysis assumes your input measures are ratio or interval scaled. In the case of ordinal variables, you have a few options. You could either use an IRT based approach (in which case you'd be using something like the Graded Response Model), or to do as you do in your example and use the polychoric correlation matrix as the input to factor analysis. You can see more discussion of this issue here
Most factor analysis packages have a method for getting factor scores, but will give you different output depending on what you choose to use as input. For example, normally you can just use factor.scores() to get your expected factor scores, but only if you input your original raw score data. The problem here is the requirement to use the polychoric matrix as input
I'm not 100% sure (and someone please correct me if I'm wrong), but I think the following should be OK in your situation:
dat <- read.csv("https://raw.githubusercontent.com/paulrconnor/datasets/master/data.csv")
dat_orig <- dat
#convert to ordered factors
for(i in 1:length(dat)){
dat[,i] <- as.factor(dat[,i])
}
# compute polychoric correlations
pc <- hetcor(dat,ML=T)
# run FA
faPC <- fa(r=pc$correlations, nfactors = 2, rotate="varimax",fm="ml")
factor.scores(dat_orig, faPC)
In essence what you're doing is:
Calculate the polychoric correlation matrix
Use that matrix to conduct the factor analysis and extract 2 factors and associated loadings
Use the loadings from the FA and the raw (numeric) data to get your factor scores
Both this method, and the method you use in your edit, treat the original data as numeric rather than factors. I think this should be OK because you're just taking your raw data and projecting it down on the factors identified by the FA, and the loadings there are already taking into account the ordinal nature of your variables (as you used the polychoric matrix as input into FA). The post linked above cautions against this approach, however, and suggests some alternatives, but this is not a straightforward problem to solve

lm() saving residuals with group_by with R- confused spss user

This is complete reEdit of my orignal question
Let's assume I'm working on RT data gathered in a repeated measure experiment. As part of my usual routine I always transform RT to natural logarytms and then compute a Z score for each RT within each partipant adjusting for trial number. This is typically done with a simple regression in SPSS syntax:
split file by subject.
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT rtLN
/METHOD=ENTER trial
/SAVE ZRESID.
split file off.
To reproduce same procedure in R generate data:
#load libraries
library(dplyr); library(magrittr)
#generate data
ob<-c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3)
ob<-factor(ob)
trial<-c(1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6)
rt<-c(300,305,290,315,320,320,350,355,330,365,370,370,560,565,570,575,560,570)
cond<-c("first","first","first","snd","snd","snd","first","first","first","snd","snd","snd","first","first","first","snd","snd","snd")
#Following variable is what I would get after using SPSS code
ZreSPSS<-c(0.4207,0.44871,-1.7779,0.47787,0.47958,-0.04897,0.45954,0.45487,-1.7962,0.43034,0.41075,0.0407,-0.6037,0.0113,0.61928,1.22038,-1.32533,0.07806)
sym<-data.frame(ob, trial, rt, cond, ZreSPSS)
I could apply a formula (blend of Mark's and Daniel's solution) to compute residuals from a lm(log(rt)~trial) regression but for some reason group_by is not working here
sym %<>%
group_by (ob) %>%
mutate(z=residuals(lm(log(rt)~trial)),
obM=mean(rt), obSd=sd(rt), zRev=z*obSd+obM)
Resulting values clearly show that grouping hasn't kicked in.
Any idea why it didn't work out?
Using dplyr and magrittr, you should be able to calculate z-scores within individual with this code (it breaks things into the groups you tell it to, then calculates within that group).
experiment %<>%
group_by(subject) %>%
mutate(rtLN = log(rt)
, ZRE1 = scale(rtLN))
You should then be able to do use that in your model. However, one thing that may help your shift to R thinking is that you can likely build your model directly, instead of having to make all of these columns ahead of time. For example, using lme4 to treat subject as a random variable:
withRandVar <-
lmer(log(rt) ~ cond + (1|as.factor(subject))
, data = experiment)
Then, the residuals should already be on the correct scale. Further, if you use the z-scores, you probably should be plotting on that scale. I am not actually sure what running with the z-scores as the response gains you -- it seems like you would lose information about the degree of difference between the groups.
That is, if the groups are tight, but the difference between them varies by subject, a z-score may always show them as a similar number of z-scores away. Imagine, for example, that you have two subjects, one scores (1,1,1) on condition A and (3,3,3) on condition B, and a second subject that scores (1,1,1) and (5,5,5) -- both will give z-scores of (-.9,-.9,-.9) vs (.9,.9,.9) -- losing the information that the difference between A and B is larger in subject 2.
If, however, you really want to convert back, you can probably use this to store the subject means and sds, then multiply the residuals by subjSD and add subjMean.
experiment %<>%
group_by(subject) %>%
mutate(rtLN = log(rt)
, ZRE1 = scale(rtLN)
, subjMean = mean(rtLN)
, subjSD = sd(rtLN))
mylm <- lm(x~y)
rstandard(mylm)
This returns the standardized residuals of the function. To bind these to a variable you can do:
zresid <- rstandard(mylm)
EXAMPLE:
a<-rnorm(1:10,10)
b<-rnorm(1:10,10)
mylm <- lm(a~b)
mylm.zresid<-rstandard(mylm)
See also:
summary(mylm)
and
mylm$coefficients
mylm$fitted.values
mylm$xlevels
mylm$residuals
mylm$assign
mylm$call
mylm$effects
mylm$qr
mylm$terms
mylm$rank
mylm$df.residual
mylm$model

lmList diagnostic plots - is it possible to subset data during a procedure or do data frames have to be subset and then passed in?

I am new to R and am trying to produce a vast number of diagnostic plots for linear models for a huge data set.
I discovered the lmList function from the nlme package.
This works a treat but what I now need is a means of passing in a fraction of this data into the plot function so that the resulting plots are not minute and unreadable.
In the example below 27 plots are nicely displayed. I want to produce diagnostics for much more data.
Is it necessary to subset the data first? (presumably with loops) or is it possible to subset within the plotting function (presumably with some kind of loop) rather than create 270 data frames and pass them all in separately?
I'm sorry to say that my R is so basic that I do not even know how to pass variables into names and values together in for loops (I tried using the paste function but it failed).
The data and function for the example are below – I would be picking values of Subject by their row numbers within the data frame. I grant that the 27 plots here show nicely but for sake of example it would be nice to split them into say into 3 sets of 9.
fm1 <- lmList(distance ~ age | Subject, Orthodont)
# observed versus fitted values by Subject
plot(fm1, distance ~ fitted(.) | Subject, abline = c(0,1))
Examples from:
https://stat.ethz.ch/R-manual/R-devel/library/nlme/html/plot.lmList.html
I would be most grateful for help and hope that my question isn't insulting to anyone's intelligence or otherwise annoying.
I can't see how to pass a subset to the plot.lmList function. But, here is a way to do it using standard split-apply-combine strategy. Here, the Subjects are just split into three arbitrary groups of 9, and lmList is applied to each group.
## Make 3 lmLists
fits <- lapply(split(unique(Orthodont$Subject), rep(1:3, each=3)), function(x) {
eval(substitute(
lmList(distance ~ age | Subject, # fit the data to subset
data=Orthodont[Orthodont$Subject %in% x,]), # use the subset
list(x=x))) # substitue the actual x-values so the proper call gets stored
})
## Make plots
for (i in seq_along(fits)) {
dev.new()
print(plot(fits[[i]], distance ~ fitted(.) | Subject, abline = c(0,1)))
}

Basic - T-Test -> Grouping Factor Must have Exactly 2 Levels

I am relatively new to R. For my assignment I have to start by conducting a T-Test by looking at the effect of a politician's (Conservative or Labour) wealth on their real gross wealth and real net wealth. I have to attempt to estimate the effect of serving in office wealth using a simple t-test.
The dataset is called takehome.dta
Labour and Tory are binary where 1 indicates that they serve for that party and 0 otherwise.
The variables for wealth are lnrealgross and lnrealnet.
I have imported and attached the dataset, but when I attempt to conduct a simple t-test. I get the following message "grouping factor must have exactly 2 levels." Not quite sure where I appear to be going wrong. Any assistance would be appreciated!
are you doing this:
t.test(y~x)
when you mean to do this
t.test(y,x)
In general use the ~ then you have data like
y <- 1:10
x <- rep(letters[1:2], each = 5)
and the , when you have data like
y <- 1:5
x <- 6:10
I assume you're doing something like:
y <- 1:10
x <- rep(1,10)
t.test(y~x) #instead of t.test(y,x)
because the error suggests you have no variation in the grouping factor x
The differences between ~ and , is the type of statistical test you are running. ~ gives you the mean differences. This is for dependent samples (e.g. before and after). , gives you the difference in means. This is for independent samples (e.g. treatment and control). These two tests are not interchangeable.
I was having a similar problem and did not realize given the size of my dataset that one of my y's had no values for one of my levels. I had taken a series of gene readings for two groups and one gene had readings only for group 2 and not group 1. I hadn't even noticed but for some reason this presented with the same error as what I would get if I had too many levels. The solution is to remove that y or in my case gene from my analysis and then the error is solved.

Resources