Difference between R and SPSS linear model results - r

I'm a beginner at statistics. Currently attending an introductory course, which uses spss. I've been trying to learn r at the same time, and so far I've consistently been getting the same results, for calculations with both tools, As expected.
However, we're currently doing correlations (Pearson's Rho), and fitting linear models, and I'm consistently getting different results between R and SPSS.
The dataset is GSS2012.zip in this zip-file.
d = GSS2012$tolerance
e = GSS2012$age
f = GSS2012$polviews
g = GSS2012$educ
SPSS R std. error (SPSS)
intercept 6,694 7,29707726 0,623
e -0,031 -0,03130627 0,006
f -0,123 -0,20586503 0,072
g 0,411 0,40029541 0,033
Full, minimal working examples to get the results above, are found below.
I've tried different use="stuff" for cor; didn't make difference.
cor(d, e, use = "pairwise.complete.obs")
Full, minimal working example for lm:
> library(haven)
> GSS2012 <- read_sav("full version/GSS2012.sav")
> lm(GSS2012$tolerance ~ GSS2012$age + GSS2012$polviews + GSS2012$educ, na.action="na.exclude", singular.ok = F)
Call:
lm(formula = GSS2012$tolerance ~ GSS2012$age + GSS2012$polviews +
GSS2012$educ, na.action = "na.exclude", singular.ok = F)
Coefficients:
(Intercept) GSS2012$age GSS2012$polviews GSS2012$educ
7.29708 -0.03131 -0.20587 0.40030
Nothing has so far given me the same values as SPSS. ---Not that I know the latter are necessarily correct, I'd just like to replicate the results.
SPSS script:
DATASET ACTIVATE DataSet1.
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT tolerance
/METHOD=ENTER age polviews educ.
Articles like these are probably related: link1; link2; link3, but I haven't been able to use the information therein to replicate the SPSS data. (Again, R might have more accurate results; I don't know. But I'm in "an SPSS environment", and thus it would be good if I'd be able to get the same results for now :)

This is only a partial answer as I can see what the problem is, although I'm not sure what is causing it. The issue is to do with missing values and how they're handled in the SPSS file. Lets just take the educ variable as an example...
In the SPSS file you can see that the values 97, 98, and 99 are defined as missing values:
If you sort the SPSS file by the educ column, you can see there are 2 data rows with these missing values. They are IDs 837 and 1214:
In R, you can confirm that those rows do infact contain missing values (NA):
> which(is.na(GSS2012$educ))
[1] 837 1214
The problem is in SPSS, when you actually tell it to count how many rows are missing, it says theres only 1 missing data row:
FREQUENCIES VARIABLES=educ
/FORMAT=NOTABLE
/ORDER= ANALYSIS .
The problem is with ID 1214. SPSS is not considering that 99 value for 1214 to be missing. For example, try changing educ for 837 to any other (non-missing) number, and you'll see that SPSS says there are 0 missing rows for educ, when in fact 1214 should still be missing (99)
I haven't checked, but I'm guessing a similar thing is happening to a number of rows for the polviews variable.
The consequence of this is that SPSS isn't treating those rows as missing data when you run the analysis, but in R those values are correctly set as missing and omitted. In other words, SPSS is using more data for the model than it should be using. You can confirm this by looking at the SPSS and R output - the degrees of freedom are different across the 2 programs, which then leads to a (slight) difference in results
I'm not sure why SPSS isnt treating those rows as missing. It could either be a bug (wouldn't be the first for SPSS...) or something to do with the way the file has been set up. I haven't checked the latter because its a big file and I'm not familiar enough with the dataset to know where to look

Related

Not able to visualize linear regression with ggPredict

I have a data set called dataTreamill that contains 5 columns of text information, and row 6 until 56 contain the variables I would like to analyse.
For every variable I would like to perform a linear regression to see how my data changes across different conditions.
I figured out that I could make a lm plot in the following way:
lmTreadmill = lm(StrideRegularity_AP~ConditionNr, data = dataTreadmill)
Visualizing this gives a nice plot:
ggPredict(lmTreadmill,se=TRUE,interactive=TRUE)
However as I have 54 other variables than StrideRegularity_AP I would like to use lapply
col <- c(6:56) # these are the only columns containing data;
allFits = lapply(dataTreadmill[,col], function(x) (lm(x~dataTreadmill$ConditionNr+dataTreadmill$Group, data=dataTreadmill)))
Now I get a nice list for every variable with the information about the regression.
However, when I want to plot any of these linear regression using this code:
ggPredict(allFits$StrideRegularity_AP)
Although when comparing allFits$StrideRegularity_AP with lmTreadmill (which are the same), I do not see any difference in structure or values, however R gives the following error:
Error in `[[<-.data.frame`(`*tmp*`, yname, value = c(`1` = 0.616668527648763, :
replacement has 419 rows, data has 30
In addition: Warning message:
'newdata' had 30 rows but variables found have 419 rows
Why am I not able to visualize the linear regression after using lapply?
Thanks in advance!
Iris
Drop the dataTreadmill$ from the lm call and try again.
allFits = lapply(dataTreadmill[,col], function(x) (lm(x~ConditionNr+Group, data=dataTreadmill)))
It's not needed as you specify the data anyway (and on a quick test with and without I get the same error as you - no idea why though)
Both the solutions work! Thank you.
Although I still wonder why, because the data looks completely similar to the suggestion I wrote. However, no at least I can plot the data :)

Tests of Between-Subjects Effects

I use from code R for table Tests of Between-Subjects Effects calculate in spss but results are diffrent and really I don't know that why are the reasons? , please help me . thanka alot
fit2222<-manova(cbind(postrunning,postbalance,postbilateral,poststrngth,postupperlimb,postresponse
,postvisumotor,postdexterity)
~ group+prerunning+ prebalance+prebilateralcoordination + prestrength + preupperlimbcoordination
+preresponse+ previsumotor+predexterity ,data=Data)
summary.aov(fit2222)
input in R
'''
input in spss
It's impossible to give a complete answer to this, or at least exactly what you should be doing differently, but it's clear from the output you're showing that you're not comparing the same results in the two packages. Note the numerator degrees of freedom (df) values, which are all 1 in R and 2 in SPSS.

Can I use quickpred in Mice to impute a subset of variables from a larger set of variables in a nested longitudinal (and long) dataframe?

I've tried to create a test data.frame to demonstrate my question but my r capacity isn't quite strong enough to even do that. I am not in a position to share my true database. I hope my question can stand on its own.
I am working with a nested longitudinal dataset that is saved as a long file (1000 subjects nested in 8 sites, 4 potential time points/subject, 68 potential predictor variables). I want to impute missing values on 4 static predictors (e.g., maternal education, family income) prior to conducting lme on the longitudinal outcomes in order to have a consistent number of cases for all models.
I am working with the package mice in r. From all that I have read, it is recommended that I use all the variables in my models and any other variables that may predict the missing values in my imputation. Given the number of variables in my models, I need something like quickpred to simplify this. But I'm getting an error that I do not understand.
I tried the following initial code for my database N2NPL, indicating c(14, 16, 18, 19) as the variables that I want to predict.
iniN2NPL <- mice(N2NPL[,c(14,16,18,19)], pred= quickpred(N2NPL,
minpuc = 0.25, exclude = c('ID','TypeConvNon','TypeCtPr','TypeName','CHR_converter')),
maxit = 0)
"Error in check.predictorMatrix(setup) :
The predictorMatrix has 73 rows and 73 columns. Both should be 4'
I know that mice::quickpred needs to be a square matrix, but is there anyway of not imputing all of the variables? Is it sufficient to include site as a predictor given the nesting of subjects within sites?
Thank you for any help directing me to the proper code or instructions on this. The examples I see all seem much simpler than mine, and thus little help with the issues I'm having.

glmmLasso error and warning

I am trying to perform variable selection in a generalized linear mixed model using glmmLasso, but am coming up with an error and a warning, that I can not resolve. The dataset is unbalanced, with some participants (PTNO) having more samples than others; no missing data. My dependent variable is binary, all other variables (beside the ID variable PTNO) are continous.
I suspect something very generic is happening, but obviously fail to see it and have not found any solution in the documentation or on the web.
The code, which is basically just adapted from the glmmLasso soccer example is:
glm8 <- glmmLasso(Group~NDUFV2_dCTABL+GPER1_dCTABL+ ESR1_dCTABL+ESR2_dCTABL+KLF12_dCTABL+SP4_dCTABL+SP1_dCTABL+ PGAM1_dCTABL+ANK3_dCTABL+RASGRP1_dCTABL+AKT1_dCTABL+NUDT1_dCTABL+ POLG_dCTABL+ ADARB1_dCTABL+OGG_dCTABL+ PDE4B_dCTABL+ GSK3B_dCTABL+ APOE_dCTABL+ MAPK6_dCTABL, rnd = list(PTNO=~1),
family = poisson(link = log), data = stackdata, lambda=100,
control = list(print.iter=TRUE,start=c(1,rep(0,29)),q.start=0.7))
The error message is displayed below. Specficially, I do not believe there are any NAs in the dataset and I am unsure about the meaning of the warning regarding the factor variable.
Iteration 1
Error in grad.lasso[b.is.0] <- score.beta[b.is.0] - lambda.b * sign(score.beta[b.is.0]) :
NAs are not allowed in subscripted assignments
In addition: Warning message:
In Ops.factor(y, Mu) : ‘-’ not meaningful for factors
An abbreviated dataset containing the necessary variables is available in R format and can be downladed here.
I hope I can be guided a bit as to how to go on with the analysis. Please let me know if there is anything wrong with the dataset or you cannot download it. ANY help is much appreciated.
Just to follow up on #Kristofersen comment above. It is indeed the start vector that messes your analysis up.
If I run
glm8 <- glmmLasso(Group~NDUFV2_dCTABL+GPER1_dCTABL+ ESR1_dCTABL+ESR2_dCTABL+KLF12_dCTABL+SP4_dCTABL+SP1_dCTABL+ PGAM1_dCTABL+ANK3_dCTABL+RASGRP1_dCTABL+AKT1_dCTABL+NUDT1_dCTABL+ POLG_dCTABL+ ADARB1_dCTABL+OGG_dCTABL+ PDE4B_dCTABL+ GSK3B_dCTABL+ APOE_dCTABL+ MAPK6_dCTABL,
rnd = list(PTNO=~1),
family = binomial(),
data = stackdata,
lambda=100,
control = list(print.iter=TRUE))
then everything is fine and dandy (i.e., it converges and produces a solution). You have copied the example with poisson regression and you need to tweak the code to your situation. I have no idea about whether the output makes sense.
Quick note: I ran with the binomial distribution in the code above since your outcome is binary. If it makes sense to estimate relative risks then poisson may be reasonable (and it also converges), but you need to recode your outcome as the two groups are defined as 1 and 2 and that will certainly mess up the poisson regression.
In other words do a
stackdata$Group <- stackdata$Group-1
before you run the analysis.

Nested model in R

I'm having a huge problem with a nested model I am trying to fit in R.
I have response time experiment with 2 conditions with 46 people each and 32 measures each. I would like measures to be nested within people and people nested within conditions, but I can't get it to work.
The code I thought should make sense was:
nestedmodel <- lmer(responsetime ~ 1 + condition +
(1|condition:person) + (1|person:measure), data=dat)
However, all I get is an error:
Error in checkNlevels(reTrms$flist, n = n, control) :
number of levels of each grouping factor must be < number of observations
Unfortunately, I do not even know where to start looking what the problem is here.
Any ideas? Please, please, please? =)
Cheers!
This might be more appropriate on CrossValidated, but: lme4 is trying to tell you that one or more of your random effects is confounded with the residual variance. As you've described your data, I don't quite see why: you should have 2*46*32=2944 total observations, 2*46=92 combinations of condition and person, and 46*32=1472 combinations of measure and person.
If you do
lf <- lFormula(responsetime ~ 1 + condition +
(1|condition:person) + (1|person:measure), data=dat)
and then
lapply(lf$reTrms$Ztlist,dim)
to look at the transposed random-effect design matrices for each term, what do you get? You should (based on your description of your data) see that these matrices are 1472 by 2944 and 92 by 2944, respectively.
As #MrFlick says, a reproducible example would be nice. Other things you could show us are:
fit the model anyway, using lmerControl(check.nobs.vs.nRE="ignore") to ignore the test, and show us the results (especially the random effects variances and the statement of the numbers of groups)
show us the results of with(dat,table(table(interaction(condition,person))) to give information on the number of replicates per combination (and similarly for measure)

Resources