Error with sem function at R : differences in factors - r

I wanted to use the function sem (with the package lavaan) on my data in R :
Model1<- 'Transfer~Amotivation+Gender+Age
Amotivation~Gender+Age
transfer are 4 questions with a 5 point likert scale
Amotivation: 4 questions with a 5 pint likert scale
Gender: 0 (=male) and 1 (=female)
Age: just the different ages
And i got next error:
in getDataFull (data= data, group = group, grow.label = group.label,:
lavaan WARNING: some observed variances are (at least) a factor 100 times larger than others; please rescale
Is anybody familiar with this error? Does it influence my results? Do I have to change anything? I really don't know what this error means.

Your scales are not equivalent. Your gender variables are constrained to be either 0 or 1. Amotivation is constrained to be between 1 and 5, but age is even less constrained. I created some sample data for gender, age, and amotivation. You can see that the variance for the age variable is over 4,000 times higher than the variance for gender, and about 500 times higher than sample amotivation data.
gender <- c(0,1,1,1,0,0,1,1,0,1,1,0,0,1,1,1)
age <- c(18,42,87,12,24,26,98,84,23,12,95,44,54,23,10,16)
set.seed(42)
amotivation <- rnorm(16, 3, 1.5)
var(gender) # 0.25 variance
var(age) # 1017.27 variance
var(amotivation) # 2.21 variance
I'm not sure how the unequal variances influence your results, or if you need to do anything at all. To make your age variable more closely match the amotivation scale, you could transform the data so that it's also on a 5 point scale.
newage <- age/max(age)*5
var(newage) # 2.65 variance
You could try running the analysis both ways (using your original data and the transformed data) and see if there are differences.

Related

R: Find cutoffpoint for continous variable to assign observations to two groups

I have the following data
Species <- c(rep('A', 47), rep('B', 23))
Value<- c(3.8711, 3.6961, 3.9984, 3.8641, 4.0863, 4.0531, 3.9164, 3.8420, 3.7023, 3.9764, 4.0504, 4.2305,
4.1365, 4.1230, 3.9840, 3.9297, 3.9945, 4.0057, 4.2313, 3.7135, 4.3070, 3.6123, 4.0383, 3.9151,
4.0561, 4.0430, 3.9178, 4.0980, 3.8557, 4.0766, 4.3301, 3.9102, 4.2516, 4.3453, 4.3008, 4.0020,
3.9336, 3.5693, 4.0475, 3.8697, 4.1418, 4.0914, 4.2086, 4.1344, 4.2734, 3.6387, 2.4088, 3.8016,
3.7439, 3.8328, 4.0293, 3.9398, 3.9104, 3.9008, 3.7805, 3.8668, 3.9254, 3.7980, 3.7766, 3.7275,
3.8680, 3.6597, 3.7348, 3.7357, 3.9617, 3.8238, 3.8211, 3.4176, 3.7910, 4.0617)
D<-data.frame(Species,Value)
I have the two species A and B and want to find out which is the best cutoffpoint for value to determine the species.
I found the following question:
R: Determine the threshold that maximally separates two groups based on a continuous variable?
and followed the accepted answer to find the best value with the dose.p function from the MASS package. I have several similar values and it worked for them, but not for the one given above (which is also the reason why i needed to include all 70 observations here).
D$Species_b<-ifelse(D$Species=="A",0,1)
my.glm<-glm(Species_b~Value, data = D, family = binomial)
dose.p(my.glm,p=0.5)
gives me 3.633957 as threshold:
Dose SE
p = 0.5: 3.633957 0.1755291
this results in 45 correct assignments. however, if I look at the data, it is obvious that this is not the best value. By trial and error I found that 3.8 gives me 50 correct assignments, which is obviously better.
Why does the function work for other values, but not for this one? Am I missing an obvious mistake? Or is there maybe a different/ better approach to solving my problem? I have several values I need to do this for, so I really do not want to just randomly test values until I find the best one.
Any help would be greatly appreciated.
I would typically use a receiver operating characteristic curve (ROC) for this type of analysis. This allows a visual and numerical assessment of how the sensitivity and specificity of your cutoff changes as you adjust your threshold. This allows you to select the optimum threshold based on when the overall accuracy is optimum. For example, using pROC:
library(pROC)
species_roc <- roc(D$Species, D$Value)
We can get a measure of how good a discriminator Value is for predicting Species by examining the area under the curve:
auc(species_roc)
#> Area under the curve: 0.778
plot(species_roc)
and we can find out the optimum cut-off threshold like this:
coords(species_roc, x = "best")
#> threshold specificity sensitivity
#> 1 3.96905 0.6170213 0.9130435
We see that this threshold correctly identifies 50 cases:
table(Actual = D$Species, Predicted = c("A", "B")[1 + (D$Value < 3.96905)])
#> Predicted
#> Actual A B
#> A 29 18
#> B 2 21

ivprobit error: the leading minor of order # is not positive definite

I am trying to run a ivprobit regression on individual crime and demographic variables. The dependent variable is annualassault and I have 24 independent variables regarding to the demographic. The r.h.s endogenous variable is sogs_total, and the instruments are drivetime and distance. I got the error: the leading minor of order 4 is not positive definite.
I reduced my independent variables to only 5 because I think I might be over fitting the model (too many variables), but I still get the same error.
library(ivprobit)
ivprobit.assault = ivprobit(annualassault ~ age +sex+ education_1+ education_2+
education_3+ education_4+ education_5+ education_6+ education_7
+maritalstatus_0+maritalstatus_1+maritalstatus_2+maritalstatus_3+maritalstatus_4+
a_nondm+d_nonam+m_nonad+ ad_nonm+am_nond
+dm_nona+ adm |sogs_total|drivetime + distance, newdata)
ivprobit.assault2 = ivprobit(annualassault ~ age +sex+ education_1+maritalstatus_1+
am_nond|sogs_total|drivetime + distance, newdata)
Error in chol.default(mat) :
the leading minor of order 4 is not positive definite

Confusing p values with ANOVA on a big dataframe

I am trying to analyse the significant differences between different car company performance values across different countries. I am using ANOVA to do this.
Running ANOVA on my real dataset (30 countries, 1000 car companies and 90000 measurement scores) gave every car a zero p-value.
Confused by this, I created a reproducible example (below) with 30 groups, 3 car companies, 90000 random scores. Purposely, I kept a score of 1 for the Benz company where you shouldn't see any difference between countries. After running anova, I see a pvalue of 0.46 instead of 1.
Does any one know why is this ?
Reproducible example
set.seed(100000)
qqq <- 90000
df = data.frame(id = c(1:90000), country = c(rep("usa",3000), rep("usb",3000), rep("usc",3000), rep("usd",3000), rep("use",3000), rep("usf",3000), rep("usg",3000), rep("ush",3000), rep("usi",3000), rep("usj",3000), rep("usk",3000), rep("usl",3000), rep("usm",3000), rep("usn",3000), rep("uso",3000), rep("usp",3000), rep("usq",3000), rep("usr",3000), rep("uss",3000), rep("ust",3000), rep("usu",3000), rep("usv",3000), rep("usw",3000), rep("usx",3000), rep("usy",3000), rep("usz",3000), rep("usaa",3000), rep("usab",3000), rep("usac",3000), rep("usad",3000)), tesla=runif(90000), bmw=runif(90000), benz=rep(1, each=qqq))
str(df)
out<-data.frame()
for(j in 3:ncol(df)){
amod2 <- aov(df[,j]~df$country)
out[(j-2),1]<-colnames(df)[j]
out[(j-2),2]<-summary(amod2, test = adjusted("bonferroni"))[[1]][[1,"Pr(>F)"]]
}
colnames(out)<-c("cars","pvalue")
write.table(out,"df.output")
df.output
"cars" "pvalue"
"1" "tesla" 0.245931589754359
"2" "bmw" 0.382730335188437
"3" "benz" 0.465083026215268
With respect to the "benz" p-value in your reproducible example: an ANOVA analysis requires positive variance (i.e., non-constant data). If you violate this assumption, the model is degenerate. Technically, the p-value is based on an F-statistic whose value is a normalized ratio of the variance attributable to the "country" effect (for "benz" in your example, zero) divided by the total variance (for "benz" in your example, zero), so your F-statistic has "value" 0/0 or NaN.
Because of the approach R takes to calculating the F-statistic (using a QR matrix decomposition to improve numerical stability in "nearly" degenerate cases), it calculates an F-statistic equal to 1 (w/ 29 and 89970 degrees of freedom). This gives a p-value of:
> pf(1, 29, 89970, lower=FALSE)
[1] 0.465083
>
but it is, of course, largely meaningless.
With respect to your original problem, with large datasets relatively small effects will yield very small p-values. For example, if you add the following after your df definition above to introduce a difference in country usa:
df = within(df, {
o = country=="usa"
tesla[o] = tesla[o] + .1
bmw[o] = bmw[o] + .1
benz[o] = benz[o] + .1
rm(o)
})
you will find that out looks like this:
> out
cars pvalue
1 tesla 9.922166e-74
2 bmw 5.143542e-74
3 benz 0.000000e+00
>
Is this what you're seeing, or are you seeing all of them exactly zero?

T-test with grouping variable

I've got a data frame with 36 variables and 74 observations. I'd like to make a two sample paired ttest of 35 variables by 1 grouping variable (with two levels).
For example: the data frame contains "age" "body weight" and "group" variables.
Now I suppose I can do the ttest for each variable with this code:
t.test(age~group)
But, is there a way to test all the 35 variables with one code, and not one by one?
Sven has provided you with a great way of implementing what you wanted to have implemented. I, however, want to warn you about the statistical aspect of what you are doing.
Recall that if you are using the standard confidence level of 0.05, this means that for each t-test performed, you have a 5% chance of committing Type 1 error (incorrectly rejecting the null hypothesis.) By the laws of probability, running 35 individual t-tests compounds your probability of committing type 1 error by a factor of 35, or more exactly:
Pr(Type 1 Error) = 1 - (0.95)^35 = 0.834
Meaning that you have about an 83.4% chance of falsely rejecting a null hypothesis. Basically what this means is that, by running so many T-tests, there is a very high probability that at least one of your T-tests is going to provide an incorrect result.
Just FYI.
An example data frame:
dat <- data.frame(age = rnorm(10, 30), body = rnorm(10, 30),
weight = rnorm(10, 30), group = gl(2,5))
You can use lapply:
lapply(dat[1:3], function(x)
t.test(x ~ dat$group, paired = TRUE, na.action = na.pass))
In the command above, 1:3 represents the numbers of the columns including the variables. The argument paired = TRUE is necessary to perform a paired t-test.

Bootstrapping to compare two groups

In the following code I use bootstrapping to calculate the C.I. and the p-value under the null hypothesis that two different fertilizers applied to tomato plants have no effect in plants yields (and the alternative being that the "improved" fertilizer is better). The first random sample (x) comes from plants where a standard fertilizer has been used, while an "improved" one has been used in the plants where the second sample (y) comes from.
x <- c(11.4,25.3,29.9,16.5,21.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
library(boot)
diff <- function(x,i) mean(x[i[6:11]]) - mean(x[i[1:5]])
b <- boot(total, diff, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
What I don't like about the code above is that resampling is done as if there was only one sample of 11 values (separating the first 5 as belonging to sample x leaving the rest to sample y).
Could you show me how this code should be modified in order to draw resamples of size 5 with replacement from the first sample and separate resamples of size 6 from the second sample, so that bootstrap resampling would mimic the “separate samples” design that produced the original data?
EDIT2 :
Hack deleted as it was a wrong solution. Instead one has to use the argument strata of the boot function :
total <- c(x,y)
id <- as.factor(c(rep("x",length(x)),rep("y",length(y))))
b <- boot(total, diff, strata=id, R = 10000)
...
Be aware you're not going to get even close to a correct estimate of your p.value :
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
b <- boot(total, diff, strata=id, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
> p.value
[1] 0.5162
How would you explain a p-value of 0.51 for two samples where all values of the second are higher than the highest value of the first?
The above code is fine to get a -biased- estimate of the confidence interval, but the significance testing about the difference should be done by permutation over the complete dataset.
Following John, I think the appropriate way to use bootstrap to test if the sums of these two different populations are significantly different is as follows:
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
b_x <- boot(x, sum, R = 10000)
b_y <- boot(y, sum, R = 10000)
z<-(b_x$t0-b_y$t0)/sqrt(var(b_x$t[,1])+var(b_y$t[,1]))
pnorm(z)
So we can clearly reject the null that they are the same population. I may have missed a degree of freedom adjustment, I am not sure how bootstrapping works in that regard, but such an adjustment will not change your results drastically.
While the actual soil beds could be considered a stratified variable in some instances this is not one of them. You only have the one manipulation, between the groups of plants. Therefore, your null hypothesis is that they really do come from the exact same population. Treating the items as if they're from a single set of 11 samples is the correct way to bootstrap in this case.
If you have two plots, and in each plot tried the different fertilizers over different seasons in a counterbalanced fashion then the plots would be statified samples and you'd want to treat them as such. But that isn't the case here.

Resources