selectiveinference package: fixedLassoInf after lasso Logit - r

Following Taylor and Tibshirani (2015), I'm applying the selectiveinference package in R after a Lasso Logit fit with glmnet. Specifically, I'm interested in inference for the lasso with a fixed lambda.
Below I report the code:
First, I standardized the X matrix (as suggested https://cran.r-project.org/web/packages/selectiveInference/selectiveInference.pdf).
Then, I fit glmnet and after I extracted the beta coefficient for a lambda previously picked with LOOCV.
X.std <- std(X1[1:2833,])
fit = glmnet(X.std, Y[1:2833],alpha=1, family=c("binomial"))
fit$lambda
lambda=0.00431814
n=2833
beta_hat = coef(fit, x=X.std, y=Y[1:2833], s=lambda/n, exact=TRUE)
beta_hat
out = fixedLassoInf(X.std, Y[1:2833],beta_hat,lambda,family="binomial")
out
After I run the code, this is what I get. I understood that there is something related to KKT conditions, and that is a problem specific to Lasso Logit, as when I try with family=gaussian, I do not get any warnings or mistakes.
Warning message:
In fixedLogitLassoInf(x, y, beta, lambda, alpha = alpha, type = type, :
Solution beta does not satisfy the KKT conditions (to within specified tolerances)
> res
Call:
fixedLassoInf(x = X.std, y = Y[1:2833], beta = b, lambda = lam,
family = c("binomial"))
Testing results at lambda = 0.004, with alpha = 0.100
Var Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
1 58.558 6.496 0.000 46.078 124.807 0.049 0.050
2 -8.008 -2.815 0.005 -13.555 -3.106 0.049 0.049
3 -18.514 -6.580 0.000 -31.262 -14.153 0.049 0.048
4 -1.070 -0.390 0.447 -22.976 19.282 0.050 0.050
5 -0.320 -1.231 0.610 -0.660 1.837 0.050 0.000
6 -0.448 -1.906 0.619 -2.378 5.056 0.050 0.050
7 -47.732 -9.624 0.000 -161.370 -44.277 0.050 0.050
8 -39.023 -8.378 0.000 -54.988 -31.510 0.050 0.048
10 23.827 1.991 0.181 -20.151 42.867 0.049 0.049
11 -2.454 -0.522 0.087 -269.951 9.345 0.050 0.050
12 0.045 0.018 0.993 -Inf -14.962 0.000 0.050
13 -18.647 -1.143 0.156 -149.623 25.464 0.050 0.050
14 -3.508 -1.140 0.305 -8.444 7.000 0.049 0.049
15 -0.620 -0.209 0.846 -3.486 46.045 0.050 0.050
16 -3.960 -1.288 0.739 -6.931 47.641 0.049 0.050
17 -8.587 -3.010 0.023 -42.700 -2.474 0.050 0.049
18 2.851 0.986 0.031 2.745 196.728 0.050 0.050
19 -6.612 -1.258 0.546 -14.967 37.070 0.049 0.000
20 -11.621 -2.291 0.021 -29.558 -2.536 0.050 0.049
21 -76.957 -0.980 0.565 -186.701 483.180 0.049 0.050
22 -13.556 -5.053 0.000 -126.367 -13.274 0.050 0.049
23 -4.836 -0.388 0.519 -109.667 125.933 0.050 0.050
24 11.355 0.898 0.492 -55.335 30.312 0.050 0.049
25 -1.118 -0.146 0.919 -4.439 232.172 0.049 0.050
26 -7.776 -1.298 0.200 -17.540 8.006 0.050 0.049
27 0.678 0.234 0.515 -42.265 38.710 0.050 0.050
28 32.938 1.065 0.335 -77.314 82.363 0.050 0.049
Does someone know how to solve this warning?
I would like to understand which kind of "tolerances" should I specify.
Thank for the help.

Related

Can I trust my cfa results if the variance-covariance matrix does not appear to be positive definite?

I am trying to create a structural equation model that tests the structure of latent variables underlying a big 5 dataset found on kaggle. More specifically, I would like to replicate a finding which suggests that common method variance (e.g., response biases) inflate the often observed high intercorrelations between the manifest variables/items of the big 5 (Chang, Connelly & Geeza (2012).
big5_CFAmodel_cmv <-'EXTRA =~ EXT1 + EXT2 + EXT3 + EXT4 + EXT5 + EXT7 + EXT8 + EXT9 + EXT10
AGREE =~ AGR1 + AGR2 + AGR4 + AGR5 + AGR6 + AGR7 + AGR8 + AGR9 + AGR10
EMO =~ EST1 + EST2 + EST3 + EST5 + EST6 + EST7 + EST8 + EST9 + EST10
OPEN =~ OPN1 + OPN2 + OPN3 + OPN5 + OPN6 + OPN7 + OPN8 + OPN9 + OPN10
CON =~ CSN1 + CSN2 + CSN3 + CSN4 + CSN5 + CSN6 + CSN7 + CSN8 + CSN9
CMV =~ EXT1 + EXT2 + EXT3 + EXT4 + EXT5 + EXT7 + EXT8 + EXT9 + EXT10 + AGR1 + AGR2 + AGR4 + AGR5 + AGR6 + AGR7 + AGR8 + AGR9 + AGR10 + CSN1 + CSN2 + CSN3 + CSN4 + CSN5 + CSN6 + CSN7 + CSN8 + CSN9 + EST1 + EST2 + EST3 + EST5 + EST6 + EST7 + EST8 + EST9 + EST10 + OPN1 + OPN2 + OPN3 + OPN5 + OPN6 + OPN7 + OPN8 + OPN9 + OPN10 '
big5_CFA_cmv <- cfa(model = big5_CFAmodel_cmv,
data = big5, estimator = "MLR")
Here is my full code on Github. Now I get a warning from lavaan:
lavaan WARNING:
The variance-covariance matrix of the estimated parameters (vcov)
does not appear to be positive definite! The smallest eigenvalue
(= -4.921738e-07) is smaller than zero. This may be a symptom that
the model is not identified.
But when I run summary(big5_CFA_cmv, fit.measures = TRUE, standardized = TRUE, rsquare = TRUE), lavaan appeared to end normally and produced good fit statistics.
lavaan 0.6-8 ended normally after 77 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 150
Used Total
Number of observations 498 500
Model Test User Model:
Standard Robust
Test Statistic 2459.635 2262.490
Degrees of freedom 885 885
P-value (Chi-square) 0.000 0.000
Scaling correction factor 1.087
Yuan-Bentler correction (Mplus variant)
Model Test Baseline Model:
Test statistic 9934.617 8875.238
Degrees of freedom 990 990
P-value 0.000 0.000
Scaling correction factor 1.119
User Model versus Baseline Model:
Comparative Fit Index (CFI) 0.824 0.825
Tucker-Lewis Index (TLI) 0.803 0.805
Robust Comparative Fit Index (CFI) 0.830
Robust Tucker-Lewis Index (TLI) 0.810
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -31449.932 -31449.932
Scaling correction factor 1.208
for the MLR correction
Loglikelihood unrestricted model (H1) -30220.114 -30220.114
Scaling correction factor 1.105
for the MLR correction
Akaike (AIC) 63199.863 63199.863
Bayesian (BIC) 63831.453 63831.453
Sample-size adjusted Bayesian (BIC) 63355.347 63355.347
Root Mean Square Error of Approximation:
RMSEA 0.060 0.056
90 Percent confidence interval - lower 0.057 0.053
90 Percent confidence interval - upper 0.063 0.059
P-value RMSEA <= 0.05 0.000 0.000
Robust RMSEA 0.058
90 Percent confidence interval - lower 0.055
90 Percent confidence interval - upper 0.061
Standardized Root Mean Square Residual:
SRMR 0.061 0.061
Parameter Estimates:
Standard errors Sandwich
Information bread Observed
Observed information based on Hessian
Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
EXTRA =~
EXT1 1.000 0.455 0.372
EXT2 1.010 0.323 3.129 0.002 0.459 0.358
EXT3 0.131 0.301 0.434 0.664 0.059 0.049
EXT4 1.393 0.430 3.240 0.001 0.633 0.532
EXT5 0.706 0.168 4.188 0.000 0.321 0.263
EXT7 1.001 0.183 5.477 0.000 0.455 0.323
EXT8 1.400 0.545 2.570 0.010 0.637 0.513
EXT9 1.468 0.426 3.446 0.001 0.667 0.505
EXT10 1.092 0.335 3.258 0.001 0.497 0.387
AGREE =~
AGR1 1.000 0.616 0.486
AGR2 0.721 0.166 4.349 0.000 0.444 0.374
AGR4 1.531 0.205 7.479 0.000 0.944 0.848
AGR5 0.999 0.141 7.085 0.000 0.615 0.568
AGR6 1.220 0.189 6.464 0.000 0.752 0.661
AGR7 0.743 0.155 4.795 0.000 0.458 0.406
AGR8 0.836 0.126 6.614 0.000 0.515 0.502
AGR9 1.292 0.209 6.176 0.000 0.796 0.741
AGR10 0.423 0.124 3.409 0.001 0.261 0.258
EMO =~
EST1 1.000 0.856 0.669
EST2 0.674 0.063 10.626 0.000 0.577 0.485
EST3 0.761 0.059 12.831 0.000 0.651 0.580
EST5 0.646 0.081 7.970 0.000 0.552 0.444
EST6 0.936 0.069 13.542 0.000 0.801 0.661
EST7 1.256 0.128 9.805 0.000 1.075 0.880
EST8 1.298 0.131 9.888 0.000 1.111 0.883
EST9 0.856 0.071 11.997 0.000 0.733 0.602
EST10 0.831 0.085 9.744 0.000 0.711 0.545
OPEN =~
OPN1 1.000 0.593 0.518
OPN2 0.853 0.106 8.065 0.000 0.506 0.492
OPN3 1.064 0.205 5.186 0.000 0.631 0.615
OPN5 1.012 0.124 8.161 0.000 0.600 0.654
OPN6 1.039 0.204 5.085 0.000 0.616 0.553
OPN7 0.721 0.089 8.115 0.000 0.428 0.481
OPN8 0.981 0.077 12.785 0.000 0.582 0.474
OPN9 0.550 0.106 5.187 0.000 0.326 0.332
OPN10 1.269 0.200 6.332 0.000 0.753 0.772
CON =~
CSN1 1.000 0.779 0.671
CSN2 1.151 0.128 8.997 0.000 0.897 0.665
CSN3 0.567 0.068 8.336 0.000 0.442 0.437
CSN4 1.054 0.107 9.867 0.000 0.821 0.669
CSN5 0.976 0.083 11.749 0.000 0.760 0.593
CSN6 1.393 0.133 10.464 0.000 1.085 0.779
CSN7 0.832 0.082 10.175 0.000 0.648 0.583
CSN8 0.684 0.077 8.910 0.000 0.532 0.500
CSN9 0.938 0.075 12.535 0.000 0.730 0.574
CMV =~
EXT1 1.000 0.815 0.666
EXT2 1.074 0.091 11.863 0.000 0.875 0.683
EXT3 1.112 0.159 7.001 0.000 0.907 0.749
EXT4 0.992 0.090 11.067 0.000 0.809 0.679
EXT5 1.194 0.108 11.064 0.000 0.974 0.798
EXT7 1.253 0.069 18.133 0.000 1.021 0.725
EXT8 0.733 0.109 6.706 0.000 0.597 0.481
EXT9 0.857 0.105 8.136 0.000 0.698 0.529
EXT10 1.010 0.088 11.446 0.000 0.824 0.641
AGR1 0.047 0.142 0.328 0.743 0.038 0.030
AGR2 0.579 0.173 3.336 0.001 0.472 0.397
AGR4 -0.144 0.167 -0.859 0.390 -0.117 -0.105
AGR5 0.154 0.143 1.075 0.282 0.125 0.116
AGR6 -0.156 0.161 -0.971 0.332 -0.127 -0.112
AGR7 0.581 0.178 3.270 0.001 0.474 0.421
AGR8 0.224 0.123 1.820 0.069 0.183 0.178
AGR9 -0.043 0.145 -0.299 0.765 -0.035 -0.033
AGR10 0.540 0.137 3.935 0.000 0.440 0.436
CSN1 -0.109 0.143 -0.761 0.446 -0.089 -0.077
CSN2 -0.289 0.150 -1.931 0.054 -0.235 -0.175
CSN3 -0.064 0.114 -0.561 0.575 -0.052 -0.052
CSN4 0.041 0.166 0.246 0.806 0.033 0.027
CSN5 0.009 0.132 0.065 0.948 0.007 0.005
CSN6 -0.307 0.181 -1.694 0.090 -0.251 -0.180
CSN7 -0.206 0.132 -1.555 0.120 -0.168 -0.151
CSN8 0.102 0.137 0.741 0.459 0.083 0.078
CSN9 0.016 0.151 0.107 0.915 0.013 0.010
EST1 -0.063 0.167 -0.375 0.708 -0.051 -0.040
EST2 0.136 0.109 1.248 0.212 0.110 0.093
EST3 -0.103 0.165 -0.625 0.532 -0.084 -0.075
EST5 0.117 0.125 0.932 0.351 0.095 0.076
EST6 0.002 0.158 0.010 0.992 0.001 0.001
EST7 -0.253 0.239 -1.058 0.290 -0.206 -0.169
EST8 -0.216 0.243 -0.888 0.375 -0.176 -0.140
EST9 0.159 0.136 1.168 0.243 0.129 0.106
EST10 0.331 0.135 2.462 0.014 0.270 0.207
OPN1 -0.025 0.150 -0.169 0.866 -0.021 -0.018
OPN2 0.042 0.127 0.332 0.740 0.034 0.033
OPN3 -0.088 0.110 -0.799 0.424 -0.072 -0.070
OPN5 0.208 0.139 1.499 0.134 0.170 0.185
OPN6 -0.012 0.116 -0.102 0.919 -0.010 -0.009
OPN7 0.146 0.126 1.156 0.248 0.119 0.133
OPN8 -0.140 0.135 -1.036 0.300 -0.114 -0.093
OPN9 -0.074 0.103 -0.723 0.470 -0.060 -0.062
OPN10 0.035 0.138 0.250 0.802 0.028 0.029
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
EXTRA ~~
AGREE -0.096 0.036 -2.692 0.007 -0.342 -0.342
EMO -0.089 0.050 -1.782 0.075 -0.228 -0.228
OPEN -0.013 0.025 -0.534 0.594 -0.048 -0.048
CON -0.060 0.042 -1.440 0.150 -0.170 -0.170
CMV -0.063 0.081 -0.783 0.434 -0.171 -0.171
AGREE ~~
EMO -0.003 0.057 -0.059 0.953 -0.006 -0.006
OPEN 0.068 0.040 1.712 0.087 0.186 0.186
CON 0.085 0.047 1.818 0.069 0.177 0.177
CMV 0.239 0.046 5.185 0.000 0.476 0.476
EMO ~~
OPEN 0.040 0.042 0.957 0.338 0.079 0.079
CON 0.229 0.050 4.542 0.000 0.343 0.343
CMV 0.250 0.066 3.810 0.000 0.358 0.358
OPEN ~~
CON 0.058 0.044 1.308 0.191 0.125 0.125
CMV 0.098 0.069 1.412 0.158 0.202 0.202
CON ~~
CMV 0.185 0.072 2.576 0.010 0.291 0.291
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.EXT1 0.754 0.059 12.680 0.000 0.754 0.503
.EXT2 0.804 0.065 12.443 0.000 0.804 0.489
.EXT3 0.658 0.084 7.843 0.000 0.658 0.449
.EXT4 0.537 0.059 9.162 0.000 0.537 0.379
.EXT5 0.545 0.049 11.184 0.000 0.545 0.366
.EXT7 0.892 0.080 11.107 0.000 0.892 0.450
.EXT8 0.907 0.117 7.740 0.000 0.907 0.589
.EXT9 0.971 0.099 9.763 0.000 0.971 0.556
.EXT10 0.867 0.081 10.666 0.000 0.867 0.525
.AGR1 1.207 0.109 11.087 0.000 1.207 0.750
.AGR2 0.790 0.085 9.293 0.000 0.790 0.561
.AGR4 0.439 0.079 5.592 0.000 0.439 0.355
.AGR5 0.708 0.066 10.721 0.000 0.708 0.602
.AGR6 0.803 0.075 10.670 0.000 0.803 0.621
.AGR7 0.628 0.056 11.266 0.000 0.628 0.495
.AGR8 0.664 0.059 11.168 0.000 0.664 0.631
.AGR9 0.548 0.056 9.726 0.000 0.548 0.474
.AGR10 0.647 0.059 10.934 0.000 0.647 0.636
.EST1 0.935 0.080 11.644 0.000 0.935 0.571
.EST2 1.026 0.077 13.359 0.000 1.026 0.724
.EST3 0.869 0.070 12.409 0.000 0.869 0.689
.EST5 1.196 0.075 15.912 0.000 1.196 0.773
.EST6 0.826 0.067 12.380 0.000 0.826 0.562
.EST7 0.453 0.059 7.653 0.000 0.453 0.304
.EST8 0.457 0.065 7.044 0.000 0.457 0.289
.EST9 0.862 0.067 12.860 0.000 0.862 0.581
.EST10 0.986 0.074 13.395 0.000 0.986 0.579
.OPN1 0.964 0.098 9.828 0.000 0.964 0.735
.OPN2 0.792 0.070 11.309 0.000 0.792 0.750
.OPN3 0.670 0.085 7.903 0.000 0.670 0.635
.OPN5 0.413 0.039 10.466 0.000 0.413 0.490
.OPN6 0.866 0.099 8.780 0.000 0.866 0.696
.OPN7 0.574 0.048 11.944 0.000 0.574 0.725
.OPN8 1.181 0.094 12.627 0.000 1.181 0.784
.OPN9 0.863 0.083 10.424 0.000 0.863 0.894
.OPN10 0.376 0.051 7.358 0.000 0.376 0.395
.CSN1 0.774 0.079 9.836 0.000 0.774 0.574
.CSN2 1.082 0.099 10.961 0.000 1.082 0.595
.CSN3 0.837 0.072 11.594 0.000 0.837 0.820
.CSN4 0.817 0.067 12.117 0.000 0.817 0.542
.CSN5 1.063 0.077 13.728 0.000 1.063 0.646
.CSN6 0.856 0.089 9.613 0.000 0.856 0.442
.CSN7 0.850 0.065 13.025 0.000 0.850 0.688
.CSN8 0.817 0.057 14.298 0.000 0.817 0.721
.CSN9 1.079 0.077 13.982 0.000 1.079 0.667
EXTRA 0.207 0.141 1.467 0.142 1.000 1.000
AGREE 0.380 0.101 3.744 0.000 1.000 1.000
EMO 0.732 0.104 7.075 0.000 1.000 1.000
OPEN 0.352 0.098 3.603 0.000 1.000 1.000
CON 0.606 0.089 6.792 0.000 1.000 1.000
CMV 0.665 0.203 3.269 0.001 1.000 1.000
R-Square:
Estimate
EXT1 0.497
EXT2 0.511
EXT3 0.551
EXT4 0.621
EXT5 0.634
EXT7 0.550
EXT8 0.411
EXT9 0.444
EXT10 0.475
AGR1 0.250
AGR2 0.439
AGR4 0.645
AGR5 0.398
AGR6 0.379
AGR7 0.505
AGR8 0.369
AGR9 0.526
AGR10 0.364
EST1 0.429
EST2 0.276
EST3 0.311
EST5 0.227
EST6 0.438
EST7 0.696
EST8 0.711
EST9 0.419
EST10 0.421
OPN1 0.265
OPN2 0.250
OPN3 0.365
OPN5 0.510
OPN6 0.304
OPN7 0.275
OPN8 0.216
OPN9 0.106
OPN10 0.605
CSN1 0.426
CSN2 0.405
CSN3 0.180
CSN4 0.458
CSN5 0.354
CSN6 0.558
CSN7 0.312
CSN8 0.279
CSN9 0.333
However, there are some negative factor loadings on the common method variance factor. Additionally, extraversion seems to correlate negatively with cmv.
What does this mean? And can I trust the fit statistics or is my model misspecified?
First, let me clear up your misinterpretation of the warning message. It refers to the covariance matrix of estimated parameters (i.e., vcov(big5_CFA_cmv), from which SEs are calculated as the square-roots of the variances on the diagonal), not to the estimates themselves. Redundancy among estimates can possibly indicate a lack of identification, which you empirically check by saving the model-implied covariance matrix and fitting the same model to it.
MI_COV <- lavInspect(big5_CFA_cmv, "cov.ov")
summary(cfa(model = big5_CFAmodel_cmv,
sample.cov = MI_COV,
sample.nobs = nobs(big5_CFA_cmv))
If your estimates change, that is evidence that your model is not identified. If the estimates remain the same, the empirical check is inconclusive (i.e., it might still not be identified, but the optimizer just found the same local solution that seemed stable enough to stop searching the parameter space; criteria for inferring convergence are not perfect).
Regarding your model specification, I would doubt it is identified because your CMV factor (on which all indicators load) is allowed to correlate with the trait factors (which are also allowed to correlate). That contradicts the definition of a "method factor", which is something about the way the data were measured that has nothing to do with what is attempted to be measured. Even when traits are orthogonal to methods, empirical identification becomes tenuous when traits and/or methods are allowed to correlate among each other. Multitrait--multimethod (MTMM) are notorious for such problems, as are many bifactor models (which are typically one trait and many methods, which your model resembles but is reversed).
What does this mean?
Your negative (and most positive) CMV loadings are not significant. Varying around 0 (in both directions) is consistent with the null hypothesis that they are zero. More noteworthy (and related to my concern above) is that the CMV loadings are significant for all EXT indicators, but only a few others (3 AGR and an EST indicator). The correlations between CMV and traits really complicates the interpretation, as does using reference indicators. Before you interpret anything, I would recommend fixing all factor variances to 1 using std.lv=TRUE and making CMV orthogonal: EXTRA + AGREE + EMO + OPEN + CON ~~ 0*CMV.
However, I still anticipate problems due to estimating so many model parameters with a relatively small sample of 500 (498 after listwise deletion). That is not nearly a large enough sample to expect 50*51/2 = 1275 (co)variances to be reliably estimated.

mean() returning error "argument is not numeric or logical: returning NA" but only for some columns in data frame?

I'm pretty new to r so maybe this is something obvious but I'm not sure what's going on. I load in a file that has a bunch of data that I then split up into separate data frames. They look something like:
V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
3 1.000 2 3 4 5 6 7.000 8.000 9.000 10.000 11.000 12.000
4 0.042 0.067 0.292 0.206 0.071 0.067 0.040 0.063 0.059 0.040 0.066 0.040
5 0.043 0.172 0.179 0.199 0.073 0.067 0.040 0.062 0.058 0.039 0.066 0.039
6 0.040 0.066 0.29 0.185 0.072 0.067 0.040 0.062 0.058 0.039 0.065 0.039
7 0.039 0.068 0.291 0.189 0.075 0.069 0.040 0.064 0.058 0.041 0.064 0.039
8 0.042 0.063 0.271 0.191 0.07 0.068 0.040 0.065 0.058 0.041 0.066 0.040
9 0.041 0.067 0.342 0.199 0.069 0.066 0.041 0.065 0.057 0.040 0.065 0.042
10 0.044 0.064 0.295 0.198 0.069 0.067 0.039 0.064 0.057 0.040 0.067 0.041
11 0.041 0.067 0.29 0.211 0.066 0.067 0.043 0.056 0.058 0.042 0.067 0.042
I'm trying to find the means of rows 4-6 and 7-9 for each column. I have each data frame in a list called "plates". When I use the line:
plates[[1]][2:4, 7]
I end up with the output:
[1] 0.04 0.04 0.04
If I include mean() in the code above it works fine for columns 7 and higher. However when I used that same code for columns lower than 7, say column 2, I end up with:
[1] 0.067 0.172 0.066
57 Levels: 0.063 0.064 0.066 0.067 0.068 0.069 0.07 0.071 0.072 0.08 0.081 0.082 0.083 0.084 0.085 ... PlateFormat
I have no idea what this 57 Levels: thing is but I'm assuming this is my problem. I only want the mean of the 3 numbers (0.067, 0.172, 0.066) but this 57 Levels being returned appears to be causing mean() to give me the error in the title. Any help with this would be greatly appreciated.
There is an entry somewhere in that column that can't be processed into a number, so read.csv() (or whatever you used) is reading the data in as a factor. It could be a typo (something as simple as an extra decimal point or a trailing comma), a missing-value code such as "?"
You can use
numify <- function(x) as.numeric(as.character(x))
mydata[] <- lapply(mydata, numify)
to convert by brute force, but it would be better to use
bad_vals <- function(x) {
x[!is.na(x) & is.na(numify(x))
}
lapply(mydata, bad_vals)
to identify what the bad values are, so you can fix them upstream in your data file (or add missing-value codes to the na.strings= argument in your input code)

Boxplot factor across many samples with R

Given the data
Step A B C D E F G I J
1 1 0.158 0.011 0.099 6.504 5.914 0.000 0.100 0.330 0.000
2 2 0.345 0.016 0.102 6.050 5.285 0.000 0.102 0.316 0.001
3 1 0.324 0.015 0.100 7.146 6.426 0.000 0.101 0.293 0.000
4 2 0.264 0.015 0.099 5.864 5.202 0.000 0.101 0.296 0.000
5 1 0.346 0.022 0.101 5.889 5.027 0.000 0.101 0.411 0.000
6 2 0.397 0.022 0.130 6.061 5.311 0.000 0.131 0.220 0.000
7 1 0.337 0.015 0.048 7.417 6.839 0.000 0.110 0.129 0.000
8 2 0.362 0.016 0.143 5.726 4.951 0.001 0.144 0.268 0.000
9 1 0.178 0.011 0.099 5.831 5.290 0.000 0.100 0.261 0.000
d < - read.table('sample.txt', header=T) gives me a data frame, and boxplot(d$A ~ d$Step) yields a reasonable graph, but I cannot seem to get all plots on the same graph. Something like boxplot(d ~ d$Step) is what I expected to work, but I get the following error:
Error in model.frame.default(formula = d ~ d$Step) :
invalid type (list) for variable 'd'
I've tried making Step a factor d$Step <- as.factor(d$Step) but that seems to have no effect.
An alternative is to plot these in base R each on their own scale, like this
par(mfrow=c(3,3))
for(i in 2:10) {
boxplot(d[,i] ~ d$Step, main=names(d)[i]) }
We can do this with tidyverse
library(tidyverse)
gather(d, Var, Val, -Step) %>%
mutate(Step=factor(Step)) %>%
ggplot(., aes(x=Var, y = Val, fill=Step)) +
geom_boxplot() +
scale_fill_manual(values = c("red", "blue"))

import specific rows from "txt" into R

I have a "example.txt" document just as follows:
SIGNAL: 40 41 42
0.406 0.043 0.051 0.021 0.013
0.056 0.201 0.026 0.009 0.000
0.000 0.128 0 0.009 0.000
TOTAL: 0.657
SIGNAL: 44 45 46 48
0.128 0.338 0.026
0.333 0.03 0.000
0.060 0.013 0.004
0.009 0.017 0.009
0.013 0 0.000
TOTAL: 0.704
SIGNAL: 51 52 54
0.368 0.081 0.085 0.004
0.162 0.09 0.064 0.073
0.013 0.017 0.009 0.000
TOTAL: 0.266
SIGNAL: 60 61 62 63 64 65 66 67
0.530 0.030
0.009 0.179
0.154 0.004
0.068 0.009
TOTAL: 0.796
I want to import the rows between "SIGNAL: 44 45 46 48" and "TOTAL: 0.704" into R, I use read.table("example.txt",skip=6 ,nrow=5) to extract these specific rows, it works.
V1 V2 V3
1 0.128 0.338 0.026
2 0.333 0.030 0.000
3 0.060 0.013 0.004
4 0.009 0.017 0.009
5 0.013 0.000 0.000
However, my real data (has 450,000 rows) is very big, if I want to extract the rows between "SIGNAL: 3000 3001 3002 3003" and the next"TOTAL", how can I do with it? Thank you so much!
I have worked it out based on akrun's code. For example, I want to extract the first two sets. I can just use:
lines <- readLines('example.txt')
g<-c(40,44)
sapply(1:length(g), function(x){Map(function(i,j) read.table(text=lines[(i+1):(j-1)], sep='', header=FALSE), grep(paste('SIGNAL:',g[x]), lines), grep('TOTAL', lines)[which(grep(paste('SIGNAL:',g[x]), lines)==grep('SIGNAL', lines))])})

Evaluating a matrix by row for a condition being met in R

I've got data in the following format.
P10_neg._qn P11_neg._qn P12_neg._qn P14_neg._qn P17_neg._qn P24_neg._qn P25_neg._qn
1 -0.025 -0.037 -0.032 -0.061 -0.176 0.033 -0.011
2 -0.029 -0.125 0.003 -0.098 0.117 0.039 0.087
3 0.033 -0.127 0.042 0.014 0.097 0.105 0.048
4 0.033 -0.127 0.042 0.014 0.097 0.105 0.048
5 -0.029 -0.125 0.003 -0.098 0.117 0.039 0.087
6 -0.029 -0.125 0.003 -0.098 0.117 0.039 0.087
What is the best way by which I can check, for every row, how many entries are greater than 0.1, for instance and return a vector of counts?
You can use the rowSum function for this task. Assuming that dat is you matrix then :
rowSum(dat > 0.1)
Using the sample data provided we have :
dat <- read.table(text = ' P10_neg._qn P11_neg._qn P12_neg._qn P14_neg._qn P17_neg._qn P24_neg._qn P25_neg._qn
1 -0.025 -0.037 -0.032 -0.061 -0.176 0.033 -0.011
2 -0.029 -0.125 0.003 -0.098 0.117 0.039 0.087
3 0.033 -0.127 0.042 0.014 0.097 0.105 0.048
4 0.033 -0.127 0.042 0.014 0.097 0.105 0.048
5 -0.029 -0.125 0.003 -0.098 0.117 0.039 0.087
6 -0.029 -0.125 0.003 -0.098 0.117 0.039 0.087',
row.names = 1, header = TRUE)
rowSums(dat > 0.1)
## 1 2 3 4 5 6
## 0 1 1 1 1 1
apply(dat, 1, function(x) sum(x>.1))
# [1] 0 1 1 1 1 1
here an Rcpp version:
// [[Rcpp::export]]
IntegerVector countGreaterThan2(NumericMatrix M,double val) {
IntegerVector res;
for (int i=0; i<M.nrow(); i++) {
NumericVector row = M( i, _);
double num = std::count_if(row.begin(), row.end(),
[&val](const double& x) -> bool {return x>val;});
res.push_back(num);
}
return res;
}
But rowSum is unbeatable:
system.time(rowSums(dfx>0.2))
user system elapsed
0.01 0.00 0.02
> system.time(countGreaterThan2(dfx,0.2))
user system elapsed
0.06 0.00 0.06

Resources