I am trying to fit a logistic curve to my data for two separate years. The data are very similar for each year, with exactly the same format and I removed all observations with NA's or NaN's and made sure they are numeric. However for some reason the script works perfectly for the first year, and does not work for the second year.
Here is a bit of what my 2013 data looks like (101 observations total):
x y
1 0.070 95.392
2 0.079 100.000
3 0.109 100.000
4 0.072 100.000
5 -0.005 100.000
6 0.014 100.000
7 -0.008 100.000
8 0.307 52.523
9 0.696 0.000
10 -0.045 100.000
And the 2018 data(116 observations total):
x y
1 0.133 100.000
2 0.139 100.000
3 0.152 100.000
4 0.124 100.000
5 0.051 100.000
6 0.062 100.000
7 -0.050 100.000
8 0.356 80.282
9 0.545 0.000
10 -0.029 62.857
Here is my script:
##2013 data
x <- veg13$`Elevation`
y <- veg13$`pclowspecies`
##predict parameters for the logistic curve
fit <- nls(y ~ SSlogis(x, Asym, xmid, scal), data=data.frame(x, y))
summary(fit)
It works fine for 2013, but when I repeat with 2018 data I get the following error:
Error in qr.default(.swts * gr) :
NA/NaN/Inf in foreign function call (arg 1)
I have read some other people who had the same issue but their solutions do not work for me, since I don't have any NA's or NaN's in my data.
Thank you for your help!
Related
I'm trying to build a logistic regression with a dataset containing 9 variables and 3000 observations. This is what it looks like with the head() function:
Sex Length Diameter Height WholeWeight ShuckedWeight VisceraWeight ShellWeight Age
1516 3 0.655 0.510 0.215 1.7835 0.8885 0.4095 0.4195 1
529 1 0.570 0.450 0.160 0.9715 0.3965 0.2550 0.2600 1
1244 2 0.385 0.280 0.085 0.2175 0.0970 0.0380 0.0670 2
1880 2 0.545 0.430 0.140 0.6870 0.2615 0.1405 0.2500 2
1311 2 0.545 0.405 0.135 0.5945 0.2700 0.1185 0.1850 2
1759 3 0.735 0.590 0.215 1.7470 0.7275 0.4030 0.5570 1
My membership class is Age and I want to build a lasso model with it, which I have done. The problem is that the predict() function returns this error and I have no idea what to do about it: "The number of variables in newx must be 8" .
The code I have used is below:
myabalone<-read.table(file.choose(), header = T, stringsAsFactors = T)
myabalone$Sex<-as.numeric(myabalone$Sex)
myabalone$Age<-as.numeric(myabalone$Age)
set.seed(69)
mysampleabalone<-myabalone[sample(nrow(myabalone)),]
train<-mysampleabalone[1:floor(nrow(mysampleabalone)*0.7),]
test<-mysampleabalone[(floor(nrow(mysampleabalone)*0.7)+1):nrow(mysampleabalone),]
set.seed(300)
x<-model.matrix(Age~., train)[,-1]
y<-ifelse(train$Age=="1", "1", "0")
cv.lasso<-cv.glmnet(x,y, alpha=1, family="binomial")
model2<-glmnet(x,y, family="binomial", alpha=1, lambda=cv.lasso$lambda.1se)
set.seed(123)
predicted<-predict(model2, test, type = "response")
This is where the "The number of variables in newx must be 8" error occurs.
Why should there be 8 variables and not 9, if the training data I used to build the model also has 9 variables? I have seen in other posts suggested that I should try to pass the test set as.data.frame(), because there might be some issues with the column names, but I tried and nothing. Plus, when I use the head() function on it, it returns exactly the same column names as the training set used for the building the model.
Anybody has any ideas how do I fix this?
I ran randomForest on a dataset with binary outcome and want the predicted probabilities (on the same dataset - I don't need separate train/test for this). I was expecting the values for p1 and p2 below to be the same, but clearly they are not. I haven't been able to find a clear description of how they are different. Any help would be appreciated.
mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
rf = randomForest(factor(admit)~., data = mydata)
p1 = predict(rf, mydata[,c(2:4)], type = "prob")
p2 <- rf$votes
> head(p1)
0 1
1 0.926 0.074
2 0.584 0.416
3 0.166 0.834
4 0.722 0.278
5 0.968 0.032
6 0.258 0.742
> head(p2)
0 1
1 0.8324324 0.16756757
2 0.7663043 0.23369565
3 0.2447917 0.75520833
4 0.9695431 0.03045685
5 0.9264706 0.07352941
6 0.3351351 0.66486486
I have a data set that is paired data from multiple samples that I want to do a parallel coordinates graph with and include a p-value above (i.e. plot each data point in each group and link the pairs with a line and have the comparison statistic above the plotted data).
I can get the graph to (largely) look the way I want it to, but when I try and add a p-value using stat_compare_means(paired=TRUE), I get 3 errors:
2 x:
"Don't know how to automatically pick scale for object of type quosure/formula. Defaulting to continuous."
1 x:
"Error in validDetails.text(x) : 'pairlist' object cannot be coerced to type 'double'".
My data is a data.frame with three variables: a sample variable so I know which pair is which, a group variable so I know which category the value is, and the value variable. I've pasted the code below and am more than happy to take other suggestions on any other ways to make the code look better as well.
ggplot(test_OCI, aes(x=test_OCI$variable, y=test_OCI$value, group =test_OCI$Pt)) +
geom_point(aes(x=test_OCI$variable),size=3)+
geom_line(aes(x=test_OCI$variable),group=test_OCI$Pt)+
theme_bw()+
theme(panel.border=element_blank(),
panel.grid.major=element_blank(),
panel.grid.minor=element_blank(),
axis.line=element_line(color="black"))+
scale_x_discrete(labels=c("OCI_pre_ART"="Pre-ART OCI", "OCI_on_ART"="On-ART OCI"))+
stat_compare_means(paired=TRUE)
edit 1: adding sample data
There isn't too much data, but I've added it below per request.
Pt variable value
1 Pt1 OCI_pre_ART 0.024
2 Pt2 OCI_pre_ART 0.027
3 Pt3 OCI_pre_ART 0.027
4 Pt4 OCI_pre_ART 0.010
5 Pt5 OCI_pre_ART 0.075
6 Pt6 OCI_pre_ART 0.040
7 Pt7 OCI_pre_ART 0.070
8 Pt8 OCI_pre_ART 0.011
9 Pt9 OCI_pre_ART 0.022
10 Pt10 OCI_pre_ART 0.006
11 Pt11 OCI_pre_ART 0.019
12 Pt1 OCI_on_ART 0.223
13 Pt2 OCI_on_ART 0.166
14 Pt3 OCI_on_ART 0.163
15 Pt4 OCI_on_ART 0.126
16 Pt5 OCI_on_ART 0.090
17 Pt6 OCI_on_ART 0.139
18 Pt7 OCI_on_ART 0.403
19 Pt8 OCI_on_ART 0.342
20 Pt9 OCI_on_ART 0.092
edit 2: packages
all lines in the figure code are from ggplot2 except stat_compare_means(paired=TRUE) which is from ggpubr.
I'm not sure if this is the reason, but it appears that the stat_compare_means() line was not interpreting the x~y aesthestic. Changing the line to
stat_compare_means(comparisons = list(c("OCI_pre_ART","OCI_on_ART")), paired=TRUE) resulted in a functional graph.
my problem is the following: I have this table below
0 1-5 6-10 11-15 16-20 21-26 27-29
a 0.019 0.300 0.296 0.211 0.117 0.042 0.014
b 0.058 0.448 0.308 0.120 0.042 0.019 0.005
c 0.026 0.277 0.316 0.187 0.105 0.068 0.020
d 0.054 0.297 0.378 0.108 0.108 0.041 0.014
e 0.004 0.252 0.358 0.216 0.102 0.053 0.015
f 0.032 0.097 0.312 0.280 0.161 0.065 0.054
g 0.113 0.500 0.233 0.094 0.043 0.014 0.003
h 0.328 0.460 0.129 0.050 0.020 0.010 0.003
representing some marginal frequencies (by row) for each subgroups of my data (a to h).
My dataset is actually in the long format (very long, counting more than 100 thousand entries), with the first 6 rows as you see below:
RX_SUMM_SURG_PRIM_SITE Nodes.Examined.Class
1 Wedge Resection 1-5
2 Segmental Resection 1-5
3 Lobectomy w/mediastinal LNdissection 6-10
4 Lobectomy w/mediastinal LNdissection 6-10
5 Lobectomy w/mediastinal LNdissection 1-5
6 Lobectomy w/mediastinal LNdissection 11-15
When I plot a barplot by group (the table above is simply the cross tabulation of of these two covariates with the row marginal probabilities taken) here's what happens:
The code I have for this is
ggplot(data.ln.red, aes(x=Nodes.Examined.Class))+geom_bar(aes(x=Nodes.Examined.Class, group=RX_SUMM_SURG_PRIM_SITE))+
facet_grid(RX_SUMM_SURG_PRIM_SITE~.)
Actually I would be very happy only with the marginal frequencies (i.e. the ones in the table) on each y-axis of the facets of the plot (instead of the counts).
Anybody can help me with this?
Thanks for all your help!
EM
geom_bar calculates both counts and proportions of observations. You can access these calculated proportions with either ..prop.. (the old way) or calc(prop) (introduced in newer versions of ggplot2). Use this as your y aesthetic.
You can also get rid of the aes you have in geom_bar, as this is just a repeat of what you've already covered by ggplot and facet_grid.
It looks like your counts/proportions are going to vary widely between groups, so I'm adding free y-scaling to the faceting.
Here's an example of a similar plot with the iris data, which you can model your code off of:
library(tidyverse)
ggplot(iris, aes(x = Sepal.Length, y = calc(prop))) +
geom_bar() +
facet_grid(Species ~ ., scales = "free_y")
Created on 2018-04-06 by the reprex package (v0.2.0).
Edit: the calculated prop variable is proportions within each group, not proportions across all groups, so it works differently when x is a factor. For categorical x, prop treats x as the group; to override this, include group = 0 or some other dummy value in your aes. Sorry I missed that the first time!
I want to compute a structural equation model with the sem() function in R with the package lavaan.
There are two categorial variables, one latent exogenous and one latent endogenous, I want to include in the final version of the model.
When I include one of the categorial variables in the model, however, R produces the following warning:
1: In estimateVCOV(lavaanModel, samplestats = lavaanSampleStats,
options = lavaanOptions, : lavaan WARNING: could not compute
standard errors!
2: In computeTestStatistic(lavaanModel, partable = lavaanParTable, :
lavaan WARNING: could not compute scaled test statistic
Code used:
model1 <- '
Wertschaetzung_Essen =~ abwechslungsreiche_M + schnell_zubereitbar + koche_sehr_gerne + koche_sehr_haeufig
Fleischverzicht =~ Ern_Index1
Fleischverzicht ~ Wertschaetzung_Essen
'
fit_model1 <- sem(model1, data=survey2_subset, ordered = c("Ern_Index1"))
Note: This is only a small version of the final model and in which I only introduce one categorial variable. The warning, however, is the same for more complex versions of the model.
Output
str(survey2_subset):
'data.frame': 3676 obs. of 116 variables:
$ abwechslungsreiche_M : num 4 2 3 4 3 3 4 3 3 3 ...
$ schnell_zubereitbar : num 0 3 2 0 0 1 3 2 1 1 ...
$ koche_sehr_gerne : num 1 3 3 1 3 1 4 4 4 3 ...
$ koche_sehr_haeufig : num 2 2 3 NA 3 2 2 4 3 3 ...
$ Ern_Index1 : num 1 1 1 1 0 0 1 0 1 0 ...
summary(fit_model1, fit.measures = TRUE, standardized=TRUE)
lavaan (0.5-15) converged normally after 31 iterations
Used Total
Number of observations 3469 3676
Estimator DWLS Robust
Minimum Function Test Statistic 13.716 NA
Degrees of freedom 4 4
P-value (Chi-square) 0.008 NA
Scaling correction factor NA
Shift parameter
for simple second-order correction (Mplus variant)
Model test baseline model:
Minimum Function Test Statistic 2176.159 1582.139
Degrees of freedom 10 10
P-value 0.000 0.000
User model versus baseline model:
Comparative Fit Index (CFI) 0.996 NA
Tucker-Lewis Index (TLI) 0.989 NA
Root Mean Square Error of Approximation:
RMSEA 0.026 NA
90 Percent Confidence Interval 0.012 0.042 NA NA
P-value RMSEA <= 0.05 0.994 NA
Parameter estimates:
Information Expected
Standard Errors Robust.sem
Estimate Std.err Z-value P(>|z|) Std.lv Std.all
Latent variables:
Wertschaetzung_Essen =~
abwchslngsr_M 1.000 0.363 0.436
schnll_zbrtbr 1.179 0.428 0.438
koche_shr_grn 2.549 0.925 0.846
koche_shr_hfg 2.530 0.918 0.775
Fleischverzicht =~
Ern_Index1 1.000 0.249 0.249
Regressions:
Fleischverzicht ~
Wrtschtzng_Es 0.302 0.440 0.440
Intercepts:
abwchslngsr_M 3.133 3.133 3.760
schnll_zbrtbr 1.701 1.701 1.741
koche_shr_grn 2.978 2.978 2.725
koche_shr_hfg 2.543 2.543 2.148
Wrtschtzng_Es 0.000 0.000 0.000
Fleischvrzcht 0.000 0.000 0.000
Thresholds:
Ern_Index1|t1 0.197 0.197 0.197
Variances:
abwchslngsr_M 0.562 0.562 0.810
schnll_zbrtbr 0.771 0.771 0.808
koche_shr_grn 0.339 0.339 0.284
koche_shr_hfg 0.559 0.559 0.399
Ern_Index1 0.938 0.938 0.938
Wrtschtzng_Es 0.132 1.000 1.000
Fleischvrzcht 0.050 0.806 0.806
Is the model not identified? There should be enough degrees of freedom and the loadings of the first manifest items are set to one.
How can I resolve this issue?
My first thought was:
You can´t have missing values in the dataframe, because with categorial variables WLSMV is used and FIML (missing="ML") is only usable with ML estimates. Perhaps that´s a problem.
Also: Does lavaan automatically fix the residual-variance of "Fleischverzicht" to 0 (or some other value)? A single-item latent variable would not be identified without that, I think.