Not able to visualize linear regression with ggPredict - r

I have a data set called dataTreamill that contains 5 columns of text information, and row 6 until 56 contain the variables I would like to analyse.
For every variable I would like to perform a linear regression to see how my data changes across different conditions.
I figured out that I could make a lm plot in the following way:
lmTreadmill = lm(StrideRegularity_AP~ConditionNr, data = dataTreadmill)
Visualizing this gives a nice plot:
ggPredict(lmTreadmill,se=TRUE,interactive=TRUE)
However as I have 54 other variables than StrideRegularity_AP I would like to use lapply
col <- c(6:56) # these are the only columns containing data;
allFits = lapply(dataTreadmill[,col], function(x) (lm(x~dataTreadmill$ConditionNr+dataTreadmill$Group, data=dataTreadmill)))
Now I get a nice list for every variable with the information about the regression.
However, when I want to plot any of these linear regression using this code:
ggPredict(allFits$StrideRegularity_AP)
Although when comparing allFits$StrideRegularity_AP with lmTreadmill (which are the same), I do not see any difference in structure or values, however R gives the following error:
Error in `[[<-.data.frame`(`*tmp*`, yname, value = c(`1` = 0.616668527648763, :
replacement has 419 rows, data has 30
In addition: Warning message:
'newdata' had 30 rows but variables found have 419 rows
Why am I not able to visualize the linear regression after using lapply?
Thanks in advance!
Iris

Drop the dataTreadmill$ from the lm call and try again.
allFits = lapply(dataTreadmill[,col], function(x) (lm(x~ConditionNr+Group, data=dataTreadmill)))
It's not needed as you specify the data anyway (and on a quick test with and without I get the same error as you - no idea why though)

Both the solutions work! Thank you.
Although I still wonder why, because the data looks completely similar to the suggestion I wrote. However, no at least I can plot the data :)

Related

R Cross Validation lm predict function [duplicate]

I am trying to convert Absorbance (Abs) values to Concentration (ng/mL), based on an established linear model & standard curve. I planned to do this by using the predict() function. I am having trouble getting predict() to return the desired results. Here is a sample of my code:
Standards<-data.frame(ng_mL=c(0,0.4,1,4),
Abs550nm=c(1.7535,1.5896,1.4285,0.9362))
LM.2<-lm(log(Standards[['Abs550nm']])~Standards[['ng_mL']])
Abs<-c(1.7812,1.7309,1.3537,1.6757,1.7409,1.7875,1.7533,1.8169,1.753,1.6721,1.7036,1.6707,
0.3903,0.3362,0.2886,0.281,0.3596,0.4122,0.218,0.2331,1.3292,1.2734)
predict(object=LM.2,
newdata=data.frame(Concentration=Abs[1]))#using Abs[1] as an example, but I eventually want predictions for all values in Abs
Running that last lines gives this output:
> predict(object=LM.2,
+ newdata=data.frame(Concentration=Abs[1]))
1 2 3 4
0.5338437 0.4731341 0.3820697 -0.0732525
Warning message:
'newdata' had 1 row but variables found have 4 rows
This does not seem to be the output I want. I am trying to get a single predicted Concentration value for each Absorbance (Abs) entry. It would be nice to be able to predict all of the entries at once and add them to an existing data frame, but I can't even get it to give me a single value correctly. I've read many threads on here, webpages found on Google, and all of the help files, and for the life of me I cannot understand what is going on with this function. Any help would be appreciated, thanks.
You must have a variable in newdata that has the same name as that used in the model formula used to fit the model initially.
You have two errors:
You don't use a variable in newdata with the same name as the covariate used to fit the model, and
You make the problem much more difficult to resolve because you abuse the formula interface.
Don't fit your model like this:
mod <- lm(log(Standards[['Abs550nm']])~Standards[['ng_mL']])
fit your model like this
mod <- lm(log(Abs550nm) ~ ng_mL, data = standards)
Isn't that some much more readable?
To predict you would need a data frame with a variable ng_mL:
predict(mod, newdata = data.frame(ng_mL = c(0.5, 1.2)))
Now you may have a third error. You appear to be trying to predict with new values of Absorbance, but the way you fitted the model, Absorbance is the response variable. You would need to supply new values for ng_mL.
The behaviour you are seeing is what happens when R can't find a correctly-named variable in newdata; it returns the fitted values from the model or the predictions at the observed data.
This makes me think you have the formula back to front. Did you mean:
mod2 <- lm(ng_mL ~ log(Abs550nm), data = standards)
?? In which case, you'd need
predict(mod2, newdata = data.frame(Abs550nm = c(1.7812,1.7309)))
say. Note you don't need to include the log() bit in the name. R recognises that as a function and applies to the variable Abs550nm for you.
If the model really is log(Abs550nm) ~ ng_mL and you want to find values of ng_mL for new values of Abs550nm you'll need to invert the fitted model in some way.

Why are polychoric correlation coefficients in matrices calculated by different R packages slightly different for the same data?

I calculated polychoric correlation matrices for the same data frame (20 ordinal variables, 190 missing values) in R, using three different packages and the coefficients for same variables are slightly different from each other.
I used the lavCor function from "lavaan" (I did list the ordinal variables when calling the function), polychoric function from "psych" (1.9.1) (took the rhos), and cor_auto function from "qgraph" (which is supposed to automatically calculate polychoric correlations for ordinal data). I am confused because I thought they were supposed to give exactly the same results. I read package documentations but could not find anything that helped me understand why. Could anyone let me know why this happens? I am sure I am missing some tiny difference between those, but I cannot figure it out.
PS: I guess this could have happened because psych package adjusts missing values (I have 190) using the correction for continuity, but I still do not understand why qgraph yields different results than lavaan as qgraph says it uses lavaan's lavCor function to calculate polychoric correlations.
Thanks!!
depanx<-data[1:20]
cor.depanx<-cor_auto(depanx)
polychor<-polychoric(depanx)
polymat<-polychor$rho
lav<-lavCor(depanx,ordered=c("unh","enj","trd","rst","noG","cry","cnc","htd","bdp","lnl","lov",
"cmp","wrg","pst","sch","dss","hlt","bad","ftr","oth"))
# as a result, matrices "cor.depanx", "polymat", and "lav" are different from each other.
Nice question! I do not know what the "data" dataset in you example is, but i recreate the two possible scenarios, which have most probably caused the discrepancy between cor_auto and lavCor results. In summary, first you must set the "ordinalLevelMax" argument in cor_auto based on your data and second you need to synchronize the "missing" argument in the two functions. Detailed explanation in the code snippet below:
depanx<-data.frame(lapply(1:5,function(x)sample(1:6,100,replace = T)),
stringsAsFactors = F)
colnames(depanx)=LETTERS[1:5]
lav<-lavaan::lavCor(depanx,ordered=colnames(depanx))
cor.depanx<-cor_auto(depanx)
all(lav==cor.depanx)#TRUE
#The first argument in cor auto, which you need to pay attention to is
#"ordinalLevelMax". #It is set to 7 by default in cor_auto,
#so any variable with levels more than 7 is sent to lavCor as plain numeric and not
#ordinal.
#Now we create the same dataset with 8 level variables. lavCor detects all as ordinal,
#since we have labeled them as so by "ordered" argument of lavCor, so it uses
#ploychorial
#correlations. Since "ordinalLevelMax" in cor_auto is 7 by default and you have not
#changed it,
#cor_auto detect none as ordinaland does not send them to lavCor as Ordinalvariables,
#so Lavcor computes pearson correlations between them,all.
depanx2<-data.frame(lapply(1:5,function(x)sample(1:8,100,replace =T)),
stringsAsFactors = F)
colnames(depanx2)=LETTERS[1:5]
lav2<-lavaan::lavCor(depanx2,ordered=colnames(depanx2))
cor.depanx2<-cor_auto(depanx2)
all(lav2==cor.depanx2)#FALSE
# the next argument you must synchronise in lavCor and cor_auto is the "missing",
#which is by default set to "pairwise" and "listwise" in cor_auto and lavCor,
#respectively.
#here we set row 10:20 value of the fifth variable to NA, without synchronizing the
#argument
depanx3<-data.frame(lapply(1:5,function(x)sample(1:6,100,replace =T)),
stringsAsFactors = F)
colnames(depanx3)=LETTERS[1:5]
depanx3[10:20,5]<-NA
lav3<-lavaan::lavCor(depanx3,ordered=colnames(depanx3))
cor.depanx3<-cor_auto(depanx3)
all(lav3==cor.depanx3)#FALSE

Predicting in a Bayesian Network in R -- message: consider using the 'smooth' argument

I've gotten code figured out to predict probabilities in each category of the target node. However, when I try to adapt it to a different dataset (just more nodes of similar type data), when I try run the predict function, it gives this message (in black text, not red like an error or warning):
The evidence for row 25 has probability smaller than 0.00000 in then
model. Consider using the 'smooth' argument when building the network.
Exiting...
Here is the code:
library("bnlearn")
pigment.test <- read.table("pigment_test.csv", sep=",", header=T)
bn.mle <- bn.fit(dg_pigment, data=pigment.test[,2:19], method="mle", smooth=0.01)
bn.grain <- as.grain(bn.mle)
predict.mle <- predict(bn.grain, "pigment", newdata=pigment_val[,2:18],
type="distribution")
predict.mle
I tried putting smooth in the bn.fit or the as.grain part of the code, but it says its an unused argument. This happens for other rows (not just 25 when I remove it). Does anyone know where I could include a smooth argument here? Or is there a different function? I was trying to have the program automatically calculate the conditional probabilities rather than me manually creating the tables.

residuals() function error: replacement has x rows and data has y rows

I have a data set that has reading time for each word that numerous individuals read.
I am trying to calculate reading time residuals for each individual in my data. Word lengths and the order of presentation (of a particular word) are factors in calculating a regression for each individual.
The reading time was log-transformed (logRT) and word lengths were calculated by nchar(). The order of presentation is also log-transformed.
model1<-lmer(logRT~wlen+log(order)+(1|subject), data=mydata)
Then, I try to get a residual column for every data point by doing the following,
mydata$logResid<-residuals(model1)
Then, I get this error.
Error in `$<-.data.frame`(`*tmp*`, "LogResid", value = c(0.145113408056189, :
replacement has 30509 rows, data has 30800
Does anyone have any advice? I am totally confused. Since this is an analysis I've been doing every day with no such error so far. It is even more confusing.
I would say you should try
model1 <- lmer(logRT~wlen+log(order)+(1|subject), data=mydata,
na.action=na.exclude)
and see if that helps; it should fill in NA values in the appropriate places.
From ?na.exclude:
... when ‘na.exclude’ is used the residuals and
predictions are padded to the correct length by inserting ‘NA’s
for cases omitted by ‘na.exclude’.

Model runs with glm but not bigglm

I was trying to run a logistic regression on 320,000 rows of data (6 variables). Stepwise model selection on a sample of the data (10000) gives a rather complex model with 5 interaction terms: Y~X1+ X2*X3+ X2*X4+ X2*X5+ X3*X6+ X4*X5. The glm() function could fit this model with 10000 rows of data, but not with the whole dataset (320,000).
Using bigglm to read data chunk by chunk from a SQL server resulted in an error, and I couldn't make sense of the results from traceback():
fit <- bigglm(Y~X1+ X2*X3+ X2*X4+ X2*X5+ X3*X6+ X4*X5,
data=sqlQuery(myconn,train_dat),family=binomial(link="logit"),
chunksize=1000, maxit=10)
Error in coef.bigqr(object$qr) :
NA/NaN/Inf in foreign function call (arg 3)
> traceback()
11: .Fortran("regcf", as.integer(p), as.integer(p * p/2), bigQR$D,
bigQR$rbar, bigQR$thetab, bigQR$tol, beta = numeric(p), nreq = as.integer(nvar),
ier = integer(1), DUP = FALSE)
10: coef.bigqr(object$qr)
9: coef(object$qr)
8: coef.biglm(iwlm)
7: coef(iwlm)
6: bigglm.function(formula = formula, data = datafun, ...)
5: bigglm(formula = formula, data = datafun, ...)
4: bigglm(formula = formula, data = datafun, ...)
bigglm was able to fit a smaller model with fewer interaction terms. but bigglm was not able to fit the same model with a small dataset (10000 rows).
Has anyone run into this problem before? Any other approach to run a complex logistic model with big data?
I've run into this problem many times and it was always caused by the fact that the the chunks processed by the bigglm did not contain all the levels in a categorical (factor) variable.
bigglm crunches data by chunks and the default size of the chunk is 5000. If you have, say, 5 levels in your categorical variable, e.g. (a,b,c,d,e) and in your first chunk (from 1:5000) contains only (a,b,c,d), but no "e" you will get this error.
What you can do is increase the size of the "chunksize" argument and/or cleverly reorder your dataframe so that each chunk contains ALL the levels.
hope this helps (at least somebody)
Ok so we were able to find the cause for this problem:
for one category in one of the interaction terms, there's no observation. "glm" function was able to run and provide "NA" as the estimated coefficient, but "bigglm" doesn't like it. "bigglm" was able to run the model if I drop this interaction term.
I'll do more research on how to deal with this kind of situation.
I met this error before, thought it was from randomForest instead of biglm. The reason could be the function cannot handle character variables, so you need to convert characters to factors. Hope this can help you.

Resources