How to reproduce results of predict function in R - r

Lets say I train a model in R.
model <- lm(as.formula(paste((model_Data)[2],"~",paste((model_Data)[c(4,5,6,7,8,9,10,11,12,13,15,16,17,18,20,21,22,63,79,90,91,109,125,132,155,175,197,202,210,251,252,279,287,292,300,313,318)],collapse="+"),sep="")),data=model_Data)
I then use the model to predict an unknown.
prediction <- predict(model,unknown[1,])
1
8.037219
Instead of using predict lets pull out the coefficients and do it manually.
model$coefficients
9.250265284
0.054054202
0.052738367
-0.55119556
0.019686046
0.392728331
0.794558094
0.200555755
-0.63218309
0.050404541
0.089660195
-0.04889444
-0.24645514
0.225817891
-0.10411162
0.108317865
0.004281512
0.219695437
0.037514904
-0.00914805
0.077885231
0.656321472
-0.05436867
0.033296525
0.072551915
-0.11498145
-0.03414029
0.081145352
0.11187141
0.690106624
NA
-0.11112986
-0.18002883
0.006238802
0.058387332
-0.04469568
-0.02520228
0.121577926
Looks like the model couldn't find a coefficient for one of the variables.
Here are the independent variables for our unknown.
2.048475484
1.747222331
-1.240658767
-1.26971135
-0.61858754
-1.186401425
-1.196781456
-0.437969964
-1.37330171
-1.392555895
-0.147275619
0.315190159
0.544014105
-1.137999082
0.464498153
-1.825631473
-1.824991143
0.61730876
-1.311527708
-0.457725059
-0.455920549
-0.196326975
0.636723746
0.128123676
-0.0064055
-0.788435688
-0.493452602
-0.563353694
-0.441559371
-1.083489708
-0.882784077
-0.567873188
1.068504735
1.364721122
0.294178454
2.302875604
-0.998685333
If I multiply each independent variable by it's coefficient and add on the intercept the predicted value for the unknown is 8.450137349
The predict function gave us 8.037219 and the manual calculation gave 8.450137349. What is happening within the predict function that is causing it to predict a different value than the manual calculation? What has to be done to make the values match?

I get a lot closer to the predict answer when using the code below:
b <- c(9.250265284, 0.054054202, 0.052738367, -0.55119556, 0.019686046, 0.392728331, 0.794558094, 0.200555755, -0.63218309, 0.050404541, 0.089660195, -0.04889444, -0.24645514, 0.225817891, -0.10411162, 0.108317865, 0.004281512, 0.219695437, 0.037514904, -0.00914805, 0.077885231, 0.656321472, -0.05436867, 0.033296525, 0.072551915, -0.11498145, -0.03414029, 0.081145352, 0.11187141, 0.690106624, NA, -0.11112986, -0.18002883, 0.006238802, 0.058387332, -0.04469568, -0.02520228, 0.121577926)
x <- c(1, 2.048475484, 1.747222331, -1.240658767, -1.26971135, -0.61858754, -1.186401425, -1.196781456, -0.437969964, -1.37330171, -1.392555895, -0.147275619, 0.315190159, 0.544014105, -1.137999082, 0.464498153, -1.825631473, -1.824991143, 0.61730876, -1.311527708, -0.457725059, -0.455920549, -0.196326975, 0.636723746, 0.128123676, -0.0064055, -0.788435688, -0.493452602, -0.563353694, -0.441559371, -1.083489708, -0.882784077, -0.567873188, 1.068504735, 1.364721122, 0.294178454, 2.302875604, -0.998685333)
# remove the missing value in `b` and the corresponding value in `x`
x <- x[-31]
b <- b[-31]
x %*% b
# [,1]
# [1,] 8.036963

Related

Using the naive Bayes function on a test and training set of data

I am trying to use the NaiveBayes function on a training and test set of data. I am using this helpful website: https://rpubs.com/riazakhan94/naive_bayes_classifier_e1071
However, for some reason it is not working and this is error that I am getting:" Error in table(train$Class, trainPred) : all arguments must have the same length. "
Here is the code that I am using, I am guessing its a super simple fix. The x and y columns of the data set are predicting on the class column:
https://github.com/samuelc12359/NaiveBayes.git
test <- read.csv(file="TestX.csv",header=FALSE)
train <- read.csv(file="TrainX.csv",header=FALSE)
Names <- c("x","y","Class")
colnames(test)<- Names
colnames(train)<- Names
NBclassfier=naiveBayes(Class~x+y, data=train)
print(NBclassfier)
trainPred=predict(NBclassfier,train, type="class")
trainTable=table(train$Class, trainPred)
testPred=predict(NBclassfier, newdata=test, type="class")
testTable=table(test$Class, testPred)
print(trainTable)
print(testTable)
You need to turn the Class column into factors, e.g. like this:
train$Class = factor(train$Class)
test$Class = factor(test$Class)
Then when you call naiveBayes() to train, and later to predict, it will do what you expect.
Alternatively, you can change prediction type to "raw" and turn them into outcomes directly. E.g. like this:
train_predictions = predict(NBclassfier,train, type="raw")
trainPred = 1 * (train_predictions[, 2] >= 0.5 )
trainTable=table(train$Class, trainPred)
test_predictions = predict(NBclassfier, newdata=test, type="raw")
testPred = 1 * (test_predictions[, 2] >= 0.5 )
testTable=table(test$Class, testPred)
print(trainTable)
print(testTable)

specify CFA with turbulances being the sum of exogenous correlations

I am trying to specify a curious model in lavaan in R language.
The model looks like this:
My specification attempt is shown bellow. What I find difficult to achieve is to fix the unique error of the observed variables to be the sum of two correlations of unique items.
for instance item y*1,2 covaries with y*1,3 and y*2,3 and its error is supposed to be cov y*1,3 + cov y*1,3.
How can I explicitly fix the item error to equal the sum of these covariances in the lavaan syntax bellow?
cfa_model_spesification<-'
C=~ #C4_12*i10i11+C4_13*i10i12+
#C5_12*i13i14+C5_13*i13i15+
#C6_12*i17i18+C6_13*i17i19+
C1_12*i1i2+C1_13*i1i3+
C2_12*i4i5+C2_13*i4i6+
C3_12*i7i8+C3_13*i7i9
R=~ #R4_23*i10i11+R4_12*i11i12+
#R5_23*i13i14+R5_12*i14i16+
#R6_23*i17i18+R6_12*i18i19+
R1_12*i1i2+R1_23*i2i3+
R2_12*i4i5+R2_23*i5i6+
R3_12*i7i8+R3_23*i8i9
O=~ #O4_13*i10i12+O4_23*i11i12+
#O5_13*i13i15+O5_23*i14i16+
#O6_13*i17i19+O6_23*i18i19+
O1_13*i1i3+O1_23*i2i3+
O2_13*i4i6+O2_23*i5i6+
O3_13*i7i9+O3_23*i8i9
O~~1*O
C~~1*C
R~~1*R
O~~C+R
R~~C
R1_23==-R1_12
R2_23==-R2_12
R3_23==-R3_12
R1_23>0
R2_23>0
R3_23>0
# R1_12<0
# R2_12<0
# R3_12<0
O1_13<0
O1_23<0
O2_13<0
O2_23<0
O3_13<0
O3_23<0
i1i2~~i1i3
i1i2~~i2i3
i1i3~~i2i3
i4i5~~i4i6
i4i5~~i5i6
i4i6~~i5i6
i7i8~~i7i9
i7i8~~i8i9
i7i9~~i8i9
i1i2~~1*i1i2
i4i5~~1*i4i5
i7i8~~1*i7i8
# i1i3~~equal("i1i3~~i1i2+i1i3~~i2i3")*i1i3
# i2i3~~equal("i2i3~~i1i2+i2i3~~i1i3")*i2i3
# i4i6~~equal("i4i6~~i4i5+i4i6~~i5i6")*i4i6
# i5i6~~equal("i5i6~~i4i5+i5i6~~i4i6")*i5i6
# i7i9~~equal("i7i9~~i7i8+i7i9~~i8i9")*i7i9
# i8i9~~equal("i8i9~~i7i8+i8i9~~i7i9")*i8i9
'
The syntax for this in mplus looks like this
TITLE:
Example
DATA:
FILE IS triplets.dat;
VARIABLE:
NAMES=i1i2 i1i3 i2i3 i4i5 i4i6 i5i6 i7i8 i7i9 i8i9 i10i11 i10i12 i11i12;
CATEGORICAL=i1i2-i11i12;
ANALYSIS:
ESTIMATOR=ulsmv;
PARAMETERIZATION=THETA;
MODEL:
Trait1 BY
i1i2*1 i1i3*1 (L1)
i4i5*-1 i4i6*-1 (L4)
i7i8*1 i7i9*1 (L7)
i10i11*1 i10i12*1 (L10);
Trait2 BY
i1i2*-1 (L2_n)
i2i3*1 (L2)
i4i5*-1 (L5_n)
i5i6*1 (L5)
i7i8*-1 (L8_n)
i8i9*1 (L8)
i10i11*1 (L11_n)
i11i12*-1 (L11);
Trait3 BY
i1i3*-1 i2i3*-1 (L3_n)
i4i6*-1 i5i6*-1 (L6_n)
i7i9*1 i8i9*1 (L9_n)
i10i12*-1 i11i12*-1 (L12_n);
Trait1-Trait3#1
Trait1 WITH Trait2*-0.4 Trait3*0;
Trait2 WITH Trait3*0.3;
i1i2*2 (e1e2);
i1i3*2 (e1e3);
i2i3*2 (e2e3);
i4i5*2 (e4e5);
i4i6*2 (e4e6);
i5i6*2 (e5e6);
i7i8*2 (e7e8);
i7i9*2 (e7e9);
i8i9*2 (e8e9);
i10i11*2 (e10e11);
i10i12*2 (e10e12);
i11i12*2 (e11e12);
i1i2 WITH i1i3*1 (e1);
i1i2 WITH i2i3*-1 (e2_n);
i1i3 WITH i2i3*1 (e3);
i4i5 WITH i4i6*1 (e4);
i4i5 WITH i5i6*-1 (e5_n);
i4i6 WITH i5i6*1 (e6);
i7i8 WITH i7i9*1 (e7);
i7i8 WITH i8i9*-1 (e8_n);
i7i9 WITH i8i9*1 (e9);
i10i11 WITH i10i12*1 (e10);
i10i11 WITH i11i12*-1 (e11_n);
i10i12 WITH i11i12*1 (e12);
MODEL CONSTRAINT:
L2_n=-L2;
L5_n=-L5;
L8_n=-L8;
L11_n=-L11;
e1e2=e1-e2_n;
e1e3=e1+e3;
e2e3=-e2_n+e3;
e4e5=e4-e5_n;
e4e6=e4+e6;
e5e6=-e5_n+e6;
e7e8=e7-e8_n;
e7e9=e7+e9;
e8e9=-e8_n+e9;
e10e11=e10-e11_n;
e10e12=e10+e12;
e11e12=-e11_n+e12;
e1=1;
e4=1;
e7=1;
e10=1;
This is the model spesification with lavaan I came up with for a tirt model
lavaan_6_model_spesification<-'# factor loadings (lambda)
trait1=~start(1)*L1*i1i2+start(1)*L1*i1i3+start(1)*L4*i4i5+start(1)*L4*i4i6+start(1)*L7*i7i8+start(1)*L7*i7i9+start(1)*L10*i10i11+start(1)*L10*i10i12+start(1)*L13*i13i14+start(1)*L13*i13i15+start(1)*L16*i16i17+start(1)*L16*i16i18
trait2=~start(-1)*L2n*i1i2+start(1)*L2*i2i3+start(-1)*L5n*i4i5+start(1)*L5*i5i6+start(-1)*L8n*i7i8+start(1)*L8*i8i9+start(-1)*L11n*i10i11+start(1)*L11*i11i12+start(-1)*L14n*i13i14+start(1)*L14*i14i15+start(-1)*L17n*i16i17+start(1)*L17*i17i18
trait3=~start(-1)*L3n*i1i3+start(-1)*L3n*i2i3+start(-1)*L6n*i4i6+start(-1)*L6n*i5i6+start(-1)*L9n*i7i9+start(-1)*L9n*i8i9+start(-1)*L12n*i10i12+start(-1)*L12n*i11i12+start(-1)*L15n*i13i15+start(-1)*L15n*i14i15+start(-1)*L18n*i16i18+start(-1)*L18n*i17i18
# fix factor variances to 1
trait1~~1*trait1
trait2~~1*trait2
trait3~~1*trait3
# factor correlations
trait1~~trait2+trait3
trait2~~trait3
# fix factor loadings of the same item to the same value
L2==-L2n
L5==-L5n
L8==-L8n
L11==-L11n
L14==-L14n
L17==-L17n
# declare uniquenesses (psi)
i1i2~~P1P2*i1i2
i1i3~~P1P3*i1i3
i2i3~~P2P3*i2i3
i4i5~~P4P5*i4i5
i4i6~~P4P6*i4i6
i5i6~~P5P6*i5i6
i7i8~~P7P8*i7i8
i7i9~~P7P9*i7i9
i8i9~~P8P9*i8i9
i10i11~~P10P11*i10i11
i10i12~~P10P12*i10i12
i11i12~~P11P12*i11i12
i13i14~~P13P14*i13i14
i13i15~~P13P15*i13i15
i14i15~~P14P15*i14i15
i16i17~~P16P17*i16i17
i16i18~~P16P18*i16i18
i17i18~~P17P18*i17i18
# correlated uniqunesses
i1i2~~start(1)*P1*i1i3
i1i2~~start(-1)*P2n*i2i3
i1i3~~start(1)*P3*i2i3
i4i5~~start(1)*P4*i4i6
i4i5~~start(-1)*P5n*i5i6
i4i6~~start(1)*P6*i5i6
i7i8~~start(1)*P7*i7i9
i7i8~~start(-1)*P8n*i8i9
i7i9~~start(1)*P9*i8i9
i10i11~~start(1)*P10*i10i12
i10i11~~start(-1)*P11n*i11i12
i10i12~~start(1)*P12*i11i12
i13i14~~start(1)*P13*i13i15
i13i14~~start(-1)*P14n*i14i15
i13i15~~start(1)*P15*i14i15
i16i17~~start(1)*P16*i16i18
i16i17~~start(-1)*P17n*i17i18
i16i18~~start(1)*P18*i17i18
# pairs uniqueness is equal to sum of 2 utility uniqunesses
P1P2==P1-P2n
P1P3==P1+P3
P2P3==-P2n+P3
P4P5==P4-P5n
P4P6==P4+P6
P5P6==-P5n+P6
P7P8==P7-P8n
P7P9==P7+P9
P8P9==-P8n+P9
P10P11==P10-P11n
P10P12==P10+P12
P11P12==-P11n+P12
P13P14==P13-P14n
P13P15==P13+P15
P14P15==-P14n+P15
P16P17==P16-P17n
P16P18==P16+P18
P17P18==-P17n+P18
# fix one uniqueness per block for identification
P1==1
P4==1
P7==1
P10==1
P13==1
P16==1
# force item parameters of the same item to be equal
'

What is the difference between fitdistr and fitdist for t-distribution?

More specifically, I have a set of data x e.g
x <- scan(textConnection(
"10.210126 10.277015 10.402625 10.208137 9.831884 9.672501
9.815476 9.980124 9.710509 9.148997 9.406072
9.991224 10.324005 10.402747 10.449439 10.051304 9.886748 9.771041
9.761175 10.049102 9.457981 9.114380 9.461333 9.804220 10.395986
10.192419 10.202962 9.984330 9.765604 9.473166 9.966462 10.120895
9.631744"))
If I use
fitdist(x,"t")
I get the following results:
m s df
10.63855874 0.50766169 37.08954639
( 0.02217171) ( 0.01736012) (23.12225558)
Compared to
fitdist(x,"t",start=list(length(x)-1,mean(x)),lower=c(0))
(starting parameters were found in an answer on this site) which results in:
estimate Std. Error
1 103129.28018 NA
2 10.63869 NA
Why is there a difference? Am I perhaps using the wrong starting points for fitdist()?

do a nonlinear least square fit in r

I have two vectors:
y <- c(0.044924, 0.00564, 0.003848, 0.002385, 0.001448, 0.001138,
0.001025, 0.000983, 0.00079, 0.000765, 0.000721, 0.00061, 0.000606,
0.000699, 0.000883, 0.001069, 0.001226, 0.001433, 0.00162, 0.001685,
0.001604, 0.001674, 0.001706, 0.001683, 0.001505, 0.001497, 0.001416,
0.001449, 0.001494, 0.001544, 0.00142, 0.001458, 0.001544, 0.001279,
0.00159, 0.001756, 0.001749, 0.001909, 0.001885, 0.002063, 0.002265,
0.002137, 0.002391, 0.002619, 0.002733, 0.002957, 0.003244, 0.003407,
0.003563, 0.003889, 0.004312, 0.004459, 0.004946, 0.005248, 0.005302,
0.00574, 0.006141, 0.006977, 0.007386, 0.007843, 0.008473, 0.008949,
0.010164, 0.010625, 0.011279, 0.01191, 0.012762, 0.014539, 0.01477)
x <- 0:68
I am trying to use the non-linear least squares function to fit the data but I keep getting the error:
Error in nlsModel(formula, mf, start, wts) : singular gradient matrix at initial parameter estimates
My code is:
a=0.00012
b=0.08436
k=0.21108
fit = nls(y ~ (a*b*exp(b*x)*k)/((k*b)+(a*(exp(b*x)-1))), start=list(a=a,b=b,k=k))
The parameters I have entered are parameters that I know are close to the expected values. Does anyone know what am I doing wrong here?
I have tried various initial values for the parameters a, b and k, but I always get some kind of error.
Use optim() instead. You have to make a function which takes a,b and k as input (collected as a vector), and which returns the squared error as a result:
func <-function(pars) {
a <- pars["a"]
b <- pars["b"]
k <- pars["k"]
fitted <- (a*b*exp(b*x)*k)/((k*b)+(a*(exp(b*x)-1)))
sum((y-fitted)^2)
}
Then we run optim() using the initial values:
result <- optim(c(a=0.00012, b=0.08436, k=0.21108), func)
To test the resulting fit:
plot(x, y)
a <- result$par["a"]
b <- result$par["b"]
k <- result$par["k"]
lines((a*b*exp(b*x)*k)/((k*b)+(a*(exp(b*x)-1))), col = "blue")

For each possible permutation of factor levels, apply function and also name list of results

Improve the following code by rewriting to be more compact (a one-liner with alply or similar?) Also if it can be made more performant (if possible).
I have a dataframe with several categorical variables, each with various number of levels. (Examples: T1_V4: B,C,E,G,H,N,S,W and T1_V7: A,B,C,D )
For any specific one of those categorical vars, I want to do the following:
Construct all possible level-permutations e.g. using DescTools::Permn()
Then for each level.perm in those level.perms...
Construct a list of function results where we apply some function to level.perm (in my particular case, recode the factor levels using level.perms, then take as.numeric, then compute correlation wrt some numeric response variable)
Finally, name that list with the corresponding string-concatenated values of level.perm (e.g. 'DBCA')
Example at bottom for permutations of A,B,C,D
Reproducible example at bottom:
The following code does this, can you improve on it? (I tried alply)
require(DescTools)
level.perms <- Permn(levels(MyFactorVariable))
tmp <- with(df,
apply( level.perms, 1,
function(var.levels) {
cor(MyResponseVariable,
as.numeric(factor(MyFactorVariable, levels=var.levels)))
})
)
names(tmp) <- apply(level.perms, 1, paste, collapse='')
Example (for CategVar1 with levels A,B,C,D):
ABCD BACD BCAD ACBD CABD CBAD BCDA ACDB
0.031423 0.031237 0.002338 0.002116 -0.026496 -0.026386 -0.008743 -0.009104
CADB CBDA ABDC BADC CDAB CDBA ADBC BDAC
-0.037228 -0.037364 0.048423 0.048075 -0.048075 -0.048423 0.037364 0.037228
BDCA ADCB DABC DBAC DBCA DACB DCAB DCBA
0.009104 0.008743 0.026386 0.026496 -0.002116 -0.002338 -0.031237 -0.031423
Reproducible example using randomly-generated dataframe:
set.seed(120)
df = data.frame(ResponseVar = exp(runif(1000, 0,4)),
CategVar1 = factor(sample(c('A','B','C','D'), 1000, replace=T)),
CategVar2 = factor(sample(c('B','C','E','G','H','N'), 1000, replace=T)) )
cor(as.numeric(df$CategVar1), df$MyResponseVar)
# 0.03142
cor(as.numeric(df$CategVar2), df$MyResponseVar)
# 0.02112
#then if you run the above code you get the above table of correlation values

Resources