STAN IRT via R programming, issue with parameter declaration?

STAN IRT via R programming, issue with parameter declaration? - r

I'm following along with this official IRT w/ STAN tutorial. The details of the model are copied below:
data {
int<lower=1> J; // number of students
int<lower=1> K; // number of questions
int<lower=1> N; // number of observations
int<lower=1,upper=J> jj[N]; // student for observation n
int<lower=1,upper=K> kk[N]; // question for observation n
int<lower=0,upper=1> y[N]; // correctness for observation n
}
parameters {
real delta; // mean student ability
real alpha[J]; // ability of student j - mean ability
real beta[K]; // difficulty of question k
}
model {
alpha ~ std_normal(); // informative true prior
beta ~ std_normal(); // informative true prior
delta ~ normal(0.75, 1); // informative true prior
for (n in 1:N)
y[n] ~ bernoulli_logit(alpha[jj[n]] - beta[kk[n]] + delta);
}
I'm not certain which variables do and do not need to be declared in R code.
toy_data <- list(
J= 5,
K = 4,
N =20,
y= c(1,1,1,1,1,1,1,0,1,1,0,0,1,0,0,0,0,0,0,0)
)
fit <- stan(file = '1PL_stan.stan', data = toy_data)
However, the following error is triggered.
Error in mod$fit_ptr() :
Exception: variable does not exist; processing stage=data initialization; variable name=jj; base type=int (in 'model920c4330dff_1PL_stan' at line 5)
In addition: Warning messages:
1: In readLines(file, warn = TRUE) :
incomplete final line found on 'C:\Users\jacob.moore\Downloads\1PL_stan.stan'
2: In system(paste(CXX, ARGS), ignore.stdout = TRUE, ignore.stderr = TRUE) :
'-E' not found
failed to create the sampler; sampling not done
In my past work, I've used python almost exclusively. So learning R has been quite the learning curve; additionally, I'm very new to STAN, hence the toy example.
The core idea is that there are 20 child/question pairings. 5 children and 4 different questions. I'm uncertain why my code is triggering the error, and what I should do to correct it. Can you clarify what needs adjustment for this code to run without triggering an error?

Every parameter listed in the data block (J, K, N, jj, kk, and y) needs to be included in the variable toy_data. You've left out jj and kk.
You have 5 students (J=5) answering 4 questions each (K=4). jj is the student ID, and kk is the question ID, so assuming your responses are ordered by student and then by question, you would have something like
jj = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5)
kk = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4)

Related

variogram function in R returns one single observation

I'm trying to construct a variogram cloud in R using the variogram function from the gstat package. I'm not sure if there's something about the topic that I've misunderstood, but surely I should get more than one observation, right? Here's my code:
data = data.frame(matrix(c(2, 4, 8, 7, 6, 4, 7, 9, 4, 4, -1.01, .05, .47, 1.36, 1.18), nrow=5, ncol=3))
data = rename(data, X=X1, Y=X2, Z=X3)
coordinates(data) = c("X","Y")
var.cld = variogram(Z ~ 1,data=data, cloud = TRUE)
And here's the output:
> var.cld
dist gamma dir.hor dir.ver id left right
1 1 0.0162 0 0 var1 5 4

I found the problem! Apparently the default value of the cutoff argument was too low for my specific set of data. Specifying a higher value resulted in additional observations.

Caret using C5.0 method, how to plot the final tree

I am using the train package method=C5.0 and would like to see the finalModel plotted as a tree.
The resulting tree has been defined :
The final values used for the model were trials = 15, model = tree and winnow = FALSE.
When I tried to plot the tree using plot or the rattle's fancyRplotModel, i get the errors below:
Using plot:
plot(diabetes.c50$finalModel,trials=15)
Error in plot(diabetes.c50$finalModel, trials = 15) :
object 'diabetes.c50' not found
Using rattle:
fancyRpartPlot(diabetes.C50$finalModel,trials=15)
Error in if (model$method == "class") { : argument is of length zero
The finalModel has been defined:
> diabetes.C50$finalModel
Call:
C5.0.default(x = structure(c(6, 8, 0, 8, 4, 10, 10, 1, 5, 7, 1, 1, 3, 8, 7, 9, 11, 10, 7, 1, 13, 5, 5, 3, 6, 4, 11, 9, 4, 3, 9, 7, 0,
"outcome", seed = 2187L), .Names = c("subset", "bands", "winnow", "noGlobalPruning", "CF", "minCases", "fuzzyThreshold",
"sample", "earlyStopping", "label", "seed")), verbose = FALSE)
Classification Tree
Number of samples: 538
Number of predictors: 8
Number of boosting iterations: 15
Average tree size: 12.9
Non-standard options: attempt to group attributes

The data structure representing a C5.0 tree is different to that representing an rpart tree. Rattle's fancyRpartPlot() assumes an rpart tree hence you get an error (recent versions of rattle check for the model class and explain this error rather than failing with the above indecipherable message).
You first error though looks like a typo and the error message is self explanatory. You meant diabetes.C50$finalModel rather than diabetes.c50$finalModel (capital C50 rather than lower c50).

Some "train" columns aren't present in "test"

everyone.
I have a problem. I have to realize a kNN classification on R using LOO. I've found packages "knncat" and "loo" for this. And I've written the code(without LOO):
library(knncat)
x <- c(1, 2, 3, 4)
y <- c(5, 6, 7, 8)
train <- data.frame(x, y)
x1 <- c(9, 10, 11, 12)
y1 <- c(13, 14, 15, 16)
test <- data.frame(x1, y1)
answer <- knncat(train, test, classcol = 2)
And I've got an error "Some "train" columns aren't present in "test"". I don't understand, what am I doing wrong? How can I fix this error?
If something's wrong with my English, sorry, I'm from Russia:)

Well, there are some problems with your approach and knncat:
You have to specify class labels for the train and test data sets and set classcol accordingly.
Only class labels which appear in train must be present in test.
The columns names of train and test must be the same, or knncat will throw the error you've mentioned: "Some "train" columns aren't present in "test".
Moreover if you are using integer values as class labels, they have to start from zero or knncat will throw an error: "Number in class 0 is 0! Abort!".
Here is an working example:
train <- data.frame(x1=1:4, x2=5:8, y=c(0, 0, 1, 1))
test <- data.frame(x1=9:12, x2=13:16, y=c(1, 0, 0, 1))
knncat(train, test, classcol = 3)
With the result:
Test set misclass rate: 50%

R's equivalence of numpy.linalg.lstsq

I have multiple linear regressions of the form vc = x1 * va + x2 * vb.
(Now, a too minimal example follows - it has the same values, which leads to warnings in R. Below a second data set illustrating my issue)
In Python, I programmed
#!/usr/bin/env python3
import numpy as np
va = np.array([1, 2, 3, 4, 5])
vb = np.array([1, 2, 3, 4, 5])
vc = np.array([1, 2, 3, 4, 5])
A = np.vstack([va, vb]).T
print(A)
result = np.linalg.lstsq(A, vc)
print(result)
Output:
(array([ 0.5, 0.5]), array([], dtype=float64), 1, array([ 1.04880885e+01, 3.14018492e-16]))
I thought, following code would be identical:
#!/usr/bin/Rscript
va <- c(1, 2, 3, 4, 5)
vb <- c(1, 2, 3, 4, 5)
vc <- c(1, 2, 3, 4, 5)
reg <- lm(vc ~ va + vb)
reg
summary(reg)
However, I get following output (excerpt):
Coefficients:
A1 A2
1 NA
esidual standard error: 7.022e-16 on 4 degrees of freedom
In summary.lm(reg) : essentially perfect fit: summary may be unreliable
Even if I adjust the numbers somehow, R still keeps complaining.
I assume, I am doing something basic wrong, but I can't figure out. I also tried to construct a matrix A (containg vb and vc as colums) and then use reg <- lm(vc ~ 0 + A). There, I get 3 degrees of freedom, but with the same Coefficients.
2nd data set
va = np.array([1, 2, 3, 4, 5])
vb = np.array([2, 2, 2, 2, 2])
vc = np.array([3.1, 3.2, 3.3, 3.4, 3.5])
va <- c(1, 2, 3, 4, 5)
vb <- c(2, 2, 2, 2, 2)
vc <- c(3.1, 3.2, 3.3, 3.4, 3.5)
If I add 0 + (which results in lm(vc ~ 0 + va + vb)), I geed 3 degrees of freedom and the same result. Looks good.
The 0 + removes the "implied intercept term" (whatever this this). Source

The problem is that you have a singular fit, and multiple combinations of coefficients will represent it equally well. IMHO, both numpy and R should really throw an error in this case by default. You can get R to give you an error by adding singular.ok = FALSE to your arguments. Additionally, altough your intercept in this case is zero, your regression equation indicates that you're not looking to fit one. To fit a linear model without an intercept in R, use a formula in the form:
lm(vc ~ va + vb - 1)
So, to (properly) return an error in this singular fit, you would call:
reg <- lm(vc ~ va + vb - 1, singular.ok = FALSE)

How to obtain the full marginal distribution of a parameter in stan

when starting a standard example from the stan webpage like the following:
schools_code <- '
data {
int<lower=0> J; // number of schools
real y[J]; // estimated treatment effects
real<lower=0> sigma[J]; // s.e. of effect estimates
}
parameters {
real theta[J];
real mu;
real<lower=0> tau;
}
model {
theta ~ normal(mu, tau);
y ~ normal(theta, sigma);
}
'
schools_dat <- list(J = 8,
y = c(28, 8, -3, 7, -1, 1, 18, 12),
sigma = c(15, 10, 16, 11, 9, 11, 10, 18))
fit <- stan(model_code = schools_code, data = schools_dat,
iter = 1000, n_chains = 4)
(this has been obtained from here)
however this does only provide me with the quantiles of the posterior of the parameters. so my question is: how to obtain other percentiles? i guess it should be similar to bugs(?)
remark: i tried to introduce the tag stan however, i have too little reputation ;) sorry for that

As from rstan v1.0.3 (not released yet), you will be able to utilize the workhorse apply() function directly on an object of stanfit class that is produced by the stan() function. If fit is an object obtained from stan(), then for example,
apply(fit, MARGIN = "parameters", FUN = quantile, probs = (1:100) / 100)
or
apply(as.matrix(fit), MARGIN = 2, FUN = quantile, probs = (1:100) / 100)
The former applies FUN to each parameter in each chain, while the latter combines the chains before applying FUN to each parameter. If you were only interested in one parameter, then something like
beta <- extract(fit, pars = "beta", inc_warmup = FALSE, permuted = TRUE)[[1]]
quantile(beta, probs = (1:100) / 100)
is an option.

here's my attempt hope this is correct:
suppose fit is an object obtained from stan(...). then the posterior for any percentile is obtained from:
quantile(fit#sim$sample[[1]]$beta, probs=c((1:100)/100))
where the number in square brackets is the chain i guess. in case this hasn't been clear: i use rstan

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

STAN IRT via R programming, issue with parameter declaration? - r

Related

variogram function in R returns one single observation

Caret using C5.0 method, how to plot the final tree

Some "train" columns aren't present in "test"

R's equivalence of numpy.linalg.lstsq

How to obtain the full marginal distribution of a parameter in stan

Categories

Resources