R's equivalence of numpy.linalg.lstsq - r

I have multiple linear regressions of the form vc = x1 * va + x2 * vb.
(Now, a too minimal example follows - it has the same values, which leads to warnings in R. Below a second data set illustrating my issue)
In Python, I programmed
#!/usr/bin/env python3
import numpy as np
va = np.array([1, 2, 3, 4, 5])
vb = np.array([1, 2, 3, 4, 5])
vc = np.array([1, 2, 3, 4, 5])
A = np.vstack([va, vb]).T
print(A)
result = np.linalg.lstsq(A, vc)
print(result)
Output:
(array([ 0.5, 0.5]), array([], dtype=float64), 1, array([ 1.04880885e+01, 3.14018492e-16]))
I thought, following code would be identical:
#!/usr/bin/Rscript
va <- c(1, 2, 3, 4, 5)
vb <- c(1, 2, 3, 4, 5)
vc <- c(1, 2, 3, 4, 5)
reg <- lm(vc ~ va + vb)
reg
summary(reg)
However, I get following output (excerpt):
Coefficients:
A1 A2
1 NA
esidual standard error: 7.022e-16 on 4 degrees of freedom
In summary.lm(reg) : essentially perfect fit: summary may be unreliable
Even if I adjust the numbers somehow, R still keeps complaining.
I assume, I am doing something basic wrong, but I can't figure out. I also tried to construct a matrix A (containg vb and vc as colums) and then use reg <- lm(vc ~ 0 + A). There, I get 3 degrees of freedom, but with the same Coefficients.
2nd data set
va = np.array([1, 2, 3, 4, 5])
vb = np.array([2, 2, 2, 2, 2])
vc = np.array([3.1, 3.2, 3.3, 3.4, 3.5])
va <- c(1, 2, 3, 4, 5)
vb <- c(2, 2, 2, 2, 2)
vc <- c(3.1, 3.2, 3.3, 3.4, 3.5)
If I add 0 + (which results in lm(vc ~ 0 + va + vb)), I geed 3 degrees of freedom and the same result. Looks good.
The 0 + removes the "implied intercept term" (whatever this this). Source

The problem is that you have a singular fit, and multiple combinations of coefficients will represent it equally well. IMHO, both numpy and R should really throw an error in this case by default. You can get R to give you an error by adding singular.ok = FALSE to your arguments. Additionally, altough your intercept in this case is zero, your regression equation indicates that you're not looking to fit one. To fit a linear model without an intercept in R, use a formula in the form:
lm(vc ~ va + vb - 1)
So, to (properly) return an error in this singular fit, you would call:
reg <- lm(vc ~ va + vb - 1, singular.ok = FALSE)

Related

variogram function in R returns one single observation

I'm trying to construct a variogram cloud in R using the variogram function from the gstat package. I'm not sure if there's something about the topic that I've misunderstood, but surely I should get more than one observation, right? Here's my code:
data = data.frame(matrix(c(2, 4, 8, 7, 6, 4, 7, 9, 4, 4, -1.01, .05, .47, 1.36, 1.18), nrow=5, ncol=3))
data = rename(data, X=X1, Y=X2, Z=X3)
coordinates(data) = c("X","Y")
var.cld = variogram(Z ~ 1,data=data, cloud = TRUE)
And here's the output:
> var.cld
dist gamma dir.hor dir.ver id left right
1 1 0.0162 0 0 var1 5 4
I found the problem! Apparently the default value of the cutoff argument was too low for my specific set of data. Specifying a higher value resulted in additional observations.

STAN IRT via R programming, issue with parameter declaration?

I'm following along with this official IRT w/ STAN tutorial. The details of the model are copied below:
data {
int<lower=1> J; // number of students
int<lower=1> K; // number of questions
int<lower=1> N; // number of observations
int<lower=1,upper=J> jj[N]; // student for observation n
int<lower=1,upper=K> kk[N]; // question for observation n
int<lower=0,upper=1> y[N]; // correctness for observation n
}
parameters {
real delta; // mean student ability
real alpha[J]; // ability of student j - mean ability
real beta[K]; // difficulty of question k
}
model {
alpha ~ std_normal(); // informative true prior
beta ~ std_normal(); // informative true prior
delta ~ normal(0.75, 1); // informative true prior
for (n in 1:N)
y[n] ~ bernoulli_logit(alpha[jj[n]] - beta[kk[n]] + delta);
}
I'm not certain which variables do and do not need to be declared in R code.
toy_data <- list(
J= 5,
K = 4,
N =20,
y= c(1,1,1,1,1,1,1,0,1,1,0,0,1,0,0,0,0,0,0,0)
)
fit <- stan(file = '1PL_stan.stan', data = toy_data)
However, the following error is triggered.
Error in mod$fit_ptr() :
Exception: variable does not exist; processing stage=data initialization; variable name=jj; base type=int (in 'model920c4330dff_1PL_stan' at line 5)
In addition: Warning messages:
1: In readLines(file, warn = TRUE) :
incomplete final line found on 'C:\Users\jacob.moore\Downloads\1PL_stan.stan'
2: In system(paste(CXX, ARGS), ignore.stdout = TRUE, ignore.stderr = TRUE) :
'-E' not found
failed to create the sampler; sampling not done
In my past work, I've used python almost exclusively. So learning R has been quite the learning curve; additionally, I'm very new to STAN, hence the toy example.
The core idea is that there are 20 child/question pairings. 5 children and 4 different questions. I'm uncertain why my code is triggering the error, and what I should do to correct it. Can you clarify what needs adjustment for this code to run without triggering an error?
Every parameter listed in the data block (J, K, N, jj, kk, and y) needs to be included in the variable toy_data. You've left out jj and kk.
You have 5 students (J=5) answering 4 questions each (K=4). jj is the student ID, and kk is the question ID, so assuming your responses are ordered by student and then by question, you would have something like
jj = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5)
kk = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4)

Which packages in R allow for lagged variables for time series analysis?

I would like to include multiple lags of an exogenous variable in a regression. Let's say that I have the following data:
X = c(1, 4, 8, 9, 3, 5...)
X2 = c(4, 6, 7, 9, 7, 8...)
I want to use lags of X2 to predict X. Does anyone know why package allows for me to do this? I have tried using dynlm and lag() from stats.
Thanks
library(zoo)
set.seed(1111)
x <- as.zoo(rnorm(10, 0, 0.02))
y <- lag(x, 2, na.pad = TRUE)
cbind(x, y)
This performs an ordinary linear regression of X on the first 2 lags of X2 with an intercept (fit2), on the first lag with an intercept (fit1) and just on an intercept (fit0). Note that in R one normally uses negative numbers to lag so for convenience we defined a Lag function which uses positive numbers to indicate lags. lag.zoo allows vector lags so Lag(z2, 1:2) has two columns, one column for each of the two lags.
library(dyn)
X = c(1, 4, 8, 9, 3, 5)
X2 = c(4, 6, 7, 9, 7, 8)
z <- zoo(X)
z2 <- zoo(X2)
Lag <- function(x, k = 1) lag(x, k = -k)
fit2 <- dyn$lm(z ~ Lag(z2, 1:2))
fit1 <- dyn$lm(z ~ Lag(z2))
fit0 <- dyn$lm(z ~ 1)
For example, here is fit2.
> fit2
Call:
lm(formula = dyn(z ~ Lag(z2, 1:2)))
Coefficients:
(Intercept) Lag(z2, 1:2)1 Lag(z2, 1:2)2
19.3333 -1.4242 -0.4242
Here is a comparison of the three fits showing that the one and two lag fits are not significantly better than just using the intercept; however, there is a quite drop in residual sum of squares by adding the first lag to the intercept only model so you might want to ignore the statistical significance and use the first lag anyways.
> anova(fit0, fit1, fit12)
Analysis of Variance Table
Model 1: z ~ 1
Model 2: z ~ Lag(z2)
Model 3: z ~ Lag(z2, 1:2)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 3 22.7500
2 2 8.4211 1 14.3289 2.1891 0.3784
3 1 6.5455 1 1.8756 0.2865 0.6871
It would also be possible to use ts class in place of the zoo class; however, lag.ts does not support vector lags so with ts each term would have to be written out separately. Lag is from above.
tt <- ts(X)
tt2 <- ts(X2)
fits12_ts <- dyn$lm(tt ~ Lag(tt2) + Lag(tt2, 2))
No external R library is required, I would say
X2 = c(4, 6, 7, 9, 7, 8)
lag = 2
lagged_data <- function(x) c(tail(X2, -x), rep(NA, x))
lagged_data(lag)
# [1] 7 9 7 8 NA NA

Spearman's rho between ordered factors in R

I have two ordered factors and simply want to find Spearman's rho between them.
However:
> cor(dat$UEMS.2,dat$SCIM23_SubScore1.2,use="pairwise.complete.obs",method="spearman")
Error in cor(dat$UEMS.2, dat$SCIM23_SubScore1.2, use = "pairwise.complete.obs", :
'x' must be numeric
And just as a sanity check:
> class(dat$UEMS.2)
[1] "ordered" "factor"
> class(dat$SCIM23_SubScore1.2)
[1] "ordered" "factor"
How do I find spearman's rho for ordered factors using R?
I did find the following:
Calculate correlation - cor() - for only a subset of columns
Which raises the same issue: R's cor() function only accepts numerical data. This doesn't seem right to me, because spearman's rho should be able to handle ordinal variables. Ordered factors are ordinal variables.
Thanx in advance.
You can use the pspearman package to handle ordinal variables:
a <- factor(c(1, 2, 3, 4, 4, 4, 3, 4, 2, 2, 1), ordered=TRUE)
b <- factor(c(1, 4, 2, 2, 4, 1, 1, 4, 4, 3, 3), ordered=TRUE)
library(pspearman)
spearman.test(a, b)
# Rsquare F df1 df2 pvalue n
# 0.001015235 0.009146396 1.000000000 9.000000000 0.925904654 11.000000000

How to grab coefficients with R when estimating a Zero Inflation Model

Probably pretty easy, but I want to know, how to grab coefficients when using the zeroinfl command?
treatment <- factor(rep(c(1, 2), c(43, 41)),
levels = c(1, 2),labels = c("placebo", "treated"))
improved <- factor(rep(c(1, 2, 3, 1, 2, 3), c(29, 7, 7, 13, 7, 21)),
levels = c(1, 2, 3),labels = c("none", "some", "marked"))
numberofdrugs <- rpois(84, 2)
healthvalue <- rpois(84,0.5)
y <- data.frame(healthvalue,numberofdrugs, treatment, improved)
require(pscl)
ZIP<-zeroinfl(healthvalue~numberofdrugs+treatment+improved, y)
summary(ZIP)
I usually use ZIP$coef[1] to grab a coefficient, but unfortunately here you grab a whole bunch. So how can I grab one single coeficients from a ZIP model?
Use the coef extraction function to list all coefficients in one long vector, and then you can use single index notation to select them:
coef(ZIP)[1]
count_(Intercept)
0.1128742
Alternatively, you need to select which model you want to get the coefficients from first:
ZIP$coef$count[1]
(Intercept)
0.1128742
ZIP$coef[[1]][1]
(Intercept)
0.1128742
If you wanted to get fancy you could split the coefficients into a list:
clist <- function(m) {
cc <- coef(m)
ptype <- gsub("_.+$","",names(cc))
ss <- split(cc,ptype)
lapply(ss, function(x) names(x) <- gsub("^.*_","",names(x)))
}
> clist(ZIP)
$count
(Intercept) numberofdrugs treatmenttreated improvedsome
-1.16112045 0.16126724 -0.07200549 -0.34807344
improvedmarked
0.23593220
$zero
(Intercept) numberofdrugs treatmenttreated improvedsome
7.509235 -14.449669 -58.644743 -8.060501
improvedmarked
58.034805
c1 <- clist(ZIP)
c1$count["numberofdrugs"]

Resources