R: interaction between continuous and categorical vars in 'isat' regression ('gets' package) - r

I want to calculate the differential response of y to x (continuous) depending on the categorical variable z.
In the standard lm setup:
lm(y~ x:z)
However, I want to do this while allowing for Impulse Indicator Saturation (IIS) in the 'gets' package. However, the following syntax produces an error:
isat(y, mxreg=x:z, iis=TRUE)
The error message is of the form:
"Error in solve.qr(out, tol = tol, LAPACK = LAPACK) :
singular matrix 'a' in 'solve"
1: In x:z :
numerical expression has 96 elements: only the first used
2: In x:z :
numerical expression has 96 elements: only the first used"
How should I modify the syntax?
Thank you!

At the moment, alas, isat doesn't provide the same functionality as lm on categorical/character variables, nor on using * and :. We hope to address that in a future release.
In the meantime you'll have to create distinct variables in your dataset representing the interaction. I guess something like the following...
library(gets)
N <- 100
x <- rnorm(N)
z <- c(rep("A",N/4),rep("B",N/4),rep("C",N/4),rep("D",N/4))
e <- rnorm(N)
y <- 0.5*x*as.numeric(z=="A") + 1.5*x*as.numeric(z=="B") - 0.75*x*as.numeric(z=="C") + 5*x*as.numeric(z=="D") + e
lm.reg <- lm(y ~ x:z)
arx.reg.0 <- arx(y,mxreg=x:z)
data <- data.frame(y,x,z,stringsAsFactors=F)
for(i in z[duplicated(z)==F]) {
data[[paste("Zx",i,sep=".")]] <- data$x * as.numeric(data$z==i)
}
arx.reg.1 <- arx(data$y,mxreg=data[,c("x","Zx.A","Zx.B","Zx.C")])
isat.1 <- isat(data$y,mc=TRUE,mxreg=data[,c("x","Zx.A","Zx.B","Zx.C")],max.block.size=20)
Note that as you'll be creating dummies for each category, there's a chance those dummies will cause singularity of your matrix of explanatory variables (if, as in my example, isat automatically uses 4 blocks). Using the argument max.block.size enables you to avoid this problem.
Let me know if I haven't addressed your particular point.

Related

R function to find which of 3 variables correlates most with another value?

I am conducting a study that analyzes speakers' production and measures their average F2 values. What I need is an R function that allows me to find a relationship for these F2 values with 3 other variables, and if there is, which one is the most significant. These variables have been coded as 1, 2, or 3 for things like "yes" "no" answers or whether responses are positive, neutral or negative (1, 2, 3 respectively).
Is there a particular technique or R function/test that we can use to approach this problem? I've considered using ANOVA or a T-Test but am unsure if this will give me what I need.
A quick solution might look like this. Here, the cor function is used. Read its help page (?cor) to understand what is calculated. By default, the Pearson correlation coefficient is used. The function below return the variable with the highest Pearson correlation with respect to the reference variable.
set.seed(111)
x <- rnorm(100)
y <- rnorm(100)
z <- rnorm(100)
ref <- 0.5*x + 0.5*rnorm(100)
find_max_corr <- function(vars, ref){
val <- sapply(vars, cor, y = ref)
val[which.max(val)]
}
find_max_corr(list('x' = x, 'y' = y, 'z' = z), ref)

How to update code to create a function for calculating Welch's for polynomial trends?

I am trying to reproduce the SPSS output for significance a linear trend among means when equal variances are not assumed.
I have gratefully used code from http://www-personal.umich.edu/~gonzo/coursenotes/file3.pdf to create a function for calculating separate variances, which based on my searching I understand as the “equal variances not assumed” output in SPSS.
My problem/goal:
I am only assessing polynomial orthogonal trends (mostly linear). I want to adapt the code creating the function so that the contrast argument can take pre-made contrast matrices rather than manually specifying the coefficients each time (room for typos!).
… I have tried those exact commands but receive Error in contrast %*% means : non-conformable arguments . I have played around with the code but I can’t get it to work.
Code for creating the function from the notes:
sepvarcontrast <- function(dv, group, contrast) {
means <- c(by(dv, group, mean))
vars <- c(by(dv, group, var))
ns <- c(by(dv, group, length))
ihat <- contrast %*% means
t.denominator <- sqrt(contrast^2 %*% (vars/ns))
t.welch <- ihat/ t.denominator
num.contrast <- ifelse(is.null(dim(contrast)),1,dim(contrast)[1])
df.welch <- rep(0, num.contrast)
if (is.null(dim(contrast))) contrast <- t(as.matrix(contrast))
for (i in 1:num.contrast) {
num <- (contrast[i,]^2 %*% (vars))^2
den <- sum((contrast[i,]^2 * vars)^2 / (ns-1))
df.welch[i] <- num/den
}
p.welch <- 2*(1- pt(abs(t.welch), df.welch))
result <- list(ihat = ihat, se.ihat = t.denominator, t.welch = t.welch,
df.welch = df.welch, p.welch = p.welch)
return(result)
}
I would like to be able to use the function like this:
# Create a polynomial contrast matrix for 5 groups, then save
contr.mat5 <- contr.poly(5)
# Calculate separate variance
sepvarcontrast(dv, group, contrast = contr.mat5)
I have tried those exact commands to see if they would work but receive Error in contrast %*% means : non-conformable arguments.
All suggestions are appreciated! I am still learning how to create a reprex...

R: multicollinearity issues using glib(), Bayesian Model Averaging (BMA-package)

I am experiencing difficulties estimating a BMA-model via glib(), due to multicollinearity issues, even though I have clearly specified which columns to use. Please find the details below.
The data I'll be using for the estimation via Bayesian Model Averaging:
Cij <- c(357848,766940,610542,482940,527326,574398,146342,139950,227229,67948,
352118,884021,933894,1183289,445745,320996,527804,266172,425046,
290507,1001799,926219,1016654,750816,146923,495992,280405,
310608,1108250,776189,1562400,272482,352053,206286,
443160,693190,991983,769488,504851,470639,
396132,937085,847498,805037,705960,
440832,847631,1131398,1063269,
359480,1061648,1443370,
376686,986608,
344014)
n <- length(Cij);
TT <- trunc(sqrt(2*n))
i <- rep(1:TT,TT:1); #row numbers: year of origin
j <- sequence(TT:1) #col numbers: year of development
k <- i+j-1 #diagonal numbers: year of payment
#Since k=i+j-1, we have to leave out another dummy in order to avoid multicollinearity
k <- ifelse(k == 2, 1, k)
I want to evaluate the effect of i and j both via levels and factors, but of course not in the same model. Since I can decide to include i and j as factors, levels, or not include them at all and for k either to include as level, or exclude, there are a total of 18 (3x3x2) models. This brings us to the following data frame:
X <- data.frame(Cij,i.factor=as.factor(i),j.factor=as.factor(j),k,i,j)
X <- model.matrix(Cij ~ -1 + i.factor + j.factor + k + i + j,X)
X <- as.data.frame(X[,-1])
Next, via the following declaration I specify which variables to consider in each of the 18 models. According to me, no linear dependence exists in these specifications.
model.set <- rbind(
c(rep(0,9),rep(0,9),0,0,0),
c(rep(0,9),rep(0,9),0,1,0),
c(rep(0,9),rep(0,9),0,0,1),
c(rep(0,9),rep(0,9),1,0,0),
c(rep(1,9),rep(0,9),0,0,0),
c(rep(0,9),rep(1,9),0,0,0),
c(rep(0,9),rep(0,9),0,1,1),
c(rep(0,9),rep(0,9),1,1,0),
c(rep(0,9),rep(1,9),0,1,0),
c(rep(0,9),rep(0,9),1,0,1),
c(rep(1,9),rep(0,9),0,0,1),
c(rep(1,9),rep(0,9),1,0,0),
c(rep(0,9),rep(1,9),1,0,0),
c(rep(1,9),rep(1,9),0,0,0),
c(rep(0,9),rep(0,9),1,1,1),
c(rep(0,9),rep(1,9),1,1,0),
c(rep(1,9),rep(0,9),1,0,1),
c(rep(1,9),rep(1,9),1,0,0))
Then I call the glib() function, telling it to select the specified columns from X according to model.set.
library(BMA)
model.glib <- glib(X,Cij,error="poisson", link="log",models=model.set)
which results in the error
Error in glim(x, y, n, error = error, link = link, scale = scale) : X matrix is not full rank
The function first checks whether the matrix is f.c.r, before it evaluates which columns to select from X via model.set. How do I circumvent this, or is there any other way to include all 18 models in the glib() function?
Thank you in advance.

How to do ma and loess normalization in R?

Attempting to do loess on two variables x and y in R using MA normalization (http://en.wikipedia.org/wiki/MA_plot) like this:
> x = rnorm(100) + 5
> y = x + 0.6 + rnorm(100)*0.8
> m = log2(x/y)
> a = 0.5*log(x*y)
I want to normalize x and y in such a way that the average m is 0, as in standard MA normalization, and then back-calculate the correct x and y values. First running loess on MA:
> l = loess(m ~ a)
What is the way to get corrected m values then? Is this correct?
> mc <- predict(l, a)
# original MA plot
> plot(a,m)
# corrected MA plot
> plot(a,m-mc)
not clear to me what predict actually does in the case of loess objects and how it's different from using l$residuals in the object l returned by loess - can someone explain?
finally, how can I back calculate new x and y values based on this correction?
First, yes, your proposed method gets the corrected m values.
Regarding the predict function: yes, l$residuals , m - fitted(l) , and m -
predict(l) all give the same result: the corrected m values. However, the predict function is more general: it will take any new values as input. This is useful if you want to use only a subset of the data to fit the loess, and then predict on the totality of the data (for example, when using spiked-in standards).
Finally, how can you back calculate new x and y values based on this correction? If you transform your data into log-space, by creating two new variables x1 <- log2(x) and y1 <- log2(y), it becomes easier to see. Since we're in log-space, calculating m and a is simpler:
m <- x1 - y1
a <- (x1 + y1)/2
Now, for correcting your data based on the fitted loess model, instead of updating the m variable by your mc correction, you can update x1 and y1 instead. Set:
x1 <- x1 - mc / 2
y1 <- y1 + mc / 2
This update has the same effect as updating m <- m - mc (because m will be recomputed as the difference between the updated x1 and y1) and has no effect on the a value.
To get your corrected data out, transform them by returning 2^x1 and 2^y1.
This is the method as used by the authors of the normalize.loess function in affy package, as originally described here (and includes the capability to cyclically look at all pairs of variables as opposed to a single pair in this case): http://web.mit.edu/~r/current/arch/i386_linux26/lib/R/library/limma/html/normalizeCyclicLoess.html

Create function to automatically create plots from summary(fit <- lm( y ~ x1 + x2 +... xn))

I am running the same regression with small alterations of x variables several times. My aim is after having determined the fit and significance of each variable for this linear regression model to view all all major plots. Instead of having to create each plot one by one, I want a function to loop through my variables (x1...xn) from the following list.
fit <-lm( y ~ x1 + x2 +... xn))
The plots I want to create for all x are
1) 'x versus y' for all x in the function above
2) 'x versus predicted y
3) x versus residuals
4) x versus time, where time is not a variable used in the regression but provided in the dataframe the data comes from.
I know how to access the coefficients from fit, however I am not able to use the coefficient names from the summary and reuse them in a function for creating the plots, as the names are characters.
I hope my question has been clearly described and hasn't been asked already.
Thanks!
Create some mock data
dat <- data.frame(x1=rnorm(100), x2=rnorm(100,4,5), x3=rnorm(100,8,27),
x4=rnorm(100,-6,0.1), t=(1:100)+runif(100,-2,2))
dat <- transform(dat, y=x1+4*x2+3.6*x3+4.7*x4+rnorm(100,3,50))
Make the fit
fit <- lm(y~x1+x2+x3+x4, data=dat)
Compute the predicted values
dat$yhat <- predict(fit)
Compute the residuals
dat$resid <- residuals(fit)
Get a vector of the variable names
vars <- names(coef(fit))[-1]
A plot can be made using this character representation of the name if you use it to build a string version of a formula and translate that. The four plots are below, and the are wrapped in a loop over all the vars. Additionally, this is surrounded by setting ask to TRUE so that you get a chance to see each plot. Alternatively you arrange multiple plots on the screen, or write them all to files to review later.
opar <- par(ask=TRUE)
for (v in vars) {
plot(as.formula(paste("y~",v)), data=dat)
plot(as.formula(paste("yhat~",v)), data=dat)
plot(as.formula(paste("resid~",v)), data=dat)
plot(as.formula(paste("t~",v)), data=dat)
}
par(opar)
The coefficients are stored in the fit objects as you say, but you can access them generically in a function by referring to them this way:
x <- 1:10
y <- x*3 + rnorm(1)
plot(x,y)
fit <- lm(y~x)
fit$coefficient[1] # intercept
fit$coefficient[2] # slope
str(fit) # a lot of info, but you can see how the fit is stored
My guess is when you say you know how to access the coefficients you are getting them from summary(fit) which is a bit harder to access than taking them directly from the fit. By using fit$coeff[1] etc you don't have to have the name of the variable in your function.
Three options to directly answer what I think was the question: How to access the coefficients using character arguments:
x <- 1:10
y <- x*3 + rnorm(1)
fit <- lm(y~x)
# 1
fit$coefficient["x"]
# 2
coefname <- "x"
fit$coefficient[coefname]
#3
coef(fit)[coefname]
If the question was how to plot the various functions then you should supply a sufficiently complex construction (in R) to allow demonstration of methods with a well-specified set of objects.

Resources