Backward selection in LME, singularity in backsolve occured - r

I have data, where "speed of flight" is a response variable and group (experimental/control), test (first/second), FL (fuel loads, % from lean body mass: from 0 to ~25%), wing (wing length in mm). Since we have tested same birds twice (first and second test, experimental group was infected), I want to perform the mixed model (add a random term ~1|ring). I also added the weight parameter for the test variable because of heteroscedasticity.
mod<-lme(speed~test* group * FL * wing,weight=~1|test,random=~1|ring,data=data,method="ML")
This is how the full model looks like (I use nlme package). After that I start the backward selection. I do it manually (according to the lowest AIC) and then check the result with a function stepAIC (MASS package). In this case first two steps of selection are well, but when I start with the model:
mod3<-lme(speed~test+group + FL + wing+ test:group + group:FL + FL:wing + test:group:wing, weight=~1|test,random=~1|ring,data=data,method="ML")
I got an error:
Error in MEEM(object, conLin, control$niterEM) :
Singularity in backsolve at level 0, block 1
As far as I understand, it means that not all interactions of factors exist. But then I should have got the same error already with the full model. And with other response variables it works well. If any of you have an idea, I would be glad!
original data
ring group wing speed_aver FL test
1 XZ13125 E 75 0.62 16.2950000 first
2 XZ13125 E 75 0.22 12.5470149 second
3 XZ13126 E 68 0.39 7.7214876 first
4 XZ13127 C 75 0.52 9.1573643 first
5 XZ13127 C 75 0.17 -1.9017391 second
6 XZ13129 C 73 0.46 10.3821705 first
7 XZ13129 C 73 0.33 -0.5278261 second
8 XZ13140 C 73 0.48 13.0774436 first
9 XZ13140 C 73 0.27 18.0092199 second
10 XZ13144 C 73 0.36 7.5144000 first
11 XZ13144 C 73 0.36 9.6820312 second
12 XZ13146 E 73 0.32 14.3651852 first
13 XZ13146 E 73 0.28 20.8171233 second
14 XZ13159 C 74 0.55 20.2760274 first
15 XZ13159 C 74 0.37 19.1687500 second
16 XZ13209 E 72 0.35 8.1464000 first
17 XZ13209 E 72 0.43 10.9945736 second
18 XZ13213 E 74 0.57 5.3682927 first
19 XZ13213 E 74 0.26 1.3584746 second
20 XZ13220 C 73 0.30 6.0105691 first
21 XZ13220 C 73 0.36 -8.0439252 second
22 XZ13230 E 74 0.44 5.3682927 first
23 XZ13230 E 74 0.31 3.0025000 second
24 XZ13231 C 75 0.28 6.2504000 first
25 XZ13231 C 75 0.37 7.7267717 second
26 XZ13232 C 74 0.34 16.8592857 first
27 XZ13232 C 74 0.33 13.7800000 second
28 XZ13271 C 73 0.32 16.2268116 first
29 XZ13271 C 73 0.28 14.3651852 second
30 XZ13278 E 72 0.45 15.5757353 first
31 XZ13278 E 72 0.37 14.9503704 second
32 XZ13280 C 74 0.33 15.0386861 first
33 XZ13280 C 74 0.36 7.6214286 second
34 XZ13340 E 73 0.62 16.8294964 first
35 XZ13340 E 73 0.26 13.7261194 second
36 XZ13367 E 75 0.42 23.4071895 first
37 XZ13370 E 71 0.25 13.6159091 first

This is pretty tricky as it turns out. I think the problem is that due to the way you're constructing your second formula, R is not automatically removing collinear variables from the model matrix.
tl;dr this is a bit stream-of-consciousness, but I think the basic take-home points are
lme doesn't necessarily check/handle aliasing in a model specification for you (unlike lm, or to a lesser extent lmer)
you can get in trouble with R's formulas if you violate marginality, which you've done here by including the test:group:wing interaction without including the group:wing and test:wing interactions. R lets you do this, but the model doesn't necessarily make sense ... I'm a little bit surprised you ended up with this model specification -- usually stepAIC, and drop1, and R's other built-in model simplification tools, try to respect marginality and thus wouldn't let you end up here ...
if you really want to fit these kinds of models, use lmer (although dealing with heteroscedasticity is harder), or construct your own numeric dummy variables with model.matrix() ...
checking out these kinds of aliasing problems can best be done with model.matrix(), outside the scope of the model-fitting (lm/lme/lmer) function itself ...
For simplicity I'm going to leave out the variance model (weights=varIdent(form=~1|test)) as it doesn't seem to be relevant to this specific problem (I didn't know that a priori, but tests with and without it didn't differ).
library("nlme")
form1 <- speed_aver~test* group * FL * wing
form2 <- speed_aver~test+group + FL + wing+
test:group + group:FL + FL:wing +
test:group:wing
mod <- lme(form1,random=~1|ring,data=dd,method="ML") ## OK
update(mod,form2)
## fails with "Singularity in backsolve" error
What if we try it with lme4?
## ugh, I wish I knew a better way to append to a formula
form1L <- formula(paste(deparse(form1),"(1|ring)",sep="+"))
form2L <- formula(paste(deparse(form2,width=100),"(1|ring)",sep="+"))
library("lme4")
mod2 <- lmer(form1L, data=dd)
mod3 <- lmer(form2L, data=dd)
## fixed-effect model matrix is rank deficient so dropping 1 column / coefficient
Aha! lmer automatically detects that the model matrix is rank-deficient. lm does this automatically and substitutes NA values for the aliased terms. At present lmer just drops them, although with reasonably recent versions of lme4 the (documented but unadvertised) option add.dropped=TRUE to fixef() will put the NA values back in the appropriate places.
So let's investigate the model matrices:
X0 <- model.matrix(form1,data=dd)
c(rankMatrix(X0)==ncol(X0)) ## TRUE; both are 16
X <- model.matrix(form2,data=dd)
c(rankMatrix(X))==ncol(X) ## FALSE; 11<12
Try to identify aliased columns: 12th element of svd(X)$d is tiny (1e-15)
ss <- svd(X)
(zz <- zapsmall(ss$v[,12])) ## elements of collinear grouping
## [1] 0.0000000 0.0000000 0.0000000 0.0000000 -0.4472136 0.0000000
## [7] 0.0000000 0.0000000 0.4472136 0.4472136 0.4472136 0.4472136
So the sum of columns 9-12 is exactly the same as column 5 (same values, oppositite signs). What's going on here?
colnames(X)[zz!=0]
## [1] "wing" "testfirst:groupC:wing" "testsecond:groupC:wing"
## [4] "testfirst:groupE:wing" "testsecond:groupE:wing"
It looks like we somehow got all of the levels of the test-by-group interaction interacting with wing, along with the wing variable itself ...
mm <- X[,zz!=0]
colnames(mm) <- gsub("(test|group|:wing)","",colnames(mm))
head(mm)
## wing first:C second:C first:E second:E
## 1 75 0 0 75 0
## 2 75 0 0 0 75
## 3 68 0 0 68 0
## 4 75 75 0 0 0
## 5 75 0 75 0 0
## 6 73 73 0 0 0
I'm still not 100% sure why this happens, but you can see that R expands the three-way interaction include all four levels of the two-way interaction (which in turn interact with the continuous wing variable), but it's also got wing
colnames(X)
## [1] "(Intercept)" "testsecond" "groupE"
## [4] "FL" "wing" "testsecond:groupE"
## [7] "groupE:FL" "FL:wing" "testfirst:groupC:wing"
## [10] "testsecond:groupC:wing" "testfirst:groupE:wing"
## "testsecond:groupE:wing"
colnames(X0)
## [1] "(Intercept)" "testsecond"
## [3] "groupE" "FL"
## [5] "wing" "testsecond:groupE"
## [7] "testsecond:FL" "groupE:FL"
## [9] "testsecond:wing" "groupE:wing"
## [11] "FL:wing" "testsecond:groupE:FL"
## [13] "testsecond:groupE:wing" "testsecond:FL:wing"
## [15] "groupE:FL:wing" "testsecond:groupE:FL:wing"
If we define a model that respects marginality, then we're OK again ...
form3 <- speed_aver~test*group*wing+FL*(group+wing)
X1 <- model.matrix(form3,dd)
c(rankMatrix(X1)== ncol(X1)) ## TRUE
And we can replicate the problem more simply this way:
form4 <- speed_aver~wing+test:group:wing
X2 <- model.matrix(form4,dd)
c(rankMatrix(X2)== ncol(X2)) ## FALSE
this model has the three-way interaction (explicitly), but is missing the two-way interaction. If we used ~wing*test*group, or even ~wing+wing*test*group, we would be OK ...
form5 <- speed_aver~wing+test*group*wing
X3 <- model.matrix(form5,dd)
c(rankMatrix(X3)== ncol(X3)) ## TRUE

Related

Finding value of a series in R without for-loop

I am a newbie in R` and I found this problem:
Calculate the following sum using R:
1+(2/3)+(2/3)(4/5)+...+(2/3)(4/5)...(38/39)
I was enthusiastic to know how to solve this without using a for loop, and using only vector operations.
My thoughts and what I've tried till now:
Suppose I create two vectors such as
x<-2*(1:19)
y<-2*(1:19)+1
Then, x consists of all the numerators in the question and y has all the denominators. Now
z<-x/y
will create a vector of length 19 in which will be stored the values of 2/3, 4/5, ..., 38/39
I was thinking of using the prod function in R to find the required products. So, I created a vector such that
i<-1:19
In hopes of traversing z from the first element to the last, I did write:
prod(z[1:i])
But it failed miserably, giving me the result:
[1] 0.6666667
Warning message:
In 1:i : numerical expression has 19 elements: only the first used
What I wanted to do:
I expected to store the values of (2/3), (2/3)(4/5), ..., (2/3)(4/5)...(38/39) individually in another vector (say p) which will thus have 19 elements in it. I then intend to use the sum function to finally find out the sum of all those...
Where am I stuck:
As described in the R documentation, the prod function returns the product of all the values present in its arguments. So,
prod(z[1:1])
prod(z[1:2])
prod(z[1:3])
will return the values of (2/3), (2/3)(4/5), (2/3)(4/5)(6/7) respectively which it does:
> prod(z[1:1])
[1] 0.6666667
> prod(z[1:2])
[1] 0.5333333
> prod(z[1:3])
[1] 0.4571429
But it's not possible to go on like this and do it for all the 19 elements of the vector z. I am stuck here thinking as to what could be done. I wanted to iterate all the elements of z one-by-one for which I created another vector i as described above, but it didn't go as I had thought. Any help, suggestions, and hints will be really great as to how this can be done. I seem to have run out of ideas here.
More Information:
Here, I am providing with all the outputs in a systematic manner for others to understand my problem better:
> x
[1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38
> y
[1] 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
> z
[1] 0.6666667 0.8000000 0.8571429 0.8888889 0.9090909 0.9230769 0.9333333
[8] 0.9411765 0.9473684 0.9523810 0.9565217 0.9600000 0.9629630 0.9655172
[15] 0.9677419 0.9696970 0.9714286 0.9729730 0.9743590
> i
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Short Note (controversial statement ahead): This post would really have benefited from the use of LaTeX, but unfortunately, due to extremely heavy dependencies, as is mentioned in several posts regarding inclusion of LaTeX on Stack Overflow (like this), that is not a thing till now.
You can use cumprod to get a cumulative product of a vector which is what you are after
p <- cumprod(z)
p
# [1] 0.6666667 0.5333333 0.4571429 0.4063492 0.3694084 0.3409923 0.3182595
# [8] 0.2995384 0.2837732 0.2702602 0.2585097 0.2481694 0.2389779 0.2307373
# [15] 0.2232941 0.2165276 0.2103411 0.2046562 0.1994087
A less-efficient but more generalized alternative to cumprod would be
p <- sapply(i, function(x) prod(z[1:x]))
Here the sapply takes the place of the loop and passes a different ending index for each product
Then you can do
1 + sum(p)

How to fix the error "Subscript out of bounds"

I have a question about fixing the error:
"subscript out of bounds".
I am analyzing data of an eye-tracking experiment. You may find example data below:
Stimulus Timebin Language Percentage on AOI
1 11 L1 0.80
1 11 L2 0.60
1 12 L1 0.80
1 12 L2 0.50
1 13 L1 0.83
1 13 L2 0.50
...
10 37 L1 0.00
10 37 L2 0.50
10 38 L1 0.70
10 38 L2 0.50
10 39 L1 0.60
10 39 L2 0.70
10 40 L1 0.75
10 40 L2 0.89
...
I would like to do a Growth curve analysis with the Language and Timebin as independent variables and percentage on Area of Interest (AOI) as dependent variable. Besides, the Stimulus as random factor. I got 40 timebins for each stimulus and condition. In order to avoid the potential problem of collinearity, I want to create orthogonalized polynomials. The code below was used to create independent (orthogonal) polynomial time terms (linear, quadratic, and cubic).
Gaze_1_Poly <- poly((unique(Gaze_1$timebin)), 3)
Gaze_1[,paste("ot", 1:3, sep="")] <- Gaze_1_Poly[Gaze_1$timebin, 1:3]
I always get an error told me that there is a Out of Bounds Subscript.
Error in Gaza_1_Poly[Gaze_1$timebin, :
subscript out of bounds
So I checked the class of variables and I think it is of no problem:
Stimulus Timebin Language percentage on AOI
"character" "integer" "factor" "numeric"
I can not figure out the reason. Can someone give me a hand?
See comment above. Let me know if this is what you had in mind.
library(dplyr)
Gaze_1 %>%
left_join(data.frame(Timebin = unique(.$Timebin), poly(unique(.$Timebin), degree = 3)),
by = 'Timebin') %>%
setNames(c("Stimulus", "Timebin", "Language", "Percentage on AOI", "ot1", "ot2", "ot3"))

want to expand a large bipartite network plot avoid vertices overlapped

I was plotting a bipartite graph using igraph package with R. There are about 10,000 edges, I want to expand the width of the whole plot to avoid state vertices overlapped.
my data looks like this:
> test2
user_id state meanlat meanlon countUS countS degState
<chr> <chr> <dbl> <dbl> <int> <int> <int>
1 -_1ctLaz3jhPYc12hKXsEQ NC 35.19401 -80.83235 909 3 18487
2 -_1ctLaz3jhPYc12hKXsEQ NV 36.11559 -115.18042 29 3 37884
3 -_1ctLaz3jhPYc12hKXsEQ SC 35.05108 -80.96166 4 3 665
4 -0wUMy3vgInUD4S6KJInnw IL 40.11227 -88.22955 2 3 1478
5 -0wUMy3vgInUD4S6KJInnw NV 36.11559 -115.18042 23 3 37884
6 -0wUMy3vgInUD4S6KJInnw WI 43.08051 -89.39835 20 3 3963
and below is my code on graph creating and setting.
g2 <- graph_from_data_frame(test2,directed = F)
V(g2)$type <- ifelse(names(V(g2)) %in% UserStateR$user_id, 'user', 'state')
V(g2)$label <- ifelse(V(g2)$type == 'user', " ", paste(names(V(g2)),"\n",as.character(test2$degState),sep=""))
V(g2)$size <- ifelse(V(g2)$type == 'user', 3, 20)
V(g2)$color <- ifelse(V(g2)$type == 'user', 'wheat', 'salmon')
V(g2)$type <- ifelse(names(V(g2)) %in% UserStateR$user_id, T, F )
E(g2)$color <- heat.colors(8)[test2$countS]
plot(g2,layout=layout.bipartite(g2, types = names(V(g2)) %in% UserStateR$state, hgap = 50, vgap = 50))
as you can see, I have tried to change the hgap and vgap arguments, but it doesn't work apparently. I have also tried asp argument, but that is not what I want.
I know this might be too late for #floatsd but I was struggling with this today and had a really hard time finding an answer, so this might help others out.
First, in general, there is a an attribute to iplot.graph called asp that very simply controls how rectangular your plot is. Simply do
l=layout.bipartite(CCM_net)
plot(CCM_net, layout=l, asp=0.65)
for a wide plot. asp smaller than 1 gives you a wide plot, asp larger than 1 a tall plot.
However, this might still not give you the layout you want. The bipartite command basically generates a matrix with coordinates for your vertices, and I actually don't understand yet how it comes up with the x-coordinates, so I ended up changing them myself.
Below the example (I am assuming you know how to turn your data into data frames with the edge list and edge/vertex attributes for making graphs so am skipping that).
My data is CCM_data_sign and is
from to value
2 EVI MAXT 0.67
4 EVI MINT 0.81
5 EVI P 0.70
7 EVI SM 0.79
8 EVI AMO 0.86
11 MAXT EVI 0.81
18 MAXT AMO 0.84
21 MEANT EVI 0.88
28 MEANT AMO 0.83
29 MEANT PDO 0.71
31 MINT EVI 0.96
39 MINT PDO 0.78
40 MINT MEI 0.66
41 P EVI 0.91
49 P PDO 0.77
50 P MEI 0.71
51 PET EVI 0.90
58 PET AMO 0.89
59 PET PDO 0.70
61 SM EVI 0.94
68 SM AMO 0.90
69 SM PDO 0.81
70 SM MEI 0.73
74 AMO MINT 0.93
76 AMO PET 0.66
79 AMO PDO 0.71
80 AMO MEI 0.83
90 PDO MEI 0.82
The data frame I generated for graphing is called CCM_net.
First a bipartite plot without any layout adjustments
V(CCM_net)$size<-30
l=layout.bipartite(CCM_net)
plot(CCM_net,
layout=l,
edge.arrow.size=1,
edge.arrow.width=2,
vertex.label.family="Helvetica",
vertex.label.color="black",
vertex.label.cex=2,
vertex.label.dist=c(3,3,3,3,3,3,3,3,3,3,3),
vertex.label.degree=c(pi/2,-pi/2,-pi/2,-pi/2,-pi/2,-pi/2,-pi/2,-pi/2,pi/2,pi/2,pi/2), #0 is right, “pi” is left, “pi/2” is below, and “-pi/2” is above
edge.lty=1)
This gives you the following
If I use asp I get the following
plot(CCM_net,
layout=l,
edge.arrow.size=1,
vertex.label.family="Helvetica",
vertex.label.color="black",
vertex.label.cex=2,
vertex.label.dist=c(3,3,3,3,3,3,3,3,3,3,3),
vertex.label.degree=c(pi/2,-pi/2,-pi/2,-pi/2,-pi/2,-pi/2,-pi/2,-pi/2,pi/2,pi/2,pi/2), #0 is right, “pi” is left, “pi/2” is below, and “-pi/2” is above
edge.arrow.width=2,
edge.lty=1,
asp=0.6) # controls how rectangular the plot is. <1 = wide, >1 = tall
dev.off()
This is looking better, but still not really what I want - see how some vertices are closer to each other than others?
So eventually I took the following approach. Setting the coordinates as bipartite looks like this
coords <- layout_as_bipartite(CCM_net)
coords
[,1] [,2]
[1,] 3.0 0
[2,] 0.0 1
[3,] 2.0 1
[4,] 3.5 1
[5,] 6.0 1
[6,] 1.0 1
[7,] 5.0 1
[8,] 7.0 1
[9,] 1.0 0
[10,] 4.5 0
[11,] 5.5 0
This matrix shows the x coordinates of your vertices in the first columns and the y coordinates in the second column, ordered according to your list with names. My list with names is
id name
1 EVI EVI
2 MAXT MAXT
3 MEANT MEANT
4 MINT MINT
5 P P
6 PET PET
7 SM SM
8 SR SR
9 AMO AMO
10 PDO PDO
11 MEI MEI
In my graph, EVI, AMO and PDO are on the bottom, but note their x coordinates: 3.0, 1.0, 4.5 and 5.5. I haven't figured out yet how the code comes up with that, but I don't like it so I simply changed it.
coords[,1]=c(2,0,4,8,12,16,20,24,9,16,24)
Now the plotting code (also with asp) and the output becomes
plot(CCM_net,
layout=coords,
edge.arrow.size=1,
vertex.label.family="Helvetica",
vertex.label.color="black",
vertex.label.cex=1,
vertex.label.dist=c(4,4,4,4,4,4,4,4,4,4,4),
vertex.label.degree=c(pi/2,-pi/2,-pi/2,-pi/2,-pi/2,-pi/2,-pi/2,-pi/2,pi/2,pi/2,pi/2), #0 is right, “pi” is left, “pi/2” is below, and “-pi/2” is above
edge.arrow.width=2,
edge.lty=1,
asp=0.6) # controls how rectangular the plot is. <1 = wide, >1 = tall
Now the vertices are nicely spaced in a rectangular plot!
Note - I also decreased the size of the vertices, the size of the labels and their positioning, for better readability.
I think you can output with PDF. then zoom in.
Or, use rgexf package to output gexf file. Then visualizate in gephi.
I think gephi is a good tools for network visualization.

Subtracting Values in Previous Rows: Ecological Lifetable Construction

I was hoping I could get some help. I am constructing a life table, not for insurance, but for ecology (a cross-sectional of the population of a any kind of wild fauna), so essentially censoring variables like smoker/non-smoker, pregnant, gender, health-status, etc.:
AgeClass=C(1,2,3,4,5,6)
SampleSize=c(100,99,87,46,32,19)
for(i in 1:6){
+ PropSurv=c(Sample/100)
+ }
> LifeTab1=data.frame(cbind(AgeClass,Sample,PropSurv))
Which gave me this:
ID AgeClas Sample PropSurv
1 1 100 1.00
2 2 99 0.99
3 3 87 0.87
4 4 46 0.46
5 5 32 0.32
6 6 19 0.19
I'm now trying to calculate those that died in each row (DeathInt) by taking the initial number of those survived and subtracting it by the number below it (i.e. 100-99, then 99-87, then 87-46, so on and so forth). And try to look like this:
ID AgeClas Sample PropSurv DeathInt
1 1 100 1.00 1
2 2 99 0.99 12
3 3 87 0.87 41
4 4 46 0.46 14
5 5 32 0.32 13
6 6 19 0.19 NA
I found this and this, and I wasn't sure if they answered my question as these guys subtracted values based on groups. I just wanted to subtract values by row.
Also, just as a side note: I did a for() to get the proportion that survived in each age group. I was wondering if there was another way to do it or if that's the proper, easiest way to do it.
Second note: If any R-users out there know of an easier way to do a life-table for ecology, do let me know!
Thanks!
If you have a vector x, that contains numbers, you can calculate the difference by using the diff function.
In your case it would be
LifeTab1$DeathInt <- c(-diff(Sample), NA)

Sum of random poisson numbers across data frame

In a simulation I am programming, in each iteration I need to compute the sum of some random poisson numbers for each row in a data frame where the parameters are stored in another column of that row.
Here's a sample of the data (called 'studies' in the code below):
phase Sites enroll_rate rec_months stud_months enrolled m_enroll
51 2 1 2.95920139 2.0000000 5.000000 6 0
52 2 24 0.20784867 2.0000000 5.000000 10 0
53 2 3 0.46501736 3.0000000 6.000000 2 0
54 2 2 1.40480769 3.0000000 6.000000 7 0
55 2 1 1.31299020 5.0000000 7.000000 3 0
64 2 29 0.04373204 0.9712526 1.971253 2 0
And here's the code I've been using to achieve this:
for (j in 1:nrow(studies)) {
studies$m_enroll[j] <- sum(rpois(studies$Sites[j],studies$enroll_rate[j]))
}
This does the job, but given that the data frame is hundreds of rows and I'm doing this simulation tens of thousands of times, it is quite inefficient.
I feel like there's a way to do this using one of the apply functions, but my experience with them is limited. Any ideas?
studies <- studies[rep(1:6,3000),]
system.time(for (j in 1:nrow(studies)){studies$m_enroll[j] <-
sum(rpois(studies$Sites[j],studies$enroll_rate[j]))})
user system elapsed
105.74 0.00 106.30
system.time(test <- sapply(1:nrow(studies),function(x)
sum(rpois(studies$Sites[x],studies$enroll_rate[x]))))
user system elapsed
0.36 0.00 0.36

Resources