Plotting nested data - Multilevel regression with categorial predictors - r

I'm writing my master thesis and I'm stuck with the complexity of my data. Therefore I'd like to plot my data to see what's in there.
My dataframe looks like that: I've 333 perceivers (PID) who rated 60 target photos (TID) each, resulting in 19980 rows. Each perceiver (PID) rated every target's photo on how likeable they are (Rating) and provided multiple self-reports about themselves (SDO_mean, KSA_mean, threats_overall).
The photos were either from photo type A (Dwithin = 0) or type B (Dwithin = 1), which is my within-subject factor as every perceiver saw all photos. In addition perceivers were assigned to one of two between-subject condition (Dbetween): All photos (TID) from type B (Dwithin = 1) were labeled either as people with migration background (Dbetween = 0) or as refugees (Dbetween = 1).
This results in a nested design where the Ratings are nested in the PID and also in the TID. My data looks like that:
TID PID Dwithin Dbetween Rating SDO_mean KSA_mean threats_overall
1 1 0 0 5 3.1 2.3 2.2
2 1 1 0 2 3.1 2.3 2.2
3 1 0 0 5 3.1 2.3 2.2
4 1 1 0 1 3.1 2.3 2.2
5 1 0 0 3 3.1 2.3 2.2
6 1 1 0 3 3.1 2.3 2.2
Now I want to predict the likeable-rating mainly by the categorial variables Dwithin and Dbetween. As Dbetween can only be interpreted as an interaction of Dwithin*Dbetween (because the label was only for Dwitihn=1 targets), the formula would be:
model1 <- lmer(Rating~1+Dwithin+Dbetween+Dwithin*Dbetween+(1+Dwithin|PID)+(1|TID),data=df)
Now I want to plot the data which I'm using for my regression. An option could be to plot the Rating seperately for each Dwithin / Dbetween condition. Or to plot the regression as in the model1 formula. But as these are categorial predictors, I didn't manage to plot the data in the right way. I looked into lattice() but couldn't apply it on my data. Is there anyone who could help me plotting it? Thanks a lot in advance!
#SASpencer: I thought for example of something like this. But my y-scale isn't continious... it only has integer numbers from 1-5.It could also be interesting for the combination of Dwithin and Dbetween (so like in your plot)
Here is a reproducible example:
mysamp <- function(n, m, s, lwr, upr, nnorm) {
set.seed(1)
samp <- rnorm(nnorm, m, s)
samp <- samp[samp >= lwr & samp <= upr]
if (length(samp) >= n) {
return(sample(samp, n))
}
}
options(digits=2)
TID <- rep(1:60, times=333)
PID <- rep(1:333,each=60)
Dwithin <- rep(rep(0:1, times=19980/2))
Dbetween <- rep(rep(0:1, each=60),times=333)[1:19980]
Rating <- floor(runif(19980, min=1, max=6))
SDO_mean <- rep(mysamp(n=333, m=4, s=2.5, lwr=1, upr=5, nnorm=1000000), each=60)
KSA_mean <- rep(mysamp(n=333, m=2, s=0.8, lwr=1, upr=5, nnorm=1000000), each=60)
threats_overall <- rep(mysamp(n=333, m=3, s=1.5, lwr=1, upr=5, nnorm=1000), each=60)
df <- data.frame(TID,PID,Dwithin,Dbetween, Rating, SDO_mean, KSA_mean, threats_overall)

Related

How to loop a function through groups of data?

I have code that calculate Kaplan-Meier product..
km_mean <- function(x,nd) {
library(tidyverse)
# first remove any missing data
df <- tibble(x,nd) %>% filter(!is.na(x))
x <- df %>% pull(x); nd <- df %>% pull(nd)
# handle cases of all detects or all nondetects; in these situations, no Kaplan-Meier
# estimate is possible or necessary; instead treat all detects as actual concentration estimates
# and all NDs as imputed at half their reporting limits
if (all(nd==0)) return(tibble(mean=mean(x),sd=sd(x)))
if (all(nd==1)) return(tibble(mean=mean(x/2),sd=sd(x/2)))
# for cases with mixed detects and NDs, table by nd status;
# determine unique x values; first subtract epsilon to each nondetect to associate
# larger rank for detects tied with NDs with same reporting limits
eps <- 1e-6
x <- x - nd*eps
nn <- nlevels(factor(x))
# determine number of at-risk values; build kaplan-meier CDF and survival function;
# note: need to augment and adjust <tab> for calculation below to work correctly
km.lev <- as.numeric(levels(factor(x)))
xa <- c(x,max(x)+1); nda <- c(nd,0)
tab <- table(xa,nda)
tab[nn+1,1] <- 0
km.rsk <- cumsum(tab[,1] + tab[,2])
km.cdf <- rev(cumprod(1 - rev(tab[,1])/rev(km.rsk)))[-1]
names(km.cdf) <- as.character(km.lev)
km.surv <- 1 - km.cdf
km.out <- tibble(km.lev,km.rsk=km.rsk[-length(km.rsk)],km.cdf,km.surv)
row.names(km.out) <- NULL
# estimate adjusted mean and SD
xm <- km.lev[1] + sum(diff(km.lev)*km.surv[-length(km.surv)])
dif <- diff(c(0,km.cdf))
xsd <- sqrt(sum(dif*(km.lev - xm)^2))
names(xm) <- NULL; names(xsd) <- NULL
tibble(mean=xm,sd=xsd)
}
My data has three columns, a sample-ID, value (x), and detect/non-detect flag (nd).
a1 0.23 0
a1 2.3 0
a1 1.6 0
a2 3.0 1
a2 3.1 0
a2 2.76 0
How can I adapt the function to run on all a1 samples as a group, then a2, etc.?
I've tried group_by commands, but can't seem to break through.

Why do results of matching depend on order of data (MatchIt package)?

When using the matchit-function for full matching, the results differ by the order of the input dataframe. That is, if the order of the data is changed, results change, too. This is surprising, because in my understanding, the optimal full algorithm should yield only one single best solution.
Am I missing something or is this an error?
Similar differences occur with the optimal algorithm.
Below you find a reproducible example. Subclasses should be identical for the two data sets, which they are not.
Thank you for your help!
# create data
nr <- c(1:100)
x1 <- rnorm(100, mean=50, sd=20)
x2 <- c(rep("a", 20),rep("b", 60), rep("c", 20))
x3 <- rnorm(100, mean=230, sd=2)
outcome <- rnorm(100, mean=500, sd=20)
group <- c(rep(0, 50),rep(1, 50))
df <- data.frame(x1=x1, x2=x2, outcome=outcome, group=group, row.names=nr, nr=nr)
df_neworder <- df[order(outcome),] # re-order data.frame
# perform matching
model_oldorder <- matchit(group~x1, data=df, method="full", distance ="logit")
model_neworder <- matchit(group~x1, data=df_neworder, method="full", distance ="logit")
# store matching results
matcheddata_oldorder <- match.data(model_oldorder, distance="pscore")
matcheddata_neworder <- match.data(model_neworder, distance="pscore")
# Results based on original data.frame
head(matcheddata_oldorder[order(nr),], 10)
x1 x2 outcome group nr pscore weights subclass
1 69.773776 a 489.1769 0 1 0.5409943 1.0 27
2 63.949637 a 529.2733 0 2 0.5283582 1.0 32
3 52.217666 a 526.7928 0 3 0.5028106 0.5 17
4 48.936397 a 492.9255 0 4 0.4956569 1.0 9
5 36.501507 a 512.9301 0 5 0.4685876 1.0 16
# Results based on re-ordered data.frame
head(matcheddata_neworder[order(matcheddata_neworder$nr),], 10)
x1 x2 outcome group nr pscore weights subclass
1 69.773776 a 489.1769 0 1 0.5409943 1.0 25
2 63.949637 a 529.2733 0 2 0.5283582 1.0 31
3 52.217666 a 526.7928 0 3 0.5028106 0.5 15
4 48.936397 a 492.9255 0 4 0.4956569 1.0 7
5 36.501507 a 512.9301 0 5 0.4685876 2.0 14
Apparently, the assignment of objects to subclasses differs. In my understanding, this should not be the case.
The developers of the optmatch package (which the matchit function calls) provided useful help:
I think what we're seeing here is the result of the tolerance argument
that fullmatch has. The matching algorithm requires integer distances,
so we have to scale then truncate floating point distances. For a
given set of integer distances, there may be multiple matchings that
achieve the minimum, so the solver is free to pick among these
non-unique solutions.
Developing your example a little more:
> library(optmatch)
> nr <- c(1:100) x1 <- rnorm(100, mean=50, sd=20)
> outcome <- rnorm(100, mean=500, sd=20) group <- c(rep(0, 50),rep(1, 50))
> df_oldorder <- data.frame(x1=x1, outcome=outcome, group=group, row.names=nr, nr=nr) > df_neworder <- df_oldorder[order(outcome),] # > re-order data.frame
> glm_oldorder <- match_on(glm(group~x1, > data=df_oldorder), data = df_oldorder)
> glm_neworder <- > match_on(glm(group~x1, data=df_neworder), data = df_neworder)
> fm_old <- fullmatch(glm_oldorder, data=df_oldorder)
> fm_new <- fullmatch(glm_neworder, data=df_neworder)
> mean(sapply(matched.distances(fm_old, glm_oldorder), mean))
> ## 0.06216174
> mean(sapply(matched.distances(fm_new, glm_neworder), mean))
> ## 0.062058 mean(sapply(matched.distances(fm_old, glm_oldorder), mean)) -
> mean(sapply(matched.distances(fm_new, glm_neworder), mean))
> ## 0.00010373
which we can see is smaller than the default tolerance of 0.001. You can always decrease the tolerance level, which may
require increased run time, in order to get closer to the true
floating put minimum. We found 0.001 seemed to work well in practice,
but there is nothing special about this value.

Resample a time series in the frequency domain (FFT)

I'm trying to implement the "synthesis equation" from the DSP Guide, equation 8-2, so I can resample a time series in the frequency domain. The way I read the equation, N is the number of output points, and given the loop of k from 0 to N/2, I can only resample to twice the original sampling rate, at most.
I tried writing a quick implementation in R, but the results are not anything close to what I expect. My code:
input <- c(1:9)
nin <- 9
nout <- 17
b <-fft(input)
reals <- Re(b) / (nout / 2)
imags <- Im(b) / (nout / 2)
reals[1] <- reals[1] / 2
reals[(nout/2)] <- reals[(nout/2)] / 2
output <- c(1:nout)
for (i in 1:nout)
{
realSum <- 0
imagSum <- 0
for (k in 1:(nout/2))
{
angle <- 2 * pi * (k-1) * (i-1) / nout
realSum <- realSum + (reals[k] * cos(angle))
imagSum <- imagSum - (imags[k] * sin(angle))
}
output[i] <- (realSum + imagSum)
}
For my input (sampled at say 1 second, resampling to 0.5 second)
[1] 1 2 3 4 5 6 7 8 9
I get the output
[1] -0.7941176 1.5150954 0.7462716 1.5022387 1.6478971 1.8357487
2.4029773 2.1965426 3.1585254 2.6178195 3.7284660 3.3721128 3.8433588 4.6390705
[15] 3.4699088 6.3005605 2.8175240
while my expected output is
[1] 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9
What am I doing wrong?
The above code does look like the inverse DFT that you referenced in the article. From your expected desired input/output, it seems like you need a cheap linear interpolation rather than a DFT/Inverse DFT solution.
From your desired input/output relationship statement above, it seems like you have a desired relationship between a time series input and a desired time series output.
A inverse DFT, which you coded, will go from frequency domain to time domain.
From what I could gather from your statements, start with the series [1,1->9], do a forward DFT (to get to frequency domain), then take that result into an inverse DFT (to get back to time domain).
Hopefully this is a hint in the right direction for you.

R: too long computation of likelihood function for conditional logit model

I am trying to maximize loglikelihood function to get coefficients for conditional logit model. I have a big data frame with about 9M rows (300k choice sets) and about 40 parameters to be estimated. It looks like this:
ChoiceSet Choice SKU Price Caramel etc.
1 1 1234 1.0 1 ...
1 0 145 2.0 1 ...
1 0 5233 2.0 0 ...
2 0 1432 1.5 1 ...
2 0 5233 2.0 0 ...
2 1 8320 2.0 0 ...
3 0 1234 1.5 1 ...
3 1 145 1.0 1 ...
3 0 8320 1.0 0 ...
Where ChoiceSet is a set of products available in store in the moment of purchase and Choice=1 when the SKU is chosen.
Since ChoiceSets might vary I use loglikelihood function:
clogit.ll <- function(beta,X) { #### This is a function to be maximized
X <- as.data.table(X)
setkey(X,ChoiceSet,Choice)
sum((as.matrix(X[J(t(as.vector(unique(X[,1,with=F]))),1),3:ncol(X),with=F]))%*%beta)-
sum(foreach(chset=unique(X[,list(ChoiceSet)])$ChoiceSet, .combine='c', .packages='data.table') %dopar% {
Z <- as.matrix(X[J(chset,0:1),3:ncol(X), with=F])
Zb <- Z%*%beta
e <- exp(Zb)
log(sum(e))
})
}
Create new data frame without SKU (it's not needed) and zero vector:
X0 <- Data[,-3]
b0 <- rep(0,ncol(X0)-2)
I maximize this function with a help of maxLike package where I use gradient to make calculation faster:
grad.clogit.ll <- function(beta,X) { ###It is a gradient of likelihood function
X <- as.data.table(X)
setkey(X,ChoiceSet,Choice)
colSums(foreach(chset=unique(X[,list(ChoiceSet)])$ChoiceSet, .combine='rbind',.packages='data.table') %dopar% {
Z <- as.matrix(X[J(chset,0:1),3:ncol(X), with=F])
Zb <- Z%*%beta
e <- exp(Zb)
as.vector(X[J(chset,1),3:ncol(X),with=F]-t(as.vector(X[J(chset,0:1),3:ncol(X),with=F]))%*%(e/sum(e)))
})
}
Maximization problem is following:
fit <- maxLik(logLik = clogit.ll, grad = grad.clogit.ll, start=b0, X=X0, method="NR", tol=10^(-6), iterlim=100)
Generally, it works fine for small samples, but too long for big:
Number of Choice sets Duration of computation
300 4.5min
400 10.5min
1000 25min
But when I do it for 5000+ choice sets R terminate session.
So (if you are still reading it) how can I maximaze this function if I have 300,000+ choice sets and 1.5 weeks to finish my course work? Please help, I have no any idea.

JAGS Runtime Error: Cannot insert node into X[ ]. Dimension Mismatch

I'm trying to add a bit of code to a data-augmentation capture-recapture model and am coming up with some errors I haven't encountered before. In short, I want to estimate a series of survivorship phases that each last more than a single time interval. I want the model to estimate the length of each survivorship phase and use that to improve the capture recapture model. I tried and failed with a few different approaches, and am now trying to accomplish this using a switching state array for the survivorship phases:
for (t in 1:(n.occasions-1)){
phi1switch[t] ~ dunif(0,1)
phi2switch[t] ~ dunif(0,1)
phi3switch[t] ~ dunif(0,1)
phi4switch[t] ~ dunif(0,1)
psphi[1,t,1] <- 1-phi1switch[t]
psphi[1,t,2] <- phi1switch[t]
psphi[1,t,3] <- 0
psphi[1,t,4] <- 0
psphi[1,t,5] <- 0
psphi[2,t,1] <- 0
psphi[2,t,2] <- 1-phi2switch[t]
psphi[2,t,3] <- phi2switch[t]
psphi[2,t,4] <- 0
psphi[2,t,5] <- 0
psphi[3,t,1] <- 0
psphi[3,t,2] <- 0
psphi[3,t,3] <- 1-phi3switch[t]
psphi[3,t,4] <- phi3switch[t]
psphi[3,t,5] <- 0
psphi[4,t,1] <- 0
psphi[4,t,2] <- 0
psphi[4,t,3] <- 0
psphi[4,t,4] <- 1-phi4switch[t]
psphi[4,t,5] <- phi4switch[t]
psphi[5,t,1] <- 0
psphi[5,t,2] <- 0
psphi[5,t,3] <- 0
psphi[5,t,4] <- 0
psphi[5,t,5] <- 1
}
So this creates a [5,t,5] array where the survivorship state can only switch to the subsequent state and not backwards (e.g. 1 to 2, 4 to 5, but not 4 to 3). Now I create a vector where the survivorship state is defined:
PhiState[1] <- 1
for (t in 2:(n.occasions-1)){
# State process: draw PhiState(t) given PhiState(t-1)
PhiState[t] ~ dcat(psphi[PhiState[t-1], t-1,])
}
We start in state 1 always, and then take a categorical draw at each time step 't' for remaining in the current state or moving on to the next one given the probabilities within the array. I want a maximum of 5 states (assuming that the model will be able to functionally produce fewer by estimating the probability of moving from state 3 to 4 and onwards near 0, or making the survivorship value of subsequent states the same or similar if they belong to the same survivorship value in reality). So I create 5 hierarchical survival probabilities:
for (a in 1:5){
mean.phi[a] ~ dunif(0,1)
phi.tau[a] <- pow(phi_sigma[a],-2)
phi.sigma[a] ~ dunif(0,20)
}
Now this next step is where the errors start. Now that I've assigned values 1-5 to my PhiState vector it should look something like this:
[1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 5
or maybe
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2
and I now want to assign a mean.phi[] to my actual phi[] term, which feeds into the model:
for(t in 1:(n.occasions-1)){
phi[t] ~ dnorm(mean.phi[PhiState[t]],phi.tau[PhiState[t]])
}
However, when I try to run this I get the following error:
Error in jags.model(model.file, data = data, inits = init.values, n.chains = n.chains, :
RUNTIME ERROR:
Cannot insert node into mean.phi[1:5]. Dimension mismatch
It's worth noting that the model works just fine when I use the following phi[] determinations:
phi[t] ~ dunif(0,1) #estimate independent annual phi's
or
phi[t] ~ dnorm(mean.phi,phi_tau) #estimate hierarchical phi's from a single mean.phi
or
#Set fixed survial periods (this works the best, but I don't want to have to tell it when
#the periods start/end and how many there are, hence the current exercise):
for (a in 1:21){
surv[a] ~ dnorm(mean.phi1,phi1_tau)
}
for (b in 22:30){
surv[b] ~ dnorm(mean.phi2,phi2_tau)
}
for (t in 1:(n.occasions-1)){
phi[t] <- surv[t]
}
I did read this post: https://sourceforge.net/p/mcmc-jags/discussion/610037/thread/36c48f25/
but I don't see where I'm redefining variables in this case... Any help fixing this or advice on a better approach would be most welcome!
Many thanks,
Josh
I'm a bit confused as to what are your actual data (the phi[t]?), but the following might give you a starting point:
nt <- 29
nstate <- 5
M <- function() {
phi_state[1] <- 1
for (t in 2:nt) {
up[t-1] ~ dbern(p[t-1])
p[t-1] <- ifelse(phi_state[t-1]==nstate, 0, p_[t-1])
p_[t-1] ~ dunif(0, 1)
phi_state[t] <- phi_state[t-1] + equals(up[t-1], 1)
}
for (k in 1:nstate) {
mean_phi[k] ~ dunif(0, 1)
phi_sigma[k] ~ dunif(0, 20)
}
for(t in 1:(nt-1)){
phi[t] ~ dnorm(mean_phi[phi_state[t]], phi_sigma[phi_state[t]]^-2)
}
}
library(R2jags)
fit <- jags(list(nt=nt, nstate=nstate), NULL,
c('phi_state', 'phi', 'mean_phi', 'phi_sigma', 'p'),
M, DIC=FALSE)
Note that above, p is a vector of probabilities of moving up to the next (adjacent) state.

Resources