I am trying to maximize loglikelihood function to get coefficients for conditional logit model. I have a big data frame with about 9M rows (300k choice sets) and about 40 parameters to be estimated. It looks like this:
ChoiceSet Choice SKU Price Caramel etc.
1 1 1234 1.0 1 ...
1 0 145 2.0 1 ...
1 0 5233 2.0 0 ...
2 0 1432 1.5 1 ...
2 0 5233 2.0 0 ...
2 1 8320 2.0 0 ...
3 0 1234 1.5 1 ...
3 1 145 1.0 1 ...
3 0 8320 1.0 0 ...
Where ChoiceSet is a set of products available in store in the moment of purchase and Choice=1 when the SKU is chosen.
Since ChoiceSets might vary I use loglikelihood function:
clogit.ll <- function(beta,X) { #### This is a function to be maximized
X <- as.data.table(X)
setkey(X,ChoiceSet,Choice)
sum((as.matrix(X[J(t(as.vector(unique(X[,1,with=F]))),1),3:ncol(X),with=F]))%*%beta)-
sum(foreach(chset=unique(X[,list(ChoiceSet)])$ChoiceSet, .combine='c', .packages='data.table') %dopar% {
Z <- as.matrix(X[J(chset,0:1),3:ncol(X), with=F])
Zb <- Z%*%beta
e <- exp(Zb)
log(sum(e))
})
}
Create new data frame without SKU (it's not needed) and zero vector:
X0 <- Data[,-3]
b0 <- rep(0,ncol(X0)-2)
I maximize this function with a help of maxLike package where I use gradient to make calculation faster:
grad.clogit.ll <- function(beta,X) { ###It is a gradient of likelihood function
X <- as.data.table(X)
setkey(X,ChoiceSet,Choice)
colSums(foreach(chset=unique(X[,list(ChoiceSet)])$ChoiceSet, .combine='rbind',.packages='data.table') %dopar% {
Z <- as.matrix(X[J(chset,0:1),3:ncol(X), with=F])
Zb <- Z%*%beta
e <- exp(Zb)
as.vector(X[J(chset,1),3:ncol(X),with=F]-t(as.vector(X[J(chset,0:1),3:ncol(X),with=F]))%*%(e/sum(e)))
})
}
Maximization problem is following:
fit <- maxLik(logLik = clogit.ll, grad = grad.clogit.ll, start=b0, X=X0, method="NR", tol=10^(-6), iterlim=100)
Generally, it works fine for small samples, but too long for big:
Number of Choice sets Duration of computation
300 4.5min
400 10.5min
1000 25min
But when I do it for 5000+ choice sets R terminate session.
So (if you are still reading it) how can I maximaze this function if I have 300,000+ choice sets and 1.5 weeks to finish my course work? Please help, I have no any idea.
Related
I'm writing my master thesis and I'm stuck with the complexity of my data. Therefore I'd like to plot my data to see what's in there.
My dataframe looks like that: I've 333 perceivers (PID) who rated 60 target photos (TID) each, resulting in 19980 rows. Each perceiver (PID) rated every target's photo on how likeable they are (Rating) and provided multiple self-reports about themselves (SDO_mean, KSA_mean, threats_overall).
The photos were either from photo type A (Dwithin = 0) or type B (Dwithin = 1), which is my within-subject factor as every perceiver saw all photos. In addition perceivers were assigned to one of two between-subject condition (Dbetween): All photos (TID) from type B (Dwithin = 1) were labeled either as people with migration background (Dbetween = 0) or as refugees (Dbetween = 1).
This results in a nested design where the Ratings are nested in the PID and also in the TID. My data looks like that:
TID PID Dwithin Dbetween Rating SDO_mean KSA_mean threats_overall
1 1 0 0 5 3.1 2.3 2.2
2 1 1 0 2 3.1 2.3 2.2
3 1 0 0 5 3.1 2.3 2.2
4 1 1 0 1 3.1 2.3 2.2
5 1 0 0 3 3.1 2.3 2.2
6 1 1 0 3 3.1 2.3 2.2
Now I want to predict the likeable-rating mainly by the categorial variables Dwithin and Dbetween. As Dbetween can only be interpreted as an interaction of Dwithin*Dbetween (because the label was only for Dwitihn=1 targets), the formula would be:
model1 <- lmer(Rating~1+Dwithin+Dbetween+Dwithin*Dbetween+(1+Dwithin|PID)+(1|TID),data=df)
Now I want to plot the data which I'm using for my regression. An option could be to plot the Rating seperately for each Dwithin / Dbetween condition. Or to plot the regression as in the model1 formula. But as these are categorial predictors, I didn't manage to plot the data in the right way. I looked into lattice() but couldn't apply it on my data. Is there anyone who could help me plotting it? Thanks a lot in advance!
#SASpencer: I thought for example of something like this. But my y-scale isn't continious... it only has integer numbers from 1-5.It could also be interesting for the combination of Dwithin and Dbetween (so like in your plot)
Here is a reproducible example:
mysamp <- function(n, m, s, lwr, upr, nnorm) {
set.seed(1)
samp <- rnorm(nnorm, m, s)
samp <- samp[samp >= lwr & samp <= upr]
if (length(samp) >= n) {
return(sample(samp, n))
}
}
options(digits=2)
TID <- rep(1:60, times=333)
PID <- rep(1:333,each=60)
Dwithin <- rep(rep(0:1, times=19980/2))
Dbetween <- rep(rep(0:1, each=60),times=333)[1:19980]
Rating <- floor(runif(19980, min=1, max=6))
SDO_mean <- rep(mysamp(n=333, m=4, s=2.5, lwr=1, upr=5, nnorm=1000000), each=60)
KSA_mean <- rep(mysamp(n=333, m=2, s=0.8, lwr=1, upr=5, nnorm=1000000), each=60)
threats_overall <- rep(mysamp(n=333, m=3, s=1.5, lwr=1, upr=5, nnorm=1000), each=60)
df <- data.frame(TID,PID,Dwithin,Dbetween, Rating, SDO_mean, KSA_mean, threats_overall)
I'm trying to implement the "synthesis equation" from the DSP Guide, equation 8-2, so I can resample a time series in the frequency domain. The way I read the equation, N is the number of output points, and given the loop of k from 0 to N/2, I can only resample to twice the original sampling rate, at most.
I tried writing a quick implementation in R, but the results are not anything close to what I expect. My code:
input <- c(1:9)
nin <- 9
nout <- 17
b <-fft(input)
reals <- Re(b) / (nout / 2)
imags <- Im(b) / (nout / 2)
reals[1] <- reals[1] / 2
reals[(nout/2)] <- reals[(nout/2)] / 2
output <- c(1:nout)
for (i in 1:nout)
{
realSum <- 0
imagSum <- 0
for (k in 1:(nout/2))
{
angle <- 2 * pi * (k-1) * (i-1) / nout
realSum <- realSum + (reals[k] * cos(angle))
imagSum <- imagSum - (imags[k] * sin(angle))
}
output[i] <- (realSum + imagSum)
}
For my input (sampled at say 1 second, resampling to 0.5 second)
[1] 1 2 3 4 5 6 7 8 9
I get the output
[1] -0.7941176 1.5150954 0.7462716 1.5022387 1.6478971 1.8357487
2.4029773 2.1965426 3.1585254 2.6178195 3.7284660 3.3721128 3.8433588 4.6390705
[15] 3.4699088 6.3005605 2.8175240
while my expected output is
[1] 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9
What am I doing wrong?
The above code does look like the inverse DFT that you referenced in the article. From your expected desired input/output, it seems like you need a cheap linear interpolation rather than a DFT/Inverse DFT solution.
From your desired input/output relationship statement above, it seems like you have a desired relationship between a time series input and a desired time series output.
A inverse DFT, which you coded, will go from frequency domain to time domain.
From what I could gather from your statements, start with the series [1,1->9], do a forward DFT (to get to frequency domain), then take that result into an inverse DFT (to get back to time domain).
Hopefully this is a hint in the right direction for you.
I'm trying to add a bit of code to a data-augmentation capture-recapture model and am coming up with some errors I haven't encountered before. In short, I want to estimate a series of survivorship phases that each last more than a single time interval. I want the model to estimate the length of each survivorship phase and use that to improve the capture recapture model. I tried and failed with a few different approaches, and am now trying to accomplish this using a switching state array for the survivorship phases:
for (t in 1:(n.occasions-1)){
phi1switch[t] ~ dunif(0,1)
phi2switch[t] ~ dunif(0,1)
phi3switch[t] ~ dunif(0,1)
phi4switch[t] ~ dunif(0,1)
psphi[1,t,1] <- 1-phi1switch[t]
psphi[1,t,2] <- phi1switch[t]
psphi[1,t,3] <- 0
psphi[1,t,4] <- 0
psphi[1,t,5] <- 0
psphi[2,t,1] <- 0
psphi[2,t,2] <- 1-phi2switch[t]
psphi[2,t,3] <- phi2switch[t]
psphi[2,t,4] <- 0
psphi[2,t,5] <- 0
psphi[3,t,1] <- 0
psphi[3,t,2] <- 0
psphi[3,t,3] <- 1-phi3switch[t]
psphi[3,t,4] <- phi3switch[t]
psphi[3,t,5] <- 0
psphi[4,t,1] <- 0
psphi[4,t,2] <- 0
psphi[4,t,3] <- 0
psphi[4,t,4] <- 1-phi4switch[t]
psphi[4,t,5] <- phi4switch[t]
psphi[5,t,1] <- 0
psphi[5,t,2] <- 0
psphi[5,t,3] <- 0
psphi[5,t,4] <- 0
psphi[5,t,5] <- 1
}
So this creates a [5,t,5] array where the survivorship state can only switch to the subsequent state and not backwards (e.g. 1 to 2, 4 to 5, but not 4 to 3). Now I create a vector where the survivorship state is defined:
PhiState[1] <- 1
for (t in 2:(n.occasions-1)){
# State process: draw PhiState(t) given PhiState(t-1)
PhiState[t] ~ dcat(psphi[PhiState[t-1], t-1,])
}
We start in state 1 always, and then take a categorical draw at each time step 't' for remaining in the current state or moving on to the next one given the probabilities within the array. I want a maximum of 5 states (assuming that the model will be able to functionally produce fewer by estimating the probability of moving from state 3 to 4 and onwards near 0, or making the survivorship value of subsequent states the same or similar if they belong to the same survivorship value in reality). So I create 5 hierarchical survival probabilities:
for (a in 1:5){
mean.phi[a] ~ dunif(0,1)
phi.tau[a] <- pow(phi_sigma[a],-2)
phi.sigma[a] ~ dunif(0,20)
}
Now this next step is where the errors start. Now that I've assigned values 1-5 to my PhiState vector it should look something like this:
[1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 5
or maybe
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2
and I now want to assign a mean.phi[] to my actual phi[] term, which feeds into the model:
for(t in 1:(n.occasions-1)){
phi[t] ~ dnorm(mean.phi[PhiState[t]],phi.tau[PhiState[t]])
}
However, when I try to run this I get the following error:
Error in jags.model(model.file, data = data, inits = init.values, n.chains = n.chains, :
RUNTIME ERROR:
Cannot insert node into mean.phi[1:5]. Dimension mismatch
It's worth noting that the model works just fine when I use the following phi[] determinations:
phi[t] ~ dunif(0,1) #estimate independent annual phi's
or
phi[t] ~ dnorm(mean.phi,phi_tau) #estimate hierarchical phi's from a single mean.phi
or
#Set fixed survial periods (this works the best, but I don't want to have to tell it when
#the periods start/end and how many there are, hence the current exercise):
for (a in 1:21){
surv[a] ~ dnorm(mean.phi1,phi1_tau)
}
for (b in 22:30){
surv[b] ~ dnorm(mean.phi2,phi2_tau)
}
for (t in 1:(n.occasions-1)){
phi[t] <- surv[t]
}
I did read this post: https://sourceforge.net/p/mcmc-jags/discussion/610037/thread/36c48f25/
but I don't see where I'm redefining variables in this case... Any help fixing this or advice on a better approach would be most welcome!
Many thanks,
Josh
I'm a bit confused as to what are your actual data (the phi[t]?), but the following might give you a starting point:
nt <- 29
nstate <- 5
M <- function() {
phi_state[1] <- 1
for (t in 2:nt) {
up[t-1] ~ dbern(p[t-1])
p[t-1] <- ifelse(phi_state[t-1]==nstate, 0, p_[t-1])
p_[t-1] ~ dunif(0, 1)
phi_state[t] <- phi_state[t-1] + equals(up[t-1], 1)
}
for (k in 1:nstate) {
mean_phi[k] ~ dunif(0, 1)
phi_sigma[k] ~ dunif(0, 20)
}
for(t in 1:(nt-1)){
phi[t] ~ dnorm(mean_phi[phi_state[t]], phi_sigma[phi_state[t]]^-2)
}
}
library(R2jags)
fit <- jags(list(nt=nt, nstate=nstate), NULL,
c('phi_state', 'phi', 'mean_phi', 'phi_sigma', 'p'),
M, DIC=FALSE)
Note that above, p is a vector of probabilities of moving up to the next (adjacent) state.
I am using the lpsolveAPI in RStudio. When I type the name of a model with few decision variables, I can read a printout of the current constraints in the model. For example
> lprec
Model name:
COLONE COLTWO COLTHREE COLFOUR
Minimize 1 3 6.24 0.1
THISROW 0 78.26 0 2.9 >= 92.3
THATROW 0.24 0 11.31 0 <= 14.8
LASTROW 12.68 0 0.08 0.9 >= 4
Type Real Real Real Real
Upper Inf Inf Inf 48.98
Lower 28.6 0 0 18
But when I make a model that has more than 9 decision variables, it no longer gives the full summary and I instead see:
> lprec
Model name:
a linear program with 13 decision variables and 258 constraints
Does anyone know how I can see the same detailed summary of the model when there are large numbers of decision variables?
Bonus Question: Is RStudio the best console for working with R?
Here is an example:
>lprec <- make.lp(0,5)
This makes a new model called lprec, with 0 constraints and 5 variables. Even if you call the name now you get:
>lprec
Model name:
C1 C2 C3 C4 C5
Minimize 0 0 0 0 0
Kind Std Std Std Std Std
Type Real Real Real Real Real
Upper Inf Inf Inf Inf Inf
Lower 0 0 0 0 0
The C columns correspond to the 5 variables. Right now there are no constraints and the objective function is 0.
You can add a constraint with
>add.constraint(lprec, c(1,3,4,2,-8), "<=", 0)
This is the constraint C1 + 3*C2 + 4*C3 + 2*C4 - 8*C5 <= 0. Now the print out is:
Model name:
C1 C2 C3 C4 C5
Minimize 0 0 0 0 0
R1 1 3 4 2 -8 <= 0
Kind Std Std Std Std Std
Type Real Real Real Real Real
Upper Inf Inf Inf Inf Inf
Lower 0 0 0 0 0
Anyway the point is that no matter how many constraints, if there are more than 9 variables then I don't get the full print out.
>lprec <- make.lp(0,15)
>lprec
Model name:
a linear program with 15 decision variables and 0 constraints
Write it out to a file for examination
When I work with LPs using lpSolveAPI, I prefer to write them out to a file. The lp format works fine for my needs. I then examine the LP model using any text editor. If you click on the output file in the "Files" panel in RStudio, it will open it too, and you can inspect it.
write.lp(lprec, "lpfilename.lp", "lp") #write it to a file in LP format
You can also write it out as MPS format if you so choose.
Here's the help file on write.lp().
Hope that helps.
Since it is an S3 object of class lpExtPtr,
the function called to display it is print.lpExtPtr.
If you check its code, you will see that it displays the object
differently depending on its size --
details for very big objects would not be very useful.
Unfortunately, the threshold cannot be changed.
class(r)
# [1] "lpExtPtr"
print.lpExtPtr
# function (x, ...)
# {
# (...)
# if (n > 8) {
# cat(paste("Model name: ", name.lp(x), "\n", " a linear program with ",
# n, " decision variables and ", m, " constraints\n",
# sep = ""))
# return(invisible(x))
# }
# (...)
You can access the contents of the object with the various get.* functions,
as the print method does.
Alternatively, you can just change the print method.
# A function to modify functions
patch <- function( f, before, after ) {
f_text <- capture.output(dput(f))
g_text <- gsub( before, after, f_text )
g <- eval( parse( text = g_text ) )
environment(g) <- environment(f)
g
}
# Sample data
library(lpSolveAPI)
r <- make.lp(0,5)
r # Shows the details
r <- make.lp(0,20)
r # Does not show the details
# Set the threshold to 800 variables instead of 8
print.lpExtPtr <- patch( print.lpExtPtr, "8", "800" )
r # Shows the details
I am trying to conduct an hierarchical bayesian analysis but am having a little trouble with R and WinBUGS code. I don't have balanced data and am struggling with the coding. I have temperature data collected daily with iButtons (temperature recording devices) in transects and am trying to generate a model that relates this to remote sensing data. Unfortunately, each transect has a different number of iButtons so creating a 3D matrix of button(i), in transect(j), repeatedly "sampled" on day(t) is a problem for me.
Ultimately, my model will be something like:
Level 1
Temp[ijk] ~ N(theta[ijk], tau)
theta[ijk] = b0 + b1*x1 + . . . + bn*xn
Level 2
b0 = a00 + a01*y1 + . . . an*yn
b1 = a10 + a11*y1 ...
Level 3 (maybe?) - random level 2 intercepts
Normally I would do something like this:
Wide <- reshape(Data1, idvar = c("iButton","block"), timevar = "julian", direction = "wide")
J <- length(unique(Data$block))
I <- length(unique(Data$iButton))
Ti <- length(unique(Data$julian))
Temp <- array(NA, dim = c(I, Ti, J))
for(t in 1:Ti) {
sel.rows <- Wide$block == t
Temp[,,t] <- as.matrix(Wide)[sel.rows, 3:Ti]
}
Then I could have a 3D matrix that I could loop through in WinBUGS or OpenBUGS as such:
for(i in 1:J) { # Loop over transects/blocks
for(j in 1:I) { # Loop over buttons
for(t in 1:Ti) { # Loop over days
Temp[i,j,t] ~ dnorm(theta[i,j,t])
theta[i,j,t] <- alpha.lam[i] + blam1*radiation[i,j] + blam2*cwd[i,j] + blam3*swd[i,j]
}}}
Anyway, don't worry about the details of the code above, it's just thrown together as an example from other analyses. My main question is how to do this type of analysis when I don't have a balanced design with equal numbers of iButtons per transect? Any help would be greatly appreciated. I'm clearly new to R and WinBUGS and don't have much previous computer coding experience.
Thanks!
oh and here is what the data look like in long (stacked) format:
> Data[1:15, 1:4]
iButton julian block aveT
1 1 1 1 -4.5000000
2 1 2 1 -5.7500000
3 1 3 1 -3.5833333
4 1 4 1 -4.6666667
5 1 5 1 -2.5833333
6 1 6 1 -3.0833333
7 1 7 1 -1.5833333
8 1 8 1 -8.3333333
9 1 9 1 -5.0000000
10 1 10 1 -2.4166667
11 1 11 1 -1.7500000
12 1 12 1 -3.2500000
13 1 13 1 -3.4166667
14 1 14 1 -2.0833333
15 1 15 1 -1.7500000
Create a vector or array of lengths and use subindexing.
Using your example:
J <- length(unique(Data$block))
I <- tapply(Data$iButton, Data$block, function(x) length(unique(x))
Ti <- tapply(Data$julian, list(Data$iButton, Data$block), function(x) length(unique(x))
for(i in 1:J) { # Loop over transects/blocks
for(j in 1:I[i]) { # Loop over buttons
for(t in 1:Ti[i, j]) { # Loop over days
Temp[i,j,t] ~ dnorm(theta[i,j,t])
theta[i,j,t] <- alpha.lam[i] + blam1*radiation[i,j] + blam2*cwd[i,j] + blam3*swd[i,j]
}}}
I think it would work, but I haven't tested since there no data.
Can you try using a list instead?
This allows a variable length for each item in the list where each index would correspond to the transect.
So something like this:
theta <- list()
for(i in unique(Data$block)) {
ibuttons <- unique(Data$iButton[Data$block==i])
days <- unique(Data$julian[Data$block==i])
theta[[i]] <- matrix(NA, length(ibuttons), length(days)) # Empty matrix with NA's
for(j in 1:length(ibuttons)) {
for(t in 1:length(days)) {
theta[[i]][j,t] <- fn(i, ibuttons[j], days[t])
}
}
}