RStan: Use matrix in ordered_logistic() - stan

I would like to use ordered_logistic() function in RStan.
Firstly, here is the data (sample):
gender age partyID treatment_rand treatment_bias y_randT y_biasT
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 21 1 0 0 1 4
2 1 21 7 1 1 3 2
3 0 67 7 0 0 4 4
4 0 78 1 0 0 2 4
5 0 35 8 0 1 4 2
I use the subset of data:
X <- cbind(1, sample[, c("age", "partyID", "gender")])
choice_num <- 5
data_randT <- list(N=nrow(sample), D=ncol(X), t=sample$treatment_rand, X=X, y=sample$y_randT, K=choice_num)
My Stan code is:
data{
int<lower=2> K;
int<lower=0> N;
int<lower=1> D;
int<lower=1, upper=K> y[N];
real<lower=0, upper=1> t[N];
matrix[N,D] X;
}
parameters{
vector[D] betaX;
real betaT;
ordered[K-1] c;
}
model{
for(n in 1:N)
y[n] ~ ordered_logistic(betaT*t[n] + X[n]*betaX, c);
}
If I run this code, I got the following error:
> fit_rand <- stan(model_code=stan_OrderedLogit, data=data_randT, seed=seed)
Error in new_CppObject_xp(fields$.module, fields$.pointer, ...) :
variable does not exist; processing stage=data initialization; variable name=X; base type=matrix_d
In addition: Warning messages:
1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
2: In FUN(X[[i]], ...) : data with name X is not numeric and not used
failed to create the sampler; sampling not done
What is wrong with the code?

I believe it will work if you make X a matrix in R, rather than a data.frame, as in
X <- as.matrix(cbind(1, sample[, c("age", "partyID", "gender")]))
However, I would think you you would first want to expand partyID into a set of dummy variables (excluding a reference category).

Related

vectorization of "cumulated" expected values from regression

I have data
set.seed(42)
dat <- data.frame(t=1:1000,x1=runif(1000,1,10),x2=round(runif(1000,0,1)))
dat$y <- 8*dat$x1 - 5*dat$x2 + rnorm(1000)
> head(dat)
t x1 x2 y
1 1 9.233254 1 71.19109
2 2 9.433679 0 75.99355
3 3 3.575256 1 24.57278
4 4 8.474029 1 63.16920
5 5 6.775710 0 53.20974
6 6 5.671864 0 44.77743
where t gives points in time. I would like to obtain the expected value of y at each point in time based on a regression of y on x1 and x2 using the preceding points in time.
I could do this in a for-loop but I wonder whether there is a solution with data.table. In a related question, Michael Chirico gave an excellent hint on how to do the regression and get the coefficients,
dat[dat, on=.(t<t), allow.cartesian = TRUE, nomatch=0L][ , as.list(coef(lm(y ~ x1 + x2))), keyby = t]
but using them to get the expected values would be even better.
Probably you want something like this :
dat[dat, on=.(t<t), allow.cartesian = TRUE, nomatch=0L][ , .( exp=predict(lm(y ~ x1 + x2),list(x1=i.x1[1],x2=i.x2[1]))), keyby = t]
t exp
1: 2 71.191094
2: 3 -64.382779
3: 4 64.935556
4: 5 54.437024
5: 6 44.693841
---
995: 996 17.828209
996: 997 47.443171
997: 998 12.177957
998: 999 43.640271
999: 1000 3.516452
Anyway, that method might be terribly inefficient in terms of memory usage (e.g. this small example already creates a throw-away data.table of 499500 rows !).
I would use a simple for-loop without the need of data.table (it takes more or less the same time) :
expected <- rep.int(NA,nrow(dat))
for(n in 2:nrow(dat)){
LM <- lm(y~x1+x2,data=dat[1:(n-1),])
expVal <- predict(LM,dat[n,])
expected[n] <- expVal
}
dat$exp <- expected
> dat
t x1 x2 y exp
1 1 9.233254 1 71.191094 NA
2 2 9.433679 0 75.993552 71.191094
3 3 3.575256 1 24.572780 -64.382779
4 4 8.474029 1 63.169202 64.935556
5 5 6.775710 0 53.209744 54.437024
6 6 5.671864 0 44.777425 44.693841
7 7 7.629295 1 56.199610 57.353776

Run if loop in parallel

I have a data set with ~4 million rows that I need to loop over. The data structure is there are repeated IDs that are dependent on each other but data is independent across IDs. For each ID, the [i+1] row is a dependent on [i]. Here is a reproducible example. I do realize that this example is not practical in terms of the inner functions but it is simply a demonstration of the structure I have.
set.seed(123)
id1 = rep(1,5)
id2 = rep(2,5)
id3 = rep(3,5)
ids = c(id1,id2,id3)
month = rep(seq(1,5),3)
x = round(rnorm(15,2,5))
y = rep(0,15)
df = as.data.frame(cbind(ids,month,x,y))
for (i in 1:nrow(df)){
if(i>1 && df[i,1]==df[i-1,1]){
#Main functions go here
df[i,4] = df[i-1,4]^2+df[i,3]
}
else {
df[i,4] = 1
}
}
The issue is in reality 1000 loops of the real function takes ~90 seconds, so 4 million rows takes days. It isn't feasible for me to run this way. However the IDs are independent and don't need to run together. My question is: is there a way to run this type of loop in parallel? A very non-elegant solution would be to split the file into 50 sections without splitting an ID and simply run the same code on the 50 sub-files. I figure there should be a way to code this though.
EDIT: Added month column to show why the rows are dependent on each other. To address two comments below:
1) There are actually 6-7 lines of functions to run. Could I use ifelse() with multiple functions?
2) The desired output would be the full data frame. In reality there are more columns but I need each row in a data frame.
ids month x y
1 1 1 -1 1
2 1 2 1 2
3 1 3 10 14
4 1 4 2 198
5 1 5 3 39207
6 2 1 11 1
7 2 2 4 5
8 2 3 -4 21
9 2 4 -1 440
10 2 5 0 193600
11 3 1 8 1
12 3 2 4 5
13 3 3 4 29
14 3 4 3 844
15 3 5 -1 712335
EDIT2: I've tried applying the foreach() package from another post but it doesn't seem to work. This code will run but I think the issue is the way that rows are distributed among cores. If each row is sequentially sent to a different core then the same ID will never be in the same core.
library(foreach)
library(doParallel)
set.seed(123)
id1 = rep(1,5)
id2 = rep(2,5)
id3 = rep(3,5)
ids = c(id1,id2,id3)
month = rep(seq(1,5),3)
x = round(rnorm(15,2,5))
y = rep(0,15)
df = as.data.frame(cbind(ids,month,x,y))
#setup parallel backend to use many processors
cores=detectCores()
cl <- makeCluster(cores[1]-1) #not to overload your computer
registerDoParallel(cl)
finalMatrix <- foreach(i=1:nrow(df), .combine=cbind) %dopar% {
for (i in 1:nrow(df)){
if(i>1 && df[i,1]==df[i-1,1]){
#Main functions go here
df[i,4] = df[i-1,4]^2+df[i,3]
}
else {
df[i,4] = 1
}
}
}
#stop cluster
stopCluster(cl)
So, simply recode your loop with Rcpp:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector fill_y(const NumericVector& x) {
int n = x.length();
NumericVector y(n); y[0] = 1;
for (int i = 1; i < n; i++) {
y[i] = pow(y[i - 1], 2) + x[i];
}
return y;
}
And, to apply it on each group, use dplyr:
df %>%
group_by(ids) %>%
mutate(y2 = fill_y(x))
I think this should be fast enough so that you don't need parallelism.
Actually I ran it on #Val's testdat and it took only 2 seconds (with an old computer).
Tell me if it's okay. Otherwise, I'll make a parallel version.
Here's a solution using foreach. Hard to say how it would work in your real life example, at least it works with the testdata ...
First I generate some testdata:
# function to generate testdata
genDat <- function(id){
# observations per id, fixed or random
n <- 50
#n <- round(runif(1,5,1000))
return(
data.frame(id=id,month=rep(1:12,ceiling(n/12))[1:n],x=round(rnorm(n,2,5)),y=rep(0,n))
)
}
#generate testdata
testdat <- do.call(rbind,lapply(1:90000,genDat))
> head(testdat)
id month x y
1 1 1 7 0
2 1 2 6 0
3 1 3 -9 0
4 1 4 3 0
5 1 5 -9 0
6 1 6 8 0
> str(testdat)
'data.frame': 4500000 obs. of 4 variables:
$ id : int 1 1 1 1 1 1 1 1 1 1 ...
$ month: int 1 2 3 4 5 6 7 8 9 10 ...
$ x : num 7 6 -9 3 -9 8 -4 13 0 5 ...
$ y : num 0 0 0 0 0 0 0 0 0 0 ...
So the testdata has ~ 4.5 million rows with 90k unique ids.
Now since your calculations are independent between the IDs, the idea would be to ship off data with unique IDs to each core ... this would ultimately also get rid of the necessity for an if or ifelse condition.
To do this, I first generate a matrix with start and stop row indices, to split the dataset in unique IDs:
id_len <- rle(testdat$id)
ixmat <- cbind(c(1,head(cumsum(id_len$lengths)+1,-1)),cumsum(id_len$lengths))
This matrix can then be passed on to foreach for running the specific parts in parallel.
In this example I modify your calculations slightly to avoid astronomical values leading to Inf.
library(parallel)
library(doParallel)
library(iterators)
cl <- makeCluster(parallel::detectCores())
registerDoParallel(cl) #create a cluster
r <- foreach (i = iter(ixmat,by='row')) %dopar% {
x <- testdat$x[i[1,1]:i[1,2]]
y <- testdat$y[i[1,1]:i[1,2]]
y[1] <- 1
for(j in 2:length(y)){
#y[j] <- (y[j-1]^2) + x[j] ##gets INF
y[j] <- y[j-1] + x[j]
}
return(y)
}
parallel::stopCluster(cl)
Finally you could replace the values in the original dataframe:
testdat$y <- unlist(r)
As for the time, the foreach loop runs in about 40 seconds on my 8 core machine.
Base R Matrix operations and melt/dcast from data.table
As discussed in the comments above, this solution is very specific to the use case in the example, but perhaps might be applicable to your use case.
Using matrix operations and the dcast.data.table and melt.data.table functions from the data.table package to make fast transitions from a long to wide format and back is pretty efficient.
All things considered, the bigger constraint will likely how much RAM you have available than processing time with these methods.
library(data.table)
set.seed(123)
id1 = rep(1,5)
id2 = rep(2,5)
id3 = rep(3,5)
ids = c(id1,id2,id3)
month = rep(seq(1,5),3)
x = round(rnorm(15,2,5))
# y = rep(0,15) ## no need to pre-define y with this method
df = as.data.frame(cbind(ids,month,x))
setDT(df) ## Convert to data.table by reference
wide <- dcast.data.table(df, month ~ ids, value.var = "x") ## pivot to 'wide' format
mat <- data.matrix(wide[,-c("month")]) ## Convert to matrix
print(mat)
gives
1 2 3
[1,] -1 11 8
[2,] 1 4 4
[3,] 10 -4 4
[4,] 2 -1 3
[5,] 3 0 -1
Then operating on it as a matrix:
mat[1,] <- 1 ## fill the first row with 1's as in your example
for (i in 2:nrow(mat)){
mat[i,] = mat[i-1L,]^2 + mat[i,]
}
print(mat)
gives
1 2 3
[1,] 1 1 1
[2,] 2 5 5
[3,] 14 21 29
[4,] 198 440 844
[5,] 39207 193600 712335
Next, melt back to a long format and then join back to the original data on key columns ids and month:
yresult <- as.data.table(mat) ## convert back to data.table format
yresult[,month := wide[,month]] ## Add back the month column
ylong <- melt.data.table(yresult,
id.vars = "month",
variable.factor = FALSE,
variable.name = "ids",
value.name = "y") ## Pivot back to 'long' format
ylong[,ids := as.numeric(ids)] ## reclass ids to match input ids
setkey(ylong, ids, month) ## set keys for join on 'ids' and 'month'
setkey(df, ids,month)
merge(df,ylong) ## join data.table with the result
yields the final result:
ids month x y
1: 1 1 -1 1
2: 1 2 1 2
3: 1 3 10 14
4: 1 4 2 198
5: 1 5 3 39207
6: 2 1 11 1
7: 2 2 4 5
8: 2 3 -4 21
9: 2 4 -1 440
10: 2 5 0 193600
11: 3 1 8 1
12: 3 2 4 5
13: 3 3 4 29
14: 3 4 3 844
15: 3 5 -1 712335
Scale Testing
To test and illustrate scaling, the function testData below generates a data set by cross joining a given number of ids and a given number of months. Then, the function testFunc performs the recursive row-wise matrix operations.
testData <- function(id_count, month_count) {
id_vector <- as.numeric(seq_len(id_count))
months_vector <- seq_len(month_count)
df <- CJ(ids = id_vector,month = months_vector)
df[,x := rnorm(.N,0,0.1)]
return(df)
}
testFunc <- function(df) {
wide <- dcast.data.table(df,month ~ ids, value.var = "x")
mat <- data.matrix(wide[,-c("month")])
mat[1,] <- 1
for (i in 2:nrow(mat)){
mat[i,] = mat[i-1L,]^2 + mat[i,]
}
yresult <- as.data.table(mat)
yresult[,month := wide[,month]]
ylong <- melt.data.table(yresult,
id.vars = "month",
variable.factor = FALSE,
variable.name = "ids",
value.name = "y")
ylong[,ids := as.numeric(ids)]
setkey(ylong, ids, month)
setkey(df, ids,month)
merge(df,ylong)
}
With 90,000 ids and 45 months:
foo <- testData(90000,45)
system.time({
testFunc(foo)
})
user system elapsed
8.186 0.013 8.201
Run-time comes in under 10 seconds with a single thread.
With 100,000 ids and 1,000 months:
This three column input data.table is ~1.9GB
foo <- testData(1e5,1e3)
system.time({
testFunc(foo)
})
user system elapsed
52.790 4.046 57.031
A single threaded run-time of less than a minute seems pretty manageable depending on how many times this needs to be run. As always, this could be sped up further by improvements to my code or converting the recursive portion to C++ using Rcpp, but avoiding the mental overhead of learning C++ and switching between languages in your workflow is always nice!

Sweep equivalent in Julia

From R documentation:
sweep: Return an array obtained from an input array by sweeping out a summary
statistic.
For example, here is how I divide each row by its row sum:
> rs = rowSums(attitude)
> ratios = sweep(attitude, 1, rs, FUN="/")
> head(ratios)
rating complaints privileges learning raises critical advance
1 0.1191136 0.1412742 0.08310249 0.1080332 0.1689751 0.2548476 0.12465374
2 0.1518072 0.1542169 0.12289157 0.1301205 0.1518072 0.1759036 0.11325301
3 0.1454918 0.1434426 0.13934426 0.1413934 0.1557377 0.1762295 0.09836066
4 0.1568123 0.1619537 0.11568123 0.1208226 0.1388175 0.2159383 0.08997429
5 0.1680498 0.1618257 0.11618257 0.1369295 0.1473029 0.1721992 0.09751037
6 0.1310976 0.1676829 0.14939024 0.1341463 0.1646341 0.1493902 0.10365854
> rowSums(ratios) # check that ratios sum up to 1
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
My attempt in Julia:
x = rand(3, 4)
x[1, 1] = 10
x[2, 1] = 20
x[3, 1] = 30
rowsum = sum(x, 2)
rowsum_mat = repmat(rowsum, 1, size(x, 2))
x = x ./ rowsum_mat
This works but is clunky. Is there a more elegant and efficient way of doing this?
No need to use repmat — all of Julia's .-operators do "broadcasting" by default. This means it matches the dimensions of the two arguments and then expands any dimensions that have length 1 (the singleton dimensions) to match the other array. Since reductions keep the same dimensionality of the source array, they can be used directly with any dot-operator.
In your case, you can just use:
x ./ sum(x, 2)
since:
julia> x ./ rowsum_mat == x ./ rowsum
true

rnorm in For Loop Different n each iteration

I am trying to for loop to have normal distribution pulled 100 times, but with different n values that are given in a dataframe. I am using below mentioned code, but is giving me error. Is there a way to use different n for each iterations?
Many thanks,
Krina
> head(dat2_Placebo_n)
Source: local data frame [6 x 2]
Repl N
<int> <int>
1 1 78
2 2 71
3 3 60
4 4 66
5 5 71
6 6 82
> cv.tumor= 0.40
> sd.tumor<-sqrt(log((cv.tumor)^2+1))
> nRep <- 100
> result<-list()
> for(i in 1:nRep) {
+ n<- for(i in 1:dat2_Placebo_n$N) {n<-N[1:100,]}
+ Log_Tumor <- rnorm(n, log(6.8), sd.tumor)
+ Tumor <- exp(Log_Tumor)
+ result[[i]]<-Base_Tumor
+ }
Error: object 'N' not found
In addition: Warning message:
In 1:dat2_Placebo_n$N :
numerical expression has 100 elements: only the first used
> result <- as.data.frame(unlist(result))
This worked.
cv.tumor<- 0.40
sd.tumor<-sqrt(log((cv.tumor)^2+1))
nRep <- 100
result_Placebo<-list()
for(i in 1:nRep) {
ni<- dat2_Placebo_n$N[[i]]
Log_Tumor <- rnorm(ni, log(6.8), sd.tumor)
Tumor <- exp(Log_Tumor)
result_Placebo[[i]]<-Tumor
}
result_Placebo <- as.data.frame(unlist(result_Placebo))

R ddply, applying if and ifelse functions

I'm trying to apply a function to a dataframe using ddply from the plyr package, but I'm getting some results that I don't understand. I have 3 questions about the
results
Given:
mydf<- data.frame(c(12,34,9,3,22,55),c(1,2,1,1,2,2)
, c(0,1,2,1,1,2))
colnames(mydf)[1] <- 'n'
colnames(mydf)[2] <- 'x'
colnames(mydf)[3] <- 'x1'
mydf looks like this:
n x x1
1 12 1 0
2 34 2 1
3 9 1 2
4 3 1 1
5 22 2 1
6 55 2 2
Question #1
If I do:
k <- function(x) {
mydf$z <- ifelse(x == 1, 0, mydf$n)
return (mydf)
}
mydf <- ddply(mydf, c("x") , .fun = k, .inform = TRUE)
I get the following error:
Error in `$<-.data.frame`(`*tmp*`, "z", value = structure(c(12, 34, 9, :
replacement has 3 rows, data has 6
Error: with piece 1:
n x x1
1 12 1 0
2 9 1 2
3 3 1 1
I get this error regardless of whether I specify the variable to split by as c("x"), "x", or .(x). I don't understand why I'm getting this error message.
Question #2
But, what I really want to do is set up an if/else function because my dataset has variables x1, x2, x3, and x4 and I want to take those variables into account as well. But when I try something simple such as:
j <- function(x) {
if(x == 1){
mydf$z <- 0
} else {
mydf$z <- mydf$n
}
return(mydf)
}
mydf <- ddply(mydf, x, .fun = j, .inform = TRUE)
I get:
Warning messages:
1: In if (x == 1) { :
the condition has length > 1 and only the first element will be used
2: In if (x == 1) { :
the condition has length > 1 and only the first element will be used
Question #3
I'm confused about to use function() and when to use function(x). Using function() for either j() or k() gives me a different error:
Error in .fun(piece, ...) : unused argument (piece)
Error: with piece 1:
n x x1 z
1 12 1 0 12
2 9 1 2 9
3 3 1 1 3
4 12 1 0 12
5 9 1 2 9
6 3 1 1 3
7 12 1 0 12
8 9 1 2 9
9 3 1 1 3
10 12 1 0 12
11 9 1 2 9
12 3 1 1 3
where column z is not correct. Yet I see a lot of functions written as function().
I sincerely appreciate any comments that can help me out with this
There's a lot that needs explaining here. Let's start with the simplest case. In your first example, all you need is:
mydf$z <- with(mydf,ifelse(x == 1,0,n))
An equivalent ddply solution might look like this:
ddply(mydf,.(x),transform,z = ifelse(x == 1,0,n))
Probably your biggest source of confusion is that you seem to not understand what is being passed as arguments to functions within ddply.
Consider your first attempt:
k <- function(x) {
mydf$z <- ifelse(x == 1, 0, mydf$n)
return (mydf)
}
The way ddply works is that it splits mydf up into several, smaller data frame, based on the values in the column x. That means that each time ddply calls k, the argument passed to k is a data frame. Specifically, a subset of you primary data frame.
So within k, x is a subset of mydf, with all the columns. You should not be trying to modify mydf from within k. Modify x, and then return the modified version. (If you must, but the options I displayed above are better.) So we might re-write your k like this:
k <- function(x) {
x$z <- ifelse(x$x == 1, 0, x$n)
return (x)
}
Note that you've created some confusing stuff by using x as both an argument to k and as the name of one of our columns.

Resources