I am looking for an approach to train a dynamic bayesian network (DBN), using the package bnstruct, for a special case where data is collected from similar events. Being so, 1) I would like to train my DBN feeding it with one event per time.
As in the real case the number of events, rows and columns are big, 2) it would be better if some parallel processing could be implemented to improve perfomance.
A dummy code is provided below, where all data must be fed at once, disregarding event boundaries.
library(bnstruct)
numEvents <- 40
numRows <- 5
numCols <- 3
mat <- matrix(data = rnorm(numEvents * numRows * numCols), ncol = numCols)
varNames <- paste0("var", 1:numCols)
colnames(mat) <- varNames
dataset <- BNDataset(data = mat, discreteness = rep(F, ncol(mat)), variables = varNames, node.sizes = rep(3, ncol(mat)))
dbn <- learn.dynamic.network(dataset, num.time.steps = numCols)
Thanks.
The data you are generating is treated in bnstruct as a DBN with 3 layers, each consisting of a single node. The right way of treating a dataset as a sequence of events is to consider variable X in event i as a different variable from the same variable X in event j, as learn.dynamic.network is just a proxy for learn.network with an implicit layering. That is, your dataset doesn't have to be constructed by adding rows, but by adding columns.
Section 4.1.2 of the vignette has an explanation of how to learn a DBN.
The right way of constructing and using a dataset in your example is
mat <- matrix(data = rnorm(numEvents * numRows * numCols), ncol = numCols * numEvents)
varNames <- rep(paste0("var", 1:numCols), numEvents)
colnames(mat) <- varNames
dataset <- BNDataset(data = mat, discreteness = rep(F, ncol(mat)), variables = varNames, node.sizes = rep(3, ncol(mat)))
dbn <- learn.dynamic.network(dataset, num.time.steps = numEvents)
dbn will have 120 effective nodes, divided in 40 layers.
Coming to the first question: one idea is to provide an initial network as starting point for the successive time steps. Assuming the dataset at time step t+1 is obtained by adding new columns to the dataset used at time step t, you have to manually adapt the BN object to represent the dataset.
From the package vignette:
It is also possible to provide an initial network as starting point for the
structure search. This can be done using the initial.network argument, which
accepts three kinds of inputs:
a BN object (with a structure);
a matrix containing the adjacency matrix representing the structure of a
network;
the string random.chain for starting from a randomly sampled chain-like
network.
The simplest option is probably to keep an expand the DAG with 0s at every augmentation, to have a network with more nodes, and no edges going to the new nodes, and to use that new DAG as starting point. In your example:
library(bnstruct)
numEvents <- 40
numRows <- 5
numCols <- 3
mat <- matrix(data = rnorm(numRows * numCols), ncol = numCols)
varNames <- paste0("var", 1:numCols)
colnames(mat) <- varNames
dataset <- BNDataset(data = mat,
discreteness = rep(F, ncol(mat)),
variables = varNames,
node.sizes = rep(3, ncol(mat)))
dbn <- learn.network(dataset)
for (event in 2:numEvents) {
# collect new data
new.mat <- matrix(data = rnorm(numRows * numCols), ncol = numCols)
colnames(new.mat) <- paste0(varNames, "_", event)
mat <- cbind(mat, new.mat)
dataset <- BNDataset(data = mat,
discreteness = rep(F, ncol(mat)),
variables = colnames(mat),
node.sizes = rep(3, ncol(mat)))
# expand structure of the DBN, adding the nodes relative to the new event
dbn.dag <- dag(dbn)
n.nodes <- ncol(dbn.dag)
new.dag <- matrix(0, nrow=ncol(mat), ncol=ncol(mat))
new.dag[1:n.nodes, 1:n.nodes] <- dbn.dag
# learn
dbn <- learn.dynamic.network(dataset,
initial.network = new.dag,
num.time.steps = event)
}
This will, however, re-learn the whole DBN every time. If edges can go only to the immediate following layer you can trim the search space by providing a layer.struct parameter, or by learning using two events at a time and manually building the larger DBN.
For the second question, bnstruct at the moment does not provide parallel processing.
Related
I need to repeat the sampling procedure of the below loop 1000 times using a second loop.
This is the simplified code i produced for reproducability, the inner loop.
##Number of iterations
N = 8
##Store data from inner loop in vectors
PMSE <- rep(1 , N)
PolynomialDegree <- rep(1, N)
for (I in 1:N){
PolynomialDegree [I] <- I
PMSE [I] <- I*rnorm(1)
}
Now, using a second , outer loop. I want repeat this "sampling procedure" 1000 times and store the data of all those vectors into a single dataframe. Im struggling to write the outer loop and was hoping for some assistance.
This is my attempt with non-reproducable code, I hope it is clear what i am attempting to do.
##Set number of iterations
N <- 8
M <- 1000
##Store data
OUTPUT <- rep(1,M)
##Outer loop starts
for (J in 1:M){
PMSE <- rep(1 , N)
PolynomialDegree <- rep(1, N)
sample <- sample(nrow(tempraindata), floor(nrow(tempraindata)*0.7))
training <- tempraindata[sample,]
testing <- tempraindata[-sample,]
##Inner loop starts
for (I in 1:N){
##Set up linear model with x polynomial of degree I x = year, y = temp
mymodel <- lm(tem ~ poly(Year, degree = I), data = training)
##fit model on testing set and save predictions
predictions <- predict(mymodel, newdata = testing, raw = FALSE)
##define and store PMSE
PMSE[I] <- (1/(nrow(tempraindata)- nrow(training)))*(sum(testing$tem-predictions))^2
PolynomialDegree [I] <- I
} ## End of inner loop
OUTPUT[J] <- ##THIS IS WHERE I WANT TO SAVE THE DATA
} ##End outer loop
I want to store all the data inside OUTPUT and make it a dataframe, if done correctly it should contain 8000 values of PMSE and 8000 values of PolynomialDegree.
Avoid the bookkeeping of initializing vectors and then assigning elements by index. Consider a single sapply (or vapply) passing both iterations to build a matrix of 8,000 elements of the PSME calculations within a 1000 X 8 structure. Every column would then be a model run (or PolynomialDegree) and every row the training/testing data pair.
## Set number of iterations
N <- 8
M <- 1000
## Defined method to generalize process
calc_PSME <- function(M, N) {
## Randomly build training/testing sets
set.seed(M+N) # TO REPRODUCE RANDOM SAMPLES
sample <- sample(nrow(tempraindata), floor(nrow(tempraindata)*0.7))
training <- tempraindata[sample,]
testing <- tempraindata[-sample,]
## Set up linear model with x polynomial of degree I x = year, y = temp
mymodel <- lm(tem ~ poly(Year, degree = N), data = training)
## Fit model on testing set and save predictions
predictions <- predict(mymodel, newdata = testing, raw = FALSE)
## Return single PSME value
(
(1/(nrow(tempraindata)- nrow(training))) *
(sum(testing$tem-predictions)) ^ 2
)
}
# RETURN (1000 X 8) MATRIX WITH NAMED COLUMNS
PSME_matrix <- sapply(1:N, calc_PSME, 1:M)
PSME_matrix <- vapply(1:N, calc_PSME, numeric(M), 1:M)
Should you need a 8,000-row data frame of two columns, consider reshape to long format:
long_df <- reshape(
data.frame(output_matrix),
varying = 1:8,
timevar = "PolynomialDegree",
v.names = "PSME",
ids = NULL,
new.row.names = 1:1E4,
direction = "long"
)
I have a data table that provides the length and composition of given vectors
for example:
set.seed(1)
dt = data.table(length = c(100, 150),
n_A = c(30, 30),
n_B = c(20, 100),
n_C = c(50, 20))
I need to randomly split each vector into two subsets with 80% and 20% of observations respectively. I can currently do this using a for loop. For example:
dt_80_list <- list() # create output lists
dt_20_list <- list()
for (i in 1:nrow(dt)){ # for each row in the data.table
sample_vec <- sample( c( rep("A", dt$n_A[i]), # create a randomised vector with the given nnumber of each component.
rep("B", dt$n_B[i]),
rep("C", dt$n_C[i]) ) )
sample_vec_80 <- sample_vec[1:floor(length(sample_vec)*0.8)] # subset 80% of the vector
dt_80_list[[i]] <- data.table( length = length(sample_vec_80), # count the number of each component in the subset and output to list
n_A = length(sample_vec_80[which(sample_vec_80 == "A")]),
n_B = length(sample_vec_80[which(sample_vec_80 == "B")]),
n_C = length(sample_vec_80[which(sample_vec_80 == "C")])
)
dt_20_list[[i]] <- data.table( length = dt$length[i] - dt_80_list[[i]]$length, # subtract the number of each component in the 80% to identify the number in the 20%
n_A = dt$n_A[i] - dt_80_list[[i]]$n_A,
n_B = dt$n_B[i] - dt_80_list[[i]]$n_B,
n_C = dt$n_C[i] - dt_80_list[[i]]$n_C
)
}
dt_80 <- do.call("rbind", dt_80_list) # collapse lists to output data.tables
dt_20 <- do.call("rbind", dt_20_list)
However, the dataset I need to apply this to is very large, and this is too slow. Does anyone have any suggestions for how I could improve performance?
Thanks.
(I assumed your dataset consists of many more rows (but only a few colums).)
Here's a version I came up with, with mainly three changes
use .N and by= to count the number of "A","B","C" drawn in each row
use the size argument in sample
join the original dt and dt_80 to calculate dt_20 without a for-loop
## draw training data
dt_80 <- dcast(
dt[,row:=1:nrow(dt)
][, .(draw=sample(c(rep("A80",n_A),
rep("B80",n_B),
rep("C80",n_C)),
size=.8*length) )
, by=row
][,.N,
by=.(row,draw)],
row~draw,value.var="N")[,length80:=A80+B80+C80]
## draw test data
dt_20 <- dt[dt_80,
.(A20=n_A-A80,
B20=n_B-B80,
C20=n_C-C80),on="row"][,length20:=A20+B20+C20]
There is probably still room for optimization, but I hope it already helps :)
EDIT
Here I add my initial first idea, I did not post this because the code above is much faster. But this one might be more memory-efficient which seems crucial in your case. So, even if you already have a working solution, this might be of interest...
library(data.table)
library(Rfast)
## add row numbers
dt[,row:=1:nrow(dt)]
## sampling function
sampfunc <- function(n_A,n_B,n_C){
draw <- sample(c(rep("A80",n_A),
rep("B80",n_B),
rep("C80",n_C)),
size=.8*(n_A+n_B+n_C))
out <- Rfast::Table(draw)
return(as.list(out))
}
## draw training data
dt_80 <- dt[,sampfunc(n_A,n_B,n_C),by=row]
I tried to create a matrix from a list which consists of N unequal matrices...
The reason to do this is to make R individual bootstrap samples.
In the example below you can find e.g. 2 companies, where we have 1 with 10 & 1 with just 5 observations.
Data:
set.seed(7)
Time <- c(10,5)
xv <- matrix(c(rnorm(10,5,2), rnorm(5,20,1), rnorm(10,5,2), rnorm(5,20,1)), ncol=2);
y <- matrix( c(rnorm(10,5,2), rnorm(5,20,1)));
z <- matrix(c(rnorm(10,5,2), rnorm(5,20,1), rnorm(10,5,2), rnorm(5,20,1)), ncol=2)
# create data frame of input variables which helps
# to conduct the rowise bootstrapping
data <- data.frame (y = y, xv = xv, z = z);
rows <- dim(data)[1];
cols <- dim(data)[2];
# create the index to sample from the different panels
cumTime <- c(0, cumsum (Time));
index <- findInterval (seq (1:rows), cumTime, left.open = TRUE);
# draw R individual bootstrap samples
bootList <- replicate(R = 5, list(), simplify=F);
bootList <- lapply (bootList, function(x) by (data, INDICES = index, FUN = function(x) dplyr::sample_n (tbl = x, size = dim(x)[1], replace = T)));
---------- UNLISTING ---------
Currently, I try do it incorrectly like this:
Example for just 1 entry of the list:
matrix(unlist(bootList[[1]], recursive = T), ncol = cols)
The desired output is just
bootList[[1]]
as a matrix.
Do you have an idea how to do this & if possible reasonably efficient?
The matrices are then processed in unfortunately slow MLE estimations...
i found a solution for you. From what i gather, you have a Dataframe containing all observations of all companies, which may have different panel lengths. And as a result you would like to have a Bootstap sample for each company of same size as the original panel length.
You mearly have to add a company indicator
data$company = c(rep(1, 10), rep(2, 5)) # this could even be a factor.
L1 = split(data, data$company)
L2 = lapply(L1, FUN = function(s) s[sample(x = 1:nrow(s), size = nrow(s), replace = TRUE),] )
stop here if you would like to have saperate bootstap samples e.g. in case you want to estimate seperately
bootdata = do.call(rbind, L2)
Best wishes,
Tim
I thought that the following problem must have been answered or a function must exist to do it, but I was unable to find an answer.
I have a nested loop that takes a row from one 3-col. data frame and copies it next to each of the other rows, to form a 6-col. data frame (with all possible combinations). This works fine, but with a medium sized data set (800 rows), the loops take forever to complete the task.
I will demonstrate on a sample data set:
Sdat <- data.frame(
x = c(10,20,30,40),
y = c(15,25,35,45),
ID =c(1,2,3,4)
)
compar <- data.frame(matrix(nrow=0, ncol=6)) # to contain all combinations
names(compar) <- c("x","y", "ID", "x","y", "ID")
N <- nrow(Sdat) # how many different points we have
for (i in 1:N)
{
for (j in 1:N)
{
Temp1 <- Sdat[i,] # data from 1st point
Temp2 <- Sdat[j,] # data from 2nd point
C <- cbind(Temp1, Temp2)
compar <- rbind(C,compar)
}
}
These loops provide exactly the output that I need for further analysis. Any suggestion for vectorizing this section?
You can do:
ind <- seq_len(nrow(Sdat))
grid <- expand.grid(ind, ind)
compar <- cbind(Sdat[grid[, 1], ], Sdat[grid[, 2], ])
A naive solution using rep (assuming you are happy with a data frame output):
compar <- data.frame(x = rep(Sdat$x, each = N),
y = rep(Sdat$y, each = N),
id = rep(1:n, each = N),
x1 = rep(Sdat$x, N),
y1 = rep(Sdat$y, N),
id_1 = rep(1:n, N))
I have a customized function (psup2) that gets data from a data frame and returns a result. The problem is that it takes a while since I am using a "for" loop that runs for every row and column.
Input:
I have a table that contains the ages (table_costumers), an n*m matrix of different terms, and two different mortality tables (for males and females).
The mortality tables i´m using contains one column for ages and another one for its corresponding survival probabilities.
Output:
I want to create a separate dataframe with the same size as that of the term table. The function will take the data from the different mortality tables (depending on the gender) and then apply the function above (psup2) taking the ages from the table X and the terms from the matrix terms.
Up to now I managed to create a very inefficient way to do this...but hopefully by using one of the functions from the apply family this could get faster.
The following code shows the idea of what I am trying to do:
#Function
psup2 <- function(x, age, term) {
P1 = 1
for (i in 1:term) {
P <- x[age + i, 2]
P1 <- P1*P
}
return(P1)
}
#Inputs
terms <- data.frame(V1 = c(1,2,3), V2 = c(1,3,4), V2 = c(2,3,4))
male<- data.frame(age = c(0,1,2,3,4,5), probability = c(0.9981,0.9979,0.9978,.994,.992,.99))
female <- data.frame(age = c(0,1,2,3,4,5), probability = c(0.9983,0.998,0.9979,.9970,.9964,.9950))
table_customers <- data.frame(id = c(1,2,3), age = c(0,0,0), gender = c(1,2,1))
#Loop
output <- data.frame(matrix(NA, nrow = 3, ncol = 0))
for (i in 1:3) {
for (j in 1:3) {
prob <- ifelse(table_customers[j, 3] == 1,
psup2(male, as.numeric(table_customers[j, 2]), as.numeric(terms[j,i])),
psup2(female, as.numeric(table_customers[j, 2]), as.numeric(terms[j,i])))
output[j, i] <- prob
}
}
your psup function can be simplified into:
psup2 <- function(x, age, term) { prod(x$probability[age+(1:term)]) }
So actually, we won't use it, we'll use the formula directly.
We'll put your male and female df next to each other, so we can use the value of the gender column to choose one or another.
mf <- merge(male,female,by="age") # assuming you have the same ages on both sides
input_df <- cbind(table_customers,terms)
output <- t(apply(input_df,1,function(x){sapply(1:3,function(i){prod(mf[x["age"]+(1:x[3+i]),x["gender"]+1])})}))
And that's it :)
The sapply function is used to loop on the columns of terms.
x["age"]+(1:x[3+i]) are the indices of the rows you want to multiply
x["gender"]+1 is the relevant column of the mf data.frame