Using R, I would like to produce a Keras custom layer that performs 3D Region of Interest (RoI) Pooling. My 3D RoI Pooling function works as expected outside of an R6Class KerasLayer, but I am having trouble integrating it into a Custom Layer. I am not sure if I am using autograph correctly in the below example. I am under the impressing that for me to use the R functionalities (i.e. for loop) in the "call" section of the custom layer I am required to wrap the function in autograph.
Is the below custom layer for applying a 3D RoI Pooling (without training weights) appropriate syntex for an R user of Keras?
The input for the 3d_RoI_Pool layer includes:
(i) output layer from a VNet model,
shape=(None, 16, 16, 40, 1)
(ii) tensor defining the "index" for cropping the input layer to the extent of each RoI,
- tf.Variable 'Variable:0' shape=(1, 1, 6) dtype=float64, numpy=array([[[ 5., 5., 4., 5., 10., 20.]]])
(iii) output dimension for the pooled 3D RoIs.
In this case its c(1, 1, 5, 5, 5, 1) for c(Batch_Size, RoI_Size, X, Y, Z, Channels)
Below are some questions that I am unclear about:
I initially apply layer_cropping_3d to crop each RoI but am not sure if custom layer is allowed to use available R Keras layers? For this reason, in code I performed my own cropping.
Is it the correct use of autograph within custom layer?
Should I instead use the layer_lambda() for such a function (i.e. Wraps arbitrary expression as a layer), given that I do not have trainable weights in the 3D_RoI_Pooling layer?
Below is the Code that gets the output layer from a V-Net CNN and prepares inputs for the custom layer:
# FINAL LAYERS OF V_Net
Output_shortcut = up1_DeConv
Final_Conv <- up1_Concat %>%
# 1st
layer_conv_3d(filters = Filter_Start, kernel_size = Kernel_Conv_Size, padding = "same", name = "Final_Conv") %>%
layer_batch_normalization() %>%
layer_activation("relu")
Final_Conv <- layer_add(c(Final_Conv, Output_shortcut), name = "Final_Conv_ResNet")
# PREPARE INPUT FOR CUSTOM LAYER
n_RoI <- 1
Batch_Size <- 1
Table_RoI <- list(list(c(5, 5), c(4, 5), c(10, 20)))
Table_RoI <- array_reshape(unlist(Table_RoI), c(Batch_Size,n_RoI,6), order="F")
Table_RoI <- tf$Variable(Table_RoI, tf$int16)
output_dim <- c(Batch_Size, n_RoI, 5,5,5, as.numeric(dim(Final_Conv)[5]))
# FEED CUSTOM LAYER TO CNN
Final_Conv2 <- Final_Conv %>% layer_3D_ROI_pooled(Table_RoI=Table_RoI, output_dim= output_dim)
Below is the actual 3D RoI Pooling Custom Layer that generates an error:
# 3D ROIpooled_Layer (R KERAS )
ROIpooled_Layer <- R6::R6Class("KerasLayer",
inherit = KerasLayer,
public = list(
Table_RoI = NULL,
output_dim = NULL,
###############
# INITIALISE
###############
initialize = function(Table_RoI, output_dim) {
self$Table_RoI = Table_RoI
self$output_dim = output_dim
},
###############
# CALL FUNCTION ROIpooled_Function <-
###############
call = autograph(function(x, mask = NULL,
Table_RoI,
output_dim) {
## Input_L ROI_Table
n_Batch <- output_dim[1]
n_RoI <- output_dim[2]
Channels <- output_dim[6]
for(r in 1:n_RoI){
# layer_cropping_3d(Input_L, cropping = list(list(as.numeric(Table_RoI[,r,1]), as.numeric(Table_RoI[,r,2])),
# list(as.numeric(Table_RoI[,r,3]), as.numeric(Table_RoI[,r,4])),
# list(as.numeric(Table_RoI[,r,5]), as.numeric(Table_RoI[,r,6]))))
RoI_Cropped <-x[,(as.numeric(Table_RoI[,r,1])+1):(dim(feature_map_T)[2]-as.numeric(Table_RoI[,r,2])),
(as.numeric(Table_RoI[,r,3])+1):(dim(feature_map_T)[3]-as.numeric(Table_RoI[,r,4])),
(as.numeric(Table_RoI[,r,5])+1):(dim(feature_map_T)[4]-as.numeric(Table_RoI[,r,6])),]
RoI_X_Res <- as.array(k_shape(RoI_Cropped)[2])
RoI_Y_Res <- as.array(k_shape(RoI_Cropped)[3])
RoI_Z_Res <- as.array(k_shape(RoI_Cropped)[4])
New_X_Res <- as.array(output_dim[3])
New_Y_Res <- as.array(output_dim[4])
New_Z_Res <- as.array(output_dim[5])
X_step = RoI_X_Res / New_X_Res
Y_step = RoI_Y_Res / New_Y_Res
Z_step = RoI_Z_Res / New_Z_Res
for(ch in 1:Channels) {
print(paste("ch", ch))
for (k in 1:New_Z_Res) {
print(paste("k", k))
for (j in 1:New_Y_Res) {
print(paste("j", j, "k", k))
for (i in 1:New_X_Res) {
# INDEX X
Index_Xstart <- floor((i-1)*X_step+1)
if(i+1 <= RoI_X_Res){
Index_Xend <- floor((i)*X_step)
}else{
Index_Xend <- RoI_X_Res
}
# INDEX Y
Index_Ystart <- floor((j-1)*Y_step+1)
if(j+1 <= RoI_Y_Res){
Index_Yend <- floor((j)*Y_step)
}else{
Index_Yend <-RoI_Y_Res
}
# INDEX Z
Index_Zstart <- floor((k-1)*Z_step+1)
if(k+1 <= RoI_Z_Res){
Index_Zend <- floor((k)*Z_step)
}else{
Index_Zend <-RoI_Z_Res
}
Max_Pool_X_Value <- as.array(k_max(RoI_Cropped[n_Batch,Index_Xstart:Index_Xend, Index_Ystart:Index_Yend, Index_Zstart:Index_Zend,ch])) # ADD BATCH AND CHANNEL LAYERS
RoI_Pooled_Array[,r,i,j,k,ch] <- Max_Pool_X_Value
}# i LOOP
} # j LOOP
} # k Loop
} #Ch LOOP
} # r LOOP (ROI)
feature_map_ROIpooled <- tf$Variable(RoI_Pooled_Array, tf$int16) # ??? NOT SURE IF RETURN NEEDS TO BE A VARIABLE
return (feature_map_ROIpooled)
}), # END OF AUTO, # END OF CALL ... AUTOGRAPH FUNCTION
##############
# OUTPUT SHAPE
##############
compute_output_shape = function(input_shape) {
list(self$output_dim)
}
)
)
#
###############################
# Create layer wrapper function
###############################
layer_3D_ROI_pooled <- function(object, Table_RoI, output_dim, name = NULL, trainable = TRUE) {
create_layer(ROIpooled_Layer, object, list(Table_RoI = Table_RoI,
output_dim = as.integer(output_dim),
name = name,
trainable = FALSE
))
}
ERROR:
Error in value[[3L]](cond) :
The R function's signature must not contains esoteric Python-incompatible constructs. Detailed traceback: SyntaxError: non-default argument follows default argument (<string>, line 3)
Any help/insight/clarity would be appreciated.
Kind Regards,
Dom
I'm not sure if this is the correct protocol to answer own question but I think I have a working 3D RoI pooling Custom Layer to share. Many mistakes in above but most notable change is in the for loop.... I think I needed to first generate a list of tensors that represent each pooled RoI and then re-shape it to the desired output shape.
################################################################################################
# ROI_3D_pooled_Layer (Custom layer class)
##########################################
ROI_3D_pooled_Layer <- R6::R6Class("KerasLayer",
inherit = KerasLayer,
public = list(
List_RoI = NULL,
output_dim = NULL,
initialize = function(List_RoI, output_dim) {
self$List_RoI = List_RoI
self$output_dim = output_dim
},
call = function(x, mask = NULL) {
List_RoI <- self$List_RoI
output_dim <- self$output_dim
# EXTRACT INFORMATION ON OUTPUT DIMENSION
n_Batch <- as.integer(output_dim[1])
n_RoIs <- as.integer(output_dim[2])
n_Channels <- as.integer(output_dim[6])
New_X_Res <- as.numeric(output_dim[3])
New_Y_Res <- as.numeric(output_dim[4])
New_Z_Res <- as.numeric(output_dim[5])
input_shape <- dim(x)
# EMPTY LIST TO STORE TENSORS
output_list = list()
for(r in 1:n_RoIs){ # LOOP RoIs
# GET one RoI AND CROP INPUT LAYER
if(n_RoIs > 1){
oneList_RoI <- List_RoI[[r]]
}else{
oneList_RoI <- List_RoI
}
RoI_Cropped <-x[,(oneList_RoI[[1]][1]+1):(as.numeric(input_shape[2])-oneList_RoI[[1]][2]),
(oneList_RoI[[2]][1]+1):(as.numeric(input_shape[3])-oneList_RoI[[2]][2]),
(oneList_RoI[[3]][1]+1):(as.numeric(input_shape[4])-oneList_RoI[[3]][1]),]
# GET RoI Dimensions for XYZ
RoI_X_Res <- as.numeric(dim(RoI_Cropped)[2])
RoI_Y_Res <- as.numeric(dim(RoI_Cropped)[3])
RoI_Z_Res <- as.numeric(dim(RoI_Cropped)[4])
# CALCULATE STEPS IN ALL DIMENSIONS FOR POOLING
X_step = RoI_X_Res / New_X_Res
Y_step = RoI_Y_Res / New_Y_Res
Z_step = RoI_Z_Res / New_Z_Res
for(ch in 1:n_Channels) { # LOOP CHANNEL
for (k in 1:New_Z_Res) { # LOOP Z
for (j in 1:New_Y_Res) { # LOOP Y
for (i in 1:New_X_Res) { # LOOP X
# INDEX X
Index_Xstart <- floor((i-1)*X_step+1)
if(i+1 <= RoI_X_Res){
Index_Xend <- floor((i)*X_step)
}else{
Index_Xend <- RoI_X_Res
}
# INDEX Y
Index_Ystart <- floor((j-1)*Y_step+1)
if(j+1 <= RoI_Y_Res){
Index_Yend <- floor((j)*Y_step)
}else{
Index_Yend <-RoI_Y_Res
}
# INDEX Z
Index_Zstart <- floor((k-1)*Z_step+1)
if(k+1 <= RoI_Z_Res){
Index_Zend <- floor((k)*Z_step)
}else{
Index_Zend <-RoI_Z_Res
}
# MAX POOL VOLUME FOR EACH ELEMENT IN FINAL LAYER AND PUT IN EMPTY ARRAY
Max_Pool_X_Value <-k_max(RoI_Cropped[,Index_Xstart:Index_Xend, Index_Ystart:Index_Yend, Index_Zstart:Index_Zend,ch])
# APPEND EACH RoI_Pooled element into a list
output_list <- list.append(output_list, Max_Pool_X_Value)
}# i LOOP
} # j LOOP
} # k Loop
} #Ch LOOP
} # r LOOP (ROI)
# STACK THE OUTPUT LIST AND RESHAPE TO THE DESIRED OUTPUT SIZE
output_Stack <- k_stack(output_list, axis = 1)
feature_map_ROIpooled <- k_reshape(output_Stack, shape = c(n_Batch, n_RoIs, New_X_Res, New_Y_Res, New_Z_Res, n_Channels))
return (feature_map_ROIpooled)
},
compute_output_shape = function(input_shape) {
return(self$output_dim)
}
)
)
# Create layer wrapper function
layer_3D_ROI_pooled <- function(object, List_RoI, output_dim) {
create_layer(ROI_3D_pooled_Layer, object, list(List_RoI = List_RoI,
output_dim = as.integer(output_dim)
))
}
This is my first attempt with a Keras Custom Layer (using R) so please provide suggestions for improvement.
Related
I have a dataframe that has two columns, x and y (both populated with numbers). I am trying to look at a moving window within the data, and I've done it like this (source):
# Extract just x and y from the original data frame
df <- dat_fin %>% select(x, y)
# Moving window creation
nr <- nrow(df)
windowSize <- 10
windfs <- lapply(seq_len(nr - windowSize + 1), function(i) df[i:(i + windowSize - 1), ])
This lapply creates a list of tibbles that are each 10 (x, y) pairs. At this point, I am trying to compute a single quantity using each of the sets of 10 pairs; my current (not working) code looks like this:
library(shotGroups)
for (f in 1:length(windfs)) {
tsceps[f] = getCEP(windfs[f], accuracy = TRUE)
}
When I run this, I get the error:
Error in getCEP.default(windfs, accuracy = TRUE) : xy must be numeric
My goal is that the variable that I've called tsceps should be a 1 x length(windfs) data frame, each value in which comes from the getCEP calculation for each of the windowed subsets.
I've tried various things with unnest and unlist, all of which were unsuccessful.
What am I missing?
Working code:
df <- dat_fin %>% select(x, y)
nr <- nrow(df)
windowSize <- 10
windfs <- lapply(seq_len(nr - windowSize + 1), function(i) df[i:(i + windowSize - 1), ])
tsceps <- vector(mode = "numeric", length = length(windfs))
library(shotGroups)
for (j in 1:length(windfs)) {
tsceps[j] <- getCEP(windfs[[j]], type = "CorrNormal", CEPlevel = 0.50, accuracy = TRUE)
}
ults <- unlist(tsceps)
ults_cep <- vector(mode = "numeric", length = length(ults))
for (k in 1:length(ults)) {
ults_cep[k] <- ults[[k]]
}
To get this working with multiple type arguments to getCEP, just use additional code blocks for each type required.
My task is to recreate the result of this for loop by using the map() function. Unfortunately, I can't get my head around this.
play_roulette <- function(bet, number) {
draw <- sample(0:36, 1)
tibble(
winning_number = draw,
your_number = number,
your_bet = bet,
your_return = if (number == draw) {
bet * 36
} else {
0
}
)
}
desired_length <- 10
list_w_for <- vector(mode = "list", length = desired_length)
for(i in seq_along(list_w_for)){
list_w_for[[i]] <- play_roulette(bet = 1, number = 5)
}
for_tibble <- bind_rows(list_w_for)
for_tibble
My current map code:
num_vec <- 1:10
bet_vec <- 1
tibble_2c <- tibble(x= bet_vec, y= num_vec)
map_dfc( tibble_2c,
play_roulette(bet = x, number = y))
You just have to call the function 10 times, since the iterator i is not used inside of the function.
# use *_dfr to row_bind the result
map_dfr(
# call the function ten times
1:10,
# note that .x, the default iterator in map, is not needed inside of the function
~play_roulette(bet = 1, number = 5))
I am using the package depmixS4 to fit a HMM on time-series data. Here is an example with some high vol and low vol data.
In the getpars function we can see the parameter value estimates.
What is happening is that sometimes the first two values in the density are the low vol state and sometimes the second two values are the low vol state. Is there any way to fix (maybe setting initial priors?)
set.seed(1)
a <- data.frame(v1 = c(rnorm(n = 100, sd = 10), rnorm(n=100, sd = 1)))
a <- sample(a)
my_model <- depmixS4::depmix(response = v1 ~ 1 , nstates = 2 , data = a)
fitted_model <- depmixS4::fit(my_model)
getpars(fitted_model)
for (i in 100:200) {
my_model2 <- depmixS4::depmix(response = v1 ~ 1 , nstates = 2 , data = a[1:i, , drop = FALSE])
fitted_model2 <- depmixS4::fit(my_model2)
pars <- getpars(fitted_model2)
if (pars[8] > 8) {
print(i)
}
}
This is called label switching.
Models in which you swap the label of states (e.g., relabel state 1 as state 2 and state 2 as state 1) have the same likelihood and hence are both valid maximum likelihood solutions.
You can try to "fix" this issue by:
setting initial values for the parameters (which make it more likely that the EM algorithm will converge to a particular solution, although this is not guaranteed!);
or by setting order constraints (e.g. forcing the mean for state 1 to be larger than the mean for state 2). Such constraints can be supplied to the fit method in depmixS4 (see examples in ?fit);
a final option is to switch the labels of a fitted depmixS4 object.
Here is a function to relabel a fitted depmix object I have used before (not tested well though!):
label_switch <- function(mod,labels) {
# labels is vector, first element is new integer label for original state integer 1, second is new integer label for original state integer 2, etc.
if(!is(mod,"depmix") || !is(mod,"depmix.fitted")) stop("this function is for depmix models")
n_states <- mod#nstates
if(length(labels) != n_states || length(unique(labels)) != n_states || !(all(labels) %in% 1:n_states)) {
stop("labels needs to be a vector of unique integers between 1 and", n_states)
}
inv_labels <- sapply(1:n_states,function(x) which(labels == x))
tmp <- mod
# relabel prior
ppars <- getpars(mod#prior)
fpars <- getpars(mod#prior,which="fixed")
out_pars <- as.numeric(t(matrix(ppars,nrow=length(ppars)/n_states,byrow = TRUE)[,inv_labels]))
out_fixed <- as.logical(t(matrix(fpars,nrow=length(fpars)/n_states,byrow = TRUE)[,inv_labels]))
if(!tmp#prior#family$link=="identity") tmp#prior#family$base <- labels[tmp#prior#family$base]
# relabel transition
for(i in 1:n_states) {
ppars <- getpars(mod#transition[[inv_labels[i]]])
fpars <- getpars(mod#transition[[inv_labels[i]]],which="fixed")
out_pars <- c(out_pars,as.numeric(t(matrix(ppars,nrow=length(ppars)/n_states,byrow = TRUE)[,inv_labels])))
out_fixed <- c(out_fixed,as.logical(t(matrix(fpars,nrow=length(fpars)/n_states,byrow = TRUE)[,inv_labels])))
tmp#transition[[i]] <- mod#transition[[inv_labels[i]]]
if(!tmp#transition[[i]]#family$link=="identity") tmp#transition[[i]]#family$base <- labels[tmp#transition[[i]]#family$base]
#out_pars <- c(out_pars,getpars(mod#transition[[inv_labels[i]]]))
}
# relabel response
for(i in 1:n_states) {
out_pars <- c(out_pars,unlist(lapply(mod#response[[inv_labels[i]]],getpars)))
out_fixed <- c(out_fixed,unlist(lapply(mod#response[[inv_labels[i]]],getpars,which="fixed")))
}
tmp <- setpars(tmp,out_fixed,which="fixed")
tmp <- setpars(tmp,out_pars)
if(is(tmp,"depmix.fitted")) tmp#posterior <- viterbi(tmp)
return(tmp)
}
I am trying to gerate gexf file using igraph but unfortunatly I have a problem with layout. How can I solve it to get a good graph like second one.
First image created with 1000 nodes but second one with 500
gD <- simplify(graph.data.frame(dataSet, directed=FALSE))
# Print number of nodes and edges
#vcount(gD)
#ecount(gD)
############################################################################################
# Calculate some node properties and node similarities that will be used to illustrate
# different plotting abilities
# Calculate degree for all nodes
degAll <- igraph::degree(gD, v = V(gD), mode = "all")
# Calculate betweenness for all nodes
#tnet::betweenness_w(data.frame(V1 = dataSet$V1,V2=dataSet$V2, V3 =dataSet$V3 ))
betAll <- igraph::betweenness(gD, v = V(gD),normalized=TRUE)
betAll <- betAll
#options("scipen"=100000000)
#betweenness()
#betAll.norm <- betAll / 100000000000000
betAll.norm <-betAll
#betAll.norm <- (betAll - min(betAll))/(max(betAll) - min(betAll))
rm(betAll)
# Calculate Dice similarities between all pairs of nodes
dsAll <- similarity.dice(gD, vids = V(gD), mode = "all")
############################################################################################
# Add new node/edge attributes based on the calculated node properties/similarities
newdataSet1 <- data.frame(username = dataSet$V1,gender = dataSet$V4)
newdataSet2 <- data.frame(username = dataSet$V2,gender = dataSet$V4)
newdataSet<-rbind(newdataSet1,newdataSet2)
genderdata<-subset(newdataSet,!duplicated(newdataSet$username))
gD <- set.vertex.attribute(gD, "degree", index = V(gD), value = degAll)
gD <- set.vertex.attribute(gD, "betweenness", index = V(gD), value = betAll.norm)
gD <- set.vertex.attribute(gD, "gender", index = V(gD), value = genderdata$gender)
newdataSet1 <- data.frame(username = dataSet$V1,sentiment = dataSet$V5)
newdataSet2 <- data.frame(username = dataSet$V2,sentiment = dataSet$V5)
newdataSet<-rbind(newdataSet1,newdataSet2)
sentimentdata<-subset(newdataSet,!duplicated(newdataSet$username))
gD <- set.vertex.attribute(gD, "sentiment", index = V(gD), value = sentimentdata$sentiment)
# Check the attributes
# summary(gD)
F1 <- function(x) {data.frame(V4 = dsAll[which(V(gD)$name == as.character(x$V1)), which(V(gD)$name == as.character(x$V2))])}
cl = createCluster(6, export = list("F1","dsAll","gD"), lib = list("igraph","plyr"))
system.time(dataSet.ext <- ddply(dataSet, .variables=c("V1", "V2", "V3"), function(x) data.frame(F1(x)),.parallel = TRUE))
#res = ddply(dat, .(category), bla, .parallel = TRUE)
stopCluster(cl)
gD <- set.edge.attribute(gD, "weight", index = E(gD), value = 0)
gD <- set.edge.attribute(gD, "similarity", index = E(gD), value = 0)
# The order of interactions in gD is not the same as it is in dataSet or as it is in the edge list,
# and for that reason these values cannot be assigned directly
#length(E(gD)[as.character(dataSet.ext$V1) %--% as.character(dataSet.ext$V2)]$weight )
E(gD)[as.character(dataSet.ext$V1) %--% as.character(dataSet.ext$V2)]$weight <- as.numeric(dataSet.ext$V3)
E(gD)[as.character(dataSet.ext$V1) %--% as.character(dataSet.ext$V2)]$similarity <- as.numeric(dataSet.ext$V4)
# Check the attributes
# summary(gD)
####################################
# Print network in the file format ready for Gephi
# This requires rgexf package
# Create a dataframe nodes: 1st column - node ID, 2nd column -node name
nodes_df <- data.frame(ID = c(1:vcount(gD)), NAME = V(gD)$name)
# Create a dataframe edges: 1st column - source node ID, 2nd column -target node ID
edges_df <- as.data.frame(get.edges(gD, c(1:ecount(gD))))
# Define node and edge attributes - these attributes won't be directly used for network visualization, but they
# may be useful for other network manipulations in Gephi
#
# Create a dataframe with node attributes: 1st column - attribute 1 (degree), 2nd column - attribute 2 (betweenness)
nodes_att <- data.frame(DEG = V(gD)$degree, BET = V(gD)$betweenness,gender = V(gD)$gender,sentement = V(gD)$sentiment)
#
# Create a dataframe with edge attributes: 1st column - attribute 1 (weight), 2nd column - attribute 2 (similarity)
edges_att <- data.frame(WGH = E(gD)$weight, SIM = E(gD)$similarity)
# Define node/edge visual attributes - these attributes are the ones used for network visualization
#
# Calculate node coordinate - needs to be 3D
#nodes_coord <- as.data.frame(layout.fruchterman.reingold(gD, weights = E(gD)$similarity, dim = 3, niter = 10000))
# We'll cheat here, as 2D coordinates result in a better (2D) plot than 3D coordinates
nodes_coord <- as.data.frame(layout.fruchterman.reingold(gD))
nodes_coord <- cbind(nodes_coord, rep(0, times = nrow(nodes_coord)))
#
# Calculate node size
# We'll interpolate node size based on the node betweenness centrality, using the "approx" function
uniqueNess<-unique(V(gD)$betweenness)
approxVals <- approx(c(1, 5), n = length(unique(V(gD)$betweenness)))
# And we will assign a node size for each node based on its betweenness centrality
nodes_size <- sapply(V(gD)$betweenness, function(x) approxVals$y[which(sort(unique(V(gD)$betweenness)) == x)])
#
# Define node color
# We'll interpolate node colors based on the node degree using the "colorRampPalette" function from the "grDevices" library
# This function returns a function corresponding to a collor palete of "bias" number of elements
F2 <- colorRampPalette(c("#F5DEB3", "#FF0000"), bias = length(unique(V(gD)$degree)), space = "rgb", interpolate = "linear")
# Now we'll create a color for each degree
colCodes <- F2(length(unique(V(gD)$degree)))
#################test parallel####################
cl = createCluster(6, export = list("F2","dsAll","gD","colCodes"), lib = list("igraph","plyr"))
system.time(nodes_col<-parSapply(cl,V(gD)$degree, function(x) colCodes[which(sort(unique(V(gD)$degree)) == x)]))
#res = ddply(dat, .(category), bla, .parallel = TRUE)
stopCluster(cl)
#############################
# And we will assign a color for each node based on its degree
#nodes_col <- sapply(V(gD)$degree, function(x) colCodes[which(sort(unique(V(gD)$degree)) == x)])
# Transform it into a data frame (we have to transpose it first)
nodes_col_df <- as.data.frame(t(col2rgb(nodes_col, alpha = FALSE)))
# And add alpha (between 0 and 1). The alpha from "col2rgb" function takes values from 0-255, so we cannot use it
nodes_col_df <- cbind(nodes_col_df, alpha = rep(1, times = nrow(nodes_col_df)))
# Assign visual attributes to nodes (colors have to be 4dimensional - RGBA)
nodes_att_viz <- list(color = nodes_col_df, position = nodes_coord, size = nodes_size)
# Assign visual attributes to edges using the same approach as we did for nodes
F2 <- colorRampPalette(c("#FFFF00", "#006400"), bias = length(unique(E(gD)$weight)), space = "rgb", interpolate = "linear")
colCodes <- F2(length(unique(E(gD)$weight)))
#################test parallel####################
cl = createCluster(12, export = list("F2","dsAll","gD","colCodes"), lib = list("igraph","plyr"))
system.time(edges_col<-parSapply(cl,E(gD)$weight, function(x) colCodes[which(sort(unique(E(gD)$weight)) == x)]))
stopCluster(cl)
#############################
#edges_col <- sapply(E(gD)$weight, function(x) colCodes[which(sort(unique(E(gD)$weight)) == x)])
edges_col_df <- as.data.frame(t(col2rgb(edges_col, alpha = FALSE)))
edges_col_df <- cbind(edges_col_df, alpha = rep(1, times = nrow(edges_col_df)))
edges_att_viz <-list(color = edges_col_df)
# Write the network into a gexf (Gephi) file
#write.gexf(nodes = nodes_df, edges = edges_df, nodesAtt = nodes_att, edgesWeight = E(gD)$weight, edgesAtt = edges_att, nodesVizAtt = nodes_att_viz, edgesVizAtt = edges_att_viz, defaultedgetype = "undirected", output = "lesmis.gexf")
# And without edge weights
write.gexf(nodes = nodes_df, edges = edges_df, nodesAtt = nodes_att, edgesAtt = edges_att, nodesVizAtt = nodes_att_viz, edgesVizAtt = edges_att_viz, defaultedgetype = "undirected", output = "arctic.gexf")
I wrote a small function to partition my dataset into training and testing sets. However, I am running into trouble when dealing with factor variables. In the model validation phase of my code, I get an error if the model was built on a dataset that doesn't have representation from each level of a factor. How can I fix this partition() function to include at least one observation from every level of a factor variable?
test.df <- data.frame(a = sample(c(0,1),100, rep = T),
b = factor(sample(letters, 100, rep = T)),
c = factor(sample(c("apple", "orange"), 100, rep = T)))
set.seed(123)
partition <- function(data, train.size = .7){
train <- data[sample(1:nrow(data), round(train.size*nrow(data)), rep= FALSE), ]
test <- data[-as.numeric(row.names(train)), ]
partitioned.data <- list(train = train, test = test)
return(partitioned.data)
}
part.data <- partition(test.df)
table(part.data$train[,'b'])
table(part.data$test[,'b'])
EDIT - New function using 'caret' package and createDataPartition():
partition <- function(data, factor=NULL, train.size = .7){
if (("package:caret" %in% search()) == FALSE){
stop("Install and Load 'caret' package")
}
if (is.null(factor)){
train.index <- createDataPartition(as.numeric(row.names(data)),
times = 1, p = train.size, list = FALSE)
train <- data[train.index, ]
test <- data[-train.index, ]
}
else{
train.index <- createDataPartition(factor,
times = 1, p = train.size, list = FALSE)
train <- data[train.index, ]
test <- data[-train.index, ]
}
partitioned.data <- list(train = train, test = test)
return(partitioned.data)
}
Try the caret package, particularly the function createDataPartition(). It should do exactly what you need, available on CRAN, homepage is here:
caret - data splitting
The function I mentioned is partially some code I found a while back on net, and then I modified it slightly to better handle edge cases (like when you ask for a sample size larger than the set, or a subset).
stratified <- function(df, group, size) {
# USE: * Specify your data frame and grouping variable (as column
# number) as the first two arguments.
# * Decide on your sample size. For a sample proportional to the
# population, enter "size" as a decimal. For an equal number
# of samples from each group, enter "size" as a whole number.
#
# Example 1: Sample 10% of each group from a data frame named "z",
# where the grouping variable is the fourth variable, use:
#
# > stratified(z, 4, .1)
#
# Example 2: Sample 5 observations from each group from a data frame
# named "z"; grouping variable is the third variable:
#
# > stratified(z, 3, 5)
#
require(sampling)
temp = df[order(df[group]),]
colsToReturn <- ncol(df)
#Don't want to attempt to sample more than possible
dfCounts <- table(df[group])
if (size > min(dfCounts)) {
size <- min(dfCounts)
}
if (size < 1) {
size = ceiling(table(temp[group]) * size)
} else if (size >= 1) {
size = rep(size, times=length(table(temp[group])))
}
strat = strata(temp, stratanames = names(temp[group]),
size = size, method = "srswor")
(dsample = getdata(temp, strat))
dsample <- dsample[order(dsample[1]),]
dsample <- data.frame(dsample[,1:colsToReturn], row.names=NULL)
return(dsample)
}