I am using the following CRAN package DMwR to deal with the problem of imbalanced data :
Code is the following:
require(DMwR)
dm = read.table("C:/data/exampleData.txt", sep=",")
ncols<-ncol(dm)
dm<-cbind(dm[2:ncols],dm[1])
dmSmote<-SMOTE(target ~ . , dm,k=5,perc.over = 1400,perc.under=140)
dm<-cbind(dmSmote[ncols],dmSmote[1:ncols-1])
Data :
5.901487,5.176487,1
6.917943,3.979710,0
5.247007,3.628324,1
5.157673,6.212658,0
4.836749,3.978392,0
4.590970,5.547353,0
3.895904,5.350865,0
4.312977,3.853151,0
5.844978,5.450767,0
4.009195,5.108031,0
Column 1 = variable 1, column 2 = variable 2, column 3 = Class
I am getting the following error: attempt to change an attribute to NULL
Link to library : http://cran.fhcrc.org/web/packages/DMwR/DMwR.pdf
What am I not getting right?
The classifier variable (target in your code) needs to be a factor.
require(DMwR)
## data
dm = structure(
c(5.901487, 6.917943, 5.247007, 5.157673, 4.836749,
4.59097, 3.895904, 4.312977, 5.844978, 4.009195, 5.176487, 3.97971,
3.628324, 6.212658, 3.978392, 5.547353, 5.350865, 3.853151, 5.450767,
5.108031, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0),
.Dim = c(10L, 3L),
.Dimnames = list(NULL, NULL))
dm = data.frame(dm)
## column names
colnames(dm) = c("var1", "var2", "target")
## you must convert the classifier variable to a factor
dm$target = factor(dm$target)
## SMOTE algorithm
dmSmote <- SMOTE(target ~ ., data = dm, k = 5,perc.over = 1400, perc.under = 140)
Using debug() on the function in question is a good starting point for diagnosing errors.
Related
I want create such matrix
dat <- matrix(
"an_no" = c(14, 17),
"an_yes" = c(3, 1),
row.names = c("TL-MCT-t", "ops"),
stringsAsFactors = FALSE
)
but i get error unused arguments.
What i did wrong and how perform correct matrix with such arguments?
as.matrix didn't help.
Thanks for your help.
You are using the arguments that you would use to build a data frame. If you want a matrix using this syntax you can do:
dat <- as.matrix(data.frame(
an_no = c(14, 17),
an_yes = c(3, 1),
row.names = c("TL-MCT-t", "ops")))
dat
#> an_no an_yes
#> TL-MCT-t 14 3
#> ops 17 1
You don't need the stringsAsFactors = FALSE because none of your data elements are strings, and in any case, stringsAsFactors is FALSE by default unless you are using an old version of R. You also don't need quotation marks around an_no and an_yes because these are both legal variable names in R.
The matrix function estructure is this:
matrix(data = NA,
nrow = 1,
ncol = 1,
byrow = FALSE,
dimnames = NULL)
Appears you're trying to create a data.frame
data.frame(row_names = c("TL-MCT-t", "ops"),
an_no = c(14,17),
an_yes = c(3,1)
)
I am trying to fit a list of dataframes and I can't figure out why I can't define conc and t0 outside of the function.
If I do it like this I get error:
'Error in nls.multstart::nls_multstart(y ~ fit_drx_mono(assoc_time,
t0, : There must be as many parameter starting bounds as there are
parameters'
conc <- 5e-9
t0 <- 127
nls.multstart::nls_multstart(y ~ fit_mono(assoc_time, t0, conc, kon, koff, ampon, ampoff),
data = data_to_fit,
iter = 100,
start_lower = c(kon = 1e4, koff = 0.00001, ampon = 0.05, ampoff = 0),
start_upper = c(kon = 1e7, koff = 0.5, ampon = 0.6, ampoff = 0.5),
lower = c(kon = 0, koff = 0, ampon = 0, ampoff = 0))
When I specify the values in the function everything works as it is supposed to. And I don't understand why.
It turned out I cannot define data = data_to_fit otherwise the function looks for variables only in that dataframe. Once I defined every variable outside of the function without specifying data it works.
I am trying to graph a Bayesian Network (BN) with instantiated nodes using the libraries bnlearn and Rgraphviz. My workflow is as follow:
After creating a data frame with random data (the data I am actually using is obviously not random) I then discretise the data, structure learn the directed acyclic graph (DAG), fit the data to the DAG and then plot the DAG. I also plot a DAG which shows the posterior probabilities of each of the nodes.
#rm(list = ls())
library(bnlearn)
library(Rgraphviz)
# Generating random dataframe
data_clean <- data.frame(a = runif(min = 0, max = 100, n = 1000),
b = runif(min = 0, max = 100, n = 1000),
c = runif(min = 0, max = 100, n = 1000),
d = runif(min = 0, max = 100, n = 1000),
e = runif(min = 0, max = 100, n = 1000))
# Discretising the data into 3 bins
bins <- 3
data_discrete <- discretize(data_clean, breaks = bins)
# Creating factors for each bin in the data
lv <- c("low", "med", "high")
for (i in names(data_discrete)){
levels(data_discrete[, i]) = lv
}
# Structure learning the DAG from the training set
whitelist <- matrix(c("a", "b",
"b", "c",
"c", "e",
"a", "d",
"d", "e"),
ncol = 2, byrow = TRUE, dimnames = list(NULL, c("from", "to")))
bn.hc <- hc(data_discrete, whitelist = whitelist)
# Plotting the DAG
dag.hc <- graphviz.plot(bn.hc,
layout = "dot")
# Fitting the data to the structure
fitted <- bn.fit(bn.hc, data = data_discrete, method = "bayes")
# Plotting the DAG with posteriors
graphviz.chart(fitted, type = "barprob", layout = "dot")
The next thing I do is to manually change the distributions in the bn.fit object, assigned to fitted, and then plot a DAG that shows the instantiated nodes and the updated posterior probability of the response variable e.
# Manually instantiating
fitted_evidence <- fitted
cpt.a = matrix(c(1, 0, 0), ncol = 3, dimnames = list(NULL, lv))
cpt.c = c(1, 0, 0,
0, 1, 0,
0, 0, 1)
dim(cpt.c) <- c(3, 3)
dimnames(cpt.c) <- list("c" = lv, "b" = lv)
cpt.b = c(1, 0, 0,
0, 1, 0,
0, 0, 1)
dim(cpt.b) <- c(3, 3)
dimnames(cpt.b) <- list("b" = lv, "a" = lv)
cpt.d = c(0, 0, 1,
0, 1, 0,
1, 0, 0)
dim(cpt.d) <- c(3, 3)
dimnames(cpt.d) <- list("d" = lv, "a" = lv)
fitted_evidence$a <- cpt.a
fitted_evidence$b <- cpt.b
fitted_evidence$c <- cpt.c
fitted_evidence$d <- cpt.d
# Plotting the DAG with instantiation and posterior for response
graphviz.chart(fitted_evidence, type = "barprob", layout = "dot")
This is the result I get but my actual BN is much larger with many more arcs and it would be impractical to manually change the bn.fit object.
I would like to find out if there is a way to plot a DAG with instantiation without changing the bn.fit object manually? Is there a workaround or function that I am missing?
I think/hope I have read the documentation for bnlearn thoroughly. I appreciate any feedback and would be happy to change anything in the question if I have not conveyed my thoughts clearly enough.
Thank you.
How about using cpdist to draw samples from the posterior given the evidence. You can then estimate the updated parameters using bn.fit using the cpdist samples. Then plot as before.
An example:
set.seed(69184390) # for sampling
# Your evidence vector
ev <- list(a = "low", b="low", c="low", d="high")
# draw samples
updated_dat <- cpdist(fitted, nodes=bnlearn::nodes(fitted), evidence=ev, method="lw", n=1e6)
# refit : you'll get warnings over missing levels
updated_fit <- bn.fit(bn.hc, data = updated_dat)
# plot
par(mar=rep(0,4))
graphviz.chart(updated_fit, type = "barprob", layout = "dot")
Note I used bnlearn::nodes as nodes is masked by a dependency of Rgraphviz. I tend to load bnlearn last.
I need to calculate the sum of some variables with imputed values. I did this with complete --> as.mids --> with --> do.call
I needed to do the same thing but in a survey context. Therefore, I did: update --> with --> MIcombine
The means of the variables calculated both ways do not match. Which one is correct?
You may check this different behavior in this toy database:
library(tidyverse)
library(mice)
library(mitools)
library(survey)
mydata <- structure(list(dis1 = c(NA, NA, 1, 0, 0, 1, 1, 1, 1, 0),
dis2 = c(0, 1, 0, 1, NA, 1, 1, 1, 1, 0),
dis3 = c(1, 1, 0, 0, NA, 1, 1, 1, 1, 0),
sex = c(0,0,0,1,0,1,1,1,1,0),
clus = c(1,1,1,1,1,2,2,2,2,2)),
row.names = c(NA, 10L),
class = c("tbl_df", "tbl", "data.frame") )
imp <- mice::mice(mydata, m = 5, seed = 237856)
# calculating numenf with mice::complete
long <- mice::complete(imp, action = "long", include = TRUE)
long$numenf <- long$dis1 + long$dis2 + long$dis3
imp2 <- mice::as.mids(long)
res <- with(imp2, mean(numenf))
do.call(mean, res$analyses) # mean = 2.1
#calculating numenf with update (from survey)
imp1 <- mice::complete(imp)
imp2 <- mice::complete(imp, 2)
imp3 <- mice::complete(imp, 3)
imp4 <- mice::complete(imp, 4)
imp5 <- mice::complete(imp, 5)
listimp <- mitools::imputationList(list(imp1, imp2, imp3, imp4, imp5))
clus <- survey::svydesign(id = ~clus, data = listimp)
clus <- stats::update(clus, numenf = dis1 + dis2 + dis3)
res <- with(clus, survey::svymean(~numenf))
summary(mitools::MIcombine(res)) # mean = 1.98
Answer
Replace do.call(mean, res$analyses) with mean(unlist(res$analyses)).
Rationale
In the first code snippet, res$analyses is a list. When entering it into do.call, you are essentially calling:
mean(res$analyses[1], res$analyses[2], res$analyses[3], res$analyses[4], res$analyses[5])
mean takes the average of a vector in its first argument. The other arguments are not used properly (see ?mean). Hence, you're just getting 2.1 back, since that is the (mean of the) value of first analysis.
We can make a vector out of the list by using unlist(res$analyses). Then, we can just feed it to mean as an argument:
mean(unlist(res$analyses))
My data are not timeseries, but it has sequential properties.
Consider one sample:
data1 = matrix(rnorm(10, 0, 1), nrow = 1)
label1 = rnorm(1, 0, 1)
label1 is a function of the data1, but the data matrix is not a timeseries. I suppose that label is a function of not just one data sample, but more older samples, which are naturally ordered in time (not sampled randomly), in other words, data samples are dependent with one another.
I have a batch of examples, say, 16.
With that I want to understand how I can design an RNN/LSTM model which will memorize all 16 examples from the batch to construct the internal state. I am especially confused with the seq_len parameter, which as I understand is specifically about the length of the timeseries used as an input to a network, which is not case.
Now this piece of code (taken from a timeseries example) only confuses me because I don't see how my task fits in.
rm(symbol)
symbol <- rnn.graph.unroll(seq_len = 5,
num_rnn_layer = 1,
num_hidden = 50,
input_size = NULL,
num_embed = NULL,
num_decode = 1,
masking = F,
loss_output = "linear",
dropout = 0.2,
ignore_label = -1,
cell_type = "lstm",
output_last_state = F,
config = "seq-to-one")
graph.viz(symbol, type = "graph", direction = "LR",
graph.height.px = 600, graph.width.px = 800)
train.data <- mx.io.arrayiter(
data = matrix(rnorm(100, 0, 1), ncol = 20)
, label = rnorm(20, 0, 1)
, batch.size = 20
, shuffle = F
)
Sure, you can treat them as time steps, and apply LSTM. Also check out this example: https://github.com/apache/incubator-mxnet/tree/master/example/multivariate_time_series as it might be relevant for your case.