R, DMwR-package, SMOTE-function won't work - r

I need to apply the smote-algorithm to a data set, but can't get it to work.
Example:
x <- c(12,13,14,16,20,25,30,50,75,71)
y <- c(0,0,1,1,1,1,1,1,1,1)
frame <- data.frame(x,y)
library(DMwR)
smotedobs <- SMOTE(y~ ., frame, perc.over=300)
This gives the following error:
Error in scale.default(T, T[i, ], ranges) : subscript out of bounds
In addition: Warning messages:
1: In FUN(newX[, i], ...) :
no non-missing arguments to max; returning -Inf
2: In FUN(newX[, i], ...) : no non-missing arguments to min; returning Inf
Would appriciate any kind of help or hints.

SMOTE has a bug in OS Win7 32 bit,
It assume the target variable in the parameter 'form' is the last column in the dataset, the following code will explain
library(DMwR)
data(iris)
# data <- iris[, c(1, 2, 5)] # SMOTE work
data <- iris[, c(2, 5, 1)] # SMOTE bug
data$Species <- factor(ifelse(data$Species == "setosa", "rare", "common"))
head(data)
table(data$Species)
newData <- SMOTE(Species ~., data, perc.over=600, perc.under=100)
table(newData$Species)
It will show following message
Error in colnames<-(*tmp*, value = c("Sepal.Width", "Species", "Sepal.Length" :
'names' attribute [3] must be the same length as the vector [2]
In Win7 64bit, the order problem does not occur!!

I don't have the full answer. I can provide another clue though:
If you convert 'y' to a factor, SMOTE will return without error - but the synthesized observations have NA values for x.

There is a bug in the SMOTE code. It assumes the y function it's being fed is already a factor variable, currently it does not handle the edge case of non-factors. Make sure to cast to a factor before calling the method.

Related

How to correctly use lm for Regression in R

I'm attempting to run a regression on a dataset for a class exercise.
The dataset is broken in two columns, X and Y, with NA values scattered about.
Running the regression with the lm() call produces the following error:
lm(formula = Y ~ X, data = data2)
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
NA/NaN/Inf in 'y'
In addition: Warning message:
In storage.mode(v) <- "double" : NAs introduced by coercion
I first began experiencing this error and read it could be due to the NA values within the data, so I attempted to remove them on import using the follow method.
> library(readxl)
> data2 <- read_excel("data2.xlsx", na = "0")
That got my data loaded in seemingly successfully, however when I use View() I can still see the NA values within my data, running the regression with "lm(formula = Y ~ X, data = data2)" produces the same result.
Any help would be greatly appreciated, thanks for taking the time to read the post.
Probably your variable Y is not numeric, check its type with str(data2).

Error in .lm.fit(x, y) : NA/NaN/Inf in 'x' in R

I have been trying to find the correlations between all the categorical and numerical in my dataset using association from greybox in the following way-
library(readxl)
library(timeDate)
library(greybox)
library(dplyr)
library(mice)
library(Hmisc)
carData33 <- read.csv("carData.csv")
#removing the first column since its not necessary, it represents the ID number
carData33 <- carData33[,c(2:15)]
#replacing NA with 0
carData33[is.na(carData33)] <-0
assoc(carData33)
the main objective is to do regression by selecting variables with correlation values.
Bt while doing so, the error that pops up is -
Error in .lm.fit(x, y) : NA/NaN/Inf in 'x'
In addition: Warning message:
In .lm.fit(x, y) : NAs introduced by coercion
the dataset is as follows-
https://i.stack.imgur.com/ZhjwR.png
Use as.factor() on the columns containing categorial data.
Like test$manufacturer <- as.factor(test$manufacturer)

R: Error in plot.window(...) : need finite 'ylim' values

I am working with R. I am trying to follow the code from a previous stackoverflow post over here: Kullback-Leibler distance between 2 samples
In particular, I am trying to determine the "distance" between two datasets:
#load library
library(FNN)
library(dplyr)
#create two data sets
df = iris
data1 = sample_n(df, 20)
data2 = sample_n(df, 20)
#plot KL divergence
plot(KLx.dist(data1,data2))
However, this produces the following error:
Error in plot.window(...) : need finite 'ylim' values
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
Does anyone know why this error is being produced?
Thanks
According to the KLx.dist documentation, this funciton requires data matrix as input. In the iris dataset, we then need to remove the Species column which is a factor variable. Removing the Species column before sampling would solve the problem :
data(iris)
library(FNN)
library(dplyr)
#create two data sets
df = iris[,1:4]
data1 = sample_n(df, 20)
data2 = sample_n(df, 20)
#plot KL divergence
plot(KLx.dist(data1,data2))

Error in panel spatial model in R using spml

I am trying to fit a panel spatial model in R using the package spml. I first define the NxN weighting matrix as follows
neib <- dnearneigh(coordinates(coord), 0, 50, longlat = TRUE)
dlist <- nbdists(neib, coordinates(coord))
idlist <- lapply(dlist, function(x) 1/x)
w50 <- nb2listw(neib,zero.policy=TRUE, glist=idlist, style="W")
Thus I define two observations to be neighbours if they are distant within a range of 50km at most. The weights attached to each pairs of neighbour observations correspond to the inverse of their distance, so that closer neighbours receive higher weights. I also use the option zero.policy=TRUE so that observations which do not have neighbours are associated with a vector of zero weights.
Once I do this I try to fit the panel spatial model in the following way
mod <- spml(y ~ x , data = data_p, listw = w50, na.action = na.fail, lag = F, spatial.error = "b", model = "within", effect = "twoways" ,zero.policy=TRUE)
but I get the following error and warning messages
Error in lag.listw(listw, u) : Variable contains non-finite values In
addition: There were 50 or more warnings (use warnings() to see the
first 50)
Warning messages: 1: In mean.default(X[[i]], ...) : argument is not
numeric or logical: returning NA
...
50: In mean.default(X[[i]], ...) : argument is not numeric or
logical: returning NA
I believe this to be related to the non-neighbour observations. Can please anyone help me with this? Is there any way to deal with non-neighbour observations besides the zero.policy option?
Many many thanks for helping me.
You should check two things:
1) Make sure that the weight matrix is row-normalized.
2) Treat properly if you have any NA values in the dataset and as well in the W matrix.

How to perform clustering without removing rows where NA is present in R

I have a data which contain some NA value in their elements.
What I want to do is to perform clustering without removing rows
where the NA is present.
I understand that gower distance measure in daisy allow such situation.
But why my code below doesn't work?
I welcome other alternatives than 'daisy'.
# plot heat map with dendogram together.
library("gplots")
library("cluster")
# Arbitrarily assigning NA to some elements
mtcars[2,2] <- "NA"
mtcars[6,7] <- "NA"
mydata <- mtcars
hclustfunc <- function(x) hclust(x, method="complete")
# Initially I wanted to use this but it didn't take NA
#distfunc <- function(x) dist(x,method="euclidean")
# Try using daisy GOWER function
# which suppose to work with NA value
distfunc <- function(x) daisy(x,metric="gower")
d <- distfunc(mydata)
fit <- hclustfunc(d)
# Perform clustering heatmap
heatmap.2(as.matrix(mydata),dendrogram="row",trace="none", margin=c(8,9), hclust=hclustfunc,distfun=distfunc);
The error message I got is this:
Error in which(is.na) : argument to 'which' is not logical
Calls: distfunc.g -> daisy
In addition: Warning messages:
1: In data.matrix(x) : NAs introduced by coercion
2: In data.matrix(x) : NAs introduced by coercion
3: In daisy(x, metric = "gower") :
binary variable(s) 8, 9 treated as interval scaled
Execution halted
At the end of the day, I'd like to perform hierarchical clustering with the NA allowed data.
Update
Converting with as.numeric work with example above.
But why this code failed when read from text file?
library("gplots")
library("cluster")
# This time read from file
mtcars <- read.table("http://dpaste.com/1496666/plain/",na.strings="NA",sep="\t")
# Following suggestion convert to numeric
mydata <- apply( mtcars, 2, as.numeric )
hclustfunc <- function(x) hclust(x, method="complete")
#distfunc <- function(x) dist(x,method="euclidean")
# Try using daisy GOWER function
distfunc <- function(x) daisy(x,metric="gower")
d <- distfunc(mydata)
fit <- hclustfunc(d)
heatmap.2(as.matrix(mydata),dendrogram="row",trace="none", margin=c(8,9), hclust=hclustfunc,distfun=distfunc);
The error I get is this:
Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
3: In min(x) : no non-missing arguments to min; returning Inf
4: In max(x) : no non-missing arguments to max; returning -Inf
Error in hclust(x, method = "complete") :
NA/NaN/Inf in foreign function call (arg 11)
Calls: hclustfunc -> hclust
Execution halted
~
The error is due to the presence of non-numeric variables in the data (numbers encoded as strings).
You can convert them to numbers:
mydata <- apply( mtcars, 2, as.numeric )
d <- distfunc(mydata)
Using as.numeric may help in this case, but I do think that the original question points to a bug in the daisy function. Specifically, it has the following code:
if (any(ina <- is.na(type3)))
stop(gettextf("invalid type %s for column numbers %s",
type2[ina], pColl(which(is.na))))
The intended error message is not printed, because which(is.na) is wrong. It should be which(ina).
I guess I should find out where / how to submit this bug now.

Resources