Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
It seems that the 'SwarmSVM' package used to have a kmeans.predict function, but no longer does.
I would like to divide a dataframe to training+testing subsets to train a model and then test it. I am currently only able to use the 'kmeans' function to create clusters, but I can't figure out which functions/packages to use to train and test a model.
k-means is a clustering method, i.e. for unsupervised learning, not supervised, and as such isn't designed to predict on future data, as adding more data would change the centers. Supervised alternatives that can do classification include k-NN, LDA/QDA, and SVMs, but such an approach would require a training set with known classes.
All that said, you could write a predict method for stats::kmeans using dist, as you're presumably really looking for the closest center to the point. Hardly optimized, but functional:
predict.kmeans <- function(object, newdata){
centers <- object$centers
n_centers <- nrow(centers)
dist_mat <- as.matrix(dist(rbind(centers, newdata)))
dist_mat <- dist_mat[-seq(n_centers), seq(n_centers)]
max.col(-dist_mat)
}
set.seed(47)
in_train <- sample(nrow(iris), 100)
mod_kmeans <- kmeans(iris[in_train, -5], 3)
test_preds <- predict(mod_kmeans, iris[-in_train, -5])
table(test_preds, iris$Species[-in_train])
#>
#> test_preds setosa versicolor virginica
#> 1 0 0 10
#> 2 0 18 7
#> 3 15 0 0
install.packages("class")
library(class)
use the knn function
for further help see use
?knn
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
What are some different packages in R that contain in built function to simulate the Zero inflated distributions, related to the popular discrete models like the Poisson, Negative Binomial, COM-Poisson, Poisson Inverse Gaussian, Poisson-Lindley except the 'iZid' package.
Have a look at the CRAN Task View on Distributions. This is a curated look at R packages that help you work with distributions. You can search the page for "inflated" to quickly find the relevant parts.
If you have an existing function that generates random deviates from a non-zero-inflated distribution, you can write a wrapper (or decorator) that creates a zero-inflated-deviate simulator. The only assumption I've made here is that the first argument of the original function is called n and specifies the number of random deviates to pick.
For example, if we want to extend rbinom to return zero-inflated binomial deviates ...
ziversion <- function(rfun) {
f <- function(n, ..., zi) {
x <- rfun(n, ...)
x <- ifelse(runif(n) < zi, 0, x)
return(x)
}
return(f)
}
rzibinom <- ziversion(rbinom)
set.seed(101)
rzibinom(10, size = 10, prob = 0.2, zi = 0.5)
## [1] 1 0 3 2 0 1 2 0 0 0
zi is the zero-inflation probability. With a little bit of effort the code could be made more efficient ...
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I would like to estimate the parameters of a nonlinear regression model with LAD regression. In essence the LAD estimator is an M-estimator. As far as I know it is not possible to use the robustbase package to do this. How could I use R to do LAD regression? Could I use a standard package?
You could do this with the built-in optim() function
Make up some data (make sure x is positive, so that a*x^b makes sense - raising negative numbers to fractional powers is problematic):
set.seed(101)
a <- 1; b <- 2
dd <- data.frame(x=rnorm(1000,mean=7))
dd$y <- a*dd$x^b + rnorm(1000,mean=0,sd=0.1)
Define objective function:
objfun <- function(p) {
pred <- p[1]*dd$x^p[2] ## a*x^b
sum(abs(pred-dd$y)) ## least-absolute-deviation criterion
}
Test objective function:
objfun(c(0,0))
objfun(c(-1,-1))
objfun(c(1,2))
Optimize:
o1 <- optim(fn=objfun, par=c(0,0), hessian=TRUE)
You do need to specify starting values, and deal with any numerical/computational issues yourself ...
I'm not sure I know how to compute standard errors: you can use sqrt(diag(solve(o1$hessian))), but I don't know if the standard theory on which this is based still applies ...
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I am looking for Packages which can do multiclass oversampling, Undersampling or both techniques. I tried using ROSE package but it works only for binary class.
my target variable has 4 class and there % are.
"0"-70% "1"-15% "2"-10% "3"-5% "4"-5%
I believe you should be able to perform a downsample or upsample with more than two classes with Caret package.
If caret doesn't, perhaps the best is to just write a custom code to randomly sample equal numbers from your variable classess.
Generally, in practice downsample or upsample is for binary classifications. You may want to consider the one versus all approach. If you downsample then, you have to adjust back your probabilities, so they are not affected by various downsample rates between classes.
update-sample code:
y = c("A", "A","A", "B", "B", "C", "C", "C","C", "C", "C")
x = c(1,2,1,2,3,4,5,4,5,6,7)
data=cbind(y=y,x1=x)
fin=NULL
for (i in unique(y)) {
sub=subset(data, y==i)
sam=sub[sample(nrow(sub), 2), ]
fin=rbind(fin, sam)}
results:
y x1
A 2
A 1
B 3
B 2
C 6
C 7
I have sampled 2 from each of the Ys in here- but instead of 2, you should put the number of the smallest class in your Y.
You can use the R UBL package. It has several implementations of techniques to oversample multiclass problens, e.g. ADASYN and other algorithms to deal with unbalanced classes.
You can try SMOTE.
SMOTE over or under samples the data by generating the observations if needed.So, ,most of the times, smote out performs any other sampling technique.
This is a snippet of code in python.In R,it is a little hard to equalize the level distribution of target variable using SMOTE, but can be done considering 2 classes at a time
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=99, ratio = 1.0)
x_train, y_train = sm.fit_sample(X_var, target_class)
print(pandas.value_counts(y_train))#verify class distribution here
ratio is hyper parameter here.
Hope this helps.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I usually do decissions trees in SPSS to get targets from a DDBB, I did a bit of research and found that there are three packages: tree, party and rpart that are available for R, but which is better for that task?
Thanks!
I have used rpart before, which is handy. I have used for predictive modeling by splitting training and test set. Here is the code. Hope this will give you some idea...
library(rpart)
library(rattle)
library(rpart.plot)
### Build the training/validate/test...
data(iris)
nobs <- nrow(iris)
train <- sample(nrow(iris), 0.7*nobs)
test <- setdiff(seq_len(nrow(iris)), train)
colnames(iris)
### The following variable selections have been noted.
input <- c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width")
numeric <- c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width")
categoric <- NULL
target <-"Species"
risk <- NULL
ident <- NULL
ignore <- NULL
weights <- NULL
#set.seed(500)
# Build the Decision Tree model.
rpart <- rpart(Species~.,
data=iris[train, ],
method="class",
parms=list(split="information"),
control=rpart.control(minsplit=12,
usesurrogate=0,
maxsurrogate=0))
# Generate a textual view of the Decision Tree model.
print(rpart)
printcp(rpart)
# Decision Tree Plot...
prp(rpart)
dev.new()
fancyRpartPlot(rpart, main="Decision Tree Graph")
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
Does anyone know of a package or method in R of carrying out a MANOVA whilst controlling for phylogenetic non-independence?
Thank you!
sos package is your friend:
library('sos')
findFn('phylogenetic MANOVA')
seems like geiger package and more precisely aov.phylo performs phylogenetic ANOVA or MANOVA.
Here an example from the help :
library(geiger)
geo=get(data(geospiza))
dat=geo$dat
d1=dat[,1]
grp<-as.factor(c(rep(0, 7), rep(1, 6)))
names(grp)=rownames(dat)
## MANOVA
x=aov.phylo(dat~grp, geo$phy, nsim=50, test="Wilks")
Multivariate Analysis of Variance Table
Response: dat
Df Wilks approx-F num-Df den-Df Pr(>F) Pr(phy)
group 1 0.27872 3.6229 5 7 0.061584 0.549
Residuals 11