Data Standardisation for Neural Network in R - r

I have built a multilayer perceptron neural network in SPSS 22. I try the same using "neuralnet" package in R, but the results are not desirable.
SPSS standardises data before performing training and I am wondering:
Does "neuralnet" package perform any sort of standardization? I could not find in its guide.
According to SPSS guide here, standardised process is done as below:
Subtract the mean and divide by the standard deviation, (x−mean)/s.
Is there an optimal function that can do this in R? Since the method is quite simple, I can implement the scaling by my own, but it might not be efficient since number of data elements and records are very large.
Or maybe should I use another neural network package like "monmlp"? that standardize data automatically?
Many thanks

This might be useful if you need to standardize multiple columns in a data frame (call it foo):
# Index of columns to standardize
cols <- c(1,2,3,4)
# Standardize
library(plyr)
standardize <- function(x) as.numeric((x - mean(x)) / sd(x))
foo[cols] <- plyr::colwise(standardize)(foo[cols])

Related

Low-pass fltering of a matrix

I'm trying to write a low-pass filter in R, to clean a "dirty" data matrix.
I did a google search, came up with a dazzling range of packages. Some apply to 1D signals (time series mostly, e.g. How do I run a high pass or low pass filter on data points in R? ); some apply to images. However I'm trying to filter a plain R data matrix. The image filters are the closest equivalent, but I'm a bit reluctant to go this way as they typically involve (i) installation of more or less complex/heavy solutions (imageMagick...), and/or (ii) conversion from matrix to image.
Here is sample data:
r<-seq(0:360)/360*(2*pi)
x<-cos(r)
y<-sin(r)
z<-outer(x,y,"*")
noise<-0.3*matrix(runif(length(x)*length(y)),nrow=length(x))
zz<-z+noise
image(zz)
What I'm looking for is a filter that will return a "cleaned" matrix (i.e. something close to z, in this case).
I'm aware this is a rather open-ended question, and I'm also happy with pointers ("have you looked at package so-and-so"), although of course I'd value sample code from users with experience on signal processing !
Thanks.
One option may be using a non-linear prediction method and getting the fitted values from the model.
For example by using a polynomial regression, we can predict the original data as the purple one,
By following the same logic, you can do the same thing to all columns of the zz matrix as,
predictions <- matrix(, nrow = 361, ncol = 0)
for(i in 1:ncol(zz)) {
pred <- as.matrix(fitted(lm(zz[,i]~poly(1:nrow(zz),2,raw=TRUE))))
predictions <- cbind(predictions,pred)
}
Then you can plot the predictions,
par(mfrow=c(1,3))
image(z,main="Original")
image(zz,main="Noisy")
image(predictions,main="Predicted")
Note that, I used a polynomial regression with degree 2, you can change the degree for a better fitting across the columns. Or maybe, you can use some other powerful non-linear prediction methods (maybe SVM, ANN etc.) to get a more accurate model.

Efficient plotting of part of a hierarchical cluster

I am running agglomerative clustering on a data set of 130K rows (130K unique keys) and 7 columns, each column ranging from 20 to 2000 unique levels. The data are categorical, specifically alphanumeric codes. At most they can be thought of as factors. I am experimenting with what results I might get from a couple of alternatives to k-modes, including hierarchical clustering and MCA.
My question is, is there any good way to visualize the results up to a certain level with the tree structure?
Standard steps are not a problem:
library{cluster}
Compute Gower distance,
ptm <- proc.time()
gower.dist <- daisy(df[,colnams], metric = c("gower"))
elapsed <- proc.time() - ptm
c(elapsed[3],elapsed[3]/60)
Compute agglomerative clustering object from Gower distance
aggl.clust.c <- hclust(gower.dist, method = "complete")
Now to plotting it. The following line works, but the plot is humanly unreadable
plot(aggl.clust.c, main = "Agglomerative, complete linkages")
Ideally what I am looking for would be something like so (the below is pseudocode that failed on my system)
plot(cutree(aggl.clust.c, k=7), main = "Agglomerative, complete linkages")
I am running R version 3.2.3. That version cannot change (and I don't believe it ought to make a difference for what I am trying to do).
I'd be interested in doing the same in Python, if anyone has good pointers.
I found a useful answer to my question re plotting part of a tree using the as.dendogram() method. Link: http://www.sthda.com/english/wiki/beautiful-dendrogram-visualizations-in-r-5-must-known-methods-unsupervised-machine-learning

Multinomial logit: estimation on a subset of alternatives in R

As McFadden (1978) showed, if the number of alternatives in a multinomial logit model is so large that computation becomes impossible, it is still feasible to obtain consistent estimates by randomly subsetting the alternatives, so that the estimated probabilities for each individual are based on the chosen alternative and C other randomly selected alternatives. In this case, the size of the subset of alternatives is C+1 for each individual.
My question is about the implementation of this algorithm in R. Is it already embedded in any multinomial logit package? If not - which seems likely based on what I know so far - how would one go about including the procedure in pre-existing packages without recoding extensively?
Not sure whether the question is more about doing the sampling of alternatives or the estimation of MNL models after sampling of alternatives. To my knowledge, there are no R packages that do sampling of alternatives (the former) so far, but the latter is possible with existing packages such as mlogit. I believe the reason is that the sampling process varies depending on how your data is organized, but it is relatively easy to do with a bit of your own code. Below is code adapted from what I used for this paper.
library(tidyverse)
# create artificial data
set.seed(6)
# data frame of choser id and chosen alt_id
id_alt <- data.frame(
id = 1:1000,
alt_chosen = sample(1:30, 1)
)
# data frame for universal choice set, with an alt-specific attributes (alt_x2)
alts <- data.frame(
alt_id = 1:30,
alt_x2 = runif(30)
)
# conduct sampling of 9 non-chosen alternatives
id_alt <- id_alt %>%
mutate(.alts_all =list(alts$alt_id),
# use weights to avoid including chosen alternative in sample
.alts_wtg = map2(.alts_all, alt_chosen, ~ifelse(.x==.y, 0, 1)),
.alts_nonch = map2(.alts_all, .alts_wtg, ~sample(.x, size=9, prob=.y)),
# combine chosen & sampled non-chosen alts
alt_id = map2(alt_chosen, .alts_nonch, c)
)
# unnest above data.frame to create a long format data frame
# with rows varying by choser id and alt_id
id_alt_lf <- id_alt %>%
select(-starts_with(".")) %>%
unnest(alt_id)
# join long format df with alts to get alt-specific attributes
id_alt_lf <- id_alt_lf %>%
left_join(alts, by="alt_id") %>%
mutate(chosen=ifelse(alt_chosen==alt_id, 1, 0))
require(mlogit)
# convert to mlogit data frame before estimating
id_alt_mldf <- mlogit.data(id_alt_lf,
choice="chosen",
chid.var="id",
alt.var="alt_id", shape="long")
mlogit( chosen ~ 0 + alt_x2, id_alt_mldf) %>%
summary()
It is, of course, possible without using the purrr::map functions, by using apply variants or looping through each row of id_alt.
Sampling of alternatives is not currently implemented in the mlogit package. As stated previously, the solution is to generate a data.frame with a subset of alternatives and then using mlogit (and importantly to use a formula with no intercepts). Note that mlogit can deal with unbalanced data, ie the number of alternatives doesn't have to be the same for all the choice situations.
My recommendation would be to review the mlogit package.
Vignette:
https://cran.r-project.org/web/packages/mlogit/vignettes/mlogit2.pdf
the package has a set of example exercises that (in my opinion) are worth looking at:
https://cran.r-project.org/web/packages/mlogit/vignettes/Exercises.pdf
You may also want to take a look at the gmnl package (I have not used it)
https://cran.r-project.org/web/packages/gmnl/index.html
Multinomial Logit Models with Continuous and Discrete Individual Heterogeneity in R: The gmnl Package
Mauricio Sarrias' (Author) gmnl Web page
Question: What specific problem(s) are you trying to apply a multinomial logit model too? Suitably intrigued.
Aside from the above question, I hope the above points you in the right direction.

How to sample rows in the randomForest package

I have a dataset with 1 million rows and 100 columns. randomForest is quite slow for data this big so I would like to train each tree on a subset of, say, 50000 columns each.
How do I achieve this with the randomForest function? Do I have to hack something together manually? I am not able to find any instruction on this in the vignette.
Do you mean that the sample for each tree should be different?
To start with, I would consider sampling before calling randomforest. Indeed, the fact that you take different samples for each tree could have an impact on the final result, and the importance matrix would probably be partly biased.
You can achieve this by doing that:
numrow <- nrow(data)
subset <- sample(numrow, 50000)
learn <- data[subset,]
test <- data[-subset,]
model_rf <- randomForest(formula=[...], data=learn, importance=T)

Function and data format for doing vector-based clustering in R

I need to run clustering on the correlations of data row vectors, that is, instead of using individual variables as clustering predictor variables, I intend to use the correlations between the vector of variables between data rows.
Is there a function in R that does vector-based clustering. If not and I need to do it manually, what is the right data format to feed in a function such as cmeans or kmeans?
Say, I have m variables and n data rows, the m variables constitute one vector for each data row. so I have a n X n matrix for correlation or cosine. Can this matrix be plugged in the clustering function directly or certain processing is required?
Many thanks.
You can transform your correlation matrix into a dissimilarity matrix,
for instance 1-cor(x) (or 2-cor(x) or 1-abs(cor(x))).
# Sample data
n <- 200
k <- 10
x <- matrix( rnorm(n*k), nr=k )
x <- x * row(x) # 10 dimensions, with less information in some of them
# Clustering
library(cluster)
r <- pam(1-cor(x), diss=TRUE, k=5)
# Check the results
plot(prcomp(t(x))$x[,1:2], col=r$clustering, pch=16, cex=3)
R clustering is often a bit limited. This is a design limitation of R, since it heavily relies on low-level C code for performance. The fast kmeans implementation included with R is an example of such a low-level code, that in turn is tied to using Euclidean distance.
There are a dozen of extensions and alternatives available in the community around R. There are PAM, CLARA and CLARANS for example. They aren't exactly k-means, but closely related. There should be a "spherical k-means" somewhere, that is sensible for cosine distance. There is the whole family of hierarchical clusterings (which scale rather badly - usually O(n^3), with O(n^2) in a few exceptions - but are very easy to understand conceptually).
If you want to explore some more clustering options, have a look at ELKI, it should allow clustering (with various methods, including k-means) by correlation based distances (and it also includes such distance functions). It's not R, though, but Java. So if you are bound to using R, it won't work for you.

Resources