Predicting a numeric attribute through high dimensional nominal attributes - bigdata

I'm having difficulties mining a big (100K entries) dataset of mine concerning logistics transportation. I have around 10 nominal String attributes (i.e. city/region/country names, customers/vessel identification codes, etc.). Along with those, I have one date attribute "departure" and one ratio-scaled numeric attribute "goal".
What I'm trying to do is using a training set to find out which attributes have strong correlations with "goal" and then validating these patterns by predicting the "goal" value of entries in a test set.
I assume clustering, classification and neural networks could be useful for this problem, so I used RapidMiner, Knime and elki and tried to apply some of their tools on my data. However, most of these tools only handle numeric data, so I got no useful results.
Is it possible to transform my nominal attributes into numeric ones? Or do I need to find different algorithms that can actually handle nominal data?

you most likely want to use tree based algorithm. These are good to use nominal features. Please be aware, that you do not want to use "id-like" attributes.
I would recommend RapidMiner's AutoModel feature as a start. GBT and RandomForest should work well.
Best,
Martin

the handling of nominal attributes does not depend on the tool. It is a question what algorithm you use. For example k-means with Euclidean distance can't handle string values. But other distance functions can handle them and algorithms can handle them, for example the random forest implementation of RapidMiner
You can also of course transform the nominal attributes to numerical, for example by using a binary dummy encoding or assigning an unique integer value (which might result in some bias). In RapidMiner you have the Nominal to Numerical operator for that.
Depending on the distribution of your nominal values it might also be useful to handle rare values. You could either group them together in a new category (such as "other") or to use a feature selection algorithm after you apply the dummy encoding.
See the screen shot for a sample RapidMiner process (which uses the Replace Rare Values operator from the Operator Toolbox extension).
Edit: Martin is also right, AutoModel will be a good start to check for problematic attributes and find a fitting algorithm.

Related

Is there an R function/package to transform time series data for confidentiality reasons?

I wish to share a dataset (largely time-series data) with a group of data scientists to explore the statistical relationships within the data (e.g. between variables). However, for confidentiality reasons, I am unable to share the original dataset and so I was wondering if I may be able to transform the data with some random transformation that I know but that the recipients won't. Is this a common practice? Is there an associated R package?
I have been exploring the use of synthetic datasets, and have looked at 'synthpop' but I have a challenge that seems slightly different. For example, I don't necessarily want the data to include fictional individuals that resemble the original file. Rather I'd prefer the value associated with a specific variable to be unclear (e.g. still numerical but also nonsensical) to the human viewer but still enable statistical analysis (e.g. despite the actual values being unclear, the relationships between variable 'x' and 'y' remain the same).
I have a feeling that this is probably quite a simple process (e.g. change names of variables, apply the same transformation across all variables), but I'm not a mathematician/statistician and so I don't want to violate underlying relationships through an inappropriate transformation.
Thanks!

K-Means Distance Measure - Large Data and mixed Scales

I´ve a question regarding k-means clustering. We have a dataset with 120,000 observations and need to compute a k-means cluster solution with R. The problem is that k-means usually use Euclidean Distance. Our dataset consists of 3 continous variables, 11 ordinal (Likert 0-5) (i think it would be okay to handle them like continous) and 5 binary variables. Do you have any suggestion for a distance measure that we can use for our k-means approach with regards to the "large" dataset? We stick to k-means, so I really hope one of you has a good idea.
Cheers,
Martin
One approach would be to normalize the features and then just use the 11-dimensional
Euclidean Distance. Cast the binary values to 0/1 (Well, it's R, so it does that anyway) and go from there.
I don't see an immediate problem with this method other than k-means in 11 dimensions will definitely be hard to interpret. You could try to use a dimensionality reduction technique and hopefully make the k-means output easier to read, but you know way more about the data set than we ever could, so our ability to help you is limited.
You can certainly encode there binary variables as 0,1 too.
It is a best practise in statistics to not treat likert scale variables as numeric, because of that uneven distribution.
But I don't you will get meaningful k-means clusters. That algorithm is all about computing means. That makes sense on continuous variables. Discrete variables usually lack "resolution" for this to work well. Three mean then degrades to a "frequency" and then the data should be handled very differently.
Do not choose the problem by the hammer. Maybe your data is not a nail; and even if you'd like to make it with kmeans, it won't solve your problem... Instead, formulate your problem, then choose the right tool. So given your data, what is a good cluster? Until you have an equation that measures this, handing the data won't solve anything.
Encoding the variables to binary will not solve the underlying problem. Rather, it will only aid in increasing the data dimensionality, an added burden. It's best practice in statistics to not alter the original data to any other form like continuous to categorical or vice versa. However, if you are doing so, i.e. the data conversion then it must be in sync with the question to solve as well as you must provide valid justification.
Continuing further, as others have stated, try to reduce the dimensionality of the dataset first. Check for issues like, missing values, outliers, zero variance, principal component analysis (continuous variables), correspondence analysis (for categorical variables) etc. This can help you reduce the dimensionality. After all, data preprocessing tasks constitute 80% of analysis.
Regarding the distance measure for mixed data type, you do understand the mean in k will work only for continuous variable. So, I do not understand the logic of using the algorithm k-means for mixed datatypes?
Consider choosing other algorithm like k-modes. k-modes is an extension of k-means. Instead of distances it uses dissimilarities (that is, quantification of the total mismatches between two objects: the smaller this number, the more similar the two objects). And instead of means, it uses modes. A mode is a vector of elements that minimizes the dissimilarities between the vector itself and each object of the data.
Mixture models can be used to cluster mixed data.
You can use the R package VarSelLCM which models, within each cluster, the continuous variables by Gaussian distributions and the ordinal/binary variables.
Moreover, missing values can be managed by the model at hand.
A tutorial is available at: http://varsellcm.r-forge.r-project.org/

using dimension reduction before real data classification

I have a dataset containing 13 features and a column which represents the class.
I want to do a binary classification based on the features, but I am using a method which can work only with 2 features. So I need to reduce the features to 2 columns.
My problem is that some of my features are real valued like age, heart rate and blood pressure and some of them are categorical like type of the chest pain etc.
Which method of dimensionality reduction suits my work?
Is PCA a good choie?
If so, how can I use PCA for my categorical features?
I work with R.
you can just code the categorical features to number, for example, 1 represent cat, 2 represent dog, and so on.
PCA is a useful feature selection method, but it is used for linear data, you can just try it and see the result. kernel PCA is used for nonlinear data, you can also try this.
other method contain LLE, ISOMAP,CCA,LDA... you can just try those methods and find a better result.
Check H2O library for GLRM models (link to docs). It can handle categorical variables.
If that does not work for you, target encoding techniques could be useful before applying PCA.
You can try using CatBoost (https://catboost.ai, https://github.com/catboost/catboost) - a new gradient boosting library with good handling of categorical features.

Alternative to Boruta function in R for Large data set

I have a data set with 90,275 rows & 60 variables. I want to do Feature engineering for this data set. Previously I used Boruta() under package Boruta for feature engineering. But seeing the size of data set,I'm feeling that Boruta() will take very long time.
Can you please suggest me some alternative to Boruta for feature engineering large data set?
The general answer would be, it depends on your data format (type of variables) since input space for different FE/FS algorithm varies significantly.
So, please, first of all, please provide the structure of your data frame.
But for a moment, I would assume you have the following formats:
1) numerical
2) factors, characters, logical and dummy variables
3) mix of numeric and factor variable
Numerical input: PCA, LDA, anova, Pearson correlation should help you to decrease dimensionality. It works quite fast since it's numeric data
Factor & Mix: anova, tree based solutions (random forest, xgboost, cubist) by checking important variables of models. These options are quite fast as well, assuming that your data does not have too many level (i.e. variable "city" with 200 options etc).
Boruta can be a good alternative to Boruta. Ironic, right?
Boruta allows you to specify the tool it uses.
One of the controls is "getImp".
If it is "getImpRFZ" then it is a z-score of mean decrease in
accuracy from a Ranger-based forest. Ranger tends to be faster than
BreimanCutler.
If it is "getImpXgboost" then it uses XGBoost, which is fast and
handles big stuff.
If it is "getImpExtraZ" then it uses the ExtraTrees library, which is
supposed to be decent against big stuff as well. Note, use ~5x the
tree count you would in a textbook RF.
You can also homebrew your importance function, but it should be
pretty tough to beat XGboost.

treating categorical/binary variables in lm

When building the regression function using lm, do we need to explicitly specify which variables should be categorical or binary? If we have to, how to do that? Thanks.
This brings up another important question: Is whether a variable is numeric or categorical a property of the data or a property of the analysis?
Back in the early days of statistical computing it was easier to store categorical variables as numbers and therefore it was necessary to designate at some point that these variables were indeed representing categories rather than the numbers themselves having meaning. The common place to designate this was at the point of analysis. This results in a legacy of having the variable type be a property of the analysis.
R (and others) is a much more modern language and takes the approach that this should be a property of the data itself. This simplifies things in that you can make this designation once and all resulting analyses/graphs/tables/etc. will treat the variable properly. I think this approach is much more intuitive and simple, after all, if a particular variable is categorical for one analysis, shouldn't it be categorical for all analyses, graphs, tables etc.?
This has been a bit of a long answer, but the idea is to help you shift your thinking from how to designate this in the analysis to thinking how to designate the properties for the data itself. If you designate that your data is a factor (using the factor, ordered or other functions) before any analysis, then the R analysis/graphing/table tools will do the correct thing. Depending on how your data looks and how it was entered/imported, this conversion may have already been done for you.
Other properties, such as the order of the categories, should also be properties of the data, not the analysis/graph/table/etc.

Resources