When building the regression function using lm, do we need to explicitly specify which variables should be categorical or binary? If we have to, how to do that? Thanks.
This brings up another important question: Is whether a variable is numeric or categorical a property of the data or a property of the analysis?
Back in the early days of statistical computing it was easier to store categorical variables as numbers and therefore it was necessary to designate at some point that these variables were indeed representing categories rather than the numbers themselves having meaning. The common place to designate this was at the point of analysis. This results in a legacy of having the variable type be a property of the analysis.
R (and others) is a much more modern language and takes the approach that this should be a property of the data itself. This simplifies things in that you can make this designation once and all resulting analyses/graphs/tables/etc. will treat the variable properly. I think this approach is much more intuitive and simple, after all, if a particular variable is categorical for one analysis, shouldn't it be categorical for all analyses, graphs, tables etc.?
This has been a bit of a long answer, but the idea is to help you shift your thinking from how to designate this in the analysis to thinking how to designate the properties for the data itself. If you designate that your data is a factor (using the factor, ordered or other functions) before any analysis, then the R analysis/graphing/table tools will do the correct thing. Depending on how your data looks and how it was entered/imported, this conversion may have already been done for you.
Other properties, such as the order of the categories, should also be properties of the data, not the analysis/graph/table/etc.
Related
I wish to share a dataset (largely time-series data) with a group of data scientists to explore the statistical relationships within the data (e.g. between variables). However, for confidentiality reasons, I am unable to share the original dataset and so I was wondering if I may be able to transform the data with some random transformation that I know but that the recipients won't. Is this a common practice? Is there an associated R package?
I have been exploring the use of synthetic datasets, and have looked at 'synthpop' but I have a challenge that seems slightly different. For example, I don't necessarily want the data to include fictional individuals that resemble the original file. Rather I'd prefer the value associated with a specific variable to be unclear (e.g. still numerical but also nonsensical) to the human viewer but still enable statistical analysis (e.g. despite the actual values being unclear, the relationships between variable 'x' and 'y' remain the same).
I have a feeling that this is probably quite a simple process (e.g. change names of variables, apply the same transformation across all variables), but I'm not a mathematician/statistician and so I don't want to violate underlying relationships through an inappropriate transformation.
Thanks!
I have a large dataset with around 200 columns and 1 million rows. I have a treatment group, and I'm trying to create a control group using propensity matching score based on about 15 different variables.
I have two questions that I've found conflictual answers online, and I would appreciate it if you could help me out.
1) How to organize the data to best run the matching process? My data has a mix of numeric, character, and factor (some ordered, others not) variables, and I've seen online some people saying that the MatchIt program runs the analysis with character variables, while others saying that it does not work for the 'nearest' function but works with other ones. So, should I put some effort into converting everything into numeric or factor (which I'm not sure it will be possible), or can I run the MatchIt with my variables as they are?
2) Has the function MatchIt been updated to read NAs in variables that are not used for the matching function? I've seen some old posts saying that the MatchIt needed a COMPLETE dataset, even for the variables that were not being used for matching, but these posts also said that it was something that would probably be fixed. Is it still the case?
Thanks
1) Beyond the data type, the question you should ask yourself is what sense it makes to give categorical data to a propensity score setting. Propensity scores are based on distances between observations, and calculating distances between categorical attributes is obviously difficult. So even though technically speaking, MatchIt does support other types, numeric features is the only really sensible data input. You can either choose to discard the categorical data from your data or convert it to numeric (by creating dummy variables and numerically encoding ordinal features). Alternatively, you can keep the categorical features and impose exact matching on these features using the exact parameter of the matchit function (note that in this case, you are not really using propensity score matching anymore..).
2) This issue has not been solved in the current version 3.0.2, which is obviously annoying..
I'm having difficulties mining a big (100K entries) dataset of mine concerning logistics transportation. I have around 10 nominal String attributes (i.e. city/region/country names, customers/vessel identification codes, etc.). Along with those, I have one date attribute "departure" and one ratio-scaled numeric attribute "goal".
What I'm trying to do is using a training set to find out which attributes have strong correlations with "goal" and then validating these patterns by predicting the "goal" value of entries in a test set.
I assume clustering, classification and neural networks could be useful for this problem, so I used RapidMiner, Knime and elki and tried to apply some of their tools on my data. However, most of these tools only handle numeric data, so I got no useful results.
Is it possible to transform my nominal attributes into numeric ones? Or do I need to find different algorithms that can actually handle nominal data?
you most likely want to use tree based algorithm. These are good to use nominal features. Please be aware, that you do not want to use "id-like" attributes.
I would recommend RapidMiner's AutoModel feature as a start. GBT and RandomForest should work well.
Best,
Martin
the handling of nominal attributes does not depend on the tool. It is a question what algorithm you use. For example k-means with Euclidean distance can't handle string values. But other distance functions can handle them and algorithms can handle them, for example the random forest implementation of RapidMiner
You can also of course transform the nominal attributes to numerical, for example by using a binary dummy encoding or assigning an unique integer value (which might result in some bias). In RapidMiner you have the Nominal to Numerical operator for that.
Depending on the distribution of your nominal values it might also be useful to handle rare values. You could either group them together in a new category (such as "other") or to use a feature selection algorithm after you apply the dummy encoding.
See the screen shot for a sample RapidMiner process (which uses the Replace Rare Values operator from the Operator Toolbox extension).
Edit: Martin is also right, AutoModel will be a good start to check for problematic attributes and find a fitting algorithm.
Let me preface this:
I have looked extensively on this matter and I've found several intriguing possibilities to look into (such as this and this). I've also looked into principal component analysis and I've seen some sources that claim it's a poor method for dimension reduction. However, I feel as though it may be a good method, but am unsure how to implement it. All the sources I've found on this matter give a good explanation, but rarely do they provide any sort of advice as to actually go about applying one of these methods (i.e. how one can actually apply a method in R).
So, my question is: is there a clear-cut way to go about dimension reduction in R? My dataset contains both numeric and categorical variables (with multiple levels) and is quite large (~40k observations, 18 variables (but 37 if I transform categorical variables into dummies)).
A few points:
If we want to use PCA, then I would have to somehow convert my categorical variables into numeric. Would it be okay to simply use a dummy variable approach for this?
For any sort of dimension reduction for unsupervised learning, how do I treat ordinal variables? Do the concept of ordinal variables even make sense in unsupervised learning?
My real issue with PCA is that when I perform it and have my principal components.. I have no idea what to actually do with them. From my knowledge, each principal component is a combination of the variables - and as such I'm not really sure how this helps us pick and choose which are the best variables.
I don't think this is an R question. This is more like a statistics question.
PCA doesn't work for categorical variables. PCA relies on decomposing the covariance matrix, which doesn't work for categorical variables.
Ordinal variables make lot's of sense in supervised and unsupervised learning. What exactly are you looking for? You should only apply PCA on ordinal variables if they are not skewed and you have many levels.
PCA only gives you a new transformation in terms of principal components, and their eigenvalues. It has nothing to do with dimension reduction. I repeat, it has nothing to do with dimension reduction. You reduce your data set only if you select a subset of the principal components. PCA is useful for regression, data visualisation, exploratory analysis etc.
A common way is to apply optimal scaling to transform your categorical variables for PCA:
Read this:
http://www.sicotests.com/psyarticle.asp?id=159
You may also want to consider correspondence analysis for categorical variables and multiple factor analysis for both categorical and continuous.
I am starting to use R, and learning about the PCA analysis too, but my problem here relates to categorical variables.
I have three main types of categorical variables in my data set.
Type one: Presence/absence of a trait, and I know that I can transform that into 1/0 in the table and no worries.
Type two: Frequency or abundance of a trait, that I assume R will understand if I transform something that is absent/less abundant/abundant into 0/1/2, for example.
Type three: The problematic one. This is truly categorical, functioning as a label. The name of the variable is "cambial variant", and the possibilites are: absent/lobed/compound/fissured/corded and so on...
These are different types of cambial variants, and those different types can be found in different species of plants.
First I assumed that it would be ok to simply use different numbers for these different types (ex: absent=0, lobed=1, compound=2, and so on). So I performed the prcomp like that, and the results were exactly what I expected. But then I realized I made a mistake by using different numbers for things that are not related at all, right? The program understands that 3 is bigger than 2, and that is somehow wrong, because the "fissured type" and the "compound type" are not levels of intensity (abundance, frequency, whatever) of the same thing. They are different types of a same structure, but one does not transform into the other, they are not related in that type of way.
So, in sum, I want to know how I could transform this variables that are more like labels into something that the function prcomp can understand and use. Or, if that is not an option, if there is another function in R that can do a PCA with these categorical variables that can`t be modified into numbers.
Thanks you all, and I am sorry if my question is way too dumb, I am really just a beginner here!