I am new to R and have the following problem: I am working on a dataset that not only has numerical values, but also non numerical values (gender, state). I wanted to start to look through the data and find some correlations first. Well, this works only for numerical values obviously and the dataset doesnt find any correlations for the numerical values. I tried it out with ggcorr and it omits the non numerical columns.
My questions is: how do you treat such datasets? How do you find correlations if you have many non numerical values categories? Also what is the workflow on creating the lineal model for such a dataset? The model should predict if a person earns more or less then 50k a year.
Thanks for any help!
Edit: This is the dataset which I am talking about. I was thinking about convert the categories into numerical values and then correlate through cor.test() but I am not sure if I would gain a valid correlation number this way. So basically my question is: how do I check the correlation between non-numerical and numerical data?
Related
I have searched stackoverflow and google regarding this but not yet found a fitting answer.
I have a data frame column with ages of individuals.
Out of around 10000 observations, 150 are NAs.
I do not want to impute those with the mean age of the whole column but assign random ages based on the distribution of the ages in my data set i.e. in this column.
How do I do that? I tried fiddling around with the MICE package but didn't make much progress.
Do you have a solution for me?
Thank you,
corkinabottle
You could simply sample 150 values from your observations:
samplevals <- sample(obs, 150)
You could also stratify your observations across quantiles to increase the chances of sampling your tail values by sampling within each quantile range.
Currently have a list of 27 correlation matrices with 7 variables, doing social science research.
Some correlations are "NA" due to missing data.
When I do the analysis, however, I do not analyse all variables in one go.
In a particular instance, I would like to keep one of the variables conditionally, if it contains at least some value (i.e. other than "NA", since there are 7 variables, I am keeping anything that DOES NOT contain 6"NA"s, and correlation with itself, 1 -> this is the tricky part because 1 is a value, but it's meaningless to me in a correlation matrix).
Appreciate if anyone could enlighten me regarding the code.
I am rather new to R, and the only thought I have is to use an if statement to set the condition. But I have been trying for hours but to no avail, as this is my first real coding experience.
Thanks a lot.
since you didn't provide sample data, I am first going to convert your matrix into a dataframe and then I am just going to pretend that you want us to see if your dataframe df has a variable var with at least one non-NA or 1. value
df <- as.data.frame(as.table(matrix)) should convert your matrix into a dataframe
table(df$var) will show you the distribution of values in your dataframe's variable. from here you can make your judgement call on whether to keep the variable or not.
For my data set, I need to perform some sort of two factor repeated measures ANOVA. I have one between-subject factor called "Treatment" and one within-subject factor called "Frequency" with 8 levels. My problem is that most of my subjects don't have responses, called "Threshold", for all 8 of the levels of frequency (missing values). In addition, my two treatments are also unbalanced (about 23 for the first treatment type and about 21 for the other).
What r code do you suggest I try? And what would that code look like? I've been looking at the aov and Anova (car package) functions. I also need to figure out the formula for my model. I was thinking something like
aov(Threshold~(TreatmentFrequency)+Error(Subject/(TreatmentFrequency))
but I keep getting error messages like "In aov (......) Error() model is singular."
My question here is if you only include within-subject factors in the error term, so Error(Subject/Frequency) or just Error(Subject), or if I had it right in including everything? Also, should I include rows of responses for every frequency per bird, even if I don't have that specific value? Should I put NA's in those missing value cells, or delete the entire row of data for that level?
Any help would be greatly appreciated! I'm new to more advanced statistics and modeling, so keep that in mind! And if I need to clarify or add anything, just let me know. Thanks!
Or will the package realize that they are not continuous and treat them as factors? I know that, for classification, the feature being classified does need to be a factor. But what about predictive features? I've run it on a couple of toy datasets, and I get slightly different results depending on whether categorical features are numeric or factors, but the algorithm is random, so I do not know if the difference in my results are meaningful.
Thank you!
Yes there is a difference between the two. If you want to use a factor variable you should specify it as such and not leave it as a numeric.
For categorical data (this is actually a very good answer on CrossValidated):
A split on a factor with N levels is actually a selection of one of the (2^N)−2 possible combinations. So, the algorithm will check all the possible combinations and choose the one that produces the better split
For numerical data (as seen here):
Numerical predictors are sorted then for every value Gini impurity or entropy is calculated and a threshold is chosen which gives the best split.
So yeah it makes a difference whether you will add it as a factor or as a numeric variable. How much of a difference depends on the actual data.
I have a panel dataset with population data. I am working mostly with two vectors - population and households. The household vector(there are 3 countries) has a substantial amount of missing values, the population vector is full. I use a model with population as the independent variable to get the missing values of households. What function should I use to extract these values? I do not need to make any forecasts, just to imput the missing data.
Thank you.
EDIT:
This is a printscreen of my dataset:
https://imagizer.imageshack.us/v2/1366x440q90/661/RAH3uh.jpg
As you can see, many values of datatype = "original" data are missing and I need to input it somehow. I have created several panel data models (Pooled, within, between) and without further considerations tried to extract the missing data with each of them; however I do not know how to do this.
EDIT 2: What I need is not how to determine which model to use but how to get the missing values(so making the dataset more balanced) of the model.