is there a command to see how a categorical variable is coded?
Example, I have a variable called HbA1c and the categories I see are <5.7 and >=5.7. I want to know what value does <5.7 and >=5.7 take (if it is a 0 or a 1 or a 2). I Need it for regression analysis.
I am sorry if this question has been addressed already but I was not able to find the post.
Thank you in advance.
if x is a factor (the technical name for a categorical variable in R), then levels(x) gives you the levels in order, so something like
setNames(1:length(levels(f)),levels(f))
## a b c
## 1 2 3
will give you a correspondence table.
Your question in the comments isn't entirely clear, but if you wanted to run a regression with numeric values starting at zero, I would try something like:
mydata$n <- as.numeric(mydata$f)-1
(the numeric codes associated with factors always run from 1 to N; this gives you a numeric variable running from 0 to N-1). Then you can run a regression something like this:
lm(y~n,data=mydata)
Related
I am new to machine learning and I am running a classification algorithm (xgboost) on my data, using the caret package in R.
However, I am confused regarding the conversion of some categorical variables into numerical variables for the purpose of machine learning. I have scoured the web but I cannot find a specific rule, if it exists, on the subject.
The xgboost vignette at the following url (xgboost) mentions that "Xgboost manages only numeric vectors." Doesn't that mean that all my features (variables) need to contain only numeric values? However, I've seen some tutorials using xgboost where the variables were categorical variables.
Any help on the subject would be highly appreciated.
The primary way categorical features are treated in statistics/machine learning is through a mechanism called one-hot encoding.
Take the following data, for example:
outcome animal
1 cat
1 dog
0 dog
1 cat
Say you wanted to predict outcome (whatever that is) based on the type of animal a given case (observation/row/subject/etc.). The way to do this is to encode animal in a one-hot fashion, like this:
outcome is_dog is_cat
1 0 1
1 1 0
0 1 0
1 0 1
Where the animal column of cardinality k has been encoded into k new columns indicating the presence or absence of a particular category/attribute given the value for animal for that row.
From there, you can use whatever model you want to predict outcome based off of (the now differently-encoded) animal column. But make sure to leave one animal (one group) out of the model as the control group. In this case, you might fit a logistic regression model outcome ~ is_dog and interpret the slope coefficient for is_dog as the increase or decrease in likelihood of the 1 outcome for a dog in comparison to a cat.
this is my first time posting here and I hope this is all in the right place. I have been using R for basic statistical analysis for some time, but haven't really used it for anything computationally challenging and I'm very much a beginner in the programming/ data manipulation side of R.
I have presence/absence (binary) data on 72 plant species in 323 plots in a single catchment. The dataframe is 323 rows, each representing a plot, with 72 columns, each representing a species. This is a sample of the first 4 columns (some row numbers are missing because the 323 plots are a subset of a larger number of preassigned plots, not all of which were surveyed):
> head(plots[,1:4])
Agrostis.canina Agrostis.capillaris Alchemilla.alpina Anthoxanthum.odoratum
1 1 0 0 0
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
8 0 0 0 0
I want to to determine whether any of the plant species in this catchment are associated with any others, and if so, whether that is a positive or negative association. To do this I want to perform a chi-squared test of independence on each combination of species. I need to create a 2x2 contingency table for each speciesxspecies comparison, run a chi-squared test on each of those contingency tables, and save the output. Ultimately I would like to end up with a list or matrix of all species by species tests that shows whether that combination of species has a positive, negative, or no significant association. I'd also like to incorporate some code that only shows an association as positive if all expected values were greater than 5.
I have made a start by writing the following function:
CHI <- function(sppx, sppy)
{test <- chisq.test(table(sppx, sppy))
result <- c(test$statistic, test$p.value,
sign((table(sppx, sppy) - test$expected)[2,2]))
return(result)
}
This returns the following:
> CHI(plots$Agrostis.canina, plots$Agrostis.capillaris)
X-squared
1.095869e-27 1.000000e+00 -1.000000e+00
Warning message:
In chisq.test(chitbl) : Chi-squared approximation may be incorrect
Now I'm trying to figure out a way to apply this function to each speciesxspecies combination in the data frame. I essentially want R to take each column, apply the CHI function to that column and each other column in sequence, and so on through all the columns, subtracting each column from the dataframe as it is done so the same species pair is not tested twice. I have tried various methods trying to use "for" loops or "apply" functions, but have not been able to figure this out.
I hope that is clear enough. Any help here would be much appreciated. I have tried looking for existing solutions to this specific problem online, but haven't been able to find any that really helped. If anyone could link me to an existing answer to this that would also be great.
You need the combn function to find all the combinations of the columns and then apply them to your function, something like this:
apply(combn(1:ncol(plots), 2), 2, function(ind) CHI(plots[, ind[1]], plots[, ind[2]]))
I think you are looking for something like this. I used the iris dataset.
require(datasets)
ind<-combn(NCOL(iris),2)
lapply(1:NCOL(ind), function (i) CHI(iris[,ind[1,i]],iris[,ind[2,i]]))
The below R code run chisquare test for every categorical variable / every factor of a r dataframe, against a variable given (x or y chisquare parameter is kept stable, is explicitly defined):
Define your variable
Please - change df$variable1 to your desired factor variable and df to your desirable dataframe that contain all the factor variables tested against the given df$variable1
Define your Dataframe
A new dataframe is created (df2) that will contain all the chi square values / dfs, p value of the given variable vs dataframe comparisons
Code created / completed/ altered from similar posts in stackoverflow, neither that produced my desired outcome.
Chi-Square Tables statistic / df / p value for variable vs dataframe
"2" parameter define column wide comparisons - check apply (MARGIN) option.
df2 <- t(round(cbind(apply(df, 2, function(x) {
ch <- chisq.test(df$variable1, x)
c(unname(ch$statistic), ch$parameter, ch$p.value )})), 3))
I have a vector (really a column of a data frame) that looks like this:
data$outcome
[1] Good Good Good Good Poor
Levels: Good Poor
Here is the str on it:
str(data$outcome)
Factor w/ 2 levels "Good","Poor": 1 1 1 1 2
I don't want 1's and 2's as in as.numeric(data$outcome)
[1] 1 1 1 1 2
I know you are not supposed to dummy-code the variables "manually" for regression, and I know about {psych} dummy.code(), which returns a matrix. I understand that I could use something like model.matrix() on the data.frame:
data$outcome <- model.matrix(lm(s100b ~ outcome, data))[,2]
Not nice...
Isn't there something like dummify(data$outcomes) somewhere in R? Please refrain from easy jokes...
I slightly prefer
data$isGood <- as.numeric(data$outcome == 'Good')
because it is a bit more explicit / less opaque, and would still work even if someone added a new level 'Awesome' to the factor.
This seems like it should be simple, but I have been struggling for a while to solve. I am trying to extract the value of variable Z - given the values of two categorical variables X & Y.
** BUT, I want to do this for all combinations of X & Y **
So, for any given value, this is easy - I can get Z by using the following code (Assume the data frame is called df):
df[df$X == 1 & df$Y == 2, ]$Z
But, I would like to use this to build a cross-reference table.
The following example will make this easy to understand.
Here's a simplified version of the data frame as it comes in:
Person ID Question Number Response
1 10 YES
1 20 NO
1 30 YES
2 10 YES
2 20 MAYBE
2 30 YES
3 10 YES
3 30 NO
4 20 NO
4 30 MAYBE
I want to be able to take this data and make a cross-reference data.frame, like so:
[row names are the levels of "Person ID" and col names are the levels of "Question Number"]
[10] [20] [30]
[1] YES NO YES
[2] YES MAYBE YES
[3] YES N/A NO
[4] N/A NO MAYBE
I have tried the "table" function gives me summary statistics, frequency counts. So, if I use the following:
table(df$Person.Id, df$Question.Num)
I get the right row and column headings, but the values are frequency counts. Since this is a cross-reference table, I need that to be the value for df$Response instead of the frequency count.
As I said before, I can manually find every value of df$Response using the following code
df[df$Person.ID == "1" & df$Question.Num == "20", ]$Response
But, I cannot manage to stitch this together into a data.frame. I tried to use nested for loops, but couldnt get it to work. I could get all the value out, but no way to stitch everything into a cross-reference table, as described above.
Just a background note: this is a necessary preparatory step so I can minimize logit linear model.
Based on the suggestion by Metrics, I did the following:
install.packages("reshape")
library("reshape")
cast(df, Person.Id~Question.Num, value = "Response")
That last part is key. The value = "Response" tells the cast function what variable to use to fill in the table.
This is a fantastic package. You can find more information on it here:
http://www.statmethods.net/management/reshape.html
and, the original paper published in the Journal of Statistical Software:
http://www.jstatsoft.org/v21/i12/paper
Thanks for the tips!
I never tried the tidyr package, because the reshape package worked so well. Perhaps, it is just as easy with that package. I leave it to the community to figure that out.
Thanks!
The dataset named data has both categorical and continuous variables. I would like to the delete categorical variables.
I tried:
data.1 <- data[,colnames(data)[[3L]]!=0]
No error is printed, but categorical variables stay in data.1. Where are problems ?
The summary of "head(data)" is
id 1,2,3,4,...
age 45,32,54,23,...
status 0,1,0,0,...
...
(more variables like as I wrote above)
All variables are defined as "Factor".
What are you trying to do with that code? First of all, colnames(data) is not a list so using [[]] doesn't make sense. Second, The only thing you test is whether the third column name is not equal to zero. As a column name can never start with a number, that's pretty much always true. So your code translates to :
data1 <- data[,TRUE]
Not what you intend to do.
I suppose you know the meaning of binomial. One way of doing that is defining your own function is.binomial() like this :
is.binomial <- function(x,na.action=c('na.omit','na.fail','na.pass'){
FUN <- match.fun(match.arg(na.action))
length(unique(FUN(x)))==2
}
in case you want to take care of NA's. This you can then apply to your dataframe :
data.1 <- data[!sapply(data,is.binomial)]
This way you drop all binomial columns, i.e. columns with only two distinct values.
#Shimpei Morimoto,
I think you need a different approach.
Are the categorical variables defines in the dataframe as factors?
If so you can use:
data.1 <- data[,!apply(data,2,is.factor)]
The test you perform now is if the colname number 3L is not 0.
I think this is not the case.
Another approach is
data.1 <- data[,-3L]
works only if 3L is a number and the only column with categorical variables
I think you're getting there, with your last comment to #Mischa Vreeburg. It might make sense (as you suggest) to reformat your original data file, but you should also be able to solve the problem within R. I can't quite replicate the undefined columns error you got.
Construct some data that look as much like your data as possible:
X <- read.csv(textConnection(
"id,age,pre.treat,status
1,'27', 0,0
2,'35', 1,0
3,'22', 0,1
4,'24', 1,2
5,'55', 1,3
, ,yes(vs)no,"),
quote="\"'")
Take a look:
str(X)
'data.frame': 6 obs. of 4 variables:
$ id : int 1 2 3 4 5 NA
$ age : int 27 35 22 24 55 NA
$ pre.treat: Factor w/ 3 levels " 0"," 1","yes(vs)no": 1 2 1 2 2 3
$ status : int 0 0 1 2 3 NA
Define #Joris Mey's function:
is.binomial <- function(x,na.action=c('na.omit','na.fail','na.pass')) {
FUN <- match.fun(match.arg(na.action))
length(unique(FUN(x)))==2
}
Try it out: you'll see that it does not detect pre.treat as binomial, and keeps all the variables.
sapply(X,is.binomial)
X1 <- X[!sapply(X,is.binomial)]
names(X1)
## keeps everything
We can drop the last row and try again:
X2 <- X[-nrow(X),]
sapply(X2,is.binomial)
It is true in general that R does not expect "extraneous" information such as level IDs to be in the same column as the data themselves. On the one hand, you can do even better in the R world by simply leaving the data as their original, meaningful values ("no", "yes", or "healthy", "sick" rather than 0, 1); on the other hand the data take up slightly more space if stored as a text file, and, more important, it becomes harder to incorporate other meta-data such as units in the file along with the data ...