I am attempting to formulate code that will subset by factor level not factor name so minimal editing is required to run the code with other data sets. The number of factor levels is always 4, yet the name of those levels may change depending on the data set.1
Specifically, the name of the SET1$Bearing factors may change. For examples "001°" may be "002°" in the next data set. Presently I use this code...
SET1Bearing1<-SET1[SET1$Bearing=="001°",]
Yet this only subsets by the name of the factor. Is there a way to subset by factor level regardless of the factor name?
SET1Bearing1<-SET1[SET1$Bearing == levels(SET1$Bering)[4],]
Should work
Related
I have a factor with a great many levels and would like to eliminate 2 of them. I know that I could use subset to extract particular levels of a factor to form a new dataset. Since I have many levels, I would like to eliminate particular levels instead, however.
In the following example, how do I eliminate the fish level?
catsDogsFish=c(rep("cat", 5), rep("dog", 5), rep("fish", 5))
happyInside=c(7,4,9,7,8, 2,4,7,3,3,9,8,9,10,10)
happyPets=data.frame(catsDogsFish, happyInside)
I need to score a big dataframe-- 250M rows x ~2000 cols (1TB). A fit has been developed (from C5.0 library) based on training data. Currently the entire dataset wont fit in memory. Big data solutions are being investigated, but I'm wondering if I can cut up the file and run predict() in chunks.
The basic problem is that I want to translate column classes and factor levels from one dataframe to another. More detail:
PROBLEM #1: Some of the column classes don't match when being read from hadoop.
PROBLEM #2: by chunking you can get factors that have fewer factor levels than the training dataset (since you're looking at a subsample of the full set). Because of this predict() doesn't want to try and score a set that has missing factor levels.
QUESTION: I was hoping to just take the classes and factor levels from the training set and class the columns and 'relevel' the factors in the big scoring set with levels(). Can one take classes of columns and factors of one dataframe and transpose them to the same variable name to 'relevel' the variables of another dataframe? I suppose this could be done with a for loop by reading factors from one frame and applying it to the other. But this would require lots of if statements for all the classes, it seems it would be messy. Is there a way to do this with apply functions or a simpler 'one-liner'?
I have a factor like this one : factor
This factor has 4 levels (the four possible responses for the survey) but people only reply with 2 of the 4 choices.
I want to convert in an ordered factor without loss of information, so without the loss of the unused choices. I used the "ordered" function but the new ordered factor has lost the 2 unused levels : ordered factor
I would like to emphasize the fact that I had a lot of factors to convert, therefore I don't want to do this with a manual solution.
Thank you for your help.
I have an R data frame and some of the variables are categorical. For example sex is "male" or "female" and "do you smoke" is 0 or 1. Others variables instead are continuous.
I would like to know if there is any way to decide if a variable is categorical or not and in case compute its frequencies.
I think in my case a good test would be to check if the variable takes less than k=4 values.
While you should use factors for categorical variables, you can find the unique values in a vector x with unique, and count them:
length(unique(x))
You can use class(dataframe$variable) to know the class of a variable within a data frame as well as determine whether the variable is a factor or not.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
dropping factor levels in a subsetted data frame in R
I have subsetted away observations with a certain factor level. When checking whether this has been done with summary() the levels were still listed, but with zero observations. Shouldn't they disappear during the subsetting?
Subsetting doesn't drop empty levels. Why this is the case is that it is a feature. Think of it as your factor levels determine the possible/potential categories of a thing. If you only take a subset of these things, the possible categories of thing don't change, your subset just doesn't contain any of them.
If you want to drop these empty levels, see ?droplevels.
To make the extra levels disappear, use drop=TRUE when subsetting:
newfactor <- oldfactor[indices, drop=TRUE]
Incidentally, one reason this is not the default is that factors with different levels cannot be compared. So if you want to compare your factors with the original vector, or perhaps a different subset of the vector, you'd need to keep the extra levels.