Ghost factor levels in R [duplicate] - r

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
dropping factor levels in a subsetted data frame in R
I have subsetted away observations with a certain factor level. When checking whether this has been done with summary() the levels were still listed, but with zero observations. Shouldn't they disappear during the subsetting?

Subsetting doesn't drop empty levels. Why this is the case is that it is a feature. Think of it as your factor levels determine the possible/potential categories of a thing. If you only take a subset of these things, the possible categories of thing don't change, your subset just doesn't contain any of them.
If you want to drop these empty levels, see ?droplevels.

To make the extra levels disappear, use drop=TRUE when subsetting:
newfactor <- oldfactor[indices, drop=TRUE]
Incidentally, one reason this is not the default is that factors with different levels cannot be compared. So if you want to compare your factors with the original vector, or perhaps a different subset of the vector, you'd need to keep the extra levels.

Related

How can i separate categorical data from continuous data in a dataset in R? [duplicate]

This question already has answers here:
Selecting only numeric columns from a data frame
(12 answers)
Closed 1 year ago.
I have a dataset and I need to plot histograms for all the continuous data which i know how to do, however I cant use a loop as there are categorical columns too, meaning histograms wont be created for them which will create an error. Is there a way to separate the continuous data from the categorical data? If worse comes to worst, I can just manually remove the categorical features however I would like to know if theres a way to do this automatically for future reference.
You can use package "dplyr", and in the example below, you chose all columns with factor variables
data <- data %>%
select_if(is.factor)

"order" function in R is not working properly with repeated values [duplicate]

This question already has answers here:
rank and order in R
(7 answers)
Closed 2 years ago.
It looks like "order" function is not working properly with repeated values. For example, check the code below. As you can see, same elemets have a different order.
Is there any way to fix this?
special <- c(0.8612482, 0.1728704, 0.1728704, 0.6933106, 0.4718281, 0.4718281, 0.8275597,
0.3934772, 0.3934772, 0.6777266, 0.2526969, 0.0605038, 0.7352600, 0.7352600,
2.2376845, 0.8814698, 2.7420961, 2.7420961, 1.5314565, 1.4855230)
special[8]
order(special)[8]
special[9]
order(special)[9]
I think the function you are looking for is sort() not order().
sort(special)[8]
0.4718281
sort():
Sort (or order) a vector or factor (partially) into ascending or
descending order. For ordering along more than one variable, e.g., for
sorting data frames, see order.
order():
order returns a permutation which rearranges its first argument into
ascending or descending order, breaking ties by further arguments.
sort.list is the same, using only one argument.

R Subset by Factor level

I am attempting to formulate code that will subset by factor level not factor name so minimal editing is required to run the code with other data sets. The number of factor levels is always 4, yet the name of those levels may change depending on the data set.1
Specifically, the name of the SET1$Bearing factors may change. For examples "001°" may be "002°" in the next data set. Presently I use this code...
SET1Bearing1<-SET1[SET1$Bearing=="001°",]
Yet this only subsets by the name of the factor. Is there a way to subset by factor level regardless of the factor name?
SET1Bearing1<-SET1[SET1$Bearing == levels(SET1$Bering)[4],]
Should work

class/type columns and match factor levels of dataframe based on another dataframe

I need to score a big dataframe-- 250M rows x ~2000 cols (1TB). A fit has been developed (from C5.0 library) based on training data. Currently the entire dataset wont fit in memory. Big data solutions are being investigated, but I'm wondering if I can cut up the file and run predict() in chunks.
The basic problem is that I want to translate column classes and factor levels from one dataframe to another. More detail:
PROBLEM #1: Some of the column classes don't match when being read from hadoop.
PROBLEM #2: by chunking you can get factors that have fewer factor levels than the training dataset (since you're looking at a subsample of the full set). Because of this predict() doesn't want to try and score a set that has missing factor levels.
QUESTION: I was hoping to just take the classes and factor levels from the training set and class the columns and 'relevel' the factors in the big scoring set with levels(). Can one take classes of columns and factors of one dataframe and transpose them to the same variable name to 'relevel' the variables of another dataframe? I suppose this could be done with a for loop by reading factors from one frame and applying it to the other. But this would require lots of if statements for all the classes, it seems it would be messy. Is there a way to do this with apply functions or a simpler 'one-liner'?

How to clean/reconstruct factor in R [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
dropping factor levels in a subsetted data frame in R
I have a data frame which has a factor column, then I would like to use subset to extract only part of its data. But the extracted data frame's factor column still has the same levels even some levels has no value. This would impact my following actions (like visualization using ggplot).
The following is a sample code.
d<-data.frame(c1=factor(c(1,1,2,3)),c2=c("a","b","c","d"))
d<-subset(d,c1 %in% c(1,2))
d$c1
The column c1 still have 3 levels (1,2,3), but actually I'd like to it to be (1,2), because these's no value for level 3. Then in visualization, I won't draw any graph for level 3.
How can I achieve that ? Thanks
Use droplevels:
d <- droplevels(d)

Resources