This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
dropping factor levels in a subsetted data frame in R
I have a data frame with several variables that I'm running a mixed model on using lme(). One of the variables, ForAgeCat, has five factor levels: 1,2,3,4,5.
str(mvthab.3hr.fc$ForAgeCat)
>Factor w/ 5 levels "1","2","3","4",..: 5 5 5 5 5 5 5 5 5 5 ...
The problem is that factor level 3 actually doesn't exist, that is, in this dataset (which is a subset of a larger dataset) there are no observations from factor level 3, which I think is messing with my modeling in lme(). Can someone help me to remove/eliminate factor level 3 from the list of factor levels?
use the function droplevels, like so:
> DF$factor_var = droplevels(DF$factor_var)
More detail:
> # create a sample dataframe:
> col1 = runif(10)
> col1
[1] 0.6971600 0.1649196 0.5451907 0.9660817 0.8207766 0.9527764
0.9643410 0.2179709 0.9302741 0.4195046
> col2 = gl(n=2, k=5, labels=c("M", "F"))
> col2
[1] M M M M M F F F F F
Levels: M F
> DF = data.frame(Col1=col1, Col2=col2)
> DF
Col1 Col2
1 0.697 M
2 0.165 M
3 0.545 M
4 0.966 M
5 0.821 M
6 0.953 F
7 0.964 F
8 0.218 F
9 0.930 F
10 0.420 F
> # now filter DF so that only *one* factor value remains
> DF1 = DF[DF$Col2=="M",]
> DF1
Col1 Col2
1 0.697 M
2 0.165 M
3 0.545 M
4 0.966 M
5 0.821 M
> str(DF1)
'data.frame': 5 obs. of 2 variables:
$ Col1: num 0.697 0.165 0.545 0.966 0.821
$ Col2: Factor w/ 2 levels "M","F": 1 1 1 1 1
> # but still 2 factor *levels*, even though only one value
> DF1$Col2 = droplevels(DF1$Col2)
> # now Col2 has only a single level:
> str(DF1)
'data.frame': 5 obs. of 2 variables:
$ Col1: num 0.697 0.165 0.545 0.966 0.821
$ Col2: Factor w/ 1 level "M": 1 1 1 1 1
Related
I am trying to reduce the number of levels in each factor variable in my data. I want to reduce the number of levels doing 2 operations:
If the number of levels is larger than a cut-off then replace the less frequent levels to a new level until the number of levels has reached the cut-off
Replace levels in a factor with not enough observations to a new level
I wrote a function which works fine, but I don't like the code. It does not matter if the level REMAIN has not enough observations. I prefer a dplyr approach.
ReplaceFactor <- function(data, max_levels, min_values_factor){
# First make sure that not to many levels are in a factor
for(i in colnames(data)){
if(class(data[[i]]) == "factor"){
if(length(levels(data[[i]])) > max_levels){
levels_keep <- names(sort(table(data[[i]]), decreasing = T))[1 : (max_levels - 1)]
data[!get(i) %in% levels_keep, (i) := "REMAIN"]
data[[i]] <- as.factor(as.character(data[[i]]))
}
}
}
# Now make sure that in each level has enough observations
for(i in colnames(data)){
if(class(data[[i]]) == "factor"){
if(min(table(data[[i]])) < min_values_factor){
levels_replace <- table(data[[i]])[table(data[[i]]) < min_values_factor]
data[get(i) %in% names(levels_replace), (i) := "REMAIN"]
data[[i]] <- as.factor(as.character(data[[i]]))
}
}
}
return(data)
}
df <- data.frame(A = c("A","A","B","B","C","C","C","C","C"),
B = 1:9,
C = c("A","A","B","B","C","C","C","D","D"),
D = c("A","B","E", "E", "E","E","E", "E", "E"))
str(df)
'data.frame': 9 obs. of 4 variables:
$ A: Factor w/ 3 levels "A","B","C": 1 1 2 2 3 3 3 3 3
$ B: int 1 2 3 4 5 6 7 8 9
$ C: Factor w/ 4 levels "A","B","C","D": 1 1 2 2 3 3 3 4 4
$ D: Factor w/ 3 levels "A","B","E": 1 2 3 3 3 3 3 3 3
dt2 <- ReplaceFactor(data = data.table(df),
max_levels = 3,
min_values_factor = 2)
str(dt2)
Classes ‘data.table’ and 'data.frame': 9 obs. of 4 variables:
$ A: Factor w/ 3 levels "A","B","C": 1 1 2 2 3 3 3 3 3
$ B: int 1 2 3 4 5 6 7 8 9
$ C: Factor w/ 3 levels "A","C","REMAIN": 1 1 3 3 2 2 2 3 3
$ D: Factor w/ 2 levels "E","REMAIN": 2 2 1 1 1 1 1 1 1
- attr(*, ".internal.selfref")=<externalptr>
dt2
A B C D
1: A 1 A REMAIN
2: A 2 A REMAIN
3: B 3 REMAIN E
4: B 4 REMAIN E
5: C 5 C E
6: C 6 C E
7: C 7 C E
8: C 8 REMAIN E
9: C 9 REMAIN E
Using forcats:
library(dplyr)
library(forcats)
max_levels <- 3
min_values_factor <- 2
df %>%
mutate_if(is.factor, fct_lump, n = max_levels,
other_level = "REMAIN", ties.method = "first") %>%
mutate_if(is.factor, fct_lump, prop = (min_values_factor - 1) / nrow(.),
other_level = "REMAIN")
# A B C D
# 1 A 1 A REMAIN
# 2 A 2 A REMAIN
# 3 B 3 B E
# 4 B 4 B E
# 5 C 5 C E
# 6 C 6 C E
# 7 C 7 C E
# 8 C 8 REMAIN E
# 9 C 9 REMAIN E
(Oh, and I wasn't able to replicate the exact behavior of your function, but you might get what you want by tweaking ties.method and substracting 1 to max_levels).
The arrange() in dplyr produces incorrect result.
library(dplyr)
x <- as.data.frame(cbind(name=c("A","B","C","D"), val=c(0.032, 0.077, 0.4, 0.0001)))
x.1 <- x %>% arrange(val)
x.2 <- x %>% arrange(desc(val))
The outputs are:
name val
1 A 0.032
2 B 0.077
3 C 0.4
4 D 1e-04
>x.1
name val
1 A 0.032
2 B 0.077
3 C 0.4
4 D 1e-04
> x.2
name val
1 D 1e-04
2 C 0.4
3 B 0.077
4 A 0.032
Both ascending and descending order sort producing incorrect output.
Not sure what I am doing wrong here?
Thank you.
as.data.frame(cbind()) is what you are doing wrong there. Everything is converted to character in cbind(), and then to factor in as.data.frame(). Have a look ...
str(x)
# 'data.frame': 4 obs. of 2 variables:
# $ name: Factor w/ 4 levels "A","B","C","D": 1 2 3 4
# $ val : Factor w/ 4 levels "0.032","0.077",..: 1 2 3 4
I don't know where people are learning this method of creating data frames, but it's terrible practice and should never be used.
Use data.frame() to create data frames, that's why it's there (or when using dplyr, there is data_frame() as well).
library(dplyr)
x <- data.frame(name=c("A","B","C","D"), val=c(0.032, 0.077, 0.4, 0.0001))
x.1 <- x %>% arrange(val)
x.2 <- x %>% arrange(desc(val))
x.1
# name val
# 1 D 0.0001
# 2 A 0.0320
# 3 B 0.0770
# 4 C 0.4000
x.2
# name val
# 1 C 0.4000
# 2 B 0.0770
# 3 A 0.0320
# 4 D 0.0001
This question already has answers here:
Drop unused factor levels in a subsetted data frame
(16 answers)
Closed 8 years ago.
Here is an example that was taken from a fellow SO member.
# define a %not% to be the opposite of %in%
library(dplyr)
# data
f <- c("a","a","a","b","b","c")
s <- c("fall","spring","other", "fall", "other", "other")
v <- c(3,5,1,4,5,2)
(dat0 <- data.frame(f, s, v))
# f s v
#1 a fall 3
#2 a spring 5
#3 a other 1
#4 b fall 4
#5 b other 5
#6 c other 2
(sp.tmp <- filter(dat0, s == "spring"))
# f s v
#1 a spring 5
(str(sp.tmp))
#'data.frame': 1 obs. of 3 variables:
# $ f: Factor w/ 3 levels "a","b","c": 1
# $ s: Factor w/ 3 levels "fall","other",..: 3
# $ v: num 5
The df resulting from filter() has retained all the levels from the original df.
What would be the recommended way to drop the unused level(s), i.e. "fall" and "others", within the dplyr framework?
You could do something like:
dat1 <- dat0 %>%
filter(s == "spring") %>%
droplevels()
Then
str(df)
#'data.frame': 1 obs. of 3 variables:
# $ f: Factor w/ 1 level "a": 1
# $ s: Factor w/ 1 level "spring": 1
# $ v: num 5
You could use droplevels
sp.tmp <- droplevels(sp.tmp)
str(sp.tmp)
#'data.frame': 1 obs. of 3 variables:
#$ f: Factor w/ 1 level "a": 1
#$ s: Factor w/ 1 level "spring": 1
# $ v: num 5
Users,
I have this data frame:
A<- c(10,2,4,5,3,5,98,65,36,65,6,100,70,54,25,23,22,30,15,23)
B<- c(1,0.1,0.5,0.8,0.2,0.9,3,1.2,5.6,3.5,15.9,10.2,5,5.1,7.1,5,6,10,4,8)
C<- c("a","a","a","a","a","a","b","b","b","b","c","c","c","c","d","d","d","d","d","d")
mydf<- data.frame(A,B,C)
and I did a subset keeping only the level "a".
subset<- subset(mydf, mydf$C=="a")
But when I make a plot (please see the image) the graph shows also the deleted levels.
plot(B~ C, data=subset)
How can I plot the subsetted data frame avoiding deleted levels?
Thank you!
Use droplevels:
subset$C <- droplevels(subset$C)
plot(B~ C, data=subset)
By the way, subset is not a good name for a data.frame.
str(subset)
#'data.frame': 6 obs. of 3 variables:
# $ A: num 10 2 4 5 3 5
# $ B: num 1 0.1 0.5 0.8 0.2 0.9
# $ C: Factor w/ 4 levels "a","b","c","d": 1 1 1 1 1 1
Remove the missing factor levels by means of factor:
subset$C <- factor(subset$C)
str(subset)
#'data.frame': 6 obs. of 3 variables:
#$ A: num 10 2 4 5 3 5
#$ B: num 1 0.1 0.5 0.8 0.2 0.9
#$ C: Factor w/ 1 level "a": 1 1 1 1 1 1
Just do:
plot(B~ droplevels(C), data=subset)
I'm looking for a way to remove rows in a data frame with less than 3 observations. Let me explain the matter in a better way.
I have a dataframe with 6 indipendent variables and 1 dependent. As I'm doing a density plot in ggplot2 using faceting, variables with less than 3 observations are not plotted (obviously). I'm looking for a way to delete these rows with less than 3 observations. this is an example of the data:
'data.frame': 432 obs. of 6 variables:
$ ID : Factor w/ 439 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Forno : Factor w/ 8 levels "Micro","Macro",..: 1 1 1 6 6 6 4 4 4 5 ...
$ Varieta: Factor w/ 11 levels "cc","dd",..: 11 11 11 6 6 6 1 1 1 6 ...
$ Impiego: Factor w/ 5 levels "aperto","chiuso",..: 2 2 2 3 3 3 2 2 2 5 ...
$ MediaL : num 60.7 58.9 60.5 55.9 56.1 ...
$ MediaL.sd : num 4.81 4.79 4.84 5.27 5.64 ...
ggplot code:
ggplot(d1,aes(MediaL))+geom_density(aes(fill=Varieta),colour=NA,alpha=0.5)+
scale_fill_brewer(palette="Set1")+facet_grid(Forno~Impiego)+
theme(axis.text.x=element_text(angle=90,hjust=1))+theme_mio +xlim(45,65)+
stat_bin(geom="text",aes(y=0,label=..count..),size=2,binwidth=2)
I would like to remove all the interactions with less than 3 observations.
Providing the actual output of your sample data would be useful. You can provide this via dput(yourObject) instead of the text representation you provided. However, it does seem like the same basic approach below works equally well with a matrix, data.frame, and table data structure.
#Matrix
x <- matrix(c(5,4,4,3,1,5,1,8,2), ncol = 3, byrow = TRUE)
x[x < 3] <- NA
#----
[,1] [,2] [,3]
[1,] 5 4 4
[2,] 3 NA 5
[3,] NA 8 NA
#data.frame
xd <- as.data.frame(matrix(c(5,4,4,3,1,5,1,8,2), ncol = 3, byrow = TRUE))
xd[xd < 3] <- NA
#----
V1 V2 V3
1 5 4 4
2 3 NA 5
3 NA 8 NA
#Table. Simulate some data first
set.seed(1)
samp <- data.frame(x1 = sample(c("acqua", "fango", "neve"), 20, TRUE),
x2 = sample(c("pippo", "pluto", "paperino"), 20, TRUE))
x2 <-table(samp)
x2[x2 < 3] <- NA
#----
x2
x1 paperino pippo pluto
acqua 3
fango 3
neve 3 3
ggplot generally likes data to be in long format, most often achieved via the melt() command in reshape2. If you provide your plotting code, that may illustrate a better way to remove the data you don't want to plot.