Users,
I have this data frame:
A<- c(10,2,4,5,3,5,98,65,36,65,6,100,70,54,25,23,22,30,15,23)
B<- c(1,0.1,0.5,0.8,0.2,0.9,3,1.2,5.6,3.5,15.9,10.2,5,5.1,7.1,5,6,10,4,8)
C<- c("a","a","a","a","a","a","b","b","b","b","c","c","c","c","d","d","d","d","d","d")
mydf<- data.frame(A,B,C)
and I did a subset keeping only the level "a".
subset<- subset(mydf, mydf$C=="a")
But when I make a plot (please see the image) the graph shows also the deleted levels.
plot(B~ C, data=subset)
How can I plot the subsetted data frame avoiding deleted levels?
Thank you!
Use droplevels:
subset$C <- droplevels(subset$C)
plot(B~ C, data=subset)
By the way, subset is not a good name for a data.frame.
str(subset)
#'data.frame': 6 obs. of 3 variables:
# $ A: num 10 2 4 5 3 5
# $ B: num 1 0.1 0.5 0.8 0.2 0.9
# $ C: Factor w/ 4 levels "a","b","c","d": 1 1 1 1 1 1
Remove the missing factor levels by means of factor:
subset$C <- factor(subset$C)
str(subset)
#'data.frame': 6 obs. of 3 variables:
#$ A: num 10 2 4 5 3 5
#$ B: num 1 0.1 0.5 0.8 0.2 0.9
#$ C: Factor w/ 1 level "a": 1 1 1 1 1 1
Just do:
plot(B~ droplevels(C), data=subset)
Related
I produced a large data frame (1700+obs,159 variables) with a function that collects info from a website. Usually, the function finds numeric values for some columns, and thus they're numeric. Sometimes, however, it finds some text, and converts the whole column to text.
I have one df whose column classes are correct, and I would like to "paste" those classes to a new, incorrect df.
Say, for example:
dfCorrect<-data.frame(x=c(1,2,3,4),y=as.factor(c("a","b","c","d")),z=c("bar","foo","dat","dot"),stringsAsFactors = F)
str(dfCorrect)
'data.frame': 4 obs. of 3 variables:
$ x: num 1 2 3 4
$ y: Factor w/ 4 levels "a","b","c","d": 1 2 3 4
$ z: chr "bar" "foo" "dat" "dot"
## now I have my "wrong" data frame:
dfWrong<-as.data.frame(sapply(dfCorrect,paste,sep=""))
str(dfWrong)
'data.frame': 4 obs. of 3 variables:
$ x: Factor w/ 4 levels "1","2","3","4": 1 2 3 4
$ y: Factor w/ 4 levels "a","b","c","d": 1 2 3 4
$ z: Factor w/ 4 levels "bar","dat","dot",..: 1 4 2 3
I wanted to copy the classes of each column of dfCorrect into dfWrong, but haven't found how to do it properly.
I've tested:
dfWrong1<-dfWrong
dfWrong1[0,]<-dfCorrect[0,]
str(dfWrong1) ## bad result
'data.frame': 4 obs. of 3 variables:
$ x: Factor w/ 4 levels "1","2","3","4": 1 2 3 4
$ y: Factor w/ 4 levels "a","b","c","d": 1 2 3 4
$ z: Factor w/ 4 levels "bar","dat","dot",..: 1 4 2 3
dfWrong1<-dfWrong
str(dfWrong1)<-str(dfCorrect)
'data.frame': 4 obs. of 3 variables:
$ x: num 1 2 3 4
$ y: Factor w/ 4 levels "a","b","c","d": 1 2 3 4
$ z: chr "bar" "foo" "dat" "dot"
Error in str(dfWrong1) <- str(dfCorrect) :
could not find function "str<-"
With this small matrix I could go by hand, but what about larger ones? Is there a way to "copy" the classes from one df to another without having to know the individual classes (and indexes) of each column?
Expected final result (after properly "pasting" classes):
all.equal(sapply(dfCorrect,class),sapply(dfWrong,class))
[1] TRUE
You could try this:
dfWrong[] <- mapply(FUN = as,dfWrong,sapply(dfCorrect,class),SIMPLIFY = FALSE)
...although my first instinct is to agree with Oliver that if it were me I'd try to ensure the correct class at the point you're reading the data.
This question already has answers here:
Drop unused factor levels in a subsetted data frame
(16 answers)
Closed 8 years ago.
Here is an example that was taken from a fellow SO member.
# define a %not% to be the opposite of %in%
library(dplyr)
# data
f <- c("a","a","a","b","b","c")
s <- c("fall","spring","other", "fall", "other", "other")
v <- c(3,5,1,4,5,2)
(dat0 <- data.frame(f, s, v))
# f s v
#1 a fall 3
#2 a spring 5
#3 a other 1
#4 b fall 4
#5 b other 5
#6 c other 2
(sp.tmp <- filter(dat0, s == "spring"))
# f s v
#1 a spring 5
(str(sp.tmp))
#'data.frame': 1 obs. of 3 variables:
# $ f: Factor w/ 3 levels "a","b","c": 1
# $ s: Factor w/ 3 levels "fall","other",..: 3
# $ v: num 5
The df resulting from filter() has retained all the levels from the original df.
What would be the recommended way to drop the unused level(s), i.e. "fall" and "others", within the dplyr framework?
You could do something like:
dat1 <- dat0 %>%
filter(s == "spring") %>%
droplevels()
Then
str(df)
#'data.frame': 1 obs. of 3 variables:
# $ f: Factor w/ 1 level "a": 1
# $ s: Factor w/ 1 level "spring": 1
# $ v: num 5
You could use droplevels
sp.tmp <- droplevels(sp.tmp)
str(sp.tmp)
#'data.frame': 1 obs. of 3 variables:
#$ f: Factor w/ 1 level "a": 1
#$ s: Factor w/ 1 level "spring": 1
# $ v: num 5
My dataset is pretty big. I have about 2,000 variables and 1,000 observations.
I want to run a model for each variable using other variables.
To do so, I need to drop variables which have missing values where the dependent variable doesn't have.
I meant that for instance, for variable "A" I need to drop variable C and D because those have missing values where variable A doesn't have. for variable "C" I can keep variable "D".
data <- read.table(text="
A B C D
1 3 9 4
2 1 3 4
NA NA 3 5
4 2 NA NA
2 5 4 3
1 1 1 2",header=T,sep="")
I think I need to make a loop to go through each variable.
I think this gets what you need:
for (i in 1:ncol(data)) {
# filter out rows with NA's in on column 'i'
# which is the column we currently care about
tmp <- data[!is.na(data[,i]),]
# now column 'i' has no NA values, so remove other columns
# that have NAs in them from the data frame
tmp <- tmp[sapply(tmp, function(x) !any(is.na(x)))]
#run your model on 'tmp'
}
For each iteration of i, the tmp data frame looks like:
'data.frame': 5 obs. of 2 variables:
$ A: int 1 2 4 2 1
$ B: int 3 1 2 5 1
'data.frame': 5 obs. of 2 variables:
$ A: int 1 2 4 2 1
$ B: int 3 1 2 5 1
'data.frame': 4 obs. of 2 variables:
$ C: int 3 3 4 1
$ D: int 4 5 3 2
'data.frame': 5 obs. of 1 variable:
$ D: int 4 4 5 3 2
I'll provide a way to get the usable vadiables for each column you choose:
getVars <- function(data, col){
tmp<-!sapply(data[!is.na(data[[col]]),], function(x) { any(is.na(x)) })
names(data)[tmp & names(data) != col]
}
PS: I'm on my phone so I didn't test the above nor had the chance for a good code styling.
EDIT: Styling fixed!
I want to convert variables into factors using apply():
a <- data.frame(x1 = rnorm(100),
x2 = sample(c("a","b"), 100, replace = T),
x3 = factor(c(rep("a",50) , rep("b",50))))
a2 <- apply(a, 2,as.factor)
apply(a2, 2,class)
results in:
x1 x2 x3
"character" "character" "character"
I don't understand why this results in character vectors instead of factor vectors.
apply converts your data.frame to a character matrix. Use lapply:
lapply(a, class)
# $x1
# [1] "numeric"
# $x2
# [1] "factor"
# $x3
# [1] "factor"
In second command apply converts result to character matrix, using lapply:
a2 <- lapply(a, as.factor)
lapply(a2, class)
# $x1
# [1] "factor"
# $x2
# [1] "factor"
# $x3
# [1] "factor"
But for simple lookout you could use str:
str(a)
# 'data.frame': 100 obs. of 3 variables:
# $ x1: num -1.79 -1.091 1.307 1.142 -0.972 ...
# $ x2: Factor w/ 2 levels "a","b": 2 1 1 1 2 1 1 1 1 2 ...
# $ x3: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1 1 1 1 ...
Additional explanation according to comments:
Why does the lapply work while apply doesn't?
The first thing that apply does is to convert an argument to a matrix. So apply(a) is equivalent to apply(as.matrix(a)). As you can see str(as.matrix(a)) gives you:
chr [1:100, 1:3] " 0.075124364" "-1.608618269" "-1.487629526" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:3] "x1" "x2" "x3"
There are no more factors, so class return "character" for all columns.
lapply works on columns so gives you what you want (it does something like class(a$column_name) for each column).
You can see in help to apply why apply and as.factor doesn't work :
In all cases the result is coerced by
as.vector to one of the basic vector
types before the dimensions are set,
so that (for example) factor results
will be coerced to a character array.
Why sapply and as.factor doesn't work you can see in help to sapply:
Value (...) An atomic vector or matrix
or list of the same length as X (...)
If simplification occurs, the output
type is determined from the highest
type of the return values in the
hierarchy NULL < raw < logical <
integer < real < complex < character <
list < expression, after coercion of
pairlists to lists.
You never get matrix of factors or data.frame.
How to convert output to data.frame?
Simple, use as.data.frame as you wrote in comment:
a2 <- as.data.frame(lapply(a, as.factor))
str(a2)
'data.frame': 100 obs. of 3 variables:
$ x1: Factor w/ 100 levels "-2.49629293159922",..: 60 6 7 63 45 93 56 98 40 61 ...
$ x2: Factor w/ 2 levels "a","b": 1 1 2 2 2 2 2 1 2 2 ...
$ x3: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1 1 1 1 ...
But if you want to replace selected character columns with factor there is a trick:
a3 <- data.frame(x1=letters, x2=LETTERS, x3=LETTERS, stringsAsFactors=FALSE)
str(a3)
'data.frame': 26 obs. of 3 variables:
$ x1: chr "a" "b" "c" "d" ...
$ x2: chr "A" "B" "C" "D" ...
$ x3: chr "A" "B" "C" "D" ...
columns_to_change <- c("x1","x2")
a3[, columns_to_change] <- lapply(a3[, columns_to_change], as.factor)
str(a3)
'data.frame': 26 obs. of 3 variables:
$ x1: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
$ x2: Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...
$ x3: chr "A" "B" "C" "D" ...
You could use it to replace all columns using:
a3 <- data.frame(x1=letters, x2=LETTERS, x3=LETTERS, stringsAsFactors=FALSE)
a3[, ] <- lapply(a3, as.factor)
str(a3)
'data.frame': 26 obs. of 3 variables:
$ x1: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
$ x2: Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...
$ x3: Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
dropping factor levels in a subsetted data frame in R
I have a data frame with several variables that I'm running a mixed model on using lme(). One of the variables, ForAgeCat, has five factor levels: 1,2,3,4,5.
str(mvthab.3hr.fc$ForAgeCat)
>Factor w/ 5 levels "1","2","3","4",..: 5 5 5 5 5 5 5 5 5 5 ...
The problem is that factor level 3 actually doesn't exist, that is, in this dataset (which is a subset of a larger dataset) there are no observations from factor level 3, which I think is messing with my modeling in lme(). Can someone help me to remove/eliminate factor level 3 from the list of factor levels?
use the function droplevels, like so:
> DF$factor_var = droplevels(DF$factor_var)
More detail:
> # create a sample dataframe:
> col1 = runif(10)
> col1
[1] 0.6971600 0.1649196 0.5451907 0.9660817 0.8207766 0.9527764
0.9643410 0.2179709 0.9302741 0.4195046
> col2 = gl(n=2, k=5, labels=c("M", "F"))
> col2
[1] M M M M M F F F F F
Levels: M F
> DF = data.frame(Col1=col1, Col2=col2)
> DF
Col1 Col2
1 0.697 M
2 0.165 M
3 0.545 M
4 0.966 M
5 0.821 M
6 0.953 F
7 0.964 F
8 0.218 F
9 0.930 F
10 0.420 F
> # now filter DF so that only *one* factor value remains
> DF1 = DF[DF$Col2=="M",]
> DF1
Col1 Col2
1 0.697 M
2 0.165 M
3 0.545 M
4 0.966 M
5 0.821 M
> str(DF1)
'data.frame': 5 obs. of 2 variables:
$ Col1: num 0.697 0.165 0.545 0.966 0.821
$ Col2: Factor w/ 2 levels "M","F": 1 1 1 1 1
> # but still 2 factor *levels*, even though only one value
> DF1$Col2 = droplevels(DF1$Col2)
> # now Col2 has only a single level:
> str(DF1)
'data.frame': 5 obs. of 2 variables:
$ Col1: num 0.697 0.165 0.545 0.966 0.821
$ Col2: Factor w/ 1 level "M": 1 1 1 1 1