R:Subsetting data frame by factor - r

assume we have the following data frame
foo
k h=1 h=2 h=3
1 3 3 6 9
2 2 2 5 8
3 1 1 4 7
with
str(check)
'data.frame': 3 obs. of 4 variables:
$ k : Factor w/ 3 levels "3","2","1": 1 2 3
$ h=1: int 3 2 1
$ h=2: int 6 5 4
$ h=3: int 9 8 7
How can I subset my dataframe based on the factor of k? For instance, to get only the row for k=3 or all rows k<3. I tried working with subet(foo, k=3) but it doesn't work. I also tried to convert the column k to numeric, but then my data.frame loses its order. It's important that the data is of descending order with regard to k (so 3, 2, 1)

Bracket notation should be able to subset on factors without any problems:
# Returns all rows of foo where k == '3'
foo[foo$k == '3',]
Two possible problems with what you did before:
1) subset(foo, k=3) should be subset(foo, k==3), don't confuse the equality operator (==) with the assignment operator (=)
2) Since you're comparing with the actual level of your factor, you should check for equality with the character '3' instead of the numeric 3. You can see in the output from str() that k's levels are "3","2","1", with quotes, whereas the integers for the other variables are shown without quotes 3 2 1

Related

How to get levels for each factor variable in R

I understand R assigns values to a factor vector alphabetically. In this following example:
x <- as.factor(c("A","B","C","A","A","A","A","A","A","B","C","B","C","B","C","B","C"))
str(x)
This prints
Factor w/ 3 levels "A","B","C": 1 2 3 1 1 1 1 1 1 2 ...
Since I have only three levels it is easier to understand the level - value association i.e., A = 1, B = 2, so on and so forth.
In a scenario where I have hundreds of factors, is there a easier way to get it printed as a table that displays all the factors along with it level values like this:
Levels Values
A 1
B 2
C 3
Why do you want to know the underlying numeric values that R assigns to each factor level? I ask because this generally wouldn't be an important thing to keep track of. Can you say more about what you're trying to accomplish? We may be able to provide additional advice if we know more about the underlying problem you're trying to solve. For now, below are examples of how to do what you ask that also show why the results might not be what you expect.
Do all the columns in your data frame have different combinations of the same underlying categories? If not, what you're asking for could give unexpected and undesirable results. Below are a couple of examples, based on a fake data frame with 3 factor columns, two of which are upper case letters and one of which is lower case letters.
# Fake data
set.seed(2)
x = c("C","A","B","C","A","A","A","A","A","A","B","C","B","C","B","C","B","C")
dat = data.frame(x=x,
y=sample(LETTERS[1:5], length(x), replace=TRUE),
z=sample(letters[1:3], length(x), replace=TRUE),
w=rnorm(length(x)))
Note that the numeric codes assigned to each factor level are not unique across columns. The lower case letters and the upper case letters can both have factor codes 1 through 3.
# Return a list with factor levels and numeric codes for each factor column
lapply(dat[ , sapply(dat, is.factor)], function(v) {
data.frame(Levels=levels(unique(sort(v))),
Values=as.numeric(unique(sort(v))))
})
$x
Levels Values
1 A 1
2 B 2
3 C 3
$y
Levels Values
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
$z
Levels Values
1 a 1
2 b 2
3 c 3
Another potential complication is whether the order of the factor levels is the same for different columns. As an example, let's change the factor order for one of the upper case columns. This creates a new issue in that the the same letter can have a different code value in different columns and the same code can be assigned to different letters. For example, A has code 1 in column x and code 5 in column y. Furthermore, code 1 is assigned to E in column y, rather than to A.
dat$y = factor(dat$y, levels = LETTERS[5:1])
# Return a list with factor levels and numeric codes for each factor column
lapply(dat[ , sapply(dat, is.factor)], function(v) {
data.frame(Levels=levels(unique(sort(v))),
Values=as.numeric(unique(sort(v))))
})
$x
Levels Values
1 A 1
2 B 2
3 C 3
$y
Levels Values
1 E 1
2 D 2
3 C 3
4 B 4
5 A 5
$z
Levels Values
1 a 1
2 b 2
3 c 3

How can I compare two factors with different levels?

Is it possible to compare two factors of same length, but different levels? For example, if we have these 2 factor variables:
A <- factor(1:5)
str(A)
Factor w/ 5 levels "1","2","3","4",..: 1 2 3 4 5
B <- factor(c(1:3,6,6))
str(B)
Factor w/ 4 levels "1","2","3","6": 1 2 3 4 4
If I try to compare them using, for example, the == operator:
mean(A == B)
I get the following error:
Error in Ops.factor(A, B) : level sets of factors are different
Convert to character then compare:
# data
A <- factor(1:5)
B <- factor(c(1:3,6,6))
str(A)
# Factor w/ 5 levels "1","2","3","4",..: 1 2 3 4 5
str(B)
# Factor w/ 4 levels "1","2","3","6": 1 2 3 4 4
mean(A == B)
Error in Ops.factor(A, B) : level sets of factors are different
mean(as.character(A) == as.character(B))
# [1] 0.6
Or another approach would be
mean(levels(A)[A] == levels(B)[B])
which is 2 times slower on a 1e8 dataset.
Converting to character as in #zx8754's answer is the easiest solution to this problem, and probably the one you'd want to use almost always. Another option, though, is to correct the 2 variables so that they have the same levels. You might want to do this if you want to keep these variables as factor for some reason and don't want to have to clog up your code with repeated calls to as.character.
A <- factor(1:5)
B <- factor(c(1:3,6,6))
mean(A == B)
Error in Ops.factor(A, B) : level sets of factors are different
We can take the union of the levels of both factors to get all levels in either factor, and then set remake the factors using that union as the levels. Now, even though the 2 factors have different values, the levels are the same between them and you can compare them:
C = factor(A, levels = union(levels(A), levels(B)))
D = factor(B, levels = union(levels(A), levels(B)))
mean(C==D)
[1] 0.6
As you can see, the values are unchanged, but the levels are now identical.
C
[1] 1 2 3 4 5
Levels: 1 2 3 4 5 6
D
[1] 1 2 3 6 6
Levels: 1 2 3 4 5 6

How to stop single variable data.frame becoming a vector?

When subsetting a data.frame with asking for only one variable we get a vector. This is what we ask for, so it is not strange. However, in other situations (if we ask for more then one column), we get a data.frame object. Example:
> data <- data.frame(a=1:10, b=letters[1:10])
> str(data)
'data.frame': 10 obs. of 2 variables:
$ a: int 1 2 3 4 5 6 7 8 9 10
$ b: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10
> data <- data[, "b"]
> str(data)
Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10
If I need my data object not to change it's type from data.frame no matter if it has only one variable, what do I have to do? The only thing that comes to my mind is:
data <- data[, "a"]
data <- as.data.frame(data)
...but this seems terribly redundant. Is there a better way, i.e. a way of saying "stay a data.frame, just give me a certain column"?
The problem is that I need:
to subset using vectors of variable names of different length
get data.frames with names unchanged as an output each time.
The best is to use list subsetting. All of these will return a data.frame:
data['a']
data[c('a')]
data[c('a', 'b')]
Using matrix subsetting, you would have to add drop = FALSE:
data[, 'a', drop = FALSE]

How will factor levels be ordered with regard to the original values?

If I create a factor from a vector of numerical values, will the factor categories be ordered automatically by the values that are now considered categories?
i.e. [1,4,7,3,2] -> categories = {1,2,3,4,7}
Short answer: yes.
Long answer: it depends. R will sort the unique values and assign the categories in that order if you convert a vector using the function factor() without calling any extra arguments :
> x <- c(3,1,4,5,1,4)
> factor(x)
[1] 3 1 4 5 1 4
Levels: 1 3 4 5
It won't however when you use the argument levels:
> factor(x, levels=unique(x))
[1] 3 1 4 5 1 4
Levels: 3 1 4 5
In this case, it takes the order of the levels as the order in which it assigns categories.

R Frequency table containing 0

I'm working on a data.frame with about 700 000 rows. It's containing the ids of statusupdates and corresponding usernames from twitter. I just want to know how many different users are in there and how many times they've tweeted. So I thought this was a very simple task using tables. But know I noticed that I'm getting different results.
recently I did it converting the column to character like this
>freqs <- as.data.frame(table(as.character(w_dup$from_user))
>nrow(freqs)
[1] 239678
2 months ago I did it like that
>freqs <- as.data.frame(table(w_dup$from_user)
>nrow(freqs)
[1] 253594
I noticed that this way the data frame contains usernames with a Frequency 0. How can that be? If the username is in the dataset it must occur at least one time.
?table didn't help me. Neither was I able to reproduce this issue on smaller datasets.
What I'm doing wrong. Or am I missunderstanding the use of tables?
The type of the column is the problem here and also keep in mind that levels of factors stay the same when subsetting the data frame:
# Full data frame
(df <- data.frame(x = letters[1:3], y = 1:3))
x y
1 a 1
2 b 2
3 c 3
# Its structure - all three levels as it should be
str(df)
'data.frame': 3 obs. of 2 variables:
$ x: Factor w/ 3 levels "a","b","c": 1 2 3
$ y: int 1 2 3
# A smaller data frame
(newDf <- df[1:2, ])
x y
1 a 1
2 b 2
# But the same three levels
str(newDf)
'data.frame': 2 obs. of 2 variables:
$ x: Factor w/ 3 levels "a","b","c": 1 2
$ y: int 1 2
so the first column contains factors. In this case:
table(newDf$x)
a b c
1 1 0
all the levels ("a","b","c") are taken into consideration. And here
table(as.character(newDf$x))
a b
1 1
they are not factors anymore.

Resources