When subsetting a data.frame with asking for only one variable we get a vector. This is what we ask for, so it is not strange. However, in other situations (if we ask for more then one column), we get a data.frame object. Example:
> data <- data.frame(a=1:10, b=letters[1:10])
> str(data)
'data.frame': 10 obs. of 2 variables:
$ a: int 1 2 3 4 5 6 7 8 9 10
$ b: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10
> data <- data[, "b"]
> str(data)
Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10
If I need my data object not to change it's type from data.frame no matter if it has only one variable, what do I have to do? The only thing that comes to my mind is:
data <- data[, "a"]
data <- as.data.frame(data)
...but this seems terribly redundant. Is there a better way, i.e. a way of saying "stay a data.frame, just give me a certain column"?
The problem is that I need:
to subset using vectors of variable names of different length
get data.frames with names unchanged as an output each time.
The best is to use list subsetting. All of these will return a data.frame:
data['a']
data[c('a')]
data[c('a', 'b')]
Using matrix subsetting, you would have to add drop = FALSE:
data[, 'a', drop = FALSE]
Related
assume we have the following data frame
foo
k h=1 h=2 h=3
1 3 3 6 9
2 2 2 5 8
3 1 1 4 7
with
str(check)
'data.frame': 3 obs. of 4 variables:
$ k : Factor w/ 3 levels "3","2","1": 1 2 3
$ h=1: int 3 2 1
$ h=2: int 6 5 4
$ h=3: int 9 8 7
How can I subset my dataframe based on the factor of k? For instance, to get only the row for k=3 or all rows k<3. I tried working with subet(foo, k=3) but it doesn't work. I also tried to convert the column k to numeric, but then my data.frame loses its order. It's important that the data is of descending order with regard to k (so 3, 2, 1)
Bracket notation should be able to subset on factors without any problems:
# Returns all rows of foo where k == '3'
foo[foo$k == '3',]
Two possible problems with what you did before:
1) subset(foo, k=3) should be subset(foo, k==3), don't confuse the equality operator (==) with the assignment operator (=)
2) Since you're comparing with the actual level of your factor, you should check for equality with the character '3' instead of the numeric 3. You can see in the output from str() that k's levels are "3","2","1", with quotes, whereas the integers for the other variables are shown without quotes 3 2 1
Is it possible to compare two factors of same length, but different levels? For example, if we have these 2 factor variables:
A <- factor(1:5)
str(A)
Factor w/ 5 levels "1","2","3","4",..: 1 2 3 4 5
B <- factor(c(1:3,6,6))
str(B)
Factor w/ 4 levels "1","2","3","6": 1 2 3 4 4
If I try to compare them using, for example, the == operator:
mean(A == B)
I get the following error:
Error in Ops.factor(A, B) : level sets of factors are different
Convert to character then compare:
# data
A <- factor(1:5)
B <- factor(c(1:3,6,6))
str(A)
# Factor w/ 5 levels "1","2","3","4",..: 1 2 3 4 5
str(B)
# Factor w/ 4 levels "1","2","3","6": 1 2 3 4 4
mean(A == B)
Error in Ops.factor(A, B) : level sets of factors are different
mean(as.character(A) == as.character(B))
# [1] 0.6
Or another approach would be
mean(levels(A)[A] == levels(B)[B])
which is 2 times slower on a 1e8 dataset.
Converting to character as in #zx8754's answer is the easiest solution to this problem, and probably the one you'd want to use almost always. Another option, though, is to correct the 2 variables so that they have the same levels. You might want to do this if you want to keep these variables as factor for some reason and don't want to have to clog up your code with repeated calls to as.character.
A <- factor(1:5)
B <- factor(c(1:3,6,6))
mean(A == B)
Error in Ops.factor(A, B) : level sets of factors are different
We can take the union of the levels of both factors to get all levels in either factor, and then set remake the factors using that union as the levels. Now, even though the 2 factors have different values, the levels are the same between them and you can compare them:
C = factor(A, levels = union(levels(A), levels(B)))
D = factor(B, levels = union(levels(A), levels(B)))
mean(C==D)
[1] 0.6
As you can see, the values are unchanged, but the levels are now identical.
C
[1] 1 2 3 4 5
Levels: 1 2 3 4 5 6
D
[1] 1 2 3 6 6
Levels: 1 2 3 4 5 6
Let I have a data frame where some colums rae factor type and there is column named "index" which is not a column. I want to extract columns
which are factor tyepe and
the "index" column.
For example let
df<-data.frame(a=runif(10),b=as.factor(sample(10)),index=as.numeri(1:10))
So df is:
a b index
0.16187501 5 1
0.75214741 8 2
0.08741729 3 3
0.58871514 2 4
0.18464752 9 5
0.98392420 1 6
0.73771960 10 7
0.97141474 6 8
0.15768011 7 9
0.10171931 4 10
Desired output is(let it be a data frame called df1)
df1:
b index
5 1
8 2
3 3
2 4
9 5
1 6
10 7
6 8
7 9
4 10
which consist the factor column and the column named "index".
I use such a code
vars<-apply(df,2,function(x) {(is.factor(x)) || (names(x)=="index")})
df1<-df[,vars]
However, this code does not work. How can I return df1 using apply types function in R? I will be very glad for any help. Thanks a lot.
You could do:
df[ , sapply(df, is.factor) | grepl("index", names(df))]
I think two things went wrong with your method: First, apply converts the data frame to a matrix, which doesn't store values as factors (see here for more on this). Also, in a matrix, every value has to be of the same mode (character, numeric, etc.). In this case, everything gets coerced to character, so there's no factor to find.
Second, the column name isn't accessible within apply (AFAIK), so names(x) returns NULL and names(x)=="index" returns logical(0).
When I do:
var = names(df)[2]
df$var
I get NULL. I think that var is a string inside quotes and that is why this is happening. How could get the columns in a dataframe and dynamically query them?
It has been suggested that I use df[var], but what if my dataframe has another dataframe within it? df[var][x] or df[var]$x won't work.
Get a column of a data frame or item in a list by value of a variable by doing:
df[[var]]
It's hard to know what error-inducing situation has been constructed without dput-output on the offending dataframe. It's modestly difficult to get a column name as described (with actual quotes in the column name, but its possible. First we can try and fail to get such a beast:
df2 <- data.frame("\"col1\""=1:10)
df2[["\"col1\""]]
#NULL
df2
# the data.frame function coerced it to a valid column name with no quotes
X.col1.
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
So we can bypass the validity checks. Now we need escapes preceding the quotes:
df2 <- data.frame("\"col1\""=1:10, check.names=FALSE)
> df2[["\"col1\""]]
[1] 1 2 3 4 5 6 7 8 9 10
If the df[[var]]$x approach worked for you, then the answer is more likely that df is not a dataframe but rather is an ordinary R named list and that it is x that is a dataframe. You should check this by doing:
str(df)
You could make such a structure very simply with:
> df3 <- list( item=data.frame(x=1:10, check.names=FALSE))
> var1 = "item"
> df3[[var1]]$x
[1] 1 2 3 4 5 6 7 8 9 10
> str(df3)
List of 1
$ item:'data.frame': 10 obs. of 1 variable:
..$ x: int [1:10] 1 2 3 4 5 6 7 8 9 10
I'm working on a data.frame with about 700 000 rows. It's containing the ids of statusupdates and corresponding usernames from twitter. I just want to know how many different users are in there and how many times they've tweeted. So I thought this was a very simple task using tables. But know I noticed that I'm getting different results.
recently I did it converting the column to character like this
>freqs <- as.data.frame(table(as.character(w_dup$from_user))
>nrow(freqs)
[1] 239678
2 months ago I did it like that
>freqs <- as.data.frame(table(w_dup$from_user)
>nrow(freqs)
[1] 253594
I noticed that this way the data frame contains usernames with a Frequency 0. How can that be? If the username is in the dataset it must occur at least one time.
?table didn't help me. Neither was I able to reproduce this issue on smaller datasets.
What I'm doing wrong. Or am I missunderstanding the use of tables?
The type of the column is the problem here and also keep in mind that levels of factors stay the same when subsetting the data frame:
# Full data frame
(df <- data.frame(x = letters[1:3], y = 1:3))
x y
1 a 1
2 b 2
3 c 3
# Its structure - all three levels as it should be
str(df)
'data.frame': 3 obs. of 2 variables:
$ x: Factor w/ 3 levels "a","b","c": 1 2 3
$ y: int 1 2 3
# A smaller data frame
(newDf <- df[1:2, ])
x y
1 a 1
2 b 2
# But the same three levels
str(newDf)
'data.frame': 2 obs. of 2 variables:
$ x: Factor w/ 3 levels "a","b","c": 1 2
$ y: int 1 2
so the first column contains factors. In this case:
table(newDf$x)
a b c
1 1 0
all the levels ("a","b","c") are taken into consideration. And here
table(as.character(newDf$x))
a b
1 1
they are not factors anymore.