R Frequency table containing 0 - r

I'm working on a data.frame with about 700 000 rows. It's containing the ids of statusupdates and corresponding usernames from twitter. I just want to know how many different users are in there and how many times they've tweeted. So I thought this was a very simple task using tables. But know I noticed that I'm getting different results.
recently I did it converting the column to character like this
>freqs <- as.data.frame(table(as.character(w_dup$from_user))
>nrow(freqs)
[1] 239678
2 months ago I did it like that
>freqs <- as.data.frame(table(w_dup$from_user)
>nrow(freqs)
[1] 253594
I noticed that this way the data frame contains usernames with a Frequency 0. How can that be? If the username is in the dataset it must occur at least one time.
?table didn't help me. Neither was I able to reproduce this issue on smaller datasets.
What I'm doing wrong. Or am I missunderstanding the use of tables?

The type of the column is the problem here and also keep in mind that levels of factors stay the same when subsetting the data frame:
# Full data frame
(df <- data.frame(x = letters[1:3], y = 1:3))
x y
1 a 1
2 b 2
3 c 3
# Its structure - all three levels as it should be
str(df)
'data.frame': 3 obs. of 2 variables:
$ x: Factor w/ 3 levels "a","b","c": 1 2 3
$ y: int 1 2 3
# A smaller data frame
(newDf <- df[1:2, ])
x y
1 a 1
2 b 2
# But the same three levels
str(newDf)
'data.frame': 2 obs. of 2 variables:
$ x: Factor w/ 3 levels "a","b","c": 1 2
$ y: int 1 2
so the first column contains factors. In this case:
table(newDf$x)
a b c
1 1 0
all the levels ("a","b","c") are taken into consideration. And here
table(as.character(newDf$x))
a b
1 1
they are not factors anymore.

Related

Using dplyr group_by corrupt data frame: columns will be truncated or padded with NAs

I tried to replicate this approach to find the means for different groups in my dataset: Means multiple columns by multiple groups and the following code:
newtest %>%
group_by(aligntool, paired) %>%
summarise(vars("read_per_length"), mean)
However, I get the following error message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
corrupt data frame: columns will be truncated or padded with NAs
I tested to see if this was a problem with zero values, so I removed those and got the same problem. I also made the dataset smaller to see if this was a memory issue. For reference, my dataframe looks like this:
str(newtest)
'data.frame': 100 obs. of 4 variables:
$ Run_Sample : Factor w/ 6 levels "Run_1768_Sample_77304",..: 5 6 3 3 4 6 2 1 6 6 ...
$ paired : Factor w/ 2 levels "N","Y": 2 2 1 1 1 1 1 2 2 1 ...
$ aligntool : Factor w/ 2 levels "bbmap","kallisto": 2 1 1 2 1 1 2 2 1 1 ...
$ read_per_length: num 2.60e-10 1.87e-09 3.28e-09 7.63e-10 1.38e-09 ...
Is there a problem in how my dataframe is formatted somehow? How do I resolve this issue?
This should work:
newtest %>%
group_by(aligntool, paired) %>%
summarise_at(vars("read_per_length"), mean)

Convert Continuous Dataframe to Categorical

I know how to convert individual continuous variables of a dataframe into categorical variables. But how can this be done for an entire dataframe at once? It seems there should be some simple way to do this but I am not seeing it. My dataframe has 34 rows and 65 variables and all variables take either a 0, 1 or 2 value. I want each value to be categorical. And the meaning of a 0, 1 or 2 is the same across all variables. Below is some R code to recreate a small subset of the data:
continuous<-data.frame(c(0,1,0,2),c(2,2,0,0),c(1,0,1,0),c(2,1,0,0))
colnames(continuous)<-c('A','B','C','D')
continuous$A<-as.factor(continuous$A) #This works, for individual variables
continuous<-as.factor(continuous) #This throws an error for the whole dataframe
think you can use lapply for this...
continuous<-data.frame(c(0,1,0,2),c(2,2,0,0),c(1,0,1,0),c(2,1,0,0))
colnames(continuous)<-c('A','B','C','D')
c2 <- lapply(continuous, as.factor)
str(c2)
List of 4
$ A: Factor w/ 3 levels "0","1","2": 1 2 1 3
$ B: Factor w/ 2 levels "0","2": 2 2 1 1
$ C: Factor w/ 2 levels "0","1": 2 1 2 1
$ D: Factor w/ 3 levels "0","1","2": 3 2 1 1
though likely you want a data frame or tibble instead of a list so
c2 <- data.frame(lapply(continuous, as.factor))

R:Subsetting data frame by factor

assume we have the following data frame
foo
k h=1 h=2 h=3
1 3 3 6 9
2 2 2 5 8
3 1 1 4 7
with
str(check)
'data.frame': 3 obs. of 4 variables:
$ k : Factor w/ 3 levels "3","2","1": 1 2 3
$ h=1: int 3 2 1
$ h=2: int 6 5 4
$ h=3: int 9 8 7
How can I subset my dataframe based on the factor of k? For instance, to get only the row for k=3 or all rows k<3. I tried working with subet(foo, k=3) but it doesn't work. I also tried to convert the column k to numeric, but then my data.frame loses its order. It's important that the data is of descending order with regard to k (so 3, 2, 1)
Bracket notation should be able to subset on factors without any problems:
# Returns all rows of foo where k == '3'
foo[foo$k == '3',]
Two possible problems with what you did before:
1) subset(foo, k=3) should be subset(foo, k==3), don't confuse the equality operator (==) with the assignment operator (=)
2) Since you're comparing with the actual level of your factor, you should check for equality with the character '3' instead of the numeric 3. You can see in the output from str() that k's levels are "3","2","1", with quotes, whereas the integers for the other variables are shown without quotes 3 2 1

How can I compare two factors with different levels?

Is it possible to compare two factors of same length, but different levels? For example, if we have these 2 factor variables:
A <- factor(1:5)
str(A)
Factor w/ 5 levels "1","2","3","4",..: 1 2 3 4 5
B <- factor(c(1:3,6,6))
str(B)
Factor w/ 4 levels "1","2","3","6": 1 2 3 4 4
If I try to compare them using, for example, the == operator:
mean(A == B)
I get the following error:
Error in Ops.factor(A, B) : level sets of factors are different
Convert to character then compare:
# data
A <- factor(1:5)
B <- factor(c(1:3,6,6))
str(A)
# Factor w/ 5 levels "1","2","3","4",..: 1 2 3 4 5
str(B)
# Factor w/ 4 levels "1","2","3","6": 1 2 3 4 4
mean(A == B)
Error in Ops.factor(A, B) : level sets of factors are different
mean(as.character(A) == as.character(B))
# [1] 0.6
Or another approach would be
mean(levels(A)[A] == levels(B)[B])
which is 2 times slower on a 1e8 dataset.
Converting to character as in #zx8754's answer is the easiest solution to this problem, and probably the one you'd want to use almost always. Another option, though, is to correct the 2 variables so that they have the same levels. You might want to do this if you want to keep these variables as factor for some reason and don't want to have to clog up your code with repeated calls to as.character.
A <- factor(1:5)
B <- factor(c(1:3,6,6))
mean(A == B)
Error in Ops.factor(A, B) : level sets of factors are different
We can take the union of the levels of both factors to get all levels in either factor, and then set remake the factors using that union as the levels. Now, even though the 2 factors have different values, the levels are the same between them and you can compare them:
C = factor(A, levels = union(levels(A), levels(B)))
D = factor(B, levels = union(levels(A), levels(B)))
mean(C==D)
[1] 0.6
As you can see, the values are unchanged, but the levels are now identical.
C
[1] 1 2 3 4 5
Levels: 1 2 3 4 5 6
D
[1] 1 2 3 6 6
Levels: 1 2 3 4 5 6

How to stop single variable data.frame becoming a vector?

When subsetting a data.frame with asking for only one variable we get a vector. This is what we ask for, so it is not strange. However, in other situations (if we ask for more then one column), we get a data.frame object. Example:
> data <- data.frame(a=1:10, b=letters[1:10])
> str(data)
'data.frame': 10 obs. of 2 variables:
$ a: int 1 2 3 4 5 6 7 8 9 10
$ b: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10
> data <- data[, "b"]
> str(data)
Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10
If I need my data object not to change it's type from data.frame no matter if it has only one variable, what do I have to do? The only thing that comes to my mind is:
data <- data[, "a"]
data <- as.data.frame(data)
...but this seems terribly redundant. Is there a better way, i.e. a way of saying "stay a data.frame, just give me a certain column"?
The problem is that I need:
to subset using vectors of variable names of different length
get data.frames with names unchanged as an output each time.
The best is to use list subsetting. All of these will return a data.frame:
data['a']
data[c('a')]
data[c('a', 'b')]
Using matrix subsetting, you would have to add drop = FALSE:
data[, 'a', drop = FALSE]

Resources