Convert Continuous Dataframe to Categorical - r

I know how to convert individual continuous variables of a dataframe into categorical variables. But how can this be done for an entire dataframe at once? It seems there should be some simple way to do this but I am not seeing it. My dataframe has 34 rows and 65 variables and all variables take either a 0, 1 or 2 value. I want each value to be categorical. And the meaning of a 0, 1 or 2 is the same across all variables. Below is some R code to recreate a small subset of the data:
continuous<-data.frame(c(0,1,0,2),c(2,2,0,0),c(1,0,1,0),c(2,1,0,0))
colnames(continuous)<-c('A','B','C','D')
continuous$A<-as.factor(continuous$A) #This works, for individual variables
continuous<-as.factor(continuous) #This throws an error for the whole dataframe

think you can use lapply for this...
continuous<-data.frame(c(0,1,0,2),c(2,2,0,0),c(1,0,1,0),c(2,1,0,0))
colnames(continuous)<-c('A','B','C','D')
c2 <- lapply(continuous, as.factor)
str(c2)
List of 4
$ A: Factor w/ 3 levels "0","1","2": 1 2 1 3
$ B: Factor w/ 2 levels "0","2": 2 2 1 1
$ C: Factor w/ 2 levels "0","1": 2 1 2 1
$ D: Factor w/ 3 levels "0","1","2": 3 2 1 1
though likely you want a data frame or tibble instead of a list so
c2 <- data.frame(lapply(continuous, as.factor))

Related

Using dplyr group_by corrupt data frame: columns will be truncated or padded with NAs

I tried to replicate this approach to find the means for different groups in my dataset: Means multiple columns by multiple groups and the following code:
newtest %>%
group_by(aligntool, paired) %>%
summarise(vars("read_per_length"), mean)
However, I get the following error message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
corrupt data frame: columns will be truncated or padded with NAs
I tested to see if this was a problem with zero values, so I removed those and got the same problem. I also made the dataset smaller to see if this was a memory issue. For reference, my dataframe looks like this:
str(newtest)
'data.frame': 100 obs. of 4 variables:
$ Run_Sample : Factor w/ 6 levels "Run_1768_Sample_77304",..: 5 6 3 3 4 6 2 1 6 6 ...
$ paired : Factor w/ 2 levels "N","Y": 2 2 1 1 1 1 1 2 2 1 ...
$ aligntool : Factor w/ 2 levels "bbmap","kallisto": 2 1 1 2 1 1 2 2 1 1 ...
$ read_per_length: num 2.60e-10 1.87e-09 3.28e-09 7.63e-10 1.38e-09 ...
Is there a problem in how my dataframe is formatted somehow? How do I resolve this issue?
This should work:
newtest %>%
group_by(aligntool, paired) %>%
summarise_at(vars("read_per_length"), mean)

How can I compare two factors with different levels?

Is it possible to compare two factors of same length, but different levels? For example, if we have these 2 factor variables:
A <- factor(1:5)
str(A)
Factor w/ 5 levels "1","2","3","4",..: 1 2 3 4 5
B <- factor(c(1:3,6,6))
str(B)
Factor w/ 4 levels "1","2","3","6": 1 2 3 4 4
If I try to compare them using, for example, the == operator:
mean(A == B)
I get the following error:
Error in Ops.factor(A, B) : level sets of factors are different
Convert to character then compare:
# data
A <- factor(1:5)
B <- factor(c(1:3,6,6))
str(A)
# Factor w/ 5 levels "1","2","3","4",..: 1 2 3 4 5
str(B)
# Factor w/ 4 levels "1","2","3","6": 1 2 3 4 4
mean(A == B)
Error in Ops.factor(A, B) : level sets of factors are different
mean(as.character(A) == as.character(B))
# [1] 0.6
Or another approach would be
mean(levels(A)[A] == levels(B)[B])
which is 2 times slower on a 1e8 dataset.
Converting to character as in #zx8754's answer is the easiest solution to this problem, and probably the one you'd want to use almost always. Another option, though, is to correct the 2 variables so that they have the same levels. You might want to do this if you want to keep these variables as factor for some reason and don't want to have to clog up your code with repeated calls to as.character.
A <- factor(1:5)
B <- factor(c(1:3,6,6))
mean(A == B)
Error in Ops.factor(A, B) : level sets of factors are different
We can take the union of the levels of both factors to get all levels in either factor, and then set remake the factors using that union as the levels. Now, even though the 2 factors have different values, the levels are the same between them and you can compare them:
C = factor(A, levels = union(levels(A), levels(B)))
D = factor(B, levels = union(levels(A), levels(B)))
mean(C==D)
[1] 0.6
As you can see, the values are unchanged, but the levels are now identical.
C
[1] 1 2 3 4 5
Levels: 1 2 3 4 5 6
D
[1] 1 2 3 6 6
Levels: 1 2 3 4 5 6

Change the levels of a factor in all the columns of a variable in R

Is there a way to change the level of factors in each column in a more efficient way (iterative or a generic script). Columns are to be modified to have levels 1 to r where is the # of levels in that factor.
Currently, I am modifying them by writing command for each column:
setattr(lizards$Diameter,"levels",c(1,2))
setattr(lizards$Species,"levels",c(1,2))
setattr(lizards$Height,"levels",c(1,2))
str(lizards)
lizards 409 obs. of 3 variables
Species : Factor w/ 2 levels "Sagrei","Distichus": 1 1...
Diameter: Factor w/ 2 levels "narrow","wide": 1 1 1 1 ...
Height : Factor w/ 2 levels "high","low": 2 2 2 2 2 2 ...

R Frequency table containing 0

I'm working on a data.frame with about 700 000 rows. It's containing the ids of statusupdates and corresponding usernames from twitter. I just want to know how many different users are in there and how many times they've tweeted. So I thought this was a very simple task using tables. But know I noticed that I'm getting different results.
recently I did it converting the column to character like this
>freqs <- as.data.frame(table(as.character(w_dup$from_user))
>nrow(freqs)
[1] 239678
2 months ago I did it like that
>freqs <- as.data.frame(table(w_dup$from_user)
>nrow(freqs)
[1] 253594
I noticed that this way the data frame contains usernames with a Frequency 0. How can that be? If the username is in the dataset it must occur at least one time.
?table didn't help me. Neither was I able to reproduce this issue on smaller datasets.
What I'm doing wrong. Or am I missunderstanding the use of tables?
The type of the column is the problem here and also keep in mind that levels of factors stay the same when subsetting the data frame:
# Full data frame
(df <- data.frame(x = letters[1:3], y = 1:3))
x y
1 a 1
2 b 2
3 c 3
# Its structure - all three levels as it should be
str(df)
'data.frame': 3 obs. of 2 variables:
$ x: Factor w/ 3 levels "a","b","c": 1 2 3
$ y: int 1 2 3
# A smaller data frame
(newDf <- df[1:2, ])
x y
1 a 1
2 b 2
# But the same three levels
str(newDf)
'data.frame': 2 obs. of 2 variables:
$ x: Factor w/ 3 levels "a","b","c": 1 2
$ y: int 1 2
so the first column contains factors. In this case:
table(newDf$x)
a b c
1 1 0
all the levels ("a","b","c") are taken into consideration. And here
table(as.character(newDf$x))
a b
1 1
they are not factors anymore.

Set ordering of factor levels for multiple columns in a data frame

I've loaded data from a CSV file into a data frame. Each column represents a survey question, and all of the answers are on a five-point Likert scale, with the labels: ("None", "Low", "Medium", "High", "Very High").
When I read in the data initially, R correctly interprets those values as factors but doesn't know what the ordering should be. I want to specify what the ordering is for the values so I can do some numerical calculations. I thought the following code would work:
X <- read.csv('..')
likerts <- data.frame(apply(X, 2, function(X){factor(X,
levels = c("None", "Low", "Medium", "High", "Very High"),
ordered = T)}))
What happens instead is that all of the level data gets converted into strings. How do I do this correctly?
When using data.frame, R will convert again to a normal factor (or if stringsAsFactors = FALSE to string). Use as.data.frame instead. A trivial example with a toy data-frame:
X <- data.frame(
var1=rep(letters[1:5],3),
var2=rep(letters[1:5],each=3)
)
likerts <- as.data.frame(lapply(X, function(X){ordered(X,
levels = letters[5:1],labels=letters[5:1])}))
> str(likerts)
'data.frame': 15 obs. of 2 variables:
$ var1: Ord.factor w/ 5 levels "e"<"d"<"c"<"b"<..: 5 4 3 2 1 5 4 3 2 1 ...
$ var2: Ord.factor w/ 5 levels "e"<"d"<"c"<"b"<..: 5 5 5 4 4 4 3 3 3 2 ...
On a sidenote, ordered() gives you an ordered factor, and lapply(X,...) is more optimal than apply(X,2,...) in case of dataframes.
And the obligatory plyr solution (using Joris's example above):
> require(plyr)
> Y <- catcolwise( function(v) ordered(v, levels = letters[5:1]))(X)
> str(Y)
'data.frame': 15 obs. of 2 variables:
$ var1: Ord.factor w/ 5 levels "e"<"d"<"c"<"b"<..: 5 4 3 2 1 5 4 3 2 1 ...
$ var2: Ord.factor w/ 5 levels "e"<"d"<"c"<"b"<..: 5 5 5 4 4 4 3 3 3 2 ...
Note that one good thing about catcolwise is that it will only apply it to the columns of X that are factors, leaving the others alone. To explain what is going on: catcolwise is a function that takes a function as an argument, and returns a function that operates "columnwise" on the factor-columns of the data-frame. So we can imagine the above line in two stages: fn <- catcolwise(...); Y <- fn(X). Note that there are also functions colwise (operates on all columns) and numcolwise (operate only on numerical columns).

Resources