Understanding coercion of factors into characters in an R dataframe - r

Trying to figure out how coercion of factors/ dataframe works in R. I am trying to plot boxplots for a subset of a dataframe. Let's see step-by-step
x = rnorm(30, 1, 1)
Created a vector x with normal distribution
c = c(rep("x1",10), rep("x2",10), rep("x3",10))
Created a character string to later use as a factor for plotting boxplots for x1, x2, x3
df = data.frame(x,c)
combined x and c into a data.frame. So now we would expect class of df: dataframe, df$x: numeric, df$c: factor (because we sent c into a dataframe) and is.data.frame and is.list applied on df should give us TRUE and TRUE. (I assumed that all dataframes are lists as well? and that's why we are getting TRUE for both checks.)
And that's what happens below. All good till now.
class(df)
#[1] "data.frame"
is.data.frame(df)
#[1] TRUE
is.list(df)
#[1] TRUE
class(df$x)
#[1] "numeric"
class(df$c)
#[1] "factor"
Now I plot the spread of x grouped using factors present in c. So the first argument is x ~ c. But I want boxplots for just two factors: x1and x2. So I used a subset argument in boxplot function.
boxplot(x ~ c, subset=c %in% c("x1", "x2"), data=df)
This is the plot we get, notice since x3 is a factor, it is still plotted
i.e. we still got 3 categories on x-axis of the boxplot inspite of subsetting to 2 categories.
So, one solution I found was to change the class of df variables into numeric and character
class(df)<- c("numeric", "character")
boxplot(x ~ c, subset=c %in% c("x1", "x2"), data=df)
New boxplot. This is what we wanted, so it worked!, we plotted boxes for just x1 and x2, got rid of x3
But if we just run the same checks, we ran before doing this coercion, on all variables, we get these outputs.
Anything funny?
class(df)
#[1] "numeric" "character"
is.data.frame(df)
#[1] FALSE
is.list(df)
#[1] TRUE
class(df$x)
#[1] "numeric"
class(df$c)
#[1] "factor"
Check out that df $ c (the second variable containing caegories x1, x2, x3) is still a factor!
And df stopped being a list (so was it ever a list?)
And what did we do exactly by class(df)<- c("numeric", "character") this coercion if not changing the datatype of df $ c?
So to sum up,
my questions for tldr version:
Are all dataframes, also lists in R?
Why did our boxplot dropped x3 in the 2nd case (when we coerced class(df) into numeric and character?
If we did coerce factor into characters by doing the above steps, why is still showing that variable's class is factor?
And why did df stopped being a dataframe after we did the above steps?

The answers make more sense if we take your questions in a different order.
Are all dataframes, also lists in R?
Yes. A data frame is a list of vectors (the columns).
And why did df stopped being a list after we did the above steps?
It didn't. It stopped being a data frame, because you changed the class with class(df)<- c("numeric", "character"). is.list(df) returns TRUE still.
If we did coerce factor into characters by doing the above steps, why is still showing that variable's class is factor?
class(df) operates on the df object itself, not the columns. Look at str(df). The factor column is still a factor. class(df) set the class attribute on the data frame object itself to a vector.
Why did our boxplot dropped x3 in the 2nd case (when we coerced class(df) into numeric and character?
You've messed up your data frame object by explicitly setting the class attribute of the object to a vector c("numeric", "character"). It's hard to predict the full effects of this. My best guess is that boxplot or the functions that draw the axes accessed the class attribute of the data frame somehow.
To do what you really wanted:
x = rnorm(30, 1, 1)
c = c(rep("x1",10), rep("x2",10), rep("x3",10))
df = data.frame(x,c)
df$c <- as.character(df$c)
or
x = rnorm(30, 1, 1)
c = c(rep("x1",10), rep("x2",10), rep("x3",10))
df = data.frame(x,c, stringsAsFactors=FALSE)

Use droplevels like this:
df0 <- subset(df, c %in% c("x1", "x2"))
df0 <- transform(df0, c = droplevels(c))
levels(df0$c)
## [1] "x1" "x2"
Note that now c only has two levels, not three.
We can write this as a pipeline using magrittr like this:
library(magrittr)
df %>%
subset(c %in% c("x1", "x2")) %>%
transform(c = droplevels(c)) %>%
boxplot(x ~ c, data = .)

Related

Unique list of variable strings for model estimation

I want to create a vector of unique variable combinations to estimate various regression models for different sets of variables, while fixing one variable to be always included.
For example, I always want to include variable X1, plus a distinct combination of up to, say, three (this threshold could be varying depending on the specific data and research question at hand) other variables from the full list of available variables X2, X3, ..., XN.
The bi-variate case is rather simple, I guess.
However, already for tri-variate models, the variable combination "X1 X2 X3" will yield the same coefficients as "X1 X3 X2". Further, I also want to exclude combinations which contain same variables twice, e.g "X1 X2 X2".
How to exclude these "double-counting"/redundant combinations best? Or how to create such a vector of all possible distinct combinations?
Test code i tried so far (separating variables with underscore):
library(dplyr)
'%!in%' <- function(x,y)!('%in%'(x,y))
A <- c("X1", "X2", "X3", "X4", "X5") # all variables in dataset
a <- "X1" # keep X1 in all models
A_minus_a <- A[A %!in% a]
# first combination:
C1 <- outer(a, A_minus_a, paste, sep = "_")
# second set of combinations:
C2 <- outer(C1, A_minus_a, paste, sep = "_") %>% as.vector
# third set of combinations:
C3 <- outer(C2, A_minus_a, paste, sep = "_") %>% as.vector
# full list of model combinations, but including many "double-counted"/redundant models:
C <- c(C1, C2, C3)
Any help you can provide is very much appreciated!
P.S. for the second step I could prevent the problem by formatting the result of outer() into a matrix and then extracting the lower triangular elements without the diagonal of the matrix. However, when turning to the third set of combinations this does not work anymore. So, there might be a better solution from start.
How about using combn()? e.g. for sets of three variables:
cc <- combn(A_minus_a, m=3)
apply(cc,2,paste,collapse="_")
## [1] "X2_X3_X4" "X2_X3_X5" "X2_X4_X5" "X3_X4_X5"

R wilcox.test by categorical subset error

I'm trying to specify catagorical subgroups, I found a source which suggests you can simply use this layout
wilcox.test(growth ~ sugar, data= carbs, subset= sugar %in% c("test", "C"))
However on my dataset it doesn't work, though the same format works if convert groups to numerical values in excel.
wilcox.test(Distance~Application, data= walking.dat,
subset = Application %in% c("Control", "Cue-Lure"))
Error in wilcox.test.formula(Distance ~ Application, data = walking.dat, :
grouping factor must have exactly 2 levels
Any suggestions would be great.
Thanks!
It is amazing you can still see my deleted comments. I made two comments earlier, pointing out two possible issues.
issue 1:
It is highly likely that there is no "Control" or "Cue-Lure" in walking.dat$Application. I would suggest you try
with(walking.dat, unique(Application[Application %in% c("Control", "Cue-Lure")]))
to see what you get. Possibly you either get a single element, or nothing.
I can easily reconstruct the error you encountered. Consider the built-in R dataset airquality.
data(airquality)
unique(airquality$Month) ## 5 6 7 8 9
wilcox.test(Ozone ~ Month, data = airquality, subset = Month %in% c(6, 7)) ## fine
wilcox.test(Ozone ~ Month, data = airquality, subset = Month %in% c(1, 7)) ## fail
In the second case, you get an error:
Error in wilcox.test.formula(Ozone ~ Month, data = foo, subset = Month %in% :
grouping factor must have exactly 2 levels
because 1 is not an available value of Month.
issue 2
If both levels exist, then I guess your variable Application is factor. Check class(Application). The problem of a factor, can be seen from here:
x <- factor(letters[1:4])
x[x %in% c("a", "b")]
#[1] a b
#Levels: a b c d
Note that the factor levels do drop after %in%. However, if you do:
x <- as.character(x)
x[x %in% c("a", "b")]
#[1] "a" "b"
Although you get characters, the formula method will coerce it into factors automatically. In this way, there is no danger that additional unused factor levels could break wilcox.test().

How to change values in data frame by column class in R

I've got a frame with a set of different variables - integers, factors, logicals - and I would like to recode all of the "NAs" as a numeric across the whole dataset while preserving the underlying variable class. For example:
frame <- data.frame("x" = rnorm(10), "y" = rep("A", 10))
frame[6,] <- NA
dat <- as.data.frame(apply(frame,2, function(x) ifelse(is.na(x)== TRUE, -9, x) ))
dat
str(dat)
However, here the integers turn into factors; when I include as.numeric(x) in the apply() function, this introduces errors. Thanks for any and all thoughts on how to deal with this.
apply returns a matrix of type character. as.data.frame turns this into factors by default. Instead, you could do
dat <- as.data.frame(lapply(frame, function(x) ifelse(is.na(x), -9, x) ) )

how is a column decided to be of class factor in a data frame?

On creating a column whose contents contain duplicate values, I notice the following with regard to factors.
1.If a column with duplicate character values is made part of a data frame at the time of data frame creation, it is of class factor, but if the same column is appended later, it is of class character though the values in both cases are the same. Why is this?
#creating a data frame
name = c('waugh','waugh','smith')
age = c(21,21,27)
df = data.frame(name,age)
#adding a new column which has the same values as the 'name' column above, to the data frame
df$newcol = c('waugh','waugh','smith')
#you can see that the class'es of the two are different though the values are same
class(df$name)
## [1] "factor"
class(df$newcol)
## [1] "character"
Only the column which has duplicate alphabetic contents becomes a factor; If a column contains duplicate numeric values, it is not treated as a factor. Why is that? I could very well mean that 1-Male, 0-Female, in which case, it should be a factor?
note that both these columns contain duplicate values
class(df$name)
## [1] "factor"
class(df$age)
## [1] "numeric"
This was basically answered in the comments, but i'll put the answer here to close out the question.
When you use data.frame() to create a data.frame, that function actually manipulates the arguments you pass in to create the data.frame object. Specifically, by default, it has a parameter named stringsAsFactors=TRUE so that it will take all character vectors you pass in and convert them to factor vectors since normally you treat these values as categorical random variables in various statistical tests and it can be more efficient to store character values as a factor if you have many values that are repeated in the vector.
df <- data.frame(name,age)
class(df$name)
# [1] "factor"
df <- data.frame(name,age, stringsAsFactors=FALSE)
class(df$name)
# [1] "character"
Note that the data.frame itself doesn't remember the "stringsAsFactors" value used during its construction. This is only used when you actually run data.frame(). So if you add columns by assigning them via the $<- syntax or cbind(), the coercion will not happen
df1 <- data.frame(name,age)
df2 <- data.frame(name,age, stringsAsFactors=FALSE)
df1$name2 <- name
df2$name2 <- name
df3 <- cbind(data.frame(name,age), name2=name)
class(df1$name2)
# [1] "character"
class(df2$name2)
# [1] "character"
class(df3$name2)
# [1] "character"
If you want to add the column as a factor, you will need to convert to factor yourself
df = data.frame(name,age)
df$name2 <- factor(name)
class(df$name2)
# [1] "factor"

Apply function to each column in a data frame observing each columns existing data type

I'm trying to get the min/max for each column in a large data frame, as part of getting to know my data. My first try was:
apply(t,2,max,na.rm=1)
It treats everything as a character vector, because the first few columns are character types. So max of some of the numeric columns is coming out as " -99.5".
I then tried this:
sapply(t,max,na.rm=1)
but it complains about max not meaningful for factors. (lapply is the same.) What is confusing me is that apply thought max was perfectly meaningful for factors, e.g. it returned "ZEBRA" for column 1.
BTW, I took a look at Using sapply on vector of POSIXct and one of the answers says "When you use sapply, your objects are coerced to numeric,...". Is this what is happening to me? If so, is there an alternative apply function that does not coerce? Surely it is a common need, as one of the key features of the data frame type is that each column can be a different type.
If it were an "ordered factor" things would be different. Which is not to say I like "ordered factors", I don't, only to say that some relationships are defined for 'ordered factors' that are not defined for "factors". Factors are thought of as ordinary categorical variables. You are seeing the natural sort order of factors which is alphabetical lexical order for your locale. If you want to get an automatic coercion to "numeric" for every column, ... dates and factors and all, then try:
sapply(df, function(x) max(as.numeric(x)) ) # not generally a useful result
Or if you want to test for factors first and return as you expect then:
sapply( df, function(x) if("factor" %in% class(x) ) {
max(as.numeric(as.character(x)))
} else { max(x) } )
#Darrens comment does work better:
sapply(df, function(x) max(as.character(x)) )
max does succeed with character vectors.
The reason that max works with apply is that apply is coercing your data frame to a matrix first, and a matrix can only hold one data type. So you end up with a matrix of characters. sapply is just a wrapper for lapply, so it is not surprising that both yield the same error.
The default behavior when you create a data frame is for categorical columns to be stored as factors. Unless you specify that it is an ordered factor, operations like max and min will be undefined, since R is assuming that you've created an unordered factor.
You can change this behavior by specifying options(stringsAsFactors = FALSE), which will change the default for the entire session, or you can pass stringsAsFactors = FALSE in the data.frame() construction call itself. Note that this just means that min and max will assume "alphabetical" ordering by default.
Or you can manually specify an ordering for each factor, although I doubt that's what you want to do.
Regardless, sapply will generally yield an atomic vector, which will entail converting everything to characters in many cases. One way around this is as follows:
#Some test data
d <- data.frame(v1 = runif(10), v2 = letters[1:10],
v3 = rnorm(10), v4 = LETTERS[1:10],stringsAsFactors = TRUE)
d[4,] <- NA
#Similar function to DWin's answer
fun <- function(x){
if(is.numeric(x)){max(x,na.rm = 1)}
else{max(as.character(x),na.rm=1)}
}
#Use colwise from plyr package
colwise(fun)(d)
v1 v2 v3 v4
1 0.8478983 j 1.999435 J
If you want to learn your data summary (df) provides the min, 1st quantile, median and mean, 3rd quantile and max of numerical columns and the frequency of the top levels of the factor columns.
The best way to do this is avoid base *apply functions, which coerces the entire data frame to an array, possibly losing information.
If you wanted to apply a function as.numeric to every column, a simple way is using mutate_all from dplyr:
t %>% mutate_all(as.numeric)
Alternatively use colwise from plyr, which will "turn a function that operates on a vector into a function that operates column-wise on a data.frame."
t %>% (colwise(as.numeric))
In the special case of reading in a data table of character vectors and coercing columns into the correct data type, use type.convert or type_convert from readr.
Less interesting answer: we can apply on each column with a for-loop:
for (i in 1:nrow(t)) { t[, i] <- parse_guess(t[, i]) }
I don't know of a good way of doing assignment with *apply while preserving data frame structure.
building on #ltamar's answer:
Use summary and munge the output into something useful!
library(tidyr)
library(dplyr)
df %>%
summary %>%
data.frame %>%
select(-Var1) %>%
separate(data=.,col=Freq,into = c('metric','value'),sep = ':') %>%
rename(column_name=Var2) %>%
mutate(value=as.numeric(value),
metric = trimws(metric,'both')
) %>%
filter(!is.na(value)) -> metrics
It's not pretty and it is certainly not fast but it gets the job done!
these days loops are just as fast so this is more than sufficient:
for (I in 1L:length(c(1,2,3))) {
data.frame(c("1","2","3"),c("1","3","3"))[,I] <-
max(as.numeric(data.frame(c("1","2","3"),c("1","3","3"))[,I]))
}
A solution using retype() from hablar to coerce factors to character or numeric type depending on feasability. I'd use dplyr for applying max to each column.
Code
library(dplyr)
library(hablar)
# Retype() simplifies each columns type, e.g. always removes factors
d <- d %>% retype()
# Check max for each column
d %>% summarise_all(max)
Result
Not the new column types.
v1 v2 v3 v4
<dbl> <chr> <dbl> <chr>
1 0.974 j 1.09 J
Data
# Sample data borrowed from #joran
d <- data.frame(v1 = runif(10), v2 = letters[1:10],
v3 = rnorm(10), v4 = LETTERS[1:10],stringsAsFactors = TRUE)
df <- head(mtcars)
df$string <- c("a","b", "c", "d","e", "f"); df
my.min <- unlist(lapply(df, min))
my.max <- unlist(lapply(df, max))

Resources