I'm working a data frame which consists of multiple different data types (numerics, characters, timestamps), but unfortunately all of them are received as characters. Hence I need to coerce them into their "appropriate" format dynamically and as efficiently as possible.
Consider the following example:
df <- data.frame("val1" = c("1","2","3","4"), "val2" = c("A", "B", "C", "D"), stringsAsFactors = FALSE)
I obviously want val1 to be numeric and val2 to remain as a character. Therefore, my result should look like this:
'data.frame': 4 obs. of 2 variables:
$ val1: num 1 2 3 4
$ val2: chr "A" "B" "C" "D"
Right now I'm accomplishing this by checking if the coercion would result in NULL and then proceeding in coercing if this isn't the case:
res <- as.data.frame(lapply(df, function(x){
x <- sapply(x, function(y) {
if (is.na(as.numeric(y))) {
return(y)
} else {
y <- as.numeric(y)
return(y)
}
})
return(x)
}), stringsAsFactors = FALSE)
However, this doesn't strike me as the correct solution because of multiple issues:
I suspect that there is a faster way of accomplishing this
For some reason I receive the warning In FUN(X[[i]], ...) : NAs introduced by coercion, although this isn't the case (see result)
This seems inappropriate when handling other data types, i.e. dates
Is there a general, heuristic approach to this, or another, more sustainable solution? Thanks
The recent file readers like data.table::fread or the readr package do a pretty decent job in identifying and converting columns to the appropriate type.
So my first reaction was to suggest to write the data to file and read it in again, e.g.,
library(data.table)
fwrite(df, "dummy.csv")
df_new <- fread("dummy.csv")
str(df_new)
Classes ‘data.table’ and 'data.frame': 4 obs. of 2 variables:
$ val1: int 1 2 3 4
$ val2: chr "A" "B" "C" "D"
- attr(*, ".internal.selfref")=<externalptr>
or without actually writing to disk:
df_new <- fread(paste(capture.output(fwrite(df, "")), collapse = "\n"))
However, d.b's suggestions are much smarter but need some polishing to avoid coercion to factor:
df[] <- lapply(df, type.convert, as.is = TRUE)
str(df)
'data.frame': 4 obs. of 2 variables:
$ val1: int 1 2 3 4
$ val2: chr "A" "B" "C" "D"
or
df[] <- lapply(df, readr::parse_guess)
You should check dataPreparation package. You will find function findAndTransformNumerics function that will do exactly what you want.
require(dataPreparation)
data("messy_adult")
sapply(messy_adult[, .(num1, num2, mail)], class)
num1 num2 mail
"character" "character" "factor"
messy_adult is an ugly data set to illustrate functions from this package. Here num1 and num2 are strings :/
messy_adult <- findAndTransformNumerics(messy_adult)
[1] "findAndTransformNumerics: It took me 0.18s to identify 3 numerics column(s), i will set them as numerics"
[1] "setColAsNumeric: I will set some columns as numeric"
[1] "setColAsNumeric: I am doing the columnnum1"
[1] "setColAsNumeric: 0 NA have been created due to transformation to numeric."
[1] "setColAsNumeric: I will set some columns as numeric"
[1] "setColAsNumeric: I am doing the columnnum2"
[1] "setColAsNumeric: 0 NA have been created due to transformation to numeric."
[1] "setColAsNumeric: I am doing the columnnum3"
[1] "setColAsNumeric: 0 NA have been created due to transformation to numeric."
[1] "findAndTransformNumerics: It took me 0.09s to transform 3 column(s) to a numeric format."
Here we performed the search and it logged what it found
And know:
sapply(messy_adult[, .(num1, num2, mail)], class)
num1 num2 mail
"numeric" "numeric" "factor"
Hope it helps!
Disclamer: I'm the author of this package.
Related
I have the following setup.
df <- data.frame(aa = rnorm(1000), bb = rnorm(1000))
apply(df, 2, typeof)
# aa bb
#"double" "double"
apply(df, 2, class)
# aa bb
#"numeric" "numeric"
Then I try to convert one of the columns to "factor". But as you can see below, I am not getting any "factor" type or classes. Am I doing anything wrong ?
df[, 1] <- as.factor(df[, 1])
apply(df, 2, typeof)
# aa bb
#"character" "character"
apply(df, 2, class)
# aa bb
#"character" "character"
Sorry I felt my original answer badly written. Why did I put that "matrix of factors" in the very beginning? Here is a better try.
From ?apply:
If ‘X’ is not an array but an object of a class with a non-null
‘dim’ value (such as a data frame), ‘apply’ attempts to coerce it
to an array via ‘as.matrix’ if it is two-dimensional (e.g., a data
frame) or via ‘as.array’.
So a data frame is converted to a matrix by as.matrix, before FUN is applied row-wise or column-wise.
From ?as.matrix:
‘as.matrix’ is a generic function. The method for data frames
will return a character matrix if there is only atomic columns and
any non-(numeric/logical/complex) column, applying ‘as.vector’ to
factors and ‘format’ to other non-character columns. Otherwise,
the usual coercion hierarchy (logical < integer < double <
complex) will be used, e.g., all-logical data frames will be
coerced to a logical matrix, mixed logical-integer will give a
integer matrix, etc.
The default method for ‘as.matrix’ calls ‘as.vector(x)’, and hence
e.g. coerces factors to character vectors.
I am not a native English speaker and I can't read the following (which looks rather important!). Can someone clarify it?
The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column, applying ‘as.vector’ to factors and ‘format’ to other non-character columns.
From ?as.vector:
Note that factors are _not_ vectors; ‘is.vector’ returns ‘FALSE’
and ‘as.vector’ converts a factor to a character vector for ‘mode
= "any"’.
Simply put, as long as you have a factor column in a data frame, as.matrix gives you a character matrix.
I believed this apply with data frame problem has been raised many times and the above just adds another duplicate answer. Really sorry. I failed to read OP's question carefully. What hit me in the first instance is that R can not build a true matrix of factors.
f <- factor(letters[1:4])
matrix(f, 2, 2)
# [,1] [,2]
#[1,] "a" "c"
#[2,] "b" "d"
## a sneaky way to get a matrix of factors by setting `dim` attribute
dim(f) <- c(2, 2)
# [,1] [,2]
#[1,] a c
#[2,] b d
#Levels: a b c d
is.matrix(f)
#[1] TRUE
class(f)
#[1] "factor" ## not a true matrix with "matrix" class
While this is interesting, it should be less-relevant to OP's question.
Sorry again for making a mess here. So bad!!
So if I do sapply would it help? Because I have many columns that need to be converted to factor.
Use lapply actually. sapply would simplify the result to an array, which is a matrix in 2D case. Here is an example:
dat <- head(trees)
sapply(dat, as.factor)
# Girth Height Volume
#[1,] "8.3" "70" "10.3"
#[2,] "8.6" "65" "10.3"
#[3,] "8.8" "63" "10.2"
#[4,] "10.5" "72" "16.4"
#[5,] "10.7" "81" "18.8"
#[6,] "10.8" "83" "19.7"
new_dat <- data.frame(lapply(dat, as.factor))
str(new_dat)
#'data.frame': 6 obs. of 3 variables:
# $ Girth : Factor w/ 6 levels "8.3","8.6","8.8",..: 1 2 3 4 5 6
# $ Height: Factor w/ 6 levels "63","65","70",..: 3 2 1 4 5 6
# $ Volume: Factor w/ 5 levels "10.2","10.3",..: 2 2 1 3 4 5
sapply(new_dat, class)
# Girth Height Volume
#"factor" "factor" "factor"
apply(new_dat, 2, class)
# Girth Height Volume
#"character" "character" "character"
Regarding typeof, factors are actually stored as integers.
sapply(new_dat, typeof)
# Girth Height Volume
#"integer" "integer" "integer"
When you dput a factor you can see this. For example:
dput(new_dat[[1]])
#structure(1:6, .Label = c("8.3", "8.6", "8.8", "10.5", "10.7",
#"10.8"), class = "factor")
The real values are 1:6. Character levels are just an attribute.
I have a character vector of classes that I would like to apply to a dataframe, so as to convert the current class of each field in that dataframe to the corresponding entry in the vector. For example:
frame <- data.frame(A = c(2:5), B = c(3:6))
classes <- c("character", "factor")
With a for-loop, I know that this can be accomplished using lapply. For example:
for(i in 1:2) { frame[i] <- lapply(frame[i], paste("as", classes[i], sep = ".")) }
For my purposes, however, a for-loop cannot work. Is there another solution that I am missing?
Thank you in advance for your input!
Note: I have been informed that this might be a duplicate of this post. And, yes, my question is similar to it. But I have looked at the class() approach before. And it does not seem to effectively deal with converting fields to factors. The lapply approach, on the other hand, does it well. But, unfortunately, I cannot utilize a for-loop in this instance
If you're not averse to using lapply without a for loop, you can try something like the following.
frame[] <- lapply(seq_along(frame), function(x) {
FUN <- paste("as", classes[x], sep = ".")
match.fun(FUN)(frame[[x]])
})
str(frame)
# 'data.frame': 4 obs. of 2 variables:
# $ A: chr "2" "3" "4" "5"
# $ B: Factor w/ 4 levels "3","4","5","6": 1 2 3 4
However, a better option is to try to apply the correct classes when you're reading the data in to begin with.
x <- tempfile() # Just to pretend....
write.csv(frame2, x, row.names = FALSE) # ... that we are reading a csv
frame3 <- read.csv(x, colClasses = classes)
str(frame3)
# 'data.frame': 4 obs. of 2 variables:
# $ A: chr "2" "3" "4" "5"
# $ B: Factor w/ 4 levels "3","4","5","6": 1 2 3 4
Sample data:
frame <- frame2 <- data.frame(A = c(2:5), B = c(3:6))
classes <- c("character", "factor")
This question already has answers here:
Dynamically select data frame columns using $ and a character value
(10 answers)
Closed 6 years ago.
I am trying to create a function that allows the conversion of selected columns of a data frame to categorical data type (factor) before running a regression analysis.
Question is how do I slice a particular column from a data frame using a string (character).
Example:
strColumnNames <- "Admit,Rank"
strDelimiter <- ","
strSplittedColumnNames <- strsplit(strColumnNames, strDelimiter)
for( strColName in strSplittedColumnNames[[1]] ){
dfData$as.name(strColName) <- factor(dfData$get(strColName))
}
Tried:
dfData$as.name()
dfData$get(as.name())
dfData$get()
Error Msg:
Error: attempt to apply non-function
Any help would be greatly appreciated! Thank you!!!
You need to change
dfData$as.name(strColName) <- factor(dfData$get(strColName))
to
dfData[[strColName]] <- factor(dfData[[strColName]])
You may read ?"[[" for more.
In your case, column names are generated programmingly, [[]] is the only way to go. Maybe this example will be clear enough to illustrate the problem of $:
dat <- data.frame(x = 1:5, y = 2:6)
z <- "x"
dat$z
# [1] NULL
dat[[z]]
# [1] 1 2 3 4 5
Regarding the other answer
apply definitely does not work, because the function you apply is as.factor or factor. apply always works on a matrix (if you feed it a data frame, it will convert it into a matrix first) and returns a matrix, while you can't have factor data class in matrix. Consider this example:
x <- data.frame(x1 = letters[1:4], x2 = LETTERS[1:4], x3 = 1:4, stringsAsFactors = FALSE)
x[, 1:2] <- apply(x[, 1:2], 2, as.factor)
str(x)
#'data.frame': 4 obs. of 3 variables:
# $ x1: chr "a" "b" "c" "d"
# $ x2: chr "A" "B" "C" "D"
# $ x3: int 1 2 3 4
Note, you still have character variable rather than factor. As I said, we have to use lapply:
x[1:2] <- lapply(x[1:2], as.factor)
str(x)
#'data.frame': 4 obs. of 3 variables:
# $ x1: Factor w/ 4 levels "a","b","c","d": 1 2 3 4
# $ x2: Factor w/ 4 levels "A","B","C","D": 1 2 3 4
# $ x3: int 1 2 3 4
Now we see the factor class in x1 and x2.
Using apply for a data frame is never a good idea. If you read the source code of apply:
dl <- length(dim(X))
if (is.object(X))
X <- if (dl == 2L)
as.matrix(X)
else as.array(X)
You see that a data frame (which has 2 dimension) will be coerced to matrix first. This is very slow. If your data frame columns have multiple different class, the resulting matrix will have only 1 class. Who knows what the result of such coercion would be.
Yet apply is written in R not C, with an ordinary for loop:
for (i in 1L:d2) {
tmp <- forceAndCall(1, FUN, newX[, i], ...)
if (!is.null(tmp))
ans[[i]] <- tmp
so it is no better than an explicit for loop you write yourself.
I would use a different method.
Create a vector of column names you want to change to factors:
factorCols <- c("Admit", "Rank")
Then extract these columns by index:
myCols <- which(names(dfData) %in% factorCols)
Finally, use apply to change these columns to factors:
dfData[,myCols] <- lapply(dfData[,myCols],as.factor)
Consider this example:
df <- data.frame(id=1:10,var1=LETTERS[1:10],var2=LETTERS[6:15])
fun.split <- function(x) tolower(as.character(x))
df$new.letters <- apply(df[ ,2:3],2,fun.split)
df$new.letters.var1
#NULL
colnames(df)
# [1] "id" "var1" "var2" "new.letters"
df$new.letters
# var1 var2
# [1,] "a" "f"
# [2,] "b" "g"
# [3,] "c" "h"
# [4,] "d" "i"
# [5,] "e" "j"
# [6,] "f" "k"
# [7,] "g" "l"
# [8,] "h" "m"
# [9,] "i" "n"
# [10,] "j" "o"
Would be someone so kind and explain what is going on here? A new dataframe within dataframe?
I expected this:
colnames(df)
# id var1 var2 new.letters.var1 new.letters.var2
The reason is because you assigned a single new column to a 2 column matrix output by apply. So, the result will be a matrix in a single column. You can convert it back to normal data.frame with
do.call(data.frame, df)
A more straightforward method will be to assign 2 columns and I use lapply instead of apply as there can be cases where the columns are of different classes. apply returns a matrix and with mixed class, the columns will be 'character' class. But, lapply gets the output in a list and preserves the class
df[paste0('new.letters', names(df)[2:3])] <- lapply(df[2:3], fun.split)
#akrun solved 90% of my problem. But I had data.frames buried within data.frames, buried within data.frames and so on, without knowing the depth to which this was happening.
In this case, I thought sharing my recursive solution might be helpful to others searching this thread as I was:
unnest_dataframes <- function(x) {
y <- do.call(data.frame, x)
if("data.frame" %in% sapply(y, class)) unnest_dataframes(y)
y
}
new_data <- unnest_dataframes(df)
Although this itself sometimes has problems and it can be helpful to separate all columns of class "data.frame" from the original data set then cbind() it back together like so:
# Find all columns that are data.frame
# Assuming your data frame is stored in variable 'y'
data.frame.cols <- unname(sapply(y, function(x) class(x) == "data.frame"))
z <- y[, !data.frame.cols]
# All columns of class "data.frame"
dfs <- y[, data.frame.cols]
# Recursively unnest each of these columns
unnest_dataframes <- function(x) {
y <- do.call(data.frame, x)
if("data.frame" %in% sapply(y, class)) {
unnest_dataframes(y)
} else {
cat('Nested data.frames successfully unpacked\n')
}
y
}
df2 <- unnest_dataframes(dfs)
# Combine with original data
all_columns <- cbind(z, df2)
In this case R doesn't behave like one would expect but maybe if we dig deeper we can solve it. What is a data frame? as Norman Matloff says in his book (chapter 5):
a data frame is a list, with the components of that list being
equal-length vectors
The following code might be useful to understand.
class(df$new.letters)
[1] "matrix"
str(df)
'data.frame': 10 obs. of 4 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10
$ var1 : Factor w/ 10 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10
$ var2 : Factor w/ 10 levels "F","G","H","I",..: 1 2 3 4 5 6 7 8 9 10
$ new.letters: chr [1:10, 1:2] "a" "b" "c" "d" ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "var1" "var2"
Maybe the reason why it looks strange is in the print methods. Consider this:
colnames(df$new.letters)
[1] "var1" "var2"
maybe there must something in the print methods that combine the sub-names of objects and display them all.
For example here the vectors that constitute the df are:
names(df)
[1] "id" "var1" "var2" "new.letters"
but in this case the vector new.letters also has a dim attributes (in fact it is a matrix) were dimensions have names var1 and var1 too. See this code:
attributes(df$new.letters)
$dim
[1] 10 2
$dimnames
$dimnames[[1]]
NULL
$dimnames[[2]]
[1] "var1" "var2"
but when we print we see all of them like they were separated vectors (and so columns of the data.frame!).
Edit: Print methods
Just for curiosity in order to improve this question I looked inside the methods of the print functions:
methods(print)
The previous code produces a very long list of methods for the generic function print but there is no one for data.frame. The one that looks for data frame (but I am sure there is a more technically way to find out that) is listof.
getS3method("print", "listof")
function (x, ...)
{
nn <- names(x)
ll <- length(x)
if (length(nn) != ll)
nn <- paste("Component", seq.int(ll))
for (i in seq_len(ll)) {
cat(nn[i], ":\n")
print(x[[i]], ...)
cat("\n")
}
invisible(x)
}
<bytecode: 0x101afe1c8>
<environment: namespace:base>
Maybe I am wrong but It seems to me that in this code there might be useful informations about why that happens, specifically when the if (length(nn) != ll) is stated.
A very unexpected behavior of the useful data.frame in R arises from keeping character columns as factor. This causes many problems if it is not considered. For example suppose the following code:
foo=data.frame(name=c("c","a"),value=1:2)
# name val
# 1 c 1
# 2 a 2
bar=matrix(1:6,nrow=3)
rownames(bar)=c("a","b","c")
# [,1] [,2]
# a 1 4
# b 2 5
# c 3 6
Then what do you expect of running bar[foo$name,]? It normally should return the rows of bar that are named according to the foo$name that means rows 'c' and 'a'. But the result is different:
bar[foo$name,]
# [,1] [,2]
# b 2 5
# a 1 4
The reason is here: foo$name is not a character vector, but an integer vector.
foo$name
# [1] c a
# Levels: a c
To have the expected behavior, I manually convert it to character vector:
foo$name = as.character(foo$name)
bar[foo$name,]
# [,1] [,2]
# c 3 6
# a 1 4
But the problem is that we may easily miss to perform this, and have hidden bugs in our codes. Is there any better solution?
This is a feature and R is working as documented. This can be dealt with generally in a few ways:
use the argument stringsAsFactors = TRUE in the call to data.frame(). See ?data.frame
if you detest this behaviour so, set the option globally via
options(stringsAsFactors = FALSE)
(as noted by #JoshuaUlrich in comments) a third option is to wrap character variables in I(....). This alters the class of the object being assigned to the data frame component to include "AsIs". In general this shouldn't be a problem as the object inherits (in this case) the class "character" so should work as before.
You can check what the default for stringsAsFactors is on the currently running R process via:
> default.stringsAsFactors()
[1] TRUE
The issue is slightly wider than data.frame() in scope as this also affects read.table(). In that function, as well as the two options above, you can also tell R what all the classes of the variables are via argument colClasses and R will respect that, e.g.
> tmp <- read.table(text = '"Var1","Var2"
+ "A","B"
+ "C","C"
+ "B","D"', header = TRUE, colClasses = rep("character", 2), sep = ",")
> str(tmp)
'data.frame': 3 obs. of 2 variables:
$ Var1: chr "A" "C" "B"
$ Var2: chr "B" "C" "D"
In the example data below, author and title are automatically converted to factor (unless you add the argument stringsAsFactors = FALSE when you are creating the data). What if we forgot to change the default setting and don't want to set the options globally?
Some code I found somewhere (most likely SO) uses sapply() to identify factors and convert them to strings.
dat = data.frame(title = c("title1", "title2", "title3"),
author = c("author1", "author2", "author3"),
customerID = c(1, 2, 1))
# > str(dat)
# 'data.frame': 3 obs. of 3 variables:
# $ title : Factor w/ 3 levels "title1","title2",..: 1 2 3
# $ author : Factor w/ 3 levels "author1","author2",..: 1 2 3
# $ customerID: num 1 2 1
dat[sapply(dat, is.factor)] = lapply(dat[sapply(dat, is.factor)],
as.character)
# > str(dat)
# 'data.frame': 3 obs. of 3 variables:
# $ title : chr "title1" "title2" "title3"
# $ author : chr "author1" "author2" "author3"
# $ customerID: num 1 2 1
I assume this would be faster than re-reading in the dataset with the stringsAsFactors = FALSE argument, but have never tested.