Data frame and the very common mistake while using character columns - r

A very unexpected behavior of the useful data.frame in R arises from keeping character columns as factor. This causes many problems if it is not considered. For example suppose the following code:
foo=data.frame(name=c("c","a"),value=1:2)
# name val
# 1 c 1
# 2 a 2
bar=matrix(1:6,nrow=3)
rownames(bar)=c("a","b","c")
# [,1] [,2]
# a 1 4
# b 2 5
# c 3 6
Then what do you expect of running bar[foo$name,]? It normally should return the rows of bar that are named according to the foo$name that means rows 'c' and 'a'. But the result is different:
bar[foo$name,]
# [,1] [,2]
# b 2 5
# a 1 4
The reason is here: foo$name is not a character vector, but an integer vector.
foo$name
# [1] c a
# Levels: a c
To have the expected behavior, I manually convert it to character vector:
foo$name = as.character(foo$name)
bar[foo$name,]
# [,1] [,2]
# c 3 6
# a 1 4
But the problem is that we may easily miss to perform this, and have hidden bugs in our codes. Is there any better solution?

This is a feature and R is working as documented. This can be dealt with generally in a few ways:
use the argument stringsAsFactors = TRUE in the call to data.frame(). See ?data.frame
if you detest this behaviour so, set the option globally via
options(stringsAsFactors = FALSE)
(as noted by #JoshuaUlrich in comments) a third option is to wrap character variables in I(....). This alters the class of the object being assigned to the data frame component to include "AsIs". In general this shouldn't be a problem as the object inherits (in this case) the class "character" so should work as before.
You can check what the default for stringsAsFactors is on the currently running R process via:
> default.stringsAsFactors()
[1] TRUE
The issue is slightly wider than data.frame() in scope as this also affects read.table(). In that function, as well as the two options above, you can also tell R what all the classes of the variables are via argument colClasses and R will respect that, e.g.
> tmp <- read.table(text = '"Var1","Var2"
+ "A","B"
+ "C","C"
+ "B","D"', header = TRUE, colClasses = rep("character", 2), sep = ",")
> str(tmp)
'data.frame': 3 obs. of 2 variables:
$ Var1: chr "A" "C" "B"
$ Var2: chr "B" "C" "D"

In the example data below, author and title are automatically converted to factor (unless you add the argument stringsAsFactors = FALSE when you are creating the data). What if we forgot to change the default setting and don't want to set the options globally?
Some code I found somewhere (most likely SO) uses sapply() to identify factors and convert them to strings.
dat = data.frame(title = c("title1", "title2", "title3"),
author = c("author1", "author2", "author3"),
customerID = c(1, 2, 1))
# > str(dat)
# 'data.frame': 3 obs. of 3 variables:
# $ title : Factor w/ 3 levels "title1","title2",..: 1 2 3
# $ author : Factor w/ 3 levels "author1","author2",..: 1 2 3
# $ customerID: num 1 2 1
dat[sapply(dat, is.factor)] = lapply(dat[sapply(dat, is.factor)],
as.character)
# > str(dat)
# 'data.frame': 3 obs. of 3 variables:
# $ title : chr "title1" "title2" "title3"
# $ author : chr "author1" "author2" "author3"
# $ customerID: num 1 2 1
I assume this would be faster than re-reading in the dataset with the stringsAsFactors = FALSE argument, but have never tested.

Related

#R #Non-numeric argument for binary operator #xts object * integer

I have an xts object. Thus a time series of "outstanding share" of a company that are ordered by date.
I want to multiply the time series of "outstanding shares" by the factor 7 in order to account for a stock split.
> outstanding_shares_xts <- shares_xts1[,1]
> adjusted <- outstanding_shares_xts*7
Error: Non-numeric argument for binary operator.
The ts "oustanding_shares_xts" is a column of integers.
Does anyone has an idea??
My guess is that they may look like integers but are in fact not.
Sleuthing:
I initially thought it could be [-vs-[[ column subsetting, since tibble(a=1:2)[,1] does not produce an integer vector (it produces a single-column tibble), but tibble(a=1:2)[,1] * 7 still works.
Then I thought it could be due to factors, but it's a different error:
data.frame(a=factor(1:2))[,1]*7
# Warning in Ops.factor(data.frame(a = factor(1:2))[, 1], 7) :
# '*' not meaningful for factors
# [1] NA NA
One possible is that you have character values that look like integers.
dat <- data.frame(a=as.character(1:2))
dat
# a
# 1 1
# 2 2
dat[,1]*7
# Error in dat[, 1] * 7 : non-numeric argument to binary operator
Try converting that column to integer, something like
str(dat)
# 'data.frame': 2 obs. of 1 variable:
# $ a: chr "1" "2"
dat$a <- as.integer(dat$a)
str(dat)
# 'data.frame': 2 obs. of 1 variable:
# $ a: int 1 2
dat[,1]*7
# [1] 7 14

After doing bind_rows() and rbind() on same data.tables , identical() = FALSE?

Caveat: novice. I have several data.tables with millions of rows each, variables are mostly dates and factors. I was using rbindlist() to combine them because. Yesterday, after breaking up the tables into smaller pieces vertically (instead of the current horizontal splicing), I was trying to understand rbind better (especially with fill = TRUE) and also tried bind_rows() and then tried to verify the results but identical() returned FALSE.
library(data.table)
library(dplyr)
DT1 <- data.table(a=1, b=2)
DT2 <- data.table(a=4, b=3)
DT_bindrows <- bind_rows(DT1,DT2)
DT_rbind <- rbind(DT1,DT2)
identical(DT_bindrows,DT_rbind)
# [1] FALSE
Visually inspecting the results from bind_rows() and rbind() says they are indeed identical. I read this and this (from where I adapted the example). My question: (a) what am I missing, and (b) if the number, names, and order of my columns is the same, should I be concerned that identical() = FALSE?
The identical checks for attributes which are not the same. With all.equal, there is an option not to check the attributes (check.attributes)
all.equal(DT_bindrows, DT_rbind, check.attributes = FALSE)
#[1] TRUE
If we check the str of both the datasets, it becomes clear
str(DT_bindrows)
#Classes ‘data.table’ and 'data.frame': 2 obs. of 2 #variables:
# $ a: num 1 4
# $ b: num 2 3
str(DT_rbind)
#Classes ‘data.table’ and 'data.frame': 2 obs. of 2 #variables:
# $ a: num 1 4
# $ b: num 2 3
# - attr(*, ".internal.selfref")=<externalptr> # reference attribute
By assigning the attribute to NULL, the identical returns TRUE
attr(DT_rbind, ".internal.selfref") <- NULL
identical(DT_bindrows, DT_rbind)
#[1] TRUE

Coerce variables in data frame to appropriate format

I'm working a data frame which consists of multiple different data types (numerics, characters, timestamps), but unfortunately all of them are received as characters. Hence I need to coerce them into their "appropriate" format dynamically and as efficiently as possible.
Consider the following example:
df <- data.frame("val1" = c("1","2","3","4"), "val2" = c("A", "B", "C", "D"), stringsAsFactors = FALSE)
I obviously want val1 to be numeric and val2 to remain as a character. Therefore, my result should look like this:
'data.frame': 4 obs. of 2 variables:
$ val1: num 1 2 3 4
$ val2: chr "A" "B" "C" "D"
Right now I'm accomplishing this by checking if the coercion would result in NULL and then proceeding in coercing if this isn't the case:
res <- as.data.frame(lapply(df, function(x){
x <- sapply(x, function(y) {
if (is.na(as.numeric(y))) {
return(y)
} else {
y <- as.numeric(y)
return(y)
}
})
return(x)
}), stringsAsFactors = FALSE)
However, this doesn't strike me as the correct solution because of multiple issues:
I suspect that there is a faster way of accomplishing this
For some reason I receive the warning In FUN(X[[i]], ...) : NAs introduced by coercion, although this isn't the case (see result)
This seems inappropriate when handling other data types, i.e. dates
Is there a general, heuristic approach to this, or another, more sustainable solution? Thanks
The recent file readers like data.table::fread or the readr package do a pretty decent job in identifying and converting columns to the appropriate type.
So my first reaction was to suggest to write the data to file and read it in again, e.g.,
library(data.table)
fwrite(df, "dummy.csv")
df_new <- fread("dummy.csv")
str(df_new)
Classes ‘data.table’ and 'data.frame': 4 obs. of 2 variables:
$ val1: int 1 2 3 4
$ val2: chr "A" "B" "C" "D"
- attr(*, ".internal.selfref")=<externalptr>
or without actually writing to disk:
df_new <- fread(paste(capture.output(fwrite(df, "")), collapse = "\n"))
However, d.b's suggestions are much smarter but need some polishing to avoid coercion to factor:
df[] <- lapply(df, type.convert, as.is = TRUE)
str(df)
'data.frame': 4 obs. of 2 variables:
$ val1: int 1 2 3 4
$ val2: chr "A" "B" "C" "D"
or
df[] <- lapply(df, readr::parse_guess)
You should check dataPreparation package. You will find function findAndTransformNumerics function that will do exactly what you want.
require(dataPreparation)
data("messy_adult")
sapply(messy_adult[, .(num1, num2, mail)], class)
num1 num2 mail
"character" "character" "factor"
messy_adult is an ugly data set to illustrate functions from this package. Here num1 and num2 are strings :/
messy_adult <- findAndTransformNumerics(messy_adult)
[1] "findAndTransformNumerics: It took me 0.18s to identify 3 numerics column(s), i will set them as numerics"
[1] "setColAsNumeric: I will set some columns as numeric"
[1] "setColAsNumeric: I am doing the columnnum1"
[1] "setColAsNumeric: 0 NA have been created due to transformation to numeric."
[1] "setColAsNumeric: I will set some columns as numeric"
[1] "setColAsNumeric: I am doing the columnnum2"
[1] "setColAsNumeric: 0 NA have been created due to transformation to numeric."
[1] "setColAsNumeric: I am doing the columnnum3"
[1] "setColAsNumeric: 0 NA have been created due to transformation to numeric."
[1] "findAndTransformNumerics: It took me 0.09s to transform 3 column(s) to a numeric format."
Here we performed the search and it logged what it found
And know:
sapply(messy_adult[, .(num1, num2, mail)], class)
num1 num2 mail
"numeric" "numeric" "factor"
Hope it helps!
Disclamer: I'm the author of this package.

How to get labels from hclust result

let's say i have a dataset like this
dt<-data.frame(id=1:4,X=sample(4),Y=sample(4))
and then i try to make a hierarchical clustering using the below code
dis<-dist(dt[,-1])
clusters <- hclust(dis)
plot(clusters)
and it works well
The point is when i ask for
clusters$labels
it gives me NULL, meanwhile i expect to see the label of indivisuals in order like
1, 4, 2, 3
it is important to have them with the order that they are added in plot
Use cluster$order rather than labels if you happened to not have assigned the labels.
Infact you can see all the contents by using function called summary
clusters <- hclust(dis)
plot(clusters)
summary(clusters)
clusters$order
You can compare with the plot i received at my end, it is offcourse little different than yours
My outcome:
> clusters$order
[1] 4 1 2 3
Content of summary command:
> summary(clusters)
Length Class Mode
merge 6 -none- numeric
height 3 -none- numeric
order 4 -none- numeric
labels 0 -none- NULL
method 1 -none- character
call 2 -none- call
dist.method 1 -none- character
You can observe that since there is null value against labels, hence you are not getting the labels. To receive the labels you need to assign them first using clusters$labels <- c("A","B","C","D") or you can assign with the rownames, once your labels are assigned you will no longer see the numbers you will able to see the names/labels.
In my case I have not assigned any name hence receiving the numbers instead.
You can put the labels in the plot function itself as well.
From the documentation ?hclust
labels
A character vector of labels for the leaves of the tree. By
default the row names or row numbers of the original data are used. If
labels = FALSE no labels at all are plotted.
You could use the following code:
# your data, I changed the id to characters to make it more clear
set.seed(1234) # for reproducibility
dt<-data.frame(id=c("A", "B", "C", "D"),X=sample(4),Y=sample(4))
dt
# your code, no labels
dis<-dist(dt[,-1])
clusters <- hclust(dis)
clusters$labels
# add labels, plot and check labels
clusters$labels <- dt$id
plot(clusters)
## labels in the order plotted
clusters$labels[clusters$order]
## [1] A D B C
## Levels: A B C D
Please let me know whether this is what you want.
Please make sure you use rownames(...) to ensure your data has labels
> rownames(dt) <- dt$id
> dt
id X Y
1 1 2 1
2 2 4 3
3 3 1 2
4 4 3 4
> dis<-dist(dt[,-1])
> clusters <- hclust(dis)
> str(clusters)
List of 7
$ merge : int [1:3, 1:2] -1 -2 1 -3 -4 2
$ height : num [1:3] 1.41 1.41 3.16
$ order : int [1:4] 1 3 2 4
$ labels : chr [1:4] "1" "2" "3" "4"
$ method : chr "complete"
$ call : language hclust(d = dis)
$ dist.method: chr "euclidean"
- attr(*, "class")= chr "hclust"
>

Names of variables inside the 'for loop' [duplicate]

This question already has answers here:
Dynamically select data frame columns using $ and a character value
(10 answers)
Closed 6 years ago.
I am trying to create a function that allows the conversion of selected columns of a data frame to categorical data type (factor) before running a regression analysis.
Question is how do I slice a particular column from a data frame using a string (character).
Example:
strColumnNames <- "Admit,Rank"
strDelimiter <- ","
strSplittedColumnNames <- strsplit(strColumnNames, strDelimiter)
for( strColName in strSplittedColumnNames[[1]] ){
dfData$as.name(strColName) <- factor(dfData$get(strColName))
}
Tried:
dfData$as.name()
dfData$get(as.name())
dfData$get()
Error Msg:
Error: attempt to apply non-function
Any help would be greatly appreciated! Thank you!!!
You need to change
dfData$as.name(strColName) <- factor(dfData$get(strColName))
to
dfData[[strColName]] <- factor(dfData[[strColName]])
You may read ?"[[" for more.
In your case, column names are generated programmingly, [[]] is the only way to go. Maybe this example will be clear enough to illustrate the problem of $:
dat <- data.frame(x = 1:5, y = 2:6)
z <- "x"
dat$z
# [1] NULL
dat[[z]]
# [1] 1 2 3 4 5
Regarding the other answer
apply definitely does not work, because the function you apply is as.factor or factor. apply always works on a matrix (if you feed it a data frame, it will convert it into a matrix first) and returns a matrix, while you can't have factor data class in matrix. Consider this example:
x <- data.frame(x1 = letters[1:4], x2 = LETTERS[1:4], x3 = 1:4, stringsAsFactors = FALSE)
x[, 1:2] <- apply(x[, 1:2], 2, as.factor)
str(x)
#'data.frame': 4 obs. of 3 variables:
# $ x1: chr "a" "b" "c" "d"
# $ x2: chr "A" "B" "C" "D"
# $ x3: int 1 2 3 4
Note, you still have character variable rather than factor. As I said, we have to use lapply:
x[1:2] <- lapply(x[1:2], as.factor)
str(x)
#'data.frame': 4 obs. of 3 variables:
# $ x1: Factor w/ 4 levels "a","b","c","d": 1 2 3 4
# $ x2: Factor w/ 4 levels "A","B","C","D": 1 2 3 4
# $ x3: int 1 2 3 4
Now we see the factor class in x1 and x2.
Using apply for a data frame is never a good idea. If you read the source code of apply:
dl <- length(dim(X))
if (is.object(X))
X <- if (dl == 2L)
as.matrix(X)
else as.array(X)
You see that a data frame (which has 2 dimension) will be coerced to matrix first. This is very slow. If your data frame columns have multiple different class, the resulting matrix will have only 1 class. Who knows what the result of such coercion would be.
Yet apply is written in R not C, with an ordinary for loop:
for (i in 1L:d2) {
tmp <- forceAndCall(1, FUN, newX[, i], ...)
if (!is.null(tmp))
ans[[i]] <- tmp
so it is no better than an explicit for loop you write yourself.
I would use a different method.
Create a vector of column names you want to change to factors:
factorCols <- c("Admit", "Rank")
Then extract these columns by index:
myCols <- which(names(dfData) %in% factorCols)
Finally, use apply to change these columns to factors:
dfData[,myCols] <- lapply(dfData[,myCols],as.factor)

Resources