How do I stop merge from converting characters into factors? - r

E.g.
chr <- c("a", "b", "c")
intgr <- c(1, 2, 3)
str(chr)
str(base::merge(chr,intgr, stringsAsFactors = FALSE))
gives:
> str(base::merge(chr,intgr, stringsAsFactors = FALSE))
'data.frame': 9 obs. of 2 variables:
$ x: Factor w/ 3 levels "a","b","c": 1 2 3 1 2 3 1 2 3
$ y: num 1 1 1 2 2 2 3 3 3
I originally thought it has something to do with how merge coerces arguments into data frames. However, I thought that adding the argument stringsAsFactors = FALSE would override the default coercion behaviour of char -> factor, yet this is not working.
EDIT: Doing the following gives me expected behaviour:
options(stringsAsFactors = FALSE)
str(base::merge(chr,intgr))
that is:
> str(base::merge(chr,intgr))
'data.frame': 9 obs. of 2 variables:
$ x: chr "a" "b" "c" "a" ...
$ y: num 1 1 1 2 2 2 3 3 3
but this is not ideal as it changes the global stringsAsFactors setting.

You can accomplish this particular "merge" using expand.grid(), since you're really just taking the cartesian product. This allows you to pass the stringsAsFactors argument:
sapply(expand.grid(x=chr,y=intgr,stringsAsFactors=F),class);
## x y
## "character" "numeric"
Here's a way of working around this limitation of merge():
sapply(merge(data.frame(x=chr,stringsAsFactors=F),intgr),class);
## x y
## "character" "numeric"
I would argue that it never makes sense to pass an atomic vector to merge(), since it is only really designed for merging data.frames.

We can use CJ from data.table as welll
library(data.table)
str(CJ(chr, intgr))
Classes ‘data.table’ and 'data.frame': 9 obs. of 2 variables:
#$ V1: chr "a" "a" "a" "b" ...
#$ V2: num 1 2 3 1 2 3 1 2 3

Related

Why does R convert numbers and characters to factors when coercing to data frame?

Recently I have come across a problem where my data has been converted to factors.
This is a large nuisance, as it's not (always) easily picked up on.
I am aware that I can convert them back with solutions such as as.character(paste(x)) or as.character(paste(x)), but that seems really unnecessary.
Example code:
nums <- c(1,2,3,4,5)
chars <- c("A","B","C,","D","E")
str(nums)
#> num [1:5] 1 2 3 4 5
str(chars)
#> chr [1:5] "A" "B" "C," "D" "E"
df <- as.data.frame(cbind(a = nums, b = chars))
str(df)
#> 'data.frame': 5 obs. of 2 variables:
#> $ a: Factor w/ 5 levels "1","2","3","4",..: 1 2 3 4 5
#> $ b: Factor w/ 5 levels "A","B","C,","D",..: 1 2 3 4 5
Don't cbind as it converts data to matrix and matrix can hold data of only one type, so it converts numbers to characters.
Use data.frame because as.data.frame(a = nums, b = chars) returns an error.
Use stringsAsFactors = FALSE because in data.frame default value of
stringsAsFactors is TRUE which converts characters to factors. The numbers also change to factors because in 1) they have been changed to characters.
df <- data.frame(a = nums, b = chars, stringsAsFactors = FALSE)
str(df)
#'data.frame': 5 obs. of 2 variables:
# $ a: num 1 2 3 4 5
# $ b: chr "A" "B" "C," "D" ...
EDIT: As of the newest version of R, the default value of stringAsFactors has changed to FALSE.
This should no longer happen if you have updated R: data frames don't automatically turn chr to fct. In a way, data frames are now more similar to tibbles.

An Elegant way to change columns type in dataframe in R

I have a data.frame which contains columns of different types, such as integer, character, numeric, and factor.
I need to convert the integer columns to numeric for use in the next step of analysis.
Example: test.data includes 4 columns (though there are thousands in my real data set): age, gender, work.years, and name; age and work.years are integer, gender is factor, and name is character. What I need to do is change age and work.years into a numeric type. And I wrote one piece of code to do this.
test.data[sapply(test.data, is.integer)] <-lapply(test.data[sapply(test.data, is.integer)], as.numeric)
It looks not good enough though it works. So I am wondering if there is some more elegant methods to fulfill this function. Any creative method will be appreciated.
I think elegant code is sometimes subjective. For me, this is elegant but it may be less efficient compared to the OP's code. However, as the question is about elegant code, this can be used.
test.data[] <- lapply(test.data, function(x) if(is.integer(x)) as.numeric(x) else x)
Also, another elegant option is dplyr
library(dplyr)
library(magrittr)
test.data %<>%
mutate_each(funs(if(is.integer(.)) as.numeric(.) else .))
Now very elegant in dplyr (with magrittr %<>% operator)
test.data %<>% mutate_if(is.integer,as.numeric)
It's tasks like this that I think are best accomplished with explicit loops. You don't buy anything here by replacing a straightforward for-loop with the hidden loop of a function like lapply(). Example:
## generate data
set.seed(1L);
N <- 3L; test.data <- data.frame(age=sample(20:90,N,T),gender=factor(sample(c('M','F'),N,T)),work.years=sample(1:5,N,T),name=sample(letters,N,T),stringsAsFactors=F);
test.data;
## age gender work.years name
## 1 38 F 5 b
## 2 46 M 4 f
## 3 60 F 4 e
str(test.data);
## 'data.frame': 3 obs. of 4 variables:
## $ age : int 38 46 60
## $ gender : Factor w/ 2 levels "F","M": 1 2 1
## $ work.years: int 5 4 4
## $ name : chr "b" "f" "e"
## solution
for (cn in names(test.data)[sapply(test.data,is.integer)])
test.data[[cn]] <- as.double(test.data[[cn]]);
## result
test.data;
## age gender work.years name
## 1 38 F 5 b
## 2 46 M 4 f
## 3 60 F 4 e
str(test.data);
## 'data.frame': 3 obs. of 4 variables:
## $ age : num 38 46 60
## $ gender : Factor w/ 2 levels "F","M": 1 2 1
## $ work.years: num 5 4 4
## $ name : chr "b" "f" "e"

R - How to add columns to a dataset incrementally using a loop?

I'm trying to get the error rates for a Naive Bayes classifier, by adding in each variable incrementally. For example I have 25 variables in my dataset. I want to get the error rates of the model as I add in one variable at a time. So you know it would output the error rate of the model with the first 2 columns, the error rate with the first 3 columns, then with the first 4 columns, and so on up to the last column.
Here is the pseudocode of what I'm trying to achieve
START
IMPORT DATASET WITH ALL VARIABLES
num_variables = num_dataset_cols
i= 1
WHILE (i <= num_variables)
{
CREATE NEW DATASET WITH x COLUMNs
BUILD THE MODEL
GET THE ERROR RATE
ADD IN NEXT COLUMN
i = i + 1
}
Here is a reproducible question. Obviously you can't build a NB classifier with this data, but that's not my problem. My problem is adding in the columns one by one. So far, the only way I can do it is by overwriting each column. For a NB classifier, the first column is the class node, so there must be at least 2 columns starting off in order for it to run.
#REPRODUCIBLE EXAMPLE
col1 <- c("A", "B", "C", "D", "E")
col2 <- c(1,2,3,4,5)
col3 <- c(TRUE, FALSE, FALSE, TRUE, FALSE)
col4 <- c("n","y","y","n","y")
col5 <- c("10", "15", "50", "100", "20")
dataset <- data.frame(col1, col2, col3, col4,col5)
num_variables <- ncol(dataset)
i <- 1
while i <= num_variables
{
data <- dataset[c(1, i+1)]
str(data)
#BUILD MODEL AND GET VALIDATION ERROR
#INCREMENT i TO GET NEXT COLUMN
i <- i + 1
}
You should be able to see from the str(data) that each time the column is overwritten. Does anyone know how I could go about adding each column without overwriting the previous one? Someone suggested an array to me, but I'm not too familiar with arrays in R. Would this work?
I think this is what you want.
col1 <- c("A", "B", "C", "D", "E")
col2 <- c(1,2,3,4,5)
col3 <- c(TRUE, FALSE, FALSE, TRUE, FALSE)
col4 <- c("n","y","y","n","y")
col5 <- c("10", "15", "50", "100", "20")
dataset <- data.frame(col1, col2, col3, col4,col5)
dataset
num_variables <- ncol(dataset)
num_variables
i <- 1
while (i <= num_variables) {
data <- dataset[, 1:i]
print(str(data))
#BUILD MODEL AND GET VALIDATION ERROR
#INCREMENT i TO GET NEXT COLUMN
i <- i + 1
}
Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
NULL
'data.frame': 5 obs. of 2 variables:
$ col1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ col2: num 1 2 3 4 5
NULL
'data.frame': 5 obs. of 3 variables:
$ col1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ col2: num 1 2 3 4 5
$ col3: logi TRUE FALSE FALSE TRUE FALSE
NULL
'data.frame': 5 obs. of 4 variables:
$ col1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ col2: num 1 2 3 4 5
$ col3: logi TRUE FALSE FALSE TRUE FALSE
$ col4: Factor w/ 2 levels "n","y": 1 2 2 1 2
NULL
'data.frame': 5 obs. of 5 variables:
$ col1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ col2: num 1 2 3 4 5
$ col3: logi TRUE FALSE FALSE TRUE FALSE
$ col4: Factor w/ 2 levels "n","y": 1 2 2 1 2
$ col5: Factor w/ 5 levels "10","100","15",..: 1 3 5 2 4
NULL
You can use append function after defining output variable
data <- dataset[c(1, i+1)]
append(output, data)
str(data)
Using the "assign" function within a while loop can be helpful for issues like this. You don't show the model syntax, but something like this should work:
dataset$errorrate <- [whatever makes this calculation, assuming it is vectorized]
name <- paste0(errorrate, i)
assign(name, dataset$errorrate)
...
This should leave you with i variables containing error estimate for each model run. If you are looking for one parameter estimate per model you can assign the single estimate a unique name within the global environment using the process above and then rbind them together after the loop has finished

Removing empty cells when using as.character in R

I have a table that look like this
A B C
AB ABC CBS
AB ABC
ADS
BBB
A want to use the columns as a character so is used this
A= as.character(table$A)
this results in c(“AB”, “AB”, “”) my goal was c(“AB”, “AB”), so without the empty cell "". To get wit of the empty cell I used this A=A[!A==""] which gives the results I want, but there must be a more elegant way of accomplishing the same goal.
May questions are 1) is there a better way of removing empty characters/cells.
Or more general 2) is there a way to transform the 3 columns (A,B,C) into characters A, B, C without the empty cells.
Thanks
'data.frame': 3 obs. of 3 variables:
$ A: Factor w/ 2 levels "","AB": 2 2 1
$ B: Factor w/ 3 levels "","ABC","ADS": 2 1 3
$ C: Factor w/ 3 levels "ABC","BBB","CBS": 3 1 2
Try specifying the argument na.strings during data import. Also, instead of using read.csv(), you could write read.csv2() which uses sep = ";" by default.
# Import data
data <- read.csv2("/path/to/data.csv", header = TRUE,
na.strings = "", stringsAsFactors = FALSE)
str(data)
'data.frame': 4 obs. of 3 variables:
$ A: chr "AB" "AB" NA NA
$ B: chr "ABC" NA "ADS" NA
$ C: chr "CBS" "ABC" NA "BBB"
# Exclude NAs
as.character(na.exclude(data$A))
[1] "AB" "AB"
If you prefer not to read your data set again, you can use:
# not in ('') or ("")
A <- table$A[!table$A %in% '']

Creating a data frame from two vectors using cbind

Consider the following R code.
> x = cbind(c(10, 20), c("[]", "[]"), c("[[1,2]]","[[1,3]]"))
> x
[,1] [,2] [,3]
[1,] "10" "[]" "[[1,2]]"
[2,] "20" "[]" "[[1,3]]"
Similarly
> x = rbind(c(10, "[]", "[[1,2]]"), c(20, "[]", "[[1,3]]"))
> x
[,1] [,2] [,3]
[1,] "10" "[]" "[[1,2]]"
[2,] "20" "[]" "[[1,3]]"
Now, I don't want the integers 10 and 20 to be converted to strings.
How can I perform this operation without any such conversion? I would of
course also like to know why this conversion happens. I looked at
the cbind help and also tried Googling, but had no luck finding a
solution. I also believe that in some cases. R converts strings to
factors, and I don't want that to happen either, though it doesn't seem
to be happening here.
Vectors and matrices can only be of a single type and cbind and rbind on vectors will give matrices. In these cases, the numeric values will be promoted to character values since that type will hold all the values.
(Note that in your rbind example, the promotion happens within the c call:
> c(10, "[]", "[[1,2]]")
[1] "10" "[]" "[[1,2]]"
If you want a rectangular structure where the columns can be different types, you want a data.frame. Any of the following should get you what you want:
> x = data.frame(v1=c(10, 20), v2=c("[]", "[]"), v3=c("[[1,2]]","[[1,3]]"))
> x
v1 v2 v3
1 10 [] [[1,2]]
2 20 [] [[1,3]]
> str(x)
'data.frame': 2 obs. of 3 variables:
$ v1: num 10 20
$ v2: Factor w/ 1 level "[]": 1 1
$ v3: Factor w/ 2 levels "[[1,2]]","[[1,3]]": 1 2
or (using specifically the data.frame version of cbind)
> x = cbind.data.frame(c(10, 20), c("[]", "[]"), c("[[1,2]]","[[1,3]]"))
> x
c(10, 20) c("[]", "[]") c("[[1,2]]", "[[1,3]]")
1 10 [] [[1,2]]
2 20 [] [[1,3]]
> str(x)
'data.frame': 2 obs. of 3 variables:
$ c(10, 20) : num 10 20
$ c("[]", "[]") : Factor w/ 1 level "[]": 1 1
$ c("[[1,2]]", "[[1,3]]"): Factor w/ 2 levels "[[1,2]]","[[1,3]]": 1 2
or (using cbind, but making the first a data.frame so that it combines as data.frames do):
> x = cbind(data.frame(c(10, 20)), c("[]", "[]"), c("[[1,2]]","[[1,3]]"))
> x
c.10..20. c("[]", "[]") c("[[1,2]]", "[[1,3]]")
1 10 [] [[1,2]]
2 20 [] [[1,3]]
> str(x)
'data.frame': 2 obs. of 3 variables:
$ c.10..20. : num 10 20
$ c("[]", "[]") : Factor w/ 1 level "[]": 1 1
$ c("[[1,2]]", "[[1,3]]"): Factor w/ 2 levels "[[1,2]]","[[1,3]]": 1 2
Using data.frame instead of cbind should be helpful
x <- data.frame(col1=c(10, 20), col2=c("[]", "[]"), col3=c("[[1,2]]","[[1,3]]"))
x
col1 col2 col3
1 10 [] [[1,2]]
2 20 [] [[1,3]]
sapply(x, class) # looking into x to see the class of each element
col1 col2 col3
"numeric" "factor" "factor"
As you can see elements from col1 are numeric as you wish.
data.frame can have variables of different class: numeric, factor and character but matrix doesn't, once you put a character element into a matrix all the other will become into this class no matter what clase they were before.

Resources