Create a character variable with data.frame function [duplicate] - r

This question already has an answer here:
Data Frame Initialization - Character Initialization read as Factors?
(1 answer)
Closed 5 years ago.
Using the data.frame function in R, I am creating an example dataset. However, the vectors with strings are converted to a factor column.
How can I make vectors with strings (e.g. var1) become character column in my data set?
Current Code
df = data.frame(var1 = c("1","2","3","4"),
var2 = c(1,2,3,4))
Resulting Output
As shown below, var1 is a factor. I need var1 it to have the chr class.
> str(df)
'data.frame': 4 obs. of 2 variables:
$ var1 : Factor w/ 4 levels "1","2","3","4": 1 2 3 4
$ var2 : num 1 2 3 4
Trouble-shooting
Based on this post, I tried adding as.character, but var1 remains a factor.
df = data.frame(var1 = as.character(c("1","2","3","4")),
var2 = c(1,2,3,4))

stringsAsFactors is your friend. Namely:
df = data.frame(var1 = c("1","2","3","4"),var2 = c(1,2,3,4),stringsAsFactors = F)
yielding:
> str(df)
'data.frame': 4 obs. of 2 variables:
$ var1: chr "1" "2" "3" "4"
$ var2: num 1 2 3 4

Based on the comments, adding the argument stringsAsFactors=FALSE will create character variables instead of factor variables.

Related

Why does R convert numbers and characters to factors when coercing to data frame?

Recently I have come across a problem where my data has been converted to factors.
This is a large nuisance, as it's not (always) easily picked up on.
I am aware that I can convert them back with solutions such as as.character(paste(x)) or as.character(paste(x)), but that seems really unnecessary.
Example code:
nums <- c(1,2,3,4,5)
chars <- c("A","B","C,","D","E")
str(nums)
#> num [1:5] 1 2 3 4 5
str(chars)
#> chr [1:5] "A" "B" "C," "D" "E"
df <- as.data.frame(cbind(a = nums, b = chars))
str(df)
#> 'data.frame': 5 obs. of 2 variables:
#> $ a: Factor w/ 5 levels "1","2","3","4",..: 1 2 3 4 5
#> $ b: Factor w/ 5 levels "A","B","C,","D",..: 1 2 3 4 5
Don't cbind as it converts data to matrix and matrix can hold data of only one type, so it converts numbers to characters.
Use data.frame because as.data.frame(a = nums, b = chars) returns an error.
Use stringsAsFactors = FALSE because in data.frame default value of
stringsAsFactors is TRUE which converts characters to factors. The numbers also change to factors because in 1) they have been changed to characters.
df <- data.frame(a = nums, b = chars, stringsAsFactors = FALSE)
str(df)
#'data.frame': 5 obs. of 2 variables:
# $ a: num 1 2 3 4 5
# $ b: chr "A" "B" "C," "D" ...
EDIT: As of the newest version of R, the default value of stringAsFactors has changed to FALSE.
This should no longer happen if you have updated R: data frames don't automatically turn chr to fct. In a way, data frames are now more similar to tibbles.

Issue with user defined function in R

I am trying to change the data type of my variables in data frame to 'factor' if they are 'character'. I have tried to replicate the problem using sample data as below
a <- c("AB","BC","AB","BC","AB","BC")
b <- c(12,23,34,45,54,65)
df <- data.frame(a,b)
str(df)
'data.frame': 6 obs. of 2 variables:
$ a: chr "AB" "BC" "AB" "BC" ...
$ b: num 12 23 34 45 54 65
I wrote the below function to achieve that
abc <- function(x) {
for(i in names(x)){
if(is.character(x[[i]])) {
x[[i]] <- as.factor(x[[i]])
}
}
}
The function is executing properly if i pass the dataframe (df), but still it doesn't change the 'character' to 'factor'.
abc(df)
str(df)
'data.frame': 6 obs. of 2 variables:
$ a: chr "AB" "BC" "AB" "BC" ...
$ b: num 12 23 34 45 54 65
NOTE: It works perfectly with for loop and if condition. When I tried to generalize it by writing a function around it, there's a problem.
Please help. What am I missing ?
Besides the comment from #Roland, you should make use of R's nice indexing possibilities and learn about the *apply family. With that you can rewrite your code to
change_to_factor <- function(df_in) {
chr_ind <- vapply(df_in, is.character, logical(1))
df_in[, chr_ind] <- lapply(df_in[, chr_ind, drop = FALSE], as.factor)
df_in
}
Explanation
vapply loops over all elements of a list, applies a function to each element and returns a value of the given type (here a boolean logical(1)). Since in R data frames are in fact lists where each (list) element is required to be of the same length, you can conveniently loop over all the columns of the data frame and apply the function is.character to each column. vapply then returns a boolean (logical) vector with TRUE/FALSE values depending on whether the column was a character column or not.
You can then use this boolean vector to subset your data frame to look only at columns which are character columns.
lapply is yet another memeber of the *apply family and loops through list elements and returns a list. We loop now over the character columns, apply as.factor to them and return a list of them which we conveniently store in the original positions in the data frame
By the way, if you look at str(df) you will see that column b is already a factor. This is because data.frame automatically converts character columns to characters. To avoid that you need to pass stringsAsFactors = FALSE to data.frame:
a <- c("AB", "BC", "AB", "BC", "AB", "BC")
b <- c(12, 23, 34, 45, 54, 65)
df <- data.frame(a, b)
str(df) # column b is factor
# 'data.frame': 6 obs. of 2 variables:
# $ a: Factor w/ 2 levels "AB","BC": 1 2 1 2 1 2
# $ b: num 12 23 34 45 54 65
str(df2 <- data.frame(a, b, stringsAsFactors = FALSE))
# 'data.frame': 6 obs. of 2 variables:
# $ a: chr "AB" "BC" "AB" "BC" ...
# $ b: num 12 23 34 45 54 65
str(change_to_factor(df2))
# 'data.frame': 6 obs. of 2 variables:
# $ a: Factor w/ 2 levels "AB","BC": 1 2 1 2 1 2
# $ b: num 12 23 34 45 54 65
It may also be worth to learn the tidyverse syntax with which you can simply do
library(tidyverse)
df2 %>%
mutate_if(is.character, as.factor) %>%
str()

as.numeric on subset of dataframe [duplicate]

This question already has answers here:
Change the class from factor to numeric of many columns in a data frame
(16 answers)
Closed 6 years ago.
How can I convert columns 1 and 2 to numeric? I tried this which seemed obvious to me, but wasn't.
Thanks
df <- data.frame(v1 = c(1,2,'x',4,5,6), v2 = c(1,2,'x',4,5,6), v3 = c(1,2,3,4,5,6), stringsAsFactors = FALSE)
as.numeric(df[,1:2])
We can use lapply to loop over the columns of interest and convert to numeric
df[1:2] <- lapply(df[1:2], as.numeric)
I suggest this approach, using unlist. I particularly like the "one line" approach, when possible.And for more options and references, I strongly suggest this beautiful post.
df <- data.frame(v1 = c(1,2,'x',4,5,6), v2 = c(1,2,'x',4,5,6),
v3 = (1,2,3,4,5,6), stringsAsFactors = FALSE)
df[, c("v1","v2")] <- as.numeric(as.character(unlist(df[, c("v1","v2")])))
Warning message:
NAs introduced by coercion
str(df)
data.frame': 6 obs. of 3 variables:
$ v1: num 1 2 NA 4 5 6
$ v2: num 1 2 NA 4 5 6
$ v3: num 1 2 3 4 5 6
You can use matrix indexing without any hidden loops:
df[, 1:2] <- as.numeric(as.matrix(df[, 1:2]))

Modifying an R factor?

Say have a Data.Frame object in R where all the character columns have been transformed to factors. I need to then "modify" the value associated with a certain row in the dataframe -- but keep it encoded as a factor. I first need to extract a single row, so here is what I'm doing. Here is a reproducible example
a = c("ab", "ba", "ca")
b = c("ab", "dd", "da")
c = c("cd", "fa", "op")
data = data.frame(a,b,c, row.names = c("row1", "row2", "row3")
colnames(data) <- c("col1", "col2", "col3")
data[,"col1"] <- as.factor(data[,"col1"])
newdat <- data["row1",]
newdat["col1"] <- "ca"
When I assign "ca" to newdat["col1"] the Factor object associated with that column in data was overwritten by the string "ca". This is not the intended behavior. Instead, I want to modify the numeric value that encodes which level is present in newdat. so I want to change the contents of newdat["col1"] as follows:
Before:
Factor object, levels = c("ab", "ba", "ca"): 1 (the value it had)
After:
Factor object, levels = c("ab", "ba", "ca"): 3 (the value associated with the level "ca")
How can I accomplish this?
What you are doing is equivalent to:
x = factor(letters[1:4]) #factor
x1 = x[1] #factor; subset of 'x'
x1 = "c" #assign new value
i.e. assign a new object to an existing symbol. In your example, you, just, replace the "factor" of newdat["col1"] with "ca".
Instead, to subassign to a factor (subassigning wit a non-level results in NA), you could use
x = factor(letters[1:4])
x1 = x[1]
x1[1] = "c" #factor; subset of 'x' with the 3rd level
And in your example (I use local to avoid changing newdat again and again for the below):
str(newdat)
#'data.frame': 1 obs. of 3 variables:
# $ col1: Factor w/ 3 levels "ab","ba","ca": 1
# $ col2: Factor w/ 3 levels "ab","da","dd": 1
# $ col3: Factor w/ 3 levels "cd","fa","op": 1
local({ newdat["col1"] = "ca"; str(newdat) })
#'data.frame': 1 obs. of 3 variables:
# $ col1: chr "ca"
# $ col2: Factor w/ 3 levels "ab","da","dd": 1
# $ col3: Factor w/ 3 levels "cd","fa","op": 1
local({ newdat[1, "col1"] = "ca"; str(newdat) })
#'data.frame': 1 obs. of 3 variables:
# $ col1: Factor w/ 3 levels "ab","ba","ca": 3
# $ col2: Factor w/ 3 levels "ab","da","dd": 1
# $ col3: Factor w/ 3 levels "cd","fa","op": 1
local({ newdat[["col1"]][1] = "ca"; str(newdat) })
#'data.frame': 1 obs. of 3 variables:
# $ col1: Factor w/ 3 levels "ab","ba","ca": 3
# $ col2: Factor w/ 3 levels "ab","da","dd": 1
# $ col3: Factor w/ 3 levels "cd","fa","op": 1

Removing empty cells when using as.character in R

I have a table that look like this
A B C
AB ABC CBS
AB ABC
ADS
BBB
A want to use the columns as a character so is used this
A= as.character(table$A)
this results in c(“AB”, “AB”, “”) my goal was c(“AB”, “AB”), so without the empty cell "". To get wit of the empty cell I used this A=A[!A==""] which gives the results I want, but there must be a more elegant way of accomplishing the same goal.
May questions are 1) is there a better way of removing empty characters/cells.
Or more general 2) is there a way to transform the 3 columns (A,B,C) into characters A, B, C without the empty cells.
Thanks
'data.frame': 3 obs. of 3 variables:
$ A: Factor w/ 2 levels "","AB": 2 2 1
$ B: Factor w/ 3 levels "","ABC","ADS": 2 1 3
$ C: Factor w/ 3 levels "ABC","BBB","CBS": 3 1 2
Try specifying the argument na.strings during data import. Also, instead of using read.csv(), you could write read.csv2() which uses sep = ";" by default.
# Import data
data <- read.csv2("/path/to/data.csv", header = TRUE,
na.strings = "", stringsAsFactors = FALSE)
str(data)
'data.frame': 4 obs. of 3 variables:
$ A: chr "AB" "AB" NA NA
$ B: chr "ABC" NA "ADS" NA
$ C: chr "CBS" "ABC" NA "BBB"
# Exclude NAs
as.character(na.exclude(data$A))
[1] "AB" "AB"
If you prefer not to read your data set again, you can use:
# not in ('') or ("")
A <- table$A[!table$A %in% '']

Resources