Modifying an R factor? - r

Say have a Data.Frame object in R where all the character columns have been transformed to factors. I need to then "modify" the value associated with a certain row in the dataframe -- but keep it encoded as a factor. I first need to extract a single row, so here is what I'm doing. Here is a reproducible example
a = c("ab", "ba", "ca")
b = c("ab", "dd", "da")
c = c("cd", "fa", "op")
data = data.frame(a,b,c, row.names = c("row1", "row2", "row3")
colnames(data) <- c("col1", "col2", "col3")
data[,"col1"] <- as.factor(data[,"col1"])
newdat <- data["row1",]
newdat["col1"] <- "ca"
When I assign "ca" to newdat["col1"] the Factor object associated with that column in data was overwritten by the string "ca". This is not the intended behavior. Instead, I want to modify the numeric value that encodes which level is present in newdat. so I want to change the contents of newdat["col1"] as follows:
Before:
Factor object, levels = c("ab", "ba", "ca"): 1 (the value it had)
After:
Factor object, levels = c("ab", "ba", "ca"): 3 (the value associated with the level "ca")
How can I accomplish this?

What you are doing is equivalent to:
x = factor(letters[1:4]) #factor
x1 = x[1] #factor; subset of 'x'
x1 = "c" #assign new value
i.e. assign a new object to an existing symbol. In your example, you, just, replace the "factor" of newdat["col1"] with "ca".
Instead, to subassign to a factor (subassigning wit a non-level results in NA), you could use
x = factor(letters[1:4])
x1 = x[1]
x1[1] = "c" #factor; subset of 'x' with the 3rd level
And in your example (I use local to avoid changing newdat again and again for the below):
str(newdat)
#'data.frame': 1 obs. of 3 variables:
# $ col1: Factor w/ 3 levels "ab","ba","ca": 1
# $ col2: Factor w/ 3 levels "ab","da","dd": 1
# $ col3: Factor w/ 3 levels "cd","fa","op": 1
local({ newdat["col1"] = "ca"; str(newdat) })
#'data.frame': 1 obs. of 3 variables:
# $ col1: chr "ca"
# $ col2: Factor w/ 3 levels "ab","da","dd": 1
# $ col3: Factor w/ 3 levels "cd","fa","op": 1
local({ newdat[1, "col1"] = "ca"; str(newdat) })
#'data.frame': 1 obs. of 3 variables:
# $ col1: Factor w/ 3 levels "ab","ba","ca": 3
# $ col2: Factor w/ 3 levels "ab","da","dd": 1
# $ col3: Factor w/ 3 levels "cd","fa","op": 1
local({ newdat[["col1"]][1] = "ca"; str(newdat) })
#'data.frame': 1 obs. of 3 variables:
# $ col1: Factor w/ 3 levels "ab","ba","ca": 3
# $ col2: Factor w/ 3 levels "ab","da","dd": 1
# $ col3: Factor w/ 3 levels "cd","fa","op": 1

Related

What is the proper conversion for factors to numeric, when the original factor levels are composed of strings?

We have a data frame with factors:
df <- data.frame("Var1" = c("A", "B", "B", "C"),
"Var2" = c("Can", "Can", "Not", "Not"))
> str(df)
'data.frame': 4 obs. of 2 variables:
$ Var1: Factor w/ 3 levels "A","B","C": 1 2 2 3
$ Var2: Factor w/ 2 levels "Can","Not": 1 1 2 2
Now we need to convert the factors to numeric values
dfb <- df
dfb[sapply(dfb, is.factor)] <- lapply(dfb[sapply(dfb, is.factor)],
as.numeric)
> str(dfb)
'data.frame': 4 obs. of 2 variables:
$ Var1: num 1 2 2 3
$ Var2: num 1 1 2 2
> summary(as.factor(dfb$Var1))
1 2 3
1 2 1
> summary(df$Var1)
A B C
1 2 1
The quantity within each level is equivalent. Yet according to the documentation for factor, there is this warning sign:
In particular, as.numeric applied to a factor is meaningless, and may happen by implicit coercion. To transform a factor f to approximately its original numeric values, as.numeric(levels(f))[f] is recommended and slightly more efficient than as.numeric(as.character(f)).
If we apply that function to the original data frame we get NAs.
df <- data.frame("Var1" = c("A", "B", "B", "C"),
"Var2" = c("Can", "Can", "Not", "Not"))
df[sapply(df, is.factor)] <- lapply(df[sapply(df, is.factor)],
function(x) as.numeric(levels(x))[x])
Warning messages:
1: In FUN(X[[i]], ...) : NAs introduced by coercion
2: In FUN(X[[i]], ...) : NAs introduced by coercion
So, is the proper way to convert these factors to numeric simply
as.numeric()
I am applying this to 18 variables some of which have upwards of 52 levels and I need to ensure that I am converting them correctly to numeric. It seems correct based on my tests but the documentations warning is throwing me off. I believe I'm fundamentally misunderstanding something.

Why does mutate not accept a data.frame as a column to nest?

library(tidyverse)
a = data.frame(c1 = c(1,2,3), c2 = c("a","b","c"))
b = data.frame(c3 = c(TRUE,FALSE,TRUE))
a %>% mutate(c_nested = b)
produces an error:
Error: Column c_nested is of unsupported class data.frame
How do I add a column that contains a nested data.frame?
Many thanks!
We can pass it as a list column
a %>%
mutate(c_nested = list(b))
res <-
a %>%
`$<-`(c_nested, b)
str(res)
# 'data.frame': 3 obs. of 3 variables:
# $ c1 : num 1 2 3
# $ c2 : Factor w/ 3 levels "a","b","c": 1 2 3
# $ c_nested:'data.frame': 3 obs. of 1 variable:
# ..$ c3: logi TRUE FALSE TRUE

Create a character variable with data.frame function [duplicate]

This question already has an answer here:
Data Frame Initialization - Character Initialization read as Factors?
(1 answer)
Closed 5 years ago.
Using the data.frame function in R, I am creating an example dataset. However, the vectors with strings are converted to a factor column.
How can I make vectors with strings (e.g. var1) become character column in my data set?
Current Code
df = data.frame(var1 = c("1","2","3","4"),
var2 = c(1,2,3,4))
Resulting Output
As shown below, var1 is a factor. I need var1 it to have the chr class.
> str(df)
'data.frame': 4 obs. of 2 variables:
$ var1 : Factor w/ 4 levels "1","2","3","4": 1 2 3 4
$ var2 : num 1 2 3 4
Trouble-shooting
Based on this post, I tried adding as.character, but var1 remains a factor.
df = data.frame(var1 = as.character(c("1","2","3","4")),
var2 = c(1,2,3,4))
stringsAsFactors is your friend. Namely:
df = data.frame(var1 = c("1","2","3","4"),var2 = c(1,2,3,4),stringsAsFactors = F)
yielding:
> str(df)
'data.frame': 4 obs. of 2 variables:
$ var1: chr "1" "2" "3" "4"
$ var2: num 1 2 3 4
Based on the comments, adding the argument stringsAsFactors=FALSE will create character variables instead of factor variables.

How do I stop merge from converting characters into factors?

E.g.
chr <- c("a", "b", "c")
intgr <- c(1, 2, 3)
str(chr)
str(base::merge(chr,intgr, stringsAsFactors = FALSE))
gives:
> str(base::merge(chr,intgr, stringsAsFactors = FALSE))
'data.frame': 9 obs. of 2 variables:
$ x: Factor w/ 3 levels "a","b","c": 1 2 3 1 2 3 1 2 3
$ y: num 1 1 1 2 2 2 3 3 3
I originally thought it has something to do with how merge coerces arguments into data frames. However, I thought that adding the argument stringsAsFactors = FALSE would override the default coercion behaviour of char -> factor, yet this is not working.
EDIT: Doing the following gives me expected behaviour:
options(stringsAsFactors = FALSE)
str(base::merge(chr,intgr))
that is:
> str(base::merge(chr,intgr))
'data.frame': 9 obs. of 2 variables:
$ x: chr "a" "b" "c" "a" ...
$ y: num 1 1 1 2 2 2 3 3 3
but this is not ideal as it changes the global stringsAsFactors setting.
You can accomplish this particular "merge" using expand.grid(), since you're really just taking the cartesian product. This allows you to pass the stringsAsFactors argument:
sapply(expand.grid(x=chr,y=intgr,stringsAsFactors=F),class);
## x y
## "character" "numeric"
Here's a way of working around this limitation of merge():
sapply(merge(data.frame(x=chr,stringsAsFactors=F),intgr),class);
## x y
## "character" "numeric"
I would argue that it never makes sense to pass an atomic vector to merge(), since it is only really designed for merging data.frames.
We can use CJ from data.table as welll
library(data.table)
str(CJ(chr, intgr))
Classes ‘data.table’ and 'data.frame': 9 obs. of 2 variables:
#$ V1: chr "a" "a" "a" "b" ...
#$ V2: num 1 2 3 1 2 3 1 2 3

R - How to add columns to a dataset incrementally using a loop?

I'm trying to get the error rates for a Naive Bayes classifier, by adding in each variable incrementally. For example I have 25 variables in my dataset. I want to get the error rates of the model as I add in one variable at a time. So you know it would output the error rate of the model with the first 2 columns, the error rate with the first 3 columns, then with the first 4 columns, and so on up to the last column.
Here is the pseudocode of what I'm trying to achieve
START
IMPORT DATASET WITH ALL VARIABLES
num_variables = num_dataset_cols
i= 1
WHILE (i <= num_variables)
{
CREATE NEW DATASET WITH x COLUMNs
BUILD THE MODEL
GET THE ERROR RATE
ADD IN NEXT COLUMN
i = i + 1
}
Here is a reproducible question. Obviously you can't build a NB classifier with this data, but that's not my problem. My problem is adding in the columns one by one. So far, the only way I can do it is by overwriting each column. For a NB classifier, the first column is the class node, so there must be at least 2 columns starting off in order for it to run.
#REPRODUCIBLE EXAMPLE
col1 <- c("A", "B", "C", "D", "E")
col2 <- c(1,2,3,4,5)
col3 <- c(TRUE, FALSE, FALSE, TRUE, FALSE)
col4 <- c("n","y","y","n","y")
col5 <- c("10", "15", "50", "100", "20")
dataset <- data.frame(col1, col2, col3, col4,col5)
num_variables <- ncol(dataset)
i <- 1
while i <= num_variables
{
data <- dataset[c(1, i+1)]
str(data)
#BUILD MODEL AND GET VALIDATION ERROR
#INCREMENT i TO GET NEXT COLUMN
i <- i + 1
}
You should be able to see from the str(data) that each time the column is overwritten. Does anyone know how I could go about adding each column without overwriting the previous one? Someone suggested an array to me, but I'm not too familiar with arrays in R. Would this work?
I think this is what you want.
col1 <- c("A", "B", "C", "D", "E")
col2 <- c(1,2,3,4,5)
col3 <- c(TRUE, FALSE, FALSE, TRUE, FALSE)
col4 <- c("n","y","y","n","y")
col5 <- c("10", "15", "50", "100", "20")
dataset <- data.frame(col1, col2, col3, col4,col5)
dataset
num_variables <- ncol(dataset)
num_variables
i <- 1
while (i <= num_variables) {
data <- dataset[, 1:i]
print(str(data))
#BUILD MODEL AND GET VALIDATION ERROR
#INCREMENT i TO GET NEXT COLUMN
i <- i + 1
}
Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
NULL
'data.frame': 5 obs. of 2 variables:
$ col1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ col2: num 1 2 3 4 5
NULL
'data.frame': 5 obs. of 3 variables:
$ col1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ col2: num 1 2 3 4 5
$ col3: logi TRUE FALSE FALSE TRUE FALSE
NULL
'data.frame': 5 obs. of 4 variables:
$ col1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ col2: num 1 2 3 4 5
$ col3: logi TRUE FALSE FALSE TRUE FALSE
$ col4: Factor w/ 2 levels "n","y": 1 2 2 1 2
NULL
'data.frame': 5 obs. of 5 variables:
$ col1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ col2: num 1 2 3 4 5
$ col3: logi TRUE FALSE FALSE TRUE FALSE
$ col4: Factor w/ 2 levels "n","y": 1 2 2 1 2
$ col5: Factor w/ 5 levels "10","100","15",..: 1 3 5 2 4
NULL
You can use append function after defining output variable
data <- dataset[c(1, i+1)]
append(output, data)
str(data)
Using the "assign" function within a while loop can be helpful for issues like this. You don't show the model syntax, but something like this should work:
dataset$errorrate <- [whatever makes this calculation, assuming it is vectorized]
name <- paste0(errorrate, i)
assign(name, dataset$errorrate)
...
This should leave you with i variables containing error estimate for each model run. If you are looking for one parameter estimate per model you can assign the single estimate a unique name within the global environment using the process above and then rbind them together after the loop has finished

Resources