R - How to add columns to a dataset incrementally using a loop? - r

I'm trying to get the error rates for a Naive Bayes classifier, by adding in each variable incrementally. For example I have 25 variables in my dataset. I want to get the error rates of the model as I add in one variable at a time. So you know it would output the error rate of the model with the first 2 columns, the error rate with the first 3 columns, then with the first 4 columns, and so on up to the last column.
Here is the pseudocode of what I'm trying to achieve
START
IMPORT DATASET WITH ALL VARIABLES
num_variables = num_dataset_cols
i= 1
WHILE (i <= num_variables)
{
CREATE NEW DATASET WITH x COLUMNs
BUILD THE MODEL
GET THE ERROR RATE
ADD IN NEXT COLUMN
i = i + 1
}
Here is a reproducible question. Obviously you can't build a NB classifier with this data, but that's not my problem. My problem is adding in the columns one by one. So far, the only way I can do it is by overwriting each column. For a NB classifier, the first column is the class node, so there must be at least 2 columns starting off in order for it to run.
#REPRODUCIBLE EXAMPLE
col1 <- c("A", "B", "C", "D", "E")
col2 <- c(1,2,3,4,5)
col3 <- c(TRUE, FALSE, FALSE, TRUE, FALSE)
col4 <- c("n","y","y","n","y")
col5 <- c("10", "15", "50", "100", "20")
dataset <- data.frame(col1, col2, col3, col4,col5)
num_variables <- ncol(dataset)
i <- 1
while i <= num_variables
{
data <- dataset[c(1, i+1)]
str(data)
#BUILD MODEL AND GET VALIDATION ERROR
#INCREMENT i TO GET NEXT COLUMN
i <- i + 1
}
You should be able to see from the str(data) that each time the column is overwritten. Does anyone know how I could go about adding each column without overwriting the previous one? Someone suggested an array to me, but I'm not too familiar with arrays in R. Would this work?

I think this is what you want.
col1 <- c("A", "B", "C", "D", "E")
col2 <- c(1,2,3,4,5)
col3 <- c(TRUE, FALSE, FALSE, TRUE, FALSE)
col4 <- c("n","y","y","n","y")
col5 <- c("10", "15", "50", "100", "20")
dataset <- data.frame(col1, col2, col3, col4,col5)
dataset
num_variables <- ncol(dataset)
num_variables
i <- 1
while (i <= num_variables) {
data <- dataset[, 1:i]
print(str(data))
#BUILD MODEL AND GET VALIDATION ERROR
#INCREMENT i TO GET NEXT COLUMN
i <- i + 1
}
Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
NULL
'data.frame': 5 obs. of 2 variables:
$ col1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ col2: num 1 2 3 4 5
NULL
'data.frame': 5 obs. of 3 variables:
$ col1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ col2: num 1 2 3 4 5
$ col3: logi TRUE FALSE FALSE TRUE FALSE
NULL
'data.frame': 5 obs. of 4 variables:
$ col1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ col2: num 1 2 3 4 5
$ col3: logi TRUE FALSE FALSE TRUE FALSE
$ col4: Factor w/ 2 levels "n","y": 1 2 2 1 2
NULL
'data.frame': 5 obs. of 5 variables:
$ col1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ col2: num 1 2 3 4 5
$ col3: logi TRUE FALSE FALSE TRUE FALSE
$ col4: Factor w/ 2 levels "n","y": 1 2 2 1 2
$ col5: Factor w/ 5 levels "10","100","15",..: 1 3 5 2 4
NULL

You can use append function after defining output variable
data <- dataset[c(1, i+1)]
append(output, data)
str(data)

Using the "assign" function within a while loop can be helpful for issues like this. You don't show the model syntax, but something like this should work:
dataset$errorrate <- [whatever makes this calculation, assuming it is vectorized]
name <- paste0(errorrate, i)
assign(name, dataset$errorrate)
...
This should leave you with i variables containing error estimate for each model run. If you are looking for one parameter estimate per model you can assign the single estimate a unique name within the global environment using the process above and then rbind them together after the loop has finished

Related

append to dataframe in function - is globalenv really required

I am using the following code, which works fine (improvement suggestions very much welcome):
WeeklySlopes <- function(Year, Week){
DynamicQuery <- paste('select DayOfYear, Week, Year, Close from SourceData where year =', Year, 'and week =', Week, 'order by DayOfYear')
SubData = sqldf(DynamicQuery)
SubData$X <- as.numeric(rownames(SubData))
lmfit <- lm(Close ~ X, data = SubData)
lmfit <- tidy(lmfit)
Slope <- as.numeric(sqldf("select estimate from lmfit where term = 'X'"))
e <- globalenv()
e$WeeklySlopesDf[nrow(e$WeeklySlopesDf) + 1,] = c(Year,Week, Slope)
}
WeeklySlopesDf <- data.frame(Year = integer(), Week = integer(), Slope = double())
WeeklySlopes(2017, 15)
WeeklySlopes(2017, 14)
head(WeeklySlopesDf)
Is there really no other way to append a row to my existing dataframe. I seem to need to access the globalenv. On the other hand, why can sqldf 'see' the 'global' dataframe SourceData?
dfrm <- data.frame(a=1:10, b=letters[1:10]) # reproducible example
myfunc <- function(new_a=20){ g <- globalenv(); g$dfrm[3,1] <- new_a; cat(dfrm[3,1])}
myfunc()
20
dfrm
a b
1 1 a
2 2 b
3 20 c # so your strategy might work, although it's unconventional.
Now try to extend dataframe outside a function:
dfrm[11, ] <- c(a=20,b="c")
An occult disaster (conversion of numeric column to character):
str(dfrm)
'data.frame': 11 obs. of 2 variables:
$ a: chr "1" "2" "20" "4" ...
$ b: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
So use a list to avoid occult coercion:
dfrm <- data.frame(a=1:10, b=letters[1:10]) # start over
dfrm[11, ] <- list(a=20,b="c")
str(dfrm)
'data.frame': 11 obs. of 2 variables:
$ a: num 1 2 3 4 5 6 7 8 9 10 ...
$ b: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
Now try within a function:
myfunc <- function(new_a=20, new_b="ZZ"){ g <- globalenv(); g$dfrm[nrow(dfrm)+1, ] <- list(a=new_a,b=new_b)}
myfunc()
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = "ZZ") :
invalid factor level, NA generated
str(dfrm)
'data.frame': 12 obs. of 2 variables:
$ a: num 1 2 3 4 5 6 7 8 9 10 ...
$ b: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
So it succeeds, but if there are any factor columns, non-existent levels will get turned into NA values (with a warning). You method of using named access to objects in the global environment is rather unconventional but there is a set of tested methods that you might want to examine. Look at ?R6. Other options are <<- and assign which allows one to specify the environment in which the assignment is to occur.

How to get my for loop working in R

I Have few variables AGE ACT_TYPE GENDER in my data frame. Instead of printing each of these factor variable's level distribution, I have used for loop to print the distribution. However nothing seems to be printing. Please let me know how to resolve the issue ..
> str(combin)
Classes ‘data.table’ and 'data.frame': 500000 obs. of 333 variables:
$ CUSTOMER_ID : int 385793 286891 108751 278651 23637 130723 5694 275523 163723 469852 ...
$ ACT_TYPE : Factor w/ 2 levels "CSA","SA": 1 1 1 1 1 1 2 2 2 1 ...
$ GENDER : Factor w/ 3 levels "","F","M": 3 3 3 3 3 3 3 3 3 3 ...
$ LEGAL_ENTITY : Factor w/ 7 levels "ASSOCIATION",..: 3 3 3 3 3 3 3 3 3 3
combin[, prop.table(table(GENDER))]
GENDER
F M
0.000272 0.232436 0.767292
combin[, prop.table(table(ACT_TYPE))]
ACT_TYPE
CSA SA
0.710686 0.289314
If I replace the above printing to the display with forloop, I don't see any o/p.
Please let me know where I am going wrong...
for(i in names(combin)) {
combin[, prop.table(table(names(combin)[i]))]
}
Also suggest me how can I apply a condition in the for loop to only print the
distribution only if it's a factor variable.
You could use purrr to loop through each column of the data frame and return a list, where each item in the list corresponds to a column and the columns that are factors are the prop.tables
library(purrr)
#generate some random data like yours
mydf <- data_frame(
id = sample(1:100, 10,replace = F)
, ACT_TYPE = factor(sample(c("CSA", "SA"),10, replace = T))
, GENDER = factor(sample(c("", "F", "M"), 10, replace = T))
)
# use map_if to generate prop.tables when the column is a factor
map_if(mydf, ~class(.x) == "factor", ~prop.table(table(.x)) )

How do I remove a particular level occurring in all factors in a dataframe

After reading in data and cleaning it, I ended up with factor columns that have levels that should no longer be there.
For example, d below has one blank cell in excel. When it’s read in, the factor columns have a level "", which shouldn’t be part of the data.
d <- read.csv(header = TRUE, text='
x,y,value
a,one,1
,,5
b,two,4
c,three,10
')
d
#> x y value
#> 1 a one 1
#> 2 5
#> 3 b two 4
#> 4 c three 10
str(d)
#> 'data.frame': 4 obs. of 3 variables:
#> $ x : Factor w/ 4 levels "","a","b","c": 2 1 3 4
#> $ y : Factor w/ 4 levels "","one","three",..: 2 1 4 3
#> $ value: int 1 5 4 10
How do I remove this level, "" from the factors which are about 20 factors in the data frame, without deleting the entire row that has just one empty cell, cause this will reduce my sample size from 299000 to just 7 observation(which I have tried before).
One way would be to replace the '' with NA and use droplevels to remove the unused levels
d[1:2] <- lapply(d[1:2], function(x) droplevels(replace(x, x=="", NA)))
levels(d$x)
#[1] "a" "b" "c"
levels(d$y)
#[1] "one" "three" "two"
Another option while reading the dataset (as we assume the OP wanted factor columns would be
d <- read.csv("yourfile.csv", na.strings = "")
This should make sure that the '' will be read as NA.
Update
Suppose, there are numeric columns in between and we need to do the replace/droplevels only for the factor columns
d[] <- lapply(d, function(x) if(is.factor(x)) droplevels(replace(x, x== "", NA))
else x)

How do I stop merge from converting characters into factors?

E.g.
chr <- c("a", "b", "c")
intgr <- c(1, 2, 3)
str(chr)
str(base::merge(chr,intgr, stringsAsFactors = FALSE))
gives:
> str(base::merge(chr,intgr, stringsAsFactors = FALSE))
'data.frame': 9 obs. of 2 variables:
$ x: Factor w/ 3 levels "a","b","c": 1 2 3 1 2 3 1 2 3
$ y: num 1 1 1 2 2 2 3 3 3
I originally thought it has something to do with how merge coerces arguments into data frames. However, I thought that adding the argument stringsAsFactors = FALSE would override the default coercion behaviour of char -> factor, yet this is not working.
EDIT: Doing the following gives me expected behaviour:
options(stringsAsFactors = FALSE)
str(base::merge(chr,intgr))
that is:
> str(base::merge(chr,intgr))
'data.frame': 9 obs. of 2 variables:
$ x: chr "a" "b" "c" "a" ...
$ y: num 1 1 1 2 2 2 3 3 3
but this is not ideal as it changes the global stringsAsFactors setting.
You can accomplish this particular "merge" using expand.grid(), since you're really just taking the cartesian product. This allows you to pass the stringsAsFactors argument:
sapply(expand.grid(x=chr,y=intgr,stringsAsFactors=F),class);
## x y
## "character" "numeric"
Here's a way of working around this limitation of merge():
sapply(merge(data.frame(x=chr,stringsAsFactors=F),intgr),class);
## x y
## "character" "numeric"
I would argue that it never makes sense to pass an atomic vector to merge(), since it is only really designed for merging data.frames.
We can use CJ from data.table as welll
library(data.table)
str(CJ(chr, intgr))
Classes ‘data.table’ and 'data.frame': 9 obs. of 2 variables:
#$ V1: chr "a" "a" "a" "b" ...
#$ V2: num 1 2 3 1 2 3 1 2 3

Modifying an R factor?

Say have a Data.Frame object in R where all the character columns have been transformed to factors. I need to then "modify" the value associated with a certain row in the dataframe -- but keep it encoded as a factor. I first need to extract a single row, so here is what I'm doing. Here is a reproducible example
a = c("ab", "ba", "ca")
b = c("ab", "dd", "da")
c = c("cd", "fa", "op")
data = data.frame(a,b,c, row.names = c("row1", "row2", "row3")
colnames(data) <- c("col1", "col2", "col3")
data[,"col1"] <- as.factor(data[,"col1"])
newdat <- data["row1",]
newdat["col1"] <- "ca"
When I assign "ca" to newdat["col1"] the Factor object associated with that column in data was overwritten by the string "ca". This is not the intended behavior. Instead, I want to modify the numeric value that encodes which level is present in newdat. so I want to change the contents of newdat["col1"] as follows:
Before:
Factor object, levels = c("ab", "ba", "ca"): 1 (the value it had)
After:
Factor object, levels = c("ab", "ba", "ca"): 3 (the value associated with the level "ca")
How can I accomplish this?
What you are doing is equivalent to:
x = factor(letters[1:4]) #factor
x1 = x[1] #factor; subset of 'x'
x1 = "c" #assign new value
i.e. assign a new object to an existing symbol. In your example, you, just, replace the "factor" of newdat["col1"] with "ca".
Instead, to subassign to a factor (subassigning wit a non-level results in NA), you could use
x = factor(letters[1:4])
x1 = x[1]
x1[1] = "c" #factor; subset of 'x' with the 3rd level
And in your example (I use local to avoid changing newdat again and again for the below):
str(newdat)
#'data.frame': 1 obs. of 3 variables:
# $ col1: Factor w/ 3 levels "ab","ba","ca": 1
# $ col2: Factor w/ 3 levels "ab","da","dd": 1
# $ col3: Factor w/ 3 levels "cd","fa","op": 1
local({ newdat["col1"] = "ca"; str(newdat) })
#'data.frame': 1 obs. of 3 variables:
# $ col1: chr "ca"
# $ col2: Factor w/ 3 levels "ab","da","dd": 1
# $ col3: Factor w/ 3 levels "cd","fa","op": 1
local({ newdat[1, "col1"] = "ca"; str(newdat) })
#'data.frame': 1 obs. of 3 variables:
# $ col1: Factor w/ 3 levels "ab","ba","ca": 3
# $ col2: Factor w/ 3 levels "ab","da","dd": 1
# $ col3: Factor w/ 3 levels "cd","fa","op": 1
local({ newdat[["col1"]][1] = "ca"; str(newdat) })
#'data.frame': 1 obs. of 3 variables:
# $ col1: Factor w/ 3 levels "ab","ba","ca": 3
# $ col2: Factor w/ 3 levels "ab","da","dd": 1
# $ col3: Factor w/ 3 levels "cd","fa","op": 1

Resources