Factors and Dummy Variables in R - r

I am new to data analytic and learning R. I have few very basic questions which I am not very clear about. I hope to find some help here. Please bear with me..still learning -
I wrote a small function to perform basic exploratory analysis on a data set with 9 variables out of which 8 are of Int/Numeric type and 1 is Factor. The function is like this :
out <- function(x)
{
c <- class(x)
na.len <- length(which(is.na(x)))
m <- mean(x, na.rm = TRUE)
s <- sd(x, na.rm = TRUE)
uc <- m+3*s
lc <- m-3*s
return(c(classofvar = c, noofNA = na.len, mean=m, stdev=s, UpperCap = uc, LowerCap = lc))
}
And I apply it to the data set using :
stats <- apply(train, 2, FUN = out)
But the output file has all the class of variables as Character and all the Means as NA. After some head hurting, I figured that the problem is due to the Factor variable. I converted it to Numeric using this :
train$MonthlyIncome=as.numeric(as.character(train$MonthlyIncome))
It worked fine. But I am confused that if without looking at the dataset I use the above function - it wont work. How can I handle this situation.
When should I consider creating dummy variables?
Thank you in advance, and I hope the questions are not too silly!

Note that c() results in a vector and all element within the vector must be of the same class. If the elements have different classes, then c() uses the least complex class which is able to hold all information. E.g. numeric and integer will result in numeric. character and integer will result in character.
Use a list or a data.frame if you need different classes.
out <- function(x)
{
c <- class(x)
na.len <- length(which(is.na(x)))
m <- mean(x, na.rm = TRUE)
s <- sd(x, na.rm = TRUE)
uc <- m+3*s
lc <- m-3*s
return(data.frame(classofvar = c, noofNA = na.len, mean=m, stdev=s, UpperCap = uc, LowerCap = lc))
}
sum(is.na(x)) is faster than length(which(is.na(x)))
Use lapply to run the function on each variable. Use do.call to append the resulting dataframes.
stats <- do.call(
rbind,
lapply(train, out)
)

Related

Apply arbitrary function that returns a list to vector in R

I recently asked a similar question (link), but the example that I gave there was a little too simple, and the answers did not work for my actual use case. Again, I am using R and want to apply a function to a vector. The function returns a list, and I want the results to be formatted as a list of vectors, where the names of the output list correspond to the names in the list returned by the function, and the value for each list element is the vector of values over the elements of the input vector. The following example shows a basic set up, together with two ways of calculating the desired output (sum.of.differences and sum.of.differences.2). The first method (sum.of.differences) seems to be the easiest way to understand what the desired output; the second method (sum.of.differences.2) avoids two major problems with the first method -- computing the function twice for each element of the input vector, and being forced to give the names of the list elements explicitly. However, the second method also seems relatively complicated for such a fundamental task. Is there a more idiomatic way to get the desired results in R?
x <- rnorm(n = 10)
a <- seq(from = -1, to = +1, by = 0.01)
sum.of.differences.fun <- function(a) {
d <- x - a
list(
sum.of.absolute.differences = sum(abs(d)),
sum.of.squared.differences = sum(d^2)
)
}
sum.of.differences <- list(
sum.of.absolute.differences = sapply(
X = a,
FUN = function(a) sum.of.differences.fun(a)$sum.of.absolute.differences
),
sum.of.squared.differences = sapply(
X = a,
FUN = function(a) sum.of.differences.fun(a)$sum.of.squared.differences
)
)
sum.of.differences.2 <- (function(lst) {
processed.lst <- lapply(
X = names(lst[[1]]),
FUN = function(name) {
sapply(
X = lst,
FUN = function(x) x[[name]]
)
}
)
names(processed.lst) <- names(lst[[1]])
return(processed.lst)
})(lapply(X = a, FUN = sum.of.differences.fun))
What language did you learn before R? It seems as though you might be using design patterns from a different functional language (I'd guess Lisp). The following code is much simpler and the output is identical (aside from names) as far as I can tell.
x <- rnorm(n = 10)
a <- seq(from = -1, to = +1, by = 0.01)
funs <- c(
sumabsdiff = function(a) sum(abs(x - a)),
sumsquarediff = function(a) sum((x - a) ^ 2)
)
sumdiff <- lapply(
funs,
function(fun) sapply(a, fun)
)

Applying a function across 3 columns in a data frame only returns one column

I am trying to normalize all columns within a data frame using the function written below. When I try to apply it to all columns using the for loop below, the output returns only one column when I would expect three. The output is normalized correctly suggesting the function works and the for loop is the issue.
seq_along(df) returns the same output
### example df
df <- cbind(as.data.frame(c(2:12), c(3:13), c(4:14)))
### normalization function
rescale <- function(x) {
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}
### for loop that returns one column although properly normalized
for (i in 1:ncol(df)){
i <- df[[i]]
output <- as.data.frame(rescale(i))
}
The syntax of as.data.frame is as.data.frame(x, row.names = NULL, optional = FALSE, ...). That means that in the call as.data.frame(c(2:12), c(3:13), c(4:14)) c(3:13) is being assigned to the argument row.names and c(4:14) to the ellipse. This means that your data.frame only has one column: 2:12.
The correct version should be: df <- as.data.frame(cbind(c(2:12), c(3:13), c(4:14))). Of course you should use the function scale instead of writing it yourself

How to dynamically create variables names and dynamically assign them a functions to avoid repeating code in R?

I want to make my code cleaner and more maintainable. For example, take the following quantsListlist :
var <- "temperature"
quantsList <- list(
q05 <- paste0('quantile(',var,', probs=.05, na.rm = TRUE)')
q10 <- paste0('quantile(',var,', probs=.10, na.rm = TRUE)')
q25 <- paste0('quantile(',var,', probs=.25, na.rm = TRUE)')
q50 <- paste0('quantile(',var,', probs=.50, na.rm = TRUE)')
q75 <- paste0('quantile(',var,', probs=.75, na.rm = TRUE)')
q80 <- paste0('quantile(',var,', probs=.80, na.rm = TRUE)')
q90 <- paste0('quantile(',var,', probs=.80, na.rm = TRUE)')
q95 <- paste0('quantile(',var,', probs=.80, na.rm = TRUE)')
)
This is a lot of repetition for creating this 5 elements list and I want to understand how to avoid such kind of bad coding, especially for lists that could be > 100 elements. My searches lead me to the assign() and is counterpart get() functions. But I can't really figure out how to properly "play" with these. For now, what I have :
# Attempt to create the list "quants" with dynamically named elements storing dynamically created functions
assignFuns <- function(q){
quant = paste0("q",q)
assign(quant, paste0('quantile(',var,', probs=.',q,', na.rm = TRUE)'))
return(get(quant))
}
quants <- list(05,10,25,50,75,80,90,95)
quantsList <- lapply(quants, assignFuns)
Doing so quantList contains the elements storing the quantiles functions but the list elements are unnamed.
I know I can simply name the elements of the list using :
names(quantsFuns) <- lapply(quants, function(x) paste0("q",x))
But this workflow seems me too convoluted and at the moment my combination of assign and get is useless. I guess there should be a more efficient use of assign() or get(). Should I adapt my assignFuns() function so that it returns both the name and the corresponding function or should I proceed in a totally different way ?
Thanks for the insights and the help you could provide me regarding this problem
Edit
So according to #Roland, assign() is to avoid. So I'm simply using this code :
assignFuns <- function(q){
quantFun <- paste0('quantile(',var,', probs=.',q,', na.rm = TRUE)')
return(quantFun)
}
quants <- list(05,10,25,50,75,80,90,95)
quantsFuns <- lapply(quants, assignFuns)
names(quantsFuns) <- lapply(quants, function(x) paste0("q",x))
R doesn't work this way. It's not a macro language. And you really do not want to create a bunch of loose variables in the global environment. Instead create a named vector (or list). The quantile function is designed to return a vector of quantiles.
t_probs <- c( 05,10,25,50,75,80,90,95)/100
temp_quants <- quantile( temperature, probs=t_probs)
# If you need them to be named then:
names(temp_quants) <- paste0( "Q_", t_probs)

R validate package. Create indicator object for arbitrary column

I'm new to R and I'm coming up against a problem that makes me very puzzled about how the language works. There is a package, "validate", that can create objects you can use to check that your data is as expected.
Testing it out on some toy data, I found that while the following code worked as expected:
library(validate)
I <- indicator(
cnt_misng = number_missing(x)
, sum = sum(x, na.rm = TRUE)
, min = min(x, na.rm = TRUE)
, mean = mean(x, na.rm = TRUE)
, max = max(x, na.rm = TRUE)
)
dat <- data.frame(x=1:4, y=c(NA,11,7,8), z=c(NA,2,0,NA))
C <- confront(dat, I)
values(C)
However, I found that I could not create a function that would return an indicator object for any arbitrary column of the data frame. This was my failed attempt:
check_values <- function(data, x){
print(x)
I <- indicator(
cnt_misng = number_missing(eval(x))
, max = max(eval(x), na.rm = TRUE)
)
C <- confront(df, I)
return(C)
}
df <- data.frame(A=1:4, B=c(NA,11,7,8), C=c(NA,2,0,NA))
C <- check_values(df,'B')
values(C)
If I have a large dataset, I'd like to be able to loop through a list of columns and have an identically formatted report for each one in the list. At this point, I'll probably give up on this package and find another way to more directly do that. However, I am still curious how this could be made to work. It seems like there should be a way to functionalize the creation of this indicator object so I can reuse the code to check the same stats for any arbitrary column of a data frame.
Any ideas?

Evaluate code within a function call in R (Use ICC::ICCbare within a loop)

I want to use the ICC::ICCbare function within a loop. However, the ICCbare uses the concrete variable names as input, e.g.:
ICCbare(x = group, y = variable1, data = dat)
whereby both "group" and "variable1" are columns of the data.frame "dat" (i.e., dat$variable1); ICCbarecannot be used with y = dat[, i].
In order to program a loop I therefore need to evaluate some R code within the function call of ICCbare. My idea was the following:
for(i in 1:10){
ICCbare(group, names(dat)[i], data = dat)
}
However, this does not work. The following error is printed:
Error in '[.data.frame`(data, yc) : undefined columns selected'
Is there a way to evaluate the statement names(dat)[i]) first before it is passed to the function call?
Here is a minimum working example for my problem:
# Create data set
dat <- data.frame(group=c(rep("A",5),
rep("B",5)),
variable1=1:10,
variable2=rnorm(10))
# Loop
for (i in names(dat)[2:3]){
ICCbare("group", i, data = dat)
}
I agree with #agstudy. This is a bad example of non-standard evaluation. You can use this as a workaround:
v <- "variable1"
ICCbare("group", v, data = dat)
#Error in `[.data.frame`(data, yc) : undefined columns selected
eval(bquote(ICCbare("group", .(v), data = dat)))
#$ICC
#[1] 0.8275862
It is a bug in ICCbare that try to to manage arguments as name in a bad manner.
function (x, y, data)
{
ICCcall <- Call <- match.call()
xc <- as.character(ICCcall[[2L]]) ## this is ugly!
yc <- as.character(ICCcall[[3L]])
inds <- unique(data[xc])[[1]]
tdata <- data.frame(data[yc], data[xc])
Personally I would remove the first lines and just use assume that arguments are just column names.
ICCbare_simple <-
function (xc, yc, data)
{
## remove lines before this one
inds <- unique(data[xc])[[1]]
## the rest of the code
.....
}
I'm the maintainer of ICC and I want to thank you for the excellent discussion. I know this is a very late reply, but I just updated the package and the new version (v2.3.0) should fix the "ugly" code and the problem encountered by the OP. See examples in this gist.
I just wanted to post this here in case anyone was searching with a similar problem. Thanks again, sorry for the delay.
Here is the content of the gist:
ICC non-standard evaluation examples
The ICC package for R calculates the intraclass correlation coefficient (ICC) from a one-way analysis of variance. Recently, the package was updated to better execute R's non-standard evaluation within each function (version 2.3.0 and higher). The package functions should now be able to handle a range of possible scenarios for calling the functions in, what I hope, is a less grotesque and more standard way of writing R functions. To demonstrate, below are some of those scenarios. Note, the examples use the ICCbare function, but the way in which the function arguments are supplied will apply to all of the functions in ICC.
First, load the package (and make sure the version is >2.3.0)
library(ICC)
packageVersion("ICC")
Columns of a data.frame
Here we supply the column names and the data.frame that contains the data to calculate the ICC. We will use the ChickWeight data fame.
data(ChickWeight)
ICCbare(x = Chick, y = weight, data = ChickWeight)
#$ICC
#[1] 0.1077609
Iterating through columns of a data.frame
In this case, we might have a data.frame in which we want to estimate the ICC for a number of different types of measurements that each has the same grouping or factor variable (e.g., x). The extreme of this might be in a simulation or bootstrapping scenario or even with some fancy high-throughput phenotyping/data collection. The point being, we want to automate the calculation of the ICC for each column.
First, we will simulate our own dataset with 3 traits to use in the example:
set.seed(101)
n <- 15 # number of individuals/groups/categories/factors
k <- 3 # number of measures per 'n'
va <- 1 # variance among
icc <- 0.6 # expected ICC
vw <- (va * (1 - icc)) / icc # solve for variance within
simdf <- data.frame(ind = rep(LETTERS[1:n], each = k),
t1 = rep(rnorm(n, 10, sqrt(va)), each = k) + rnorm(n*k, 0, sqrt(vw)),
t2 = rep(rnorm(n, 10, sqrt(va)), each = k) + rnorm(n*k, 0, sqrt(vw)),
t3 = rep(rnorm(n, 10, sqrt(va)), each = k) + rnorm(n*k, 0, sqrt(vw)))
Two ways to run through the columns come to mind: iteratively pass the name of each column or iteratively pass the column index. I will demonstrate both below. I do these in for loops so it is easier to see, but an easy extension would be to vectorise this by using something from the apply family of functions. First, passing the name:
for(i in names(simdf)[-1]){
cat(i, ":")
tmp.icc <- ICCbare(x = ind, y = i, data = simdf)
cat(tmp.icc, "\n")
}
#t1 : 0.60446
#t2 : 0.6381197
#t3 : 0.591065
or even like this:
for(i in 1:3){
cat(paste0("t", i), ": ")
tmp.icc <- ICCbare(x = ind, y = paste0("t", i), data = simdf)
cat(tmp.icc, "\n")
}
#t1 : 0.60446
#t2 : 0.6381197
#t3 : 0.591065
Alternatively, pass the column index:
for(i in 2:ncol(simdf)){
cat(names(simdf)[i], ": ")
tmp.icc <- ICCbare(x = ind, y = simdf[, i], data = simdf)
cat(tmp.icc, "\n")
}
#t1 : 0.60446
#t2 : 0.6381197
#t3 : 0.591065
Passing a character as an argument is deprecated
Note that the function will still work if a character is passed directly (e.g., "t1"), albeit with a warning. The warning just means that this may no longer work in future versions of the package. For example:
ICCbare(x = ind, y = "t1", data = simdf)
#[1] 0.60446
#Warning message:
#In ICCbare(x = ind, y = "t1", data = simdf) :
# passing a character string to 'y' is deprecated since ICC version
# 2.3.0 and will not be supported in future versions. The argument
# to 'y' should either be an unquoted column name of 'data' or an object
Note, however, that an expression evaluating to a character (e.g., paste0("t", 1)) doesn't throw the warning, which is nice!

Resources