I'm new to R and I'm coming up against a problem that makes me very puzzled about how the language works. There is a package, "validate", that can create objects you can use to check that your data is as expected.
Testing it out on some toy data, I found that while the following code worked as expected:
library(validate)
I <- indicator(
cnt_misng = number_missing(x)
, sum = sum(x, na.rm = TRUE)
, min = min(x, na.rm = TRUE)
, mean = mean(x, na.rm = TRUE)
, max = max(x, na.rm = TRUE)
)
dat <- data.frame(x=1:4, y=c(NA,11,7,8), z=c(NA,2,0,NA))
C <- confront(dat, I)
values(C)
However, I found that I could not create a function that would return an indicator object for any arbitrary column of the data frame. This was my failed attempt:
check_values <- function(data, x){
print(x)
I <- indicator(
cnt_misng = number_missing(eval(x))
, max = max(eval(x), na.rm = TRUE)
)
C <- confront(df, I)
return(C)
}
df <- data.frame(A=1:4, B=c(NA,11,7,8), C=c(NA,2,0,NA))
C <- check_values(df,'B')
values(C)
If I have a large dataset, I'd like to be able to loop through a list of columns and have an identically formatted report for each one in the list. At this point, I'll probably give up on this package and find another way to more directly do that. However, I am still curious how this could be made to work. It seems like there should be a way to functionalize the creation of this indicator object so I can reuse the code to check the same stats for any arbitrary column of a data frame.
Any ideas?
Related
I am trying to normalize all columns within a data frame using the function written below. When I try to apply it to all columns using the for loop below, the output returns only one column when I would expect three. The output is normalized correctly suggesting the function works and the for loop is the issue.
seq_along(df) returns the same output
### example df
df <- cbind(as.data.frame(c(2:12), c(3:13), c(4:14)))
### normalization function
rescale <- function(x) {
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}
### for loop that returns one column although properly normalized
for (i in 1:ncol(df)){
i <- df[[i]]
output <- as.data.frame(rescale(i))
}
The syntax of as.data.frame is as.data.frame(x, row.names = NULL, optional = FALSE, ...). That means that in the call as.data.frame(c(2:12), c(3:13), c(4:14)) c(3:13) is being assigned to the argument row.names and c(4:14) to the ellipse. This means that your data.frame only has one column: 2:12.
The correct version should be: df <- as.data.frame(cbind(c(2:12), c(3:13), c(4:14))). Of course you should use the function scale instead of writing it yourself
I want to make my code cleaner and more maintainable. For example, take the following quantsListlist :
var <- "temperature"
quantsList <- list(
q05 <- paste0('quantile(',var,', probs=.05, na.rm = TRUE)')
q10 <- paste0('quantile(',var,', probs=.10, na.rm = TRUE)')
q25 <- paste0('quantile(',var,', probs=.25, na.rm = TRUE)')
q50 <- paste0('quantile(',var,', probs=.50, na.rm = TRUE)')
q75 <- paste0('quantile(',var,', probs=.75, na.rm = TRUE)')
q80 <- paste0('quantile(',var,', probs=.80, na.rm = TRUE)')
q90 <- paste0('quantile(',var,', probs=.80, na.rm = TRUE)')
q95 <- paste0('quantile(',var,', probs=.80, na.rm = TRUE)')
)
This is a lot of repetition for creating this 5 elements list and I want to understand how to avoid such kind of bad coding, especially for lists that could be > 100 elements. My searches lead me to the assign() and is counterpart get() functions. But I can't really figure out how to properly "play" with these. For now, what I have :
# Attempt to create the list "quants" with dynamically named elements storing dynamically created functions
assignFuns <- function(q){
quant = paste0("q",q)
assign(quant, paste0('quantile(',var,', probs=.',q,', na.rm = TRUE)'))
return(get(quant))
}
quants <- list(05,10,25,50,75,80,90,95)
quantsList <- lapply(quants, assignFuns)
Doing so quantList contains the elements storing the quantiles functions but the list elements are unnamed.
I know I can simply name the elements of the list using :
names(quantsFuns) <- lapply(quants, function(x) paste0("q",x))
But this workflow seems me too convoluted and at the moment my combination of assign and get is useless. I guess there should be a more efficient use of assign() or get(). Should I adapt my assignFuns() function so that it returns both the name and the corresponding function or should I proceed in a totally different way ?
Thanks for the insights and the help you could provide me regarding this problem
Edit
So according to #Roland, assign() is to avoid. So I'm simply using this code :
assignFuns <- function(q){
quantFun <- paste0('quantile(',var,', probs=.',q,', na.rm = TRUE)')
return(quantFun)
}
quants <- list(05,10,25,50,75,80,90,95)
quantsFuns <- lapply(quants, assignFuns)
names(quantsFuns) <- lapply(quants, function(x) paste0("q",x))
R doesn't work this way. It's not a macro language. And you really do not want to create a bunch of loose variables in the global environment. Instead create a named vector (or list). The quantile function is designed to return a vector of quantiles.
t_probs <- c( 05,10,25,50,75,80,90,95)/100
temp_quants <- quantile( temperature, probs=t_probs)
# If you need them to be named then:
names(temp_quants) <- paste0( "Q_", t_probs)
Trying to get my head around Non-Standard Evaluation as used by dplyr but without success. I'd like a short function that returns summary statistics (N, mean, sd, median, IQR, min, max) for a specified set of variables.
Simplified version of my function...
my_summarise <- function(df = temp,
to.sum = 'eg1',
...){
## Summarise
results <- summarise_(df,
n = ~n(),
mean = mean(~to.sum, na.rm = TRUE))
return(results)
}
And running it with some dummy data...
set.seed(43290)
temp <- cbind(rnorm(n = 100, mean = 2, sd = 4),
rnorm(n = 100, mean = 3, sd = 6)) %>% as.data.frame()
names(temp) <- c('eg1', 'eg2')
mean(temp$eg1)
[1] 1.881721
mean(temp$eg2)
[1] 3.575819
my_summarise(df = temp, to.sum = 'eg1')
n mean
1 100 NA
N is calculated, but the mean is not, can't figure out why.
Ultimately I'd like my function to be more general, along the lines of...
my_summarise <- function(df = temp,
group.by = 'group'
to.sum = c('eg1', 'eg2'),
...){
results <- list()
## Select columns
df <- dplyr::select_(df, .dots = c(group.by, to.sum))
## Summarise overall
results$all <- summarise_each(df,
funs(n = ~n(),
mean = mean(~to.sum, na.rm = TRUE)))
## Summarise by specified group
results$by.group <- group_by_(df, ~to.group) %>%
summarise_each(df,
funs(n = ~n(),
mean = mean(~to.sum, na.rm = TRUE)))
return(results)
}
...but before I move onto this more complex version (which I was using this example for guidance) I need to get the evaluation working in the simple version first as thats the stumbling block, the call to dplyr::select() works ok.
Appreciate any advice as to where I'm going wrong.
Thanks in advance
The basic idea is that you have to actually build the appropriate call yourself, most easily done with the lazyeval package.
In this case you want to programmatically create a call that looks like ~mean(eg1, na.rm = TRUE). This is how:
my_summarise <- function(df = temp,
to.sum = 'eg1',
...){
## Summarise
results <- summarise_(df,
n = ~n(),
mean = lazyeval::interp(~mean(x, na.rm = TRUE),
x = as.name(to.sum)))
return(results)
}
Here is what I do when I struggle to get things working:
Remember that, just like the ~n() you already have, the call will have to start with a ~.
Write the correct call with the actual variable and see if it works (~mean(eg1, na.rm = TRUE)).
Use lazyeval::interp to recreate that call, and check this by running only the interp to visually see what it is doing.
In this case I would probably often write interp(~mean(x, na.rm = TRUE), x = to.sum). But running that will give us ~mean("eg1", na.rm = TRUE) which is treating eg1 as a character instead of a variable name. So we use as.name, as is taught to us in vignette("nse").
Trying to get my head around Non-Standard Evaluation as used by dplyr but without success. I'd like a short function that returns summary statistics (N, mean, sd, median, IQR, min, max) for a specified set of variables.
Simplified version of my function...
my_summarise <- function(df = temp,
to.sum = 'eg1',
...){
## Summarise
results <- summarise_(df,
n = ~n(),
mean = mean(~to.sum, na.rm = TRUE))
return(results)
}
And running it with some dummy data...
set.seed(43290)
temp <- cbind(rnorm(n = 100, mean = 2, sd = 4),
rnorm(n = 100, mean = 3, sd = 6)) %>% as.data.frame()
names(temp) <- c('eg1', 'eg2')
mean(temp$eg1)
[1] 1.881721
mean(temp$eg2)
[1] 3.575819
my_summarise(df = temp, to.sum = 'eg1')
n mean
1 100 NA
N is calculated, but the mean is not, can't figure out why.
Ultimately I'd like my function to be more general, along the lines of...
my_summarise <- function(df = temp,
group.by = 'group'
to.sum = c('eg1', 'eg2'),
...){
results <- list()
## Select columns
df <- dplyr::select_(df, .dots = c(group.by, to.sum))
## Summarise overall
results$all <- summarise_each(df,
funs(n = ~n(),
mean = mean(~to.sum, na.rm = TRUE)))
## Summarise by specified group
results$by.group <- group_by_(df, ~to.group) %>%
summarise_each(df,
funs(n = ~n(),
mean = mean(~to.sum, na.rm = TRUE)))
return(results)
}
...but before I move onto this more complex version (which I was using this example for guidance) I need to get the evaluation working in the simple version first as thats the stumbling block, the call to dplyr::select() works ok.
Appreciate any advice as to where I'm going wrong.
Thanks in advance
The basic idea is that you have to actually build the appropriate call yourself, most easily done with the lazyeval package.
In this case you want to programmatically create a call that looks like ~mean(eg1, na.rm = TRUE). This is how:
my_summarise <- function(df = temp,
to.sum = 'eg1',
...){
## Summarise
results <- summarise_(df,
n = ~n(),
mean = lazyeval::interp(~mean(x, na.rm = TRUE),
x = as.name(to.sum)))
return(results)
}
Here is what I do when I struggle to get things working:
Remember that, just like the ~n() you already have, the call will have to start with a ~.
Write the correct call with the actual variable and see if it works (~mean(eg1, na.rm = TRUE)).
Use lazyeval::interp to recreate that call, and check this by running only the interp to visually see what it is doing.
In this case I would probably often write interp(~mean(x, na.rm = TRUE), x = to.sum). But running that will give us ~mean("eg1", na.rm = TRUE) which is treating eg1 as a character instead of a variable name. So we use as.name, as is taught to us in vignette("nse").
I am new to data analytic and learning R. I have few very basic questions which I am not very clear about. I hope to find some help here. Please bear with me..still learning -
I wrote a small function to perform basic exploratory analysis on a data set with 9 variables out of which 8 are of Int/Numeric type and 1 is Factor. The function is like this :
out <- function(x)
{
c <- class(x)
na.len <- length(which(is.na(x)))
m <- mean(x, na.rm = TRUE)
s <- sd(x, na.rm = TRUE)
uc <- m+3*s
lc <- m-3*s
return(c(classofvar = c, noofNA = na.len, mean=m, stdev=s, UpperCap = uc, LowerCap = lc))
}
And I apply it to the data set using :
stats <- apply(train, 2, FUN = out)
But the output file has all the class of variables as Character and all the Means as NA. After some head hurting, I figured that the problem is due to the Factor variable. I converted it to Numeric using this :
train$MonthlyIncome=as.numeric(as.character(train$MonthlyIncome))
It worked fine. But I am confused that if without looking at the dataset I use the above function - it wont work. How can I handle this situation.
When should I consider creating dummy variables?
Thank you in advance, and I hope the questions are not too silly!
Note that c() results in a vector and all element within the vector must be of the same class. If the elements have different classes, then c() uses the least complex class which is able to hold all information. E.g. numeric and integer will result in numeric. character and integer will result in character.
Use a list or a data.frame if you need different classes.
out <- function(x)
{
c <- class(x)
na.len <- length(which(is.na(x)))
m <- mean(x, na.rm = TRUE)
s <- sd(x, na.rm = TRUE)
uc <- m+3*s
lc <- m-3*s
return(data.frame(classofvar = c, noofNA = na.len, mean=m, stdev=s, UpperCap = uc, LowerCap = lc))
}
sum(is.na(x)) is faster than length(which(is.na(x)))
Use lapply to run the function on each variable. Use do.call to append the resulting dataframes.
stats <- do.call(
rbind,
lapply(train, out)
)