set an argument to remove NAs in a function - r

I am trying to complete a function. Hopefully, sometime in the future, I may share it with other users. In this function, I would like to have an argument so that users will have an option either excluding all missing values in all analyses or as it is based on data available for different components. I wonder if there is a standard way to do this or a r rule for this.
To show my point:
mydata <- data.frame(x = c(1, 2, 3, 4, 5, NA, 7),
y = c(2, NA, 4, 5, 6, 7, NA))
myfun <- function(data, na.omit = FALSE, ...) {
if (na.omit == TRUE) {
data <- na.omit(data)
}
# computing a lot of things
print(data)
}
myfun(data = mydata, na.omit = F)
myfun(data = mydata, na.omit = T)
Although it works fine now, I am still a little worried about this because na.omit is an existing r function. Should I change this argument into something like na_omit or complete_set?

Related

Text Argument in R Function With GLM: Object Not Found

A function included in a package is throwing an error when I attempt to supply weights to the function. The portion of the package call is required to be specified like this:
weights = c("kernel_wght")
Inside the function, the following two lines of code are used to specify a data frame object called weight:
weight1 <- sprintf("dataarg$%s", weights)
weight <- as.data.frame(eval(parse(text = weight1)))
However, the analytic portion of the function attempts to use glm to conduct an analysis of data, using the weights provided.
result1 <- glm(f1, family="gaussian", weights=weight, data=dataarg)
Doing so yields the following error:
Error in (function (arg) : object 'weight' not found
I've seen some recommendations that the whole glm call should be re-specified...and i've seen some referrals to global environment objects. Why can i print the dataframe, verifying it indeed is created, but can't refer to it in the call to glm? Is there a fix that i have overlooked?
As per requested, here is a small example. I created some sample data, as if it had come from a multiple imputation generating process:
dat <- c(1, 1, 0, .5, 1, 3, 0, 1, 1, 4, 0, .5, 1, 5, 1, 1, 1, 2, 1,
.5,
2, 7, 1, 1, 2, 3, 0, .5, 2, 2, 0, 1, 2, 4, 1, .5)
dat <- data.frame(matrix(dat,ncol=4, byrow=T))
colnames(dat) <- c("id", "y", "tx", "wt")
imp_lst <- lapply(1:2, function(s) dplyr::filter(dat, id == s))
for (i in 1:length(imp_lst)) { assign(paste0("imp", i),
as.data.frame(imp_lst[[i]])) }
df_lst <- list()
for (i in 1:length(imp_lst)) {
assign(paste0("imp", i), as.data.frame(imp_lst[[i]]))
df_lst <- append(df_lst, list(get(paste0("imp", i))))
names(df_lst)[i] <- paste0("imp", i)
}
And here is a small example, mostly taken from the package, that re-creates the problem:
my_ex <- function(datasets, y, treatment, weights=NULL, ...) {
data <- names(datasets)
for (i in 1:length(treatment)) {
d1 <- sprintf("datasets$%s", data[i])
dataarg <- eval(parse(text=d1))
print(dataarg)
if(!is.null(weights)) {
weight1 <- sprintf("dataarg$%s", weights)
weight <- as.data.frame(eval(parse(text = weight1)))
print(weight)
} else {
dataarg$weight <- weight <- rep(1,nrow(dataarg))
}
f1 <- sprintf("%s ~ %s ", y, treatment)
print(f1)
result1 <- glm(f1, family="gaussian", weights=weight, data=dataarg)
print(summary(result1))
}
}
Using the following call, the error appears:
testrun <- my_ex(df_lst, y = c("y","y"), treatment = c("tx","tx"), weights = c("wt","wt"))enter code here
The proximal problem is that you are defining the formula as a character string and passing it to glm. It gets converted to a formula within glm, but when that happens its environment is the environment of glm, so it doesn't know where to look for the weights variable (loosely speaking, glm will look (1) within the data frame provided as data and (2) in the environment of the formula). You can work around this by using as.formula() to convert the string to a formula before passing it to glm (e.g. glm(as.formula(f1), ...)).
However: using functions like eval, parse, assign is a code smell in R — it means there's probably a more natural, simpler, more robust way to do what you want. For example, I think this function does the same as what your function is trying to do, relying on indexing within lists rather than using eval(parse(...)) and friends.
my_ex2 <- function(datasets, y, treatment, weights = NULL, ...) {
result <- list()
for (i in 1:length(treatment)) {
form <- reformulate(treatment[i], response = y[i])
data <- datasets[[i]]
## note double brackets around second term - we want
## the results to be a vector, not a data frame
weight <- data[[weights[i]]]
result[[i]] <- glm(formula = form, weight = weight, data = data)
}
result
}
Then, to print out all the summaries, lapply(result, summary) (if you really think you only need the summary, you can save the summary instead of the fitted object inside the loop).

Calculating summary scores using apply function and if else statement

I'm calculating two summary scores based on two 3-item scales. I'm calculating each like so:
tasksum <- paste("item",c(8,11,12), sep="")
all_data_2$summary_score_task <- apply(all_data_2[,tasksum], 1, sum, na.rm = FALSE)
activesum <- paste("item",c(14,21,25), sep="")
all_data_2$summary_score_act <- apply(all_data_2[,activesum], 1, sum, na.rm = FALSE)
I would like to accomplish the following:
for summary_score_task, in cases where "item8" is NA, I would like to calculate the summary score with the following expression: ((item11 + item12)/2)*3. In cases where it's not NA, I would like to continue to calculate the summary score the same way as above.
for activity_score_act, in cases where "item21" is NA, I would like to calculate the summary score with the following expression: (2*ibr14 + ibr25). In cases where it's not NA, I would like to continue to calculate the summary score the same way as above.
I'm sort of new to R so I would appreciate some help with this. Thanks.
First, the function rowSums will handle the simple case of getting sums for every row (and more efficiently), though there is nothing wrong with using apply.
Second, to do the custom set of calculations you want to do, you can write your own anonymous function for use with apply that will do exactly the task you desire. apply with the margin argument set to 1 as you have will apply that function to each row of the input data. Without access to your data, here's an example:
set.seed(2)
all_data_2 <- data.frame(
item8 = c(rnorm(48), NA, NA),
item11 = rnorm(50),
item12 = rnorm(50)
)
tasksum <- paste("item", c(8, 11, 12), sep = "")
all_data_2$summary_score_task <-
apply(all_data_2[,tasksum], 1, function(x) {
# Note I am using the fact that I know the first element is item8
if (is.na(x[1])) {
((x[2] + x[3])/2)*3
} else {
sum(x, na.rm = T)
}
})
You can accomplish your second task very similarly, I think. Examine your data after doing this and confirm it is doing what you want.

Using testthat to check each variable in a data frame for NA values

I'm building a dataset from very messy raw files, and am using testthat to make sure things don't break as new data is added or cleaning rules are corrected. I'd like to add a test to see if there are any NA values in the data, and, if so, to report which columns they are in.
Its trivial to do so manually, by writing a test for each column. But that solution will be a pain to maintain and error-prone as I don't want to have to remember to update the test-NA file everytime a column is added or removed from the dataset.
Here is example code for what I have
df <- tidyr::tribble(
~A, ~B, ~C,
1, 2, 3,
NA, 2, 3,
1, 2, NA
)
# checks all variables, doesn't report which have NA values
testthat::test_that("NA Values", {
testthat::expect_true(sum(is.na(df)) == 0)
})
# Checks each column, but is a pain to maintain
testthat::test_that("Variable specific checks", {
testthat::expect_true(sum(is.na(df$A)) == 0)
testthat::expect_true(sum(is.na(df$B)) == 0)
testthat::expect_true(sum(is.na(df$C)) == 0)
})
Solution 1: quick and (not so) dirty
df <- tidyr::tribble(
~A, ~B, ~C,
1, 2, 3,
NA, 2, 3,
1, 2, NA
)
# Checks each column, but is a pain to maintain
testthat::test_that("Variable specific checks", {
res <- apply(df, 2, function(x) sum(is.na(x))>0)
testthat::expect_true(all(res), label = paste(paste(which(res), collapse=", "), "contain(s) NA(s)"))
})
which should return
Error: Test failed: 'Variable specific checks'
* 1, 3 contain(s) NA isn't true.
Solution 2: tailor an expect_() function to your needs
expect_true2 <- function(object, info = NULL, label = NULL) {
act <- testthat::quasi_label(rlang::enquo(object), label, arg = "object")
testthat::expect(identical(as.vector(act$val), TRUE), sprintf("Column %s contain(s) NA(s).",
act$lab), info = info)
invisible(act$val)
}
testthat::test_that("Variable specific checks", {
res <- apply(df, 2, function(x) sum(is.na(x))>0)
expect_true2(all(res), label = paste(which(res), collapse=","))
})
which should return
Error: Test failed: 'Variable specific checks'
* Column 1,3 contain(s) NA(s).

How to refer to unnamed object in R

I want to perform a simple task in R. I want to call a method on an object which has not been assigned to any variable yet.
Like this:
a <- c(5, 2, 11, 3)
b <- order(a, decreasing = TRUE)[1:floor(0.1 * length(.))]
So I guess, I would like to to find, what to pass to length function here. I know that I can perform it like this:
a <- c(5, 2, 11, 3)
b <- order(a, decreasing = TRUE)
b <- b[1:floor(0.1 * length(b))]
But I wanted to make it like I wrote above.
There is as far as i know, no implemented way that will achieve higher efficiency than the base code
a <- c(5, 2, 11, 3)
b <- order(a, decreasing = TRUE)
b[1:floor(0.1 * length(b))]
However one can achieve something similar to what you are asking, using either the magrittr, the dplyr or similar packages, which allow for piping calls. This would look similar to
a <- c(5, 2, 11, 3)
c <- a %>% order(., decreasing = TRUE) %>% .[1:floor(0.1 * length(.))]
identical(b[1:floor(0.1 * length(b))],c)
[1] TRUE

R function for creating, naming and lagging variables

I have some data like so:
a <- c(1, 2, 9, 18, 6, 45)
b <- c(12, 3, 34, 89, 108, 44)
c <- c(0.5, 3.3, 2.4, 5, 13,2)
df <- data.frame(a, b,c)
I need to create a function to lag a lot of variables at once for a very large time series analysis with dozens of variables. So i need to lag a lot of variables without typing it all out. In short, I would like to create variables a.lag1, b.lag1 and c.lag1 and be able to add them to the original df specified above. I figure the best way to do so is by creating a custom function, something along the lines of:
lag.fn <- function(x) {
assign(paste(x, "lag1", sep = "."), lag(x, n = 1L)
return (assign(paste(x, "lag1", sep = ".")
}
The desired output is:
a.lag1 <- c(NA, 1, 2, 9, 18, 6, 45)
b.lag1 <- c(NA, 12, 3, 34, 89, 108, 44)
c.lag1 <- c(NA, 0.5, 3.3, 2.4, 5, 13, 2)
However, I don't get what I am looking for. Should I change the environment to the global environment? I would like to be able to use cbind to add to orignal df. Thanks.
Easy using dplyr. Don't call data frames df, may cause confusion with the function of the same name. I'm using df1.
library(dplyr)
df1 <- df1 %>%
mutate(a.lag1 = lag(a),
b.lag1 = lag(b),
c.lag1 = lag(c))
The data frame statement in the question is invalid since a, b and c are not the same length. What you can do is create a zoo series. Note that the lag specified in lag.zoo can be a vector of lags as in the second example below.
library(zoo)
z <- merge(a = zoo(a), b = zoo(b), c = zoo(c))
lag(z, -1) # lag all columns
lag(z, 0:-1) # each column and its lag
We can use mutate_all
library(dplyr)
df %>%
mutate_all(funs(lag = lag(.)))
If everything else fails, you can use a simple base R function:
my_lag <- function(x, steps = 1) {
c(rep(NA, steps), x[1:(length(x) - steps)])
}

Resources