Error: calling arguments in user functions using quosures - r

I am trying to create cross tables using weights::wtd.chi.sq.
The data:
data_in <- read_table2("Q50_1 Q50_2 Q38 Q90 pov gender wgt1 wgt2
never always Yes 2 High M 1.3 0.4
sometimes always No 2 Med F 0.4 0.2
always sometimes Yes 4 Low F 1.2 0.7
never never No 2 High M 0.5 0.7
always always No 4 High M 0.7 0.8
sometimes never Yes 3 Low F 0.56 0.3
sometimes never Yes 2 Med F 0.9 0.1
")
x_tab function that feeds into another function:
xtab_func <- function(dat, col, target, wgt){
col <- rlang::as_string(ensym(col))
target <- rlang::as_string(ensym(target))
wgt <- rlang::as_string(ensym(wgt))
wtd.chi.sq(dat[[target]],dat[[col]], weight = dat[[wgt]])
}
Running it gives:
xtab_func(data_in, 'Q50_1','pov','wgt1')
Chisq df p.value
7.3395092 4.0000000 0.1189981
Now I am looping through a vector of columns to repeat this for tabulation for each column. The error happens when I try to call target and wgt within the xtab function above. I've tried 3 different ways but none of them work.
crosstab <- function(dat, target, columns, wgt,target_name, school_type){
# browser()
target <- rlang::as_string(ensym(target))
print(target)
wgt <- rlang::as_string(ensym(wgt))
target_name <- enquo(target_name)
school_type <- enquo(school_type)
d <- list()
for (i in columns){
# OPTION 1
# x <- xtab_func(dat, i, !!target, !!wgt)
# OPTION 2
x <- xtab_func(dat, i, target, wgt)
# OPTION 3
# x <- xtab_func(dat, i, dat[[target]],dat[[wgt]])
x$i <- i
d[[i]] <- x
df <- do.call(rbind, d)
}
return(df)
}
When I run this I could see the chi values for the columns by pov...
cols <- data_in %>% select(starts_with("Q"))
cols <- names(cols)
crosstab(data_in,'pov',cols, 'wgt1', 'pov','trad')
But I get these errors:
Error: Only strings can be converted to symbols
OR
Error in model.frame.default(formula = weight ~ var1 + var2) :
invalid type (NULL) for variable 'var1'
Any idea how I call those variables? Thank you!

It's not clear to me at all why you are trying to use all the rlang stuff when you are just passing character values to your functions. This could be greatly simplified to
xtab_func <- function(dat, col, target, wgt){
weights::wtd.chi.sq(dat[[target]],dat[[col]], weight = dat[[wgt]])
}
and
crosstab <- function(dat, target, columns, wgt,target_name, school_type){
d <- list()
for (i in columns){
x <- as.data.frame(as.list(xtab_func(dat, i, target, wgt)))
x$i <- i
d[[i]] <- x
}
df <- do.call(rbind, d)
return(df)
}
Just use [[]] with character values to index into your data.

With the OP's default function xtab_func, we can modify the crosstab to
library(purrr)
library(dplyr)
crosstab <- function(dat, target, columns, wgt,target_name, school_type){
purrr::map_dfr(columns, ~ {
xtab_func(dat, !!.x, !!target, !!wgt)
})
}
-testing
crosstab(data_in,'pov', cols, 'wgt1', 'pov','trad')
# A tibble: 4 x 3
Chisq df p.value
<dbl> <dbl> <dbl>
1 7.34 4 0.119
2 6.02 4 0.198
3 1.47 2 0.480
4 4.83 4 0.306

Related

Fix a column in for loop while doing Chi-square test

I want to perform chi-square test of independence on the following dataset. The dataset consists of four categorical variables. The test is performed on two variables at a time with variable V4 fixed. Essentially, I want to perform chi-square for 3 combinations: V1-V4, V2-V4, and V3-V4. Now, I want to perform this in a loop since the actual analysis consists of operations over a large number of combinations.
V1 V2 V3 V4
A SUV Yes Good
A SUV No Good
B SUV No Good
B SUV Yes Satisfactory
C car Yes Excellent
C SUV No Poor
D SUV Yes Poor
D van Yes Satisfactory
E car No Excellent
What I have tried:
x <- c(1:3)
for (i in x) {
test <- chisq.test(df[, i], df[, 4])
out <- data.frame("X" = colnames(df)[i]
, "Y" = colnames(df[4])
, "Chi.Square" = round(test$statistic,3)
, "df"= test$parameter
, "p.value" = round(test$p.value, 3)
)
return(out)
}
However, I only receive the output for V1-V4 combination.
Reference for code: Chi Square Analysis using for loop in R
out is getting replaced in each iteration with the current output and the result OP got is from the last iteration. We can initialize with a list with length of 'x' to store the output
x <- 1:3
out <- vector('list', length(x))
for (i in x) {
test <- chisq.test(df[, i], df[, 4])
out[[i]] <- data.frame("X" = colnames(df[i]),
"Y" = colnames(df[4]),
"Chi.Square" = round(test$statistic, 3),
"df" = test$parameter,
"p.value" = round(test$p.value, 3))
}
You can use lapply to perform this loop.
x <- 1:3
do.call(rbind, lapply(x, function(i) {
test <- chisq.test(df[, i], df[, 4])
data.frame("X" = colnames(df)[i],
"Y" = colnames(df[4]),
"Chi.Square" = round(test$statistic,3),
"df"= test$parameter,
"p.value" = round(test$p.value, 3))
})) -> out
rownames(out) <- NULL
out
# X Y Chi.Square df p.value
#1 V1 V4 14.25 12 0.285
#2 V2 V4 12.75 6 0.047
#3 V3 V4 2.25 3 0.522

Aggregate a table by applying a function of multiple columns

Considering the following table df, with categorical variables noted x1 and x2 and numerical measurements noted y1, y2 and y3:
df <- data.frame(x1=sample(letters[1:3], 20, replace=TRUE),
x2=sample(letters[4:6], 20, replace=TRUE),
y1=rnorm(20), y2=rnorm(20), y3=rnorm(20))
I'd like to apply on it a function of the 3 numerical measurements y with respect to the categorical variables x. For example the following function, where the input y is a table of 3 columns, which should output one new column:
f <- function(y){ sum((y[,1] - y[,2]) / y[,3]) }
I tried it with aggregate, dplyr, summarizeBy.. without success as it seems that for every method, mixing the inputs columns is not an option. Any idea on how to do that with such kind of functions (i.e. taking advantage of aggregation)?
aggregate(data = df, y1 + y2 + y3 ~ x1 + x2, FUN = f)
To clarify, the expected result can be obtained with something like:
groups <- unique(df[,c("x1", "x2")]) # coocurences of explanatory variables
res <- c()
for (i in 1:nrow(groups)){ # get the subtables
temp <- df[df$x1 == groups[i,1] & df$x2 == groups[i,2], c("y1", "y2", "y3")]
res <- c(res, f(temp)) # apply function on subtables
}
groups$res <- res # aggregate results
Which is not that fat for this simple toy example but very impractical with more complex data.
The problem is on th input side of your function. The way you specified it, it expects a dataframe.
A possible slution is to feed the function a list of columns. With a small change to your function:
f <- function(y) sum((y[[1]] - y[[2]]) / y[[3]])
You can now use it in a dplyr-chain:
df %>%
group_by(x1, x2) %>%
summarise(sum_y = f(list(y1, y2, y3)))
which gives:
# A tibble: 9 x 3
# Groups: x1 [?]
x1 x2 sum_y
<fct> <fct> <dbl>
1 a d 1.20
2 a e 0.457
3 a f -9.46
4 b d -1.11
5 b e -0.176
6 b f -1.34
7 c d -0.994
8 c e 3.38
9 c f -2.63

Replacing Defined Outliers Using Apply/Tapply R

Good Afternoon R wizards,
I searched through a few posts on replacing outliers in data set - two that came closest to answering my questions were Changing outliers for NA in all columns in a dataset in R and Replace outliers by quantiles in R
The code in the 2nd reference works great if you want to update a column or two, but I have 40+ and would like to be able to use apply function to hit all the columns at once.
I want to set a threshold "max" of quantile(probs = .75) for each column, and replace any x>"max" with "max"
set.seed(1)
x = matrix(rnorm(20), ncol = 2)
x[2, 1] = 100
x[4, 2] = 200
colnames(x) <- c("a","b")
#apply(x,2,quantile,probs = .75)
Winsor75 <- function(x) {
Max <- quantile(x, probs = .75)
return(Max)
}
y <- as.data.frame(x)
y$a[y$a > Winsor75(x)] <- Winsor75(x)
The last line of code effectively replaces any defined outliers (in my case values above 75%) but uses the 75% for the entire matrix "x" where as I would like (a) the quantile to be attributable to each column and for (b) the ability to use the function in apply/tapply etc so I can perform the operation on all columns efficiently.
Any suggestions?
Thanks!
as.data.frame(lapply(y, function(x) pmin(x, quantile(x, 0.75, na.rm = TRUE))))
As a function:
df_winsor <- function(df, p) {
as.data.frame(lapply(df,
function(x) pmin(x, quantile(x, probs = p, na.rm = TRUE))))
}
Statistician's Disclaimer: I've solved the programming problem you asked. This should not be taken as an endorsement of the idea of automatically checking for, or doing anything with, so-called "outliers".
One option is to use mutate_all with custom function and apply rules to all columns.
Approach:
I have crated an replaceOutlier function (based on OPs function) which calculatesMaxand then replaces any item which is more thanMaxbefore returning vector.replaceOutlieris applied over all columns usingdplyr::mutate_all`.
library(tidyverse)
replaceOutlier <- function(x) {
Max <- quantile(x, probs = .75)
x[x>Max] <- Max
return(x)
}
x %>% as_tibble() %>% mutate_all(funs(replaceOutlier))
#Results
# # A tibble: 10 x 2
# a b
# <dbl> <dbl>
# 1 -0.626 1.08
# 2 0.698 0.390
# 3 -0.836 -0.621
# 4 0.698 1.08
# 5 0.330 1.08
# 6 -0.820 -0.0449
# 7 0.487 -0.0162
# 8 0.698 0.944
# 9 0.576 0.821
# 10 -0.305 0.594
#
Data
set.seed(1)
x = matrix(rnorm(20), ncol = 2)
x[2, 1] = 100
x[4, 2] = 200
colnames(x) <- c("a","b")

min max scaling/normalization in r for train and test data

I am looking to create a function that takes in the training set and the testing set as its arguments, min-max scales/normalizes and returns the training set and uses those same values of minimum and range to min-max scale/normalize and return the test set.
So far this is the function I have come up with:
min_max_scaling <- function(train, test){
min_vals <- sapply(train, min)
range1 <- sapply(train, function(x) diff(range(x)))
# scale the training data
train_scaled <- data.frame(matrix(nrow = nrow(train), ncol = ncol(train)))
for(i in seq_len(ncol(train))){
column <- (train[,i] - min_vals[i])/range1[i]
train_scaled[i] <- column
}
colnames(train_scaled) <- colnames(train)
# scale the testing data using the min and range of the train data
test_scaled <- data.frame(matrix(nrow = nrow(test), ncol = ncol(test)))
for(i in seq_len(ncol(test))){
column <- (test[,i] - min_vals[i])/range1[i]
test_scaled[i] <- column
}
colnames(test_scaled) <- colnames(test)
return(list(train = train_scaled, test = test_scaled))
}
The definition of min max scaling is similar to this question asked earlier on SO - Normalisation of a two column data using min and max values
My questions are:
1. Is there a way to vectorize the two for loops in the function? e.g. using sapply()
2. Are there any packages that allow us to do what we are looking to do here?
Here is the code for the min-max normalization. See this Wikipedia page for the formulae, and also other ways of performing feature scaling.
normalize <- function(x, na.rm = TRUE) {
return((x- min(x)) /(max(x)-min(x)))
}
To get a vector, use apply instead of lapply.
as.data.frame(apply(df$name, normalize))
Update to address Holger's suggestion.
If you want to pass additional arguments to min() and max(), e.g., na.rm, then you can use:
normalize <- function(x, ...) {
return((x - min(x, ...)) /(max(x, ...) - min(x, ...)))
}
x <- c(1, NA, 2, 3)
normalize(a)
# [1] NA NA NA NA
normalize(a, na.rm = TRUE)
# 0.0 NA 0.5 1.0
Just keep in mind, that whatever you pass to min() via the ellipsis ... you also implicitly pass to max(). In this case, this shouldn't be a big problem since both min() and max() share the same function signature.
Regarding your 2nd question, you can use the caret package:
library(caret)
train = data.frame(a = 1:3, b = 10:12)
test = data.frame(a = 1:6, b = 7:12)
pp = preProcess(train, method = "range")
predict(pp, train)
# a b
# 1 0.0 0.0
# 2 0.5 0.5
# 3 1.0 1.0
predict(pp, test)
# a b
# 1 0.0 -1.5
# 2 0.5 -1.0
# 3 1.0 -0.5
# 4 1.5 0.0
# 5 2.0 0.5
# 6 2.5 1.0
This packages also defines other transformation methods, see: http://machinelearningmastery.com/pre-process-your-dataset-in-r/
set.seed(1984)
### simulating a data set
df <- data.frame(var1 = rnorm(100,5,3),
var2 = rpois(100,15),
var3 = runif(50,90,100))
df_train <- df[1:60,]
df_test <- df[61:100,]
## the function
normalize_data <- function(train_set, test_set) ## the args are the two sets
{
ranges <- sapply(train_set, function(x) max(x)-min(x)) ## range calculation
normalized_train <- train_set/ranges # the normalization
normalized_test <- test_set/ranges
return(list(ranges = ranges, # returning a list
normalized_train= normalized_train,
normalized_test =normalized_test ))
}
z <- normalize_data(df_train, df_test) ## applying the function
## the results
z$ranges
var1 var2 var3
13.051448 22.000000 9.945934
> head(z$normalized_train)
var1 var2 var3
1 0.47715854 1.1492978 7.289028
2 0.18322387 0.4545455 4.280883
3 0.69451066 1.3070668 9.703761
4 -0.04125108 1.6090169 7.277882
5 0.35731555 0.7272727 4.133561
6 0.86120315 0.6032616 9.246209
> head(z$normalized_train)
var1 var2 var3
1 0.47715854 1.1492978 7.289028
2 0.18322387 0.4545455 4.280883
3 0.69451066 1.3070668 9.703761
4 -0.04125108 1.6090169 7.277882
5 0.35731555 0.7272727 4.133561
6 0.86120315 0.6032616 9.246209

How to avoid for-loops with multiple criteria in function which()

I have a 25 years data set that looks similar to the following:
date name value tag
1 2014-12-01 f -0.338578654 12
2 2014-12-01 a 0.323379254 4
3 2014-12-01 f 0.004163806 9
4 2014-12-01 f 1.365219477 2
5 2014-12-01 l -1.225602543 7
6 2014-12-01 d -0.308544089 9
This is how to replicate it:
set.seed(9)
date <- rep(seq(as.Date("1990-01-01"), as.Date("2015-01-1"), by="months"), each=50)
N <- length(date)
name <- sample(letters, N, replace=T)
value <- rnorm(N)
tag <- sample(c(1:50), N, replace=T)
mydata <- data.frame(date, name, value, tag)
head(mydata)
I would like to create a new matrix that stores values that satisfy multiple criteria. For instance, the sum of values that have a name j and a tag i. I use two for-loops and the which() function to filter out the correct values. Like this:
S <- matrix(data=NA, nrow=length(unique(mydata$tag)), ncol=length(unique(mydata$name)))
for(i in 1:nrow(S)){
for (j in 1:ncol(S)){
foo <- which(mydata$tag == unique(mydata$tag)[i] & mydata$name == unique(mydata$name)[j])
S[i,j] <- sum(mydata$value[foo])
}
}
This is ok for small data sets, but too slow for larger ones. Is it possible to avoid the for-loops or somehow speed up the process?
You can use dcast from package reshape2, with a custom function to sum your values:
library(reshape2)
dcast(mydata, name~tag, value.var='value', fun.aggregate=sum)
Or simply xtabs, base R:
xtabs(value~name+tag, mydata)
Some benchmark:
funcPer = function(){
S <- matrix(data=NA, nrow=length(unique(mydata$tag)), ncol=length(unique(mydata$name)))
for(i in 1:nrow(S)){
for (j in 1:ncol(S)){
foo <- which(mydata$tag == unique(mydata$tag)[i] & mydata$name == unique(mydata$name)[j])
S[i,j] <- sum(mydata$value[foo])
}
}
}
colonel1 = function() dcast(mydata, name~tag, value.var='value', fun.aggregate=sum)
colonel2 = function() xtabs(value~name+tag, mydata)
#> system.time(colonel1())
# user system elapsed
# 0.01 0.00 0.01
#> system.time(colonel2())
# user system elapsed
# 0.05 0.00 0.05
#> system.time(funcPer())
# user system elapsed
# 4.67 0.00 4.82

Resources