Create R Function with flexibility to reference different datasets - r

I am trying to create a simple function in R that can reference multiple datasets and multiple variable names. Using the following code, I get an error, which I believe is due to referencing:
set.seed(123)
dat1 <- data.frame(x = sample(10), y = sample(10), z = sample(10))
dat2 <- data.frame(x = sample(10), y = sample(10), z = sample(10))
table(dat1$x, dat1$y)
table(dat2$x, dat2$y)
fun <- function(dat, sig, range){print(table(dat$sig, dat$range))}
fun(dat = dat1, sig = x, range = y)
fun(dat = dat2, sig = x, range = y)
Any idea how to adjust this code so that it can return the table appropriately?

The [[ ]] operator on data frame is similar to $ but allows you to introduce an object and look for it's value. Then outside of the function you assign "x" value to sig. if you don't put quotes there R will look for x object
fun <- function(dat, sig, range){print(table(dat[[sig]], dat[[range]]))}
fun(dat = dat1, sig = "x", range = "y")
fun(dat = dat2, sig = "x", range = "y")

Related

R: passing a formula into an function as the first input

library(RSSL)
set.seed(1)
df <- generateSlicedCookie(1000,expected=FALSE) %>%
add_missinglabels_mar(Class~.,0.98)
class_erlr <- EntropyRegularizedLogisticRegression(Class ~., df, lambda=0.01,lambda_entropy = 100)
In the EntropyRegularizedLogisticRegression function from the RSSL package, the example in the documentation passed in the formula Class ~. as the input. I was looking at the source code, and these are the parameters for the function
function (X, y, X_u = NULL, lambda = 0, lambda_entropy = 1, intercept = TRUE,
init = NA, scale = FALSE, x_center = FALSE)
I tried manually defining what X, y, X_u are based on the df I generated. But running the following gives me an error with the optimization:
y <- df$Class
X <- df[, -1]
ids <- which(is.na(y))
X_u <- X[ids, ]
class_erlr_manual <- EntropyRegularizedLogisticRegression(X = X, y = y, X_u = X_u, lambda=0.01,lambda_entropy = 100)
The error reads:
Error in optim(w, fn = loss_erlr, gr = grad_erlr, X, y, X_u, lambda = lambda, :
initial value in 'vmmin' is not finite
Why does changing the formula input Class ~. into X=X, y =y, X_u = X_u result in an error? Can anyone point me to where in the source code the formula input is being used?

Create a one-row dataframe where every column is the same as an existing variable

I often write code like this:
answer.df = data.frame(x = numeric(0), y = numeric(0), z = numeric(0))
for (i in 1:100) {
x = do_stuff(i)
y = do_more_stuff(i)
z = yet_more_stuff(i)
# Is there a better way of doing this:
temp.df = data.frame(x = x, y = y, z = z)
answer.df = rbind(answer.df, temp.df)
}
My question is, in the line temp.df = data.frame(x = x, y = y, z = z), is there a neater way of doing this? Imagine it with ten or more variables to understand my problem.
Try this:
do.call("rbind", lapply(1:100, function(i) list(x = xfun(i), y = yfun(i))))
Also try rbindlist from data.table which may have some performance advantages:
library(data.table)
rbindlist(lapply(1:100, function(i) list(x = xfun(i), y = yfun(i))))

representing the name of variables in a scatterplot

I need to write a function which draws a plot for the variables. The problem is that it doesn't print the name of variables.
visual<-function( x , y){
df<-cbind(x,y)
df<-scale(df, center = TRUE, scale = TRUE)
df<-as.data.frame(df)
ggpairs(df, columns=1:2,xlab = colnames(df)[1],ylab =colnames(df)[2])
}
If we have these to vectors:
a <- c(128.095014, 71.430997, 88.704595, 48.180638)
b <- c(10.584888, 10.246740, 4.422322, 9.621246)
visual(a,b)
What is wrong with that?
You can use substitute to get the names of the objects passed into your function.
visual<-function(x, y){
xname <- substitute(x)
yname <- substitute(y)
df<-cbind(x,y)
df<-scale(df, center = TRUE, scale = TRUE)
df<-as.data.frame(df)
names(df) <- c(xname, yname)
GGally::ggpairs(df, columns=1:2, xlab = colnames(df)[1], ylab =colnames(df)[2])
}
b<-c(128.095014, 71.430997, 88.704595, 48.180638)
a<-c(10.584888, 10.246740, 4.422322, 9.621246)
visual(a,b)
output

Error in VAR: Different row size of y and exogen

I am attempting a VAR model in R with an exogenous variable on:
vndata <- read.csv("vndata.txt", sep="")
names(vndata)
da <- data.frame(vndata[2:dim(vndata),])
# STOCK PRICE MODEL
y <- da[, c("irate", "stockp", "mrate", "frate")]
x <- data.frame(da[, c("cdi")])
library("vars")
VARselect(y, lag.max = 8,exogen = x)
var1 <- restrict(VAR(y, p = 2,exogen = x),method = c("ser"),thresh = 1.56)
Then, I want to plot the impulse response function:
plot(irf(var1, impulse = c("irate"), response = c("frate"), boot = T,
cumulative = FALSE,n.ahead = 20))
however, it produces the warning:
Error in VAR(y = ysampled, p = 2, exogen = x) :
Different row size of y and exogen.
I can not figure what happen. I have use dim() to make sure that y and x have the same row size.
Try this, it worked for me:
.GlobalEnv$exogen <- x
VARselect(y, lag.max = 8,exogen = .GlobalEnv$exogen)

Display multiple time series with rCharts hPlot

Using a simple data frame to illustrate this problem:
df <- data.frame(x=c(1,2,3), y1=c(1,2,3), y2=c(3,4,5))
Single time series plot is easy:
hPlot(y="y1", x="x", data=df)
Cannot figure out how to plot both y1 and y2 together. Tried this but it returns an error
> hPlot(x='x', y=c('y1','y2'), data=df)
Error in .subset2(x, i, exact = exact) : subscript out of bounds
Checking the code in hPlot where it uses [[ to extract one column from input data.frame, does it mean it only works for single time series?
hPlot <- highchartPlot <- function(..., radius = 3, title = NULL, subtitle = NULL, group.na = NULL){
rChart <- Highcharts$new()
# Get layers
d <- getLayer(...)
data <- data.frame(
x = d$data[[d$x]],
y = d$data[[d$y]]
)
Try to use long formatt data with group:
hPlot(x = "x", y = "value", group = "variable", data = reshape2::melt(df, id.vars = "x"))

Resources