Making a function with variable inputs - r

I am trying to make a function that does a simple calculation, however, I want to apply it across several columns, and each column has a different constant in the equation.
This is the formula I want to make into a function.
Example
df<- iris[1:10,2:4]
y_true<- c(3, 1, 0.4) # each number(constant) corresponds to each column in df
y_true_sepal_width<- 3 # (1 corresponds to petal.length, 0.4 to petal.width)
R<- 10 # Number of samples
y_estimated<- mean(df$Sepal.Width)
(((sqrt((y_estimated-y_true_sepal_width)^2/R))*100)) #This is (I believe) how to find the value for one column manually
I want to do this formula, but basically taking the mean of each column and substituting out each y_true as it moves across the data frame. I figured this would be done by putting the true constants into a vector, but haven't had any luck in incorporating it into the function.

Given that you have a df and y_true, you can create an estimate function as follows:
estimate = function(df, y_true) {
R = nrow(df)
y_estimated = apply(df, 2, mean)
((sqrt( (y_estimated - y_true)^2 / R)) / y_true) * 100
}
and then, you can use it with your data as follows:
df = iris[1:10,2:4]
y_true = c(3, 1, 0.4)
estimate(df = df, y_true = y_true)
which outputs:
Sepal.Width Petal.Length Petal.Width
3.267687 14.230249 14.230249

Related

Using Rollapply to return both the Coefficient and RSquare

I have a dataset that looks something like this:
data.table(x=c(11:30),y=rnorm(20))
I would like to calculate the rolling regression coefficient and rsquared over the last 10 items:
dtset[,coefficient:=rollapply(1:20,width=10,FUN=function(a) {
subdtset <- dtset[a]
reg <- lm.fit(matrix(data=c(subdtset$x, rep(1,nrow(subdtset))), nrow=nrow(subdtset), ncol=2), subdtset$y)
return(coef(reg)[1])
},align="right",fill=NA)]
dtset[,rsquare:=rollapply(1:20,width=10,FUN=function(a) {
subdtset <- dtset[a]
reg <- lm.fit(matrix(data=c(subdtset$x, rep(1,nrow(subdtset))), nrow=nrow(subdtset), ncol=2), subdtset$y)
return(1 - sum((subdtset$y - reg$fitted.values)^2) / sum((subdtset$y - mean(subdtset$y, na.rm=TRUE))^2))
},align="right",fill=NA)]
The code above accomplishes this, but my dataset has millions of rows and I have multiple columns where I want to make these calculations so it is taking a very long time. I am hoping there is a way to speed things up:
Is there a better way to capture the last 10 items in rollapply rather than passing the row numbers as the variable a and then doing subdtset <- dtset[a]? I tried using .SD and .SDcols but was unable to get that to work. I can only figure out how to get rollapply to accept one column or vector as the input, not two columns/vectors.
Is there a way to return 2 values from one rollapply statement? I think I could get significant time savings if I only had to do the regression once, and then from that take the coefficient and calculate RSquare. It's pretty inefficient to do the same calculations twice.
Thanks for the help!
Use by.column = FALSE to pass both columns to the function. In the function calculate the slope and r squared directly to avoid the overhead of lm.fit. Note that rollapply can return a vector and that rollapplyr with an r on the end is right aligned. This also works if dtset consists of a single x column followed by multiple y columns as in the example below with the builtin anscombe data frame.
library(data.table)
library(zoo)
stats <- function(X, x = X[, 1], y = X[, -1]) {
c(slope = cov(x, y) / var(x), rsq = cor(x, y)^2)
}
rollapplyr(dtset, 10, stats, by.column = FALSE, fill = NA)
a <- anscombe[c("x3", "y1", "y2", "y3")]
rollapplyr(a, 3, stats, by.column = FALSE, fill = NA)
Check
We check the formulas using the built-in BOD data frame.
fm <- lm(demand ~ Time, BOD)
c(coef(fm)[[2]], summary(fm)$r.squared)
## [1] 1.7214286 0.6449202
stats(BOD)
## slope rsq
## 1.7214286 0.6449202

Logistic Regression in R: glm() vs rxGlm()

I fit a lot of GLMs in R. Usually I used revoScaleR::rxGlm() for this because I work with large data sets and use quite complex model formulae - and glm() just won't cope.
In the past these have all been based on Poisson or gamma error structures and log link functions. It all works well.
Today I'm trying to build a logistic regression model, which I haven't done before in R, and I have stumbled across a problem. I'm using revoScaleR::rxLogit() although revoScaleR::rxGlm() produces the same output - and has the same problem.
Consider this reprex:
df_reprex <- data.frame(x = c(1, 1, 2, 2), # number of trials
y = c(0, 1, 0, 1)) # number of successes
df_reprex$p <- df_reprex$y / df_reprex$x # success rate
# overall average success rate is 2/6 = 0.333, so I hope the model outputs will give this number
glm_1 <- glm(p ~ 1,
family = binomial,
data = df_reprex,
weights = x)
exp(glm_1$coefficients[1]) / (1 + exp(glm_1$coefficients[1])) # overall fitted average 0.333 - correct
glm_2 <- rxLogit(p ~ 1,
data = df_reprex,
pweights = "x")
exp(glm_2$coefficients[1]) / (1 + exp(glm_2$coefficients[1])) # overall fitted average 0.167 - incorrect
The first call to glm() produces the correct answer. The second call to rxLogit() does not. Reading the docs for rxLogit(): https://learn.microsoft.com/en-us/machine-learning-server/r-reference/revoscaler/rxlogit it states that "Dependent variable must be binary".
So it looks like rxLogit() needs me to use y as the dependent variable rather than p. However if I run
glm_2 <- rxLogit(y ~ 1,
data = df_reprex,
pweights = "x")
I get an overall average
exp(glm_2$coefficients[1]) / (1 + exp(glm_2$coefficients[1]))
of 0.5 instead, which also isn't the correct answer.
Does anyone know how I can fix this? Do I need to use an offset() term in the model formula, or change the weights, or...
(by using the revoScaleR package I occasionally painting myself into a corner like this, because not many other seem to use it)
I'm flying blind here because I can't verify these in RevoScaleR myself -- but would you try running the code below and leave a comment as to what the results were? I can then edit/delete this post accordingly
Two things to try:
Expand data, get rid of weights statement
use cbind(y,x-y)~1 in either rxLogit or rxGlm without weights and without expanding data
If the dependent variable is required to be binary, then the data has to be expanded so that each row corresponds to each 1 or 0 response and then this expanded data is run in a glm call without a weights argument.
I tried to demonstrate this with your example by applying labels to df_reprex and then making a corresponding df_reprex_expanded -- I know this is unfortunate, because you said the data you were working with was already large.
Does rxLogit allow a cbind representation, like glm() does (I put an example as glm1b), because that would allow data to stay same size… from the rxLogit page, I'm guessing not for rxLogit, but rxGLM might allow it, given the following note in the formula page:
A formula typically consists of a response, which in most RevoScaleR
functions can be a single variable or multiple variables combined
using cbind, the "~" operator, and one or more predictors,typically
separated by the "+" operator. The rxSummary function typically
requires a formula with no response.
Does glm_2b or glm_2c in the example below work?
df_reprex <- data.frame(x = c(1, 1, 2, 2), # number of trials
y = c(0, 1, 0, 1), # number of successes
trial=c("first", "second", "third", "fourth")) # trial label
df_reprex$p <- df_reprex$y / df_reprex$x # success rate
# overall average success rate is 2/6 = 0.333, so I hope the model outputs will give this number
glm_1 <- glm(p ~ 1,
family = binomial,
data = df_reprex,
weights = x)
exp(glm_1$coefficients[1]) / (1 + exp(glm_1$coefficients[1])) # overall fitted average 0.333 - correct
df_reprex_expanded <- data.frame(y=c(0,1,0,0,1,0),
trial=c("first","second","third", "third", "fourth", "fourth"))
## binary dependent variable
## expanded data
## no weights
glm_1a <- glm(y ~ 1,
family = binomial,
data = df_reprex_expanded)
exp(glm_1a$coefficients[1]) / (1 + exp(glm_1a$coefficients[1])) # overall fitted average 0.333 - correct
## cbind(success, failures) dependent variable
## compressed data
## no weights
glm_1b <- glm(cbind(y,x-y)~1,
family=binomial,
data=df_reprex)
exp(glm_1b$coefficients[1]) / (1 + exp(glm_1b$coefficients[1])) # overall fitted average 0.333 - correct
glm_2 <- rxLogit(p ~ 1,
data = df_reprex,
pweights = "x")
exp(glm_2$coefficients[1]) / (1 + exp(glm_2$coefficients[1])) # overall fitted average 0.167 - incorrect
glm_2a <- rxLogit(y ~ 1,
data = df_reprex_expanded)
exp(glm_2a$coefficients[1]) / (1 + exp(glm_2a$coefficients[1])) # overall fitted average ???
# try cbind() in rxLogit. If no, then try rxGlm below
glm_2b <- rxLogit(cbind(y,x-y)~1,
data=df_reprex)
exp(glm_2b$coefficients[1]) / (1 + exp(glm_2b$coefficients[1])) # overall fitted average ???
# cbind() + rxGlm + family=binomial FTW(?)
glm_2c <- rxGlm(cbind(y,x-y)~1,
family=binomial,
data=df_reprex)
exp(glm_2c$coefficients[1]) / (1 + exp(glm_2c$coefficients[1])) # overall fitted average ???

R bootstrap through all columns in a data frame

I'm trying to get bootstrapped descriptive statistics for the columns of a data frame using boot() in R. I can't figure out how to write the "statistic" function required of boot so that it goes through all the columns and returns 4 stats each.
I'm basically trying to modify code from a book chapter introducing bootstrapping, but the example there is using dlply to group up the numeric values in one column based on the group names in another. I don't need that step; I just need to do the same thing to every column. I reproduce the code from the textbook below:
library(boot)
library(plyr)
library(moments)
DescStat <- function(data, i) {
+ temp <- data[i,]
+ desc <- dlply(temp, "GroupName", summarize, mean = mean(ValueColumn), sd = sd(ValueColumn), skew = skewness(ValueColumn), kurt = kurtosis(ValueColumn))
+ l.desc <- unlist(desc)
+ return(l.desc)
+}
DasBoot <- boot(dataframe, DescStat, 1000)
print(DasBoot)
That returns a table of each one of those stats, plus the bias and the SE, each on its own line. So Group1 mean and bias and SE, then Group1 SD and bias and SE on the next line, and on down the list, 4 lines for each group.
I would like to do the exact same thing, but for column in the data frame instead of groups of cases.
I have kind of been floundering. It seems like this should not be that difficult to do, but all the examples and tutorials online are either for one column (which I'm able to do just fine by specifying it in the statistic function), or for trickier manipulations such as in the above.
Any help would be greatly appreciated.
I was able to figure it out using either apply() or purrr::map(). Here's the statistic function that doesn't require any extra packages (tidyverse) installed:
> ApStat <- function(data, i) {
+ temp <- data[i,]
+ desc <- apply(temp, 2, FUN = function(x) {
+ list(mean = mean(x, 0.2), sd = sd(x),
+ skew = skewness(x), kurt = kurtosis(x))
+ })
+ l.desc <- unlist(desc)
+ return(l.desc)
+ }
Thanks to the guys/gals at r/rstats!

Generate numbers with specific correlation [with only positive values in the output]

I want to obtain a dataframe with simulated values which have a specific correlation to each other.
I need to use this function, but in the returned output there are also negative values, which do not have meaning for my purposes:
COR <- function (n, xmean, xsd, ymean, ysd, correlation) {
x <- rnorm(n)
y <- rnorm(n)
z <- correlation * scale(x)[,1] + sqrt(1 - correlation^2) *
scale(resid(lm(y ~ x)))[,1]
xresult <- xmean + xsd * scale(x)[,1]
yresult <- ymean + ysd * z
data.frame(x=xresult,y=yresult)
}
Please note that my question starts from this previous post (currently closed):
another similar discussion
Is there a method able to exclude from the final output all the rows which have at least one negative value? (in another terms, x and y must be always positives).
I spent many hours without any concrete result.....
Filtering rows which have at least one negative value can be done with the apply function, e.g.
df <- simcor(100, 1, 1, 1, 1, 0.8)
filter <- apply(df, 1, function(x) sum(x < 0) > 0)
df <- df[!filter,]
plot(df)
First, I create a dataframe df from your funcion. Then, I apply the function sum(x < 0) > 0 rowwise to the dataframe (the second argument of apply, 1 indicates to go along the first dimension of the dataframe or array). This will create a logical vector that is TRUE for every row with at least one negative value. Subsetting the dataframe with the inverse of that (!filter) leaves you with all rows that have no negative values.
UPDATE:
Seems like the package VineCopula offers functions to create distributions with a given correlation. However, I did not dive into the math as deep so I was not able to fully grasp how copulas (i.e. multivariate probability distributions) work. Using this package, you can at least create e.g. two gaussian distributions.
library(VineCopula)
BC <- BiCop(family = 1, par = 0.9)
sim <- BiCopSim(N = 1000, obj = BC)
cor(sim[,1], sim[,2])
plot(sim)
You might be able to then scale the resulting matrix to achieve a certain standard derivation.

How can I find numérical intervals of k-means clusters?

I'm trying to discretize a numerical variable using Kmeans.
It worked pretty well but I'm wondering how I can find the intervals in my cluster.
I work with FactoMineR to do my kmeans.
I found 3 clusters according to the following graph :
My point now is to identify the intervals of my numerical variable within the clusters.
Is there any option or method in FactoMineR or other package to do it ?
I can do it manually but as I have to do it for a certain amount of variables, I'd like to found an easy way to identify them.
Since you have not provided data I have used the example from the kmeans documentation, which produces two groups for data with two columns x and y. You may split the original data by the cluster each row belongs to and then extract data from each group. I am not sure if my example data resembles your data, but in below code I have simply used the difference between min value of column x and max value of column y as the boundaries of a potential interval (depending on the use case this makes sense or not). Does that help you?
data <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(data) <- c("x", "y")
cl <- kmeans(data, 2)
data <- as.data.frame(cbind(data, cluster = cl$cluster))
lapply(split(data, data$cluster), function(x) {
min_x <- min(x$x)
max_y <- max(x$y)
diff <- max_y-min_x
c(min_x = min_x , max_y = max_y, diff = diff)
})
# $`1`
# min_x max_y diff
# -0.6906124 0.5123950 1.2030074
#
# $`2`
# min_x max_y diff
# 0.2052112 1.6941800 1.4889688

Resources