Bootstrap-t Method for Comparing Trimmed Means in R - r

I am confused with different robust methods to compare independent means. I found good explanation in statistical textbooks. For example yuen() in case of equal sample sizes. My samples are rather unequal, thus I would like to try a bootstrap-t method (from Wilcox book: Introduction to Robust Estimation and Hypothesis Testing, p.163). It says yuenbt() would be a possible solution.
But all textbooks say I can use vectors here:
yuenbt(x,y,tr=0.2,alpha=0.05,nboot=599,side=F)
If I check the local description it says:
yuenbt(formula, data, tr = 0.2, nboot = 599)
What's wrong with my trial:
x <- c(1,2,3)
y <- c(5,6,12,30,2,2,3,65)
yuenbt(x,y)
Why can't I use yuenbt-function with my two vectors? Thank you very much

Looking at the help (for those wondering, yuenbt is from the package WRS2...) for yuenbt, it takes a formula and a dataframe as arguments. My impression is that it expects data in long format.
With your example data, we can achieve that like so:
library(WRS2)
x <- c(1,2,3)
y <- c(5,6,12,30,2,2,3,65)
dat <- data.frame(value=c(x,y),group=rep(c("x","y"), c(length(x),length(y))))
We can then use the function:
yuenbt(value~group, data=dat)

Related

Centering/standardizing variables in R [duplicate]

I'm trying to understand the definition of scale that R provides. I have data (mydata) that I want to make a heat map with, and there is a VERY strong positive skew. I've created a heatmap with a dendrogram for both scale(mydata) and log(my data), and the dendrograms are different for both. Why? What does it mean to scale my data, versus log transform my data? And which would be more appropriate if I want to look at the dendrogram illustrating the relationship between the columns of my data?
Thank you for any help! I've read the definitions but they are whooping over my head.
log simply takes the logarithm (base e, by default) of each element of the vector.
scale, with default settings, will calculate the mean and standard deviation of the entire vector, then "scale" each element by those values by subtracting the mean and dividing by the sd. (If you use scale(x, scale=FALSE), it will only subtract the mean but not divide by the std deviation.)
Note that this will give you the same values
set.seed(1)
x <- runif(7)
# Manually scaling
(x - mean(x)) / sd(x)
scale(x)
It provides nothing else but a standardization of the data. The values it creates are known under several different names, one of them being z-scores ("Z" because the normal distribution is also known as the "Z distribution").
More can be found here:
http://en.wikipedia.org/wiki/Standard_score
This is a late addition but I was looking for information on the scale function myself and though it might help somebody else as well.
To modify the response from Ricardo Saporta a little bit.
Scaling is not done using standard deviation, at least not in version 3.6.1 of R, I base this on "Becker, R. (2018). The new S language. CRC Press." and my own experimentation.
X.man.scaled <- X/sqrt(sum(X^2)/(length(X)-1))
X.aut.scaled <- scale(X, center = F)
The result of these rows are exactly the same, I show it without centering because of simplicity.
I would respond in a comment but did not have enough reputation.
I thought I would contribute by providing a concrete example of the practical use of the scale function. Say you have 3 test scores (Math, Science, and English) that you want to compare. Maybe you may even want to generate a composite score based on each of the 3 tests for each observation. Your data could look as as thus:
student_id <- seq(1,10)
math <- c(502,600,412,358,495,512,410,625,573,522)
science <- c(95,99,80,82,75,85,80,95,89,86)
english <- c(25,22,18,15,20,28,15,30,27,18)
df <- data.frame(student_id,math,science,english)
Obviously it would not make sense to compare the means of these 3 scores as the scale of the scores are vastly different. By scaling them however, you have more comparable scoring units:
z <- scale(df[,2:4],center=TRUE,scale=TRUE)
You could then use these scaled results to create a composite score. For instance, average the values and assign a grade based on the percentiles of this average. Hope this helped!
Note: I borrowed this example from the book "R In Action". It's a great book! Would definitely recommend.

sample variance

function(x) var(sum(((x - mean(x))^2)/(n - 1)))
This is my function but it does not seem to work.
Are we allowed to use the var in the function?
Unless I'm mistaken, this is significantly harder than you think it is; you can't compute it directly the way you're trying to do it. (Hint: when you take the sum() you get a single value/vector of length 1, so the sample variance of that vector is NA.) This post on Mathematics Stack Exchange derives the variance of the sample variance as mu_4/n - sigma^4*(n-3)/(n*(n-1)), where mu_4 is the fourth central moment; you could also see this post on CrossValidated or this Wikipedia page (different derivations present the result in terms of the fourth central moment, the kurtosis, or the excess kurtosis ...) So:
set.seed(101)
r <- rnorm(100)
mu_4 <- mean((r-mean(r))^4)
sigma_4 <- var(r)^2
n <- length(r)
mu_4/n - sigma_4*(n-3)/(n*(n-1)) ## 0.01122
Be careful:
var(replicate(100000,var(rnorm(100))))
gives about 0.0202. At least it's not the wrong order of magnitude, but it's a factor of 2 different from the example above. My guess is just that the estimate is itself highly variable (which wouldn't be surprising since it depends on the fourth moment ...) (I tried the estimation method above a few times and it does indeed seem highly variable ...)

Samples from scaled inverse chisquare distribution

I want to generate sa scaled-inv-chisquared distribution in R. I know geoR have a R function for generating this. But I want to use gamma-distribution to generate this.
I think this two are equivalent:
X ~ rinvchisq(100, df=d, scale=s)
1/X ~ rgamma(100, shape=d/2, scale=2/(d*s))
isn't it? Can there be any numerical problem due this due to extreme values?
More specifically you would need X <- rinvchisq(...) and X <- 1/rgamma(...) (the ~ notation works this way in programs such as WinBUGS, and in statistics notation, but not in R). If you look at the code of geoR::rinvchisq, the relevant part is just
return((df * scale)/rchisq(n, df = df))
so if you have problems taking the reciprocal of very large or small chi-squared deviates you'll be in trouble anyway (although rchisq is internally using .External(C_rchisq, n, df), which falls through to C code, presumably for efficiency in this special case, rather than calling rgamma). If I were you I would go ahead and superimpose densities of some test samples just to make sure I hadn't screwed up the arithmetic or parameterization somewhere ...
For what it's worth there are also rinvgamma() functions in a variety of packages (library(sos); findFn("rinvgamma"))

Understanding `scale` in R

I'm trying to understand the definition of scale that R provides. I have data (mydata) that I want to make a heat map with, and there is a VERY strong positive skew. I've created a heatmap with a dendrogram for both scale(mydata) and log(my data), and the dendrograms are different for both. Why? What does it mean to scale my data, versus log transform my data? And which would be more appropriate if I want to look at the dendrogram illustrating the relationship between the columns of my data?
Thank you for any help! I've read the definitions but they are whooping over my head.
log simply takes the logarithm (base e, by default) of each element of the vector.
scale, with default settings, will calculate the mean and standard deviation of the entire vector, then "scale" each element by those values by subtracting the mean and dividing by the sd. (If you use scale(x, scale=FALSE), it will only subtract the mean but not divide by the std deviation.)
Note that this will give you the same values
set.seed(1)
x <- runif(7)
# Manually scaling
(x - mean(x)) / sd(x)
scale(x)
It provides nothing else but a standardization of the data. The values it creates are known under several different names, one of them being z-scores ("Z" because the normal distribution is also known as the "Z distribution").
More can be found here:
http://en.wikipedia.org/wiki/Standard_score
This is a late addition but I was looking for information on the scale function myself and though it might help somebody else as well.
To modify the response from Ricardo Saporta a little bit.
Scaling is not done using standard deviation, at least not in version 3.6.1 of R, I base this on "Becker, R. (2018). The new S language. CRC Press." and my own experimentation.
X.man.scaled <- X/sqrt(sum(X^2)/(length(X)-1))
X.aut.scaled <- scale(X, center = F)
The result of these rows are exactly the same, I show it without centering because of simplicity.
I would respond in a comment but did not have enough reputation.
I thought I would contribute by providing a concrete example of the practical use of the scale function. Say you have 3 test scores (Math, Science, and English) that you want to compare. Maybe you may even want to generate a composite score based on each of the 3 tests for each observation. Your data could look as as thus:
student_id <- seq(1,10)
math <- c(502,600,412,358,495,512,410,625,573,522)
science <- c(95,99,80,82,75,85,80,95,89,86)
english <- c(25,22,18,15,20,28,15,30,27,18)
df <- data.frame(student_id,math,science,english)
Obviously it would not make sense to compare the means of these 3 scores as the scale of the scores are vastly different. By scaling them however, you have more comparable scoring units:
z <- scale(df[,2:4],center=TRUE,scale=TRUE)
You could then use these scaled results to create a composite score. For instance, average the values and assign a grade based on the percentiles of this average. Hope this helped!
Note: I borrowed this example from the book "R In Action". It's a great book! Would definitely recommend.

Calculate many AUCs in R

I am fairly new to R. I am using the ROCR package in R to calculate AUC, which I can do for one predictor just fine. What I am looking to do is perform many AUC calculations for 100 different variables.
What I have done so far is the following:
varlist <- names(mydata)[2:101]
formlist <- lapply(varlist, function(x) paste0("prediction(",x,"mydata$V1))
However then the formulas are in text format, and the as.formula is giving me an error. Any help appreciated! Thanks in advance!
The function inside your lapply looks like it is just outputting a statement like prediction(varmydata$V1). I am guessing you actually want to run that command. If so, you probably want something like
lapply(varlist,function(x) prediction(mydata[x]))
but it is hard to tell without a reproducible situation. Also, it looks like your code has a missing quote.
If I understand you correctly, you want to use the first column of mydata as predictions, and all other variables as labels, one after the other.
Is this the correct way to treat mydata? This way is rather uncommon. It is more common to have the same true labels for several diffent predictions (e.g. iterated cross validation, comparison of different classifiers).
However, to answer your original question:
predictions and labels need to have the same shape for ROCR::prediction, e.g.
either as matrix
prediction (matrix (rep (mydata$V1, 10), 10), mydata [, -1])
or as lists:
prediction (mydata [rep (1, ncol (mydata) - 1)], mydata [-1])

Resources