Right (or left) side trimmed mean - r

Using:
mean (x, trim=0.05)
Removes 2.5% from each side of the distribution, which is fine for symmetrical two-tailed data. But if I have one tailed or highly asymmetric data I would like to be able to remove just one side of the distribution. Is there a function for this or do I have write myself a new one? If so, how?

Just create a modified mean.default. First look at mean.default:
mean.default
Then modify it to accept a new argument:
mean.default <-
function (x, trim = 0, na.rm = FALSE, ..., side="both")
{
if (!is.numeric(x) && !is.complex(x) && !is.logical(x)) {
warning("argument is not numeric or logical: returning NA")
return(NA_real_)
}
if (na.rm)
x <- x[!is.na(x)]
if (!is.numeric(trim) || length(trim) != 1L)
stop("'trim' must be numeric of length one")
n <- length(x)
if (trim > 0 && n) {
if (is.complex(x))
stop("trimmed means are not defined for complex data")
if (any(is.na(x)))
return(NA_real_)
if (trim >= 0.5)
return(stats::median(x, na.rm = FALSE))
lo <- if( side=="both" || side=="right" ){ floor(n * trim) + 1 }else{1}
hi <- if( side=="both" || side=="left" ){ n + 1 - (floor(n * trim) + 1 ) }else{ n}
x <- sort.int(x, partial = unique(c(lo, hi)))[lo:hi]
cat(c(length(x), lo , hi) )
}
.Internal(mean(x))
}

I don't know of a function. Something like the following would trim off the upper tail of the distribution before taking the mean.
upper.trim.mean <- function(x,trim) {
x <- sort(x)
mean(x[1:floor(length(x)*(1-trim))])
}

This should account for either side, or both sides for trimming.
trim.side.mean <- function(x, trim, type="both"){
if (type == "both") {
mean(x,trim)}
else if (type == "right") {
x <- sort(x)
mean(x[1:floor(length(x)*(1-trim))])}
else if (type == "left"){
x <- sort(x)
mean(x[max(1,floor(length(x)*trim)):length(x)])}}

one.sided.trim.mean <- function(x, trim, upper=T) {
if(upper) trim = 1-trim
data <- mean(x[x<quantile(x, trim)])
}

I found that all the answers posted do not match when checked manually. So I created one of my own. Its long but simple enough to understand
get_trim <- function(x,trim,type)
{
x <- sort(x)
ans<-0
if (type=="both")
{
for (i in (trim+1):(length(x)-trim))
{
ans=ans+x[i];
}
print(ans/(length(x)-(2*trim)))
}
else if(type=="left")
{
for (i in (trim+1):(length(x)))
{
ans=ans+x[i];
}
print(ans/(length(x)-trim))
}
else if (type=="right")
{
for (i in 1:(length(x)-trim))
{
ans=ans+x[i];
}
print(ans/(length(x)-trim))
}
}

Related

how to define a rank of values for an argument inside a function?

Let's suppose the next function:
demo_function <- function(x){
if(is.na(x)){
return(NA)
} else if(1 < x < 2){
return("something")
} else {
return("Nothing")
}
}
The idea is that when the argument x is between 1 and 2, say x=0.001, then the function returns something.
However when trying to run the above function, the next error arises:
Error: no function to go from, jumping to a higher level
How could I adjust the function in order to get "something" for the specified argument?
The issue is in the else if i.e. the syntax in R is different than the mathematical notation - multiple expressions are connected by logical operators
else if(1 < x && x < 2)
i.e.
demo_function <- function(x){
if(is.na(x)){
return(NA)
} else if(1 < x && x < 2){
return("something")
} else {
return("Nothing")
}
}
> demo_function(0.01)
[1] "Nothing"
> demo_function(1.5)
[1] "something"
> demo_function(NA)
[1] NA

How to best combine unique and match in R?

I found myself often writing code such as
#' #param x input vector
#' #param ... passed to [slow_fun()]
fast_fun <- function(x, ...) {
u <- unique(x)
i <- match(x, u)
v <- slow_fun(u, ...)
v[i]
}
To accelerate a slow vectorized "pure" function where each input entry could theoretically be computed individually and where input is expected to contain many duplicates.
Now I wonder whether this is the best way to achieve such a speedup or is there some function (preferrably in base R or the tidyverse) which does something like unique and match at the same time?
Benchmarks so far
Thanks for the provided answers. I've written a small benchmark suite to compare the approaches:
method <- list(
brute = slow_fun,
unique_match = function(x, ...) {
u <- unique(x)
i <- match(x, u)
v <- slow_fun(u, ...)
v[i]
},
unique_factor = function(x, ...) {
if (is.character(x)) {
x <- factor(x)
i <- as.integer(x)
u <- levels(x)
} else {
u <- unique(x)
i <- as.integer(factor(x, levels = u))
}
v <- slow_fun(u, ...)
v[i]
},
unique_match_df = function(x, ...) {
u <- unique(x)
i <- if (is.numeric(x)) {
match(data.frame(t(round(x, 10))), data.frame(t(round(u, 10))))
} else {
match(data.frame(t(x)), data.frame(t(u)))
}
v <- slow_fun(u, ...)
v[i]
},
rcpp_uniquify = function(x, ...) {
iu <- uniquify(x)
v <- slow_fun(iu[["u"]], ...)
v[iu[["i"]]]
}
)
exprs <- lapply(method, function(fun) substitute(fun(x), list(fun = fun)))
settings$bench <- lapply(seq_len(nrow(settings)), function(i) {
cat("\rBenchmark ", i, " / ", nrow(settings), sep = "")
x <- switch(
settings$type[i],
integer = sample.int(
n = settings$n_distinct[i],
size = settings$n_total[i],
replace = TRUE
),
double = sample(
x = runif(n = settings$n_distinct[i]),
size = settings$n_total[i],
replace = TRUE
),
character = sample(
x = stringi::stri_rand_strings(
n = settings$n_distinct[i],
length = 20L
),
size = settings$n_total[i],
replace = TRUE
)
)
microbenchmark::microbenchmark(
list = exprs
)
})
library(tidyverse)
settings %>%
mutate(
bench = map(bench, summary)
) %>%
unnest(bench) %>%
group_by(n_distinct, n_total, type) %>%
mutate(score = median / min(median)) %>%
group_by(expr) %>%
summarise(mean_score = mean(score)) %>%
arrange(mean_score)
Currently, the rcpp-based approach is best in all tested settings on my machine but barely manages to exceed the unique-then-match method.
I suspect a greater advantage in performance the longer x becomes, because unique-then-match needs two passes over the data while uniquify() only needs one pass.
|expr | mean_score|
|:---------------|----------:|
|rcpp_uniquify | 1.018550|
|unique_match | 1.027154|
|unique_factor | 5.024102|
|unique_match_df | 36.613970|
|brute | 45.106015|
Maybe you can try factor + as.integer like below
as.integer(factor(x))
I found a cool, and fast, answer recently,
match(data.frame(t(x)), data.frame(t(y)))
As always, beware when working with floats. I recommend something like
match(data.frame(t(round(x,10))), data.frame(t(round(y))))
in such cases.
I've finally managed to beat unique() and match() using Rcpp to hand-code the algorithm in C++ using a std::unordered_map as core bookkeeping data structure.
Here is the source code, which can be used in R by writing it into a file and running Rcpp::sourceCpp on it.
#include <Rcpp.h>
using namespace Rcpp;
template <int T>
List uniquify_impl(Vector<T> x) {
IntegerVector idxes(x.length());
typedef typename Rcpp::traits::storage_type<T>::type storage_t;
std::unordered_map<storage_t, int> unique_map;
int n_unique = 0;
// 1. Pass through x once
for (int i = 0; i < x.length(); i++) {
storage_t curr = x[i];
int idx = unique_map[curr];
if (idx == 0) {
unique_map[curr] = ++n_unique;
idx = n_unique;
}
idxes[i] = idx;
}
// 2. Sort unique_map by its key
Vector<T> uniques(unique_map.size());
for (auto &pair : unique_map) {
uniques[pair.second - 1] = pair.first;
}
return List::create(
_["u"] = uniques,
_["i"] = idxes
);
}
// [[Rcpp::export]]
List uniquify(RObject x) {
switch (TYPEOF(x)) {
case INTSXP: {
return uniquify_impl(as<IntegerVector>(x));
}
case REALSXP: {
return uniquify_impl(as<NumericVector>(x));
}
case STRSXP: {
return uniquify_impl(as<CharacterVector>(x));
}
default: {
warning(
"Invalid SEXPTYPE %d (%s).\n",
TYPEOF(x), type2name(x)
);
return R_NilValue;
}
}
}

what is different between two codes?

I don't know what is different between two codes. When I use ml.norm(iris[1:4], mode="uv",na.rm=FALSE) and dh.norm(iris[1:4], mode="uv",na.rm=FALSE), the results are different..`
ml.norm <- function(x, mode="uv", na.rm=FALSE){
if(class(x)=="data.frame"){
x <- as.matrix(x)
}
else{
return (apply(x,2,ml.norm, mode=mode, na.rm=na.rm))
}
if (mode =="uv"){
x = x/sd(x, na.rm=na.rm)
}
else if (mode =="z"){
x = (x-mean(x))/sd(x, na.rm=na.rm)
}
else{stop(paste("unknow mode", mode))}
return(x)
}
dh.norm <- function (x,mode="uv",na.rm=FALSE) {
# need to check if x is a matrix
if (is.data.frame(x)) {
x=as.matrix(x)
}
if (is.matrix(x)) {
return(apply(x,2,dh.norm,mode=mode,na.rm=na.rm))
}
if (mode == "uv") {
x = x/sd(x,na.rm=na.rm)
} else if (mode == "z") {
# your code here
x = (x - mean(x))/sd(x,na.rm=na.rm)
} else {
stop(paste("unknown mode",mode))
}
return(x)
}
ml.norm
IF x IS data.frame DO convert it into a matrix. THEN check mode and DO stuff.
dh.norm
IF x IS a data.frame DO convert it into a matrix. THEN check if x is a matrix and apply dh.norm on the columns. THEN check the mode and DO stuff.
So ml.norm is missing the return(apply(x,2,[YOUR FUNCTION],mode=mode,na.rm=na.rm)) part if you run it on a data.frame.

implement matrix determinant in R

I was asked to implement function that calculates n-dimensional matrix determinant using Laplace expansion. This involves recursion. I developed this:
minor<-function(A,i,j) {
return(A[c(1:(i-1),(i+1):dim(A)[1]),c(1:(j-1),(j+1):dim(A)[2])])
}
determinantRec<-function(X,k) {
if (dim(X)[1] == 1 && dim(X)[2] == 1) return(X[1][1])
else {
s = 0
for (i in 1:dim(X)[2]) {
s = s + X[k][i]*(-1)^(k+i)*determinantRec(minor(X,k,i),k)
}
return(s)
}
}
where k in determinantRec(X,k) function indicates which row I want to use Laplace expansion along of.
My problem is when I run determinantRec(matrix(c(1,2,3,4),nrow = 2,ncol = 2),1) this error appears:
C stack usage 7970628 is too close to the limit
What is wrong with my code?
#julia, there is one simple type in your code. Just remove the '*' at the end of the definition of 's'. And don't indent the recursion.
determinantRek<-function(X,k) {
if (dim(X)[1] == 1 && dim(X)[2] == 1)
return(X[1,1])
if (dim(X)[1] == 2 && dim(X)[2] == 2)
return(X[1,1]*X[2,2]-X[1,2]*X[2,1])
else
s = 0
for (i in 1:dim(X)[2]) {
s = s + X[k,i]*(-1)^(k+i)
determinantRek(X[-k,-i],k)
}
return(s)
}
I did this way and works just fine, although it is super slow, compared to the det function in base R
laplace_expansion <- function(mat){
det1 <- function(mat){
mat[1]*mat[4]-mat[2]*mat[3]
}
determinant <- 0
for(j in 1:ncol(mat)){
mat1 <- mat[-1,-j]
if(nrow(mat1) == 2){
determinant <- determinant+mat[1,j]*(-1)^(1+j)*det1(mat1)
}else{
val <- mat[1,j]*(-1)^(1+j)
if(val != 0){
determinant <- determinant+val*laplace_expansion(mat1)
}
}
}
return(determinant)
}
This is my approach, I think it's cleaner.
deter <- function(X) {
stopifnot(is.matrix(X))
stopifnot(identical(ncol(X), nrow(X)))
if (all(dim(X) == c(1, 1))) return(as.numeric(X))
i <- 1:nrow(X)
out <- purrr::map_dbl(i, function(i){
X[i, 1] * (-1)^(i + 1) * deter(X[-i, -1, drop = FALSE])
})
return(sum(out))
}
Thank you #ArtemSokolov and #MrFlick for pointing the problem cause, it was it. I also discovered that this code does not calculate properly the determinant of 2x2 matrix. After all it looks like that:
determinantRek<-function(X,k) {
if (dim(X)[1] == 1 && dim(X)[2] == 1)
return(X[1,1])
if (dim(X)[1] == 2 && dim(X)[2] == 2)
return(X[1,1]*X[2,2]-X[1,2]*X[2,1])
else
s = 0
for (i in 1:dim(X)[2]) {
s = s + X[k,i]*(-1)^(k+i)*
determinantRek(X[-k,-i],k)
}
return(s)
}
Debuging with browser() was also helpful :)

R function wrapper than maintains function signature

I am trying to write a very simple function wrapper in R, that will accept f and return g where g returns zero whenever the first argument is negative. I have the following code
wrapper <- function(f) {
function(x, ...) {
if( x <= 0 ) { 0 }
else { f(x, ...) }
}
}
Thge wrapper works as expected, but is there are way to maintain the function signature
> wdnorm <- wrapper(dnorm)
> args(dnorm)
function (x, mean = 0, sd = 1, log = FALSE)
NULL
> args(wdnorm)
function (x, ...)
NULL
I would like to do something like this (but obviously it doesn't work)
args(g) <- args(f)
is this possible in R?
Here is what you want. Tho, do you really need this?
wrapper <- function(f) {
f2 = function(x) {
if (x <= 0) { 0 }
else { do.call(f, as.list( match.call())[-1]) }
}
formals(f2) = formals(f)
f2
}
wdnorm <- wrapper(dnorm)
args(dnorm)
args(wdnorm)
wdnorm(-5)
wdnorm(5)
output
> args(dnorm)
function (x, mean = 0, sd = 1, log = FALSE)
NULL
> args(wdnorm)
function (x, mean = 0, sd = 1, log = FALSE)
NULL
> wdnorm(-5)
[1] 0
> wdnorm(5)
[1] 1.48672e-06

Resources