Plotting CDF and PDF in R with custom function - r

I was wondering if there was any way to plot this PDF and CDF in R. I found these on a different question a user asked, and was curious.
I know that I have to create a function and then plot this, but I'm struggling with the different parameters and am unsure how to translate this to R. I have only ever plotted PDF/CDF using a normal distribution, or from datasets.

You can write the pdf and cdf as functions, using ifelse to specify the values within the ranges, then simply plot the functions over the desired range:
pdf <- function(x) {
ifelse(x >= 0 & x < 1, 0.5, ifelse(x < 1.5 & x >= 1, 1, 0))
}
cdf <- function(x) {
ifelse(x < 0, 0,
ifelse(x >= 0 & x < 1, 0.5 * x,
ifelse(x < 1.5 & x >= 1, x - 0.5, 1)))
}
plot(pdf, xlim = c(-1, 2), type = "s")
plot(cdf, xlim = c(-1, 2))
Created on 2022-10-27 with reprex v2.0.2

ifelse can be very slow, we can fill an output vector instead. numeric() creates a vector of zeroes of a specified length, we then simply change everything that should not yield zero.
pdf_vec <- function(x) {
out <- numeric(length(x))
out[x >= 0 & x < 1] <- .5
out[x >= 1 & x < 1.5] <- 1
out
}
cdf_vec <- function(x) {
out <- numeric(length(x))
tmp <- x >= 0 & x < 1
out[tmp] <- .5*x[tmp]
tmp <- x >= 1 & x < 1.5
out[tmp] <- x[tmp] - .5
tmp <- x >= 1.5
out[tmp] <- 1
out
}
set.seed(42)
x <- rnorm(1e6)
stopifnot(all.equal(cdf(x), cdf1(x)))
stopifnot(all.equal(pdf(x), pdf1(x)))
#Allan Camero already showed nicely how to plot it.
Microbenchmark
It's about three times faster than the ifelse solution.
microbenchmark::microbenchmark(
cdf_ifelse=cdf(x), cdf_vec=cdf1(x), check='equal'
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# cdf_ifelse 110.66285 121.25428 133.0789 133.86041 142.53296 167.64401 100 b
# cdf_vec 43.56277 45.08759 48.8924 46.83869 49.46047 74.95487 100 a
microbenchmark::microbenchmark(
pdf_ifelse=pdf(x), pdf_vec=pdf1(x), check='equal'
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# pdf_ifelse 71.39501 76.71747 101.17738 81.43037 87.82162 192.3919 100 b
# pdf_vec 27.82642 30.02056 34.55301 30.38457 31.29751 133.2798 100 a

We can try the code below
f <- function(x) 0.5 * (x >= 0) + 0.5 * (x >= 1) - (x >= 1.5)
F <- Vectorize(function(x) integrate(f, -Inf, x)$value)
plot(f, -2, 2, col = "red")
curve(F, -2, 2, add = TRUE, col = "blue")
legend(-1, 0.8,
legend = c("pdf", "cdf"),
col = c("red", "blue"),
lty = 1:2, cex = 2,
box.col = "white"
)

Related

How to write a if else function in R to generate rule?

I wanna print the rule like following:
rule(0,0,0) = 0
rule(0,0,1) = 1
rule(0,1,0) = 1
rule(0,1,1) = 1
rule(1,0,0) = 1
rule(1,0,1) = 0
rule(1,1,0) = 0
rule(1,1,1) = 0
I tried to write a set of “if-else” conditions that test for zeroes and ones in the input values, returning the corresponding value from the table as output but it didnt work
A single comparison will do. This would be the most performant implementation I can think of.
rule <- function(a, b, c) +(abs(4*a + 2*b + c - 2.5) < 2)
A parsimonious solution:
rule <- function(a, b, c) +xor(a, b | c)
Note both solutions are vectorized.
Benchmarking:
rule1 <- function(a, b, c) +(abs(4*a + 2*b + c - 2.5) < 2)
rule2 <- function(a, b, c) +xor(a, b | c)
rule3 <- function(x, y, z) +((4 * x + 2 * y + z) %in% 1:4)
abc <- matrix(sample(0:1, 3e6, 1), 1e6, 3)
microbenchmark::microbenchmark(rule1 = rule1(abc[,1], abc[,2], abc[,3]),
rule2 = rule2(abc[,1], abc[,2], abc[,3]),
Thomas = rule3(abc[,1], abc[,2], abc[,3]),
check = "identical")
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> rule1 13.5161 16.58425 20.74505 17.69030 20.22745 53.9513 100
#> rule2 32.7552 35.05735 39.41473 36.27760 39.38165 74.8564 100
#> Thomas 24.6562 28.39065 33.78937 29.70875 33.19045 65.9709 100
A possibility is to concatenate the three digits and then use switch:
rule <- function(a, b, c) {
x <- paste0(a, b, c)
switch(
x,
"000" = 0,
"001" = 1,
......
)
}
You can try
rule <- function(x, y, z) {
+((4 * x + 2 * y + z) %in% 1:4)
}
or
rule <- function(x, y, z) {
bitwXor(x, bitwOr(y, z))
}
Benchmark
rule1 <- function(a, b, c) +(abs(4 * a + 2 * b + c - 2.5) < 2)
rule2 <- function(a, b, c) +xor(a, b | c)
rule3 <- function(x, y, z) +((4 * x + 2 * y + z) %in% 1:4)
rule4 <- function(x, y, z) bitwXor(x, bitwOr(y, z))
abc <- matrix(sample(0:1, 3e6, 1), 1e6, 3)
microbenchmark::microbenchmark(
rule1 = rule1(abc[, 1], abc[, 2], abc[, 3]),
rule2 = rule2(abc[, 1], abc[, 2], abc[, 3]),
Thomas1 = rule3(abc[, 1], abc[, 2], abc[, 3]),
Thomas2 = rule4(abc[, 1], abc[, 2], abc[, 3]),
check = "identical"
)
gives
Unit: milliseconds
expr min lq mean median uq max neval
rule1 16.1315 22.82880 32.91071 24.48080 28.29635 113.1915 100
rule2 33.6093 40.93665 50.12914 44.77415 48.90045 128.0033 100
Thomas1 26.6938 34.78615 43.34770 37.63255 42.49940 114.3973 100
Thomas2 9.1119 12.25080 18.46705 16.26445 18.46835 105.1263 100
In the case that your rule can't be expressed as a mathematical expression (as jblood94 and ThomasIsCoding did), you can create a list with the options, check which option was passed, and get the result associated with that option.
rule = function(n1, n2, n3){
combin = list(c(0,0,0), c(1,0,0), c(0,1,0), c(0,0,1), c(1,1,0), c(1,0,1), c(0,1,1), c(1,1,1))
result = c(0, 1, 1, 1, 0, 0, 1, 0)
index = which(sapply(combin, function(x){identical(x, c(n1, n2, n3))}))
result[index]
}
rule(0, 1, 0)
[1] 1
For the fun of brevity:
rule <- function(first, second, third){
bits_as_decimal = paste(first, second, third, sep = '') |>
strtoi(base = 2)
bits_as_decimal %in% 1:4 |> as.integer()
}
What happens:
the three arguments are considered a three bit binary (e. g. 001)
binary gets converted to its decimal equivalent (strtoi)
check if the decimal equivalent falls into the range 1-4 (which you want to convert to 1)
re-cast the boolean value to binary

Split a vector into chunks such that sum of each chunk is approximately constant

I have a large data frame with more than 100 000 records where the values are sorted
For example, consider the following dummy data set
df <- data.frame(values = c(1,1,2,2,3,4,5,6,6,7))
I want to create 3 groups of above values (in sequence only) such that the sum of each group is more or less the same
So for the above group, if I decide to divide the sorted df in 3 groups as follows, their sums will be
1. 1 + 1 + 2 +2 + 3 + 4 = 13
2. 5 + 6 = 11
3. 6 + 7 = 13
How can create this optimization in R? any logic?
So, let's use pruning. I think other solutions are giving a good solution, but not the best one.
First, we want to minimize
where S_n is the cumulative sum of the first n elements.
computeD <- function(p, q, S) {
n <- length(S)
S.star <- S[n] / 3
if (all(p < q)) {
(S[p] - S.star)^2 + (S[q] - S[p] - S.star)^2 + (S[n] - S[q] - S.star)^2
} else {
stop("You shouldn't be here!")
}
}
I think the other solutions optimize over p and q independently, which won't give a global minima (expected for some particular cases).
optiCut <- function(v) {
S <- cumsum(v)
n <- length(v)
S_star <- S[n] / 3
# good starting values
p_star <- which.min((S - S_star)^2)
q_star <- which.min((S - 2*S_star)^2)
print(min <- computeD(p_star, q_star, S))
count <- 0
for (q in 2:(n-1)) {
S3 <- S[n] - S[q] - S_star
if (S3*S3 < min) {
count <- count + 1
D <- computeD(seq_len(q - 1), q, S)
ind = which.min(D);
if (D[ind] < min) {
# Update optimal values
p_star = ind;
q_star = q;
min = D[ind];
}
}
}
c(p_star, q_star, computeD(p_star, q_star, S), count)
}
This is as fast as the other solutions because it prunes a lot the iterations based on the condition S3*S3 < min. But, it gives the optimal solution, see optiCut(c(1, 2, 3, 3, 5, 10)).
For the solution with K >= 3, I basically reimplemented trees with nested tibbles, that was fun!
optiCut_K <- function(v, K) {
S <- cumsum(v)
n <- length(v)
S_star <- S[n] / K
# good starting values
p_vec_first <- sapply(seq_len(K - 1), function(i) which.min((S - i*S_star)^2))
min_first <- sum((diff(c(0, S[c(p_vec_first, n)])) - S_star)^2)
compute_children <- function(level, ind, val) {
# leaf
if (level == 1) {
val <- val + (S[ind] - S_star)^2
if (val > min_first) {
return(NULL)
} else {
return(val)
}
}
P_all <- val + (S[ind] - S[seq_len(ind - 1)] - S_star)^2
inds <- which(P_all < min_first)
if (length(inds) == 0) return(NULL)
node <- tibble::tibble(
level = level - 1,
ind = inds,
val = P_all[inds]
)
node$children <- purrr::pmap(node, compute_children)
node <- dplyr::filter(node, !purrr::map_lgl(children, is.null))
`if`(nrow(node) == 0, NULL, node)
}
compute_children(K, n, 0)
}
This gives you all the solution that are least better than the greedy one:
v <- sort(sample(1:1000, 1e5, replace = TRUE))
test <- optiCut_K(v, 9)
You need to unnest this:
full_unnest <- function(tbl) {
tmp <- try(tidyr::unnest(tbl), silent = TRUE)
`if`(identical(class(tmp), "try-error"), tbl, full_unnest(tmp))
}
print(test <- full_unnest(test))
And finally, to get the best solution:
test[which.min(test$children), ]
Here is one approach:
splitter <- function(values, N){
inds = c(0, sapply(1:N, function(i) which.min(abs(cumsum(as.numeric(values)) - sum(as.numeric(values))/N*i))))
dif = diff(inds)
re = rep(1:length(dif), times = dif)
return(split(values, re))
}
how good is it:
# I calculate the mean and sd of the maximal difference of the sums in the
#splits of 100 runs:
#split on 15 parts
set.seed(5)
z1 = as.data.frame(matrix(1:15, nrow=1))
repeat{
values = sort(sample(1:1000, 1000000, replace = T))
z = splitter(values, 15)
z = lapply(z, sum)
z = unlist(z)
z1 = rbind(z1, z)
if (nrow(z1)>101){
break
}
}
z1 = z1[-1,]
mean(apply(z1, 1, function(x) max(x) - min(x)))
[1] 1004.158
sd(apply(z1, 1, function(x) max(x) - min(x)))
[1] 210.6653
#with less splits (4)
set.seed(5)
z1 = as.data.frame(matrix(1:4, nrow=1))
repeat{
values = sort(sample(1:1000, 1000000, replace = T))
z = splitter(values, 4)
z = lapply(z, sum)
z = unlist(z)
z1 = rbind(z1, z)
if (nrow(z1)>101){
break
}
}
z1 = z1[-1,]
mean(apply(z1, 1, function(x) max(x) - min(x)))
#632.7723
sd(apply(z1, 1, function(x) max(x) - min(x)))
#260.9864
library(microbenchmark)
1M:
values = sort(sample(1:1000, 1000000, replace = T))
microbenchmark(
sp_27 = splitter(values, 27),
sp_3 = splitter(values, 3),
)
Unit: milliseconds
expr min lq mean median uq max neval cld
sp_27 897.7346 934.2360 1052.0972 1078.6713 1118.6203 1329.3044 100 b
sp_3 108.3283 116.2223 209.4777 173.0522 291.8669 409.7050 100 a
btw F. Privé is correct this function does not give the globally optimal split. It is greedy which is not a good characteristic for such a problem. It will give splits with sums closer to global sum / n in the initial part of the vector but behaving as so will compromise the splits in the later part of the vector.
Here is a test comparison of the three functions posted so far:
db = function(values, N){
temp = floor(sum(values)/N)
inds = c(0, which(c(0, diff(cumsum(values) %% temp)) < 0)[1:(N-1)], length(values))
dif = diff(inds)
re = rep(1:length(dif), times = dif)
return(split(values, re))
} #had to change it a bit since the posted one would not work - the core
#which calculates the splitting positions is the same
missuse <- function(values, N){
inds = c(0, sapply(1:N, function(i) which.min(abs(cumsum(as.numeric(values)) - sum(as.numeric(values))/N*i))))
dif = diff(inds)
re = rep(1:length(dif), times = dif)
return(split(values, re))
}
prive = function(v, N){ #added dummy N argument because of the tester function
dummy = N
computeD <- function(p, q, S) {
n <- length(S)
S.star <- S[n] / 3
if (all(p < q)) {
(S[p] - S.star)^2 + (S[q] - S[p] - S.star)^2 + (S[n] - S[q] - S.star)^2
} else {
stop("You shouldn't be here!")
}
}
optiCut <- function(v, N) {
S <- cumsum(v)
n <- length(v)
S_star <- S[n] / 3
# good starting values
p_star <- which.min((S - S_star)^2)
q_star <- which.min((S - 2*S_star)^2)
print(min <- computeD(p_star, q_star, S))
count <- 0
for (q in 2:(n-1)) {
S3 <- S[n] - S[q] - S_star
if (S3*S3 < min) {
count <- count + 1
D <- computeD(seq_len(q - 1), q, S)
ind = which.min(D);
if (D[ind] < min) {
# Update optimal values
p_star = ind;
q_star = q;
min = D[ind];
}
}
}
c(p_star, q_star, computeD(p_star, q_star, S), count)
}
z3 = optiCut(v)
inds = c(0, z3[1:2], length(v))
dif = diff(inds)
re = rep(1:length(dif), times = dif)
return(split(v, re))
} #added output to be more in line with the other two
Function for testing:
tester = function(split, seed){
set.seed(seed)
z1 = as.data.frame(matrix(1:3, nrow=1))
repeat{
values = sort(sample(1:1000, 1000000, replace = T))
z = split(values, 3)
z = lapply(z, sum)
z = unlist(z)
z1 = rbind(z1, z)
if (nrow(z1)>101){
break
}
}
m = mean(apply(z1, 1, function(x) max(x) - min(x)))
s = sd(apply(z1, 1, function(x) max(x) - min(x)))
return(c("mean" = m, "sd" = s))
} #tests 100 random 1M length vectors with elements drawn from 1:1000
tester(db, 5)
#mean sd
#779.5686 349.5717
tester(missuse, 5)
#mean sd
#481.4804 216.9158
tester(prive, 5)
#mean sd
#451.6765 174.6303
prive is the clear winner - however it takes quite a bit longer than the other 2. and can handle splitting on 3 elements only.
microbenchmark(
missuse(values, 3),
prive(values, 3),
db(values, 3)
)
Unit: milliseconds
expr min lq mean median uq max neval cld
missuse(values, 3) 100.85978 111.1552 185.8199 120.1707 304.0303 393.4031 100 a
prive(values, 3) 1932.58682 1980.0515 2096.7516 2043.7133 2211.6294 2671.9357 100 b
db(values, 3) 96.86879 104.5141 194.0085 117.6270 306.7143 500.6455 100 a
N = 3
temp = floor(sum(df$values)/N)
inds = c(0, which(c(0, diff(cumsum(df$values) %% temp)) < 0)[1:(N-1)], NROW(df))
split(df$values, rep(1:N, ifelse(N == 1, NROW(df), diff(inds))))
#$`1`
#[1] 1 1 2 2 3 4
#$`2`
#[1] 5 6
#$`3`
#[1] 6 7

function ifelse calculation

I am trying to write the function in R, but keep getting an error. Within 1 simulation run I generate random values from 2 intevals - in order to generate 2 different output values.
se.m if the input parameter lies within [0, 1]
se.st if the input parameter lies wiothin [1, 5]
(floating point in neglegible)
Then, these randomly generated values are used as input in the following function:
This is the code I have used:
fuchs08 <- function(n){
x.m=se.m=x.st=se.st=NULL
for(i in 1:n){
se.m[i] <- ifelse (runif(n = 1, min = 0, max = 1) < 0.33, 0, 0.12 * (x.m[i]^2) - 0.04 * x.m[i])
se.st[i] <- ifelse (runif(n = 1, min = 1, max = 5) < 3.06, 0.12 * (x.m[i]^2) - 0.04 * x.m[i], 1)
}
return(cbind(se.m, se.st))
}
I dont get any results. I believe the error is in the ifelse statement, but cannot find a solution to it.
> fuchs08(5)
se.m se.st
[1,] 0 NA
[2,] NA 1
[3,] NA 1
[4,] NA NA
[5,] 0 1
The overall idea is add this function to a list of functions called funktionen. Then I run a simulation 100 times. Simulation 1 randomly chooses a function from the list funktionen and executes it. (Function creates two outputs for the aforementioned intevals: se.m and se.st which are combined with the outputs from simulation 2:99) Therefore the function needs to be in the format: function(n) in order to run the random function selection. Here is my code for that part:
funktionen <- list(akbas, bell.glade, borter, fell.hartford, fuchs07, fuchs08)
fxn_list_sample <- sample(1:5, 100, replace=T)
fxn_list_result <- unlist(sapply(fxn_list_sample, function(x) do.call(funktionen[[x]], args=list(n=1))))
results <- as.data.frame(t(fxn_list_result))
colnames(results) <- c("se.m", "se.st")
results <- melt(results)
results$value <-round(results$value, 4)
separate(results, variable, into = c("Parameter", "Intensitaet")) %>%
mutate(Intensitaet = c(3, 2) [(Intensitaet == "m")+1])
Any suggestions how to fix this?
To answer your trouble with ifelse():
ifelse() needs three arguments to make sense (condition, yes, no). it works with only the condition if the condition evaluates to NA, hence the NA's in your results and works with two arguments if the condition evalutates to TRUE, hence the 1 in your resutls. As Konrad says in the comment, the use of ifelse seems redundant. For illustration:
> ifelse(1==1)
Error in ifelse(1 == 1) : argument "yes" is missing, with no default
> ifelse(NA)
[1] NA
> ifelse(1==1, 4)
[1] 4
> ifelse(1!=1, 4)
Error in ifelse(1 != 1, 4) : argument "no" is missing, with no default
> ifelse(1!=1, 4, 10)
[1] 10
Regarding your original problem, I am not sure if I understand you question correctly, but maybe this does what you want:
fuchs08 <- function(x){
ifelse(x<1/3, 0,
ifelse(x<=3.06, 0.12*x^2-0.04*x, 1))
}
fuchs08_with_n_inputs_two_outputcols <- function(n) {
df <- data.frame(input=runif(n, 0, 5))
df$se.m <- ifelse(df$input<1, fuchs08(df$input), NA)
df$se.st <- ifelse(df$input>1 & df$input<5, fuchs08(df$input), NA)
return(df)
}
fuchs08_with_n_inputs_two_outputcols(10)
Edit: replaced n by x to avoid confusion and added a second function after having read your answer (the name is long for the sake of clarity...). It is not the output in your answer but may easily be transformed to that. I think it would be helpfull to give an example of the output you want and which format it should have (data.frame, named vector...?)
I think ifelse & if-and-else are both awkward. You could try something like:
fuchs08<-function(n,min,max) {
x<-runif(n,min,max)
y<-x
y[x<1/3]<-0
y[x>=1/3 & x<=3.06]<-0.12*y[x>=1/3 & x<=3.06]^2-0.04*y[x>=1/3 & x<=3.06]
y[x>3.06]<-1
return(y)
}
(want<-cbind(fuchs08(100,0,1),fuchs08(100,1,5)))
This seems to work. However, not very elegant answer. Feel free to give me tipps to improve it, reduce rebundant elements, etc.
fuchs08 <- function(n) {
x.m=se.m=x.st=se.st=NULL
for(i in 1:n){
print(x.m[i] <- runif(n = 1, min = 0, max = 1))
se.m[i] <- ifelse (runif(n = 1, min = 0, max = 1) < 0.33, 0, 0.12 * x.m[i]^2 - 0.04* x.m[i])
print(x.st[i] <- runif(n = 1, min = 1, max = 5))
se.st[i] <- ifelse (runif(n = 1, min = 1, max = 5) < 3.06, 0.12 * x.st[i]^2 - 0.04* x.st[i], 1)
}
return(cbind(se.m, se.st))
}
fuchs08(10)
The whole code is:
library(reshape2)
library(stringr)
install.packages("dplyr")
install.packages("tidyr")
library(dplyr)
library(tidyr)
install.packages("data.table")
library(data.table)
# AKBAS u.a. (2009)
akbas <- function(n){
x.m=se.m=x.st=se.st=NULL
for(i in 1:n){
print(x.m[i] <- runif(n = 1, min = 0, max = 1))
se.m[i] <- 0.17 * (x.m[i]^2) - 0.03 * x.m[i]
print(x.st[i] <- runif(n = 1, min = 1, max = 5))
se.st[i] <- 0.17 * (x.st[i]^2) - 0.03 * x.st[i]
}
akbasr<-return(cbind(se.m, se.st))
}
# FUCHS u.a.(2007)
fuchs07 <- function(n){
x.m=se.m=x.st=se.st=NULL #solves indexing problem
for(i in 1:n){
print(x.m[i] <- runif(n = 1, min = 0, max = 1))
se.m[i] <- 0.11 * (x.m[i]^2) - 0.02 * x.m[i]
print(x.st[i] <- runif(n = 1, min = 1, max = 5))
se.st[i] <- 0.11 * (x.st[i]^2) - 0.02 * x.st[i]
}
return(cbind(se.m, se.st))
}
# BELL AND GLADE (2004)
bell.glade <- function(n){
x.m=se.m=x.st=se.st=NULL
for(i in 1:n){
se.m[i] <- ifelse (runif(n = 1, min = 0, max = 1) < 1, 0.2, 0.2)
se.st[i] <- ifelse (runif(n = 1, min = 0, max = 1) < 1, 0.5, 0.5)
}
return(cbind(se.m, se.st))
}
# BORTER (1999b,a)
borter <- function(n){
x.m=se.m=x.st=se.st=NULL
for(i in 1:n){
se.m[i] <- ifelse (runif(n = 1, min = 0, max = 1) < 1, 0.1, 0.1)
se.st[i] <- ifelse (runif(n = 1, min = 0, max = 1) < 1, 0.5, 0.5)
}
return(cbind(se.m, se.st))
}
# FELL UND HARTFORD (1997)
fell.hartford <- function(n){
x.m=se.m=x.st=se.st=NULL
for(i in 1:n){
se.m[i] <- ifelse (runif(n = 1, min = 0, max = 1) < 0.25, 0.1, 0.4)
se.st[i] <- ifelse (runif(n = 1, min = 1, max = 5) < 1.5, 0.4, 0.7)
}
return(cbind(se.m, se.st))
}
# FUCH (2008, 2009)
fuchs08 <- function(n) {
x.m=se.m=x.st=se.st=NULL
for(i in 1:n){
print(x.m[i] <- runif(n = 1, min = 0, max = 1))
se.m[i] <- ifelse (runif(n = 1, min = 0, max = 1) < 0.33, 0, 0.12 * x.m[i]^2 - 0.04* x.m[i])
print(x.st[i] <- runif(n = 1, min = 1, max = 5))
se.st[i] <- ifelse (runif(n = 1, min = 1, max = 5) < 3.06, 0.12 * x.st[i]^2 - 0.04* x.st[i], 1)
}
return(cbind(se.m, se.st))
}
funktionen <- list(akbas, bell.glade, borter, fell.hartford, fuchs07, fuchs08)
fxn_list_sample <- sample(1:5, 100, replace=T)
fxn_list_result <- unlist(sapply(fxn_list_sample, function(x) do.call(funktionen[[x]], args=list(n=1))))
results <- as.data.frame(t(fxn_list_result))
colnames(results) <- c("se.m", "se.st")
results <- melt(results)
results$value <-round(results$value, 4)
separate(results, variable, into = c("Parameter", "Intensitaet")) %>%
mutate(Intensitaet = c(3, 2) [(Intensitaet == "m")+1])
write.csv(results, "murgang-test.csv")

Filling a dataframe with a 0 and 1 vector [duplicate]

This question already has answers here:
Create binary column (0/1) based on condition in another column
(2 answers)
Closed 8 years ago.
I have a datarame with two columns (A and B). Column A is categorical B is numeric (ranging from 0.0 to 1.0). I want to create a column C for which the values are 1 when the value in Column B is greater than or equal to 0.5 and 0 when the value in column B is less than 0.5. Any suggestions on how to do this? The final df should look like this:
A = c('spA', 'spB', 'spC', 'spD')
B = c(0.25, 0.15, 0.50, 0.75)
C = c(0,0,1,1)
df = data.frame(A, B, C)
Just use
A = c('spA', 'spB', 'spC', 'spD')
B = c(0.25, 0.15, 0.50, 0.75)
df = data.frame(A, B)
df$C <- as.numeric(df$B >= 0.5)
#David Arenburg: Speed comparison of all 3 solutions pointed our above
To be honest i dont know why it is that much faster.
require(microbenchmark)
microbenchmark(
df$C <- ifelse(df$B>=0.5, 1, 0),
transform(df, C = as.numeric(B >= 0.5)),
df$C <- as.numeric(df$B>=0.5)
)
Result:
Unit: microseconds
expr min lq median uq max neval
df$C <- ifelse(df$B >= 0.5, 1, 0) 33.585 35.7580 38.1285 41.6845 140.66 100
transform(df, C = as.numeric(B >= 0.5)) 143.821 149.7470 155.0815 164.5640 284.48 100
df$C <- as.numeric(df$B >= 0.5) 20.546 22.9165 24.2995 27.2630 53.34 100
EDIT: Lager Dataset
df <- data.frame(B=runif(100000))
require(microbenchmark)
microbenchmark(
df$C <- ifelse(df$B>=0.5, 1, 0),
transform(df, C = as.numeric(B >= 0.5)),
df$C <- as.numeric(df$B>=0.5)
)
Unit: microseconds
expr min lq median uq max neval
df$C <- ifelse(df$B >= 0.5, 1, 0) 31620.826 33623.452 34529.8380 55652.9290 62707.064 100
transform(df, C = as.numeric(B >= 0.5)) 811.561 979.286 1032.6255 1248.5550 2333.137 100
df$C <- as.numeric(df$B >= 0.5) 606.498 764.542 808.0045 979.0875 23805.112 100

Is there a weighted.median() function?

I'm looking for something similar in form to weighted.mean(). I've found some solutions via search that write out the entire function but would appreciate something a bit more user friendly.
The following packages all have a function to calculate a weighted median: 'aroma.light', 'isotone', 'limma', 'cwhmisc', 'ergm', 'laeken', 'matrixStats, 'PSCBS', and 'bigvis' (on github).
To find them I used the invaluable findFn() in the 'sos' package which is an extension for R's inbuilt help.
findFn('weighted median')
Or,
???'weighted median'
as ??? is a shortcut in the same way ?some.function is for help(some.function)
Some experience using the answers from #wkmor1 and #Jaitropmange.
I've checked 3 functions from 3 packages, isotone, laeken, and matrixStats. Only matrixStats works properly. Other two (just as the median(rep(x, times=w) solution) give integer output. As long as I calculated median age of populations, decimal places matter.
Reproducible example. Calculation of the median age of a population
df <- data.frame(age = 0:100,
pop = spline(c(4,7,9,8,7,6,4,3,2,1),n = 101)$y)
library(isotone)
library(laeken)
library(matrixStats)
isotone::weighted.median(df$age,df$pop)
# [1] 36
laeken::weightedMedian(df$age,df$pop)
# [1] 36
matrixStats::weightedMedian(df$age,df$pop)
# [1] 36.164
median(rep(df$age, times=df$pop))
# [1] 35
Summary
matrixStats::weightedMedian() is the reliable solution
To calculate the weighted median of a vector x using a same length vector of (integer) weights w:
median(rep(x, times=w))
This is just a simple solution, ready to use almost anywhere.
weighted.median <- function(x, w) {
w <- w[order(x)]
x <- x[order(x)]
prob <- cumsum(w)/sum(w)
ps <- which(abs(prob - .5) == min(abs(prob - .5)))
return(x[ps])
}
Really old post but I just came across it and did some testing of the different methods. spatstat::weighted.median() seemed to be about 14 times faster than median(rep(x, times=w)) and its actually noticeable if you want to run the function more than a couple times. Testing was with a relatively large survey, about 15,000 people.
One can also use stats::density to create a weighted PDF, then convert this to a CDF, as elaborated here:
my_wtd_q = function(x, w, prob, n = 4096)
with(density(x, weights = w/sum(w), n = n),
x[which.max(cumsum(y*(x[2L] - x[1L])) >= prob)])
Then my_wtd_q(x, w, .5) will be the weighted median.
One could also be more careful to ensure that the total area under the density is one by re-normalizing.
A way in base to get a weighted median will be to order by the values and build the cumsum of the weights and get the value(s) at sum * 0.5 of the weights.
medianWeighted <- function(x, w, q=.5) {
n <- length(x)
i <- order(x)
w <- cumsum(w[i])
p <- w[n] * q
j <- findInterval(p, w)
Vectorize(function(p,j) if(w[n] <= 0) NA else
if(j < 1) x[i[1]] else
if(j == n) x[i[n]] else
if(w[j] == p) (x[i[j]] + x[i[j+1]]) / 2 else
x[i[j+1]])(p,j)
}
What will have the following results with simple input data.
medianWeighted(c(10, 40), c(1, 2))
#[1] 40
median(rep(c(10, 40), c(1, 2)))
#[1] 40
medianWeighted(c(10, 40), c(2, 1))
#[1] 10
median(rep(c(10, 40), c(2, 1)))
#[1] 10
medianWeighted(c(10, 40), c(1.5, 2))
#[1] 40
medianWeighted(c(10, 40), c(3, 4))
#[1] 40
median(rep(c(10, 40), c(3, 4)))
#[1] 40
medianWeighted(c(10, 40), c(1.5, 1.5))
#[1] 25
medianWeighted(c(10, 40), c(3, 3))
#[1] 25
median(rep(c(10, 40), c(3, 3)))
#[1] 25
medianWeighted(c(10, 40), c(0, 1))
#[1] 40
medianWeighted(c(10, 40), c(1, 0))
#[1] 10
medianWeighted(c(10, 40), c(0, 0))
#[1] NA
It can also be used for other qantiles
medianWeighted(1:10, 10:1, seq(0, 1, 0.25))
[1] 1 2 4 6 10
Compare with other methods.
#Functions from other Answers
weighted.median <- function(x, w) {
w <- w[order(x)]
x <- x[order(x)]
prob <- cumsum(w)/sum(w)
ps <- which(abs(prob - .5) == min(abs(prob - .5)))
return(x[ps])
}
my_wtd_q = function(x, w, prob, n = 4096)
with(density(x, weights = w/sum(w), n = n),
x[which.max(cumsum(y*(x[2L] - x[1L])) >= prob)])
weighted.quantile <- function(x, w, probs = seq(0, 1, 0.25),
na.rm = FALSE, names = TRUE) {
if (any(probs > 1) | any(probs < 0)) stop("'probs' outside [0,1]")
if (length(w) == 1) w <- rep(w, length(x))
if (length(w) != length(x)) stop("w must have length 1 or be as long as x")
if (isTRUE(na.rm)) {
w <- x[!is.na(x)]
x <- x[!is.na(x)]
}
w <- w[order(x)] / sum(w)
x <- x[order(x)]
cum_w <- cumsum(w) - w * (1 - (seq_along(w) - 1) / (length(w) - 1))
res <- approx(x = cum_w, y = x, xout = probs)$y
if (isTRUE(names)) {
res <- setNames(res, paste0(format(100 * probs, digits = 7), "%"))
}
res
}
Methods
M <- alist(
medRep = median(rep(DF$x, DF$w)),
isotone = isotone::weighted.median(DF$x, DF$w),
laeken = laeken::weightedMedian(DF$x, DF$w),
spatstat1 = spatstat.geom::weighted.median(DF$x, DF$w, type=1),
spatstat2 = spatstat.geom::weighted.median(DF$x, DF$w, type=2),
spatstat4 = spatstat.geom::weighted.median(DF$x, DF$w, type=4),
survey = survey::svyquantile(~x, survey::svydesign(id=~1, weights=~w, data=DF), 0.5)$x[1],
RAndres = weighted.median(DF$x, DF$w),
matrixStats = matrixStats::weightedMedian(DF$x, DF$w),
MichaelChirico = my_wtd_q(DF$x, DF$w, .5),
Leonardo = weighted.quantile(DF$x, DF$w, .5),
GKi = medianWeighted(DF$x, DF$w)
)
Results
DF <- data.frame(x=c(10, 40), w=c(1, 2))
sapply(M, eval)
# medRep isotone laeken spatstat1 spatstat2
# 40.00000 40.00000 40.00000 40.00000 25.00000
# spatstat4 survey RAndres matrixStats MichaelChirico
# 17.50000 40.00000 10.00000 30.00000 34.15005
# Leonardo.50% GKi
# 25.00000 40.00000
DF <- data.frame(x=c(10, 40), w=c(1, 1))
sapply(M, eval)
# medRep isotone laeken spatstat1 spatstat2
# 25.00000 25.00000 40.00000 10.00000 10.00000
# spatstat4 survey RAndres matrixStats MichaelChirico
# 10.00000 10.00000 10.00000 25.00000 25.05044
# Leonardo.50% GKi
# 25.00000 25.00000
In those two cases only isotone and GKi give identical results compared to what median(rep(x, w)) returns.
If you're working with the survey package, assuming you've defined your survey design and x is your variable of interest:
svyquantile(~x, mydesign, c(0.5))
I got here looking for weighted quantiles, so I thought I might as well leave for future readers what I ended up with. Naturally, using probs = 0.5 will return the weighted median.
I started with MichaelChirico's answer, which unfortunately was off at the edges. Then I decided to switch from density() to approx(). Finally, I believe I nailed the correction factor to ensure consistency with the default algorithm of the unweighted quantile().
weighted.quantile <- function(x, w, probs = seq(0, 1, 0.25),
na.rm = FALSE, names = TRUE) {
if (any(probs > 1) | any(probs < 0)) stop("'probs' outside [0,1]")
if (length(w) == 1) w <- rep(w, length(x))
if (length(w) != length(x)) stop("w must have length 1 or be as long as x")
if (isTRUE(na.rm)) {
w <- x[!is.na(x)]
x <- x[!is.na(x)]
}
w <- w[order(x)] / sum(w)
x <- x[order(x)]
cum_w <- cumsum(w) - w * (1 - (seq_along(w) - 1) / (length(w) - 1))
res <- approx(x = cum_w, y = x, xout = probs)$y
if (isTRUE(names)) {
res <- setNames(res, paste0(format(100 * probs, digits = 7), "%"))
}
res
}
When weights are uniform, the weighted quantiles are identical to regular unweighted quantiles:
x <- rnorm(100)
stopifnot(stopifnot(identical(weighted.quantile(x, w = 1), quantile(x)))
Example using the same data as in the weighted.mean() man page.
x <- c(3.7, 3.3, 3.5, 2.8)
w <- c(5, 5, 4, 1)/15
stopifnot(isTRUE(all.equal(
weighted.quantile(x, w, 0:4/4, names = FALSE),
c(2.8, 3.33611111111111, 3.46111111111111, 3.58157894736842,
3.7)
)))
And this is for whoever solely wants the weighted median value:
weighted.median <- function(x, w, ...) {
weighted.quantile(x, w, probs = 0.5, names = FALSE, ...)
}

Resources