Find the closest average of three numbers - code optimization - r

This may seem trivial, but I have a code that finds the average of the closest two numbers in a set of three numbers. So 5 examples:
x1 <- c(1,5,7)
x2 <- c(NA,2,3)
x3 <- c(2,6,4)
x4 <- c(1,NA,NA)
x5 <- c(1,3,1)
I would get an output of
y1 = 6
y2 = 2.5
y3 = 4
y4 = 1
y5 = 1
respectively. Basically, finding the closest 2 numbers, then averaging them, accounting for NA and ties.
This code is a monstrosity:
x <-x[!is.na(x)]
x <-x[order(x)]
y <-ifelse(length(x) == 1, x,
ifelse(length(x) == 2, mean(x),
ifelse(length(x) == 3,
ifelse(abs(x[1] - x[2]) == abs(x[2] - x[3]), mean(x),
ifelse(abs(x[1] - x[2]) > abs(x[2] - x[3]), mean(x[2:3]),
ifelse(abs(x[1] - x[2]) < abs(x[2] - x[3]), mean(x[1:2]),
"error"))), NA)))
It works, but because this is part of a larger for loop, I was wondering there's a better way of doing this..

We define an S3 generic with "list" and "default" methods.
The "default" method takes a vector and sort it (which also removes NA values) and then if the length of what is left is <= 1 it returns the single value or NA if none. If the length is 2 or the two successive differences are the same then it returns the mean of all values; otherwise, it finds the index of the first of the pair of the closest two values and returns the mean of the values.
The "list" method applies the default method repeatedly.
mean_min_diff <- function(x) UseMethod("mean_min_diff")
mean_min_diff.list <- function(x) sapply(x, mean_min_diff.default)
mean_min_diff.default <- function(x) {
x0 <- sort(x)
if (length(x0) <= 1) c(x0, NA)[1]
else if (length(x0) == 2 || sd(diff(x0)) == 0) mean(x0)
else mean(x0[seq(which.min(diff(x0)), length = 2)])
}
Now test it out:
mean_min_diff(x1)
## [1] 6
mean_min_diff(list(x1, x2, x3, x4, x5))
## [1] 6.0 2.5 4.0 1.0 1.0

Related

A confusion about `ifelse`

There is a confusion about ifelse.I hope someone can help explain.
Consider code below:
x1 = c(1,4,3)
y1 = c(2,3,5)
# 1
> ifelse(x1 > y1, x1^2 + y1^2,y1)
[1] 2 25 5
# 2
> ifelse(x1 > y1, sum(x1),y1)
[1] 2 8 5
# from #1 I guess second element should be sum(x1) == sum(x1[2]) == sum(4)
Why?
Update:
After reading the book -- The Art of R Programming, I solve my problem.
ifelse(b,u,v) where b is a Boolean vector, and u and v are vectors.
The return value is itself a vector; element i is u[i] if b[i] is true, or v[i] if b[i] is false
So
ifelse(x1 > y1, sum(x1),y1) == ifelse(x1 > y1, c(sum(x1),sum(x1),sum(x1)),c(2,3,5)) # by recycling
# then b = c(T,F,T), u = c(8,8,8), v = c(2,3,5)
# therefore output would be (v[1],u[2],v[3]), i.e.
# [1] 2 8 5
sum(x1)=8 is obvious since 1+4+3=8. Now you might wonder why ifelse seems to evaluate expressions differently: It is not, it is just that ^2 cannot be applied to a vector (whats a vector squared?) so it is applying element wise. you can however apply sum() to a vector, which happens in the second evaluation. try ifelse(x1 > y1, x1,y1)

Updating a vector within a dataframe using a random experiment

I have the following dataframes "df1" and "df2":
x1 <- c(1,1,1,2,2,3)
y1 <- c(0,0,1,1,2,2)
df1 <- data.frame(x1,y1)
y <- c(0,1,2)
p <- c(0.1,0.6,0.9)
df2 <- data.frame(y,p)
What I want to do is to update df1$x1 to a new vector df1$x2, based on a random experiment. This can be manually done using the following function and "lapply" on vector df1$x1:
example_function <- function(x,p){
if(runif(1) <= p) return(x + 1)
return(x)
}
set.seed(123)
df1$x2 <- unlist(lapply(df1$x1,example_function,0.5))
The function performs a random experiment and compares it with a given probability p. Depending on the result either x remains the same for df$x2 or increases by the value of 1.
In the procedure described above, "p" was selected manually within the function (here 0.5 for all x-values in df1). However, I want p to be chosen automatically depending on the combination of df1$x1 and df1$y1. Here comes df2 into play. df2 shows which p-values are related to which y-values. For example df1$x1[3] equals 1, the corresponding y value df1$y1[3] is also equal 1. df2 shows that the associated p-value has to be 0.6 (that is the p-value for y equal 1). In order to determine the corresponding value df1$x2, p = 0.6 should be used in "example_function". Depending on df1$y1, p should be 0.1 for df1$x1[1] and df1$x1[2], 0.6 for df1$x1[3] and df1$x1[4] and 0.9 for df1$x1[5] and df1$x1[6].
Following example is an approach, but only if vector df$x1 contains only different values:
x1 <- c(1,2,3,4,5,6)
y1 <- c(0,0,1,1,2,2)
df1 <- data.frame(x1,y1)
set.seed(123)
df1$x2 <- unlist(lapply(df1$x1,
function(z) {
example_function(z, df2$p[df2$y == df1$y1[df1$x1 == z]])
}))
df1
x1 y1 x2
#1 1 0 1
#2 2 0 2
#3 3 1 4
#4 4 1 4
#5 5 2 5
#6 6 2 7
Using x1 <- c(1,1,1,2,2,3), as mentioned above, leads to warnings and errors:
x1 <- c(1,1,1,2,2,3)
y1 <- c(0,0,1,1,2,2)
df1 <- data.frame(x1,y1)
set.seed(123)
df1$x2 <- unlist(lapply(df1$x1,
function(z) {
example_function(z, df2$p[df2$y == df1$y1[df1$x1 == z]])
}))
Error in if (runif(1) <= p) return(x + 1) : argument is of length zero
In addition: Warning message:
In df2$y == df1$y1[df1$x1 == z] :
Error in if (runif(1) <= p) return(x + 1) : argument is of length zero
Is there anyone who has an idea how to fix that problem? I am very grateful for any help.
Working with "merge" seems to be one solution:
df_new <- merge(df1, df2, by.x = 'y1', by.y = 'y')
set.seed(123)
df1$x2 <- mapply(example_function,df1$x1,df_new$p)
> df1
x1 y1 x2
1 1 0 1
2 1 0 1
3 1 1 2
4 2 1 2
5 2 2 2
6 3 2 4

Data cleaning: Function to find very similar variables

I have some large data, which partly consists of very similar variables. Some variables have missing values (e.g. x3 and x5 in the example below) and some variables are similar, but with different labels (e.g. x2 and x5). In order to clean my data, I want to identify and eventually delete these similar variables. I am trying to write a function, which returns the column names of all similar variable pairs. Here is some exemplifying data:
# Example data
set.seed(222)
N <- 100
x1 <- round(rnorm(N, 0, 10))
x2 <- round(rnorm(N, 10, 20))
x3 <- x1
x3[sample(1:N, 7)] <- NA
x4 <- x1
x4[sample(1:N, 5)] <- round(rnorm(5, 0, 10))
x5 <- x2
x5 <- paste("A", x5, sep = "")
x5[sample(1:N, 15)] <- NA
df <- data.frame(x1, x2, x3, x4, x5)
df$x1 <- as.character(df$x1)
df$x2 <- as.character(df$x2)
df$x3 <- as.character(df$x3)
df$x4 <- as.character(df$x4)
df$x5 <- as.character(df$x5)
head(df)
As you can see, x1, x3, and x4 are very similar; and x2 and x5 are very similar as well. My function should print a list, which includes all pairs with the same values in 80% or more of the cases. Here is what I got so far:
# My attempt to write such a function
fun_clean <- function(data, similarity) {
output <- list()
data <- data[complete.cases(data), ]
for(i in 1:ncol(data)) {
if(i < ncol(data)) {
for(j in (i + 1):ncol(data)) {
similarity_ij <- sum(data[ , i] == data[ , j]) / nrow(data)
if(similarity_ij >= similarity) {
output[[length(output) + 1]] <- colnames(data)[c(i, j)]
}
}
}
}
output
}
fun_clean(data = df, similarity = 0.8)
I managed to identify the similarity of x1, x3, and x4. The similarity of x2 and x5 (i.e. similar variables with different labels) is not found by my function. Further, my function is very slow. Therefore, I have the following question:
Question: How could I identify all similar variables in a computationally efficient way?
In order to compare your columns, you need numeric values first. You can extract only the numeric values by using gsub() and then transform to numeric values. After this transformation, you'll be good to go:
df <- apply(df, 2, function(x) as.numeric( gsub("[^0-9]", "", x) ))
Now you can compare all columns by first using combn(5, 2) to get all pairs of columns you want to compare. Then you can use that to compare the columns and calculate the percentage of entries that are equal.
combs <- combn(ncol(df), 2)
res <- apply(combs, 2, function(x){
sum(df[, x[1]] == df[, x[2]], na.rm = TRUE)/nrow(df)
})
thresh <- 0.8
combs[, res > thresh]
# [,1] [,2] [,3] [,4]
# [1,] 1 1 2 3
# [2,] 3 4 5 4
So columns 1 & 3, 1 & 4, 2 & 5 and 3 & 4 are equal to each other in more than 80% of the cases.
Note: If one or both of the compared values have an NA, this will be considered as not a match!
In caret package there is a functionality to discover correlated variables and also variables that are lineal combinations of others:
http://topepo.github.io/caret/pre-processing.html

R: calculating mean of positive values in every column of a data frame

I want to calculate the mean of every column of a date frame. But only the positive values should be considered. The positive mean-values of every column are summarised in one vector.
My code:
x <- data.frame(replicate(3, sample(-5000:7000, 1000, rep = TRUE)))
meanxpositive <- c(NA)
for (n_col in 1:3) {
z <- mean(x[which(x[, ncol] > 0)])
meanxpositive[n_col] <- z
}
This code don't work. Maybe someone have a better idea.
sapply(x, function (y) mean(y[y > 0]))
colMeans(as.matrix(x) * (x > 0))
A colMeans approach might look like:
x[x<=0] <- NA
colMeans(x, na.rm=T)
# X1 X2 X3
#3483.664 3626.115 3533.687
Since you also mentioned you wanted to see this solution using a for loop you could adjust your old code to:
meanxpositive<-rep(NA, ncol(x))
for (n_col in 1:3) {
z<-mean(x[which(x[,n_col]>0), n_col]) #Changed this line to reference "n_col" instead of "ncol"
meanxpositive[n_col]<-z
}
meanxpositive
#[1] 3483.664 3626.115 3533.687
Note I changed references to ncol to n_col and also put in a specific selection of n_col in your mean(...) before, you weren't selecting any columns.
Lastly, with for loops it is best to "pre-allocate" the memory of your result. In this case that means setting it to be the size that you know it should be (3). In R growing objects inside loops is extremely slow and inefficient.
Data:
set.seed(1)
x<-data.frame(replicate(3,sample(-5000:7000,1000,rep=TRUE)))
If you want to keep your data in a data.frame,
library(dplyr)
set.seed(47)
x <- data.frame(replicate(3, sample(-5000:7000, 1000, replace = TRUE)))
x %>% summarise_all(~mean(.x[.x > 0]))
#> X1 X2 X3
#> 1 3578.912 3535.614 3358.444
or with the old funs notation,
x %>% summarise_all(funs(mean(.[. > 0])))
#> X1 X2 X3
#> 1 3578.912 3535.614 3358.444
or in base R,
aggregate(. ~ TRUE, x, function(x){mean(x[x > 0])})
#> X1 X2 X3
#> 1 3578.912 3535.614 3358.444
or data.table,
library(data.table)
setDT(x)[, lapply(.SD, function(x){mean(x[x > 0])})]
#> X1 X2 X3
#> 1: 3578.912 3535.614 3358.444

Nested for loop in R

I wrote the following code, and I need to repeat this for 100 times, and I know I need to user another for loop, but I don't know how to do it. Here is the code:
mean <- c(5,5,10,10,5,5,5)
x <- NULL
u <- NULL
delta1 <- NULL
w1 <- NULL
for (i in 1:7 ) {
x[i] <- rexp(1, rate = mean[i])
u[i] <- (1/1.2)*runif(1, min=0, max=1)
y1 <- min(x,u)
if (y1 == min(x)) {
delta1 <- 1
}
else {
delta1 <- 0
}
if (delta1 == 0)
{
w1 <- NULL
}
else {
if(y1== x[[1]])
{
w1 <- "x1"
}
}
}
output <- cbind(delta1,w1)
output
I want the final output to be 100 rows* 3 columns matrix representing run number, delta1, and w1.
Any thought will be truly appreciated.
Here's what I gather you're trying to achieve from your code:
Given two vectors drawn from different distributions (Exponential and Uniform)
Find out which distribution the smallest number comes from
Repeat this 100 times.
Theres a couple of problems with your code if you want to achieve this, so here's a cleaned up example:
rates <- c(5, 5, 10, 10, 5, 5, 5) # 'mean' is an inbuilt function
# Initialise the output data frame:
output <- data.frame(number=rep(0, 100), delta1=rep(1, 100), w1=rep("x1", 100))
for (i in 1:100) {
# Generating u doesn't require a for loop. Additionally, can bring in
# the (1/1.2) out the front.
u <- runif(7, min=0, max=5/6)
# Generating x doesn't need a loop either. It's better to use apply functions
# when you can!
x <- sapply(rates, function(x) { rexp(1, rate=x) })
y1 <- min(x, u)
# Now we can store the output
output[i, "number"] <- y1
# Two things here:
# 1) use all.equal instead of == to compare floating point numbers
# 2) We initialised the data frame to assume they always came from x.
# So we only need to overwrite it where it comes from u.
if (isTRUE(all.equal(y1, min(u)))) {
output[i, "delta1"] <- 0
output[i, "w1"] <- NA # Can't use NULL in a character vector.
}
}
output
Here's an alternative, more efficient approach with replicate:
Mean <- c(5, 5, 10, 10, 5, 5, 5)
n <- 100 # number of runs
res <- t(replicate(n, {
x <- rexp(n = length(Mean), rate = Mean)
u <- runif(n = length(Mean), min = 0, max = 1/1.2)
mx <- min(x)
delta1 <- mx <= min(u)
w1 <- delta1 & mx == x[1]
c(delta1, w1)
}))
output <- data.frame(run = seq.int(n), delta1 = as.integer(res[ , 1]),
w1 = c(NA, "x1")[res[ , 2] + 1])
The result:
head(output)
# run delta1 w1
# 1 1 1 <NA>
# 2 2 1 <NA>
# 3 3 1 <NA>
# 4 4 1 x1
# 5 5 1 <NA>
# 6 6 0 <NA>

Resources