Decile function in R - nested ifelse() statements lead to poor runtime - r

I wrote a function that calculates the deciles of each row in a vector. I am doing this with the intention of creating graphics to evaluate the efficacy of a predictive model. There has to be a easier way to do this, but I haven't been able to figure it out for a while. Does anyone have any idea how I could score a vector in this way without having so many nested ifelse() statements? I included the function as well as some code to copy my results.
# function
decile <- function(x){
deciles <- vector(length=10)
for (i in seq(0.1,1,.1)){
deciles[i*10] <- quantile(x, i)
}
return (ifelse(x<deciles[1], 1,
ifelse(x<deciles[2], 2,
ifelse(x<deciles[3], 3,
ifelse(x<deciles[4], 4,
ifelse(x<deciles[5], 5,
ifelse(x<deciles[6], 6,
ifelse(x<deciles[7], 7,
ifelse(x<deciles[8], 8,
ifelse(x<deciles[9], 9, 10))))))))))
}
# check functionality
test.df <- data.frame(a = 1:10, b = rnorm(10, 0, 1))
test.df$deciles <- decile(test.df$b)
test.df
# order data frame
test.df[with(test.df, order(b)),]

You can use quantile and findInterval
# find the decile locations
decLocations <- quantile(test.df$b, probs = seq(0.1,0.9,by=0.1))
# use findInterval with -Inf and Inf as upper and lower bounds
findInterval(test.df$b,c(-Inf,decLocations, Inf))

Another solution is to use ecdf(), described in the help files as the inverse of quantile().
round(ecdf(test.df$b)(test.df$b) * 10)
Note that #mnel's solution is around 100 times faster.

Related

R using set.seed() within a purrr::map iteration structure

When generating new data in R, I can use set.seed() to ensure that I get the same data sets every time the code is run:
set.seed(12345)
a <- rnorm(500, mean = 50, sd = 10)
set.seed(12345)
b <- rnorm(500, mean = 50, sd = 10)
identical(a, b)
# TRUE
If I comment out the set.seed() lines, identical(a,b) returns FALSE.
Now I want to use a purrr::map() structure to generate multiple data sets with slightly different parameters:
library(tidyverse)
means <- c(40, 50, 60)
sds <- c(9, 10, 11)
set.seed(12345)
data <- map2(
means,
sds,
~
rnorm(500, mean = .x, sd = .y)
)
The map2() call generates a list of three data frames. With this relatively simple operation, I get identical data frames every time I run the code. But I'm finding that with more complex, longer functional pipelines involving certain packages (e.g., bestNormalize), I'm not getting identical output when the set.seed() command is outside the iterative looping structure of map().
I'm at a loss for how to bring set.seed() within the map() iteration structure so that it is called anew at the beginning of each iteration. To be clear, the larger goal is to be able to iterate over functions that use random number generation, and to get identical results every time. Perhaps there's a better way to accomplish this in the tidyverse that doesn't depend on set.seed(). Thanks in advance for any help!
I hope this solves your question as how to position the seed inside the map call:
means <- c(40, 50, 60)
sds <- c(9, 10, 11)
myfun <- function(means, sds){
set.seed(12345) # set it before each call
ret <- rnorm(500, mean = means, sd = sds)
return(ret)
}
data <- purrr::map2(means,
sds,
~ myfun(.x, .y))
As a followup, here is the most concise way to solve my original problem:
library(tidyverse)
means <- c(40, 50, 60)
sds <- c(9, 10, 11)
data <- map2(
means,
sds,
~ {
set.seed(12345)
rnorm(500, mean = .x, sd = .y)
}
)
This code returns identical results each time it is run.

K-means iterated for same data for 10 times

I am a fresher to R. Trying to evaluate if I can get an optimization of K-means (using R) by iteratively calling the k-means routine for same dataset and same value for K (i.e. k=3 in my case) of 10/15 times and see if if can give me good results. I see the clustering changes at every call, even the total sum of squares and withinss starts changing but not sure how to halt at the best situation.
Can anyone guide me?
code:
run_kmeans <- function(xtimes)
{
for (x in 1:xtimes)
{
kmeans_results <- kmeans(filtered_data, 3)
print(kmeans_results["totss"])
print(kmeans_results["tot.withinss"])
}
return(kmeans_results)
}
kmeans_results = run_kmeans(10)
Not sure I understood your question because this is not the usual way of selecting the best partition (elbow method, silhouette method, etc.)
Let's say you want to find the kmeans partition that minimizes your within-cluster sum of squares.
Let's take the example from ?kmeans
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
You could write that to run repetitively kmeans:
xtimes <- 10
kmeans <- lapply(seq_len(xtimes), function(i){
kmeans_results <- kmeans(x, 3)
})
lapply is always preferrable to for. You output a list. To extract withinss and see which one is minimal:
perf <- sapply(kmeans, function(d) as.numeric(d["tot.withinss"]))
which.min(perf)
However, unless I misunderstood your objective, this is a strange way to select the most performing partition. Usually, this is the number of clusters that is evaluated ; not different partititons produced with the same sample data and the same number of clusters.
Edit from your comment
Ok, so you want to find the combination of columns that give you the best performance. I give you an example below where every two by two combinations of three variables is tested. You could generalize a little bit (but the number of combinations possible with 8 variables is very big, you should have a routine to reduce the number of tested combinations)
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 3),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 3)
)
colnames(x) <- c("x", "y","z")
combinations <- combn(colnames(x), 2, simplify = FALSE)
kmeans <- lapply(combinations, function(i){
kmeans_results <- kmeans(x[,i], 3)
})
perf <- sapply(kmeans, function(d) as.numeric(d["tot.withinss"]))
which.min(perf)

Calculating the log-likelihood of a set of observations sampled from a mixture of two normal distributions using R

I wrote a function to calculate the log-likelihood of a set of observations sampled from a mixture of two normal distributions. This function is not giving me the correct answer.
I will not know which of the two distributions any given sample is from, so the function needs to sum over possibilities.
This function takes a vector of five model parameters as its first argument (μ1, σ1​, μ2​, σ2​ and p) where μi​ and σi​ are the mean and standard deviation of the ith distribution and p is the probability a sample is from the first distribution. For the second argument, the function takes a vector of observations.
I have written the following function:
mixloglik <- function(p, v) {
sum(log(dnorm(v, p[1], p[2])*p[5] + dnorm(v,p[3],p[4]))*p[5])
}
I can create test data, for which I know the solution should be ~ -854.6359:
set.seed(42)
v<- c(rnorm(100), rnorm(200, 8, 2))
p <- c(0, 1, 6, 2, 0.5)
When I test this function on the test data I do not get the correct solution
> mixloglik(p, v)
[1] -356.7194
I know the solution should be ~ -854.6359. Where am I going wrong in my function?
The correct expression for the log-likelihood is the following.
mixloglik <- function(p, v) {
sum(log(p[5]*dnorm(v, p[1], p[2]) + (1 - p[5])*dnorm(v, p[3], p[4])))
}
Now try it:
set.seed(42)
v<- c(rnorm(100), rnorm(200, 8, 2))
p <- c(0, 1, 6, 2, 0.5)
mixloglik(p, v)
#[1] -854.6359
In cases like this, the best way to solve the error is to restart by rewriting the expression on paper and recode it.

Vapply command and mvtnorm

I am not familiar with function over a vector in R.
I would like a vector with the different values of cumulative probability of a bivariate when some parameters change value simultaneously according to different function. For example here:
library(mvtnorm)
m<-2
corr<-diag(2)
corr[2,1]<-0
vapply(2*1:3,function(x)
pmvnorm(mean=c(2,x),corr,lower=c(-Inf,-Inf), upper=c(1,2)),1)
[1] 7.932763e-02 3.609428e-03 5.024809e-06
I have the different value of cumulative probability when the mean of the second distribution takes value 2,4 and 6.
My problem is that I want simultaneously change also the value of the value of the mean of the first distribution. I can't write properly the vapply command with more than one function. What can I do?
Thank you very much
You will need to use mapply for this task
library(mvtnorm)
corr <- diag(2)
m1 <- c(3, 5, 7)
m2 <- c(2, 4, 6)
mapply(function(x, y)
pmvnorm(mean = c(x, y), corr, lower = c(-Inf, -Inf), upper = c(1, 2)),
m1, m2)
## [1] 1.1375e-02 7.2052e-07 3.1246e-14

How to generate such random numbers in R

I want to generate bivariates in the following way. I have four lists with equal length n. I need to use the first two lists as means lists, and the latter two as variance lists, and generate normal bivariates.
For example n=2, I have the lists as (1, 2), (3, 4), (5, 6), (7, 8), and I need
c(rnorm(1, mean=1, sd=sqrt(5)), rnorm(1, mean=2, sd=sqrt(6)), rnorm(1, mean=3, sd=sqrt(7)), rnorm(1, mean=4, sd=sqrt(8)),ncol=2)
How can I do this in R in a more functional way?
Here is one way:
m <- 1:4
s <- 5:8
rnorm(n = 4, mean = m, sd = s)
[1] 4.599257 1.661132 16.987241 3.418957
This works because, like many R functions, rnorm() is 'vectorized', meaning that it allows you to call it once with vectors as arguments, rather than many times in a loop that iterates through the elements of the vectors.
Your main task, then, is to convert the 'lists' in which you've got your arguments right now into vectors that can be passed to rnorm().
NOTE: If you want to produce more than one -- lets say 3 -- random variate for each mean/sd combination, rnorm(n=rep(3,4), mean=m, sd=s) will not work. You'll have to either: (a) repeat elements of the m and s vectors like so rnorm(n=3*4, mean=rep(m, each=3), sd=rep(s, each=3)); or (b) use mapply() as described in DWin's answer.
I'm taking you at your word that you have a list, i.e an Rlist:
plist <- list( a=list(1, 2), b=list(3, 4), c=list(5, 6), d=list(7, 8))
means <-plist[c("a","b")] # or you could use means <- plist[1:2]
vars <- plist[c("c","d")]
mapply(rnorm, n=rep(1,4), unlist(means), unlist(vars))
#[1] 3.9382147 1.0502025 0.9554021 -7.3591917
You used the term bivariate. Did you really want to have x,y pairs that had a specific correlation?

Resources