automation of subset process - r

It is probably easy, but I can't figure it out.
I have a data frame with over 70 variables. I make predictions using all those variables. For sensitivity analysis I would like to subset the data frame automatically to see how the prediction performs on this specific subset.
I have done this manually but with over 100 different subset options it is very tedious.
Here is the data/code and my desired solution:
n = c(2, 3, 5)
s = c("aa", "bb", "cc")
b = c(TRUE, FALSE, TRUE)
a = c(1.7, 3.3, 5.1)
df = data.frame(n, s, b, a)
df
To calculate the accuracy of prediction a
df$calc <- df$a - df$n
df$difference <- sqrt(df$calc * df$calc)
With these values I can now calculate the Mean and SD
Mean <- mean(df$difference)
SD <- sd(df$difference)
Let's say I would like to get an overview of the prediction accuracy for all cases where b = TRUE. (Or other subsets of the data)
Ideally I would like a data frame to look like this:
subset = c("b=TRUE", "b=FALSE", "s=aa")
amount = c(2, 1, 1) # count the number this subset occurs
Mean = c(0.22, 0.3, 0.1)
SD = c(0.1, 0.2, 0.5)
OV = data.frame(subset, amount, Mean, SD)
OV
Considering that I have more than 100 different subsets that I would like to create, I need a fast solution that generates an overview like the OV data frame. I tried a loop, but I have trouble defining a vector for subsetting the data.
Thanks!

Related

K-means iterated for same data for 10 times

I am a fresher to R. Trying to evaluate if I can get an optimization of K-means (using R) by iteratively calling the k-means routine for same dataset and same value for K (i.e. k=3 in my case) of 10/15 times and see if if can give me good results. I see the clustering changes at every call, even the total sum of squares and withinss starts changing but not sure how to halt at the best situation.
Can anyone guide me?
code:
run_kmeans <- function(xtimes)
{
for (x in 1:xtimes)
{
kmeans_results <- kmeans(filtered_data, 3)
print(kmeans_results["totss"])
print(kmeans_results["tot.withinss"])
}
return(kmeans_results)
}
kmeans_results = run_kmeans(10)
Not sure I understood your question because this is not the usual way of selecting the best partition (elbow method, silhouette method, etc.)
Let's say you want to find the kmeans partition that minimizes your within-cluster sum of squares.
Let's take the example from ?kmeans
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
You could write that to run repetitively kmeans:
xtimes <- 10
kmeans <- lapply(seq_len(xtimes), function(i){
kmeans_results <- kmeans(x, 3)
})
lapply is always preferrable to for. You output a list. To extract withinss and see which one is minimal:
perf <- sapply(kmeans, function(d) as.numeric(d["tot.withinss"]))
which.min(perf)
However, unless I misunderstood your objective, this is a strange way to select the most performing partition. Usually, this is the number of clusters that is evaluated ; not different partititons produced with the same sample data and the same number of clusters.
Edit from your comment
Ok, so you want to find the combination of columns that give you the best performance. I give you an example below where every two by two combinations of three variables is tested. You could generalize a little bit (but the number of combinations possible with 8 variables is very big, you should have a routine to reduce the number of tested combinations)
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 3),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 3)
)
colnames(x) <- c("x", "y","z")
combinations <- combn(colnames(x), 2, simplify = FALSE)
kmeans <- lapply(combinations, function(i){
kmeans_results <- kmeans(x[,i], 3)
})
perf <- sapply(kmeans, function(d) as.numeric(d["tot.withinss"]))
which.min(perf)

Weighted Pearson's Correlation with one Object

I want to create a correlation matrix using data but weighted based on significant edges.
m <- matrix(data = rnorm(36), nrow = 6, ncol = 6)
x <- LETTERS[1:6]
for (a in 1:length(x)) y <- c(y, paste("c", a, sep = ""))
mCor <- cor(t(m))
w <- sample(x = seq(0.5, 0.8, by = 0.01), size = 36)
The object w represents the weights for mCor. I know other packages that provide correlation for input data that has to be the same length for vectors x and y. I want to calculate a pairwise weighted Pearson's correlation table, using data for each row across all columns.
I just want to make sure it's correct, but I thought about using a weighted cor for each row A and B by multiplying each value by the given weight. You typically need three vectors all the same length, two for data, and one for the weights.
I am using the data.table package so speedy solutions are welcomed. Also, not sure if I should pass a table with two columns for connections and one for weights. Do the existing functions preserve order or automatically match?
weight <- data.table(x = rep(LETTERS[1:3], each = 12), y = rep(LETTERS[4:6], times = 3), w = w)

How can I find numérical intervals of k-means clusters?

I'm trying to discretize a numerical variable using Kmeans.
It worked pretty well but I'm wondering how I can find the intervals in my cluster.
I work with FactoMineR to do my kmeans.
I found 3 clusters according to the following graph :
My point now is to identify the intervals of my numerical variable within the clusters.
Is there any option or method in FactoMineR or other package to do it ?
I can do it manually but as I have to do it for a certain amount of variables, I'd like to found an easy way to identify them.
Since you have not provided data I have used the example from the kmeans documentation, which produces two groups for data with two columns x and y. You may split the original data by the cluster each row belongs to and then extract data from each group. I am not sure if my example data resembles your data, but in below code I have simply used the difference between min value of column x and max value of column y as the boundaries of a potential interval (depending on the use case this makes sense or not). Does that help you?
data <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(data) <- c("x", "y")
cl <- kmeans(data, 2)
data <- as.data.frame(cbind(data, cluster = cl$cluster))
lapply(split(data, data$cluster), function(x) {
min_x <- min(x$x)
max_y <- max(x$y)
diff <- max_y-min_x
c(min_x = min_x , max_y = max_y, diff = diff)
})
# $`1`
# min_x max_y diff
# -0.6906124 0.5123950 1.2030074
#
# $`2`
# min_x max_y diff
# 0.2052112 1.6941800 1.4889688

R: weighted imputation/imputation preferences

Suppose I have a dataset with multiple columns and one of them is gender. As far as I understand, knnImputation() with standard options will compute metric where all the variables are treated equally, while I wish to create some rule, when, for example, having the same gender is strongly preferred when searching for neighbours (e.g., gender has stronger influence on total weight or only rows with the same gender are chosen(this can be done by splitting and then reassembling both training and testing sets, but maybe there exists a simpler way)).
I see that kNNImpute() has the impute.fn parameter for imputation function and the knnImputation() has meth for method. How can I create such a rule that will be flexible and easy to edit (e.g. written as function of something like that)?
This will not do variable selection, but it will impute using kNN using only the rows that have the matching gender g as you suggest in the comments:
Sys.setenv("PKG_CXXFLAGS"="-std=c++0x") # needed for the lambda functions in Rcpp
# install/load package, create example data
devtools::install_github("alexwhitworth/imputation")
library(imputation)
set.seed(1345)
g <- sample(c("M", "F"), 100, replace=T)
a <- matrix(rnorm(1000), ncol=10)
a[a>1.5] <- NA
df <- data.frame(a,g)
# subset by gender, exclude character column from kNN (which doesn't
# handle character variables)
df_f <- kNN_impute(df[df$g == "F", 1:10], k= 3, q= 2, check_scale = FALSE, parallel= FALSE)
df_m <- kNN_impute(df[df$g == "M", 1:10], k= 3, q= 2, check_scale = FALSE, parallel= FALSE)
# recombine. Can use rownames as key
df2 <- data.frame(rbind(df_f$x, df_m$x))
df2 <- df2[order(as.integer(rownames(df2))),]
df2$g <- df$g

Sample from an unknown probability distribution

I have a vector of ~100k length, with values between 0 and 1 representing habitat suitability at geographic locations. While some of the values are very small, many of them are 0.9 etc, so the sum is much greater than one.
I would like to generate 1000 random samples of locations, each sample having length 6 (without replacement), with the probability that a location is chosen being weighted by the value of the vector at that location.
Dummy data below. Any ideas?
mylocs = letters[1:10]
myprobs = c(0.1,NA,0.01,0.2,0.6,NA,0.001,0.03,0.9,NA)
mydata = data.frame(mylocs,myprobs)
I'm a bit confused with your question, so here are two possible answers.
If you want you want to sample 1000 groups of six values, where groups can share values, then:
locs = letters[1:15]
probs = c(0.1,NA,0.01,0.2,0.6,NA,0.001,0.03,0.9,NA, 0.1, 0.1, 0.1, 0.1, 0.1)
mydata = data.frame(locs,probs)
d = na.omit(mydata)
replicate(1000, sample(d$locs, size=6, prob=d$probs, replace=F))
If groups shouldn't share values, then just do:
## Change the "2" to 1000 in the real data set
s = sample(d$locs, size=6*2, prob=d$probs, replace=F)
matrix(s, ncol=6)

Resources