Related
Apologies I am new to R, I have a dataset with height and canopy density of trees for example:
i_h100 i_cd
2.89 0.0198
2.88 0.0198
17.53 0.658
27.23 0.347
I want to regroup 'h_100' into 2m intervals going from 2m min to 30m max, I then want to calculate the mean i_cd value and interquartile range for each of these intervals so that I can then plot these with a least squares regression. There is something wrong with the code I am using to get the mean. This is what I have so far:
mydata=read.csv("irelandish.csv")
height=mydata$i_h100
breaks=seq(2,30,by=2) #2m intervals
height.cut=cut(height, breaks, right=TRUE)
#attempt at calculating means per group
install.packages("dplyr")
mean=summarise(group_by(cut(height, breaks, right=TRUE),
mean(mydata$i_cd)))
install.packages("reshape2")
dcast(mean)
Thanks in advance for any advice.
Using aggregate() to calculate the groupwise means.
# Some example data
set.seed(1)
i_h100 <- round(runif(100, 2, 30), 2)
i_cd <- rexp(100, 1/i_h100)
mydata <- data.frame(i_cd, i_h100)
# Grouping i_h100
mydata$i_h100_2m <- cut(mydata$i_h100, seq(2, 30, by=2))
head(mydata)
# i_cd i_h100 i_h100_2m
# 1 2.918093 9.43 (8,10]
# 2 13.735728 12.42 (12,14]
# 3 13.966347 18.04 (18,20]
# 4 2.459760 27.43 (26,28]
# 5 8.477551 7.65 (6,8]
# 6 6.713224 27.15 (26,28]
# Calculate groupwise means of i_cd
i_cd_2m_mean <- aggregate(i_cd ~ i_h100_2m, mydata, mean)
# And IQR
i_cd_2m_iqr <- aggregate(i_cd ~ i_h100_2m, mydata, IQR)
upper <- i_cd_2m_mean[,2]+(i_cd_2m_iqr[,2]/2)
lower <- i_cd_2m_mean[,2]-(i_cd_2m_iqr[,2]/2)
# Plotting the result
plot.default(i_cd_2m_mean, xaxt="n", ylim=range(c(upper, lower)),
main="Groupwise means \U00B1 0.5 IQR", type="n")
points(upper, pch=2, col="lightblue", lwd=1.5)
points(lower, pch=6, col="pink", lwd=1.5)
points(i_cd_2m_mean, pch=16)
axis(1, i_cd_2m[,1], as.character(i_cd_2m[,1]), cex.axis=0.6, las=2)
Here is a solution,
library(reshape2)
library(dplyr)
mydata <- data_frame(i_h100=c(2.89,2.88,17.53,27.23),i_cd=c(0.0198,0.0198,0.658,0.347))
height <- mydata$i_h100
breaks <- seq(2,30,by=2) #2m intervals
height.cut <- cut(height, breaks, right=TRUE)
mydata$height.cut <- height.cut
mean_i_h100 <- mydata %>% group_by(height.cut) %>% summarise(mean_i_h100 = mean(i_h100))
A few remarks:
it is better to avoid naming variables with function names, so I changed the mean variable to mean_i_h100
I am using the pipe notation, which makes the code more readable, it avoids repeating the first argument of each function, you can find a more detailed explanation here.
Without the pipe notation, the last line of code would be:
mean_i_h100 <- summarise(group_by(mydata,height.cut),mean_i_h100 = mean(i_h100))
you have to load the two packages you installed with library
I'm just beating my head against the wall trying to get a Cholesky decomposition to work in order to simulate correlated price movements.
I use the following code:
cormat <- as.matrix(read.csv("http://pastebin.com/raw/qGbkfiyA"))
cormat <- cormat[,2:ncol(cormat)]
rownames(cormat) <- colnames(cormat)
cormat <- apply(cormat,c(1,2),FUN = function(x) as.numeric(x))
chol(cormat)
#Error in chol.default(cormat) :
# the leading minor of order 8 is not positive definite
cholmat <- chol(cormat, pivot=TRUE)
#Warning message:
# In chol.default(cormat, pivot = TRUE) :
# the matrix is either rank-deficient or indefinite
rands <- array(rnorm(ncol(cholmat)), dim = c(10000,ncol(cholmat)))
V <- t(t(cholmat) %*% t(rands))
#Check for similarity
cor(V) - cormat ## Not all zeros!
#Check the standard deviations
apply(V,2,sd) ## Not all ones!
I'm not really sure how to properly use the pivot = TRUE statement to generate my correlated movements. The results look totally bogus.
Even if I have a simple matrix and I try out "pivot" then I get bogus results...
cormat <- matrix(c(1,.95,.90,.95,1,.93,.90,.93,1), ncol=3)
cholmat <- chol(cormat)
# No Error
cholmat2 <- chol(cormat, pivot=TRUE)
# No warning... pivot changes column order
rands <- array(rnorm(ncol(cholmat)), dim = c(10000,ncol(cholmat)))
V <- t(t(cholmat2) %*% t(rands))
#Check for similarity
cor(V) - cormat ## Not all zeros!
#Check the standard deviations
apply(V,2,sd) ## Not all ones!
There are two errors with your code:
You did not use pivoting index to revert the pivoting done to the Cholesky factor. Note, pivoted Cholesky factorization for a semi-positive definite matrix A is doing:
P'AP = R'R
where P is a column pivoting matrix, and R is an upper triangular matrix. To recover A from R, we need apply the inverse of P (i.e., P'):
A = PR'RP' = (RP')'(RP')
Multivariate normal with covariance matrix A, is generated by:
XRP'
where X is multivariate normal with zero mean and identity covariance.
Your generation of X
X <- array(rnorm(ncol(R)), dim = c(10000,ncol(R)))
is wrong. First, it should not be ncol(R) but nrow(R), i.e., the rank of X, denoted by r. Second, you are recycling rnorm(ncol(R)) along columns, and the resulting matrix is not random at all. Therefore, cor(X) is never close to an identity matrix. The correct code is:
X <- matrix(rnorm(10000 * r), 10000, r)
As a model implementation of the above theory, consider your toy example:
A <- matrix(c(1,.95,.90,.95,1,.93,.90,.93,1), ncol=3)
We compute the upper triangular factor (suppressing possible rank-deficient warnings) and extract inverse pivoting index and rank:
R <- suppressWarnings(chol(A, pivot = TRUE))
piv <- order(attr(R, "pivot")) ## reverse pivoting index
r <- attr(R, "rank") ## numerical rank
Then we generate X. For better result we centre X so that column means are 0.
X <- matrix(rnorm(10000 * r), 10000, r)
## for best effect, we centre `X`
X <- sweep(X, 2L, colMeans(X), "-")
Then we generate target multivariate normal:
## compute `V = RP'`
V <- R[1:r, piv]
## compute `Y = X %*% V`
Y <- X %*% V
We can verify that Y has target covariance A:
cor(Y)
# [,1] [,2] [,3]
#[1,] 1.0000000 0.9509181 0.9009645
#[2,] 0.9509181 1.0000000 0.9299037
#[3,] 0.9009645 0.9299037 1.0000000
A
# [,1] [,2] [,3]
#[1,] 1.00 0.95 0.90
#[2,] 0.95 1.00 0.93
#[3,] 0.90 0.93 1.00
If I think I understand something I like to verify, so in this case I was trying to verify the calculation of the Partial Autocorrelation. pacf().
what I end up with is something a little different. My understanding is that the pacf would be the coefficient of the regression of the last/furthest lag given all of the previous lags. So to set up some code, I'm using Canadian employment data and the book Elements of Forecasting by F. Diebold (1998) Chapter6
#Obtain Canadian Employment dataset
caemp <- c(83.090255, 82.7996338824, 84.6344380294, 85.3774583529, 86.197605, 86.5788438824, 88.0497240294, 87.9249263529, 88.465131, 88.3984638824, 89.4494320294, 90.5563753529, 92.272335, 92.1496788824, 93.9564890294, 94.8114863529, 96.583434, 96.9646728824, 98.9954360294, 101.138164353, 102.882122, 103.095394882, 104.006386029, 104.777404353, 104.701732, 102.563504882, 103.558486029, 102.985774353, 102.098281, 101.471734882, 102.550696029, 104.021564353, 105.093652, 105.194954882, 104.594266029, 105.813184353, 105.149642, 102.899434882, 102.354736029, 102.033974353, 102.014299, 101.835654882, 102.018806029, 102.733834353, 103.134062, 103.263354882, 103.866416029, 105.393274353, 107.081242, 108.414274882, 109.297286029, 111.495994353, 112.680072, 113.061304882, 112.376636029, 111.244054353, 107.305192, 106.678644882, 104.678246029, 105.729204353, 107.837082, 108.022364882, 107.281706029, 107.016934353, 106.045452, 106.370704882, 106.049966029, 105.841184353, 106.045452, 106.650644882, 107.393676029, 108.668584353, 109.628702, 110.261894882, 110.920946029, 110.740154353, 110.048622, 108.190324882, 107.057746029, 108.024724353, 109.712692, 111.409654882, 108.765396029, 106.289084353, 103.917902, 100.799874882, 97.3997700294, 93.2438143529, 94.123068, 96.1970798824, 97.2754290294, 96.4561423529, 92.674237, 92.8536228824, 93.4304540294, 93.2055593529, 93.955896, 94.7296738824, 95.5665510294, 95.5459793529, 97.09503, 97.7573598824, 96.1609430294, 96.5861653529, 103.874812, 105.094384882, 106.804276029, 107.786744353, 106.596022, 107.310354882, 106.897156029, 107.210924353, 107.134682, 108.829774882, 107.926196029, 106.298904353, 103.365872, 102.029554882, 99.3000760294, 95.3045073529, 90.50099, 88.0984848824, 86.5150710294, 85.1143943529, 89.033584, 88.8229008824, 88.2666710294, 87.7260053529, 88.102896, 87.6546968824, 88.4004090294, 88.3618013529, 89.031151, 91.0202948824, 91.6732820294, 92.0149173529)
# create time series with the canadian employment dataset
caemp.ts<-ts(caemp, start=c(1961, 1), end=c(1994, 4), frequency=4)
caemp.ts2<-window(caemp.ts,start=c(1961,5), end=c(1993,4))
# set up max lag the book says use sqrt(T) but in this case i'm using 3 for the example
lag.max <- 3
# R Code using pacf()
pacf(caemp.ts2, lag.max=3, plot=F)
# initialize vector to capture the partial autocorrelations
pauto.corr <- rep(0, lag.max)
# Set up lagged data frame
pa.mat <- as.data.frame(caemp.ts2)
for(i in 1:lag.max){
a <- c(rep(NA, i), pa.mat[1:(length(caemp.ts2) - i),1])
pa.mat <- cbind(pa.mat, a)
}
names(pa.mat) <- c("0":lag.max)
# Set up my base linear model
base.lm <- lm(pa.mat[, 1] ~ 1)
### I could not get the for loop to work successfully here
i <- 1
base.lm <- update(base.lm, .~. + pa.mat[,2])
pauto.corr[i]<-base.lm$coefficients[length(base.lm$coefficients)]
i<-2
base.lm <-update(base.lm, .~. + pa.mat[,3])
pauto.corr[i]<-base.lm$coefficients[length(base.lm$coefficients)]
i<-3
base.lm <-update(base.lm, .~. + pa.mat[,4])
pauto.corr[i]<-base.lm$coefficients[length(base.lm$coefficients)]
# Compare results...
round(pauto.corr,3)
pacf(caemp.ts2, lag.max=3, plot=F)
For the output
> round(pauto.corr,3)
[1] 0.971 -0.479 -0.072
> pacf(caemp.ts2, lag.max=3, plot=F)
Partial autocorrelations of series ‘caemp.ts2’, by lag
0.25 0.50 0.75
0.949 -0.244 -0.100
Maybe it is because my example is quarterly and not monthly data, or I could just be wrong?
In caret package and help system for related varImp() there is:
Partial Least Squares: the variable importance measure here is based
on weighted sums of the absolute regression coefficients. The weights
are a function of the reduction of the sums of squares across the
number of PLS components and are computed separately for each outcome.
Therefore, the contribution of the coefficients are weighted
proportionally to the reduction in the sums of squares.
Below is the output of variable importance of a classification model by caret package method="pls":
> varImp(plsFitvac)
pls variable importance
variables are sorted by average importance across the classes
H P R Q
IL17A 9.516 100.000 19.813 61.20
IL8 17.814 1.344 80.628 34.33
IL6ST 10.319 75.452 62.296 68.41
IL23A 7.662 55.422 43.188 44.17
IL27RA 10.311 0.000 45.932 24.76
IL12RB2 15.497 28.467 38.848 33.73
IL12B 13.569 22.799 32.728 27.25
IL12RB1 12.292 23.431 6.395 18.67
IL12A 10.394 22.774 12.330 18.94
EBI3 12.039 6.932 14.877 11.01
IL23R 13.053 10.018 9.708 13.22
That's fine, but when I extract this data frame by this line of code:
df <- varImp(plsFitvac)$importance
I get the same of above but unsorted, it would be very nice though if sorted. Anyway, to sort this data frame based on the average importance across classes (as stated in the output) I did this:
df$Sort <- apply(df, 1, sum)
df$Sort <- df$Sort/ncol(df) # not needed since sum and average will be sorted alike
df[order(df$Sort,decreasing=TRUE),]
> df[order(df$Sort,decreasing=TRUE),]
H P R Q Sort
IL6ST 10.318521 75.451572 62.295779 68.40740 43.294655
IL17A 9.515726 100.000000 19.813439 61.20098 38.106029
IL23A 7.662351 55.422249 43.187811 44.16892 30.088267
IL8 17.813522 1.343589 80.628315 34.32519 26.822122
IL12RB2 15.497069 28.466890 38.847943 33.73476 23.309331
IL12B 13.569266 22.798682 32.727759 27.24567 19.268275
IL27RA 10.311489 0.000000 45.932101 24.76301 16.201321
IL12A 10.393673 22.773860 12.329890 18.94323 12.888131
IL12RB1 12.291526 23.431046 6.395495 18.66685 12.156983
IL23R 13.053380 10.018339 9.708473 13.22094 9.200227
EBI3 12.039321 6.931682 14.877214 11.00619 8.970881
So that ended up with a different version than that of the sorted list of caret via varImp() function. Am I missing something here? Thanks.
Note:
I didn't pass importance = TRUE argument to train() call for a PLSDA model, i.e., method = "pls".
$importance
> dput(df)
structure(list(H = c(17.8135216215421, 9.51572613703257, 7.66235106434041,
13.0533801732928, 12.0393206867905, 10.3185210244416, 10.3936725783446,
15.4970686175322, 13.569265567599, 12.291526066084, 10.3114887728613
), P = c(1.34358921525031, 100, 55.4222485106407, 10.0183388053119,
6.93168239216908, 75.4515720604057, 22.7738599760963, 28.4668895810321,
22.7986823025468, 23.4310464801875, 0), R = c(80.6283150180913,
19.8134392303359, 43.1878112878907, 9.70847280019312, 14.8772141493434,
62.2957787591232, 12.3298895434334, 38.8479426109151, 32.7277593254102,
6.39549491068232, 45.932101268196), Q = c(34.3251855315416, 61.2009790458015,
44.1689231007598, 13.2209412495112, 11.0061874803613, 68.4074013762385,
18.9432341406872, 33.7347566350668, 27.2456691770754, 18.6668467881651,
24.7630136095146)), .Names = c("H", "P", "R", "Q"), row.names = c("IL8",
"IL17A", "IL23A", "IL23R", "EBI3", "IL6ST", "IL12A", "IL12RB2",
"IL12B", "IL12RB1", "IL27RA"), class = "data.frame")
Question:
How to measure importance across classes? can I trust the varImp() output unsorted?
EDIT:
the method by max() to rank importance of variables:
vi <- varImp(plsFitvac)$importance
vi$max <- apply(vi, 1, max)
vi[order(-vi$max),]
resulted in the same of varImp():
varImp(plsFitvac)
which yielded this:
> vi[order(-vi$max),]
H P R Q max
IL17A 9.515726 100.000000 19.813439 61.20098 100.00000
IL8 17.813522 1.343589 80.628315 34.32519 80.62832
IL6ST 10.318521 75.451572 62.295779 68.40740 75.45157
IL23A 7.662351 55.422249 43.187811 44.16892 55.42225
IL27RA 10.311489 0.000000 45.932101 24.76301 45.93210
IL12RB2 15.497069 28.466890 38.847943 33.73476 38.84794
IL12B 13.569266 22.798682 32.727759 27.24567 32.72776
IL12RB1 12.291526 23.431046 6.395495 18.66685 23.43105
IL12A 10.393673 22.773860 12.329890 18.94323 22.77386
EBI3 12.039321 6.931682 14.877214 11.00619 14.87721
IL23R 13.053380 10.018339 9.708473 13.22094 13.22094
but using sum() of importance across classes yielded a different ranking (see above). So which one is correct and what happens in case of ties in the max() method?
The output shown using varImp(plsFitvac) is formatted and shown to some abbreviated level of precision:
> format(9.515726, digits = 4)
[1] "9.516"
Try using various values of digits in this code:
format(varImp(plsFit)$importance, digits = 4)
and you should be able to see that they are the same values.
When you print the data frame, print.data.frame uses digits = getOption("digits") while print.varImp.train uses max(3, getOption("digits") - 3).
The default value of getOption("digits") gives me a headache, which is my function is the way that it is.
EDIT: if the question is about the ordering, the way the function ranks these is to find the maximum importance across the classes for each predictor and order them based on that. There is a little more to it (in case of ties etc) and the code is in the undocumented internal function sortImp. This code should approximate that function:
vi$max <- apply(vi, 1, max)
vi[order(-vi$max),]
Max
Try using, write.csv2(varImp(vi),"vi.csv")
and you can do sort in excel.
I'm running k-means clustering on a data frame df1, and I'm looking for a simple approach to computing the closest cluster center for each observation in a new data frame df2 (with the same variable names). Think of df1 as the training set and df2 on the testing set; I want to cluster on the training set and assign each test point to the correct cluster.
I know how to do this with the apply function and a few simple user-defined functions (previous posts on the topic have usually proposed something similar):
df1 <- data.frame(x=runif(100), y=runif(100))
df2 <- data.frame(x=runif(100), y=runif(100))
km <- kmeans(df1, centers=3)
closest.cluster <- function(x) {
cluster.dist <- apply(km$centers, 1, function(y) sqrt(sum((x-y)^2)))
return(which.min(cluster.dist)[1])
}
clusters2 <- apply(df2, 1, closest.cluster)
However, I'm preparing this clustering example for a course in which students will be unfamiliar with the apply function, so I would much prefer if I could assign the clusters to df2 with a built-in function. Are there any convenient built-in functions to find the closest cluster?
You could use the flexclust package, which has an implemented predict method for k-means:
library("flexclust")
data("Nclus")
set.seed(1)
dat <- as.data.frame(Nclus)
ind <- sample(nrow(dat), 50)
dat[["train"]] <- TRUE
dat[["train"]][ind] <- FALSE
cl1 = kcca(dat[dat[["train"]]==TRUE, 1:2], k=4, kccaFamily("kmeans"))
cl1
#
# call:
# kcca(x = dat[dat[["train"]] == TRUE, 1:2], k = 4)
#
# cluster sizes:
#
# 1 2 3 4
#130 181 98 91
pred_train <- predict(cl1)
pred_test <- predict(cl1, newdata=dat[dat[["train"]]==FALSE, 1:2])
image(cl1)
points(dat[dat[["train"]]==TRUE, 1:2], col=pred_train, pch=19, cex=0.3)
points(dat[dat[["train"]]==FALSE, 1:2], col=pred_test, pch=22, bg="orange")
There are also conversion methods to convert the results from cluster functions like stats::kmeans or cluster::pam to objects of class kcca and vice versa:
as.kcca(cl, data=x)
# kcca object of family ‘kmeans’
#
# call:
# as.kcca(object = cl, data = x)
#
# cluster sizes:
#
# 1 2
# 50 50
Something I noticed about both the approach in the question and the flexclust approaches are that they are rather slow (benchmarked here for a training and testing set with 1 million observations with 2 features each).
Fitting the original model is reasonably fast:
set.seed(144)
df1 <- data.frame(x=runif(1e6), y=runif(1e6))
df2 <- data.frame(x=runif(1e6), y=runif(1e6))
system.time(km <- kmeans(df1, centers=3))
# user system elapsed
# 1.204 0.077 1.295
The solution I posted in the question is slow at calculating the test-set cluster assignments, since it separately calls closest.cluster for each test-set point:
system.time(pred.test <- apply(df2, 1, closest.cluster))
# user system elapsed
# 42.064 0.251 42.586
Meanwhile, the flexclust package seems to add a lot of overhead regardless of whether we convert the fitted model with as.kcca or fit a new one ourselves with kcca (though the prediction at the end is much faster)
# APPROACH #1: Convert from the kmeans() output
system.time(km.flexclust <- as.kcca(km, data=df1))
# user system elapsed
# 87.562 1.216 89.495
system.time(pred.flexclust <- predict(km.flexclust, newdata=df2))
# user system elapsed
# 0.182 0.065 0.250
# Approach #2: Fit the k-means clustering model in the flexclust package
system.time(km.flexclust2 <- kcca(df1, k=3, kccaFamily("kmeans")))
# user system elapsed
# 125.193 7.182 133.519
system.time(pred.flexclust2 <- predict(km.flexclust2, newdata=df2))
# user system elapsed
# 0.198 0.084 0.302
It seems that there is another sensible approach here: using a fast k-nearest neighbors solution like a k-d tree to find the nearest neighbor of each test-set observation within the set of cluster centroids. This can be written compactly and is relatively speedy:
library(FNN)
system.time(pred.knn <- get.knnx(km$center, df2, 1)$nn.index[,1])
# user system elapsed
# 0.315 0.013 0.345
all(pred.test == pred.knn)
# [1] TRUE
You can use the ClusterR::KMeans_rcpp() function, use RcppArmadillo. It allows for multiple initializations (which can be parallelized if Openmp is available). Besides optimal_init, quantile_init, random and kmeans ++ initilizations one can specify the centroids using the CENTROIDS parameter. The running time and convergence of the algorithm can be adjusted using the num_init, max_iters and tol parameters.
library(scorecard)
library(ClusterR)
library(dplyr)
library(ggplot2)
## Generate data
set.seed(2019)
x = c(rnorm(200000, 0,1), rnorm(150000, 5,1), rnorm(150000,-5,1))
y = c(rnorm(200000,-1,1), rnorm(150000, 6,1), rnorm(150000, 6,1))
df <- split_df(data.frame(x,y), ratio = 0.5, seed = 123)
system.time(
kmrcpp <- KMeans_rcpp(df$train, clusters = 3, num_init = 4, max_iters = 100, initializer = 'kmeans++'))
# user system elapsed
# 0.64 0.05 0.82
system.time(pr <- predict_KMeans(df$test, kmrcpp$centroids))
# user system elapsed
# 0.01 0.00 0.02
p1 <- df$train %>% mutate(cluster = as.factor(kmrcpp$clusters)) %>%
ggplot(., aes(x,y,color = cluster)) + geom_point() +
ggtitle("train data")
p2 <- df$test %>% mutate(cluster = as.factor(pr)) %>%
ggplot(., aes(x,y,color = cluster)) + geom_point() +
ggtitle("test data")
gridExtra::grid.arrange(p1,p2,ncol = 2)