Say I have a data set x and do the following kmeans cluster:
fit <- kmeans(x,2)
My question is in regards to the output of fit$cluster: I know that it will give me a vector of integers (from 1:k) indicating the cluster to which each point is allocated. Instead, is there a way to have the clusters be labeled 1,2, etc... in order of decreasing numerical value of their center?
For example: If x=c(1.5,1.4,1.45,.2,.3,.3) , then fit$cluster should result in (1,1,1,2,2,2) but not result in (2,2,2,1,1,1)
Similarly, if x=c(1.5,.2,1.45,1.4,.3,.3) then fit$cluster should return (1,2,1,1,2,2), instead of (2,1,2,2,1,1)
Right now, fit$cluster seems to label the cluster numbers randomly. I've looked into documentation but haven't been able to find anything. Please let me know if you can help!
I had a similar problem. I had a vector of ages that I wanted to separate into 5 factor groups based on a logical ordinal set. I did the following:
I ran the k-means function:
k5 <- kmeans(all_data$age, centers = 5, nstart = 25)
I built a data frame of the k-means indexes and centres; then arranged it by centre value.
kmeans_index <- as.numeric(rownames(k5$centers))
k_means_centres <- as.numeric(k5$centers)
k_means_df <- data_frame(index=kmeans_index, centres=k_means_centres)
k_means_df <- k_means_df %>%
arrange(centres)
Now that the centres are in the df in ascending order, I created my 5 element factor list and bound it to the data frame:
factors <- c("very_young", "young", "middle_age", "old", "very_old")
k_means_df <- cbind(k_means_df, factors)
Looks like this:
> k_means_df
index centres factors
1 2 23.33770 very_young
2 5 39.15239 young
3 1 55.31727 middle_age
4 4 67.49422 old
5 3 79.38353 very_old
I saved my cluster values in a data frame and created a dummy factor column:
cluster_vals <- data_frame(cluster=k5$cluster, factor=NA)
Finally, I iterated through the factor options in k_means_df and replaced the cluster value with my factor/character value within the cluster_vals data frame:
for (i in 1:nrow(k_means_df))
{
index_val <- k_means_df$index[i]
factor_val <- as.character(k_means_df$factors[i])
cluster_vals <- cluster_vals %>%
mutate(factor=replace(factor, cluster==index_val, factor_val))
}
Voila; I now have a vector of factors/characters that were applied based on their ordinal logic to the randomly created cluster vector.
# A tibble: 3,163 x 2
cluster factor
<int> <chr>
1 4 old
2 2 very_young
3 2 very_young
4 2 very_young
5 3 very_old
6 3 very_old
7 4 old
8 4 old
9 2 very_young
10 5 young
# ... with 3,153 more rows
Hope this helps.
K-means is a randomized algorithm. It is actually correct when the labels are not consistent across runs, or ordered in "ascending" order.
But you can of course remap the labels as you like, you know...
You seem to be using 1-dimensional data. Then k-means is actually not the best choice for you.
In contrast to 2- and higher-dimensional data, 1-dimensional data can efficiently be sorted. If your data is 1-dimensional, use an algorithm that exploits this for efficiency. There are much better algorithms for 1-dimensional data than for multivariate data.
Related
I'm using the first principal component from a PCA analysis as an explanatory variable in a forecasting model that forecasts recursively using Kalman filtering. In other words, at each point in time, the model updates and produces a new forecast based on the new observation included into the model. Since PCA uses data from all observations included in the model for its calculations, I need to run also the PCAs recursively, using only the observations prior to the point in time that I am forecasting (otherwise, the PCA-result could reveal information about the future, and help the model produce a more accurate answer than it would have otherwise). I think a loop might be the solution, but I am struggling with how to formulate the code.
As a more specific example, consider if I have the following data.frame
data <- as.data.frame(rbind(c(6,15,23),c(9,11,22), c(7,13,23), c(6,12,25),c(7,13,23)))
names(data) <- c("V1","V2","V3")
> data
V1 V2 V3
1 6 15 23
2 9 11 22
3 7 13 23
4 6 12 25
5 7 13 23
At each observation date, I wish to run a PCA (function prcomp() from the stats-package) for all observations up to, and including, that observation. So I want to first run PCA for the two first observation
pca2 <- prcomp(data[1:2,], scale = TRUE)
next I want to run PCA with the first, second and third observation as input
pca3 <- prcomp(data[1:3,], scale = TRUE)
next I want to run PCA with the first, second, third and fourth observation as input
pca4 <- prcomp(data[1:4,], scale = TRUE)
and so on, until the last run of the PCA, which includes all observations in the dataframe. For each of these "runs" of the PCA, I wish to extract the last value (though for pca2, I use both the first and second value) of the first principal component (PC1), and merge these into a final dataframe, where each monthly observation is the last value of the first principal component of PCA results for each of the runs.
The principal component outputs are:
> my_pca2 <- as.data.frame(pca2$x)
> my_pca2
PC1 PC2
1 -1.224745 -5.551115e-17
2 1.224745 5.551115e-17
> my_pca3 <- as.data.frame(pca3$x)
> my_pca3
PC1 PC2 PC3
1 -1.4172321 -0.2944338 6.106227e-16
2 1.8732448 -0.1215046 3.330669e-16
3 -0.4560127 0.4159384 4.163336e-16
> my_pca4 <- as.data.frame(pca4$x)
> my_pca4
PC1 PC2 PC3
1 -1.03030993 -1.10154914 0.015457199
2 2.00769890 0.07649216 0.011670433
3 0.03301806 -0.24226508 -0.033461874
4 -1.01040702 1.26732205 0.006334242
So I want my final output to be a dataframe to look like
>final.output
PC1
1 -1.224745
2 1.224745
3 -0.4560127
4 -1.01040702
Comment: yes, it looks a bit weird with the two first values, but please don't pay too much attention to that. My point is that I wish to build a dataframe that consists of the last calculated value for the first principal component for each of the PCA runs.
I am thinking that a for.loop might be the best solution here, but I have not been successful in finding any threads that might guide me closer to a coding solution. How can I make the loop use an increasing amount of the dataframe in the calculations? Does anyone have any suggestions/tips/links? Any help on this is much appreciated!
I had a very similar approach.
PCA <- vector("list", length=nrow(data)-1)
for(i in 1:(nrow(data)-1)) {
if(i==1) j <- 1:2 else j<-i+1
PCA[[i]] <- as.data.frame(prcomp(data[1:(1+i),], scale = TRUE)$x)[j, 1]
}
unlist(PCA)
You can use a for loop. It's maybe not the most efficient solution, but it will work.
First, you create an empty list to store your results:
all_results <- list()
Next, you iterate from 2 to the number of rows of data with a loop. For each iteration of the loop, run prcomp on data[1:i,]. You can directly create your pca data frame and extract PC1from it as a vector. Now you store it in the list at index i - 1
for(i in 2:nrow(data))
{
all_results[[i - 1]] <- as.data.frame(prcomp(data[1:i,], scale = TRUE)$x)$PC1
}
Now to extract all the results, you use lapply (list apply) to extract only the last element from each PC1 vector:
PC1 <- lapply(all_results, function(pca) pca[length(pca)] )
Now you convert these from a list of single elements to a vector:
PC1 <- do.call("c", PC1)
Finally, you want to stick the first value of the first analysis back on to the front of this vector:
PC1 <- c(all_results[[1]][1], PC1)
While stats::cutree() takes an hclust-object and cuts it into a given number of clusters, I'm looking for a function that takes a given amount of elements and attempts to set k accordingly. In other words: Return the first cluster with n elements.
For example:
Searching for the first cluster with n = 9 objects.
library(psych)
data(bfi)
x <- bfi
hclust.res <- hclust(dist(abs(cor(na.omit(x)))))
cutree.res <- cutree(hclust.res, k = 2)
cutree.table <- table(cutree.res)
cutree.table
# no cluster with n = 9 elements
> cutree.res
1 2
23 5
while k = 3 yields
cutree.res <- cutree(hclust.res, k = 3)
# three clusters, whereas cluster 2 contains the required amount of objects
> cutree.table
cutree.res
1 2 3
14 9 5
Is there a more convenient way then iterating over this?
Thanks
You can easily write code for this yourself that only does one pass over the dendrogram rather than calling cutter in a loop.
Just execute the merges one by one and note the cluster sizes. Then keep the one that you "liked" the best.
Note that there might be no such solution. For example on the 1 dimensional data set -11 -10 +10 +11, cutting the dendrogram in merge order will return clusters with 1,2, or 4 elements only. So you'll have to handle this case, too.
I have a large Dataset (dataframe) where I want to find the number and the names of my cartegories in a column.
For example my df was like that:
A B
1 car
2 car
3 bus
4 car
5 plane
6 plane
7 plane
8 plane
9 plane
10 train
I would want to find :
car
bus
plane
train
4
How would I do that?
categories <- unique(yourDataFrame$yourColumn)
numberOfCategories <- length(categories)
Pretty painless.
This gives unique, length of unique, and frequency:
table(df$B)
bus car plane train
1 3 5 1
length(table(x$B))
[1] 4
You can simply use unique:
x <- unique(df$B)
And it will extract the unique values in the column. You can use it with apply to get them from each column too!
I would recommend you use factors here, if you are not already. It's straightforward and simple.
levels() gives the unique categories and nlevels() gives the number of them. If we run droplevels() on the data first, we take care of any levels that may no longer be in the data.
with(droplevels(df), list(levels = levels(B), nlevels = nlevels(B)))
# $levels
# [1] "bus" "car" "plane" "train"
#
# $nlevels
# [1] 4
Additionally, to see sorted values you can use the following:
sort(table(df$B), decreasing = TRUE)
And you will see the values in the decreasing order.
Firstly you must ensure that your column is in the correct data type. Most probably R had read it in as a 'chr' which you can check with 'str(df)'.
For the data you have provided as an example, you will want to change this to a 'factor'. df$column <- as.factor(df$column)
Once the data is in the correct format, you can then use 'levels(df$column)' to get a summary of levels you have in the dataset
I have a data frame DF which looks like this:
ID Area time
1 1 182.685 1
2 2 182.714 1
3 3 182.275 1
4 4 211.928 1
5 5 218.804 1
6 6 183.445 1
...
1 1 184.334 2
2 2 196.765 2
3 3 186.435 2
4 4 213.322 2
5 5 214.766 2
6 6 172.667 2
.. and so to ID = 6. I want to apply an autocorrelation function on each ID, i.e. compare ID = 1 at time 1 with ID = 1 at time 2 and so on.
What is the most straightforward way to apply e.g. acf() to my data frame?
When I try to use
autocorr = aggregate(x = DF$Area, by = list(DF$ID), FUN = acf)
I get a weird object.
Thanks in advance!
I want to apply an autocorrelation function on each ID
OK, good, so you don't want any cross-correlation, which make things much easier.
I get a weird object
acf returns a bunch of things, i.e., it returns a list of things. I think you will be only interested in ACF values, so you need:
FUN = function (u) c(acf(u, plot = FALSE)$acf)
Also, using aggregate is not a good idea. You may want split and sapply:
## so your data frame is called `x`
oo <- sapply(split(x$Area, x$ID), FUN = function (u) c(acf(u, plot = FALSE)$acf) )
If you have balanced data, i.e., if you have equal number of observations for each ID, oo will be simplified into a matrix for sure. If you do not have balanced data, you may want to explicitly control the lag.max argument in acf. By default, acf will auto-decide on this value based on the number of observations.
Now suppose we want lag 0 to lag 7, we can set:
oo <- sapply(split(x$Area, x$ID),
FUN = function (u) c(acf(u, plot = FALSE, lag.max = 7)$acf) )
Thus result oo is a matrix of 8 rows (row for lag, column for ID). I don't see any good of using a data frame to hold this result, but in case you want a data frame, simply do:
data.frame(oo)
With data either in a matrix or a data frame, it is easy for you to do further analysis.
-----------
For a complete description of acf, please read Produce a boxplot for multiple ACFs
How can I plot a recurrency in R.
Any solution with base plot, ggplot2, lattice, or a dedicated package is welcome.
For example:
Imagine I have these data:
mydata <- data.frame(t=1:10, Y=runif(10))
t Y
1 0.3744869
2 0.6314202
3 0.3900789
4 0.6896278
5 0.6894134
6 0.5549006
7 0.4296244
8 0.4527201
9 0.3064433
10 0.5783539
I could transform it like this:
mydata2 <- data.frame(t=c(NA,mydata$t),Y=c(NA,mydata$Y),Y2=c(mydata$Y, NA))
t Y Y2
NA NA 0.9103703
1 0.9103703 0.1426041
2 0.1426041 0.4150476
3 0.4150476 0.2109258
4 0.2109258 0.4287504
5 0.4287504 0.1326900
6 0.1326900 0.4600964
7 0.4600964 0.9429571
8 0.9429571 0.7619739
9 0.7619739 0.9329098
10 0.9329098 NA
(or similar methods, but I can have problems with missing data)
And plot it
plot(Y2~Y, data=mydata2)
I guess I must use some grouping function such as ave or apply. But it's not an elegant solution, and if I have more columns it can become difficult to generalize the transformation.
For example
mydata3 <- data.frame(x=sample(10,100, replace=T),t=1:100, Y=2*runif(100)+1)
For every x (or combination of values on other columns) I want to plot Y_{i+1} ~ Y_i, on the same plot.
Other tools, such as Mathematica have functions to plot sequences directly.
I've found a solution, thoug not very beautiful:
For this sample data.
mydata <- data.frame(x=sample(4,25, replace=T),t=1:25, Y=2*runif(25)+1)
newdata <- mydata[order(mydata$x, mydata$t), ]
newdata$prev <- ave(newdata$Y, newdata$x, FUN=function(x) c(NA,head(x,-1)))
plot(Y~prev, data=newdata)
In this example you don't have rows for every t value, you would need to first generate NAs for missing values. But it's just a quick solution. In my real data I have many observations for each t.
lag.plot can plot recurrence plots but not within each subgroup.