I have this array:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1
[25] 1 1 1 1 2 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1
[49] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[73] 1 1 1 1 1 4 3 2 5 3 2 3 3 2 3 2 3 2 3 3 2 3 3 2
[97] 3 2 2 2 3 2 2 2 2 2 3 2 3 3 2 3 2 1 2 2 3 2 2 3
I need a function that returns only the number of the maximum occurrences. For example, if I use:
table(x[1:80])
I will get:
1 2 3 4
74 3 2 1
How can I get automatically the value '74'? Meaning that I can't know if '1' or '2' and so on... is the the maximum occurrence in my array. Thanks!
Edit:
I run:
tf<- tablulate(x):
[1] 75 24 19 1 1
and tried to run a for loop to get the "maximum" of each element on the "tabulate result" as following:
for (element in tf)
{
+ b= max(table(x[element]))
+ print (b)
+ }
I don't get the expected result, it is probably simple but not really for me.
I tried this:
> a=max(table(C[1:75]))
[1] 72
> b=max(table(C[76:99]))
[1] 11
> c=max(table(C[100:118]))
[1] 12
> d=max(table(C[119]))
[1] 1
> e=max(table(C[120]))
[1] 1
and so on.
and it works but it's really long and not fun if I have a big dataset.
To the commenter's tip, if you want a function use:
maximum <- function(vector, upto=length(vector)) {
max(table(vector[1:upto]))
}
So for:
set.seed(123)
x <- sample(1:3, 100, replace=T)
maximum(x)
[1] 34
maximum(x, 55) #checking at the 55th number in the vector
[1] 19
Update
To answer your edited question. Use this function:
maxtable <- function(vector) {
index <- cumsum(1:length(vector) %in% cumsum(tabulate(vector)))
s <- split(vector, index)
sapply(s, function(v) max(table(v)))
}
maxtable(x)
0 1 2 3 4 5
71 11 12 1 1 1
Edit
I think this small change is more of what you're looking for:
maxtable2 <- function(vector) {
index <- cumsum(1:length(vector) %in% (cumsum(tabulate(vector))+1))
s <- split(vector, index+1)
sapply(s, function(v) max(table(v)))
}
maxtable2(x)
1 2 3 4 5
72 11 12 1 1
There is a function in the library modeest that gets you most of the way there called mfv. But the function itself is simple enough to make yourself:
> mfv
function (x, ...)
{
f <- factor(x)
tf <- tabulate(f)
return(as.numeric(levels(f)[tf == max(tf)]))
}
<environment: namespace:modeest>
So you can do sum(x == mfv(x)) to get 74.
Related
I have a matrix and I would like to know the center and min/max size of each cluster represented by the same number value.
By example, to get the center position and size of clusters (or the min/max column/row) represented by the number 2 in the following matrix. The idea is closed to the one perform on an image How to obtain size of cluster of pixels in R and How to obtain size of multi clusters in matrix (R)
But when I use the function apply(matrix2, 2, mean) and apply(matrix2, 2, range), results merge the two clusters. Is there a way to get each cluster ?
> matrix<- read.csv("2_ind_matrix.csv")
X1 X1.1 X1.2 X1.3 X1.4 X1.5
1 1 1 1 1 1 1
2 1 1 1 1 1 1
3 1 1 1 1 1 1
4 1 1 1 2 2 2
5 1 1 1 1 2 2
6 1 1 1 1 1 1
7 1 1 1 1 1 1
8 1 1 1 1 1 1
9 1 1 1 1 1 1
10 1 1 1 1 1 1
11 2 1 1 1 1 1
12 2 1 1 1 1 1
13 2 1 1 1 1 1
14 2 2 1 1 1 1
15 2 2 2 1 1 1
16 2 2 2 2 2 2
17 2 2 2 2 2 2
> matrix2<- which(matrix == 2, TRUE)
> apply(matrix2, 2, range) #Range
row col
[1,] 4 1
[2,] 17 6
> apply(matrix2, 2, mean) #Center
row col
13.16 3.20
The decision on how many clusters are there needs to be done. Here I assume there are 2 clusters. Those can be found by kmeans by using the positions returned from which.
y <- which(x==2, TRUE)
y <- cbind(y, cluster=kmeans(y, 2)$cluster)
aggregate(y[,1:2], list(y[,3]), range)
# Group.1 row.1 row.2 col.1 col.2
#1 1 4 5 4 6
#2 2 11 17 1 6
aggregate(y[,1:2], list(y[,3]), mean)
# Group.1 row col
#1 1 4.40 5.2
#2 2 15.35 2.7
I am trying to build an R code that will take my loop and run it 20 times. Each time I would like to add a column to the existing data frame. Here I tried it by adding the code 3 times, but I feel like there must be an easier way to automate this. I am very grateful for any help.
My original data file (called "igel") contains two columns ("Year" and "Grid") and 1096 rows. With the loop I pick a random number from the column "Grid" and check whether it has been picked before. If so it adds 0 to a new column if not it adds 1.
Here the code:
a <- data.frame(matrix(ncol = 2, nrow = 0))
x <- c("number", "count")
colnames(a) <- x
for (i in 1:1096) {
num_i <- sample(igel$Grid, 1)
count_i <- c(if (num_i %in% a$number == TRUE) {0} else {1})
a<-a %>% add_row(number = num_i, count = count_i)
}
b <- data.frame(matrix(ncol = 2, nrow = 0))
x <- c("number", "count")
colnames(b) <- x
for (i in 1:1096) {
num_i <- sample(igel$Grid, 1)
count_i <- c(if (num_i %in% b$number == TRUE) {0} else {1})
b<-b %>% add_row(number = num_i, count = count_i)
}
c <- data.frame(matrix(ncol = 2, nrow = 0))
x <- c("number", "count")
colnames(c) <- x
for (i in 1:1096) {
num_i <- sample(igel$Grid, 1)
count_i <- c(if (num_i %in% c$number == TRUE) {0} else {1})
c<-c %>% add_row(number = num_i, count = count_i)
}
df.total<- cbind(a$count,b$count, c$count)
Consider sapply and even its wrapper, replicate and calculate number and count separately in vector calculations instead of growing object in loop by row.
# RUNS 3 SAMPLES OF igel$Grid 1,096 TIMES (ADJUST 3 TO ANY POSITIVE INT LIKE 20)
grid_number <- data.frame(replicate(3, replicate(1096, sample(igel$Grid, 1))))
# RUNS ACROSS 3 COLUMNS TO CHECK CURRENT ROW VALUE IS INCLUDED FOR ALL VALUES BEFORE ROW
grid_count <- sapply(grid_number, function(col)
sapply(seq_along(col), function(i)
ifelse(col[i] %in% col[1:(i-1)], 0, 1)
)
)
While above does not exactly reproduce your output, df.total (a matrix and not data frame), due to the random sampling within iterations, the two maintain similar structure:
dim(df.total)
# [1] 1096 3
dim(grid_count)
# [1] 1096 3
Try to avoid iterating through rows. It is rarely necessary, if ever. Here is one approach (replace n with 1096 and elem with igel$Grid):
n = 20
elem = 1:5
df.total = list()
for (i in 1:5) {
a = data.frame(number = sample(elem, n, replace=TRUE))
a$count = as.numeric(duplicated(a$number))
df.total[[i]] = a
}
df.total = as.data.frame(df.total)
df.total
## number count number.1 count.1 number.2 count.2 number.3 count.3 number.4 count.4
## 1 4 0 2 0 5 0 4 0 1 0
## 2 3 0 5 0 3 0 4 1 3 0
## 3 5 0 3 0 4 0 2 0 4 0
## 4 5 1 1 0 2 0 5 0 3 1
## 5 2 0 4 0 2 1 5 1 5 0
## 6 4 1 2 1 2 1 5 1 5 1
## 7 5 1 1 1 3 1 2 1 4 1
## 8 5 1 2 1 5 1 5 1 4 1
## 9 2 1 1 1 1 0 1 0 1 1
## 10 3 1 1 1 5 1 4 1 1 1
## 11 5 1 3 1 1 1 3 0 5 1
## 12 2 1 1 1 2 1 5 1 1 1
## 13 3 1 5 1 4 1 5 1 4 1
## 14 1 0 4 1 2 1 4 1 1 1
## 15 4 1 4 1 2 1 5 1 1 1
## 16 4 1 2 1 5 1 2 1 5 1
## 17 3 1 1 1 1 1 3 1 2 0
## 18 2 1 2 1 2 1 2 1 2 1
## 19 2 1 3 1 1 1 2 1 1 1
## 20 1 1 3 1 2 1 1 1 3 1
This question already has answers here:
How to split data into training/testing sets using sample function
(28 answers)
Randomly sample data frame into 3 groups in R
(4 answers)
Closed 3 years ago.
I'd like to know how to make multiple sampling in R. For example, when I try dividing some data into 60(train data):40(validate data), I can write the code like this:
original.data = read.csv("~.csv", na.strings="")
train.index = sample(c(1:dim(original.data)[1]), dim(original.data)[1]*0.6)
train.data = original.data[train.index,]
valid.data = original.data[-train.index,]
However, it is so hard to figure out making multiple sampling like dividing some data into 60:20:20.
I would appreciate if you make me know the best solution!
If you want more than two sets, then the other solutions are close but you need just a little more. There are at least two options.
First:
set.seed(2)
table(samp <- sample(1:3, size = nrow(iris), prob = c(0.6, 0.2, 0.2), replace = TRUE))
# 1 2 3
# 93 35 22
nrow(iris) # 150
set1 <- iris[samp == 1,]
set2 <- iris[samp == 2,]
set3 <- iris[samp == 3,]
set1 <- iris[samp == 1,]
set2 <- iris[samp == 2,]
set3 <- iris[samp == 3,]
nrow(set1)
# [1] 93
nrow(set2)
# [1] 35
nrow(set3)
# [1] 22
Because it's random, you want always get your exact proportions.
Second:
If you must have exact proportions, you can do this:
ns <- nrow(iris) * c(0.6, 0.2, 0.2)
sum(ns)
# [1] 150
### in case of rounding (and sum != nrow) ... just fix one of ns
rep(1:3, times = ns)
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# [46] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# [91] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
# [136] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
set.seed(2)
head(samp <- sample(rep(1:3, times = ns)))
# [1] 1 2 1 1 3 3
set1 <- iris[samp == 1,]
set2 <- iris[samp == 2,]
set3 <- iris[samp == 3,]
nrow(set1)
# [1] 90
nrow(set2)
# [1] 30
nrow(set3)
# [1] 30
This can easily be generalized to support an arbitrary number of partitions.
I've built a cluster model in R (kmeans):
fit <- kmeans(sales_DP_DCT_agg_tr_bel_mod, 4)
Now I want to use this model and segment a brand new data set. How can I:
store the model
run the model on an new data set?
Let's say you're using iris as a dataset.
data = iris[,1:4] ## Don't want the categorical feature
model = kmeans(data, 3)
Here's what the output looks like:
>model
K-means clustering with 3 clusters of sizes 96, 33, 21
Cluster means:
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 6.314583 2.895833 4.973958 1.7031250
2 5.175758 3.624242 1.472727 0.2727273
3 4.738095 2.904762 1.790476 0.3523810
Clustering vector:
[1] 2 3 3 3 2 2 2 2 3 3 2 2 3 3 2 2 2 2 2 2 2 2 2 2 3 3 2 2 2 3 3 2 2 2 3 2 2 2 3 2 2 3 3 2 2 3 2 3 2 2 1 1 1 1 1 1 1 3 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[76] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Within cluster sum of squares by cluster:
[1] 118.651875 6.432121 17.669524
(between_SS / total_SS = 79.0 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size" "iter" "ifault"
Notice you have access to the centroids using model$centers. All you have to do to classify an incoming sample is find which centroid it's closest to. You could define a euclidean distance function as follows:
eucDist <- function(x, y) sqrt(sum( (x-y)^2 ))
And then a classifying function:
classifyNewSample <- function(newData, centroids = model$centers) {
dists = apply(centroids, 1, function(y) eucDist(y,newData))
order(dists)[1]
}
> classifyNewSample(c(7,3,6,2))
[1] 1
> classifyNewSample(c(6,2.7,4.3,1.4))
[1] 2
As far as model persistence goes, checkout ?save here.
Edit:
To apply the predict function to a new matrix:
## I'm just generating a random matrix of 50x4 here:
r <- 50
c <- 4
m0 <- matrix(0, r, c)
new_data = apply(m0, c(1,2), function(x) sample(seq(0,10,0.1),1))
new_labels = apply(new_data, 1, classifyNewSample)
>new_labels
[1] 1 2 3 3 2 1 3 1 3 1 2 3 3 1 1 3 1 1 1 3 1 1 1 1 1 1 3 1 1 3 3 1 1 3 2 1 3 2 3 1 2 1 2 1 1 2 1 3 2 1
I have an input file of this form:
0.35217720 1 201 1
0.26413283 1 209 1
1.1665874 1 210 1
...
0.30815500 2 194 1
0.15407741 2 196 1
0.15407741 2 197 1
0.33016610 2 205 1
...
where the first column is a scalar value, the second is the x coordinate of a discrete lattice, the third is the y coordinate and the last one is time-like discrete component.
I would like to make a two dimensional heatmap of the scalar values at fixed time. How can i do? Edit: I don't know how to use image() to use the second and the third column as x, y coordinates.
Example file:
7.62939453 1 1 1
1.3153768 1 2 1
7.5560522 1 3 1
4.5865011 1 4 1
5.3276706 1 5 1
2.1895909 2 1 1
0.47044516 2 2 1
6.7886448 2 3 1
6.7929626 2 4 1
9.3469286 2 5 1
3.8350201 3 1 1
5.1941633 3 2 1
8.3096523 3 3 1
0.34571886 3 4 1
0.53461552 3 5 1
5.2970004 4 1 1
6.7114925 4 2 1
7.69805908 4 3 1
3.8341546 4 4 1
0.66842079 4 5 1
4.1748595 5 1 1
6.8677258 5 2 1
5.8897662 5 3 1
9.3043633 5 4 1
8.4616680 5 5 1
Reshape your data to a matrix and then use heatmap():
This worked on R version 2.10.1 (2009-12-14):
txt <- textConnection("7.62939453 1 1 1
1.3153768 1 2 1
7.5560522 1 3 1
4.5865011 1 4 1
5.3276706 1 5 1
2.1895909 2 1 1
0.47044516 2 2 1
6.7886448 2 3 1
6.7929626 2 4 1
9.3469286 2 5 1
3.8350201 3 1 1
5.1941633 3 2 1
8.3096523 3 3 1
0.34571886 3 4 1
0.53461552 3 5 1
5.2970004 4 1 1
6.7114925 4 2 1
7.69805908 4 3 1
3.8341546 4 4 1
0.66842079 4 5 1
4.1748595 5 1 1
6.8677258 5 2 1
5.8897662 5 3 1
9.3043633 5 4 1
8.4616680 5 5 1
")
df <- read.table(txt)
close(txt)
names(df) <- c("value", "x", "y", "t")
require(reshape)
dfc <- cast(df[ ,-4], x ~ y)
heatmap(as.matrix(dfc))
## Some copy/pasteable fake data for you (dput() works nicely for pasteable real data)
your_matrix <- cbind(runif(25, 0, 10), rep(1:5, each = 5), rep(1:5, 5), rep(1, 25))
heatmap_matrix <- matrix(your_matrix[, 1], nrow = 5)
## alternatively, if your_matrix isn't in order
## (The reshape method in EDi's answer is a nicer alternative)
for (i in 1:nrow(your_matrix)) {
heatmap_matrix[your_matrix[i, 2], you_matrix[i, 3]]
}
heatmap(heatmap_matrix) # one option
image(z = heatmap_matrix) # another option
require(gplots)
heatmap.2(heatmap_matrix) # this has fancier preferences