This question already has answers here:
How to split data into training/testing sets using sample function
(28 answers)
Randomly sample data frame into 3 groups in R
(4 answers)
Closed 3 years ago.
I'd like to know how to make multiple sampling in R. For example, when I try dividing some data into 60(train data):40(validate data), I can write the code like this:
original.data = read.csv("~.csv", na.strings="")
train.index = sample(c(1:dim(original.data)[1]), dim(original.data)[1]*0.6)
train.data = original.data[train.index,]
valid.data = original.data[-train.index,]
However, it is so hard to figure out making multiple sampling like dividing some data into 60:20:20.
I would appreciate if you make me know the best solution!
If you want more than two sets, then the other solutions are close but you need just a little more. There are at least two options.
First:
set.seed(2)
table(samp <- sample(1:3, size = nrow(iris), prob = c(0.6, 0.2, 0.2), replace = TRUE))
# 1 2 3
# 93 35 22
nrow(iris) # 150
set1 <- iris[samp == 1,]
set2 <- iris[samp == 2,]
set3 <- iris[samp == 3,]
set1 <- iris[samp == 1,]
set2 <- iris[samp == 2,]
set3 <- iris[samp == 3,]
nrow(set1)
# [1] 93
nrow(set2)
# [1] 35
nrow(set3)
# [1] 22
Because it's random, you want always get your exact proportions.
Second:
If you must have exact proportions, you can do this:
ns <- nrow(iris) * c(0.6, 0.2, 0.2)
sum(ns)
# [1] 150
### in case of rounding (and sum != nrow) ... just fix one of ns
rep(1:3, times = ns)
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# [46] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# [91] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
# [136] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
set.seed(2)
head(samp <- sample(rep(1:3, times = ns)))
# [1] 1 2 1 1 3 3
set1 <- iris[samp == 1,]
set2 <- iris[samp == 2,]
set3 <- iris[samp == 3,]
nrow(set1)
# [1] 90
nrow(set2)
# [1] 30
nrow(set3)
# [1] 30
This can easily be generalized to support an arbitrary number of partitions.
Related
This question already has answers here:
Select first row in each contiguous run by group
(4 answers)
Closed 5 months ago.
I am trying to create a subset where I keep the first value in each sequence of numbers in a column. I tried to use:
df %>% group_by(x) %>% slice_head(n = 1)
But it only works for the first instance of each sequence.
An example data where x column contains the repeated sequence can be seen below:
x = c(2,2,2,3,3,3,1,1,1,5,5,5,2,2,2,1,1,1,3,3,3)
y = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)
df= data.frame(x,y)
> df
x y
1 2 1
2 2 1
3 2 1
4 3 1
5 3 1
6 3 1
7 1 1
8 1 1
9 1 1
10 5 1
11 5 1
12 5 1
13 2 1
14 2 1
15 2 1
16 1 1
17 1 1
18 1 1
19 3 1
20 3 1
21 3 1
So the end result that I would like to achive is:
x = c(2,3,1,5,2,1,3)
y = c(1,1,1,1,1,1,1)
df= data.frame(x,y)
> df
x y
1 2 1
2 3 1
3 1 1
4 5 1
5 2 1
6 1 1
7 3 1
Could you please help or point me to any useful existing topics as I haven't managed to find it?
Thanks
You can try rleid from package data.table
> library(data.table)
> setDT(df)[!duplicated(rleid(x))]
x y
1: 2 1
2: 3 1
3: 1 1
4: 5 1
5: 2 1
6: 1 1
7: 3 1
Base R.
df[c(1, diff(df$x)) != 0, ]
Or also with helper functions from data.table.
library(data.table)
df[rowid(rleid(df$x)) == 1L, ]
# x y
# 1 2 1
# 4 3 1
# 7 1 1
# 10 5 1
# 13 2 1
# 16 1 1
# 19 3 1
Using rle and match.
df[match(with(rle(df$x), values), df$x), ]
# x y
# 1 2 1
# 4 3 1
# 7 1 1
# 10 5 1
# 1.1 2 1
# 7.1 1 1
# 4.1 3 1
From a data frame in R that has X Y coordinates (see example) I would like to add to rows (final X and final Y) to show all possible pairwise comparisons between the two.
dt = data.frame(X = seq(1, 5, by=1), Y = seq(1, 5, by=1))
This is the final goal but there should be a row for every possible combination of x, y and final_x, final_y
You can use expand.grid:
eg <- expand.grid(final_Y = 1:5, Y = 1:5, final_X = 1:5, X = 1:5)[,c(4,2,3,1)]
head(eg, n=20)
# X Y final_X final_Y
# 1 1 1 1 1
# 2 1 1 1 2
# 3 1 1 1 3
# 4 1 1 1 4
# 5 1 1 1 5
# 6 1 2 1 1
# 7 1 2 1 2
# 8 1 2 1 3
# 9 1 2 1 4
# 10 1 2 1 5
# 11 1 3 1 1
# 12 1 3 1 2
# 13 1 3 1 3
# 14 1 3 1 4
# 15 1 3 1 5
# 16 1 4 1 1
# 17 1 4 1 2
# 18 1 4 1 3
# 19 1 4 1 4
# 20 1 4 1 5
nrow(eg)
# [1] 625
I defined the columns out of order and reordered them simply to match the ordering of your expected output. One could easily do expand.grid(X=,Y=,final_X=,final_Y=) and leave off the [,c(...)] and the effective results would be the same but in a different row-order.
I am trying to build an R code that will take my loop and run it 20 times. Each time I would like to add a column to the existing data frame. Here I tried it by adding the code 3 times, but I feel like there must be an easier way to automate this. I am very grateful for any help.
My original data file (called "igel") contains two columns ("Year" and "Grid") and 1096 rows. With the loop I pick a random number from the column "Grid" and check whether it has been picked before. If so it adds 0 to a new column if not it adds 1.
Here the code:
a <- data.frame(matrix(ncol = 2, nrow = 0))
x <- c("number", "count")
colnames(a) <- x
for (i in 1:1096) {
num_i <- sample(igel$Grid, 1)
count_i <- c(if (num_i %in% a$number == TRUE) {0} else {1})
a<-a %>% add_row(number = num_i, count = count_i)
}
b <- data.frame(matrix(ncol = 2, nrow = 0))
x <- c("number", "count")
colnames(b) <- x
for (i in 1:1096) {
num_i <- sample(igel$Grid, 1)
count_i <- c(if (num_i %in% b$number == TRUE) {0} else {1})
b<-b %>% add_row(number = num_i, count = count_i)
}
c <- data.frame(matrix(ncol = 2, nrow = 0))
x <- c("number", "count")
colnames(c) <- x
for (i in 1:1096) {
num_i <- sample(igel$Grid, 1)
count_i <- c(if (num_i %in% c$number == TRUE) {0} else {1})
c<-c %>% add_row(number = num_i, count = count_i)
}
df.total<- cbind(a$count,b$count, c$count)
Consider sapply and even its wrapper, replicate and calculate number and count separately in vector calculations instead of growing object in loop by row.
# RUNS 3 SAMPLES OF igel$Grid 1,096 TIMES (ADJUST 3 TO ANY POSITIVE INT LIKE 20)
grid_number <- data.frame(replicate(3, replicate(1096, sample(igel$Grid, 1))))
# RUNS ACROSS 3 COLUMNS TO CHECK CURRENT ROW VALUE IS INCLUDED FOR ALL VALUES BEFORE ROW
grid_count <- sapply(grid_number, function(col)
sapply(seq_along(col), function(i)
ifelse(col[i] %in% col[1:(i-1)], 0, 1)
)
)
While above does not exactly reproduce your output, df.total (a matrix and not data frame), due to the random sampling within iterations, the two maintain similar structure:
dim(df.total)
# [1] 1096 3
dim(grid_count)
# [1] 1096 3
Try to avoid iterating through rows. It is rarely necessary, if ever. Here is one approach (replace n with 1096 and elem with igel$Grid):
n = 20
elem = 1:5
df.total = list()
for (i in 1:5) {
a = data.frame(number = sample(elem, n, replace=TRUE))
a$count = as.numeric(duplicated(a$number))
df.total[[i]] = a
}
df.total = as.data.frame(df.total)
df.total
## number count number.1 count.1 number.2 count.2 number.3 count.3 number.4 count.4
## 1 4 0 2 0 5 0 4 0 1 0
## 2 3 0 5 0 3 0 4 1 3 0
## 3 5 0 3 0 4 0 2 0 4 0
## 4 5 1 1 0 2 0 5 0 3 1
## 5 2 0 4 0 2 1 5 1 5 0
## 6 4 1 2 1 2 1 5 1 5 1
## 7 5 1 1 1 3 1 2 1 4 1
## 8 5 1 2 1 5 1 5 1 4 1
## 9 2 1 1 1 1 0 1 0 1 1
## 10 3 1 1 1 5 1 4 1 1 1
## 11 5 1 3 1 1 1 3 0 5 1
## 12 2 1 1 1 2 1 5 1 1 1
## 13 3 1 5 1 4 1 5 1 4 1
## 14 1 0 4 1 2 1 4 1 1 1
## 15 4 1 4 1 2 1 5 1 1 1
## 16 4 1 2 1 5 1 2 1 5 1
## 17 3 1 1 1 1 1 3 1 2 0
## 18 2 1 2 1 2 1 2 1 2 1
## 19 2 1 3 1 1 1 2 1 1 1
## 20 1 1 3 1 2 1 1 1 3 1
I've built a cluster model in R (kmeans):
fit <- kmeans(sales_DP_DCT_agg_tr_bel_mod, 4)
Now I want to use this model and segment a brand new data set. How can I:
store the model
run the model on an new data set?
Let's say you're using iris as a dataset.
data = iris[,1:4] ## Don't want the categorical feature
model = kmeans(data, 3)
Here's what the output looks like:
>model
K-means clustering with 3 clusters of sizes 96, 33, 21
Cluster means:
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 6.314583 2.895833 4.973958 1.7031250
2 5.175758 3.624242 1.472727 0.2727273
3 4.738095 2.904762 1.790476 0.3523810
Clustering vector:
[1] 2 3 3 3 2 2 2 2 3 3 2 2 3 3 2 2 2 2 2 2 2 2 2 2 3 3 2 2 2 3 3 2 2 2 3 2 2 2 3 2 2 3 3 2 2 3 2 3 2 2 1 1 1 1 1 1 1 3 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[76] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Within cluster sum of squares by cluster:
[1] 118.651875 6.432121 17.669524
(between_SS / total_SS = 79.0 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size" "iter" "ifault"
Notice you have access to the centroids using model$centers. All you have to do to classify an incoming sample is find which centroid it's closest to. You could define a euclidean distance function as follows:
eucDist <- function(x, y) sqrt(sum( (x-y)^2 ))
And then a classifying function:
classifyNewSample <- function(newData, centroids = model$centers) {
dists = apply(centroids, 1, function(y) eucDist(y,newData))
order(dists)[1]
}
> classifyNewSample(c(7,3,6,2))
[1] 1
> classifyNewSample(c(6,2.7,4.3,1.4))
[1] 2
As far as model persistence goes, checkout ?save here.
Edit:
To apply the predict function to a new matrix:
## I'm just generating a random matrix of 50x4 here:
r <- 50
c <- 4
m0 <- matrix(0, r, c)
new_data = apply(m0, c(1,2), function(x) sample(seq(0,10,0.1),1))
new_labels = apply(new_data, 1, classifyNewSample)
>new_labels
[1] 1 2 3 3 2 1 3 1 3 1 2 3 3 1 1 3 1 1 1 3 1 1 1 1 1 1 3 1 1 3 3 1 1 3 2 1 3 2 3 1 2 1 2 1 1 2 1 3 2 1
I have this array:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1
[25] 1 1 1 1 2 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1
[49] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[73] 1 1 1 1 1 4 3 2 5 3 2 3 3 2 3 2 3 2 3 3 2 3 3 2
[97] 3 2 2 2 3 2 2 2 2 2 3 2 3 3 2 3 2 1 2 2 3 2 2 3
I need a function that returns only the number of the maximum occurrences. For example, if I use:
table(x[1:80])
I will get:
1 2 3 4
74 3 2 1
How can I get automatically the value '74'? Meaning that I can't know if '1' or '2' and so on... is the the maximum occurrence in my array. Thanks!
Edit:
I run:
tf<- tablulate(x):
[1] 75 24 19 1 1
and tried to run a for loop to get the "maximum" of each element on the "tabulate result" as following:
for (element in tf)
{
+ b= max(table(x[element]))
+ print (b)
+ }
I don't get the expected result, it is probably simple but not really for me.
I tried this:
> a=max(table(C[1:75]))
[1] 72
> b=max(table(C[76:99]))
[1] 11
> c=max(table(C[100:118]))
[1] 12
> d=max(table(C[119]))
[1] 1
> e=max(table(C[120]))
[1] 1
and so on.
and it works but it's really long and not fun if I have a big dataset.
To the commenter's tip, if you want a function use:
maximum <- function(vector, upto=length(vector)) {
max(table(vector[1:upto]))
}
So for:
set.seed(123)
x <- sample(1:3, 100, replace=T)
maximum(x)
[1] 34
maximum(x, 55) #checking at the 55th number in the vector
[1] 19
Update
To answer your edited question. Use this function:
maxtable <- function(vector) {
index <- cumsum(1:length(vector) %in% cumsum(tabulate(vector)))
s <- split(vector, index)
sapply(s, function(v) max(table(v)))
}
maxtable(x)
0 1 2 3 4 5
71 11 12 1 1 1
Edit
I think this small change is more of what you're looking for:
maxtable2 <- function(vector) {
index <- cumsum(1:length(vector) %in% (cumsum(tabulate(vector))+1))
s <- split(vector, index+1)
sapply(s, function(v) max(table(v)))
}
maxtable2(x)
1 2 3 4 5
72 11 12 1 1
There is a function in the library modeest that gets you most of the way there called mfv. But the function itself is simple enough to make yourself:
> mfv
function (x, ...)
{
f <- factor(x)
tf <- tabulate(f)
return(as.numeric(levels(f)[tf == max(tf)]))
}
<environment: namespace:modeest>
So you can do sum(x == mfv(x)) to get 74.