I am trying to build some machine learning models,
so I need training data and a validation data
so suppose I have N number of examples, I want to select random x examples in a data frame.
For example, suppose I have 100 examples, and I need 10 random numbers, is there a way (to efficiently) generate 10 random INTEGER numbers for me to extract the training data out of my sample data?
I tried using a while loop, and slowly change the repeated numbers, but the running time is not very ideal, so I am looking for a more efficient way to do it.
Can anyone help, please?
sample (or sample.int) does this:
sample.int(100, 10)
# [1] 58 83 54 68 53 4 71 11 75 90
will generate ten random numbers from the range 1–100. You probably want replace = TRUE, which samples with replacing:
sample.int(20, 10, replace = TRUE)
# [1] 10 2 11 13 9 9 3 13 3 17
More generally, sample samples n observations from a vector of arbitrary values.
If I understand correctly, you are trying to create a hold-out sampling. This is usually done using probabilities. So if you have n.rows samples and want a fraction of training.fraction to be used for training, you may do something like this:
select.training <- runif(n=n.rows) < training.fraction
data.training <- my.data[select.training, ]
data.testing <- my.data[!select.training, ]
If you want to specify EXACT number of training cases, you may do something like:
indices.training <- sample(x=seq(n.rows), size=training.size, replace=FALSE) #replace=FALSE makes sure the indices are unique
data.training <- my.data[indices.training, ]
data.testing <- my.data[-indices.training, ] #note that index negation means "take everything except for those"
from the raster package:
raster::sampleInt(242, 10, replace = FALSE)
## 95 230 148 183 38 98 137 110 188 39
This may fail if the limits are too large:
sample.int(1e+12, 10)
Related
I'm trying to reduce the input data size by first performing a K-means clustering in R then sample 50-100 samples per representative cluster for downstream classification and feature selection.
The original dataset was split 80/20, and then 80% went into K means training. I know the input data has 2 columns of labels and 110 columns of numeric variables. From the label column, I know there are 7 different drug treatments. In parallel, I tested the elbow method to find the optimal K for the cluster number, it is around 8. So I picked 10, to have more data clusters to sample for downstream.
Now I have finished running the model <- Kmeans(), the output list got me a little confused of what to do. Since I have to scale only the numeric variables to put into the kmeans function, the output cluster membership don't have that treatment labels anymore. This I can overcome by appending the cluster membership to the original training data table.
Then for the 10 centroids, how do I find out what the labels are? I can't just do
training_set$centroids <- model$centroids
And most important question, how do I find 100 samples per cluster that are the closeted to their respective centroid?? I have seen one post here in python but no R resources yet.
Output 50 samples closest to each cluster center using scikit-learn.k-means library
Any pointers?
First we need a reproducible example of your data:
set.seed(42)
x <- matrix(runif(150), 50, 3)
kmeans.x <- kmeans(x, 10)
Now you want to find the observations in original data x that are closest to the centroids computed and stored as kmeans.x. We use the get.knnx() function in package FNN. We will just get the 5 closest observations for each of the 10 clusters.
library(FNN)
y <- get.knnx(x, kmeans.x$centers, 5)
str(y)
# List of 2
# $ nn.index: int [1:10, 1:5] 42 40 50 22 39 47 11 7 8 16 ...
# $ nn.dist : num [1:10, 1:5] 0.1237 0.0669 0.1316 0.1194 0.1253 ...
y$nn.index[1, ]
# [1] 42 38 3 22 43
idx1 <- sort(y$nn.index[1, ])
cbind(idx1, x[idx1, ])
# idx1
# [1,] 3 0.28614 0.3984854 0.21657
# [2,] 22 0.13871 0.1404791 0.41064
# [3,] 38 0.20766 0.0899805 0.11372
# [4,] 42 0.43577 0.0002389 0.08026
# [5,] 43 0.03743 0.2085700 0.46407
The row indices of the nearest neighbors are stored in nn.index so for the first cluster, the 5 closest observations are 42, 38, 3, 22, 43.
I am quite familiar with R but never had this requirement where I need to create exactly equal data partition randomly using createDataPartition in R.
index = createDataPartition(final_ts$SAR,p=0.5, list = F)
final_test_data = final_ts[index,]
final_validation_data = final_ts[-index,]
This code creates two datasets with sizes 1396 and 1398 observations respectively.
I am surprised why p=0.5 doesn't do what it is supposed to do. Does it have something to do with resulting dataset not having odd number of observations by default?
Thanks in advance!
It has to do with the number of cases of the response variable (final_ts$SAR in your case).
For example:
y <- rep(c(0,1), 10)
table(y)
y
0 1
10 10
# even number of cases
Now we split:
train <- y[caret::createDataPartition(y, p=0.5,list=F)]
table(train) # we have 10 obs
train
0 1
5 5
test <- y[-caret::createDataPartition(y, p=0.5,list=F)]
table(test) # we have 10 obs.
test
0 1
5 5
If we build and example instead with odd number of cases:
y <- rep(c(0,1), 11)
table(y)
y
0 1
11 11
We have:
train <- y[caret::createDataPartition(y, p=0.5,list=F)]
table(train) # we have 12 obs.
train
0 1
6 6
test <- y[-caret::createDataPartition(y, p=0.5,list=F)]
table(test) # we have 10 obs.
test
0 1
5 5
More info here.
Here is another thread which explains why the number returned from createDataPartition might seem to be "off" to us but not according to what this function is trying to do.
So, it depends on what you have in final_ts$SAR and the spread of the data.
If it is categorical value, ex: T and F, if you have 100 total, 55 are T, 45 are F. When you invoke the way in your code, it will return you 51 because:
55*0.5=27.5, 45*0.5=22.5, round each result up, 28+23=51.
You can refer to below thread which has a great explanation about this when the values you want to split are numbers.
R - caret createDataPartition returns more samples than expected
I want to make a series of randomly sampled training sets that are exactly 75% the size of the full data set. The code below is an example of what I want to achieve except I always want 75 samples of 1, and 25 samples of 2; this code only gives me samples which are close to those sizes but not exact.
column <- c(rep("A", 40), rep("B", 60))
data <- as.data.frame(column)
index <- sample(2,100, replace=TRUE, prob=c(0.75,0.25))
I want to be able to achieve this kind of partitioning without the use of additional packages and just with the base R if possible. Packages don't seem to work for me the vast majority of the time, which is why I have found it difficult to find a solution already.
That's how sample is intended to work. You may instead consider two steps:
idxTrain <- sample(100, 75)
head(idxTrain)
# [1] 54 70 3 42 72 67
length(idxTrain)
# [1] 75
idxTest <- setdiff(1:100, idxTrain)
head(idxTest)
# [1] 5 7 13 14 19 24
length(idxTest)
# [1] 25
I would like to sample, let's say the ages of 100 persons above 65 years old,
and the probabilities for the age groups are as follows:
65-74<- 0.56
75-84<- 0.30
85<- 0.24
I know the existence of the sample function and I tried it as follows,but that didn't work unfortunately
list65_74<-range(65,74)
list75_84<-range(75,84)
list85<-range(85,100)
age<-sample(c(list65_74,list75_84,list85),size=10,replace=TRUE,prob =c(0.56,0.30,0.24 ))I get the following error
I got then the following error
Error in sample.int(length(x), size, replace, prob) :
incorrect number of probabilities
So I was wondering what is the proper way to sample from multiple lists.
Thank you very much in advance!
First, let's I'll call those three objects groups instead since they don't use the list function.
The way you define them could be fine, but it's somewhat more direct to go with, e.g., 65:74 rather than c(65, 74). So, ultimately I put the three groups in the following list:
groups <- list(group65_74 = 65:74, group75_84 = 75:84, group85 = 85:100)
Now the first problem with the usage of sample was your x argument value, which is
either a vector of one or more elements from which to choose, or a
positive integer. See ‘Details.’
Meanwhile, you x was just
c(list65_74, list75_84, list85)
# [1] 65 74 75 84 85 100
Lastly, the value of prob is inappropriate. You supply 3 number to a vector of 6 candidates to sample from. Doesn't sound right. Instead, you need to assign an appropriate probability to each age from each group as in
rep(c(0.56, 0.30, 0.24), times = sapply(groups, length))
So that the result is
sample(unlist(groups), size = 10, replace = TRUE,
prob = rep(c(0.56, 0.30, 0.24), times = sapply(groups, length)))
# [1] 82 72 69 74 72 72 69 70 74 70
I'm new to R and I want to sample from a list of 97 values. The list is composed of 3 different values (1,2 and 3), each representing a certain condition. The 97 values represent 97 individuals.
Lets assume the list is called original_pop. I want to randomly choose 50 individuals and store them as males and take the remaining 47 individuals and store them as females. A simple and similar scenario:
original_pop = [1 2 3 3 1 2 2 1 3 1 ...]
male_pop = [50 random values from original_pop]
female_pop = [the 47 values that are not in male_pop]
I created original_pop with sample so that the values are random but I don't know how to do the rest. Right now I've stored the first 50 values of original_pop as males and the last 47 as females and it might work because original_pop was randomly generated but I think it would be more appropriate to choose the values from original_pop in a random way and not in order.
Appreciate your responses!
n <- 97
In the absence of your original_pop data, we simulate it below.
original_pop <- sample(1:3, size=n, replace=TRUE)
maleIndexes <- sample(n, 50)
males <- original_pop[maleIndexes]
females <- original_pop[-maleIndexes]