Subsets with random sampling with replacement in R - r

I have a vector of 300 numbers (from 1 to 300). I want to create two subsets, i.e., model/training (200 numbers) and testing set (100 numbers) with replacement. I tried to use sample and subset but didn't got the results I want.
MWE:
x=(1,2,3,.......300)
x1 = (1,1,2,3,5,5,...........,300) (Consider it training set of 200 samples)
x2 = (1,3,9,101,130,130,..........299)
Any suggestion please !!!!!

You could create a set of random indices for the training set and then select all but those indices for the test set, like this:
data <- c(1,3,8,7,19,5,4,10,11,20)
i <- sample(1:length(data), 5)
training <- data[i]
test <- data[-i]
This will get five points for the training set and all the remaining points will go in the test set.

Related

How to apply MICE imputations on test set?

I have two separate data sets: one for train (1000000 observation) and the other one for test (1000000 observation). I divided the train set into 3 sets (mytrain: 700000 observations, myvalid: 150000 observations, mytest:150000 observations). Thetest set with 1000000 observations doesn't include the target variable, so it should be used for the final test. Since there are some missing values for categorical variables, I need to use mice to impute them. I should reuse the imputation done on mytrain set to fill the missing values in the myvalid, mytest and test sets. Based on the answer to this question, I should do this:
data2 <- rbind(mytrain,myval,mytest,test)
data2$ST_EMPL <- as.factor(data2$ST_EMPL)
data2$TYP_RES <- as.factor(data2$TYP_RES)
imp <- mice(data2, method = "cart", m = 1, maxit = 1, seed = 123,
ignore = c(rep(FALSE, 700000),rep(TRUE, 1300000)))
data2.imp <- complete(imp,1)
summary(imp)
mytrainN <- data2.imp[1:700000,]
myvalN <- data2.imp[700001:850000,]
mytestN <- data2.imp[850001:1000000,]
testN <- data2.imp[1000001:2000000,]
However, since the test set does not have the target column, it is not possible to merge it with mytrain, mytest, and myvalid. Is it possible to add a hypothetical target column (with the value of say 10 for all 1000000 observations) to the test set?

Split 100 times randomly train and test data using caret in R

To better estimate the accuracy of the classifier, I would like to carry out classification procedure which will be repeated 100 times with randomly changing sets of training and test samples (divided 50% by class). I don't know how to split this sets randomly and repeat it 100 times.
You can create multiple splits based on the outcome variable at once with the times argument to caret::createDataPartition().
For example, the following line of code will produce a list of 100 numeric vectors of data indices. (Note: I set list = TRUE so we can use purrr::map() next)
indices <- caret::createDataPartition(extracted$class, p = 0.5, list = TRUE, times = 100)
Now you can use purrr::map() to iterate over each and get a list of training and testing sets.
resample_data <- tibble(
training_sets = map(indices, ~ extracted[.x, ]),
testing_sets = map(indices, ~ extracted[-.x, ])
)

Logistic regression training and test data

I am a beginner to R and am having trouble with something that feels basic but I am not sure how to do it. I have a data set with 1319 rows and I want to setup training data for observations 1 to 1000 and the test data for 1001 to 1319.
Comparing with notes from my class and the professor set this up by doing a Boolean vector by the 'Year' variable in her data. For example:
train=(Year<2005)
And that returns the True/False statements.
I understand that and would be able to setup a Boolean vector if I was subsetting my data by a variable but instead I have to strictly by the number of rows which I do not understand how to accomplish. I tried
train=(data$nrow < 1001)
But got logical(0) as a result.
Can anyone lead me in the right direction?
You get logical(0) because nrow is not a column
You can also subset your dataframe by using row numbers
train = 1:1000 # vector with integers from 1 to 1000
test = 1001:nrow(data)
train_data = data[train,]
test_data = data[test,]
But be careful, unless the order of rows in your dataframe is completely random, you probably want to get 1000 rows randomly and not the 1000 first ones, you can do this using
train = sample(1:nrow(data),1000)
You can then get your train_data and test_data using
train_data = data[train,]
test_data = data[setdiff(1:nrow(data),train),]
The setdiff function is used to get all rows not selected in train
The issue with splitting your data set by rows is the potential to introduce bias into your training and testing set - particularly for ordered data.
# Create a data set
data <- data.frame(year = sample(seq(2000, 2019, by = 1), 1000, replace = T),
data = sample(seq(0, 1, by = 0.01), 1000, replace = T))
nrow(data)
[1] 1000
If you really want to take the first n rows then you can try:
first.n.rows <- data[1:1000, ]
The caret package provides a more reliable approach to using cross validation in your models.
First create the partition rule:
library(caret)
inTrain <- createDataPartition(y = data$year,
p = 0.8, list = FALSE)
Note y = data$year this tells R to use the variable year to sample from, ensuring you don't get ordered data and introduced bias to the model.
The p argument tells caret how much of the original data should be partitioned to the training set, in this case 80%.
Then apply the partition to the data set:
# Create the training set
train <- data[inTrain,]
# Create the testing set
test <- data[-inTrain,]
nrow(train) + nrow(test)
[1] 1000

Arrange different data set using matrix code

I'm trying to use repeat loop to generate 100 data set of Poisson Distribution with sample size n=100 and I would like to arrange the result in by row and column but it is just show me repeating to show me the last set of data while not all the data set. At the same time I would also trying to figure out the way to get the mean, variance and MSE of the 100 data set.
set.seed(124)
a <- 1
repeat{
b = rpois(100, lambda = 3)
Storage100 <- matrix(data=b,nrow=100,ncol=1)
a = a+1
print(b)
if (a>100){break
}
}
Storage100
I'm expecting that my 100 data set can be show like first set of data in first column, second set of data in second column.....
Use replicate with simplify as TRUE to get matrix of dimension 100 X 100 where each column represents the distribution.
set.seed(124)
m1 <- replicate(100, matrix(data=rpois(100, lambda = 3),ncol = 1), simplify = TRUE)
To get the mean for each column we can use colMeans (thanks to #jay.sf)
colMeans(m1)

How do I produce a set of predictions based on a new set of data using predict in R? [duplicate]

This question already has answers here:
Predict() - Maybe I'm not understanding it
(4 answers)
Closed 6 years ago.
I'm struggling to understand how the predict function works and can be used with different sample data. For instance the following code...
my <- data.frame(x=rnorm(1000))
my$y <- 0.5*my$x+0.5*rnorm(1000)
fit <- lm(my$y ~ my$x)
mySample <- my[sample(nrow(my), 100),]
predict(fit, mySample)
I would understand should return 100 y predictions based on the sample. But it returns 1,000 row with the warning message :
'newdata' had 100 rows but variables found have 1000 rows
How do I produce a set of predictions based on a new set of data using predict? Or am I using the wrong function? I am a noob so apologise in advance if I am asking stupid questions.
It's never a good idea to use the $ symbol when using the formula syntax (and most of the times it's completely unnecessary. This is especially true when you are trying to make predictions because the predict() function works hard to exactly match up column names and data.types. So rather than
fit <- lm(my$y ~ my$x)
use
fit <- lm(y ~ x, my)
So a complete example would be
set.seed(15) # for reproducibility
my <- data.frame(x=rnorm(1000))
my$y <- 0.5*my$x+0.5*rnorm(1000)
fit <- lm(y ~ x, my)
mySample <- my[sample(1:nrow(my), 100),]
head(predict(fit, mySample))
# 694 278 298 825 366 980
# 0.43593108 -0.67936324 -0.42168723 -0.04982095 -0.72499087 0.09627245
couple of things wrong with the code: you are overwriting the sample function with your variable named sample. you want something like mysample<- sample(my\$x,100) ... its nothing to do with predict. From my limited understanding dataframes are 'lists of columns' so sampling my means creating 100 samples of (the 1000 row) column x. by using my\$x you now are referring to the column ( in the dataframe), which is a list of rows.
In other words you are sampling from a list of columns (which only has a single element), but you actually want to sample from a list of the rows in column x
Is this what you want
library(caret)
my <- data.frame(x=rnorm(1000))
my$y <- 0.5*my$x+0.5*rnorm(1000)
## Divide data into train and test set
Index <- createDataPartition(my$y, p = 0.8, list = FALSE, times = 1)
train <- my[Index, ]
test <- my[-Index,]
lmfit<- train(y~x,method="lm",data=train,trControl = trainControl(method = "cv"))
lmpredict<-predict(lmfit,test)
this for an in-sample prediction for pseudo out of sample prediction (forecasting one step ahead) you just need lag the independent variable by 1
Lag(x)

Resources