Row sampling in R - r

I use the example data to ask the question.
seed(1)
X <- data.frame(matrix(rnorm(200), nrow=20))
I wanted to select 10 random rows everytime without replacement and do a multiple regression. I tried
hi=X[sample(1:20,10),]
MR1<-lm(X10~., data=hi)
R1<-summary(MR1)$r.squared #extract the R squared
Is it possible to create 25 such datasets sampling 10 rows each time. In the end, I would like to store the sampled datasets and do a multiple regression and extract the r squared values from the 25 such models as well as well.

You could use lapply:
set.seed(1)
X <- data.frame(matrix(rnorm(200), nrow=20))
n <- 25
res <- lapply(1:n,
function(i) {
samples <- sample(1:20,10)
hi=X[samples,]
MR1<-lm(X10~., data=X)
R1<-summary(MR1)$r.squared
return(list(Samples=samples,Hi=hi,MR1=MR1,R1=R1))
})

Related

How do I run a for loop so it generates repeated samples of n observations?

I first generated random data from a Gamma distribution using the following code
data <- rgamma(9, shape=32, scale=1/4)
I proceeded to generate a single sample of 9 observations from the population.
sample(data, 9)
I'm trying to run a for loop in R so that I can repeatedly generate samples of 9 observations and save the mean of each sample into a new vector. I want to do this 500,000 times. After the for loop I then want to create a null distribution based on the distribution created from the for loop. I am also wanting to sample with replacement. (I am also very new to R, so any suggestions or help is greatly appreciated).
Here is the code I have tried for the for loop:
v <- 500000
Storage <- numeric(9)
for (i in v) {
Storage[i] <- mean(i)
}
The easiest way to do this is like this...
means <- replicate(500000, mean(rgamma(9, shape=32, scale=1/4)))
This will generate 9 gamma variates, take the mean, and repeat the process 500,000 times, storing the result in the vector means. Definitely no need for a for loop!
Use replicate to create the vectors, then compute the means with the fast colMeans.
set.seed(2023)
data <- rgamma(9, shape=32, scale=1/4)
v <- 500000L
Storage <- replicate(v, sample(data, 9, TRUE))
mean_Storage <- colMeans(Storage)
hist(mean_Storage, freq = FALSE)
Created on 2023-02-03 with reprex v2.0.2
Or maybe you want to sample from a Gamma distribution.
set.seed(2023)
v <- 500000L
Storage <- replicate(v, rgamma(9, shape=32, scale=1/4))
mean_Storage <- colMeans(Storage)
hist(mean_Storage, freq = FALSE)
Created on 2023-02-03 with reprex v2.0.2

How to capture the most important variables in Bootstrapped models in R?

I have several models that I would like to compare their choices of important predictors over the same data set, Lasso being one of them. The data set I am using consists of census data with around a thousand variables that have been renamed to "x1", "x2" and so on for convenience sake (The original names are extremely long). I would like to report the top features then rename these variables with a shorter more concise name.
My attempt to solve this is by extracting the top variables in each iterated model, put it into a list, then finding the mean of the top variables in X amount of loops. However, my issue is I still find variability with the top 10 most used predictors and so I cannot manually alter the variable names as each run on the code chunk yields different results. I suspect this is because I have so many variables in my analysis and due to CV causing the creation of new models every bootstrap.
For the sake of a simple example I used mtcars and will look for the top 3 most common predictors due to only having 10 variables in this data set.
library(glmnet)
data("mtcars") # Base R Dataset
df <- mtcars
topvar <- list()
for (i in 1:100) {
# CV and Splitting
ind <- sample(nrow(df), nrow(df), replace = TRUE)
ind <- unique(ind)
train <- df[ind, ]
xtrain <- model.matrix(mpg~., train)[,-1]
ytrain <- df[ind, 1]
test <- df[-ind, ]
xtest <- model.matrix(mpg~., test)[,-1]
ytest <- df[-ind, 1]
# Create Model per Loop
model <- glmnet(xtrain, ytrain, alpha = 1, lambda = 0.2)
# Store Coeffecients per loop
coef_las <- coef(model, s = 0.2)[-1, ] # Remove intercept
# Store all nonzero Coefficients
topvar[[i]] <- coef_las[which(coef_las != 0)]
}
# Unlist
varimp <- unlist(topvar)
# Count all predictors
novar <- table(names(varimp))
# Find the mean of all variables
meanvar <- tapply(varimp, names(varimp), mean)
# Return top 3 repeated Coefs
repvar <- novar[order(novar, decreasing = TRUE)][1:3]
# Return mean of repeated Coefs
repvar.mean <- meanvar[names(repvar)]
repvar
Now if you were to rerun the code chunk above you would notice that the top 3 variables change and so if I had to rename these variables it would be difficult to do if they are not constant and changing every run. Any suggestions on how I could approach this?
You can use function set.seed() to ensure your sample will return the same sample each time. For example
set.seed(123)
When I add this to above code and then run twice, the following is returned both times:
wt carb hp
98 89 86

R: looped variable assignment, augmenting variable calculation each time

I am trying to calculate a regression variable based on a range of variables in my data set. I would like the regression variable (ei: Threshold 1) to be calculated using a different variable set in each iteration of running the regression.
Aim to collected SSR values for each threshold range, and thus identify the ideal threshold based on the data.
Data (df) variables: Yield, Prec, Price, 0C, 1C, 2C, 3C, 4C, 5C, 6C, 7C, 8C, 9C, 10C
Each loop calculates "thresholds" by selecting a different "b" each time.
a <- df$0C
b <- df$1C
Threshold1 <- (a-b)
Threshold2 <- (b)
Where "b" would be changing in each loop, ranging from 1C to 9C.
Each individual threshold set (1 and 2) should be used to run a regression, and save the SSR for comparison with the subsequent regression utilizing thresholds based on a new "b" value (ranging from 1C TO 9C)
Regression:
reg <- lm(log(Yield)~Threshold1+Threshold2+log(Price)+prec+I(prec^2),data=df)
for each loop of the Regression, I vary the components of calculating thresholds in the following manner:
Current approach is centered around the following code:
df <- read.csv("Data.csv",header=TRUE)
names(df)
0C-9Cvarlist <- names(df)[9:19]
ssr.vec <- matrix(,21,1)
for(i in 1:length(varlist)){
a <- df$0C
b <- df$[i]
Threshold1 <- (a-b)
Threshold2 <- (b)
reg <- lm(log(Yield)~Threshold1+Threshold2+log(Price)+prec+I(prec^2),data=df)
r2 <- summary(reg)$r.squared
ssr.vec[i,] <- c(varlist,r2)
}
colnames(ssr.vec) <- c("varlist","r2")
I am failing to achieve the desired result with the above approach.
Thank you.
I can spot quite a few mistakes...
You need to add variables of interest (Threshold1 anf Threshold2) to the data in the regression. Also, I think that you need to select varlist[i] and not varlist to create your ssr.vec. You need 2 columns to your ssr.vec which is a matrix, so you should call it matrix. You also cannot use something like df$[i] to extract a column! Why is the matrix of length 21 ?! Change the column name to C0,..,C9 and not 0C,..,9C.
For future reference, solve the simple errors before asking question... and include error messages in your post!
This should do the job:
df <- read.csv("Data.csv",header=TRUE)
names(df)[8:19] = paste0("C",0:10)
varlist <- names(df)[9:19]
ssr.vec <- matrix(,21,2)
for(i in 1:length(varlist)){
a <- df$C0
b <- df[,i+9]
df$Threshold1 <- (a-b)
df$Threshold2 <- (b)
reg <- lm(log(Yield)~Threshold1+Threshold2+log(Price)+prec+I(prec^2),data=df)
r2 <- summary(reg)$r.squared
ssr.vec[i,] <- c(varlist[i],r2)
}
colnames(ssr.vec) <- c("varlist","r2")

Create a matrix out of remaining data from a random row selection of a matrix, and use data to calculate RMSE in R

I have a matrix[A] with 42 rows and 2 columns. I then have a function that selects randomly 12 of these rows, does a linear regression of the randomly selected matrix and outputs the coefficients (slope and intercept) of the linear regression.
In R, I want to then get the other 30 rows from the original matrix that were not selected in my random function, and then use that data with my newly calculated coefficients, to generate a point (y-value). So I will have 30 y-values, and then from there I would like to calculate the RMSE (http://upload.wikimedia.org/math/e/f/b/efb7882a7dbfa5fe48d771565d2675f3.png) using the new y-values, and 1 of the columns in my new 30 row matrix.
The code below is what I currently have right now:
#Calibration Equation 1 (TC OFF)
A <- matrix(c(Box.CR, Box.DC.ww), nrow=42)
randco <- function(A) {
B<- A[sample(42,12),]
lm(B[,2] ~ B[,1])$coefficients
}
Z <- t(replicate(10000, randco(A)))
arows <- apply(A, 1, paste, collapse="_")
brows <- apply(B, 1, paste, collapse="_")
A[-match(brows, arows), ]
Alternative method, converting matrix to data.table
(not recommended, if your sole purpose is whats described above)
library(data.table)
A <- as.data.table(A)
B <- A[sample(nrow(A), 12)]
setkey(A)
setkey(B)
A[!B]

Random sample from given bivariate discrete distribution

Suppose I have a bivariate discrete distribution, i.e. a table of probability values P(X=i,Y=j), for i=1,...n and j=1,...m. How do I generate a random sample (X_k,Y_k), k=1,...N from such distribution? Maybe there is a ready R function like:
sample(100,prob=biprob)
where biprob is 2 dimensional matrix?
One intuitive way to sample is the following. Suppose we have a data.frame
dt=data.frame(X=x,Y=y,P=pij)
Where x and y come from
expand.grid(x=1:n,y=1:m)
and pij are the P(X=i,Y=j).
Then we get our sample (Xs,Ys) of size N, the following way:
set.seed(1000)
Xs <- sample(dt$X,size=N,prob=dt$P)
set.seed(1000)
Ys <- sample(dt$Y,size=N,prob=dt$P)
I use set.seed() to simulate the "bivariateness". Intuitively I should get something similar to what I need. I am not sure that this is correct way though. Hence the question :)
Another way is to use Gibbs sampling, marginal distributions are easy to compute.
I tried googling, but nothing really relevant came up.
You are almost there. Assuming you have the data frame dt with the x, y, and pij values, just sample the rows!
dt <- expand.grid(X=1:3, Y=1:2)
dt$p <- runif(6)
dt$p <- dt$p / sum(dt$p) # get fake probabilities
idx <- sample(1:nrow(dt), size=8, replace=TRUE, prob=dt$p)
sampled.x <- dt$X[idx]
sampled.y <- dt$Y[idx]
It's not clear to me why you should care that it is bivariate. The probabilities sum to one and the outcomes are discrete, so you are just sampling from a categorical distribution. The only difference is that you are indexing the observations using rows and columns rather than a single position. This is just notation.
In R, you can therefore easily sample from your distribution by reshaping your data and sampling from a categorical distribution. Sampling from a categorical can be done using rmultinom and using which to select the index, or, as Aniko suggests, using sample to sample the rows of the reshaped data. Some bookkeeping can take care of your exact case.
Here's a solution:
library(reshape)
# Reshape data to long format.
data <- matrix(data = c(.25,.5,.1,.4), nrow=2, ncol=2)
pmatrix <- melt(data)
# Sample categorical n times.
rcat <- function(n, pmatrix) {
rows <- which(rmultinom(n,1,pmatrix$value)==1, arr.ind=TRUE)[,'row']
indices <- pmatrix[rows, c('X1','X2')]
colnames(indices) <- c('i','j')
rownames(indices) <- seq(1,nrow(indices))
return(indices)
}
rcat(3,pmatrix)
This returns 3 random draws from your matrix, reporting the i and j of the rows and columns:
i j
1 1 1
2 2 2
3 2 2

Resources