I have a data set of x entries, and I need to resample it to y entries, with y being a number smaller than x. my data set is not a series of numbers, of rather x rows, and I need the entire row of information when resampling.
I am aware of the sample() function but given that my dataset is not a vector I am unclear how the exact code should be written.
Any help would be appreciated!
The idea is that you want to sample the index of rows, then use that to pull back all of the columns for those rows, like so:
set.seed(4444) # for reproducibility
data(iris)
x <- nrow(iris)
y <- 7
irisSubset <- iris[sample(x,y),]
Related
Hi i'm trying to create 10 sub-training set (from a training set of 75%) in loop extracting randomly from a dataframe (DB). i'm using
smp_size<- floor((0.75* nrow(DB))/10)
train_ind<-sample(seq_len(nrow(DB)), size=(smp_size))
training<- matrix(ncol=(ncol(DB)), nrow=(smp_size))
for (i in 1:10){
training[i]<-DB[train_ind, ]
}
what's wrong?
To partition your dataset in 10 equally sized subsets, you may use the following:
# Randomly order the rows in your training set:
DB <- DB[order(runif(nrow(DB))), ]
# You will create a sequence 1,2,..,10,1,2,...,10,1,2.. you will use to subset
inds <- rep(1:10, nrow(DB)/10)
# split() will store the subsets (created by inds) in a list
subsets <- split(DB, inds)
Note, however, that split() will only give you equally sized subsets. Therefore, it might (and probably will) happen that some of the observations are not be included in any of the subsets.
If you wish to use all observations, causing some subsets to be larger than others, use inds <- rep(1:10, length.out = nrow(DB)) instead
I have two data sets - TEST end TRAIN. TEST is a subset of TRAIN. By using the columns "prod" and "clnt" I need to find all rows in TRAIN which corresponds to TEST (it is one to multiple correspondence). Then I make a temporal analysis of the respective values of the column "order" of TEST (first column "week" is the time).
So I take the first row of TRAIN, I compare all rows of TEST whether some of them contain the same combination of numbers of "prod" and "clnt" and record the respective values of "order" in TS. Usually I have zero to about ten values in TS per row of TRAIN. Then I do some calculations on TS (in this artificial case just mean(TS)) and record the result as well as the "Id" of the row of TEST in a data set Subm.
The algorithm works, but because I have millions of rows in TRAIN and TEST, I need it as fast as possible and especially to get rid of the loop, which is the slowest part. Probably I messed up with the data.frame declaration/usage also, but I am not sure.
set.seed(42)
NumObsTrain=100000 # this can be as much as 70 000 000
NumObsTest=10000 # this can be as much as 6 000 000
#create the TRAIN data set
train1=floor(runif(NumObsTrain, min=0, max=NumObsTrain+1))
train1=matrix(train1,ncol = 2)
train=cbind(8,train1) #week
train=rbind(train,cbind(9,train1)) #week
train=cbind(train,runif(NumObsTrain,min=1,max=10)) #order
train=cbind(c(1:nrow(train)),train)# id number of each row
colnames(train)=c("id","week","prod","clnt","order")
train=as.data.frame(train)
train=train[sample(nrow(train)),] # reflush the rows of train
# Create the TEST dataset
test=train[1:NumObsTest,]
test[,"week"][1:{NumObsTest/2}]=10
test[,"week"][{(NumObsTest/2)+1}:NumObsTest]=11
TS=numeric(length = 10)
id=c(1:NumObsTest*2)
order=c(1:NumObsTest*2)
Subm=data.frame(id,order)
ptm <- proc.time()
# This is the loop
for (i in 1:NumObsTest){
Subm$id[i]=test$id[i]
TS=train$order[train$clnt==test$clnt[i]&train$prod==test$prod[i]]
Subm$order[i]=mean(TS)
}
proc.time() - ptm
The following will create a data.frame with all (prod, clnt) and order combinations, then group them by prod and clnt, then take the mean of the order of each group. The final result is missing the id, and for some reason you have more data in your final data.frame, which I cannot figure out why. But the order results are correct.
newtrain <- train[, 3:5]
newtest <- test[, c(1, 3:4)]
x <- dplyr::inner_join(newtest, newtrain)
y <- dplyr::group_by(x, prod, clnt)
z <- dplyr::summarise(y, mean(order))
I want to apply a statistical function to increasingly larger subsets of a data frame, starting at row 1 and incrementing by, say, 10 rows each time. So the first subset is rows 1-10, the second rows 1-20, and the final subset is rows 1-nrows. Can this be done without a for loop? And if so, how?
here is one solution:
# some sample data
df <- data.frame(x = sample(1:105, 105))
#getting the endpoints of the sequences you wanted
row_seq <- c(seq(0,nrow(df), 10), nrow(df))
#getting the datasubsets filtering df from 1 to each endpoint
data.subsets <- lapply(row_seq, function(x) df[1:x, ])
# applying the mean function to each data-set
# just replace the function mean by whatever function you want to use
lapply(data.subsets, mean)
I have two dataframes as follows:
seed(1)
X <- data.frame(matrix(rnorm(2000), nrow=10))
where the rows represent the genes and the columns are the genotypes.
For each round of bootstrapping (n=1000), genotypes should be selected at random without replacement from this dataset (X) and form two groups of datasets (X' should have 5 genotypes and Y' should have 5 genotypes). Basically, in the end I will have thousand such datasets X' and Y' which will contain 5 random genotypes each from the full expression dataset.
I tried using replicate and apply but did not work.
B <- 1000
replicate(B, apply(X, 2, sample, replace = FALSE))
I think it might make more sense for you to first select the column numbers, 10 from 200 without replacement (five for each X' and Y'):
colnums_boot <- replicate(1000,sample.int(200,10))
From there, as you evaluate each iteration, i from 1 to 1000, you can grab
Xprime <- X[,colnums_boot[1:5,i]]
Yprime <- X[,colnums_boot[6:10,i]]
This saves you from making a 3-dimensional array (the generalization of matrix in R).
Also, if speed is a concern, I think it would be much faster to leave X as a matrix instead of a data frame. Maybe someone else can comment on that.
EDIT: Here's a way to grab them all up-front (in a pair of three-dimensional arrays):
Z <- as.matrix(X)
Xprimes <- array(,dim=c(10,5,1000))
Xprimes[] <- Z[,colnums_boot[1:5,]]
Yprimes <- array(,dim=c(10,5,1000))
Yprimes[] <- Z[,colnums_boot[6:10,]]
I've got 2 dataframes each with 150 rows and 10 columns + column and row IDs. I want to correlate every row in one dataframe with every row in the other (e.g. 150x150 correlations) and plot the distribution of the resulting 22500 values.(Then I want to calculate p values etc from the distribution - but that's the next step).
Frankly I don't know where to start with this. I can read my data in and see how to correlate vectors or matching slices of two matrices etc., but I can't get handle on what I'm trying to do here.
set.seed(42)
DF1 <- as.data.frame(matrix(rnorm(1500),150))
DF2 <- as.data.frame(matrix(runif(1500),150))
#transform to matrices for better performance
m1 <- as.matrix(DF1)
m2 <- as.matrix(DF2)
#use outer to get all combinations of row numbers and apply a function to them
#22500 combinations is small enough to fit into RAM
cors <- outer(seq_len(nrow(DF1)),seq_len(nrow(DF2)),
#you need a vectorized function
#Vectorize takes care of that, but is just a hidden loop (slow for huge row numbers)
FUN=Vectorize(function(i,j) cor(m1[i,],m2[j,])))
hist(cors)
You can use cor with two arguments:
cor( t(m1), t(m2) )