sample() command is too slow in R - r

I want to create a random subset of a data.table df that is very large (around 2 million lines).
The data table has a weight column, wgt that indicates how many observation each line represents.
To generate the vector of row numbers I want to extract, I proceed as follows:
I get the exact number of observations :
ns<- length(df$wgt)
I get the number of desired lines (30% of the sample):
lines<-round(0.3*ns)
I compute the vector of probabilities:
pr<-df$wgt/sum(df$wgt)
And then I compute the vector of line numbers to get the subsample:
ssout<-sample(1:ns, size=lines, probs=pr)
The final aim is to subset the data using df[ssout,]. However, R gets stuck when computing ssout.
Is there a faster/more efficient way to do this?
Thank you!

I'm guessing that df is a summary description of a data set that has repeated observations (with wgt being the count of repetitions). In that case, the only useful way to sample from it would be with replacement; and a proper 30% sample would be 30% of the real population, .3*sum(wgt):
# example data
wgt <- sample(10,2e6,replace=TRUE)
nobs<- sum(wgt)
pr <- wgt/sum(wgt)
# select rows
system.time(x <- sample.int(2e6,size=.3*nobs,prob=pr,replace=TRUE))
# user system elapsed
# 0.20 0.02 0.22
Sampling rows without replacement takes forever on my computer, but is also something that I don't think one needs to do here.

Related

Identical values generated from random samples from a uniform distribution in dplyr

This is a follow up to previous question. My question was not fully formulated and therefore not fully answered in my last post. Forgive me, I'm new to using stack overflow.
My professor has assigned a problem set, and we are required to use dplyr and other tidyverse packages. I'm very aware that most (if not all) the tasks that I'm trying to execute are possible in base r, but that's not in agreement with my instructions.
First we are asked to generate a tibble of 1000 random samples from a uniform distribution:
2a. Create a new tibble called uniformDf containing a variable called unifSamples that contains 10000 random samples from a uniform distribution. You should use the runif() function to create the uniform samples. {r 2a}
uniformDf <- tibble(unifSamples = runif(1000))
This goes well.
Then we are asked to loop thru this tibble 1000 times, each time choosing 20 random samples and computing the mean and saving it to a tibble:
2c. Now let's loop through 1000 times, sampling 20 values from a uniform distribution and computing the mean of the sample, saving this mean to a variable called sampMean within a tibble called uniformSampleMeans. {r 2c}
unif_sample_size = 20 # sample size
n_samples = 1000 # number of samples
# set up q data frame to contain the results
uniformSampleMeans <- tibble(sampMean=rep(NA,n_samples))
# loop through all samples. for each one, take a new random sample,
# compute the mean, and store it in the data frame
for (i in 1:n_samples){
uniformSampleMeans$sampMean[i] <- uniformDf %>%
sample_n(unif_sample_size) %>%
summarize(sampMean = mean(sampMean))
}
This all runs, well, I believe until I look at my uniformSampleMeans tibble. Which looks like this:
1 0.471271611726843
2 0.471271611726843
3 0.471271611726843
4 0.471271611726843
5 0.471271611726843
6 0.471271611726843
7 0.471271611726843
...
1000 0.471271611726843
All the values are identical! Does anyone have any insight as to why my output is like this? I'd be less concerned if they varied by +/- 0.000x values seeing as how this is from a distribution that ranges from 0 to 1 but the values are all identical even out to the 15th decimal place! Any help is much appreciated!
The following selects random unif_sample_size rows and gives it's mean
library(dplyr)
uniformDf %>% sample_n(unif_sample_size) %>% pull(unifSamples) %>% mean
#[1] 0.5563638
If you want to do this n times use replicate and repeat it n times
n <- 10
replicate(n, uniformDf %>%
sample_n(unif_sample_size) %>%
pull(unifSamples) %>% mean)
#[1] 0.5070833 0.5259541 0.5617969 0.4695862 0.5030998 0.5745950 0.4688153 0.4914363 0.4449804 0.5202964

Resampling in R

Consider the following data:
library(Benchmarking)
d <- data.frame(x1=c(200,200,3000), x2=c(200,200,1000), y=c(100,100,3))
So I have 3 observations.
Now I want to select 2 observations randomly out of d three times (without repetition - there is three combinations in total). For each of these three times I want to calculate the following:
e <- dea(d[c('x1', 'x2')], d$y)
weighted.mean(eff(e), d$y)
That is, I will get three numbers, which I want to calculate an average of. Can someone show how to do this with a loop function in R?
Example:
There is three combinations in total, so I can only get the same result in this case. If I do the calculation manually, I will get the three following result:
0.977 0.977 1
(The result could of course be in a another order).
And the mean of these two numbers is:
0.984
This is a simple example. In my case I have a lot of combinations, where I don't select all of the combinations (e.g. there could be say 1,000,000 combinations, where I only select 1,000 of them).
I think it's better if you use sample.int and replicate instead of doing all the combinations, see my example:
nsample <- 2 # Number of selected observations
nboot <- 10 # Number of times you repeat the process
replicate(nboot, with(d[sample.int(nrow(d), nsample), ],
weighted.mean(eff(dea(data.frame(x1, x2), y)), y)))
I have check also the link you bring regarding this issue, so if I got it right, I mean, you want to extract two rows (observations) each time without replacement, you can use sample:
SelObs <- sample(1:nrow(d),2)
# for getting the selected observations just
dSel <- d[SelObs,]
And then do your calculations
If you want those already selected observation to not be selected in a nex random selection, it is similar, but you need an index
Obs <- 1:nrow(d)
SelObs <- sample(Obs, 2)
dSel <- d[SelObs, ]
# and now, for removing those already selected
Obs <- Obs[-SelObs]
# and keep going with next random selections and the above code

omitting certain data in R to maintain overall data integrity

I have a function that returns 50 data values, in a one column matrix, for each of 100 different data frames . However due to circumstance sometimes the function returns a "NaN" in one or more of the 50 values in a data frame . This perturbs the data as a data frame that has one or more NaN is now considered to have 49 or 48 columns.
df1 df2
112.4563 112.4563
110.1210 110.1210
109.2143 109.2143
NaN 108.1806 <- now uneven and can not perform iterations
107.3700 107.3700
How can I tell my computer/ subsequent commands when iterating through these 100 50 rowed data frames to "ignore" the NaN values in a way that each of the 100 will still be able to have 50 values and are consistently iterable? Or its it even possible to have a varying iteration range- for(i in 1:(47-50). So that the computer forgives the variance in row numbers?
this is also with respect to graphs.
As someone else has noted, it can also depend on what you want to do with the NaN value. However, on answering for an interative range, you can do something like the following. I'll be using the dataframe mtcars as an example.
df = mtcars
length(df$mpg)
length(rownames(df))
length(colnames(df))
If you need to iterate over the total number of rows in your data frame, you can use length(rownames(df)). If you need to iterate over the number of columns instead, you can use length(colnames(df)).
In a for loop, you would do the following:
for (i in length(rownames(df)){
# iterative code
}
This will iterate over the total number of rows in a given data frame.

Select multiple observations in a matrix based on a specific condition

I am very new to the R interface but need to use the program in order to run the relevant analyses for my clinical doctorate thesis. So, apologies in advance if this is a novice question.
I have a matrix of beta methylation values with the following dimensions:485577x894. The row names of the matrix refer to cpg sites which range in non-numerical and non-ascending order (e.g. "cg00000029" "cg00000108" "cg00000109" "cg00000165"), while the column names refer to participant IDs which are also in non-numerical and non-ascending order (e.g. "11209" "14140" "1260" "5414").
I would like to identify which beta methylation values are > 0.5 so that I can exclude them from further analyses. In doing so, I need the data to stay in a matrix format. All attempts I have made to conduct this analysis have resulted in retrieval of integer variables rather than the data in a matrix format.
I would be so grateful if someone could please advise me of the code to conduct this analysis.
Thank you for your time.
Cheers,
Alicia
set.seed(1) # so example is reproduceable
m <- matrix(runif(1000,0,0.6),nrow=100) # 100 rows X 10 cols, data in U[0,0.6]
m[m>0.5]<-NA # anything > 0.5 set to NA
z <- na.omit(m) # remove all rows with any NA's

Exclude data based on the number of non NA observations for each value of key

I have a dataset consisting of monthly observations for returns of US companies. I am trying to exclude from my sample all companies which have less than a certain number of non NA observations.
I managed to do what I want using foreach, but my dataset is very large and this takes a long time. Here is a working example which shows how I accomplished what I wanted and hopefully makes my goal clear
#load required packages
library(data.table)
library(foreach)
#example data
myseries <- data.table(
X = sample(letters[1:6],30,replace=TRUE),
Y = sample(c(NA,1,2,3),30,replace=TRUE))
setkey(myseries,"X") #so X is the company identifier
#here I create another data table with each company identifier and its number
#of non NA observations
nobsmyseries <- myseries[,list(NOBSnona = length(Y[complete.cases(Y)])),by=X]
# then I select the companies which have less than 3 non NA observations
comps <- nobsmyseries[NOBSnona <3,]
#finally I exclude all companies which are in the list "comps",
#that is, I exclude companies which have less than 3 non NA observations
#but I do for each of the companies in the list, one by one,
#and this is what makes it slow.
for (i in 1:dim(comps)[1]){
myseries <- myseries[X != comps$X[i],]
}
How can I do this more efficiently? Is there a data.table way of getting the same result?
If you have more than 1 column you wish to consider for NA values then you can use complete.cases(.SD), however as you want to test a single columnI would suggest something like
naCases <- myseries[,list(totalNA = sum(!is.na(Y))),by=X]
you can then join given a threshold total NA values
eg
threshold <- 3
myseries[naCases[totalNA > threshold]]
you could also select using not join to get those cases you have excluded
myseries[!naCases[totalNA > threshold]]
As noted in the comments, something like
myseries[,totalNA := sum(!is.na(Y)),by=X][totalNA > 3]
would work, however, in this case you are performing a vector scan on the entire data.table, whereas the previous solution performed the vector scan on a data.table that is only nrow(unique(myseries[['X']])).
Given that this is a single vector scan, it will be efficient regardless (and perhaps binary join + small vector scan may be slower than larger vector scan), However I doubt there will be much difference either way.
How about aggregating the number of NAs in Y over X, and then subsetting?
# Aggregate number of NAs
num_nas <- as.data.table(aggregate(formula=Y~X, data=myseries, FUN=function(x) sum(!is.na(x))))
# Subset
myseries[!X %in% num_nas$X[Y>=3],]

Resources