How to Bootstrap Resample Count Data in R

How to Bootstrap Resample Count Data in R - r

I have a vector of counts which I want to resample with replacement in R:
X350277 128
X193233 301
X514940 3715
X535375 760
X953855 50
X357046 236
X196664 460
X589071 898
X583656 670
X583117 1614
(Note the second column is counts, the first column is the object the counts represent)
From reading various documentation it seems easy to resample data where each row or column represents a single observation. But how do I do this when each row represents multiple observations summed together (as in a table of counts)?

You can use weighted sampling (as user20650 also mentioned in the comments):
sample_weights <- dat$count/sum(dat$count)
mysample <- dat[sample(1:nrow(dat),1000,replace=T,prob=sample_weights),]
A less efficient approach - which might have its uses depending on what you want to do - is to turn your data to 'long' again:
dat_large <- dat[rep(1:nrow(dat),dat$count),]
#then sampling is easy
mysample <- dat_large[sample(1:nrow(dat_large),1000,replace=T),]

Related

R loop to execute the same code

I'm trying to figure out how to repeat the same code 30 times without typing each one at a time... any help will be much appreciated.
SRS_1 <- sample(1:nrow(MyData_points), size=.10*nrow(MyData_points))
data_sample_1 <- MyData_points[SRS_1,]
fpc.srs <- rep(6399875, 639987)
design_SRS_1 <- svydesign(id=~1, strata=NULL, data=data_sample_1, fpc=fpc.srs)
ONStotal_SRS1 <- svytotal(~data_sample_1$V4, design=design_SRS_1)
ONSmean_SRS1 <- svymean(~data_sample_1$V4, design=design_SRS_1)
CI_SRS_1 <- confint(svytotal(~data_sample_1$V4, design=design_SRS_1))
The first code calculates a Simple Random Sampling with a probability of .10 from the data. The second gets the sample from the data. Third, calculates the fpc, which is the 10% of the total data points. Now, in order to estimate the population I need to do a design of the sample without replacement including the fpc. Then, for the last three codes, I calculate a population estimate, mean and confidence interval based on that sample.
What changes is that I must repeat 30 different Simple Random Samplings from the data. Therefore, the resulting estimation, mean and confidence intervals will be obtained from 30 different samples. They might be close but not equal
How can I make this code better so I can run it 30 times each and be able to print a table with (ONStotal_SRS1, ONSmean_SRS1,CI_SRS_1)?

Usually I would use either rbindlist from the data.table package or bind_rows from dplyr in combination with an lapply to build the table a row at a time and then bind the rows together. Here is an example using bind_rows with the mtcars data set:
library(dplyr)
combined_data <- bind_rows(lapply(1:30, function(...) {
# Take a sample
SRS_1 <- sample(1:nrow(mtcars), size = .10 * nrow(mtcars))
data_sample_1 <- mtcars[SRS_1, ]
# Compute some things from the sample
m_disp <- mean(data_sample_1$disp)
m_hp <- mean(data_sample_1$hp)
# Make a one row data.frame that will be returned by the function
data.frame(m_disp, m_hp)
}))
Which gives this data.frame:
> str(combined_data)
'data.frame': 30 obs. of 2 variables:
$ m_disp: num 235 272 410 115 249 ...
$ m_hp : num 147 159 195 113 154 ...

Conducting a t-test with a grouping variable

Getting started on an assignment with R, and I haven't really worked with it before, so apologies if this is basic.
brain is an excel dataframe. Its format is as follows (for an odd 40-some rows):
para1 para2 para3 para4 para5 para6 para7
FF 133 132 124 118 64.5 816932
highVAL = ifelse(brain$para2>=130,1, 0)
highVAL gives me a vector of 1's and 0's, categorized by para2.
I'm looking to perform a t-test on the mean para7 between two sets: rows that have para2 > 130 and those that have para2 < 130.
In Python, I would construct two new arrays and append values in, and perform a t-test there. Not sure how I would go about it in R.

You're closer than you think! Your highVAL variable should be added as a new column to the brain data frame:
brain$highVAL <- brain$FSIQ >= 130
This adds a true/false column to the dataset. Then you can run the test using t-test's formula interface:
result <- t.test(MRIcount ~ highVAL, data = brain)

Calculate Fisher's exact test p-value in dataframe rows

I have a list of 1700 samples in a data frame where every row represents the number of colorful items that every assistant has counted in a random number of specimens from different boxes. There are two available colors and two individuals counting the items so this could easily create a 2x2 contingency table.
df
Box-ID 1_Red 1_Blue 2_Red 2_Blue
1 1075 918 29 26
2 903 1076 135 144
I would like to know how can I treat every row as a contigency table (either vector or matrix) in order to perform a chi-square test (like Fisher's or Barnard's) and generate a sixth column with p-values.
This is what I've tried so far, but I am not sure if it's correct
df$p-value = chisq.test(t(matrix(c(df[,1:4]), nrow=2)))$p.value

I think you could do something like this
df$p_value <- apply(df,1,function(x) fisher.test(matrix(x[-1],nrow=2))$p.value)

For loop inside a for loop? in R

I am new to R and am trying create a new dataframe of bootstrapped resamples of groups of different sizes. My dataframe has 6 variables and a group designation, and there are 128 groups of different Ns. Here is an example of my data:
head(PhenoM2)
ID Name PhenoNames Group HML RML FML TML FHD BIB
1 378607 PaleoAleut PaleoAleut 1 323.5 248.75 434.50 355.75 46.84 NA
2 378664 PaleoAleut PaleoAleut 1 NA 238.50 441.50 353.00 45.83 277.0
3 378377 PaleoAleut PaleoAleut 1 309.5 227.75 419.00 332.25 46.39 284.0
4 378463 PaleoAleut PaleoAleut 1 283.5 228.75 397.75 331.00 44.37 255.5
5 378602 PaleoAleut PaleoAleut 1 279.5 230.00 393.00 329.50 45.93 265.0
6 378610 PaleoAleut PaleoAleut 1 307.5 234.25 419.50 338.50 43.98 271.5
Pulling from this question - bootstrap resampling for hierarchical/multilevel data - and taking some advice from others (thanks!) I wrote the code:
resample.M <- NULL
for(i in 1000){
groups <- unique(PhenoM2$"Group")
for(ii in 1:128)
data.i.ii <- PhenoM2[PhenoM2$"Group"==groups[ii],]
resample.M[i] <- data.i.ii[sample(1:nrow(data.i.ii),replace=T),]
}
Unfortunately, this gives me the warning:
In resample.M[i] <- data.i.ii[sample(1:nrow(data.i.ii), replace = T),:
number of items to replace is not a multiple of replacement length
Which I understand, since each of the 128 groups has a different N and none of it is a multiple of 1000. I put in resample.M[i] to try and accumulate all of the 1000x resamples of the 128 groups into a single database, and I'm pretty sure the problem is here.
Nearly all of the examples of for loops I've read create a vector database - numeric(1000) - then plug in the information, but since I'm wanting all of the data (which include factors, integers, and numerics) this doesn't work. I tried making a matrix to put the info in (there are 2187 unique individuals in the dataframe):
resample.M <- matrix(ncol=2187000,nrow=10)
But it's giving me the same warning.
So, since I'm sure I'm missing something basic here, I have three questions:
How can I get this code to resample all of the groups (with replacement and based on their individual Ns)?
How can I get this code to repeat this resampling 1000x?
How can I get the resamples of every group into the same database?
Thank you so much for your insight and expertise!

I think you may have wanted to use double square bracket, to store the results in a list, i.e. resample.M[[i]] <- .... Apart from that it makes more sense to write PhenoM2$Group than PhenoM2$"Group" and also groups <- unique(PhenoM2$Group) can go outside of your for loop since you only need to compute it once. Also replace 1:128 by 1:length(groups) or seq_along(groups), so that you don't need to hard code the length of the vector.
Because you will often need to operate on data frames grouped by some variable, I suggest you familiarise yourself with a package designed to do that, rather than using for loops, which can be very slow. The best one for a beginner in R may be plyr, which has an easy syntax (although there are many possibilities, including the slightly more "advanced" packages like dplyr and data.table).
So for a subset d <- subset(PhenoM2, Group == 1), you already have the function you need to perform on it: function(d) d[sample(1:nrow(d), replace = TRUE),].
Now to go over all such subsets, perform this operation and then arrange the results in a new data frame named samples you do
samples <- ddply(PhenoM2, .(Group),
function(d) d[sample(1:nrow(d), replace = TRUE),])
So what remains is to iterate this 1000 or however many times you want. You can use a for loop for this, storing the results in a list. Note that you need to use double square bracket [[ to set elements of the list.
n <- 1000 # number of iterations
samples <- vector("list", n) # list of length n to store results
for (i in seq_along(samples))
samples[[i]] <- ddply(PhenoM2, .(Group),
function(d) d[sample(1:nrow(d), replace = TRUE),])
An alternative way would be to use the function replicate, that performs the same task many times.
Once you have done this, all resamples will be stored in a list. I am not sure what you mean by "How can I get the resamples of every group into the same database". If you want to group them in a single data frame, you do all.samples <- do.call(rbind, samples). In general, you can format your list of samples using do.call and lapply together with a function.

Identifying duplicate columns in a dataframe

I'm an R newbie and am attempting to remove duplicate columns from a largish dataframe (50K rows, 215 columns). The frame has a mix of discrete continuous and categorical variables.
My approach has been to generate a table for each column in the frame into a list, then use the duplicated() function to find rows in the list that are duplicates, as follows:
age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)
tables=apply(testframe,2,table)
dups=which(duplicated(tables))
testframe <- subset(testframe, select = -c(dups))
This isn't very efficient, especially for large continuous variables. However, I've gone down this route because I've been unable to get the same result using summary (note, the following assumes an original testframe containing duplicates):
summaries=apply(testframe,2,summary)
dups=which(duplicated(summaries))
testframe <- subset(testframe, select = -c(dups))
If you run that code you'll see it only removes the first duplicate found. I presume this is because I am doing something wrong. Can anyone point out where I am going wrong or, even better, point me in the direction of a better way to remove duplicate columns from a dataframe?

How about:
testframe[!duplicated(as.list(testframe))]

You can do with lapply:
testframe[!duplicated(lapply(testframe, summary))]
summary summarizes the distribution while ignoring the order.
Not 100% but I would use digest if the data is huge:
library(digest)
testframe[!duplicated(lapply(testframe, digest))]

A nice trick that you can use is to transpose your data frame and then check for duplicates.
duplicated(t(testframe))

unique(testframe, MARGIN=2)
does not work, though I think it should, so try
as.data.frame(unique(as.matrix(testframe), MARGIN=2))
or if you are worried about numbers turning into factors,
testframe[,colnames(unique(as.matrix(testframe), MARGIN=2))]
which produces
age height gender
1 18 76.1 M
2 19 77.0 F
3 20 78.1 M
4 21 78.2 M
5 22 78.8 F
6 23 79.7 F
7 24 79.9 M
8 25 81.1 M
9 26 81.2 F
10 27 81.8 M
11 28 82.8 F
12 29 83.5 M

It is probably best for you to first find the duplicate column names and treat them accordingly (for example summing the two, taking the mean, first, last, second, mode, etc... To find the duplicate columns:
names(df)[duplicated(names(df))]

What about just:
unique.matrix(testframe, MARGIN=2)

Actually you just would need to invert the duplicated-result in your code and could stick to using subset (which is more readable compared to bracket notation imho)
require(dplyr)
iris %>% subset(., select=which(!duplicated(names(.))))

Here is a simple command that would work if the duplicated columns of your data frame had the same names:
testframe[names(testframe)[!duplicated(names(testframe))]]

If the problem is that dataframes have been merged one time too many using, for example:
testframe2 <- merge(testframe, testframe, by = c('age'))
It is also good to remove the .x suffix from the column names. I applied it here on top of Mostafa Rezaei's great answer:
testframe2 <- testframe2[!duplicated(as.list(testframe2))]
names(testframe2) <- gsub('.x','',names(testframe2))

Since this Q&A is a popular Google search result but the answer is a bit slow for a large matrix I propose a new version using exponential search and data.table power.
This a function I implemented in dataPreparation package.
The function
dataPreparation::which_are_bijection
which_are_in_double(testframe)
Which return 3 and 4 the columns that are duplicated in your example
Build a data set with wanted dimensions for performance tests
age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)
for (i in 1:12){
testframe = rbind(testframe,testframe)
}
# Result in 49152 rows
for (i in 1:5){
testframe = cbind(testframe,testframe)
}
# Result in 160 columns
The benchmark
To perform the benchmark, I use the library rbenchmark which will reproduce each computations 100 times
benchmark(
which_are_in_double(testframe, verbose=FALSE),
duplicated(lapply(testframe, summary)),
duplicated(lapply(testframe, digest))
)
test replications elapsed
3 duplicated(lapply(testframe, digest)) 100 39.505
2 duplicated(lapply(testframe, summary)) 100 20.412
1 which_are_in_double(testframe, verbose = FALSE) 100 13.581
So which are bijection 3 to 1.5 times faster than other proposed solutions.
NB 1: I excluded from the benchmark the solution testframe[,colnames(unique(as.matrix(testframe), MARGIN=2))] because it was already 10 times slower with 12k rows.
NB 2: Please note, the way this data set is constructed we have a lot of duplicated columns which reduce the advantage of exponential search. With just a few duplicated columns, one would have much better performance for which_are_bijection and similar performances for other methods.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to Bootstrap Resample Count Data in R - r

Related

R loop to execute the same code

Conducting a t-test with a grouping variable

Calculate Fisher's exact test p-value in dataframe rows

For loop inside a for loop? in R

Identifying duplicate columns in a dataframe

Categories

Resources