Permutation analysis of group labels using replacements

Permutation analysis of group labels using replacements - r

I have a dataset of (two groups, replicates). My data is split based on the groups so I have 24 samples in group 1 and 20 samples in group 2. My data has replicates. So each set has 4 replicates, hence I have 6 sets in group 1 and 5 sets in group 2. Hence I have assigned indices to them to make it easier during permutation (indices from 1-11). What I want to do now is a routine permutation analysis to obtain the test statistic. I am using non paramteric method with resampling with replacement.
I am trying to permute the group labels. My null hypothesis is that there is no difference between the mean values between both the groups. My doubt\problem in R coding is that I have to pool the data together and then resample the groups. When I try to do this, I have to make sure I maintain the sample size for respective groups (that is after resampling the group lables, my new dataset should still contain 6 sets (24 samples) in group 1 and 5 sets (20 samples) in group 2. I am unable to achieve the latter.
How can I achieve this in R?

Related

Reorganize dataframe to be able to perform Chi Square test in R

I want to test the Chi-Square test to know if there are significant differences between the Status (variable with 3 categories) and the Year (variable with 4 categories).
You can see the data I have in the picture. Right now the years are in separate columns and the numbers are person counts.
How can I generate a table from this one that allows me to do the chisq.test() in Rstudio?

CompareGroup function in R _How to write the r code when the number of groups are more than five

I need to conduct the cluster mean comparison using "compareGroup" function. However, my r output says that the number of groups must be less than or equal 5.
how to increase the number of groups up to 6 since I have six clusters to be compared.

Filtering based on the values in an R data frame

This is a small sample of my data:
The top row includes codes that denote different tree species (e.g. PJ = jack pine). The numerical values represent counts of each species within survey plots. Each case represents a stand that was assessed once by a forest manager (B), and then audited with a plot based survey by the regulating government agency (A). I want to use chisq.test in R to determine the probability that the two samples were taken from the same population, essentially compare the results from source A to those from source B. For each case, I want to convert the 0 values to NA where there is a 0 for both A and B sources. Otherwise I am unnecessarily inflating the degrees of freedom for the test. I am very new to R.
I want to do a chi squared test across many cases, and the entire data set contains up to 15 species. In most cases there are 2-7 species to deal with.
Thanks for your help

you can use tidyverse functions and try something like:
df %>%
group_by(Case) %>%
mutate( PJ = if_else(sum(PJ) == 0, NA, PJ))
what this is doing is, if the measure is 0 for both Source, the sum is 0, and then you replace the value to NA for this group. It eliminates the need for converting the data to wide-format.
Also, possibly look at mutate_at to mutate multiple columns at the same time.

R code to generate random pairs of rows and do simulation

I have a data matrix in R having 45 rows. Each row represents a value of a individual sample. I need to do to a trial simulation; I want to pair up samples randomly and calculate their differences. I want a large sampling (maybe 10000) from all the possible permutations and combinations.
This is how I managed to do it till now:-
My data matrix ("data") has 45 rows and 2 columns. I selected 45 rows randomly and subtracted from another randomly generated 45 rows.
n1<-(data[sample(nrow(data),size=45,replace=F),])-(data[sample(nrow(data),size=45,replace=F),])
This gave me a random set of differences of size 45.
I made 50 such vectors (n1 to n50) and did rbind, which gave me a big data matrix containing random differences.
Of course, many rows between first random set and second random set were same and cancelled out. I removed it with a code as follows:
row_sub = apply(new, 1, function(row) all(row !=0 ))
new.remove.zero<-new[row_sub,]
BUT, is there a cleaner way to do this? A simpler way to generate all possible random pairs of rows, calculate their difference as bind it together as a new matrix?
Thanks in advance.

Sampling according to distribution from a large vetor in R

I have a large vector of 11 billion values. The distribution of the data is not know and therefore I would like to sample 500k data points based on the existing probabilities/distribution. In R there is a limitation of values that can be loaded in a vector - 2^31 -1 which is why I plan to do the sampling manually.
Some information about the data: The data is just integers. And many of them are repeated multiple times.
large.vec <- (1,2,3,4,1,1,8,7,4,1,...,216280)
To create the probabilities of 500k samples across the distribution I will first create the probability sequence.
prob.vec <- seq(0,1,,500000)
Next, convert these probabilities to position in the original sequence.
position.vec <- prob.vec*11034432564
The reason I created the position vector is so that I can pic data point at the specific position after I order the population data.
Now I count the occurrences of each integer value in the population. Create a data frame with the integer values and their counts. I also create the interval for each of these values
integer.values counts lw.interval up.interval
0 300,000,034 0 300,000,034
1 169,345,364 300,000,034 469,345,398
2 450,555,321 469,345,399 919,900,719
...
Now using the position vector, I identify which position value falls in which interval and based on that get the value of that interval.
This way I believe I have a sample of the population. I got a large chunk of the idea from this reference,
Calculate quantiles for large data.
I wanted to know if there is a better approach? Or if this approach could reasonably, albeit crudely give me a good sample of the population?
This process does take a reasonable amount of time, as the position vector as to go through all possible intervals in the data frame. For that I have made it parallel using RHIPE.
I understand that I will be able to do this only because the data can be ordered.
I am not trying to randomly sample here, I am trying to "sample" the data keeping the underlying distribution intact. Mainly reduce 11 billion to 500k.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex