permuting data and random simulation for chisq test on R - r

I am new to R and I am trying to compare a table of observed values with one of expected values and calculate chisq. As a part of my assignment, I need to compare the expected values table with a set of 999 tables that I created using random permutations from the observed values. I need to calculate the chisq value for each table (nsim=999) and then plot a histogram of all chisq values along with the actual chisq from observed data. Here is the data and codes I am using:
> survival=table(titanic[,c("CLASS","SURVIVED")])
> survival
SURVIVED
CLASS no yes
1st 122 203
2nd 167 118
3rd 528 178
crew 673 212
> expected=expected(survival) #library(epitools)
> expected
SURVIVED
CLASS no yes
1st 220.0136 104.98637
2nd 192.9350 92.06497
3rd 477.9373 228.06270
crew 599.1140 285.88596
>nsim=999
>random= rep(survival,nsim)
and now I am stuck!

The simplest way to generate permutations is to use the sample command on your "SURVIVED" column:
sample(titanic[,"SURVIVED"])
Will shuffled the yes/no labels for that column, then you can repeat this 999 times:
replicate(999, {
permSurvival <- sample(titanic[,"SURVIVED"])
# Code to measure chi square test goes here
})

Related

R loop to execute the same code

I'm trying to figure out how to repeat the same code 30 times without typing each one at a time... any help will be much appreciated.
SRS_1 <- sample(1:nrow(MyData_points), size=.10*nrow(MyData_points))
data_sample_1 <- MyData_points[SRS_1,]
fpc.srs <- rep(6399875, 639987)
design_SRS_1 <- svydesign(id=~1, strata=NULL, data=data_sample_1, fpc=fpc.srs)
ONStotal_SRS1 <- svytotal(~data_sample_1$V4, design=design_SRS_1)
ONSmean_SRS1 <- svymean(~data_sample_1$V4, design=design_SRS_1)
CI_SRS_1 <- confint(svytotal(~data_sample_1$V4, design=design_SRS_1))
The first code calculates a Simple Random Sampling with a probability of .10 from the data. The second gets the sample from the data. Third, calculates the fpc, which is the 10% of the total data points. Now, in order to estimate the population I need to do a design of the sample without replacement including the fpc. Then, for the last three codes, I calculate a population estimate, mean and confidence interval based on that sample.
What changes is that I must repeat 30 different Simple Random Samplings from the data. Therefore, the resulting estimation, mean and confidence intervals will be obtained from 30 different samples. They might be close but not equal
How can I make this code better so I can run it 30 times each and be able to print a table with (ONStotal_SRS1, ONSmean_SRS1,CI_SRS_1)?
Usually I would use either rbindlist from the data.table package or bind_rows from dplyr in combination with an lapply to build the table a row at a time and then bind the rows together. Here is an example using bind_rows with the mtcars data set:
library(dplyr)
combined_data <- bind_rows(lapply(1:30, function(...) {
# Take a sample
SRS_1 <- sample(1:nrow(mtcars), size = .10 * nrow(mtcars))
data_sample_1 <- mtcars[SRS_1, ]
# Compute some things from the sample
m_disp <- mean(data_sample_1$disp)
m_hp <- mean(data_sample_1$hp)
# Make a one row data.frame that will be returned by the function
data.frame(m_disp, m_hp)
}))
Which gives this data.frame:
> str(combined_data)
'data.frame': 30 obs. of 2 variables:
$ m_disp: num 235 272 410 115 249 ...
$ m_hp : num 147 159 195 113 154 ...

Sorting data by type in R

I am struggling to write a function for a dataset that looks like this:
identifier age occupation
pers1 18 student
pers2 45 teacher
pers3 65 retired
What I am trying to do, is to write a function that will:
sort my variables into numerical vs. factor variable
for the numerical variables, give me the mean, min and mx
for the factor variable, give me a frequency table
return point (2) and (3) in a "nice" format (dataframe, vector or table)
So far, I have tried this:
describe<- function(x)
{ if (is.numeric(x)) { mean <- mean(x)
min <- min(x)
max <- max(x)
d <- data.frame(mean, min, max)}
else { factor <- table(x) }
}
stats <- lapply(data, describe)
Problems:
My problem is that now, "stats" is a list that is difficult to read and to export to Excel or share. I don't know how to make the list "stats" more reader-friendly.
Alternatively, maybe is there a better way to build the function "describe"?
Any thoughts on how to solve these two problems are much appreciated!
I ma be late to the party, but maybe you still need a solution. I combined the answers from some of the comments to your post to the following code. It assumes you only have numerical columns and factors, and scales to a large number of columns, as you specified:
# Just some sample data for my example, you don't need ggplot2.
library(ggplot2)
data=diamonds
# Find which columns are numeric, and which are not.
classes = sapply(data,class)
numeric = which(classes=="numeric")
non_numeric = which(classes!="numeric")
# create the summary objects
summ_numeric = summary(data[,numeric])
summ_non_numeric = summary(data[,non_numeric])
# result is easily written to csv
write.csv(summ_non_numeric,file="test.csv")
Hope this helps.
The desired functionality is already available elsewhere, so if you are not interested in coding it yourself then you can maybe use this. The Publish package can be used to generate a table for presentation in a paper. It is not on CRAN, but you can install it from github
devtools::install_github('tagteam/Publish')
library(Publish)
library(isdals) # Get some data
data(fev)
fev$Smoke <- factor(fev$Smoke, levels=0:1, labels=c("No", "Yes"))
fev$Gender <- factor(fev$Gender, levels=0:1, labels=c("Girl", "Boy"))
The univariateTable can generate a publication-ready table presenting the data. By default, univariateTable computes the mean and standard deviation for numeric variables and the distribution of observations in categories for factors. These values can be computed and compared across groups. The main input to univariateTable is a formula where the right-hand side lists the variables to be included in the table while the left-hand side --- if present --- specifies a grouping variable.
univariateTable(Smoke ~ Age + Ht + FEV + Gender, data=fev)
This produces the following output
Variable Level No (n=589) Yes (n=65) Total (n=654) p-value
1 Age mean (sd) 9.5 (2.7) 13.5 (2.3) 9.9 (3.0) <1e-04
2 Ht mean (sd) 60.6 (5.7) 66.0 (3.2) 61.1 (5.7) <1e-04
3 FEV mean (sd) 2.6 (0.9) 3.3 (0.7) 2.6 (0.9) <1e-04
4 Gender Girl 279 (47.4) 39 (60.0) 318 (48.6)
5 Boy 310 (52.6) 26 (40.0) 336 (51.4) 0.0714

Conducting a t-test with a grouping variable

Getting started on an assignment with R, and I haven't really worked with it before, so apologies if this is basic.
brain is an excel dataframe. Its format is as follows (for an odd 40-some rows):
para1 para2 para3 para4 para5 para6 para7
FF 133 132 124 118 64.5 816932
highVAL = ifelse(brain$para2>=130,1, 0)
highVAL gives me a vector of 1's and 0's, categorized by para2.
I'm looking to perform a t-test on the mean para7 between two sets: rows that have para2 > 130 and those that have para2 < 130.
In Python, I would construct two new arrays and append values in, and perform a t-test there. Not sure how I would go about it in R.
You're closer than you think! Your highVAL variable should be added as a new column to the brain data frame:
brain$highVAL <- brain$FSIQ >= 130
This adds a true/false column to the dataset. Then you can run the test using t-test's formula interface:
result <- t.test(MRIcount ~ highVAL, data = brain)

Calculate Fisher's exact test p-value in dataframe rows

I have a list of 1700 samples in a data frame where every row represents the number of colorful items that every assistant has counted in a random number of specimens from different boxes. There are two available colors and two individuals counting the items so this could easily create a 2x2 contingency table.
df
Box-ID 1_Red 1_Blue 2_Red 2_Blue
1 1075 918 29 26
2 903 1076 135 144
I would like to know how can I treat every row as a contigency table (either vector or matrix) in order to perform a chi-square test (like Fisher's or Barnard's) and generate a sixth column with p-values.
This is what I've tried so far, but I am not sure if it's correct
df$p-value = chisq.test(t(matrix(c(df[,1:4]), nrow=2)))$p.value
I think you could do something like this
df$p_value <- apply(df,1,function(x) fisher.test(matrix(x[-1],nrow=2))$p.value)

How to Bootstrap Resample Count Data in R

I have a vector of counts which I want to resample with replacement in R:
X350277 128
X193233 301
X514940 3715
X535375 760
X953855 50
X357046 236
X196664 460
X589071 898
X583656 670
X583117 1614
(Note the second column is counts, the first column is the object the counts represent)
From reading various documentation it seems easy to resample data where each row or column represents a single observation. But how do I do this when each row represents multiple observations summed together (as in a table of counts)?
You can use weighted sampling (as user20650 also mentioned in the comments):
sample_weights <- dat$count/sum(dat$count)
mysample <- dat[sample(1:nrow(dat),1000,replace=T,prob=sample_weights),]
A less efficient approach - which might have its uses depending on what you want to do - is to turn your data to 'long' again:
dat_large <- dat[rep(1:nrow(dat),dat$count),]
#then sampling is easy
mysample <- dat_large[sample(1:nrow(dat_large),1000,replace=T),]

Resources