Calculate Fisher's exact test p-value in dataframe rows - r

I have a list of 1700 samples in a data frame where every row represents the number of colorful items that every assistant has counted in a random number of specimens from different boxes. There are two available colors and two individuals counting the items so this could easily create a 2x2 contingency table.
df
Box-ID 1_Red 1_Blue 2_Red 2_Blue
1 1075 918 29 26
2 903 1076 135 144
I would like to know how can I treat every row as a contigency table (either vector or matrix) in order to perform a chi-square test (like Fisher's or Barnard's) and generate a sixth column with p-values.
This is what I've tried so far, but I am not sure if it's correct
df$p-value = chisq.test(t(matrix(c(df[,1:4]), nrow=2)))$p.value

I think you could do something like this
df$p_value <- apply(df,1,function(x) fisher.test(matrix(x[-1],nrow=2))$p.value)

Related

Explanation for aggregate and cbind function

first I can't understand aggregate function and cbind I need explanation really simple words, second I have data
permno number mean std
1 10107 120 0.0117174000 0.06802718
2 11850 120 0.0024398083 0.04594591
3 12060 120 0.0005072167 0.08544500
4 12490 120 0.0063569167 0.05325215
5 14593 120 0.0200060583 0.08865493
6 19561 120 0.0154743500 0.07771348
7 25785 120 0.0184815583 0.16510082
8 27983 120 0.0025951333 0.09538822
9 55976 120 0.0092889000 0.04812975
10 59328 120 0.0098526167 0.07135423
I NEED TO process this by
data_processed2 <- aggregate(cbind(return)~permno, Data_summary, median)
I cant understand this command please explain me very simple THANK YOU!
cbind takes two or more tables (dataframes), puts them side by side, and then makes them into one big table. So for example, if you have one table with columns A, B and C, and another with column D and E, after you cbind them, you'll have one table with five columns: A, B, C, D and E. for the rows, cbind assumes all tables are in the same order.
As noted by Rui, in your example cbind doesn't do anything, because return is not a table, and even if it was, it's only one thing.
aggregate takes a table, divides it by some variable, and the calculates a statistic on a variable within each group. For example, if I have data for sales by month and day of month, I can aggregate by month, and calculate the average sales per day for each of the months.
The command you provided uses the following syntax:
aggregate(VARIABLES~GROUPING, DATA, FUNCTION)
Variables (cbind(return) - which doesn't make sense, really) is the list of all the variables for which your statistic will be calculated
Grouping (pernmo) is the variable by which you will break the data into groups (in the sample data you provided each row has a unique number for this variable, so that doesn't really make sense either).
Data is the dataframe you're using.
Function is median.
So this call will break Data_summery into groups that have the same pernmo, and calculate the median for each of the columns.
With the data you provided, you'll basically get the same table back, since you're grouping the data by groups of one row each... -- Actually, since your variables are an empty group, as far as I can tell, you'll get nothing back.

How to Bootstrap Resample Count Data in R

I have a vector of counts which I want to resample with replacement in R:
X350277 128
X193233 301
X514940 3715
X535375 760
X953855 50
X357046 236
X196664 460
X589071 898
X583656 670
X583117 1614
(Note the second column is counts, the first column is the object the counts represent)
From reading various documentation it seems easy to resample data where each row or column represents a single observation. But how do I do this when each row represents multiple observations summed together (as in a table of counts)?
You can use weighted sampling (as user20650 also mentioned in the comments):
sample_weights <- dat$count/sum(dat$count)
mysample <- dat[sample(1:nrow(dat),1000,replace=T,prob=sample_weights),]
A less efficient approach - which might have its uses depending on what you want to do - is to turn your data to 'long' again:
dat_large <- dat[rep(1:nrow(dat),dat$count),]
#then sampling is easy
mysample <- dat_large[sample(1:nrow(dat_large),1000,replace=T),]

omitting certain data in R to maintain overall data integrity

I have a function that returns 50 data values, in a one column matrix, for each of 100 different data frames . However due to circumstance sometimes the function returns a "NaN" in one or more of the 50 values in a data frame . This perturbs the data as a data frame that has one or more NaN is now considered to have 49 or 48 columns.
df1 df2
112.4563 112.4563
110.1210 110.1210
109.2143 109.2143
NaN 108.1806 <- now uneven and can not perform iterations
107.3700 107.3700
How can I tell my computer/ subsequent commands when iterating through these 100 50 rowed data frames to "ignore" the NaN values in a way that each of the 100 will still be able to have 50 values and are consistently iterable? Or its it even possible to have a varying iteration range- for(i in 1:(47-50). So that the computer forgives the variance in row numbers?
this is also with respect to graphs.
As someone else has noted, it can also depend on what you want to do with the NaN value. However, on answering for an interative range, you can do something like the following. I'll be using the dataframe mtcars as an example.
df = mtcars
length(df$mpg)
length(rownames(df))
length(colnames(df))
If you need to iterate over the total number of rows in your data frame, you can use length(rownames(df)). If you need to iterate over the number of columns instead, you can use length(colnames(df)).
In a for loop, you would do the following:
for (i in length(rownames(df)){
# iterative code
}
This will iterate over the total number of rows in a given data frame.

Finding aggregate correlation of multiple columns against one column in r

I have a data frame with 11 columns out of which 9 are numeric. I am trying to find out the correlation of 8 columns together against the remaining column i.e., correlation of 8 variables with 1 variable which should generate one value of correlation instead of generating 9 different values in a matrix.
is it possible? or do I need to calculate the average correlation after calculating individual correlation?e.g., I am trying to find the correlation of X,Y,Z to A. Using the mentioned methods I get a matrix which gives me indivual score of association for X,Y,Z with A where as I need one score which takes into account all three X,Y & Z.
A simulated df is presented below for illustration purposes
x y z a
1 1.72480753 0.007053053 0.32435032 10
2 0.97227885 -0.844118498 -0.75534119 20
3 -0.53844294 -0.036178789 0.89396765 30
4 1.34695331 0.870119744 0.99400826 40
5 0.02336335 0.514481676 0.95894286 50
6 -0.15239307 0.386061290 0.73541287 60
7 -0.29878116 1.615012645 -0.04416341 70
8 -1.10907706 -1.581093487 -0.93293702 80
9 2.73021114 -0.130141775 1.85304372 90
10 0.22417487 1.170900385 -0.68312974 100
I can do correlation of each row and variable with a but what I want is correlation of x,y,z combined with a
corr.test(df[,1:3],df[,4])
I will appreciate any help towards this problem.
Regards,
Pearson Correlation is defined to be a number relating one sequence (or vector) of values to another (look it up). As far as I know there is no roughly equivalent definition for a group of vectors to another, but you could do something like take the average vector (of the 3 vectors) and correlate a to that.
To me at least that has a more immediate geometric meaning than taking the average of the 3 correlation values.
If you want to compute the correlation of each variable with a, you could do something like:
head(cor(df)[,"a"], -1)
# x y z
# -0.14301569 0.19188340 -0.06561505
You said you wanted to combine these values by averaging, so I suppose you could just take the mean of that:
mean(head(cor(df)[,"a"], -1))
# [1] -0.005582445

permuting data and random simulation for chisq test on R

I am new to R and I am trying to compare a table of observed values with one of expected values and calculate chisq. As a part of my assignment, I need to compare the expected values table with a set of 999 tables that I created using random permutations from the observed values. I need to calculate the chisq value for each table (nsim=999) and then plot a histogram of all chisq values along with the actual chisq from observed data. Here is the data and codes I am using:
> survival=table(titanic[,c("CLASS","SURVIVED")])
> survival
SURVIVED
CLASS no yes
1st 122 203
2nd 167 118
3rd 528 178
crew 673 212
> expected=expected(survival) #library(epitools)
> expected
SURVIVED
CLASS no yes
1st 220.0136 104.98637
2nd 192.9350 92.06497
3rd 477.9373 228.06270
crew 599.1140 285.88596
>nsim=999
>random= rep(survival,nsim)
and now I am stuck!
The simplest way to generate permutations is to use the sample command on your "SURVIVED" column:
sample(titanic[,"SURVIVED"])
Will shuffled the yes/no labels for that column, then you can repeat this 999 times:
replicate(999, {
permSurvival <- sample(titanic[,"SURVIVED"])
# Code to measure chi square test goes here
})

Resources