Calculating significance of two variables in a dataset for every column - r

I would like to find out how to create a column of p-values to check the significance of the variable for every observation. I would like to check the p values for the two columns on the right side of the data set. I think the most efficient way to do so is to calculate a t.test for every column but i don't know how to do so.
This is what i tried. But this didn't give me the significance of every table.
t.test(Elasapp,Elashuis,var.equal=TRUE)
Results:
Two Sample t-test
data: Elasapp and Elashuis
t = 41.674, df = 48860, p-value \< 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0\.07461778 0.08198304
sample estimates:
mean of x mean of y
0\.085672044 0.007371636

Related

R is making wrong contingency tables for me

I am creating a contingency table for Fisher's Exact test from my 'covid' dataframe using the table function. The table is the following
table1 <- matrix(c(117,390,861,669),ncol=2,byrow=TRUE)
colnames(table1) <- c("No","Yes")
rownames(table1) <- c("Hospital","House")
table1 <- as.table(table1)
table1
The code I wrote to make the table was following:
table1 = table(covid$Treatment_place, covid$Cognitive_dysfunction)
table1
No Yes
Hospital (a)117 (b)390
House (c) 861 (d)669
However, my suspicion is that R is not counting my positive and negative cases correctly. In this case, cell b and c would be positive cases (I have manually labelled the cells for your understanding), but R is counting them as negative and therefore giving a flawed analysis. The following is the odds ratio given by R, which is possible only if R counts cell a and d as positive cases (which should be negative cases).
fisher.test (table1)
Fisher's Exact Test for Count Data
data: table1
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.1836178 0.2947950
sample estimates:
odds ratio
0.2332679
Is there any way I can specify which cells are to be counted as positives ? I think if I can interchange my 'Yes' and 'No' columns it would give me a correct result too, but I don't know how.
Also, is there any way I can transfer the original column names as table labels?
Thanks.

How to calculate p-values for each feature in R using two sample t-test

I have two data frames cases and controls and I performed two sample t-test as shown below.But I am doing feature extraction from the feature set of (1299 features/columns) so I want to calculate p-values for each feature. Based on the p-value generated for each feature I want to reject or accept the null hypothesis.
Can anyone explain to me how the below output is interpreted and how to calculate the p-values for each feature?
t.test(New_data_zero,New_data_one)
Welch Two Sample t-test
data: New_data_zero_pca and New_data_one_pca
t = -29.086, df = 182840000, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.02499162 -0.02183612
sample estimates:
mean of x mean of y
0.04553462 0.06894849
Look at ?t.test. x and y are supposed to be vectors not matrixes. So the function is automatically converting them to vectors. What you want to do, assuming that columns are features and the two matrixes have the same features, is:
pvals=vector()
for (i in seq(ncol(New_data_zero))){
pvals[i]=t.test(New_data_zero[,i], New_data_one[,i])$p.value
}
Then you can look at pvals (probably in log scale) and after multiple hypothesis testing correction (see ?p.adjust).
Let's also address the enormously bad idea of this approach to finding differences among your features. Even if all of the effects between these 1299 features are literally zero you will find *significant results in 0.05 of all possible 1299 2-way comparisons which makes this strategy effectively meaningless. I would strongly suggest taking a look at an introductory statistics text, especially the section on family-wise type I error rates before proceeding.

How do extract the p value from a binomial test

I have the result of a binomial test and it looks like this:
data: x and n
number of successes = 0, number of trials = 7, p-value = 0.01563
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.0000000 0.4096164
sample estimates:
probability of success
0
All I would like to know is how to extract just the p-value in R. I tried grep and pmatch but they appear to require a table or vector.
You only need to do:
binom.test(3,15)$p.value
Take a look at str(binom.test(3,15)) to see the other results from the binomial test. 3 and 15 are chosen arbitrarily so long as second number is larger than the first.

Interpreting var.test results in R

I am trying to learn F test and on performing the inbuilt var.test() in R, I obtained the following result var.test(gardenB,gardenC)
F test to compare two variances
data: gardenB and gardenC
F = 0.09375, num df = 9, denom df = 9, p-value = 0.001624
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.02328617 0.37743695
sample estimates:
ratio of variances
0.09375
I understand that based on the p-value, I should reject the Null hypothesis.
However, I am unable to understand the meaning conveyed by the 95 percent confidence interval?
I tried reading through the explanation provided for the queries:
https://stats.stackexchange.com/questions/31454/how-to-interpret-the-confidence-interval-of-a-variance-f-test-using-r
But am still able to understand the meaning conveyed in the confidence interval. Any help would be really appreciated?
Sorry, I know this is an old post but it showed up as the second result on google so I will try to answer the question still.
The confidence interval is for the RATIO of the two variances.
For example, if the variances are equal ie. var1 = var2, the ratio would be var1/var2 which is 1.
var.test() is usually used to test if the variances are equal. If 1 is not in the 95% confidence interval, it is safe to assume that the variances are not equal and thus, reject the hypothesis.

Sum of N independent standard normal variables

I wanted to simulate sum of N independent standard normal variables.
sums <- c(1:5000)
for (i in 1:5000) {
sums[i] <- sum(rnorm(5000,0,1))
}
I tried to draw N=5000 standard normal and sum them. Repeat for 5000 simulation paths.
I would expect the expectation of sums be 0, and variance of sums be 5000.
> mean(sums)
[1] 0.4260789
> var(sums)
[1] 5032.494
The simulated expectation is too big. When I tried it again, I got 1.309206 for the mean.
#ilir is correct, the value you get is essentially zero.
If you look at the plot, you get values between -200 and 200. 0.42 is for all intents and purposes 0.
You can test this with t.test.
> t.test(sums, mu = 0)
One Sample t-test
data: sums
t = -1.1869, df = 4999, p-value = 0.2353
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-3.167856 0.778563
sample estimates:
mean of x
-1.194646
There is no evidence that your mean values differs from zero (given the null hypothesis is true).
This is just plain normal that the mean does not fall exactly on 0, because it is an empirical mean computed from "only" 5000 realizations of the random variable.
However, the distribution of your realizations contained in the sumsvector should "look" Gaussian.
For example, when I try to plot the histogram and the qqplot obtained of 10000 realizations of the sum of 5000 gaussian laws (created in this way: sums <- replicate(1e4,sum(rnorm(5000,0,1)))), it looks normal, as you can see on the following figures:
hist(sums)
qqnorm(sums)
Sum of the independent normals is again normal, with mean the sum of the means and the variance the sum of variance. So sum(rnorm(5000,0,1)) is equivalent to rnorm(1,0,sqrt(5000)). The sample average of normals is again the normal variable. In your case you take a sample average of 5000 independent normal variables with zero mean and variance 5000. This is a normal variable with zero mean and unit variance, i.e. the standard normal.
So in your case mean(sums) is identical to rnorm(1). So any value from interval (-1.96,1.96) will come up 95% of the time.

Resources