Suitable test for one group sample(if values under 0) - r

I did some regressions/paired t tests on my sample I now want to test for my final hypothesis where I want to see if a single group of 50 observations is significantly under 0(not mean) so x<0 and if it is significant by how much as in an average of all the negative values which kind of te

Related

How to determine the time lag at different time intervals between two correlated variables

I have two variables, x and y, measured at one minute intervals for over two years. The average daily values of x and y are almost 90% correlated. However, when I analyze x and y in one minute intervals they are only 50% correlated. How can I detect the time interval at which this correlation becomes 90%? Ideally I'd like to do this in R.
I'm new to statistics/econometrics, so my apologies if this question is very basic!
I'm not quite sure what you are asking here. What do you mean by x and y being 90 "percent" correlated? Do you mean you get a correlation coefficient of .9?
Beyond this clarification you can absolutely have a situation where the average of 2 variables is more correlated than any individual subset of the data. In other words order matters, so the correlation of the average is not the average of the correlation. For example, this R code shows if we took 3 measurements each hour for 2 hours (6 measurements total), the overall correlation is .5, while the correlation of the average hourly measure is a perfect 1. Essentially when you take the correlation of averages you are effectively removing the impact of the order your measurement values are distributed within the interval you are averaging over, which ends up actually being very important when taking the correlations. Let me know if I missed something about your question though.
X=c(1,2,3,4,5,6)
Y=c(3,2,1,6,5,4)
cor(X,Y)
HourAvgX=c(mean(X[1:3]),mean(X[4:6]))
HourAvgY=c(mean(Y[1:3]),mean(Y[4:6]))
cor(HourAvgX,HourAvgY)

Contradiction between Pearson and Pairwise.prop.test

I have two vectors a and b with same length. The vectors contains number of times a game has been played. So for example game 1 has been played 265350 in group a while it has been played 52516 in group b.
a <- c(265350, 89148, 243182, 208991, 113090, 124698, 146574, 33649, 276435, 9320, 58630, 20139, 26178, 7837, 6405, 399)
b <- c(52516, 42840, 60571, 58355, 46975, 47262, 58197, 42074, 50090, 27198, 45491, 43048, 44512, 27266, 43519, 28766)
I want to use Pearsons Chi square test to test Independence between the two vector. In R I type
chisq.test(a,b)
and I get a p-value 0.2348 meaning that the two vectors are independent (H is true).
But when I run pairwise.prop.test(a,b) and get all the pairwise p-values and almost all of them are very low, meaning that there are pairwise dependence between the two vectors but this is in contrast to the first result. How can that be ?
The pairwise.prop.test is not the correct test for your case.
As it mentions in the documentation:
Calculate pairwise comparisons between pairs of proportions with correction for multiple testing
And also:
x (first argument).
Vector of counts of successes or a matrix with 2 columns giving the counts of successes and failures, respectively.
And
n (second argument).
Vector of counts of trials; ignored if x is a matrix.
So, x in the number of successes out of n which is the trials, i.e. x <= (less than or equal) to each pair in n. And this is why pairwise.prop.test is used for proportions. As an example imagine tossing a coin 1000 times getting heads in 550. x would be 550 and n would be 1000. In your case you do not have something similar, you just have counts of a game in two groups.
The correct hypothesis test for testing independence is the chisq.test(a,b) that you have already used and I would trust that.

Prop.Test in R: How to correct for large number of observations

This is not really a coding question but more of a statistical question.
I'm doing a proportions test on multiple proportions for many subjects.
For example, subject 1 will have multiple proportions (multiple "successes per total trials"), and subject 2 will have multiple proportions. And for each subject we're testing if all these proportions are the same. For each subject, there are multiple proportions where there is number of successes per total trials. The proportions could range from being 30 successes out of 60 to like 300 successes out of a 1000 (just to show the range of trials and successes for each subject). Furthermore, for each subject, there could be varying number of proportions. Subject 1 could have 50 proportions, whereas subject 2 could only have 2. The idea is that we're trying to test that for each subject that all of their proportions are the same, and then reject if they are different.
However, I'm realizing that subjects that have many more proportions, will have more significant p-values than subjects that only have 2 proportions, when using the prop.test. I was wondering if there is a way to approach this problem in a different way. Any sort of correction I could do, or take into account the number of positions.
Any suggestions would be helpfil.
One way you can approach your example of comparing proportions for a single subject is by performing null hypothesis testing is by using the Z-statistic to compare one proportion with the other proportions. The Z-statistic inherently normalizes data for different sample sizes. As an example, assuming that one subject has 50 proportions, you would have 50 tests, and in the method below 5% error is allowed for each subject. You can set this up with the following:
Research Question:
For a single subject with 50 proportions, is the first proportion the same as the other proportions?
Hypothesis
Null hypothesis: u_1 = u_2 = ... = u_50
Alternative hypothesis: u_i != 1/49 sum (u_j) where j != i
Calculate the Statistic
Use the Z-test to compare the mean to the average mean of the other 49 proportions (for all 50 samples)
N is your number of trials
Compute the appropriate test statistic and rejection criteria
5% error is allowed for each subject
p-value, 5% / 50
You would repeat this method for each proportion for this subject (i.e. perform null hypothesis testing 50 times for this subject).

fisher's exact test (R) - simulated p-value does not vary

I have a problem using fisher’s exact test in R with a simulated p-value, but I don’t know if it’s a caused by “the technique” ( R ) or if it is (statistically) intended to work that way.
One of the datasets I want to work with:
matrix(c(103,0,2,1,0,0,1,0,3,0,0,3,0,0,0,0,0,0,19,3,57,11,2,87,1,2,0,869,4,2,8,1,4,3,18,16,5,60,60,42,1,1,1,1,21,704,40,759,404,151,1491,9,40,144),ncol=2,nrow=27)
The resulting p-value is always the same, no matter how often I repeat the test:
p = 1 / (B+1)
(B = number of replicates used in the Monte Carlo test)
When I shorten the matrix it works if the number of rows is lower than 19. Nevertheless it is not a matter of number of cells in the matrix. After transforming it into a matrix with 3 columns it still does not work, although it does when using the same numbers in just two columns.
Varying simulated p-values:
>a <- matrix(c(103,0,2,1,0,0,1,0,3,0,0,3,0,0,0,0,0,0,869,4,2,8,1,4,3,18,16,5,60,60,42,1,1,1,1,21),ncol=2,nrow=18)
>b <- matrix(c(103,0,2,1,0,0,1,0,3,0,0,3,0,0,0,0,0,0,19,869,4,2,8,1,4,3,18,16,5,60,60,42,1,1,1,1,21,704),ncol=2,nrow=19)
>c <- matrix(c(103,0,2,1,0,0,1,0,3,0,0,3,0,0,0,0,0,0,869,4,2,8,1,4,3,18,16,5,60,60,42,1,1,1,1,21),ncol=3,nrow=12)
>fisher.test(a,simulate.p.value=TRUE)$p.value
Number of cells in a and b are the same, but the simulation only works with matrix a.
Does anyone know if it is a statistical issue or a R issue and, if so, how it could be solved?
Thanks for your suggestions
I think that you are just seeing a very significant result. The p-value is being computed as the number of simulated (and the original) matrices that are as extreme or more extreme than the original. If none of the randomly generated matrices are as or more extreme then the p-value will just be 1 (the original matrix is as extreme as itself) divided by the total number of matrices which is $B+1$ (the B simulated and the 1 original matrix). If you run the function with enough samples (high enough B) then you will start to see some of the random matrices as or more extreme and therefor varying p-values, but the time to do so is probably not reasonable.

Looking for an efficient way to compute the variances of a multinomial distribution in R

I have an R matrix which dimensions are ~20,000,000 rows by 1,000 columns. The first column represents counts and the rest of the columns represent the probabilities of a multinomial distribution of these counts. So in other words, in each row the first column is n and the rest of the k columns are the probabilities of the k categories. Another point is that the matrix is sparse, meaning that in each row there are many columns with value of 0.
Here's a toy matrix I created:
mat=rbind(c(5,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1),c(2,0.2,0.2,0.2,0.2,0.2,0,0,0,0,0),c(22,0.4,0.6,0,0,0,0,0,0,0,0),c(5,0.5,0.2,0,0.1,0.2,0,0,0,0,0),c(4,0.4,0.15,0.15,0.15,0.15,0,0,0,0,0),c(10,0.6,0.1,0.1,0.1,0.1,0,0,0,0,0))
What I'd like to do is obtain an empirical measure of the variance of the counts for each category. The natural thing that comes to mind is to obtain random draws and then compute the variances over them. Something like:
draws = apply(mat,1,function(x) rmultinom(samples,x[1],x[2:ncol(mat)]))
Where say samples=100000
Then I can run an apply over draws to compute the variances.
However, for my real data dimensions this will become prohibitive at least in terms of RAM. Is whether a more efficient solution in R to this problem?
If all you need is the variance of the counts, just compute it immediately instead of returning the intermediate simulated draws.
draws = apply(mat,1,function(x) var(rmultinom(samples,x[1],x[2:ncol(mat)])))

Resources