Calculating percentiles by factor using ave() in r - r

I'm trying to calculate percentiles (0.3 & 0.7) of a z-score variable (BuildingZ) by factor level (DistrictName). My research so far has pointed me in the direction of the ave() function, allowing me to sort by factor level, but given my "newness" to working in R, I don't really know how to work around this issue. Here's what I have tried:
Create the data frame with the variables I need:
MathGap2=data.frame(MathGap$DistrictName, MathGap$BuildingName, MathGap$Grade, MathGap$BuildingZ)
Use the ave() function to calculate the desired quantile as a new column:
MathGap2$Thirty<-ave(MathGap2$BuildingZ, MathGap2$DistrictName, fun=quantile(MathGap2$BuildingZ, c(0.3)))
I'm not sure if invoking "quantile" is even valid, or whether I'd have to write a function for this (which is beyond my experience). I've seen similar attempts here, but can't get them to work.
P.S. If it's any help, some of the factors may only occur 1-3 times. I'm not sure if this affects the ability to calculate percentiles or not. While this seems silly, ignore the shady math for now; simply trying to replicate an existing study as close as possible.

Related

How to deal with too large vector size in lm with many-level factor as control

I'm trying to fit a linear model with roughly 900,000 observations and just two explanatory variables. Yet, I additionally need to include a control variable that is a many-level factor variable (11,135 levels). The code for the regression looks like this:
model1 <- dep_var ~ expl_var_1 + expl_var_2 + factor(control_var), data=data
However, R throws me the error "Cannot allocate a vector of size 75.6 GB"
I'm well aware that this is due to the many-level factor variable, however, I need to include this variable as a control. Please note: this is not an ordered factor; it is simply an id without any order.
I've tried to find a solution to this problem, but ran into problems:
I looked into plm - but that doesn't work because while my control variable can be interpreted as an ID time doesn't play a role (and even if it did; there can be >1 observation per ID per time)
I looked into biglm but this fits better the case of big data and not many-level factor
My questions:
Is there a way to include a variable in the regression and leaving it out when assigning the outcome of the regression to model1? I'm really not interested at all in the coefficients per control variable factor level. I just need to control for it.
If there isn't: can I efficiently split up my regression even if I cannot make sure that in each chunk there are all control variable factor levels present (that isn't feasible, because some levels just have 1 observation)?
I'd appreciate any starting points for a solution and ideas where to look for a solution - currently I'm just stuck with my level of knowledge and understanding.
Thanks in advance for your time, support, and patience.
I am late to the party, but actually don't see why biglm would not work. You would not need to have all control as dummies, but as one factor, thus making the problem much less sparse. The only thing is to create chunks of the data ahead of the biglm (which you can do with split or sample and split), run biglm on the first chunk then on the other chunks using the biglm::update function. The number of chunks will depend on your memory.
The only thing is to make sure you define the levels of factors in each chunk the exact same way (using levels with/out relevel before chunking). For those factors absent from a chunk, biglm will return a NA, which will be updated in the next stages.

Using permanova in r to analyse the effect of 3 independent variables on reef systems

I am trying to understand how to run PERMANOVA using Adonis2 in R to analyse some data that I have collected. I have been looking online, but as it often happens, explanations are a bit convoluted, so I am asking for your help, if you can help me. I have got some fish and coral groups as columns, as well as 3 independent variables (reef age, depth, and material). Snapshot of my dataset structure I think I have understood that p-values are not the only important bit of the output, and that the R2 values indicate how much each variable contributes to the model. Is there something wrong or that I am missing here? Also, I think I understood that I should check for homogeneity of variance, but I have not understood, again, if I should check for it on each variable independently, or if I should include them all in the same bit of code (which does not seem to work). Here are the bit of code that I am using to run the PERMANOVA (1), and the one that I am trying to use to assess homogeneity of variance - which does not work (2).
(1) adonis2(species ~ Age+Material+Depth,data=data.var,by="margin)
'Species' is the subset of the dataset including all the species'count, while 'data.var'is the subset including the 3 independent variables. Also what is the difference in using '+' or '' in the code? When I use '' it gives me - 'Error in qr.X(object$CCA$QR) :
need larger value of 'ncol' as pivoting occurred'. What does this mean?
(2) variance.check<-betadisper(species.distance,data.var, type=c("centroid"), bias.adjust= FALSE)
'species.distance' is a matrix calculated through 'vegdist' using Bray-Curtis method. I used 'data.var'to check variance on all the 3 independent variables, but it does not work, while it works if I check them independently (3). Why is that?
(3) variance.check<-betadisper(species.distance, data$Depth, type=c("centroid"), bias.adjust= FALSE)
Thank you in advance for your responses, and for your help. It will really help me to get my head around it (and sorry for the many questions).

ARTool package in R - multiple within factors

I have recently discovered the ARTool package for R (https://cran.r-project.org/web/packages/ARTool/) when looking for a non-parametric alternative for a repeated measures ANOVA.
I have used ARTool and find it really very useful, but I came across a problem, that I am not sure how to deal with. Specifically, the Df.res seem to be strongly inflated as soon as I have more than one within factor. I have not come across this when I tried it with two between factors, a between and a within factor, or two between and one within factor, but whenever I add a second within factor, Df.res seems to become inflated.
I just wondered whether I am misunderstanding something or maybe there is an explanation that I am not aware of.
Any response would be greatly appreciated.
Many thanks!

1 sample t-test from summarized data in R

I can perform a 1 sample t-test in R with the t.test command. This requires actual sets of data. I can't use summary statistics (sample size, sample mean, standard deviation). I can work around this utilizing the BSDA package. But are there any other ways to accomplish this 1-sample-T in R without the BSDA pacakage?
Many ways. I'll list a few:
directly calculate the p-value by computing the statistic and calling pt with that and the df as arguments, as commenters suggest above (it can be done with a single short line in R - ekstroem shows the two-tailed test case; for the one tailed case you wouldn't double it)
alternatively, if it's something you need a lot, you could convert that into a nice robust function, even adding in tests against non-zero mu and confidence intervals if you like. Presumably if you go this route you'' want to take advantage of the functionality built around the htest class
(code and even a reasonably complete function can be found in the answers to this stats.SE question.)
If samples are not huge (smaller than a few million, say), you can simulate data with the exact same mean and standard deviation and call the ordinary t.test function. If m and s and n are the mean, sd and sample size, t.test(scale(rnorm(n))*s+m) should do (it doesn't matter what distribution you use, so runif would suffice). Note the importance of calling scale there. This makes it easy to change your alternative or get a CI without writing more code, but it wouldn't be suitable if you had millions of observations and needed to do it more than a couple of times.
call a function in a different package that will calculate it -- there's at least one or two other such packages (you don't make it clear whether using BSDA was a problem or whether you wanted to avoid packages altogether)

Customized Fisher exact test in R

Beginner's question ahead!
(after spending much time, could not find straightforward solution..)
After trying all relevant posts I can't seem to find the answer, perhaps because my question is quite basic.
I want to run fisher.test on my data (Whatever data, doesn't really matter to me - mine is Rubin's children TV workshop from QR33 - http://www.stat.columbia.edu/~cook/qr33.pdf) It has to simulate completely randomized experiment.
My assumption is that RCT in this context means that all units have the same probability to be assigned to treatment(1/N). (of course, correct me if I'm wrong. thanks).
I was asked to create a customized function and my function has to include the following arguments:
Treatment observations (vector)
Control observations (vector)
A scalar representing the value, e.g., zero, of the sharp null hypothesis; and
The number of simulated experiments the function should run.
When digging in R's fisher.test I see that I can specify X,Y and many other params, but I'm unsure reg the following:
What's the meaning of Y? (i.e. a factor object; ignored if x is a matrix. is not informative as per the statistical meaning).
How to specify my null hypothesis? (i.e. if I don't want to use 0.) I see that there is a class "htest" with null.value but how can I use it in the function?
Reg number of simulations, my plan is to run everything through a loop - sounds expensive - any ideas how to better write it?
Thanks for helping - this is not an easy task I believe, hopefully will be useful for many people.
Cheers,
NB - Following explanations were found unsatisfying:
https://www.r-bloggers.com/contingency-tables-%E2%80%93-fisher%E2%80%99s-exact-test/
https://stats.stackexchange.com/questions/252234/compute-a-fisher-exact-test-in-r
https://stats.stackexchange.com/questions/133441/computing-the-power-of-fishers-exact-test-in-r
https://stats.stackexchange.com/questions/147559/fisher-exact-test-on-paired-data
It's not completely clear to me that a Fisher test is necessarily the right thing for what you're trying to do (that would be a good question for stats.SE) but I'll address the R questions.
As is explained at the start of the section on "Details", R offers two ways to specify your data.
You can either 1. supply to the argument x a contingency table of counts (omitting anything for y), or you can supply observations on individuals as two vectors that indicate the row and column categories (it doesn't matter which is which); each vector containing factors for x and y. [I'm not sure why it also doesn't let you specify x as a vector of counts and y as a data frame of factors, but it's easy enough to convert]
With a Fisher test, the null hypothesis under which (conditionally on the margins) the observation-categories become exchangeable is independence, but you can choose to make it one or two tailed (via the alternative argument)
I'm not sure I clearly understand the simulation aspect but I almost never use a loop for simulations (not for efficiency, but for clarity and brevity). The function replicate is very good for doing simulations. I use it roughly daily, sometimes many times.

Resources