ezANOVA not providing Greenhouse Geiser correct df though violated - r

I've noticed that sometimes when I use ezANOVA from package ez I get columns stating what the Greenhouse-Geiser corrected df values are, and other times the tables with the sphericity corrections do not include the new df values, even though there are violations. For example, I just ran a 2-way repeated measures anova, and my table output looks like this:
I wish I could give repeatable data, but I genuinely don't know why it does or doesn't do it sometimes. Does anyone else know? I'll show my code below in case there's something I'm missing regarding the actual ezANOVA function. I could do the Df values by hand, but haven't found a good resource online to show me how to correct them using the epsilon value and I unfortunately was never taught that.
ez::ezANOVA(data = IntA2, wid = rat, dv = numReinforcers, within = .(component, minute))
Edit: A very nice person on the internet has explained to me how to calculate the new df values by hand (multiplying the GG epsilon by the old Dfs, in case any one else was wondering!) but I'm still unclear on why sometimes the function does it for you and other times it does not.

Related

Error in plsm... manifest variables must be contained in data

I am trying to make a PLS-SEM model and I am using the plsm() function in R from the semPLS package. However, at first I got an error saying:
The latent variables are not allowed to coincide with names of observed variables.
I understood it, but after going through my input and even in my measurement model matrix adding single-factor constructs (directly measured variables) I now get the following:
mod <- plsm(data = survey, strucmod = smin, measuremod = mmin)
Error in plsm(data = survey, strucmod = smin, measuremod = mmin) :
The manifest variables must be contained in the data.
I am at a loss as to how I should proceed. It seems that whenever I "fix" one problem, it directly causes another. Does anyone have any examples aside from the standard mobi example from the package where I could see how it's done when I have both latent and directly measured variables?
Found the code for the function, but now I'm even more confused.
https://github.com/cran/semPLS/blob/master/R/plsm.R
Could anyone explain in a simple manner how I am supposed to name my df columns, and the measurement model to avoid this problem?
don't know if you ever solved this, but i just had a similar issue, and seemed to be the only other person. i ended up getting this solved through some trial and error.
I created three tables:
structmodel: SM - column names: Source|Target
measurement model: MM - Column names: Source|Target
Data: Column names - Measurement headers
I converted the sm and mm tables to a matrix
datamatrix_SM = as.matrix(SM)
datamatrix_MM = as.matrix(MM)

R - select cases so that the mean of a variable is some given number

I previously worked on a project where we examined some sociological data. I did the descriptive statistics and after several months, I was asked to make some graphs from the stats.
I made the graphs, but something seemed odd and when I compared the graph to the numbers in the report, I noticed that they are different. Upon investigating further, I noticed that my cleaning code (which removed participants with duplicate IDs) now results with more rows, e.g. more participants with unique IDs than previously. I now have 730 participants, whereas previously there were 702 I don't know if this was due to updates of some packages and unfortunately I cannot post the actual data here because it is confidential, but I am trying to find out who these 28 participants are and what happened in the data.
Therefore, I would like to know if there is a method that allows the user to filter the cases so that the mean of some variables is a set number. Ideally it would be something like this, but of course I know that it's not going to work in this form:
iris %>%
filter_if(mean(.$Petal.Length) == 1.3)
I know that this was an incorrect attempt but I don't know any other way that I would try this, so I am looking for help and suggestions.
I'm not convinced this is a tractable problem, but you may get somewhere by doing the following.
Firstly, work out what the sum of the variable was in your original analysis, and what it is now:
old_sum <- 702 * old_mean
new_sum <- 730 * new_mean
Now work out what the sum of the variable in the extra 28 cases would be:
extra_sum <- new_sum - old_sum
This allows you to work out the relative proportions of the sum of the variable from the old cases and from the extra cases. Put these proportions in a vector:
contributions <- c(extra_sum/new_sum, old_sum/new_sum)
Now, using the functions described in my answer to this question, you can find the optimal solution to partitioning your variable to match these two proportions. The rows which end up in the "extra" partition are likely to be the new ones. Even if they aren't the new ones, you will be left with a sample that has a mean that differs from your original by less than one part in a million.

How to interpret the prediction in this plot of classification tree?

I have followed this tutorial and was able to reproduce the results. However, the last graph confuses me. I understand most of the time it's probability, but why are there negative numbers? Since the response is Survived, how to interpret the numbers in the predictions? How to convert those numbers to Yes and No?
https://www.h2o.ai/blog/finally-you-can-plot-h2o-decision-trees-in-r/
EIDT 11/19/2019: by the way, I did find a similar post on Cross Validated. The answer was not certain since it ended with a question mark.
https://stats.stackexchange.com/questions/374569/may-somebody-help-with-interpretation-of-trees-from-h2o-gbm-see-as-photo-attach
I filtered the data using the logic in the tree and looked at the unique prediction of the subset. I was able to find the threshold for 'yes' and 'no' predictions. I also changed the original code (starting line 34) so that the leaf shows the ultimate result of the numbers. However, this is just a way to hack the plot. If someone can tell me how the numbers are derived, that would be great.
if(class(left_node)[[1]] == 'H2OLeafNode')
leftLabel = ifelse(left_node#prediction >= threshold, 'Yes', 'No')
else
leftLabel = left_node#split_feature
if(class(right_node)[[1]] == 'H2OLeafNode')
rightLabel = ifelse(right_node#prediction >= threshold, 'Yes', 'No')
else
rightLabel = right_node#split_feature
Since the picture is a GBM plot, it’s not as straightforward as you might like, since the inference calculation does some math on the value extracted from the leaf of the tree.
The actual code is here:
https://github.com/h2oai/h2o-3/blob/master/h2o-genmodel/src/main/java/hex/genmodel/algos/gbm/GbmMojoModel.java
Look at the score0 function.
My advice would be to build a 1-tree DRF instead, and then write a short java program and try to single-step it in a java debugger.
The java snippet to start from is how to compile and run a MOJO in this document:
http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/index.html
If you do this, you will be able to step through the exact steps that produce the answer (for GBM as well if you prefer), and nothing will be unknown at that point.

How do I use prodlim function with a non-binary variable in formula?

I am trying to (eventually) plot data by groups, using the prodlim function.
I'm adjusting and adapting code that someone else (not available for questions) has written, and I'm not very familiar with the prodlim library/function. There are definitely other ways to do what I'd like to, but I'm trying to keep it consistent with what the previous person did.
I have code that works, when dividing the data into 2 groups, but when I try to adjust for a 4 group situation, I get an error.
Of note, the data is coming over from SAS using StatTransfer, which has been working fine.
I am new to coding, but I have compared the dataframes I'm trying to work with. The second is just a subset of the first (where the code does work), with all the same variables, and both of the variables I'm trying to group by are integer values.
Hist(medpop$dz_time, medpop$dz_status) works just fine, so the problem must be with the prodlim function, and I haven't understood much of what I've looked up about it, sadly :/ But it the documentation seems to indicate it supports continuous or categorical variables, and doesn't seem limited to binary either. None of the options seem applicable as I understand them.
this works:
M <- prodlim(Hist(dz_time, dz_status)~med, data=pop)
where med is a binary value =1 when a member of this population is taking it, and dz is a disease that some portion develop.
this does not:
(either of these get the error as below)
N <- prodlim(Hist(dz_time, dz_status)~strength, data=medpop)
N <- prodlim(Hist(dz_time, dz_status)~strength, data=pop, subset=pop$med==1)
medpop = the subset of the original population taking the med,
strength = categorical variable ("1","2","3","4")
For the line that does work, the next step is just plot(M), giving a plot with two lines, med==0 and med==1 (showing cumulative incidence of dz_status by dz_time).
For the other line, I get an error saying
Error in KernSmooth::dpik(cumtabx/N, kernel = "box") :
scale estimate is zero for input data
I don't know what that means or how to fix it.. :/

error in function - argument is a length of zero in R-studio

deadcheck<-function(a,t){ #function to check if dead for specific age at a time age sending to function
roe<-which( birthmort$age[i]==fertmortc$min & fertmortc$max) #checks row in fertmortc(hart) to pick an age that meets min and max age requirements I think this could be wrong...
prob<-1-(((1-fertmortc$mortality[roe])^(1/365))^t) #finds the prob for the row that meets the above requirements
if(runif(1,0,1)<=prob) {d<-TRUE} else {d<-FALSE} #I have a row that has the probability of death every 7 days.
return(d) #outputs if dead
Background: I am creating an agent based model that is a population in a dataframe that is simulating how Tuberculosis spreads in a population. ( I know that there are probably 10000 better ways of having done this). I have thus far created a loop that populates my dataframe with people ages etc. I am now trying to create a function that will go to a chart that lists the probability of death per year, based on a age bracket. 0-5,5-10,10-15 etc. (I have math in there b/c I want it to check who lives, dies, makes babies every 7 days). I have a function similar to this that check who is pregnant and it works. However I for the life of me can't figure out why this function is not working. I keep getting the following error.
Error in if (runif(1, 0, 1) <= prob) { : argument is of length zero
I am unsure how to fix this.
I apologize in advanced it this is a dumb question, I have been trying to teach myself to code over the last 4-5 months. If I asked this question in the wrong format or incorrectly then please let me know how to do so correctly.
Value of prob is of length zero. It means
prob = NULL
in this case. Try to print alter your code and add
print(prob)
so you can check partial result.
As you suspected in your comments, the expression
birthmort$age[i]==fertmortc$min & fertmortc$max
is problematic. What this does is evaluate the comparison birthmort$age[i]==fertmortc$min, and then takes the result of that comparison and combines it with fertmortc$max using the and operator. This involves forming the and of a Boolean value and an integer, which is unlikely to make much sense.
Just guessing, you perhaps want:
birthmort$age[i] >= fertmortc$min & birthmort$age[i] <= fertmortc$max
I don't know if this will fix your problem -- you haven't given enough to test it. For optimal help, you should give a reproducible example. See this for how to do so in R

Resources