Can effect size go beyond 1 using hedges g or cohens d in meta analysis? - metafor

i am using R metafor package to calculate effect size. hedges g is applied as my sample size is different for comparative groups.
i came across that effect size 0.9 is very strong but i am getting 1.3, i am not sure if that is right.

There is in principle no lower/upper bound on a standardized mean difference. If the mean difference is large or the SD is small, it can get very large. However, if it gets very large (e.g., values like 5 or 10), I would start to get worried that something is off with the information provided/used (a common mistake is that standard errors of the mean are used instead of standard deviations in computing the standardized mean difference, leading to values that are way too large).

Related

Uniquenesses component of explorator yfactor analysis

I am applying an Exploratory factor analysis on a dataset using the factanal() package in R. After applying Scree Test I found out that 2 factors need to be retained from 20 features.
Trying to find what this uniqueness represents, I found the following from here
"A high uniqueness for a variable usually means it doesn’t fit neatly into our factors. ..... If we subtract the uniquenesses from 1, we get a quantity called the communality. The communality is the proportion of variance of the ith variable contributed by the m common factors. ......In general, we’d like to see low uniquenesses or high communalities, depending on what your statistical program returns."
I understand that if the value of uniquenesses is high, it could be better represented in a single factor. But what is a good threshold for this uniquenesses measure? All of my features show a value greater than 0.3, and most of them range from 0.3 to 0.7. Does the following mean that my factor analysis doesn't work well on my data? I have tried rotation, the results are not very different. What else should I try then?
You can partition an indicator variable's variance into its...
Uniqueness (h2): Variance that is not explained by the common factors
Communality (1-h2): The variance that is explained by the common factors
Which values can be considered "good" depends on your context. You should look for examples in your application domain to know what you can expect. In the package psych, you can find some examples from psychology:
library(psych)
m0 <- fa(Thurstone.33,2,rotate="none",fm="mle")
m0
m0$loadings
When you run the code, you can see that the communalities are around 0.6. The absolute factor loadings of the unrotated solution vary between 0.27 and 0.85.
An absolute value of 0.4 is often used as an arbitrary cutoff for acceptable factor loadings in psychological domains.

Relrisk function and bandwidth selection in spatstat

I'm having trouble interpreting the results I got from relrisk. My data is a multiple point process containing two marks (two rodents species AA and RE), I want to know if they are spatially segregated or not.
> summary(REkm)
Marked planar point pattern: 46 points
Average intensity 0.08101444 points per square unit
*Pattern contains duplicated points*
Coordinates are given to 3 decimal places
i.e. rounded to the nearest multiple of 0.001 units
Multitype:
frequency proportion intensity
AA 15 0.326087 0.02641775
RE 31 0.673913 0.05459669
Window: rectangle = [4, 38] x [0.3, 17] units
x 16.7 units)
Window area = 567.8 square units
relkm <- relrisk(REkm)
plot(relkm, main="Relrisk default")
The bandwidth of this relrisk estimation is automatically selection by default(bw.relrisk), but when I tried other numeric number using sigma= 0.5 or 1, the results are somehow kind of weird.
How did this happened? Was it because the large proportion of blank space of my ppp?
According to chapter.14 of Spatial Point Patterns books and the previous discussion, I assume the default of relrisk will show the ratio of intensities (case divided by control, in my case: RE divided by AA), but if I set casecontrol=FALSE, I can get the spatially-varying probability of each type.
Then why the image of type RE in the Casecontrol=False looks exactly same as the relrisk estimation by default? Or they both estimate p(RE)=λRE/ λRE+λAA for each sites?
Any help will be appreciated! Thanks a lot!
That's two questions.
Why does the image for RE when casecontrol=FALSE look the same as the default output from relrisk?
The definitive source of information about spatstat functions is the online documentation in the help files. The help file for relrisk.ppp gives full details of the behaviour of this function. It says that the calculation of probabilities and risks is controlled by the argument relative. If relative=FALSE (the default), the code calculates the spatially varying probability of each type. If relative=TRUE it calculates the relative risk of each type i, defined as the ratio of the probability of type i to the probability of type c where c is the type designated as the control. If you wanted the relative risk then you should set relative=TRUE.
Very different results obtained when setting sigma=0.5 compared to the automatically selected bandwidth.
Your example output says that the window is 34 by 17 units. A smoothing bandwidth of sigma=0.5 is very small for this region. Imagine each data point being replaced by a blurry circle of radius 0.5; there would be a lot of empty space. The smoothing procedure is encountering numerical problems which are causing the funky artefacts.
You could try a range of different values of sigma, say from 1 to 15, and decide which value produces the most satisfactory result.
The plot of relrisk(REkm, casecontrol=FALSE) suggests that the automatic bandwidth selector bw.relriskppp chose a much larger value of sigma, perhaps about 10. You can investigate this by
b <- bw.relriskppp(REkm)
print(b)
plot(b)
The print command will print the chosen value of sigma that was used in the default calculation. The plot command will show the cross-validation criterion which was maximised to select the bandwidth. This gives you an idea of the range of values of sigma that are acceptable according to the automatic selector.
Read the help file for bw.relriskppp about the different options available for bandwidth selection method. Maybe a different choice of method would give you a more acceptable result from your viewpoint.

Is there any way to calculate effect size between a pre-test and a post-test when scores on pre-test is 0 (or almost 0)

I would like to calculate an effect size between scores from pre-test and post-test of my studies.
However, due to the nature of my research, pre-test scores are usually 0 or almost 0 (before the treatment, participants usually do not have any knowledge in question).
I cannot just use Cohen's d to calculate effect sizes since the pre-test scores do not follow a normal distribution.
Is there any way I can calculate effect sizes in this case?
Any suggestions would be greatly appreciated.
You are looking for Cohen's d to see if the difference between the two time points (pre- and post-treatment) is large or small. The Cohen's d can be calculated as follows:
(mean_post - mean_pre) / {(variance_post + variance_pre)/2}^0.5
Where variance_post and variance_pre are the sample variances. Nowhere does it require here that the pre- and post-treatment score are normally distributed.
There are multiple packages available in R that provide a function for Cohen's d: effsize, pwr and lsr. In lsr your R-code would look like this:
library(lsr)
cohensD(pre_test_vector, post_test_vector)
Sidenote: The average scores tend to a normal distribution when your sample size tends to infinity. As long as your sample size is large enough, the average scores follow a normal distribution (Central Limit Theorem).

Convergence of R density() function to a delta function

I'm a bit puzzled by the behavior of the R density() function in an edge case...
Suppose I add more and more points with x=0 into a simulated data set. What I expect is that the density estimate will very quickly converge (I'm being deliberately vague about what that means...) to a delta function at x=0. In practice, the fit certainly gets narrower, but very slowly, as shown by this sequence of plots:
plot(density(c(0,0)), xlim=c(-2,2))
plot(density(c(0,0,0,0)), xlim=c(-2,2))
plot(density(c(rep(0,10000))), xlim=c(-2,2))
plot(density(c(rep(0,10000000))), xlim=c(-2,2))
But if you add a tiny bit of noise to the simulated data, the behavior is much better:
plot(density(0.0000001*rnorm(10000000) + c(rep(0,10000000))), xlim=c(-2,2))
Just let sleeping dogs lie? Or am I missing something about the usage of density()?
Per ?bw.nrd0, the default bandwidth selector for density:
bw.nrd0 implements a rule-of-thumb for choosing the bandwidth of a Gaussian kernel density estimator. It defaults to 0.9 times the minimum of the standard deviation and the interquartile range divided by 1.34 times the sample size to the negative one-fifth power (= Silverman's ‘rule of thumb’, Silverman (1986, page 48, eqn (3.31)) unless the quartiles coincide when a positive result will be guaranteed.
When your data is constant, then the quartiles coincide, so the last clause guaranteeing a positive result kicks in. This basically means that the bandwidth chosen is not a continuous function of the spread of the data, at zero.
To illustrate:
> bw.nrd0(rep(0, 1e6))
[1] 0.05678616
> bw.nrd0(rnorm(1e6, s=1e-6))
[1] 5.672872e-08
Actually (...tail between legs...) I now realize that my entire question was misguided. Being fairly new to R, I had instantly assumed that density() tries to fit Gaussians of different widths to the data points, optimizing both the number of Gaussians and their individual widths. But in fact it is doing something much simpler. It just smears out each data point, and adds up the smears to give a smoothed estimate of the data. density() is just a simple smoothing algorithm. So, yes indeed, RTFM :)

Likert Rank ordering optimization heuristic possible?

I can't find the type of problem I have and I was wondering if someone knew the type of statistics it involves. I'm not sure it's even a type that can be optimized.
I'd like to optimize three variables, or more precisely the combination of 2. The first is a likert scale average the other is the frequency of that item being rated on that likert scale, and the third is the item ID. The likert is [1,2,3,4]
So:
3.25, 200, item1. Would mean that item1 was rated 200 times and got an average of 3.25 in rating.
I have a bunch of items and I'd like to find the high value items. For instance, an item that is 4,1 would suck because while it is rated highest, it is rated only once. And a 1,1000 would also suck for the inverse reason.
Is there a way to optimize with a simple heuristic? Someone told me to look into confidence bands but I am not sure how that would work. Thanks!
Basically you want to ignore scores with fewer than x ratings, where x is a threshold that can be estimated based on the variance in your data.
I would recommend estimating the variance (standard deviation) of your data and putting a threshold on your standard error, then translating that error into the minimum number of samples required to produce that bound with 95% confidence. See: http://en.wikipedia.org/wiki/Standard_error_(statistics)
For example, if your data has standard deviation 0.5 and you want to be 95% sure your score is within 0.1 of the current estimate, then you need (0.5/0.1)^2 = 25 ratings.

Resources