I'm using the 'metafor' package in R to perform log response ratios. Some of my means are zero, which seems to be the cause of a warning after my escalc command (since log(0) is -inf). The metafor package provides a method of adding a small value to zero to avoid this. The documentation states:
"Cell entries with a zero can be problematic especially for the relative risk and the odds ratio. Adding a small constant to the cells of the 2 × 2 tables is a common solution to this problem [...] When to = "only0", the value of add is added to each cell of the 2 × 2 tables only in those tables with at least one cell equal to 0."
For some reason this is not resolving my error, perhaps because my data is not a 2x2 table? (It is output from summarise with ddply from the ply package, similar to the formatting in this example). Must I replace the zero values with a small number manually or is there a more elegant way? (Note that in this example the rows with zero also have sample size of 1 and thus no variance and will be dropped from the analysis anyway. I just want to know how this works for the future).
Reproducible example:
dat<-dput(Bin_Y_count_summary_wide)
structure(list(Species.ID = c("CAFERANA", "TR11", "TR118", "TR500",
"TR504", "TR9", "TR9_US1"), Y_num_mean.early = c(2, 147.375,
4.5, 0.5, 12.5, 93.4523809523809, 5), N.early = c(1L, 4L, 2L,
4L, 4L, 7L, 2L), sd.early = c(NA, 174.699444284558, 6.36396103067893,
1, 22.4127939653523, 137.506118190001, 7.07106781186548), se.early = c(NA,
87.3497221422789, 4.5, 0.5, 11.2063969826762, 51.9724274972283,
5), Y_num_mean.late = c(0, 3.625, 2.98482142857143, 0.8, 3, 47.2,
0), N.late = c(1L, 4L, 7L, 10L, 10L, 8L, 1L), sd.late = c(NA,
7.25, 5.10407804830748, 1.75119007154183, 8.03118920210451, 40.7351024477486,
NA), se.late = c(NA, 3.625, 1.9291601697265, 0.553774924194538,
2.53968501984006, 14.4020335865659, NA), Y_num_mean.wet = c(NA,
71.5, 0, 12, 27, 0, NA), N.wet = c(NA, 2L, 1L, 2L, 2L, 2L, NA
), sd.wet = c(NA, 17.6776695296637, NA, 9.89949493661167, 38.1837661840736,
0, NA), se.wet = c(NA, 12.5, NA, 7, 27, 0, NA)), row.names = c(NA,
7L), .Names = c("Species.ID", "Y_num_mean.early", "N.early",
"sd.early", "se.early", "Y_num_mean.late", "N.late", "sd.late",
"se.late", "Y_num_mean.wet", "N.wet", "sd.wet", "se.wet"), class = "data.frame", reshapeWide = structure(list(
v.names = c("Y_num_mean", "N", "sd", "se"), timevar = "early_or_late",
idvar = "Species.ID", times = c("early", "late", "wet"),
varying = structure(c("Y_num_mean.early", "N.early", "sd.early",
"se.early", "Y_num_mean.late", "N.late", "sd.late", "se.late",
"Y_num_mean.wet", "N.wet", "sd.wet", "se.wet"), .Dim = c(4L,
3L))), .Names = c("v.names", "timevar", "idvar", "times",
"varying")))
# Warning produced from this command
test <- escalc(measure="ROM", m1i=Y_num_mean.early, sd1i=sd.early, n1i=N.early, m2i=Y_num_mean.late, sd2i=sd.late, n2i=N.late, data=dat, add=1/2, to="only0")
The paragraph you are quoting applies to measures that one can calculate based on 2x2 tables (i.e., RR, OR, RD, AS, and PETO). The add and to arguments do not have any effect for measures such as SMD and ROM.
The only way you can get a mean of 0 for a ratio scale variable (which is what use of response ratios assumes) is if every value is equal to 0. Therefore, by definition, the variance must also be 0. This applies whether the sample size is 1 (in which case the variance is of course also 0) or whether you have a larger sample size.
In general, whenever at least one of the two means is 0, one cannot calculate the log response ratio. Of course, one could start adding some kind of constant to the means manually (and the same for the SDs), but this seems rather arbitrary. The adjustments we can do to counts in 2x2 tables are motivated based on statistical theory (those adjustments are actually bias reductions, which also happen to make the calculation of certain measures possible when there is a 0 count).
Related
I want see with a two-way ANOVA for each of the 10 environmental variables ( height, iwdo, rdos, etc.. until no2) differences among period and site.
This, in three different indipendent watersheds grouped in stream.
For each stream I need to check the normality with shapiro.test and the homoscedasticity with leveneTest. After I run the model aov(nest_database[nest_database=="stream name (i.e. smeltaite)",]environmental variable (i.e.iwdo)~period*site).
So, is there a formula that can automatize such process for the three stream and at the same time being reproduced on each column of environmental variables giving me a summary for shapiro.test, leveneTest and aov results respectively?
down below the head of my dataset
nest_data<-structure(list(stream = structure(c(2L, 2L, 2L, 2L, 2L, 2L), .Label =
c("blendziava",
"smeltaite", "sventoji"), class = "factor"), period = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = c("February", "March", "April",
"May"), class = c("ordered", "factor")), site = structure(c(1L,
2L, 1L, 2L, 1L, 2L), .Label = c("N", "NN"), class = "factor"),
stake = c("A", "A", "B", "B", "C", "C"), class = c("low",
"medium", "low", "low", "low", "high"), height = c(0, 10,
0, 3.5, 0, 15), iwdo = c(13, 8.37, 10.8, 3.3, 11, 5.3), rdos = c(89.041095890411,
57.3287671232877, 73.972602739726, 22.6027397260274, 75.3424657534247,
36.3013698630137), iwc = c(359, 375, 357, 340, 360, 357),
dwc = c(2, 14, 4, 21, 1, 4), iwt = c(2.2, 2.1, 2.3, 2.3,
2.6, 2.3), dt = c(0, 0.1, 0.0999999999999996, 0.0999999999999996,
0.4, 0.0999999999999996), no3 = c(0.8104551, 0.6300294, 1.1296698,
1.2962166, 0.963123, 1.240701), nh4 = c(0.2187052, 0.1457344,
0.186718, 0.2177056, 0.2297008, 0.2187052), no2 = c(0.0133336,
0.0100408, 0.0116872, 0.0083944, 0.0127848, 0.009492)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
So far I'm using the code:
nest_data %>%
split(.$stream) %>%
purrr::map(.,function(x){
aov(iwdo ~ period*site, data = x) %>%
tidy(.)
}) -> results
df <- as.data.frame(do.call(rbind,results))
that allows me to perform the test on the three stream but only on one column.
I presume that I should use a for cycle but not sure where to put inside the function
Thanks in advance and hope I was clear since this is my first question here!
`
Consider generalizing all your steps in a defined method. Then call method iteratively which base R methods of by and sapply can help. Use reformulate to adjust formula. Please fill in each ellipsis (...).
env_vars <- c("height", "iwdo", "rdos", ..., "no2")
proc_model <- function(sub_df) {
# NAMED LIST OF ENVIRONMENT VARS MODEL AND TESTS
sapply(env_vars, function(env) {
model <- aov(reformulate("period*site", env), data = sub_df)
sp <- shapiro.test(...)
lv <- leveneTest(...)
# NAMED LIST OF MODEL AND TESTS
list(
aov_result = model, shapiro_test = sp, levene_test = lv
)
}, simplify=FALSE)
}
# NESTED NAMED LIST BY STREAM FOR EACH ENV VAR
results_list <- by(nest_data, nest_data$stream, proc_model)
To access results:
results_list$smeltaite$height$aov_result
results_list$smeltaite$height$shapiro_test
results_list$smeltaite$height$levene_test
For your original implementation:
results <- nest_data %>%
split(.$stream) %>%
purrr::map(proc_model)
I'm trying to write an SPSS file with some haven_labelled variables and factors that are created from that variable. It's just convenient for me and my use case to use nearly identical variable names. I've used all lower case for the haven_labelled variables and title case for the respective factor variable.
When I export the data frame with write_sav, the SPSS records the variable name of the title case factor with var1, rather than the title case, in this case Francophone. Note that when I change the name of the variable significantly, it prints the variable name.
#This makes the data frame of haven labelled variable and a corresponding factor
test<-structure(list(francophone = structure(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), labels = c(Francophone = 1, `Not Francophone` = 0), label = "Dichotomous variable, R is francophone", class = c("haven_labelled", "vctrs_vctr", "double")), Francophone = structure(c(1L, 1L, 1L,1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Not Francophone", "Francophone"), label = "Dichotomous variable, R is francophone", class = "factor")), row.names = c(NA,-10L), class = c("tbl_df", "tbl", "data.frame"))
#Create a second factor variable equivalent to the first factor, but with a different variable name
test$Franco<-test$Francophone
library(tidyverse)
library(haven)
#Write out the file; sorry I do not know how to use tmpfile() in this case.
test %>%
write_sav(., path="~/Desktop/test2.sav")
To close the loop, variable names in SPSS have to be unique: https://www.ibm.com/docs/en/spss-statistics/version-missing?topic=list-variable-names
This has always been the case (and will probably not change).
I have this data set, I put a screenshot of real data instead of a code or something.
sorry for messing up, I am a newbie here in R
enter image description here
Then, I want to change the data into dummy set for "13 Source" categorical data, but it has to be summarized by "HH No". Which will look like this
enter image description here
I've tried to use to.dummy by varhandle, model.matrix but ended up messy dataset.
Could anybody help me how to deal with this?
Thanks a million in advance
There are a number of ways to make dummy variables from factors - here is one way to create a summary presence table.
Assume df is your data frame. You can use xtabs to start with, which will create a frequency table from your 2 columns.
By comparing to see if your values are > 0, you will get TRUE if > 0, and FALSE otherwise. Adding 0 at the end will make TRUE the number 1 and FALSE the number 0.
(xtabs(~ HH_No + Source, df) > 0) + 0
Output
Source
HH_No Deep_well Rainwater
1 1 1
3 1 1
4 0 1
Data
df <- structure(list(HH_No = c(1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3,
3, 3, 4, 4), Source = structure(c(2L, 2L, 2L, 2L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L), .Label = c("Deep_well",
"Rainwater"), class = "factor")), class = "data.frame", row.names = c(NA,
-16L))
I am trying to see if there is a relation between the proportion of target vs non target species in some communities and their phylogenetic structure. I cannot show you the real data but it looks something similar to this (although I have over 18000 data points and 'hab' has 12 levels):
df <- structure(list(hab = structure(c(6L, 6L, 6L, 6L, 6L, 6L, 6L,12L, 12L, 9L),
.Label = c("Eur_nitro_herb", "Forest_deciduous","Forest_everg_Eur",
"Forest_med", "Grass_alp", "Grass_mont_subalp","Margin_mantle", "Med_nitro_herb", "Rocks", "Shrub_med", "Shrub_mont_subalp","Wet_humid"), class = "factor"),
ses_mpd = c(0.05408747785078,-0.578266990137644, -1.48812316684822, -0.345401572814568, 0.124151290090708,-1.51817069020564, 0.0530607986221243, 0.00261416940904258, 0.665908557766837,-0.701477005797007),
target = c(1, 2, 3, 1, 2, 0, 0, 0, 0, 1),non_target = c(32, 27, 20, 30, 34, 26, 30, 9, 12, 6)), row.names = c(1L,2L, 3L, 4L, 5L, 6L, 7L, 18793L, 18794L, 18795L), class = "data.frame")
df
ses.mpd as calculated by the picante package is a standarized size effect so it can take positive and negative values as well as zero.
I want to see the relationship between proportion of target and non target with ses_mpd controlling for possible differences between habitats. I have used a Generalized Linear Mixed Model to do so but when I check the residuals they look fairly skewed:
library(lme4)
mod<-glmer(cbind(target,non_target)~ses_mpd+(1|hab), family = binomial(logit), data = df)
plot(mod, resid(., type='response')~fitted(.), main="Normalized Residuals v Fitted Values",abline=c(0,0))
res <- resid(mod, type="response")
qqnorm(res)
qqline(res)
Given the large sample size I was expecting that the residuals would be normal but I was clearly wrong. I guess that these results cannot be trusted so my question is if there is any other way to analyzed this data.
Cheers and thanks in advance.
I made an ordination of a time series of some vegetation data, using the vegan package. Since ordination diagrams often are cluttered with many data points, I extracted the eigenvalues of the first two ordination axes and took the mean of each group. Now I have only one point per site (11 sites total) To still show some of the variation, I added ellipses with standard deviation and 95% confidence interval:
The last thing I want to do is to connect points of the same group (either A, B or C) with an arrow, indicating direction of change over time. All movement is from right to left.
I initially wanted to use the ordiarrow function in vegan, but this works only when class is decorana. My class is a factor.
Using ggplot2 does not seem like a valid option as the ordiellipse function (creating the ellipses) does not work there.
code for plotting data:
install.packages("vegan")
library(vegan)
plot(Ord_KIKKER, type = "n", main = "Kikkervalleien",
xlab = "DCA1 Eigenvalue = 0.62", ylab = "DCA2 Eigenvalue = 0.39")
points(ORD_KIKKER, cex = 2, pch = 19,
col = c("black", "black", "black", "red","red", "green", "green", "green", "blue", "blue", "blue"))
The resulting plot looks a bit different since I posted a reduced dataset here.
My data (Ord_KIKKER):
structure(list(DCA1 = c(2.676616032, 0.361181861, -1.363464067,
3.176862449, -0.087190269, 2.059548542, 0.167440366, -0.459090096,
1.571536367, 0.309623788, -0.25787459), DCA2 = c(0.276788721,
0.422077659, 0.181723453, 0.221610649, 0.940063655, -0.116083905,
-0.539375059, -0.545053063, -0.06120542, -0.367148924, -1.679257818
), Unique = structure(c(1L, 5L, 8L, 2L, 9L, 3L, 6L, 10L, 4L,
7L, 11L), .Label = c("2001A", "2001B", "2001C", "2001D", "2008A",
"2008C", "2008D", "2018A", "2018B", "2018C", "2018D"), class = "factor"),
BLOCK = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 4L, 4L,
4L), .Label = c("A", "B", "C", "D"), class = "factor")), .Names = c("DCA1",
"DCA2", "Unique", "BLOCK"), class = "data.frame", row.names = c("2001A",
"2008A", "2018A", "2001B", "2018B", "2001C", "2008C", "2018C",
"2001D", "2008D", "2018D"))
vegan::ordiarrows() will work, if you give it only the variables that have scores:
ordiarrows(Ord_KIKKER[,1:2], Ord_KIKKER$BLOCK) # one way
However, you should also remember to have asp=1 in the initial plot to force equal aspect ratio to axes.
I cannot do full testing, because the graph cannot be reproduced with the data you posted: If you issue plot(Ord_KIKKER, ...) with a data frame, you will not get ordinary plot, but a panel plot of all variables against each other (pairs() plot), and also give an error for type = "n" argument. It seems that you instead used some non-standard graphics tools, and I am not sure that standard R graphics of vegan::ordiarrows() can be combined with those.