How do I structure data to use R lmer - r

I am trying to do a trending analysis of reliability data. A typical case would be to determine if a 10-year trend exists in the demand rate for a specified system at specified plants.
I am trying to generate a test case but am a bit confused about how to structure the data. The trend years range from 2004 to 2013. In my test case I have, for each year, 10 systems for which demands have been counted. I am using normally distributed demand counts with different means and variances each year. Of course real data will likely not have the same system count each year, and the demand counts are not necessarily normally distributed.
The following R code produces a data frame (df1) that seems reasonable to me:
yr <- 2004:2013
y2004 <- rnorm(10, 10, 3)
y2005 <- rnorm(10, 11, 2)
y2006 <- rnorm(10, 12, 1)
y2007 <- rnorm(10, 13, 5)
y2008 <- rnorm(10, 14, 3)
y2009 <- rnorm(10, 15, 4)
y2010 <- rnorm(10, 16, 1)
y2011 <- rnorm(10, 17, 2)
y2012 <- rnorm(10, 18, 4)
y2013 <- rnorm(10, 19, 1)
df1 <- data.frame(cbind(yr), y2004, y2005, y2006, y2007, y2008, y2009, y2010, y2011, y2012,y2013)
df2 <- data.frame(cbind(rep(0.0, 100), rep(0.0, 100)))
names(df2) <- c("x", "y")
k <-1
for (i in 1:10) {
for (j in 1:10) {
df2$x[k] <- df1$yr[i]
df2$y[k] <- df1[j,i+1]
k <- k + 1
}
}
boxplot(y ~ x, df2)
Anyway, my first problem is the construction of df2 seems unnecessary given I already have the data in df1 - it's just that the call to lmer seems to require the organization of df2. My call to lmer looks like the following:
fit <- lmer(y ~ x + (1|x), data=df2)
So is there a way to use lmer without the construction of df2, using df1 directly? Or is there a better way to structure the data entirely?
My second problem is I am not really sure how to use lmer to do what I want to do. Basically I am looking to pool the count data for each year and fit the mean demand count each year with a straight line. The best fit should consider the variance in the data in each pooled year group. Am I going about it correctly?

Nearly all plotting and modeling functions in R require data in the "long" format (ie df2). So if anything, I would skip the construction of df1. If you want to generate df2 more directly, you could do
df2 <- do.call("rbind.data.frame", Map(cbind,
y=Map(function(n,m,s) rnorm(n,m,s), 10, 10:19, c(3,2,1,5,3,4,1,2,3,1)),
x=2004:2013))

Related

R using set.seed() within a purrr::map iteration structure

When generating new data in R, I can use set.seed() to ensure that I get the same data sets every time the code is run:
set.seed(12345)
a <- rnorm(500, mean = 50, sd = 10)
set.seed(12345)
b <- rnorm(500, mean = 50, sd = 10)
identical(a, b)
# TRUE
If I comment out the set.seed() lines, identical(a,b) returns FALSE.
Now I want to use a purrr::map() structure to generate multiple data sets with slightly different parameters:
library(tidyverse)
means <- c(40, 50, 60)
sds <- c(9, 10, 11)
set.seed(12345)
data <- map2(
means,
sds,
~
rnorm(500, mean = .x, sd = .y)
)
The map2() call generates a list of three data frames. With this relatively simple operation, I get identical data frames every time I run the code. But I'm finding that with more complex, longer functional pipelines involving certain packages (e.g., bestNormalize), I'm not getting identical output when the set.seed() command is outside the iterative looping structure of map().
I'm at a loss for how to bring set.seed() within the map() iteration structure so that it is called anew at the beginning of each iteration. To be clear, the larger goal is to be able to iterate over functions that use random number generation, and to get identical results every time. Perhaps there's a better way to accomplish this in the tidyverse that doesn't depend on set.seed(). Thanks in advance for any help!
I hope this solves your question as how to position the seed inside the map call:
means <- c(40, 50, 60)
sds <- c(9, 10, 11)
myfun <- function(means, sds){
set.seed(12345) # set it before each call
ret <- rnorm(500, mean = means, sd = sds)
return(ret)
}
data <- purrr::map2(means,
sds,
~ myfun(.x, .y))
As a followup, here is the most concise way to solve my original problem:
library(tidyverse)
means <- c(40, 50, 60)
sds <- c(9, 10, 11)
data <- map2(
means,
sds,
~ {
set.seed(12345)
rnorm(500, mean = .x, sd = .y)
}
)
This code returns identical results each time it is run.

How do I run this command in the xlsx library?

I'm trying to run a command in this tutorial on RMANOVA:
https://finnstats.com/index.php/2021/04/06/repeated-measures-of-anova-in-r/
However, when I try to run this command:
data <- read.xlsx("D:/RStudio/data.xlsx",sheetName="Sheet1")
It gives me the following error:
Error in loadWorkbook(file, password = password) : Cannot find
D:/RStudio/data.xlsx
It appears that this is not loading because it requires a local data file that I don't have. However, I don't see that there is this data file in the tutorial page. Am I correct in assuming the file is missing, or is this command supposed to build the spreadsheet?
Any help would be great. Thanks!
In short
You are correct in assuming the file is missing. This command will not create any spreadsheets but it will read a spreadsheet stored at "D:/RStudio/data.xlsx".
Possible solution
You could create a dataset on your own, e.g., like this:
# Treatment A: Samples for the different time steps T0, T1, and T2
T0A <- rnorm(12, mean = 7.853, sd = 3.082)
T1A <- rnorm(12, mean = 9.298, sd = 2.090)
T2A <- rnorm(12, mean = 5.586, sd = 0.396)
# Treatment B: Samples for the different time steps T0, T1, and T2
T0B <- rnorm(12, mean = 7.853, sd = 3.082)
T1B <- rnorm(12, mean = 9.298, sd = 2.090)
T2B <- rnorm(12, mean = 5.586, sd = 0.396)
# Combine the values in a data.frame
data <- data.frame(time = c(rep("T0", 12), rep("T1", 12), rep("T2", 12),
rep("T0", 12), rep("T1", 12), rep("T2", 12)),
score = c(T0A, T1A, T2A, T0B, T1B, T2B),
Treatment = c(rep("A", 36), rep("B", 36))
)
# make time and Treatment factors
data$time <- as.factor(data$time)
data$Treatment <- as.factor(data$Treatment)
# Here, we are at the Summary Statistics step of the tutorial already
library(dplyr)
library(rstatix)
summary<-data %>%
group_by(time) %>%
get_summary_stats(score, type = "mean_sd")
data.frame(summary)
Note that treatment A and B are exactly the same in this case. Depending on what you want to test, you can alter the mean and standard deviation of the different treatments.
Additional ideas
You could also introduce outliers to your self-created data set. In the following, I just chose the mean value from T0A and 4 times the standard deviation of T0A. Then we can set a value N for how many potential outliers we want. Subsequently, we create random values that can be up to 4 standard deviations higher or lower than the mean of T0A and use those values to replace random score values within the data.frame. In this case we set a maximum of N = 1 outlier. Of course, this script could be adapted to set certain ranges of potential values dependent on the time and Treatment factors (but that's beyond the scope of this answer for now).
mean_value <- 7.853
extreme <- 4*3.082
N <- 1
outlier_values <- runif(N, min = mean_value - extreme, max = mean_value + extreme)
outliers <- round(runif(N, min = 1, max = nrow(data)), digits = 0)
data$score[outliers] <- outlier_values
In my opinion, this data set is more useful than any example data set, because you can now change mean and standard deviation values and introduce outliers, etc, so you can experiment with the data and see how your statistical tests respond in various situations.

vegdist function cannot handle datasets of abundance containing 0

As a marine biologist, we need to figure out whether the fish abundance of 4 different fish species counted three times over a year differs from one artifical reef to another (reef A, B, and C) and from one month to another (June, September, November). For each area, 3 different replicates are generated (1, 2, 3).
Let's consider the gathered data (including the factors for better understanding) as follows:
data <- as.data.frame(matrix(NA, 27, 4, dimnames =
list(1:27, c("Diplodus sargus", "Chelon labrosus", "Oblada melanura", "Seriola dumerii"))))
#fish counts
data$`Diplodus sargus` <- as.numeric(c(0,0,0,0,0,0,0,0,0,5,0,0,3,0,0,0,0,1,0,0,0,0,0,0,4,0,0))
data$`Oblada melanura` <- as.numeric(c(0,0,0,10,0,0,0,0,0,0,0,0,10,5,0,0,0,0,1,0,2,3,0,2,0,0,0))
data$`Chelon labrosus`<- as.numeric(c(0,0,0,0,2,0,6,0,0,0,0,0,3,0,0,2,0,0,0,0,0,3,0,0,0,0,1))
data$`Seriola dumerii` <-as.numeric(c(4,0,2,0,1,1,0,0,9,0,0,0,0,0,3,0,0,7,0,0,0,8,0,0,0,1,0))
#factors
data$reef <- rep(c(rep("A", 3), rep("B",3), rep("C", 3)),3)
data$month <- rep(c(rep("June", 3), rep("September",3), rep("November", 3)),3)
data$combined <- c(rep("JuneA", 3), rep("JuneB",3), rep("JuneC", 3), rep("SepA", 3), rep("SepB",3), rep("SepC", 3),rep("NovA", 3), rep("NovB",3), rep("NOvC", 3))
data$Replicate <- rep(c(rep("1", 3), rep("2", 3), rep("3", 3)))
#square-root data
comp <- sqrt(data[, 1:4])
library(vegan)
mydist <- vegdist(comp, method = "bray")
pl.clust <- hclust(mydist, method = "complete")
Error in hclust(mydist, method = "complete") :
NA/NaN/Inf in foreign function call (arg 11)
The aim is to perform a Permutation ANOVA on the Bray-Curtis similarities of square root-transformed data in order to determine whether samples (assemblages of counted species) differ significantly depending on factors (alone or combined). However, vegdist function cannot handle data set with 0 as it generates vegdist objects containing NaN...which in turn cannot be handled by the adonis function. I thought of simply adding +1 to each counts as it is the differences between the samples that matter and not the absolute values. However, mydist <- ecodist::bcdist(squared_data,rmzero=FALSE) gives a very different result to that first solution. Is anybody familiar with such issue and how to correctly handle it?
Thank you and looking forward to reading you

Creating an excel one-way data table in R -- Problem with my for loop

I'm trying to create an excel one-way data table in R so that I can find the exponent that minimizes errors of a coefficient in an equation. I have a for loop that produces the correct result but it does something strange that I can't figure out.
Here is an example of the data. I'll use the Pythogrean Win formula from baseball and use a for loop to find the exponent that minimizes the mean absolute error in the win projections.
## Create Data
Teams <- c("Bulls", "Sharks", "Snakes", "Dogs", "Cats")
Wins <- c(5, 3, 8, 1, 9)
Losses <- 10 - Wins
Win.Pct <- Wins/(Wins + Losses)
Points.Gained <- c(30, 50, 44, 28, 60)
Points.Allowed <- c(28, 74, 40, 92, 25)
season <- data.frame(Teams, Wins, Losses, Win.Pct, Points.Gained, Points.Allowed)
season
## Calculate Scoring Ratio
season$Score.Ratio <- with(season, Points.Gained/Points.Allowed)
## Predict Wins from Scoring Ratio
exponent <- 2
season$Predicted.Wins <- season$Score.Ratio^exponent / (1 + season$Score.Ratio^exponent)
## Calculate Mean Absolute Error
season$Abs.Error <- with(season, abs(Win.Pct - Predicted.Wins))
mae <- mean(season$Abs.Error)
mae
Here is my for loop that is looking at a range of exponent options to see if any of them are better than the exponent, 2, used above. For some strange reason, when I run the for loop, it keeps repeating the table several times (many of the tables with incorrect results) until finally producing the correct table as the last one. Can anyone explain to me what is wrong with my for loop and why this is happening?
## Identify potential exponent options that minimize mean absolute error
exp.options <- seq(from = 0.5, to = 3, by = 0.1)
mae.results <- data.frame("Exp" = exp.options, "Results" = NA)
for(i in 1:length(exp.options)){
win.pct <- season$Predicted.Wins
pred.win.pct <-
(season$Points.Gained/season$Points.Allowed)^exp.options[i] /
(1 + (season$Points.Gained/season$Points.Allowed)^exp.options[i])
mae.results[i,2] <- mean(abs(win.pct - pred.win.pct))
print(mae.results)
}

Cluster robust standard errors after multiple imputation using mice R package

I would like to compute cluster robust standard errors using a mids class object. This arise from multiple imputation of missing values in a column of my original data. A minimal example below.
library(mice)
y <- c(1,0,0,1,1,1,1,0)
x <- c(26, 34, 55, 15, 31 ,47, 97, 12)
z <- c(2, NA, 0, NA, 3 ,7,7, 5)
mydata <- as.data.frame(cbind(y,x,z))
tempData <- mice(mydata,m=5,maxit=5,meth='pmm',seed=500)
class(tempData)
# [1] "mids"
modelFit <- with(tempData,lm(y ~ x + z))
summary(modelFit)
At this point I would like to get the cluster robust standard errors. Unfortunately miceadds::lm.cluster does not allow "mids" class objects.
The function lm.cluster in miceadds is intended for regular data frames. An example for an application to multiply imputed data is given in the documentation.
Given below is a version adapted to your question. I used the first variables as a cluster indicator because your example didn't have one.
library(mice)
library(miceadds)
id <- c(1,0,0,1,1,1,1,0)
y <- c(26,34,55,15,31,47,97,12)
x <- c(2,NA,0,NA,3,7,7,5)
dat <- data.frame(id,y,x)
imp <- mice(dat, m=5, maxit=5, method='pmm', seed=500)
implist <- lapply(1:5, function(i) complete(imp,i))
mod <- lapply( implist, function(i){
lm.cluster( i, formula=y~x, cluster=i$id )
})
# extract parameters and covariance matrices
betas <- lapply(mod, coef)
vars <- lapply(mod, vcov)
# pool
summary(pool_mi( qhat=betas, u=vars ))

Resources