Unbalanced ANOVA, R possibly disregarding duplicates - r

After running a one-way ANOVA on my dataset, I noticed that it's reporting the results as unbalanced despite having even numbers of entries for every variable.
Then, using ezPrecis to look at the dataframe, it seems that some values are not being counted despite having the correct number of rows registered. For example, using just method C from id 1, it says there's 46 values in ct even though it registers 50 rows (and has 50 values under ct). Is it possible that R is disregarding the duplicate values? Because looking at the raw file, there's 4 400's and 2 1684's. If you eliminate the duplicates, then that's precisely 4 items not counted which lines up with the 46 counted ct's when viewing through ezPrecis. Is this why the Anova is unbalanced? If so, how do you fix it?
library(ez)
data1 <- read.csv("data.csv")
data1
data1$id <- as.character(data1$id)
data1$id <- as_factor(data1$id)
data1$method <- as_factor(data1$method)
ezPrecis(data1)
ezDesign(data=data1, x=method, y=id)
data2 <- data1 %>%
group_by(method) %>%
summarise(mean = mean(ct, na.rm = TRUE),
sd = sd(ct, na.rm = TRUE),
se = sd(ct)/sqrt(length(ct)))
data2
data2anova <- ezANOVA(data=data1, dv=ct, wid=id, within=.(method),type=3,
detailed=TRUE, return_aov=TRUE)
data2anova
Raw data: https://ufile.io/cfe1w

All the rows are used in the ANOVA. The ezPrecis function informs you of the number of unique values in the column. This is clear from the help for the function where it refers to the "values" column as "unique." Why that column name got changed to "values" is anyone's guess.
The output from the ANOVA says "Estimated effects may be unbalanced". The replications for each variable, such as calculated using the replications function, is likely being inspected during the aov processing and warning the user that there could be unbalance present.
Your data frame yields the following:
replications(~ . - ct, data=data1)
id method testblock trial
150 100 60 30

Related

Dealing with Missing Values for one Variable in R

I'm currently dealing with a data set that has missing values, but they are only missing for one single variable. I was trying to determine whether they are missing at random, so that I can simply remove them from the data frame. Hence, I am trying to find potential correlations between the NA's in the data frame and the values of the other variables. I found the following code online:
library("VIM")
data(sleep)
x <- as.data.frame(abs(is.na(sleep)))
head(sleep)
head(x)
y <- x[which(sapply(x, sd) > 0)]
cor(y)
However, this only shows you how the missing values themselves are correlated, in case there are distributed across all variables.
Is there a way to find not the correlation between the missing values in a data frame, but the correlation between the missing values of one variable and values of another variable? For example, if you have a survey which is optionally asking for family income, how could you determine whether the missing values are e.g. correlated with low income with R?
library(finalfit)
library(dplyr)
df <- data.frame(
A = c(1,2,4,5),
B = c(55,44,3,6),
C = c(NA, 4, NA, 5)
)
df %>%
missing_pairs("A", "C")

Student t-test (Paired) on multiple matrices with different number of rows using R

I need to use student t Test on the columns two matrices having 21 x 4044 and 36 x 4044 respectively. The columns are identical in both, just the rows vary in length.
Sample code for my example input data
mat1 <- matrix(rnorm(100), ncol = 5)
mat2 <- matrix(rnorm(125), ncol = 5)
f <- function(x,y){
test <- t.test(x,y, paired=TRUE)
out <- data.frame(stat = test$statistic,
df = test$parameter,
pval = test$p.value,
conl = test$conf.int[1],
conh = test$conf.int[2]
)
return(out)
}
sapply(seq(ncol(mat1)), function(x) f(mat1[,x], mat2[,x]))
But is gives the following error
Error in complete.cases(x, y) : not all arguments have the same length
How to deal with this error?
It works fine for the matrices having same number of rows.
A paired t-test assumes that you have two results for each entity, so for example, you might measure heart-rate of the same person before and after a race leaving you with reading 1 and reading 2 that are 'paired'. This is what you're achieving with paired = TRUE.
In your example, you have differently sized vectors, suggesting that you may not be recording two readings for the same entity, so from here:
If you have not been collecting pairs of readings from the same subject, switch to paired = FALSE.
If you have been collecting pairs of readings from the same subject then you are missing some readings (by virtue of one column having more readings than the other) and you should remove the cases where you don't have two readings.
Hopefully that makes sense and helps a little.
EDIT: Having just made that change and run your code, I get:
stat -0.1336019 -0.8981109 -0.1962769 0.9045503 0.3164153
df 42.35801 42.9418 38.21301 40.52551 41.40109
pval 0.8943501 0.3741347 0.8454336 0.3710499 0.7532772
conl -0.7211962 -1.069044 -0.6361448 -0.363129 -0.5404484
conh 0.6316144 0.4102729 0.5236731 0.9519358 0.7413329

R Replace Intermittent NA Values With Last Observation Carried Forward (NA.LOCF)

Background
I neeed to replace the NA's in my data frame by using different methods depending on the NA's nature. My data frame come from a study with repeated measures, where some of the Na's are a result of subjects dropping out while others are a result of intermittent missing measurements, defined as one or a sequence of multiple missing measurements, followed by a measured value.
I will be referring to intermittent missing measurements as intermittent NA's.
Problem
I am having trouble testing whether the NA's are the result of intermittent missing measurements, and what functions I should use to replace these NA's with. I would ideally replace these intermittent NA's with the na.locf method. But I need Dropout NA's to be replaced with the baseline OR the last value observed, whichever is greater.
Examples
Example 1
Here is a clean example of NA's that I want to be treated as intermittent NA's with the na.locf imputation:
data.frame(visit=c(1,2,3,4,5,6,7,8,9,10),value=c(34,NA,NA,15,16,19,NA,12,23,31))
and how I want it the end result to be:
data.frame(visit=c(1,2,3,4,5,6,7,8,9,10),value=c(34,34,34,15,16,19,19,12,23,31))
Example 2
Here is a clean example of NA's (dropout NA's) that I want to be imputed by the previous non-NA observation OR the baseline value (visit 1), whichever is greatest:
data.frame(visit=c(1,2,3,4,5,6,7,8,9,10),value=c(34,22,18,15,16,19,NA,NA,NA,NA))
And how I want the end result to be:
data.frame(visit=c(1,2,3,4,5,6,7,8,9,10),value=c(34,22,18,15,16,19,34,34,34,34))
Example 3
Here is a complex example of a mixture of NA's which need different imputations, here where the previous non-NA observation is greater than the baseline observation (visit 1) for the dropout NA's:
data.frame(visit=c(1,2,3,4,5,6,7,8,9,10),value=c(34,NA,NA,42,16,19,NA,38,NA,NA))
How I need the result to be:
data.frame(visit=c(1,2,3,4,5,6,7,8,9,10),value=c(34,34,34,42,16,19,19,38,38,38))
Example 4
Another complex example where the baseline observation (visit 1) is greater than the previous non-NA value for the dropout NA's:
data.frame(visit=c(1,2,3,4,5,6,7,8,9,10),value=c(40,NA,NA,42,16,19,NA,38,NA,NA))
How I need the result to be:
data.frame(visit=c(1,2,3,4,5,6,7,8,9,10),value=c(40,40,40,42,16,19,19,38,40,40))
What I have tried
As suggested by #Gregor, upon me stating that this would solve my problems, it was possible to test for the presence of intermittent NA's with:
mutate(is.na(value) & !is.na(lead(value))
But this does not help me with imputing all intermittent NA's and in particular, intermittent NA's that are in a sequence (NA1,NA2,NA3,14), where only NA3 is returned as TRUE after running this test.
We can use na.locf(..., fromLast = TRUE) to identify the trailing NA values and use pmax on them with the baseline. We'll demonstrate on the examples from your question in a nice all-together format:
# consolidate example data
dd = data.frame(
example = rep(1:3, each = 10),
visit = rep(1:10, 3),
value = c(34,NA,NA,15,16,19,NA,12,23,31,
34,22,18,15,16,19,NA,NA,NA,NA,
34,NA,NA,42,16,19,NA,38,NA,NA),
goal = c(34,34,34,15,16,19,19,12,23,31,
34,22,18,15,16,19,34,34,34,34,
34,34,34,42,16,19,19,38,38,38)
)
library(dplyr)
dd = dd %>% group_by(example) %>%
mutate(to_fill = !is.na(zoo::na.locf(value, fromLast = TRUE, na.rm = FALSE)),
result = if_else(to_fill,
zoo::na.locf(value, na.rm = FALSE),
pmax(first(value), zoo::na.locf(value, na.rm = FALSE))),
)
all(dd$goal == dd$result)
# [1] TRUE
As you can see, the result matches the goal column perfectly.

Covariance matrices by group, lots of NA

This is a follow up question to my earlier post (covariance matrix by group) regarding a large data set. I have 6 variables (HML, RML, FML, TML, HFD, and BIB) and I am trying to create group specific covariance matrices for them (based on variable Group). However, I have a lot of missing data in these 6 variables (not in Group) and I need to be able to use that data in the analysis - removing or omitting by row is not a good option for this research.
I narrowed the data set down into a matrix of the actual variables of interest with:
>MMatrix = MMatrix2[1:2187,4:10]
This worked fine for calculating a overall covariance matrix with:
>cov(MMatrix, use="pairwise.complete.obs",method="pearson")
So to get this to list the covariance matrices by group, I turned the original data matrix into a data frame (so I could use the $ indicator) with:
>CovDataM <- as.data.frame(MMatrix)
I then used the following suggested code to get covariances by group, but it keeps returning NULL:
>cov.list <- lapply(unique(CovDataM$group),function(x)cov(CovDataM[CovDataM$group==x,-1]))
I figured this was because of my NAs, so I tried adding use = "pairwise.complete.obs" as well as use = "na.or.complete" (when desperate) to the end of the code, and it only returned NULLs. I read somewhere that "pairwise.complete.obs" could only be used if method = "pearson" but the addition of that at the end it didn't make a difference either. I need to get covariance matrices of these variables by group, and with all the available data included, if possible, and I am way stuck.
Here is an example that should get you going:
# Create some fake data
m <- matrix(runif(6000), ncol=6,
dimnames=list(NULL, c('HML', 'RML', 'FML', 'TML', 'HFD', 'BIB')))
# Insert random NAs
m[sample(6000, 500)] <- NA
# Create a factor indicating group levels
grp <- gl(4, 250, labels=paste('group', 1:4))
# Covariance matrices by group
covmats <- by(m, grp, cov, use='pairwise')
The resulting object, covmats, is a list with four elements (in this case), which correspond to the covariance matrices for each of the four groups.
Your problem is that lapply is treating your list oddly. If you run this code (which I hope is pretty much analogous to yours):
CovData <- matrix(1:75, 15)
CovData[3,4] <- NA
CovData[1,3] <- NA
CovData[4,2] <- NA
CovDataM <- data.frame(CovData, "group" = c(rep("a",5),rep("b",5),rep("c",5)))
colnames(CovDataM) <- c("a","b","c","d","e", "group")
lapply(unique(as.character(CovDataM$group)), function(x) print(x))
You can see that lapply is evaluating the list in a different manner than you intend. The NAs don't appear to be the problem. When I run:
by(CovDataM[ ,1:5], CovDataM$group, cov, use = "pairwise.complete.obs", method = "pearson")
It seems to work fine. Hopefully that generalizes to your problem.

Finding the mean of a variable within an imputed data set for population quintiles

I have a data set "base_data" which has missing values. I have therefore used the package 'Amelia' to impute the missing values into an object "a.output".
I have been able to find the mean for some variables within the imputed results using the following code:
q.out<-NULL
se.out<-NULL
for(i in 1:m) {
dclus <- svydesign(id=~site, data=a.output$base_data[[i]])
q.out <- rbind(q.out, coef(svymean(~hh_expenditure, dclus)))
se.out <- rbind(se.out, SE(svymean(~hh_expenditure, dclus)))}
I have combined the results using:
svymean.combine <- mi.meld(q = q.out, se = se.out)
Which gives me the mean and standard error for household expenditure (hh_expenditure) across the population.
However I have a variable which splits the population into wealth quintiles (wealth_quin).
As such, I am now wanting to find the average, and standard error, of the household expenditure per wealth_quin (a variable which is either 1,2,3,4,or 5).
I initially tried subsetting the imputed data, but this came up with many errors.
Is there a way to do this without having to split up the data into the 5 wealth quintiles before imputing the data?
Cheers,
Timothy
EDIT: HERE IS A WORKABLE EXAMPLE
require(Amelia)
require(survey)
a<-as.data.frame(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16))
b<-as.data.frame(c(1,2,2,1,2,1,1,2,1,2,2,1,1,2,1,2))
c<-as.data.frame(c(2,7,8,5,4,4,3,8,7,9,10,1,3,3,2,8))
d<-as.data.frame(c(3,9,7,4,5,5,2,10,8,10,12,2,4,4,3,7))
e<-as.data.frame(c(2500,8000,NA,4500,4500,NA,2500,NA,7400,9648,1112,1532,3487,3544,NA,7000)
impute<-cbind(a,b,c,d,e)
names(impute) <- c("X","site","var2","var3", "hh_inc")
so no we have a data frame to work with, with missing values for hh_inc which I want to impute.
first step, set the number of imputations
m<-5
now run the imputation:
a.output <- amelia(x = impute, m=m, autopri=0.5,cs="X",
idvars=c("site","var2"),
logs=c("hh_inc","var3"))
a.output is now holds the data from the 5 imputations.
What I now want to do is find the average (and standard error) hh_inc for site 1 and site 2 separately using the imputed values from amelia.
How is that possible to do? I know it is possible to do if I just ignore the NA's. But this might introduce bias, hence why I imputed the values in the first place.
Cheers,
Timothy
EDIT:
I have placed a bounty to this. If no one knows the exact way to do it, then the results from the individual imputed data sets can be combined using Rubins formula (http://sites.stat.psu.edu/~jls/mifaq.html#minf)
As such, I will award to bounty to someone who can transform the 5 separate imputed datasets from the Amelia object into 5 separate, complete, data frames.
require(Amelia)
require(survey)
require(data.table)
require(plotrix)
a<-as.data.frame(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16))
b<-as.data.frame(c(1,2,2,1,2,1,1,2,1,2,2,1,1,2,1,2))
c<-as.data.frame(c(2,7,8,5,4,4,3,8,7,9,10,1,3,3,2,8))
d<-as.data.frame(c(3,9,7,4,5,5,2,10,8,10,12,2,4,4,3,7))
e<-as.data.frame(c(2500,8000,NA,4500,4500,NA,2500,NA,7400,9648,1112,1532,3487,3544,NA,7000))
impute<-cbind(a,b,c,d,e)
names(impute) <- c("X","site","var2","var3", "hh_inc")
summary(impute)
m <- 5
a.output <- amelia(x = impute, m=m, autopri=0.5,cs="X",
idvars=c("site","var2"),
logs=c("hh_inc","var3"))
stats.out <- NULL
for(i in 1:m){
df2 <- data.table(a.output$imputations[[i]])
df3 <- data.frame(dataset=i,df2[,list(std.error(hh_inc),mean(hh_inc)), by="site"])
stats.out <- rbind(stats.out, df3)
}
colnames(stats.out) <- c("dataset","site","stdError","mean")
stats.out
I'm not sure I understand your question or the structure of your data (specifically the importance of whether the data is imputed or not) but here's how I've done some summary stats by group.
require(data.table)
require(plotrix)
# create some data
df1 <- data.frame(id=seq(1,50,1), wealth = runif(50)*1000)
df1$cutter <- cut(df1$wealth, 5, labels=FALSE)
head(df1)
# put the data into a data.table to speed things up
df2 <- as.data.table(df1)
head(df2)
grp1StdErr <- df2[,std.error(wealth), by="cutter"]
grp1Mean <- df2[,mean(wealth), by="cutter"]
Hope this helps.
Or, in one grouping step :
df2[,list(std.error(wealth),mean(wealth)), by=cut(wealth,5,labels=FALSE)]

Resources