Comparing means (int) by zipcode (factor) in R - r

I have a list of zipcodes and the number of covid deaths per zipcode in a data frame (not real numbers, just examples):
City
Total
Richmond
552
Las Vegas
994
San Francisco
388
I want to see if there is any relationship between zipcode and the total number of deaths.
I made an LM model using the LM() function
mod_zip <- lm(Total ~ City, data=zipcode)
But when I call summary(mod_zip) I get NA for everything except the estimate column.
Coefficients
Estimate
Std. Error
t value
Pr(>
t)
CityRichmond
2851
NA
NA
NA
NA
CityLasVegas
-2604
NA
NA
NA
NA
CitySanFran
-966
NA
NA
NA
NA
What am I doing wrong?

lm will turn the factor into one-hot columns, so you have a parameter for each city except one and a global intercept.
Then (assuming, without seeing your data) you try to estimate n data points with n parameters which it manages to do but it doesn't have enough degrees of freedom to get a standard error.
Simplified Example to reproduce:
df <- data.frame(x = LETTERS, y = rnorm(26), stringsAsFactors = TRUE)
fit <- lm(y~x, data = df)
summary(fit)
You will see an Intercept and parameters B through Z (26 parameters for 26 observations), the degrees of freedom are thus 0 hence the standard errors and related metrics not calculable.

It sounds like you are looking to test whether City is a relevant factor for predicting deaths. In other words, would you expect to see the observed range of values if each death had an equal chance of occurring in any City?
My intuition on this would be that there should certainly be a difference, based on many City-varying differences in demographics, rules, norms, vaccination rates, and the nature of an infectious disease that spreads more if more people are infected to begin with.
If you want to confirm this intuition, you could use simulation. Let's say all Cities had the same underlying risk rate of 800, and all variation was totally due to chance.
set.seed(2021)
Same_risk = 800
Same_risk_deaths = rpois(100, Same_risk)
mean(Same_risk_deaths)
sd(Same_risk_deaths)
Here, the observed mean is indeed close to 800, with a standard deviation of around 3% of the average value.
If we instead had a situation where some cities, for whatever combination of reasons, had different risk factors (say, 600 or 1000), then we could see the same average around 800, but with a much higher standard deviation around 25% of the average value.
Diff_risk = rep(c(600, 1000), 50)
Diff_risk_deaths = rpois(100, Diff_risk)
mean(Diff_risk_deaths)
sd(Diff_risk_deaths)
I imagine your data does not look like the first distribution and is instead much more varied.

Related

Model predicted values around mean using training data

I tried to ask these questions through imputations, but I want to see if this can be done with predictive modelling instead. I am trying to use information from 2003-2004 NHANES to predict future NHANES cycles. For some context, in 2003-2004 NHANES measured blood contaminants in individual people's blood. In this cycle, they also measured things such as triglycerides, cholesterol etc. that influence the concentration of these blood contaminants.
The first step in my workflow is the impute missing blood contaminant concentrations in 2003-2004 using the measured values of triglycerides, cholesterol etc. This is an easy step and very straightforward. This will be my training dataset.
For future NHANES years (for example 2005-2006), they took individual blood samples combined them (or pooled in other words) and then measured blood contaminants. I need to figure out what the individual concentrations were in these cycles. I have individual measurements for triglycerides, cholesterol etc. and the pooled value is considered the mean. Could I use the mean, 2003-2004 data to unpool or predict the values? For example, if a pool contains 8 individuals, we know the mean, the distribution (2003-2004) and the other parameters (triglycerides) which we can use in the regression to estimate the blood contaminants in those 8 individuals. This would be my test dataset where I have the same contaminants as in the training dataset, with a column for the number of individuals in each pool and the mean value. Alternatively, I can create rows of empty values for contaminants, add mean values separately.
I can easily run MICE, but I need to make sure that the distribution of the imputed data matches 2003-2004 and that the average of the imputed 8 individuals from the pools is equal to the measured pool. So the 8 values for each pool, need to average to the measured pool value while the distribution has to be the same as 2003-2004.
Does that make sense? Happy to provide context if need be. There is an outline code below.
library(mice)
library(tidyverse)
library(VIM)
#Papers detailing these functions can be found in MICE Cran package
df <- read.csv('2003_2004_template.csv', stringsAsFactors = TRUE, na.strings = c("", NA))
#Checking out the NA's that we are working with
non_detect_summary <- as.data.frame(df %>% summarize_all(funs(sum(is.na(.)))))
#helpful representation of ND
aggr_plot <- aggr(df[, 7:42], col=c('navyblue', 'red'),
numbers=TRUE,
sortVars=TRUE,
labels=names(df[, 7:42]),
cex.axis=.7,
gap=3,
ylab=c("Histogram of Missing Data", "Pattern"))
#Mice time, m is the number of imputed datasets (you can think of this as # of cycles)
#You can check out what regression methods below in console
methods(mice)
#Pick Method based on what you think is the best method. Read up.
#Now apply the right method
imputed_data <- mice(df, m = 30)
summary(imputed_data)
#if you want to see imputed values
imputed_data$imp
#finish the dataset
finished_imputed_data <- complete(imputed_data)
#Check for any missing values
sapply(finished_imputed_data, function(x) sum(is.na(x))) #All features should have a value of zero
#Helpful plot is the density plot. The density of the imputed data for each imputed dataset is showed
#in magenta while the density of the observed data is showed in blue.
#Again, under our previous assumptions we expect the distributions to be similar.
densityplot(x = imputed_data, data = ~ LBX028LA+LBX153LA+LBX189LA)
#Print off finished dataset
write_csv(finished_imputed_data, "finished_imputed_data.csv")
#This is where I need to use the finished_imputed_data to impute the values in the future years.

R point-to-point method for calculating x given y

I am using a commercial ELISA kit which contains four standards. These standards are used to create a standard curve, with optical densities from the ELISA reader on the y axis and concentrations in international units per milileter on the x axis.
I now need to use this standard curve to get concentrations for samples in which I only have the optical density readings. The ELISA kit instructions specifically state "Use “point-to-point” plotting for calculation of the standard curve by computer".
I am assuming they mean derive the value of x by seeing where y is hitting the line between the points on the standard curve and dropping down to the x axis from there. The problem is I have no idea how to do this in r (which is what I am using for my full analytical pipeline). I have searched in vain for any r packages, functions or code which correspond to "point-to-point" but can´t find anything. All the R packages that deal with ELISA data and / or standard curves (e.g. drc and ELISAtools seem to do something much more complex, i.e. fit a log model and account for inter-plate variances etc., which is not what I need.
Please note that I don´t need to visualise the standard curve - I just need a method to get the concentrations from the standard curve data based on the point-to-point line.
Here is some sample data:
# Data for standard curve:
scdt <- data.table(id = c("Cal1", "Cal2", "Cal3", "Cal4"),
conc = c(200, 100, 25, 5),
od = c(1.783, 1.395, 0.594, 0.164))
> scdt
id conc od
1: Cal1 200 1.783
2: Cal2 100 1.395
3: Cal3 25 0.594
4: Cal4 5 0.164
# Some example OD values for which I would like to derive concentration:
unknowns <- c(0.015, 0.634, 0.891, 1.510, 2.345, 3.105)
In the example values I want to solve for x, I have also included some that are outside the range covered by the standards as this also occurs in my real data from time to time. The kit manufacturer advises against reporting IU/mL for anything with an OD exceeding that of the highest standard (Cal1) which is sensible.
How can I do the R equivalent of finding x with a ruler and graph paper from the standard curve and what is this formally called? (I think one reason I might not have found anything is because "point-to-point" isn´t a mathematical term, but there must be one for this - is it interpolation?).
It sounds like you want a simple linear interpolation. This is achieved in R using the function approx. You feed it your known x values, your known y values and the new values for x for which you want the corresponding y values. (Note that it doesn't matter which variable you call x and which you call y, as long as you are consistent).
To get a result that is easier to work with, we can convert the response to a data frame with appropriate column names:
new_data <- approx(scdt$od, scdt$conc, xout = unknowns) |>
setNames(c("od", "conc")) |>
as.data.frame()
new_data
#> od conc
#> 1 0.015 NA
#> 2 0.634 28.74532
#> 3 0.891 52.80899
#> 4 1.510 129.63918
#> 5 2.345 NA
#> 6 3.105 NA
Note that (as the manufacturer recommends), optical densities falling outside the extreme ranges of your calibration points will give NA values for concentration. To get these you would need to extrapolate rather than interpolate
Just to confirm this is what you're looking for, let's plot the results of this interpolation in red over the curve formed from the initial data:
plot(scdt$od, scdt$conc, type = "l", lty = 2)
points(scdt$od, scdt$conc)
points(new_data$od, new_data$conc, col = "red")
We can see that the estimated concentrations at each new optical density lie on the lines connecting the calibration points.

How can I add an amount random error to a numerical variable in R?

I am working on investigating the relationship between body measurements and overall weight in a set of biological specimens using regression equations. I have been comparing my results to previous studies, which did not draw their measurement data and body weights from the same series of individuals. Instead, these studies used the mean values reported for each species from the previously published literature (with body measurements and weight drawn from different sets of individuals) or just took the midpoint of reported ranges of body measurements.
I am trying to figure out how to introduce a small amount of random error in my data to simulate the effects of drawing measurement and weight data from different sources. For example, mutating all data to be slightly altered from their actual value by roughly +/- 5% of their actual value, which is close to the difference I get between my measurements and the literature measurements, and seeing how much that affects accuracy statistics. I know there is the jitter() command, but that only seems to work with plotting data.
There is jitter function in base R which allows you to add random noise in the data.
x <- 1:10
set.seed(123)
jitter(x)
#[1] 0.915 2.115 2.964 4.153 5.176 5.818 7.011 8.157 9.021 9.983
Check ?jitter which explains different ways to control the noise added.
Straight forward if you know what the error looks like (i.e. how is your error distributed?). Is the error normally distributed? Uniform?
v1 <- rep(100, 10) # measurements with no noise
v1_n <- v1 + rnorm(10, 0, 20) #error with mean 0 and sd 20 sampled from normal distribution
v1_u <- v1 + runif(10, -5, 5) #error with mean 0 min -5 and max 5 from uniform distribution
v1_n
[1] 87.47092 103.67287 83.28743 131.90562 106.59016 83.59063 109.74858 114.76649 111.51563 93.89223
v1_u
[1] 104.34705 97.12143 101.51674 96.25555 97.67221 98.86114 95.13390 98.82388 103.69691 98.40349

GAMM4 smoothing spline for time variable

I am constructing a GAMM model (for the first time) to compare longitudinal slopes of cognitive performance in a Bipolar Disorder (BD) sample, compared to a control (HC) sample. The study design is referred to as an "accelerated longitudinal study" where participants across a large span of ages 25-60, are followed for 2 years (HC group) and 4 years (BD group).
Hypothesis (1) The BD group’s yearly rate of change on processing speed will be higher overall than the healthy control group, suggesting a more rapid cognitive decline in BD than seen in HC.
Here is my R code formula, which I think is a bit off:
RUN2 <- gamm4(BACS_SC_R ~ group + s(VISITMONTH, bs = "cc") +
s(VISITMONTH, bs = "cc", by=group), random=~(1|SUBNUM), data=Df, REML = TRUE)
The visitmonth variable is coded as "months from first visit." Visit 1 would equal 0, and the following visits (3 per year) are coded as months elapsed from visit 1. Is a cyclic smooth correct in this case?
I plan on adding additional variables (i.e peripheral inflammation) to the model to predict individual slopes of cognitive trajectories in BD.
If you have any other suggestions, it would be greatly appreciated. Thank you!
If VISITMONTH is over years (i.e. for a BD observation we would have VISITMONTH in {0, 1, 2, ..., 48} (for the four years)), then no, you don't want a cyclic smooth unless there is some 4-year periodicity that would mean 0 and 11 should be constrained to be the same.
The default thin plate spline bs = 'tp' should suffice.
I'm also assuming that there are many possible values for VISITMONTH as not everyone was followed up at the same monthly intervals? Otherwise you're not going to have many degrees of freedom available for the temporal smooth.
Is group coded as an ordered factor here? If so that's great; the by smooth will encode the difference between the reference level (be sure to set HC as the reference level) and the other level so you can see directly in the summary a test for a difference of the BD group.
It's not clear how you are dealing with the fact that HC are followed up over fewer months than the BD group. It looks like the model has VISITMONTH representing the full time of the study not just a winthin-year term. So how do you intend to compare the BD group with the HC group for the 2 years where the HC group are not observed?

Interpolation with na.approx : How does it do that?

I am doing some light un-suppression of employment data, and I stumbled on na.approx approach in the zoo package. The data represents the percentage of total government employment, and I figured a rough estimate would be to look at the trends of change between state and local government. They should add to one.
State % Local %
2001 na na
2002 na na
2003 na na
2004 0.118147539 0.881852461
2005 0.114500321 0.885499679
2006 0.117247083 0.882752917
2007 0.116841331 0.883158669
I use the spline setting which allows the estimation of leading na's
z <- zoo(DF2,1:7)
d<-na.spline(z,na.rm=FALSE,maxgap=Inf)
Which gives the output:
State % Local %
0.262918013 0.737081987
0.182809891 0.817190109
0.137735231 0.862264769
0.118147539 0.881852461
0.114500321 0.885499679
0.117247083 0.882752917
0.116841331 0.883158669
Great right? The part that amazes me is that, the approximated na values sum to 1 (which is what I want, but unexpected!) but the documentation for na.approx says that it does each column separately, column-wise. Am I missing something? My money's on mis-reading the documentation
I believe it's just a chance property of linear least squares. The slopes of from both regressions sum to zero, as a result of the constraint that the sum of the series equals one; and the intercepts sum to one. Hence the fitted values from both regressions at any point in time sum to one.
EDIT: A bit more explanations.
y1 = a + beta * t + epsilon
y2 = 1-y1 = (1-a) + (- beta) * t - epsilon
Therefore, running OLS will give intercepts summing to one, and slopes to zero.

Resources