Interpolation with na.approx : How does it do that? - r

I am doing some light un-suppression of employment data, and I stumbled on na.approx approach in the zoo package. The data represents the percentage of total government employment, and I figured a rough estimate would be to look at the trends of change between state and local government. They should add to one.
State % Local %
2001 na na
2002 na na
2003 na na
2004 0.118147539 0.881852461
2005 0.114500321 0.885499679
2006 0.117247083 0.882752917
2007 0.116841331 0.883158669
I use the spline setting which allows the estimation of leading na's
z <- zoo(DF2,1:7)
d<-na.spline(z,na.rm=FALSE,maxgap=Inf)
Which gives the output:
State % Local %
0.262918013 0.737081987
0.182809891 0.817190109
0.137735231 0.862264769
0.118147539 0.881852461
0.114500321 0.885499679
0.117247083 0.882752917
0.116841331 0.883158669
Great right? The part that amazes me is that, the approximated na values sum to 1 (which is what I want, but unexpected!) but the documentation for na.approx says that it does each column separately, column-wise. Am I missing something? My money's on mis-reading the documentation

I believe it's just a chance property of linear least squares. The slopes of from both regressions sum to zero, as a result of the constraint that the sum of the series equals one; and the intercepts sum to one. Hence the fitted values from both regressions at any point in time sum to one.
EDIT: A bit more explanations.
y1 = a + beta * t + epsilon
y2 = 1-y1 = (1-a) + (- beta) * t - epsilon
Therefore, running OLS will give intercepts summing to one, and slopes to zero.

Related

Comparing means (int) by zipcode (factor) in R

I have a list of zipcodes and the number of covid deaths per zipcode in a data frame (not real numbers, just examples):
City
Total
Richmond
552
Las Vegas
994
San Francisco
388
I want to see if there is any relationship between zipcode and the total number of deaths.
I made an LM model using the LM() function
mod_zip <- lm(Total ~ City, data=zipcode)
But when I call summary(mod_zip) I get NA for everything except the estimate column.
Coefficients
Estimate
Std. Error
t value
Pr(>
t)
CityRichmond
2851
NA
NA
NA
NA
CityLasVegas
-2604
NA
NA
NA
NA
CitySanFran
-966
NA
NA
NA
NA
What am I doing wrong?
lm will turn the factor into one-hot columns, so you have a parameter for each city except one and a global intercept.
Then (assuming, without seeing your data) you try to estimate n data points with n parameters which it manages to do but it doesn't have enough degrees of freedom to get a standard error.
Simplified Example to reproduce:
df <- data.frame(x = LETTERS, y = rnorm(26), stringsAsFactors = TRUE)
fit <- lm(y~x, data = df)
summary(fit)
You will see an Intercept and parameters B through Z (26 parameters for 26 observations), the degrees of freedom are thus 0 hence the standard errors and related metrics not calculable.
It sounds like you are looking to test whether City is a relevant factor for predicting deaths. In other words, would you expect to see the observed range of values if each death had an equal chance of occurring in any City?
My intuition on this would be that there should certainly be a difference, based on many City-varying differences in demographics, rules, norms, vaccination rates, and the nature of an infectious disease that spreads more if more people are infected to begin with.
If you want to confirm this intuition, you could use simulation. Let's say all Cities had the same underlying risk rate of 800, and all variation was totally due to chance.
set.seed(2021)
Same_risk = 800
Same_risk_deaths = rpois(100, Same_risk)
mean(Same_risk_deaths)
sd(Same_risk_deaths)
Here, the observed mean is indeed close to 800, with a standard deviation of around 3% of the average value.
If we instead had a situation where some cities, for whatever combination of reasons, had different risk factors (say, 600 or 1000), then we could see the same average around 800, but with a much higher standard deviation around 25% of the average value.
Diff_risk = rep(c(600, 1000), 50)
Diff_risk_deaths = rpois(100, Diff_risk)
mean(Diff_risk_deaths)
sd(Diff_risk_deaths)
I imagine your data does not look like the first distribution and is instead much more varied.

In R, how do you impute left-censored data that is below a limit of detection?

This is probably a simple problem but I just can't work it out. I have a dataframe of biochemistry test results. Some of these tests like base_crp are returning values like <3 because of limits of detection. I need to impute this data before moving forward. I'd like to do this properly, so not just substituting.
I tried multLN from the zCompositions package but it seems to think that all the <3 values are negative (error says X contains negative values). There also doesn't seem to be much documentation out there- is this an obscure package?
I also looked at LODI but it wants me to specify covariates for the imputation model- is there a proper way to select these? Anyway, I picked 3 that would theoretically correlate well and used this code:
clmi.out <- clmi(formula = log(base_crp) ~ base_wcc + base_neut + base_lymph, df = all, lod = crplim, seed = 12345, n.imps = 5)
where base_crp is the variable I'm trying to fix. I replaced all the <3 with NA and inserted a new column all$crplim <- "3". However, this is just returning
Error in sprintf("%s must be numeric.") : too few arguments.
Even if I can get LODI working, I'm not sure if it's the right tool. I'm only an undergraduate university student with little statistical background so I don't really understand what I'm doing- I just want something that will populate the column with numbers so I can move forward with Pearson correlations and linear regressions, etc. I would really appreciate some help with this. Thanks in advance.
I've done a bit of statistical modelling of CRP (C reactive protein) levels before - see this peer-reviewed paper as an example. CRP has an approximately log-normal distribution, and the median value in an unselected population across all testing indications is usually around 3.5 mg/l (most healthy people will be in that "<3mg/l" category). You probably don't want to be using an imputation model, because these are for missing data. The low CRP data is not missing. You already know it lies within a certain range, so you are losing information if you do the imputation this way.
It is reasonable to want to replace "<3" with a numeric value for regressions etc, as long as you are using this to correlate CRP with clinical findings etc and not (as Ben Norris points out) for CRP machine calibration.
I can tell you from data of over 10,000 samples of high-sensitvity CRP measurements in the study I linked above that the mean CRP in people with CRP < 3 is about 1.3, and it would be reasonable to substitute all of your "CRP < 3" measurements with 1.3 for most real-world clinical observational studies.
If you really need to have plausible numerical values on the missing CRP, you could impute the bottom half of a lognormal distribution. The following function would give you numbers that would likely be indistinguishable from real-life CRP measurements:
impute_crp <- function(n)
{
x <- exp(rnorm(10 * n, 1.355, 1.45))
round(x[x < 3][seq(n)], 1)
}
So you could do
impute_crp(10)
#> [1] 1.5 2.0 1.1 0.4 2.5 0.1 0.7 1.5 1.4 0.4
And
base_crp[base_crp == "<3"] <- impute_crp(length(which(base_crp == "<3"))
However, you will notice that I didn't use imputation at all in my own CRP models. Replacing the lower value with the threshold of detection was good enough for the purposes of modelling - and I'm fairly sure whether you replace the "< 3" with a lognormal tail, or all 1.3, or all 2, it will make no difference to the conclusions you are trying to draw.

zero-inflated overdispersed count data glmmTMB error in R

I am working with count data (available here) that are zero-inflated and overdispersed and has random effects. The package best suited to work with this sort of data is the glmmTMB (details here and troubleshooting here).
Before working with the data, I inspected it for normality (it is zero-inflated), homogeneity of variance, correlations, and outliers. The data had two outliers, which I removed from the dataset linekd above. There are 351 observations from 18 locations (prop_id).
The data looks like this:
euc0 ea_grass ep_grass np_grass np_other_grass month year precip season prop_id quad
3 5.7 0.0 16.7 4.0 7 2006 526 Winter Barlow 1
0 6.7 0.0 28.3 0.0 7 2006 525 Winter Barlow 2
0 2.3 0.0 3.3 0.0 7 2006 524 Winter Barlow 3
0 1.7 0.0 13.3 0.0 7 2006 845 Winter Blaber 4
0 5.7 0.0 45.0 0.0 7 2006 817 Winter Blaber 5
0 11.7 1.7 46.7 0.0 7 2006 607 Winter DClark 3
The response variable is euc0 and the random effects are prop_id and quad_id. The rest of the variables are fixed effects (all representing the percent cover of different plant species).
The model I want to run:
library(glmmTMB)
seed0<-glmmTMB(euc0 ~ ea_grass + ep_grass + np_grass + np_other_grass + month + year*precip + season*precip + (1|prop_id) + (1|quad), data = euc, family=poisson(link=identity))
fit_zinbinom <- update(seed0,family=nbinom2) #allow variance increases quadratically
The error I get after running the seed0 code is:
Error in optimHess(par.fixed, obj$fn, obj$gr) : gradient in optim
evaluated to length 1 not 15 In addition: There were 50 or more
warnings (use warnings() to see the first 50)
warnings() gives:
1. In (function (start, objective, gradient = NULL, hessian = NULL, ... :
NA/NaN function evaluation
I also normally mean center and standardize my numerical variables, but this only removes the first error and keeps the NA/NaN error. I tried adding a glmmTMBControl statement like this OP, but it just opened a whole new world of errors.
How can I fix this? What am I doing wrong?
A detailed explanation would be appreciated so that I can learn how to troubleshoot this better myself in the future. Alternatively, I am open to a MCMCglmm solution as that function can also deal with this sort of data (despite taking longer to run).
An incomplete answer ...
identity-link models for limited-domain response distributions (e.g. Gamma or Poisson, where negative values are impossible) are computationally problematic; in my opinion they're often conceptually problematic as well, although there are some reasonable arguments in their favor. Do you have a good reason to do this?
This is a pretty small data set for the model you're trying to fit: 13 fixed-effect predictors and 2 random-effect predictors. The rule of thumb would be that you want about 10-20 times that many observations: that seems to fit in OK with your 345 or so observations, but ... only 40 of your observations are non-zero! That means your 'effective' number of observations/amount of information will be much smaller (see Frank Harrell's Regression Modeling Strategies for more discussion of this point).
That said, let me run through some of the things I tried and where I ended up.
GGally::ggpairs(euc, columns=2:10) doesn't detect anything obviously terrible about the data (I did throw out the data point with euc0==78)
In order to try to make the identity-link model work I added some code in glmmTMB. You should be able to install via remotes::install_github("glmmTMB/glmmTMB/glmmTMB#clamp") (note you will need compilers etc. installed to install this). This version takes negative predicted values and forces them to be non-negative, while adding a corresponding penalty to the negative log-likelihood.
Using the new version of glmmTMB I don't get an error, but I do get these warnings:
Warning messages:
1: In fitTMB(TMBStruc) :
Model convergence problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')
2: In fitTMB(TMBStruc) :
Model convergence problem; false convergence (8). See vignette('troubleshooting')
The Hessian (second-derivative) matrix being non-positive-definite means there are some (still hard-to-troubleshoot) problems. heatmap(vcov(f2)$cond,Rowv=NA,Colv=NA) lets me look at the covariance matrix. (I also like corrplot::corrplot.mixed(cov2cor(vcov(f2)$cond),"ellipse","number"), but that doesn't work when vcov(.)$cond is non-positive definite. In a pinch you can use sfsmisc::posdefify() to force it to be positive definite ...)
Tried scaling:
eucsc <- dplyr::mutate_at(euc1,dplyr::vars(c(ea_grass:precip)), ~c(scale(.)))
This will help some - right now we're still doing a few silly things like treating year as a numeric variable without centering it (so the 'intercept' of the model is at year 0 of the Gregorian calendar ...)
But that still doesn't fix the problem.
Looking more closely at the ggpairs plot, it looks like season and year are confounded: with(eucsc,table(season,year)) shows that observations occur in Spring and Winter in one year and Autumn in the other year. season and month are also confounded: if we know the month, then we automatically know the season.
At this point I decided to give up on the identity link and see what happened. update(<previous_model>, family=poisson) (i.e. using a Poisson with a standard log link) worked! So did using family=nbinom2, which was much better.
I looked at the results and discovered that the CIs for the precip X season coefficients were crazy, so dropped the interaction term (update(f2S_noyr_logNB, . ~ . - precip:season)) at which point the results look sensible.
A few final notes:
the variance associated with quadrat is effectively zero
I don't think you necessarily need zero-inflation; low means and overdispersion (i.e. family=nbinom2) are probably sufficient.
the distribution of the residuals looks OK, but there still seems to be some model mis-fit (library(DHARMa); plot(simulateResiduals(f2S_noyr_logNB2))). I would spend some time plotting residuals and predicted values against various combinations of predictors to see if you can localize the problem.
PS A quicker way to see that there's something wrong with the fixed effects (multicollinearity):
X <- model.matrix(~ ea_grass + ep_grass +
np_grass + np_other_grass + month +
year*precip + season*precip,
data=euc)
ncol(X) ## 13
Matrix::rankMatrix(X) ## 11
lme4 has tests like this, and machinery for automatically dropping aliased columns, but they aren't implemented in glmmTMB at present.

How to remove outliers from distance matrix or Hierarchical clustering in R?

I have some questions
First, I don't know how to find and remove outliers in distance matrix or symmetry matrix.
Second, also I used Hierarachical clustering with Average linkage.
My data is engmale161 (already made symmetry matrix with DTW )
engmale161 <- na.omit(engmale161)
engmale161 <- scale(engmale161)
d <- dist(engmale161, method = "euclidean")
hc1_engmale161 <- hclust(d, method="average")
and I find optimize index 4 with silhouette, wss & gap.
>sub_grp <- cutree(hc1_engmale161,h=60, k = 4)
>table(sub_grp)
>table(sub_grp)
sub_grp
1 2 3 4
741 16 7 1
> subset(sub_grp,sub_grp==4)
4165634865
4
>fviz_cluster(list(data = engmale161, cluster = sub_grp), geom = "point")
So, I think the right upper point(4165634865) is outlier and it has only one point.
How to delete the outlier in H-C algorithm.
just some ideas.
in a nutshell,
don't do "na.omit" on engmale161. find the outlier(s) using
quantiles and box-and-whiskers put outliers to NA in the dist matrix
proceed with your processing
long version:
"dist" behaves nicely with NAs (from the R documentation, "Missing
values are allowed, and are excluded from all computations involving
the rows within which they occur. Further, when Inf values are
involved, all pairs of values are excluded when their contribution to
the distance gave NaN or NA)"
to find an outlier I would use concepts from exploratory statistics.
use "quantile" with default probs and na.rm = true (because your dist
matrix still contains NAs) --> you'll get values for the quartiles
(dataset split in 4: 0-25%, 25-50%m and so on). 25-75 is the "box".
To find the "whiskers" is a debated topic. the standard approach is
to find the InterQuartileRange (IQR), which is third-first quartile,
then first quartile - 1.5*IQR is the "lower" whiskers, and third
quartile + 1.5*IQR is the "upper" whisker. Any value outside the
whiskers is to be considered an outlier. Mark them as NA, and proceed.
Best of luck, and my compliments for being someone who actually looks at the data!

How is NaN handled in Pearson correlation user-user similarity matrix in a recommender system?

I am generating a user-user similarity matrix from a user-rating data (particularly MovieLens100K data). Computing correlation leads to some NaN values. I have tested in a smaller dataset:
User-Item rating matrix
I1 I2 I3 I4
U1 4 0 5 5
U2 4 2 1 0
U3 3 0 2 4
U4 4 4 0 0
User-User Pearson Correlation similarity matrix
U1 U2 U3 U4 U5
U1 1 -1 0 -nan 0.755929
U2 -1 1 1 -nan -0.327327
U3 0 1 1 -nan 0.654654
U4 -nan -nan -nan -nan -nan
U5 0.755929 -0.327327 0.654654 -nan 1
For computing the pearson correlation , only corated items are considered between two users. (See Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions, Gediminas Adomavicius, Alexander Tuzhilin
How can i handle the NaN values?
EDIT
Here is a code with which i find pearson correlation in R. The R matrix is the user-item rating matrix. Contains 1 to 5 scale rating 0 means not rated. S is the user-user correlation matrix.
for (i in 1:nrow (R))
{
cat ("user: ", i, "\n");
for (k in 1:nrow (R))
{
if (i != k)
{
corated_list <- which (((R[i,] != 0) & (R[k,] != 0)) == TRUE);
ui <- (R[i,corated_list] - mean (R[i,corated_list]));
uk <- (R[k,corated_list] - mean (R[k,corated_list]));
temp <- sum (ui * uk) / sqrt (sum (ui^2) * sum (uk^2));
S[i,k] <- ifelse (is.nan (temp), 0, temp)
}
else
{
S[i,k] <- 0;
}
}
}
Note that in the S[i,k] <- ifelse (is.nan (temp), 0, temp) line i am replacing the NaNs with 0.
I recently developed a recommender system in Java for user-user & user-item matrix. Firstly as you probably already have found. RS are difficult. For my implementation I utilised the Apache Common Math Library which is fantastic, you are using R which is probably relatively similar in how it calculates Pearson's.
Your question was: How can I handle NaN values, followed by an edit saying you are saying NaN is = 0.
My answer is this:
You shouldn't really handle NaN values as 0, because what you are saying is that there is absolutely no correlation between users or users/items. This might be the case, but it is likely not always the case. Ignoring this will skew your recommendations.
Firstly you should be asking yourself, "why am I getting NaN values"? Here are some reasons from the Wiki page of NaN detailing why you might get a NaN value:
There are three kinds of operations that can return NaN:
Operations with a NaN as at least one operand.
Indeterminate forms
The divisions 0/0 and ±∞/±∞
The multiplications 0×±∞ and ±∞×0
The additions ∞ + (−∞), (−∞) + ∞ and equivalent subtractions
The standard has alternative functions for powers:
The standard pow function and the integer exponent pown function define 00, 1∞, and ∞0 as 1.
The powr function defines all three indeterminate forms as invalid operations and so returns NaN.
Real operations with complex results, for example:
The square root of a negative number.
The logarithm of a negative number
The inverse sine or cosine of a number that is less than −1 or greater than +1.
You should debug your application and step through each step to see which of the above reasons is the offending cause.
Secondly understanding that Pearsons Correlation can be represented in a number of different ways, and you need to consider whether you are calculating it across a sample or population and then find the appropriate method of calculating it i.e. for a population:
cor(X, Y) = Σ[(xi - E(X))(yi - E(Y))] / [(n - 1)s(X)s(Y)]
where E(X) is the mean of X,
E(Y) is the mean of the Y values and
s(X), s(Y) are standard deviations and
standard deviations is generally the positive square root of the variance and
variance = sum((x_i - mean)^2) / (n - 1)
where mean is the Mean and
n is the number of sample observations.
This is probably where your NaN are appearing i.e. dividing by 0 for not rated. If you can I would suggest not using the value of 0 to mean not rated, instead use null. I would do this for 2 reasons:
1. The 0 is probably what is cocking up your results with NaNs, and
2. Readability / Understandability. Your Scale is 1 - 5, so 0 should not feature, confuses things. So avoid that if possible.
Thirdly from a recommender stand point, think about things from a recommendation point of view. If you have 2 users and they only have 1 rating in common, say U1 and U4 for I1 in your smaller dataset. Is that 1 item in common really enough to offer recommendations on? The answer is - of course not. So can I also suggest you set a minimum threshold of ratingsInCommon to ensure that the quality of recommendation is better. The minimum you can set for this threshold is 2, but consider setting it a bit higher. If you read the MovieLens research then they set it to between 5-10 (cant remember off the top of my head). The higher you set this the less coverage you will get but you will achieve "better" (lower error scores) recommendations. You've probably done your reading of the academic literature then you will have probably picked up on this point, but thought I would mention it anyway.
On the above point. Look at U4 and compare with every other User. Notice how U4 does not have more that 1 item in common with any user. Now hopefully you will notice that the NaNs appear exclusively with U4. If you have followed this answer then you will hopefully now see that the reason you are getting NaNs is because you can actually compute Pearson's with just 1 item in common :).
Finally one thing that slightly bothers me about the sample dataset above is number of correlations that are 1's and -1's. Think about what that is actually saying about these users preferences, then sense check them against the actual ratings. E.g. look at U1 and U2 ratings. for Item 1 they have strong positive correlation of 1 (both rated it a 4) then for Item 3 they have a strong negative correlation (U1 rated it 5, U3 rated it 1), seems strange that Pearson Correlation between these two users is -1 (i.e. their preferences are completely opposite). This is clearly not the case, really the Pearson score should be a bit above or a bit below 0. This issue links back into points about using 0 on the scale and also comparing only a small amount of items together.
Now, there are strategies in place for "filling in" items that users have not rated. I am not going to go into them you need read up on that, but essentially it is like using the average score for that item or the average rating for that user. Both methods have their downsides, and personally I don't really like either of them. My advice is to only calculate Pearson correlations between users when they have 5 or more items in common, and ignore the items where ratings are 0 (or better - null) ratings.
So to conclude.
NaN does not equal 0 so do not set it to 0.
0's in your scale are better represented as null
You should only calculate Pearson Correlations when the number of items in common between two users is >1, preferably greater than 5/10.
Only calculate the Pearson Correlation for two users where they have commonly rated items, do not include items in the score that have not been rated by the other user.
Hope that helps and good luck.

Resources