R correlation score calculation - r

My dataset contains information such as feedbackDate and Subcategory(issue) and location (6 months data). The temporal calculation was by cross tabulating the subcategory of issues with their feebackDates and then calculating Pearson correlation score for every pair of cross tabulated issues. See the code below
#weekly correlation
require(ISOweek)
datacfs_date$FeedbackWeek <- ISOweek(datacfs_date$FeedbackDate)
raw_timecor_matrix <- table(datacfs_date$SubCategory, datacfs_date$FeedbackWeek)
raw_timecor_matrix <- t(raw_timecor_matrix)
timecor_matrix <- cor(raw_timecor_matrix)
#Invert correlation to get distance matrix
inverse_tcc <- 1-timecor_matrix
Now the question is how do I calculate this on biweekly and monthly basis instead of weekly correlation of six months data.

Just make your labels, e.g.
datacfs_date$FeedbackMonth<-paste0(year(datacfs_date$FeedbackDate),"-M",month(datacfs_date$FeedbackDate))
datacfs_date$FeedbackBiWeek<-paste0(year(datacfs_date$FeedbackDate),"-W",(ceiling(week(datacfs_date$FeedbackDate)/2)*2)-1,":",(ceiling(week(datacfs_date$FeedbackDate)/2)*2))
and correlate on those

Related

When setting your obsCovs for the function pcount (package unmarked) how does R "know" which obsCov observation corresponds to each y value?

I'm relatively new at R particularly with this package. I am running n-mixture models assessing detection probabilities and abundance. I have abundance data, site covariates and observation covariates. There are three repeated observations(rounds)/site. The observation covariates are set up as columns (three column/covariate, one for each round). The rows are individual sites. The abundance data is formatted similarly, with each column heading representing a different round. I've copied my code below.
y.abun2<-COYE[2:4]
obsCovs.ss <- list(temp=Covariate2021[3:5], Date=Covariate2021[13:15], Cloud=Covariate2021[17:19], Wind=Covariate2021[21:23],Observ=Covariate2021[25:27])
siteCovs.ss <- Covariate2021[c(29,30,31,32)]
coyeabund<-unmarkedFramePCount(y=y.abun2, siteCovs = siteCovs.ss,
obsCovs = obsCovs.ss)
After this I scale using this code:
coyeabund#siteCovs$TreeCover <-
scale(coyeabund#siteCovs$TreeCover)
Moving on to my model I use this code:
abun.coye.full<-pcount(~TreeCover+temp+Date+Cloud+Wind+Observ ~ HHSDI+ProportionNH+Quality, coyeabund,mixture="NB", K=132,se=TRUE)
Is the model matching the observation covariates to the abundance measurements to each round? (i.e., is it able to tell that temp column 5 corresponds to the third round of abundance measurements?)
The models seem fine so far but I am so new at this I want to confirm that I haven't gone astray.

Model predicted values around mean using training data

I tried to ask these questions through imputations, but I want to see if this can be done with predictive modelling instead. I am trying to use information from 2003-2004 NHANES to predict future NHANES cycles. For some context, in 2003-2004 NHANES measured blood contaminants in individual people's blood. In this cycle, they also measured things such as triglycerides, cholesterol etc. that influence the concentration of these blood contaminants.
The first step in my workflow is the impute missing blood contaminant concentrations in 2003-2004 using the measured values of triglycerides, cholesterol etc. This is an easy step and very straightforward. This will be my training dataset.
For future NHANES years (for example 2005-2006), they took individual blood samples combined them (or pooled in other words) and then measured blood contaminants. I need to figure out what the individual concentrations were in these cycles. I have individual measurements for triglycerides, cholesterol etc. and the pooled value is considered the mean. Could I use the mean, 2003-2004 data to unpool or predict the values? For example, if a pool contains 8 individuals, we know the mean, the distribution (2003-2004) and the other parameters (triglycerides) which we can use in the regression to estimate the blood contaminants in those 8 individuals. This would be my test dataset where I have the same contaminants as in the training dataset, with a column for the number of individuals in each pool and the mean value. Alternatively, I can create rows of empty values for contaminants, add mean values separately.
I can easily run MICE, but I need to make sure that the distribution of the imputed data matches 2003-2004 and that the average of the imputed 8 individuals from the pools is equal to the measured pool. So the 8 values for each pool, need to average to the measured pool value while the distribution has to be the same as 2003-2004.
Does that make sense? Happy to provide context if need be. There is an outline code below.
library(mice)
library(tidyverse)
library(VIM)
#Papers detailing these functions can be found in MICE Cran package
df <- read.csv('2003_2004_template.csv', stringsAsFactors = TRUE, na.strings = c("", NA))
#Checking out the NA's that we are working with
non_detect_summary <- as.data.frame(df %>% summarize_all(funs(sum(is.na(.)))))
#helpful representation of ND
aggr_plot <- aggr(df[, 7:42], col=c('navyblue', 'red'),
numbers=TRUE,
sortVars=TRUE,
labels=names(df[, 7:42]),
cex.axis=.7,
gap=3,
ylab=c("Histogram of Missing Data", "Pattern"))
#Mice time, m is the number of imputed datasets (you can think of this as # of cycles)
#You can check out what regression methods below in console
methods(mice)
#Pick Method based on what you think is the best method. Read up.
#Now apply the right method
imputed_data <- mice(df, m = 30)
summary(imputed_data)
#if you want to see imputed values
imputed_data$imp
#finish the dataset
finished_imputed_data <- complete(imputed_data)
#Check for any missing values
sapply(finished_imputed_data, function(x) sum(is.na(x))) #All features should have a value of zero
#Helpful plot is the density plot. The density of the imputed data for each imputed dataset is showed
#in magenta while the density of the observed data is showed in blue.
#Again, under our previous assumptions we expect the distributions to be similar.
densityplot(x = imputed_data, data = ~ LBX028LA+LBX153LA+LBX189LA)
#Print off finished dataset
write_csv(finished_imputed_data, "finished_imputed_data.csv")
#This is where I need to use the finished_imputed_data to impute the values in the future years.

How can I get the spatial correlation between two datsets in r?

I have two arrays:
data1=array(-10:30, c(2160,1080,12))
data2=array(-20:30, c(2160,1080,12))
#Add in some NAs
ind <- which(data1 %in% sample(data1, 1500))
data1[ind] <- NA
One is modelled global gridded data (lon,lat,month) and the other, global gridded observations (lon,lat,month).
I want to assess how 'skillful' the modelled data is at recreating the obs. I think the best way to do this is with a spatial correlation between the datasets. How can I do that?
I tried a straightforward x<-cor(data1,data2) but that just returned x<-NA_real_.
Then I was thinking that I probably have to break it up by month or season. So, just looking at one month x<-cor(data1[,,1],data2[,,1]) it returned a matrix of size 1080*1080 (most of which are NAs).
How can I get a spatial correlation between these two datasets? i.e. I want to see where the modelled data performs 'well' i.e. has high correlation with observations, or where it does badly (low correlation with observations).

Generate random data based on correlation matrix for multiple timesteps in R

I would like to simulate data for some cases (e.g. nPerson=1000 obversations) at
some consecutive timesteps (e.g. ts = 3) for N intercorrelated variables (e.g. N=5).
The simulation should be based on a correlation matrix (corrMat, nrows=nPerson,.ncols = N).
corrMat should be identical for all timesteps.
I already found out that the MASS package has a function to create
random data fitting the constraints given by corrMat.
t1 <- mvrnorm(nPerson,mu=rep(0, N),Sigma=corrMat,empirical=T)
Now I would like to simulate t2 as a function of t1 and corrMat.
The data of t2 therefore should correlate according to corrMat
and they should also have same variance as the variables of t1.
One important constrained: for the intial values corrMat[i,i] = 1,
for consequtive timesteps it should be posible, that corrMat[i,i] < 1,
because each variable is depending on itsself a timestep before,
but a perfect correlation is notintended.
Maybe there is a variance decomposition of the correlation matrix,
that calculates an error variance for each of the n variables at the
next time step, so that one could calculate the
values at timestep t+1 as sum of the weighted correlations of the
variables at timestep t and then adding a random error,distributed
according to the error variance (with mean of error = 0) that replicates
the correlation matrix again at t+1.
Assuming normal errors:
getRand <- function (range) {
return (rnorm(1,mean=0, sd=range) )
}
That the (very simplified) code for the i-th variable x_i:
x_i[t+1] = 0
for (j:1..N) {
x_i[t+1] = x_i[t+1] + corrMat[i,j] * x_j[t]
}
x_i[t+1] = x_i[t+1] + getRand(sdErr)
So the question would be more specific: how to calculate sdErr?
For simplification I try to assume, that the variance for all variables
should be 1.
Thank you for any hint, how to get one step further!
I will do a mathematical formulation of the problem to stats.stackexchange.com,
as mikeck suggested to discuss details of the correlation problems more
in depth.
I still am interested in finding a geneal formula to calculate sdErr
to use it in the calculation of x_i[t+1].
But meanwhile I found a useful practical solution to the specific question "how to calculate sdErr?" without a formula for sdErr:
(1) simply calculate all variables WITHOUT errors (according to the equation above).
(2) calculate variances of the new variables
(3) calculate (for each i) differences var(x_i[t]) - var(x_i[t+1]) = sdErr ^ 2
So this sdErr can be added to each variable for each new observation.
This should lead to observations at t+1 which at least have the same variances as the observations in t.
Details concercing the question, if the model definition is adequate,
will be part of another post.

Looking for an efficient way to compute the variances of a multinomial distribution in R

I have an R matrix which dimensions are ~20,000,000 rows by 1,000 columns. The first column represents counts and the rest of the columns represent the probabilities of a multinomial distribution of these counts. So in other words, in each row the first column is n and the rest of the k columns are the probabilities of the k categories. Another point is that the matrix is sparse, meaning that in each row there are many columns with value of 0.
Here's a toy matrix I created:
mat=rbind(c(5,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1),c(2,0.2,0.2,0.2,0.2,0.2,0,0,0,0,0),c(22,0.4,0.6,0,0,0,0,0,0,0,0),c(5,0.5,0.2,0,0.1,0.2,0,0,0,0,0),c(4,0.4,0.15,0.15,0.15,0.15,0,0,0,0,0),c(10,0.6,0.1,0.1,0.1,0.1,0,0,0,0,0))
What I'd like to do is obtain an empirical measure of the variance of the counts for each category. The natural thing that comes to mind is to obtain random draws and then compute the variances over them. Something like:
draws = apply(mat,1,function(x) rmultinom(samples,x[1],x[2:ncol(mat)]))
Where say samples=100000
Then I can run an apply over draws to compute the variances.
However, for my real data dimensions this will become prohibitive at least in terms of RAM. Is whether a more efficient solution in R to this problem?
If all you need is the variance of the counts, just compute it immediately instead of returning the intermediate simulated draws.
draws = apply(mat,1,function(x) var(rmultinom(samples,x[1],x[2:ncol(mat)])))

Resources