constructing an error bar with compositional data - r

I have a problem and hopefully there is somebody to help me.
I have a data set with compositional data, for each weekday of 160 weeks the ratio of cars are measured. The sum of the three ratios sum up to 1. There are three types of cars in this research.
My task is to construct the mean and an 'error bar'. I used the following lines of code in R:
Day = rep(c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday",
"Saterday"),3)
cars = c(rep("nissan",7),rep("toyota",7),rep("bmw",7))
y <- colMeans(datadag,na.rm=TRUE)
delta <- apply(datadag,2,sd,na.rm=TRUE)
df=data.frame(Day,cars,y,delta)
p<-ggplot(df,aes(x=Day,y=y,group=Device,color=Device))+
geom_point() +
geom_errorbar(aes(ymin=y-delta,ymax=y+delta),width=.6)
print(p)
The code above give the following plot:
The problem I face is that the error bounds exceeds the 0 and 1 which is not possible because of the compositional data. Can anybody tell me what I did wrong?

Your problem is statistical, not to do with R. You are assuming that the standard deviation will "know" that your data cannot be negative. Consider the following.
foo <- c(0,0,1,1000)
mean(foo) - sd(foo)
[1] -249.5836
I am not sure if the same problem can arise with the standard error but I suspect it can...

Related

Investigate correlations between a big amount of binary variables and a metric variable

I am trying to investigate a dataset with about 260 binary varibles and a metric one. These are dummies of categorial variables, which I want to regress on the metric variable.
How can I visualize them?
I first tried plot() but it wasnt possible to use it on the whole dataset and even if I use only a view I can´t interpret them.
I tried pairs(), but there came the output:
'Error in plot.new() : figure margins too large'
I also tried sjp.corr() from sjPlot package, but it is too small and wasn't interpretable.
I´m not really experienced in handeling data like this, what would you recommend me? how you analyse and interpret the data (even non graphical)? Woould you recommend not to try to interpret it graphically? I also got the problem, if I try to investigate it non graphically and use the command rcorr() of the Hmisc package. Then I only got an 3 x 260 table and it omitts 258 rows? What can I do?
I am really sorry but I cant show you the data :( But I would be glad if you could still give me some advice
You did not provide us,the data, but out of your plot i can get some points
You have 2 features, which one of them is Binary (1,0), meanwhile the other one is an integer from 0 to 600.
The frequency of the both 0s, and 1s are more when the other feature is between 0 and 150.
So, given the above information i generate a random dataset for myself, and answer your question based on my data.
dt<-data.frame(binary=sample(c("0","1"),100,replace = T ),
price=rnbinom(100, 100,0.5 ) )
In my dataset the binary is a string which can only contain 1 or 0. and the price is a numeric value.
The first thing i can do is to study the price feature, to understand its histogram, it helps me to get the distribution of it.
library(ggplot2)
ggplot(dt,aes( x=price, fill=binary ))+
geom_histogram( position="identity", alpha=.5)+
geom_density()
and the result is:
In the next step, i wish to compare the frequency of 1s with 0s
library(ggplot2)
ggplot(dt,aes(binary,fill=binary))+
geom_bar()
and it shows me the frequency of them:
I doubt if regression is a good choice to get a prediction. I would say, the best choice here is a classification using rpart
library(rpart)
model<-rpart(binary~price,dt, method="class" )
But, don't forget to make test, and train data seperated.

PCA scores vs. Varimax-rotated PCA scores

I have performed PCA using prcomp in R with my databases of 75-76 indicator variables and 7232 companies, including NAs. Before applying the function, I centred my data, but did not rescale them because they are all indicator variables. (Is my reasoning correct?)
After that I varimax-rotated the loadings of the 2 or 3 first principal components following the instructions by amoeba here.
Since I had centred, but not rescaled my data, I changed the code to:
Varimax_results <- varimax(rawLoadings,normalize = FALSE)
invLoadings <- t(pracma::pinv(VarimaxLoadings))
scores <- scale(DatosPCA, scale = FALSE) %*% invLoadings
Now I am trying to figure out why the scores given by "prcomp" and the scores obtained using the code above are not the same.
I am probably missing some theoretical background, so I would be grateful if someone could tell me if the scores are supposed to be the same and, in that case, what I am doing wrong in my code. If they are not supposed to be the same, which ones should I use?
Thank you very much!

Cross-correlation of 5 time series (distance) and interpretation

I would appreciate some input in this a lot!
I have data for 5 time series (an example of 1 step in the series is in the plot below), where each step in the series is a vertical profile of species sightings in the ocean which were investigated 6h apart. All 5 steps are spaced vertically by 0.1m (and the 6h in time).
What I want to do is calculate the multivariate cross-correlation between all series in order to find out at which lag the profiles are most correlated and stable over time.
Profile example:
I find the documentation in R on that not so great, so what I did so far is use the package MTS with the ccm function to create cross correlation matrices. However, the interpretation of the figures is rather difficult with sparse documentation. I would appreciate some help with that a lot.
Data example:
http://pastebin.com/embed_iframe.php?i=8gdAeGP4
Save in file cross_correlation_stack.csv or change as you wish.
library(dplyr)
library(MTS)
library(data.table)
d1 <- file.path('cross_correlation_stack.csv')
d2 = read.csv(d1)
# USING package MTS
mod1<-ccm(d2,lag=1000,level=T)
#USING base R
acf(d2,lag.max=1000)
# MQ plot also from MTS package
mq(d2,lag=1000)
Which produces this (the ccm command):
This:
and this:
In parallel, the acf command from above produces this:
My question now is if somebody can give some input in whether I am going in the right direction or are there better suited packages and commands?
Since the default figures don't get any titles etc. What am I looking at, specifically in the ccm figures?
The ACF command was proposed somewhere, but can I use it here? In it's documentation it says ... calculates autocovariance or autocorrelation... I assume this is not what I want. But then again it's the only command that seems to work multivariate. I am confused.
The plot with the significance values shows that after a lag of 150 (15 meters) the p values increase. How would you interpret that regarding my data? 0.1 intervals of species sightings and many lags up to 100-150 are significant? Would that mean something like that peaks in sightings are stable over the 5 time-steps on a scale of 150 lags aka 15 meters?
In either way it would be nice if somebody who worked with this before can explain what I am looking at! Any input is highly appreciated!
You can use the base R function ccf(), which will estimate the cross-correlation function between any two variables x and y. However, it only works on vectors, so you'll have to loop over the columns in d1. Something like:
cc <- vector("list",choose(dim(d1)[2],2))
par(mfrow=c(ceiling(choose(dim(d1)[2],2)/2),2))
cnt <- 1
for(i in 1:(dim(d1)[2]-1)) {
for(j in (i+1):dim(d1)[2]) {
cc[[cnt]] <- ccf(d1[,i],d1[,j],main=paste0("Cross-correlation of ",colnames(d1)[i]," with ",colnames(d1)[j]))
cnt <- cnt + 1
}
}
This will plot each of the estimated CCF's and store the estimates in the list cc. It is important to remember that the lag-k value returned by ccf(x,y) is an estimate of the correlation between x[t+k] and y[t].
All of that said, however, the ccf is only defined for data that are more-or-less normally distributed, but your data are clearly overdispersed with all of those zeroes. Therefore, lacking some adequate transformation, you should really look into other metrics of "association" such as the mutual information as estimated from entropy. I suggest checking out the R packages entropy and infotheo.

Explanation of the formula object used in the coxph function in R

I am a complete novice when it comes to survival analysis. I am working on a project that requires I use the coxph function in the "survival" package, but I am running into trouble because I do not understand what is required by the formula object.
Most descriptions I can find about the function are as follows:
"a formula object, with the response on the left of a ~ operator, and the terms on the right. The response must be a survival object as returned by the Surv function. "
I know what needs to be on the left of the operator, the issue is what the function expects from the right-hand side.
Here is a link of what my data looks like (The actual data set is much larger, I'm only displaying the first 20 data points for brevity):
Short explanation of data:
-Row 1 is the header
-Each row after that is a separate patient
-The first column is the age of the patient at the time of the study
-columns 2 through 14 (headed by x2-x13), and 19 (x18) and 20 (x19) are covariates such as race, relationship status, medical conditions that take on either true (1) or false (0) values.
-columns 15 (x14) through 18 (x17) are covariates such as tumor size, which take on whole number values greater than 0.
-The second to last column "sur" is the number of months survived, and "index" is whether or not that is a right-censored time (1 for true, 0 for false).
Given this data I need to plot a Cox Proportional hazard curve, but I end up with an incorrect plot because the right hand side of the formula object is wrong.
Here is my code, "temp4" is the name I gave to the data table:
library("survival")
temp4 <- read.table("~/data.txt", header=TRUE)
seerCox <- coxph(Surv(sur, index)~ temp4$x1 + temp4$x2 + temp4$x3 + temp4$x4 + temp4$x5 + temp4$x6 + temp4$x7 + temp4$x8 + temp4$x9 + temp4$x10 + temp4$x11 + temp4$x12 + temp4$x13 + temp4$x14 + temp4$x15 + temp4$x16 + temp4$x17 + temp4$x18 + temp4$x19, data=temp4, singular.ok=TRUE)
plot(survfit(seerCox), main= "Cox Estimate", mark.time=FALSE, ylab="Probability", xlab="Survival Time in Months", col=c("blue", "red", "green"))
I should also note that I have tried replacing the right hand side that you're seeing with the number 1, a period, leaving it blank. These methods produce a kaplan-meier curve.
The following is the console output:
Each new line is an example of the error produced depending on how I filter the data. (ie if I only include patients with ages greater than 85, etc.)
If someone could explain how it works, it would be greatly appreciated.
PS- I have searched for over a week to my solution, and I am asking for help here as a last resort.
You should not be using the prefix temp$ if you are also using a data argument. The whole purpose of supplying a data argument is to allow dropping those in the formula.
seerCox <- coxph( Surv(sur, index) ~ . , data=temp4, singular.ok=TRUE)
The above would use all of the x-variables in your temp data.frame. This will use just the first 3:
seerCox <- coxph( Surv(sur, index) ~ x1+x2+x3 , data=temp4)
Exactly what the warnings signify depends on the data (as you have in one sense already exemplified by producing different sorts of collinearity with different subsets.) If you have collinear columns, then you get singularities in the inversion of the model matrix and the software will attempt to drop aliased columns with a warning. This is really telling you that you do not have enough data to build the large models you are attempting. Exploring that possibility with table calls is often informative.
Bottom line: This is not a problem with your formula construction, so much as it is a problem of not understanding the limitations of the chosen method with the dataset you have assembled. You need to be more careful about defining your goals. What is the highest priority in this research? Do you really need every variable? Is it possible to aggregate some of these anonymous variables into clinically meaningful categories such as diagnostic categories or comorbities?

"system is computationally singular" error when using `gmm` (GMM Estimation)

Trying to use the GMM package in R to estimate the parameters (a-f) of a linear model:
LEV1 = a*Macro + b*Firm + c*Sector + d*qtr + e*fqtr + f*tax
Macro, Firm and Sector are matrices with n number of rows. qtr, fqtr and tax are vectors with n members.
I have one large data frame called unconstrd that has all of the data. First, I break that data into separate matrices:
v_LEV1 <- as.matrix(unconstrd$LEV1)
Macro <- as.matrix(cbind(unconstrd$Agg_Corp_Prof,unconstrd$R1000_TR, unconstrd$CP_Spread))
Firm <- as.matrix(cbind(unconstrd$ppe_ratio, unconstrd$op_inc_ratio_avg, unconstrd$selling_exp_avg,
unconstrd$tax_avg, unconstrd$Mark_to_Bk, unconstrd$mc_ratio))
Sector <- as.matrix(cbind(unconstrd$Sect_Flag03,
unconstrd$Sect_Flag04, unconstrd$Sect_Flag05, unconstrd$Sect_Flag06,
unconstrd$Sect_Flag07, unconstrd$Sect_Flag08, unconstrd$Sect_Flag12,
unconstrd$Sect_Flag13, unconstrd$Sect_Flag14, unconstrd$Sect_Flag15,
unconstrd$Sect_Flag17))
v_qtr <- as.matrix(unconstrd$qtr)
v_fqtr <- as.matrix(unconstrd$fqtr)
v_tax <- as.matrix(unconstrd$tax_dummy)
Then, I bind the data together for the x variable called by gmm:
h=cbind(Macro,Firm,Sector,v_qtr, v_fqtr, v_tax)
Then, I invoke gmm:
gmm1 <- gmm(v_LEV1 ~ Macro + Firm + Sector + v_qtr + v_fqtr + v_tax, x=h)
I get the message:
Error in solve.default(crossprod(hm, xm), crossprod(hm, ym)) :
system is computationally singular: reciprocal condition number = 1.10214e-18
I apologize in advance and admit that I'm a neophyte at R and I've never used GMM before. The GMM function is so general, I've looked at the examples available on the web but nothing seems specific enough to help my situation.
You are trying to fit onto a matrix which does not have full rank---try excluding some of the variable and/or look for errors. We cannot say much more without your data, or at least a sample.
That's more of a modelling question for Crossvalidated.com than a programming question for StackOverflow.
I was pretty certain there was no linear dependency between my variables but I went through the exercise of adding one variable at a time to see what was causing the errors. In the end, I asked a colleague to run GMM on SAS and it ran perfectly, no error messages. I'm not sure what the problem is with the R version is but at this point I have a solution and give u on GMM on R.
Thanks to everyone who tried to help.

Resources