Barplot mean /w SD in R-Project - r

Sounds like a trivial one, but some research didn´t come up with an elegant solution:
I have a dataframe structured with a categorial variable (GROUP) and a continuous read-out variable (bloodpressure).
How can a make a simple box-plot showing the mean for each group with its standard deviation?
There are multiple groups: A,B,C,D How can I perform an ANOVA post-hoc analysis within the dataframe. How does it work with Mann-Whitney-U-Test? Can I mark the significance level in the bar-plot?
How can I streamline this operation to multiple continuous variables (dia_bloodpressure, sys_bloodpressure, mean_bloodpressure) and sink() the output in different files (by name of the variable)?

After some research I came up with the agricolae package. This one provides multiple group comparison. The resulting objects can be pipelined into a decent plotting function for groupwise bar-graphs +/- SD or SEM. Unfortunately, no way to use markers of significance between groups in the plots.

After some more programming in R, I stumbled over another nice package suitable for medical research: psych.
Considering the question above, describe() and describeBy() get statistical overview of a dataframe and sort it by a grouping variable.
The function error.bars.by() is an advanced plotting function for mean values +/- SD.
The package offers many functions on covariate analysis, which are useful in psychological research but might also help for medical and marketing research.

A possible code snippet:
library(psych)
x<-c(1,2,3,4,5,6,7,8,9,NA)
y<-c(2,3,NA,3,4,NA,2,3,NA,2)
group<-rep((factor(LETTERS[1:2])),5)
df<-data.frame(x,y,group)
df
by(df$x,df$group,summary)
by(df$x,df$group,mean)
sd(df$x) #result: NA
sd(df$x, na.rm=TRUE) #result: 2.738613
v = c("x", "y")#or
v = colnames(df)[1:2]
sapply(v, function(i) tapply(df[[i]], df$group, sd, na.rm=TRUE))
describeBy(df$x, df$group)
error.bars.by(df$x, df$group, bars=TRUE)

Related

Is there a R package for computing intraindividual Gamma correlations?

im searching a r package that enables me to compute Goodman and Kruskal's gamma correlations within each subject. I have 2 variables with 16 items each, which I would like to correlate per subject.
So far I used the Hmisc package and the rcorr.cens() function. However, the function creates an correlation overall subject and I failed to adapt the code to get a correlation for each subject... Thats how I tried so far....
```Gamma_correlation <- dataframe %>%
group_by(Subject) %>%
rcorr.cens(dataframe$Variable_1,
dataframe$Variable_2,
outx = TRUE)[2]```
You could possibly achieve this by removing the dataframe$ and just keeping the variable names. Hmisc also seems to mask the summarize function of the dplyr, so you can put the rcorr.cens chunk in dplyr::summarize().

Multinomial logit: estimation on a subset of alternatives in R

As McFadden (1978) showed, if the number of alternatives in a multinomial logit model is so large that computation becomes impossible, it is still feasible to obtain consistent estimates by randomly subsetting the alternatives, so that the estimated probabilities for each individual are based on the chosen alternative and C other randomly selected alternatives. In this case, the size of the subset of alternatives is C+1 for each individual.
My question is about the implementation of this algorithm in R. Is it already embedded in any multinomial logit package? If not - which seems likely based on what I know so far - how would one go about including the procedure in pre-existing packages without recoding extensively?
Not sure whether the question is more about doing the sampling of alternatives or the estimation of MNL models after sampling of alternatives. To my knowledge, there are no R packages that do sampling of alternatives (the former) so far, but the latter is possible with existing packages such as mlogit. I believe the reason is that the sampling process varies depending on how your data is organized, but it is relatively easy to do with a bit of your own code. Below is code adapted from what I used for this paper.
library(tidyverse)
# create artificial data
set.seed(6)
# data frame of choser id and chosen alt_id
id_alt <- data.frame(
id = 1:1000,
alt_chosen = sample(1:30, 1)
)
# data frame for universal choice set, with an alt-specific attributes (alt_x2)
alts <- data.frame(
alt_id = 1:30,
alt_x2 = runif(30)
)
# conduct sampling of 9 non-chosen alternatives
id_alt <- id_alt %>%
mutate(.alts_all =list(alts$alt_id),
# use weights to avoid including chosen alternative in sample
.alts_wtg = map2(.alts_all, alt_chosen, ~ifelse(.x==.y, 0, 1)),
.alts_nonch = map2(.alts_all, .alts_wtg, ~sample(.x, size=9, prob=.y)),
# combine chosen & sampled non-chosen alts
alt_id = map2(alt_chosen, .alts_nonch, c)
)
# unnest above data.frame to create a long format data frame
# with rows varying by choser id and alt_id
id_alt_lf <- id_alt %>%
select(-starts_with(".")) %>%
unnest(alt_id)
# join long format df with alts to get alt-specific attributes
id_alt_lf <- id_alt_lf %>%
left_join(alts, by="alt_id") %>%
mutate(chosen=ifelse(alt_chosen==alt_id, 1, 0))
require(mlogit)
# convert to mlogit data frame before estimating
id_alt_mldf <- mlogit.data(id_alt_lf,
choice="chosen",
chid.var="id",
alt.var="alt_id", shape="long")
mlogit( chosen ~ 0 + alt_x2, id_alt_mldf) %>%
summary()
It is, of course, possible without using the purrr::map functions, by using apply variants or looping through each row of id_alt.
Sampling of alternatives is not currently implemented in the mlogit package. As stated previously, the solution is to generate a data.frame with a subset of alternatives and then using mlogit (and importantly to use a formula with no intercepts). Note that mlogit can deal with unbalanced data, ie the number of alternatives doesn't have to be the same for all the choice situations.
My recommendation would be to review the mlogit package.
Vignette:
https://cran.r-project.org/web/packages/mlogit/vignettes/mlogit2.pdf
the package has a set of example exercises that (in my opinion) are worth looking at:
https://cran.r-project.org/web/packages/mlogit/vignettes/Exercises.pdf
You may also want to take a look at the gmnl package (I have not used it)
https://cran.r-project.org/web/packages/gmnl/index.html
Multinomial Logit Models with Continuous and Discrete Individual Heterogeneity in R: The gmnl Package
Mauricio Sarrias' (Author) gmnl Web page
Question: What specific problem(s) are you trying to apply a multinomial logit model too? Suitably intrigued.
Aside from the above question, I hope the above points you in the right direction.

lm() saving residuals with group_by with R- confused spss user

This is complete reEdit of my orignal question
Let's assume I'm working on RT data gathered in a repeated measure experiment. As part of my usual routine I always transform RT to natural logarytms and then compute a Z score for each RT within each partipant adjusting for trial number. This is typically done with a simple regression in SPSS syntax:
split file by subject.
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT rtLN
/METHOD=ENTER trial
/SAVE ZRESID.
split file off.
To reproduce same procedure in R generate data:
#load libraries
library(dplyr); library(magrittr)
#generate data
ob<-c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3)
ob<-factor(ob)
trial<-c(1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6)
rt<-c(300,305,290,315,320,320,350,355,330,365,370,370,560,565,570,575,560,570)
cond<-c("first","first","first","snd","snd","snd","first","first","first","snd","snd","snd","first","first","first","snd","snd","snd")
#Following variable is what I would get after using SPSS code
ZreSPSS<-c(0.4207,0.44871,-1.7779,0.47787,0.47958,-0.04897,0.45954,0.45487,-1.7962,0.43034,0.41075,0.0407,-0.6037,0.0113,0.61928,1.22038,-1.32533,0.07806)
sym<-data.frame(ob, trial, rt, cond, ZreSPSS)
I could apply a formula (blend of Mark's and Daniel's solution) to compute residuals from a lm(log(rt)~trial) regression but for some reason group_by is not working here
sym %<>%
group_by (ob) %>%
mutate(z=residuals(lm(log(rt)~trial)),
obM=mean(rt), obSd=sd(rt), zRev=z*obSd+obM)
Resulting values clearly show that grouping hasn't kicked in.
Any idea why it didn't work out?
Using dplyr and magrittr, you should be able to calculate z-scores within individual with this code (it breaks things into the groups you tell it to, then calculates within that group).
experiment %<>%
group_by(subject) %>%
mutate(rtLN = log(rt)
, ZRE1 = scale(rtLN))
You should then be able to do use that in your model. However, one thing that may help your shift to R thinking is that you can likely build your model directly, instead of having to make all of these columns ahead of time. For example, using lme4 to treat subject as a random variable:
withRandVar <-
lmer(log(rt) ~ cond + (1|as.factor(subject))
, data = experiment)
Then, the residuals should already be on the correct scale. Further, if you use the z-scores, you probably should be plotting on that scale. I am not actually sure what running with the z-scores as the response gains you -- it seems like you would lose information about the degree of difference between the groups.
That is, if the groups are tight, but the difference between them varies by subject, a z-score may always show them as a similar number of z-scores away. Imagine, for example, that you have two subjects, one scores (1,1,1) on condition A and (3,3,3) on condition B, and a second subject that scores (1,1,1) and (5,5,5) -- both will give z-scores of (-.9,-.9,-.9) vs (.9,.9,.9) -- losing the information that the difference between A and B is larger in subject 2.
If, however, you really want to convert back, you can probably use this to store the subject means and sds, then multiply the residuals by subjSD and add subjMean.
experiment %<>%
group_by(subject) %>%
mutate(rtLN = log(rt)
, ZRE1 = scale(rtLN)
, subjMean = mean(rtLN)
, subjSD = sd(rtLN))
mylm <- lm(x~y)
rstandard(mylm)
This returns the standardized residuals of the function. To bind these to a variable you can do:
zresid <- rstandard(mylm)
EXAMPLE:
a<-rnorm(1:10,10)
b<-rnorm(1:10,10)
mylm <- lm(a~b)
mylm.zresid<-rstandard(mylm)
See also:
summary(mylm)
and
mylm$coefficients
mylm$fitted.values
mylm$xlevels
mylm$residuals
mylm$assign
mylm$call
mylm$effects
mylm$qr
mylm$terms
mylm$rank
mylm$df.residual
mylm$model

Cross-correlation of 5 time series (distance) and interpretation

I would appreciate some input in this a lot!
I have data for 5 time series (an example of 1 step in the series is in the plot below), where each step in the series is a vertical profile of species sightings in the ocean which were investigated 6h apart. All 5 steps are spaced vertically by 0.1m (and the 6h in time).
What I want to do is calculate the multivariate cross-correlation between all series in order to find out at which lag the profiles are most correlated and stable over time.
Profile example:
I find the documentation in R on that not so great, so what I did so far is use the package MTS with the ccm function to create cross correlation matrices. However, the interpretation of the figures is rather difficult with sparse documentation. I would appreciate some help with that a lot.
Data example:
http://pastebin.com/embed_iframe.php?i=8gdAeGP4
Save in file cross_correlation_stack.csv or change as you wish.
library(dplyr)
library(MTS)
library(data.table)
d1 <- file.path('cross_correlation_stack.csv')
d2 = read.csv(d1)
# USING package MTS
mod1<-ccm(d2,lag=1000,level=T)
#USING base R
acf(d2,lag.max=1000)
# MQ plot also from MTS package
mq(d2,lag=1000)
Which produces this (the ccm command):
This:
and this:
In parallel, the acf command from above produces this:
My question now is if somebody can give some input in whether I am going in the right direction or are there better suited packages and commands?
Since the default figures don't get any titles etc. What am I looking at, specifically in the ccm figures?
The ACF command was proposed somewhere, but can I use it here? In it's documentation it says ... calculates autocovariance or autocorrelation... I assume this is not what I want. But then again it's the only command that seems to work multivariate. I am confused.
The plot with the significance values shows that after a lag of 150 (15 meters) the p values increase. How would you interpret that regarding my data? 0.1 intervals of species sightings and many lags up to 100-150 are significant? Would that mean something like that peaks in sightings are stable over the 5 time-steps on a scale of 150 lags aka 15 meters?
In either way it would be nice if somebody who worked with this before can explain what I am looking at! Any input is highly appreciated!
You can use the base R function ccf(), which will estimate the cross-correlation function between any two variables x and y. However, it only works on vectors, so you'll have to loop over the columns in d1. Something like:
cc <- vector("list",choose(dim(d1)[2],2))
par(mfrow=c(ceiling(choose(dim(d1)[2],2)/2),2))
cnt <- 1
for(i in 1:(dim(d1)[2]-1)) {
for(j in (i+1):dim(d1)[2]) {
cc[[cnt]] <- ccf(d1[,i],d1[,j],main=paste0("Cross-correlation of ",colnames(d1)[i]," with ",colnames(d1)[j]))
cnt <- cnt + 1
}
}
This will plot each of the estimated CCF's and store the estimates in the list cc. It is important to remember that the lag-k value returned by ccf(x,y) is an estimate of the correlation between x[t+k] and y[t].
All of that said, however, the ccf is only defined for data that are more-or-less normally distributed, but your data are clearly overdispersed with all of those zeroes. Therefore, lacking some adequate transformation, you should really look into other metrics of "association" such as the mutual information as estimated from entropy. I suggest checking out the R packages entropy and infotheo.

Data Standardisation for Neural Network in R

I have built a multilayer perceptron neural network in SPSS 22. I try the same using "neuralnet" package in R, but the results are not desirable.
SPSS standardises data before performing training and I am wondering:
Does "neuralnet" package perform any sort of standardization? I could not find in its guide.
According to SPSS guide here, standardised process is done as below:
Subtract the mean and divide by the standard deviation, (x−mean)/s.
Is there an optimal function that can do this in R? Since the method is quite simple, I can implement the scaling by my own, but it might not be efficient since number of data elements and records are very large.
Or maybe should I use another neural network package like "monmlp"? that standardize data automatically?
Many thanks
This might be useful if you need to standardize multiple columns in a data frame (call it foo):
# Index of columns to standardize
cols <- c(1,2,3,4)
# Standardize
library(plyr)
standardize <- function(x) as.numeric((x - mean(x)) / sd(x))
foo[cols] <- plyr::colwise(standardize)(foo[cols])

Resources