Mahalanobis distance between profiles in R - r

A sample of 100 subjects responded to two personality tests. These tests have slightly different wordings but are generally the same, i.e. they both measure the same 4 attitudes. Therefore, I have 2 matrices like this, with 4 scores per subject:
>test1
subj A1 A2 A3 A4
1 -2.14 1.21 0.93 -1.72
2 0.25 1.17 0.67 0.67
>test2
subj A1 A2 A3 A4
1 -1.99 1.11 1.00 -1.52
2 0.24 1.20 0.71 0.65
I'd like to evaluate the similarity of profiles in the two tests, i.e. the similarity of two sets of 4 scores for each individual. I feel like the mahalanobis distance is the measure I need and I checked some packages (HDMD, StatMatch) but couldn't find the right function.

One approach to this is to create a difference score matrix and then calculate the Mahalanobis distances on the difference scores.
testDiff <- test1 - test2
testDiffMahalanobis <- mahalanobis(testDiff,
center = colMeans(testDiff),
cov = cov(testDiff))

Related

Heatmap of effect sizes and p-values using different exposures and outcomes in ggplot2

I want to create a heat map that graphically shows effect sizes between different outcomes and exposures and if p-values were significant.
I have created one big dataframe containing all exposure-outcomes tests with p-values and effect sizes. The effect direction can be positive or negative. Now, there are great resources to create this for correlation matrices such as corrplot.
I don't get how to do this for effects sizes with different exposures and outcomes.
This would be the sample dataframe. The exposures would be 20 and the outcome 15.
Here is a shortened example. Estimates and p-values made up, so disregard the statical nonsense in the values.
dat
# id Exposure Outcome beta p-value se x
# 1 a 1 0.02 0.04 0.001
# 1 a 2 0.52 0.001 0.02
# 1 a 3 0.001 0.54 0.001
# 1 b 1 -0.02 0.09 0.045
# 1 b 2 0.06 0.12 0.03
# 1 b 3 -0.1 0.41 0.09
# 1 c 1 -0.42 0.01 0.08
This is an example of a similar plot using correlation.

monthly slope coefficient regressing first column with rest of the column

I like to regress the first column means market return (as y) with rest of the columns (as X) and create a data frame with the list of monthly slope coefficients. My data frame is like this
Date Marker return AFARAK GROUP PLC AFFECTO OYJ
1/3/2007 -0.45 0.00 0.85
1/4/2007 -0.92 2.47 -0.85
1/5/2007 -1.98 3.98 -1.14
The expected output of slope coefficient data frame is like this
Date AFARAK GROUP PLC AFFECTO OYJ
Jan-07 1 0.5
Feb-07 2 1.5
Mar-07 2 1
Apr-07 3 2
Could someone help me in this regard?

Find where species accumulation curve reaches asymptote

I have used the specaccum() command to develop species accumulation curves for my samples.
Here is some example data:
site1<-c(0,8,9,7,0,0,0,8,0,7,8,0)
site2<-c(5,0,9,0,5,0,0,0,0,0,0,0)
site3<-c(5,0,9,0,0,0,0,0,0,6,0,0)
site4<-c(5,0,9,0,0,0,0,0,0,0,0,0)
site5<-c(5,0,9,0,0,6,6,0,0,0,0,0)
site6<-c(5,0,9,0,0,0,6,6,0,0,0,0)
site7<-c(5,0,9,0,0,0,0,0,7,0,0,3)
site8<-c(5,0,9,0,0,0,0,0,0,0,1,0)
site9<-c(5,0,9,0,0,0,0,0,0,0,1,0)
site10<-c(5,0,9,0,0,0,0,0,0,0,1,6)
site11<-c(5,0,9,0,0,0,5,0,0,0,0,0)
site12<-c(5,0,9,0,0,0,0,0,0,0,0,0)
site13<-c(5,1,9,0,0,0,0,0,0,0,0,0)
species_counts<-rbind(site1,site2,site3,site4,site5,site6,site7,site8,site9,site10,site11,site12,site13)
accum <- specaccum(species_counts, method="random", permutations=100)
plot(accum)
In order to ensure I have sampled sufficiently, I need to make sure the curve of the species accumulation plot reaches an asymptote, defined as a slope of <0.3 between the last two points (ei between sites 12 and 13).
results <- with(accum, data.frame(sites, richness, sd))
Produces this:
sites richness sd
1 1 3.46 0.9991916
2 2 4.94 1.6625403
3 3 5.94 1.7513054
4 4 7.05 1.6779918
5 5 8.03 1.6542263
6 6 8.74 1.6794660
7 7 9.32 1.5497149
8 8 9.92 1.3534841
9 9 10.51 1.0492422
10 10 11.00 0.8408750
11 11 11.35 0.7017295
12 12 11.67 0.4725816
13 13 12.00 0.0000000
I feel like I'm getting there. I could generate an lm with site vs richness and extract the exact slope (tangent?) between sites 12 and 13. Going to search a bit longer here.
Streamlining your data generation process a little bit:
species_counts <- matrix(c(0,8,9,7,0,0,0,8,0,7,8,0,
5,0,9,0,5,0,0,0,0,0,0,0, 5,0,9,0,0,0,0,0,0,6,0,0,
5,0,9,0,0,0,0,0,0,0,0,0, 5,0,9,0,0,6,6,0,0,0,0,0,
5,0,9,0,0,0,6,6,0,0,0,0, 5,0,9,0,0,0,0,0,7,0,0,3,
5,0,9,0,0,0,0,0,0,0,1,0, 5,0,9,0,0,0,0,0,0,0,1,0,
5,0,9,0,0,0,0,0,0,0,1,6, 5,0,9,0,0,0,5,0,0,0,0,0,
5,0,9,0,0,0,0,0,0,0,0,0, 5,1,9,0,0,0,0,0,0,0,0,0),
byrow=TRUE,nrow=13)
Always a good idea to set.seed() before running randomization tests (and let us know that specaccum is in the vegan package):
set.seed(101)
library(vegan)
accum <- specaccum(species_counts, method="random", permutations=100)
Extract the richness and sites components from within the returned object and compute d(richness)/d(sites) (note that the slope vector is one element shorter than the origin site/richness vectors: be careful if you're trying to match up slopes with particular numbers of sites)
(slopes <- with(accum,diff(richness)/diff(sites)))
## [1] 1.45 1.07 0.93 0.91 0.86 0.66 0.65 0.45 0.54 0.39 0.32 0.31
In this case, the slope never actually goes below 0.3, so this code for finding the first time that the slope falls below 0.3:
which(slopes<0.3)[1]
returns NA.

R - How to select specific values

I'm working in healthcare and I need help on how to use R.
I explain: I have a set of data like that:
S1 S2 S3 S4 S5
0.498 1.48 1.43 0.536 0.548
2.03 1.7 3.74 2.13 2.02
0.272 0.242 0.989 0.534 0.787
0.986 2.03 2.53 1.65 2.31
0.307 0.934 0.633 0.36 0.281
0.78 0.76 0.706 0.81 1.11
0.829 2.03 0.667 1.48 1.42
0.497 1.27 0.952 1.23 1.73
0.553 0.286 0.513 0.422 0.573
Here are my objectives:
Do correlation between every column
Calculate p-values
Calculate R-squared
Only show when R2>0.5 and p-values <0.05
Here is my code so far (it's not the most efficient but it work):
> e<-read.table(‘Workbook8nm.csv’, header=TRUE, sep=“,”, dec=“.”, na.strings=“NA”)
> f<-data.frame(e)
> M<-cor(f, use=“complete”) #Do the correlation like I want
> library(‘psych’)
> N<-corr.test (f) #Give me p-values
So, so far I have my correlation in M and my p-values in N.
I need help on how to show R2 ?
And second part how to make R only show me when R2>0.5 and p-values<0.05 for example ? I used this line :
P<-M[which(m>0.9))]
To show me only when the pearson coefficent is more than 0.9 as a training. But it just make me a list of every values that are superior to 0.9 ... So I don't know between which and which column this coefficient come from. The best would be that it show me significant values in a table with the name of column so after I can easily identify them.
The reason I want to do that is because by table is 570 by 570 so I can't look at every p-values to keep only the significant one.
I hope I was clear ! It's my first post here, tell me if I did any mistake !
Thanks for your help !
I'm sure there is a function somewhere in the R space to do this quicker, but I wrote a quick function to expand a matrix into a data.frame with the "row" and "column" as columns, and the value as a third column.
matrixToFrame <- function(m, name) {
e <- expand.grid(row=rownames(m), col=colnames(m))
e[name] <- as.vector(m)
e
}
We can transform the correlation matrix into a data frame like so:
> matrixToFrame(cor(f), "cor")
row col cor
1 S1 S1 1.0000000
2 S2 S1 0.5322052
3 S3 S1 0.8573687
4 S4 S1 0.8542438
5 S5 S1 0.6820144
6 S1 S2 0.5322052
....
And we can merge the result of corr.test and cor because the columns match up
> b <- merge(matrixToFrame(corr.test(a)$p, "p"), matrixToFrame(cor(a), "cor"))
> head(b)
row col p cor
1 S1 S1 0.0000000000 1.0000000
2 S1 S2 0.2743683745 0.5322052
3 S1 S3 0.0281656707 0.8573687
4 S1 S4 0.0281656707 0.8542438
5 S1 S5 0.2134783039 0.6820144
6 S2 S1 0.1402243214 0.5322052
Then we can just filter for the elements that we want
> b[b$cor > .5 & b$p > .2,]
row col p cor
2 S1 S2 0.2743684 0.5322052
5 S1 S5 0.2134783 0.6820144
8 S2 S3 0.2743684 0.5356585
10 S2 S5 0.2134783 0.6724486
15 S3 S5 0.2134783 0.6827349
EDIT: I found R matrix to rownames colnames values, which provides a couple of attempts at matrixToFrame; nothing particularly more elegant than what I have here, though.
EDIT2: Make sure to read the docs carefully for corr.test -- it looks like different information gets encoded in the upper and lower diagonal (?), so the results here may be deceptive. You may want to do some filtering with lower.tri or upper.tri before the final filtering step.

Combining and appending columns of different lengths, by row number, R

I'm working with biochemical data from subjects, analysing the results by sex. I have 19 biochemical tests to analyse for each sex, for each of two drugs (haematology and anatomy tests coming later).
For reasons of reproducibility of results and for preventing transcription errors, I am trying to summarise each test into one table. Included in the table output, I need a column for the Dunnett post hoc comparison p-values. Because the Dunnett test compares to the control results, with a control and 3 drug levels I only get 3 p-values. However, I have 4 mean and sd values.
Using ddply to get the mean and sd results (having limited the number of significant figures, I get a dataset that looks like this:
Sex<- c(rep("F",4), rep("M",4))
Druglevel <- c(rep(0:3,2))
Sample <- c(rep(10,8))
Mean <- c(0.44, 0.50, 0.46, 0.49, 0.48, 0.55, 0.47, 0.57)
sd <- c(0.07, 0.07, 0.09, 0.12, 0.18, 0.19, 0.13, 0.41)
Drug1Biochem1 <- data.frame(Sex, Druglevel, Sample, Mean, sd)
I have used glht in the package multcomp to perform the Dunnett tests on the aov object I constructed from undertaking a normal aov. I've extracted the p-values from the glht summary (I've rounded these to three decimal places). The male and female analyses have been run using separate ANOVA so I have one set of output for each sex. The female results are:
femaleR <- c(0.371, 0.973, 0.490)
and the male results are:
maleR <- c(0.862, 0.999, 0.738)
How can I append a column for the p-values to my original dataframe (Drug1Biochem1) so that both femaleR and maleR are in that final column, with row 1 and row 5 of that column empty (i.e. no p-values for the control)?
I wish to output the resulting combination to html, which can be inserted into a Word document so no transcription errors occur. I have set a seed value so that the results of the program are reproducible (when I finally stop debugging).
In summary, I would like a data frame (or table, or whatever I can output to html) that has the following format:
Sex Druglevel Sample Mean sd p-value
F 0 10 0.44 0.07
F 1 10 0.50 0.07 0.371
F 2 10 0.46 0.09 0.973
F 3 10 0.49 0.12 0.480
M 0 10 0.48 0.18
M 1 10 0.55 0.19 0.862
M 2 10 0.47 0.13 0.999
M 3 10 0.57 0.41 0.738
For each test, I wish to reproduce this exact table. There will always be 4 groups per sex, and there will never be a p-value for the control, which will always be summarised in row 1 (F) and row 5 (M).
You could try merge
dN <- data.frame(Sex=rep(c('M', 'F'), each=3), Druglevel=1:3,
pval=c(maleR, femaleR))
merge(Drug1Biochem1, dN, by=c('Sex', 'Druglevel'), all=TRUE)
# Sex Druglevel Sample Mean sd pval
#1 F 0 10 0.44 0.07 NA
#2 F 1 10 0.50 0.07 0.371
#3 F 2 10 0.46 0.09 0.973
#4 F 3 10 0.49 0.12 0.490
#5 M 0 10 0.48 0.18 NA
#6 M 1 10 0.55 0.19 0.862
#7 M 2 10 0.47 0.13 0.999
#8 M 3 10 0.57 0.41 0.738

Resources