Multidimensional scaling plot in R - r

I have a dataset ("data") that looks like this:
PatientID Visit Var1 Var2 Var3 Var4 Var5
1 ID1 0 44.28 4.57 23.56 4.36 8.87
2 ID1 1 58.60 5.34 4.74 3.76 6.96
3 ID1 2 72.44 11.18 21.22 2.15 8.34
4 ID2 0 65.98 6.91 8.57 1.19 7.39
5 ID2 1 10.33 38.27 0.48 14.41 NA
6 ID2 2 69.45 11.18 20.69 2.15 8.34
7 ID3 0 69.16 6.17 10.98 1.91 6.12
8 ID3 1 86.02 3.28 16.29 4.28 5.74
9 ID3 2 69.45 NA 20.69 2.15 8.34
10 ID4 0 98.55 26.75 2.89 3.92 2.19
11 ID4 1 32.66 14.38 4.96 1.13 4.78
12 ID4 2 70.45 11.42 21.78 2.15 8.34
I need to to generate an MDS plot with all datapoints. I also need the visit-points to be linked by a line and coloured as green for visit 1, red for visit 2 and black for visit3 (consistent colours for all individuals).
My code looks like this (quite lenghty, but it doesn't work):
data.cor <- cor(t(data[,3:7]), use = "pairwise.complete.obs", method = "spearman")
dim(data.cor)
dim(data)
rownames(data.cor) <- paste0(data$PatientID, "V", data$Visit)
colnames(data.cor) <- paste0(data$PatientID, "V", data$Visit)
c <- dist(data.cor)
fit <- cmdscale(c,eig=TRUE, k=2)
ff <- fit$points
ff <- as.data.frame(ff)
ff$pair <- paste0(substr(rownames(ff),1,6))
ff$pair <- factor(ff$pair)
pc.pair.distances <- matrix(nrow = nlevels(ff$pair), ncol = 1)
for(i in 1:nlevels(ff$pair)){
pair2 <- ff[ff$pair %in% levels(ff$pair)[i] , ]
pc.pair.distances[i,1] <- sqrt(
((pair2[1,1] - pair2[2,1]) * (pair2[1,1] - pair2[2,1]))
+ ((pair2[1,2] - pair2[2,2]) * (pair2[1,2] - pair2[2,2]))
)
rm(pair2)
}
plot(ff[,1], ff[,2], xlab="Principal 1", ylab="Principal 2", type = "n", las = 1)
for(i in 1:nlevels(ff$pair)){
lines(ff[ff$pair == levels(ff$pair)[i],1], ff[ff$pair == levels(ff$pair)[i],2], col = "grey")
}
points(ff[,1], ff[,2], xlab="Coordinate 1", ylab="Coordinate 2", type = "p",
pch = ifelse(grepl(x = substr(rownames(ff), 7,8), "V1"), 20, 18),
cex = 1.3)
)
I would really appreciate your help.

I suggest you to modify your data.frame in order to add a column for visit number and for indiv id with the function sapply.
ff$visit <- sapply(ff$pair,function(x){substr(x,5,5)})
ff$indiv <- sapply(ff$pair,function(x){substr(x,3,3)})
And then the library ggplot2 is very usefull to plot data. First, you draw the points :
g <- ggplot(ff,aes(V1,V2))+geom_point(aes(color=visit))
And then add lines for each individual :
for (i in unique(ff$indiv)){
g <- g+geom_line(data=ff[ff$indiv==i,],aes(V1,V2))
}

Related

Is there another way to calculate within-subject Hedges'g (and error)?

I'm carrying out a meta-analysis of within-subject studies (crossover studies). I've read some papers that used the esc package (esc_mean_sd function, more precisely) to calculate Hedges'g to perform it. However, its output is doubling the "n" of each study.
Please, look that the "n" in the data is n=12 for all the three studies, while in the output there are n=24.
ID mean_exp mean_con sd_exp sd_con n
1 A 150 130 15 22 12
2 B 166 145 10 8 12
3 C 179 165 11 14 12
# What I did:
e1 <- esc_mean_sd(data[1,2],data[1,4],data[1,6],
data[1,3],data[1,5],data[1,6],
r = .9,es.type = "g")
e2 <- esc_mean_sd(data[2,2],data[2,4],data[2,6],
data[2,3],data[2,5],data[2,6],
r = .9,es.type = "g")
e3 <- esc_mean_sd(data[3,2],data[3,4],data[3,6],
data[3,3],data[3,5],data[3,6],
r = .9,es.type = "g")
data2 <- combine_esc(e1, e2, e3)
colnames(data2) <- ("study","es","weight","n","se","var","lCI","uCI","measure")
head(data2, 3)
# study es weight n se var lCI uCI measure
# 1 1.80 4.18 24 0.489 0.239 0.842 2.76 g
# 2 4.53 1.60 24 0.791 0.626 2.983 6.08 g
# 3 2.14 3.71 24 0.519 0.269 1.126 3.16 g

standardize a variable values differently based on another categorical variable in R (Using R Base)

I have a large dataset that has a continuous variable "Cholesterol" for two visits for each participant (each participant has two rows: first visit = Before & second visit= After). I'd like to standadise cholesterol but I have both Before and After visits merged which will not make my standardisation accurate as it is calculated using the mean and the SD
USING R BASE, How can I create a new cholesterol variable standardised based on Visit in the same data set (in this process standardisation should be done twice; once for Before and another time for After, but the output (standardised values) will be in a one variable again following the same structure of this DF
DF$Cholesterol<- c( 0.9861551,2.9154158, 3.9302373,2.9453085, 4.2248018,2.4789901, 0.9972635, 0.3879830, 1.1782336, 1.4065341, 1.0495609,1.2750138, 2.8515144, 0.4369885, 2.2410429, 0.7566147, 3.0395565,1.7335131, 1.9242212, 2.4539439, 2.8528908, 0.8432039,1.7002653, 2.3952744,2.6522959, 1.2178764, 2.3426695, 1.9030782,1.1708246,2.7267124)
DF$Visit< -c(Before,After,Before,After,Before,After,Before,After,Before,After,Before,After,Before,After,Before,After,Before,After,Before,After,Before,After,Before,After,Before, After,Before,After,Before,After)
# the standardisation function I want to apply
standardise <- function(x) {return((x-min(x,na.rm = T))/sd(x,na.rm = T))}
thank you in advance
Let's make your data, fix the df$visit assignment, fix the standardise function to be mean rather than min, and then assume each new occasion of before is the next person, pivot to wide format, then mutate our before and after standardised variables:
df <- data.frame(x = rep(1, 30))
df$cholesterol<- c( 0.9861551,2.9154158, 3.9302373,2.9453085, 4.2248018,2.4789901, 0.9972635, 0.3879830, 1.1782336, 1.4065341, 1.0495609,1.2750138, 2.8515144, 0.4369885, 2.2410429, 0.7566147, 3.0395565,1.7335131, 1.9242212, 2.4539439, 2.8528908, 0.8432039,1.7002653, 2.3952744,2.6522959, 1.2178764, 2.3426695, 1.9030782,1.1708246,2.7267124)
df$visit <- rep(c("before", "after"), 15)
standardise <- function(x) {return((x-mean(x,na.rm = T))/sd(x,na.rm = T))}
df <- df %>%
mutate(person = cumsum(visit == "before"))%>%
pivot_wider(names_from = visit, id_cols = person, values_from = cholesterol)%>%
mutate(before_std = standardise(before),
after_std = standardise(after))
gives:
person before after before_std after_std
<int> <dbl> <dbl> <dbl> <dbl>
1 1 0.986 2.92 -1.16 1.33
2 2 3.93 2.95 1.63 1.36
3 3 4.22 2.48 1.91 0.842
4 4 0.997 0.388 -1.15 -1.49
5 5 1.18 1.41 -0.979 -0.356
6 6 1.05 1.28 -1.10 -0.503
7 7 2.85 0.437 0.609 -1.44
8 8 2.24 0.757 0.0300 -1.08
9 9 3.04 1.73 0.788 0.00940
10 10 1.92 2.45 -0.271 0.814
11 11 2.85 0.843 0.611 -0.985
12 12 1.70 2.40 -0.483 0.749
13 13 2.65 1.22 0.420 -0.567
14 14 2.34 1.90 0.126 0.199
15 15 1.17 2.73 -0.986 1.12
If you actually want min in your standardise function rather than mean, editing it should be simple enough.
Edited for BaseR solution, but with a cautionary tale that there's probably a much neater solution:
df <- data.frame(id = rep(c(seq(1, 15, 1)), each = 2))
df$cholesterol<- c( 0.9861551,2.9154158, 3.9302373,2.9453085, 4.2248018,2.4789901, 0.9972635, 0.3879830, 1.1782336, 1.4065341, 1.0495609,1.2750138, 2.8515144, 0.4369885, 2.2410429, 0.7566147, 3.0395565,1.7335131, 1.9242212, 2.4539439, 2.8528908, 0.8432039,1.7002653, 2.3952744,2.6522959, 1.2178764, 2.3426695, 1.9030782,1.1708246,2.7267124)
df$visit <- rep(c("before", "after"), 15)
df <- reshape(df, direction = "wide", idvar = "id", timevar = "visit")
standardise <- function(x) {return((x-mean(x,na.rm = T))/sd(x,na.rm = T))}
df$before_std <- round(standardise(df$cholesterol.before), 2)
df$aafter_std <- round(standardise(df$cholesterol.after), 2)
gives:
i id cholesterol.before cholesterol.after before_std after_std
1 1 0.9861551 2.9154158 -1.16 1.33
3 2 3.9302373 2.9453085 1.63 1.36
5 3 4.2248018 2.4789901 1.91 0.84
7 4 0.9972635 0.3879830 -1.15 -1.49
9 5 1.1782336 1.4065341 -0.98 -0.36
11 6 1.0495609 1.2750138 -1.10 -0.50
13 7 2.8515144 0.4369885 0.61 -1.44
15 8 2.2410429 0.7566147 0.03 -1.08
17 9 3.0395565 1.7335131 0.79 0.01
19 10 1.9242212 2.4539439 -0.27 0.81
21 11 2.8528908 0.8432039 0.61 -0.99
23 12 1.7002653 2.3952744 -0.48 0.75
25 13 2.6522959 1.2178764 0.42 -0.57
27 14 2.3426695 1.9030782 0.13 0.20
29 15 1.1708246 2.7267124 -0.99 1.12

applying a rolling error function using rollapply

I am trying to calculate a rolling error function in R where I take the last 30 days and compute the rmse, move forward 1 day and take the last 30 days from this point and compute the new rmse.
My data looks like the following where I have a date and two values:
dates val1 val2
1 2010-01-01 -0.479526441 -0.294149127
2 2010-01-02 -0.860588950 0.426375720
3 2010-01-03 -0.660643894 -1.483020861
4 2010-01-04 -0.938748812 -1.631823690
Where am I going wrong in the code?
Data & attempt:
d <- data.frame(
dates = seq(from = as.Date("2010-01-01"), to = as.Date("2012-12-31"), by = 1),
val1 = rnorm(1096),
val2 = rnorm(1096)
)
d %>%
mutate(rollRMSE = rollapply(., width = 30, by = 1, FUN = Metrics::rmse(val1, val2)))
EDIT : setting window size as a variable
I have sliced the steps, rollapply will result in 29 less data with a 30 window so may want to collect this in another tibble.
suppressPackageStartupMessages(library(dplyr))
d <- data.frame(
dates = seq(from = as.Date("2010-01-01"), to = as.Date("2012-12-31"), by = 1),
val1 = rnorm(1096),
val2 = rnorm(1096)
)
rse <- function(x, y){sqrt((x-y)**2)}
# assign window size for moving average
window <- 30
d %>% tibble::as_tibble() %>%
mutate(err = rse(val1, val2),
roll = c(zoo::rollapply(err, width = window, by = 1, FUN = mean), rep(NA, window -1) )
)
#> # A tibble: 1,096 x 5
#> dates val1 val2 err roll
#> <date> <dbl> <dbl> <dbl> <dbl>
#> 1 2010-01-01 -0.0248 1.18 1.20 1.40
#> 2 2010-01-02 -0.684 0.603 1.29 1.38
#> 3 2010-01-03 -0.344 -1.92 1.58 1.42
#> 4 2010-01-04 0.447 0.319 0.128 1.38
#> 5 2010-01-05 0.123 -0.810 0.933 1.42
#> 6 2010-01-06 0.00384 2.29 2.29 1.43
#> 7 2010-01-07 -1.51 -1.03 0.487 1.39
#> 8 2010-01-08 0.394 -1.25 1.64 1.41
#> 9 2010-01-09 -1.30 1.61 2.92 1.42
#> 10 2010-01-10 0.394 0.117 0.278 1.33
#> # ... with 1,086 more rows
You can do it manually with base R, i.e.
sapply(seq(0, (nrow(d) - 30)), function(i) Metrics::rmse(d$val1[(seq(30) + i)], d$val2[(seq(30) + i)]))

plotting with specific values for heatmap in pheatmap

I have a data frame like this:
gene s1 s2 s3
1 -3.83 -8.17 -8.59
2 0.33 -4.51 -7.27
3 0.15 -5.26 -6.2
4 -0.08 -6.13 -5.95
5 -1.15 -4.82 -5.75
6 -0.99 -4.11 -4.85
7 0.42 -4.18 -4.54
8 -0.32 -3.43 -4.4
9 -0.72 -3.37 -4.39
I want to make a heatmap using pheatmap where if anything is below -4 it should be green and anything over +4 should be red and everything in between should red/green shades. I also don't want to scale my data and no clustering. I have this code so far in R:
d <- read.table("test.txt", header = TRUE, sep = "\t", row.names = 1, quote = "")
pheatmap(as.matrix(d), # matrix
scale = "none", # z score scaling applied to rows
cluster_cols=FALSE, # do not cluster columns
cluster_rows = FALSE,
treeheight_row=0, # do not show row dendrogram
show_rownames=FALSE, # do not show row names i.e gene names
main = "test.txt",
color = colorRampPalette(c("#0016DB","#FFFFFF","#FFFF00"))(50),
)
How can I plot this with the color scheme I mentioned above.
Thanks
d <-read.table(text="gene s1 s2 s3
1 -3.83 -8.17 -8.59
2 0.33 -4.51 -7.27
3 0.15 -5.26 -6.20
4 -0.08 -6.13 -5.95
5 -1.15 -4.82 -5.75
6 -0.99 -4.11 -4.85
7 0.42 -4.18 -4.54
8 -0.32 -3.43 -4.40
9 -0.72 -3.37 -4.39", header=T)
library(pheatmap)
my_colors <- c(min(d),seq(-4,4,by=0.01),max(d))
my_palette <- c("green",colorRampPalette(colors = c("green", "red"))
(n = length(my_colors)-2), "red")
pheatmap(as.matrix(d),
scale = "none",
cluster_cols=FALSE,
cluster_rows = FALSE,
treeheight_row=0,
show_rownames=FALSE,
main = "test.txt",
color = my_palette,
breaks = my_colors)
Created on 2019-05-29 by the reprex package (v0.3.0)

Rolling correlation with id and date

I have some data that has a name, date, and two factors (x,y). I would like to calculate
dt<-seq(as.Date("2013/1/1"), by = "days", length.out = 20)
df1<-data.frame("ABC",dt,rnorm(20, 0,3),rnorm(20, 2,4) )
names(df1)<-c("name","date","x","y")
df2<-data.frame("XYZ",dt,rnorm(20, 2,5),rnorm(20, 3,10) )
names(df2)<-c("name","date","x","y")
df<-rbind(df1,df2)
I would like to add a column named "Correl" that for each date, takes the correlation of the previous 5 periods. However, when the name changes, I would like it to give NA's instead.
As you can see below, when the data becomes XYZ instead of ABC, the first 4 periods, the correlation is NA. When there's 5 data points is when the correlation begins again.
name date x y Correl
ABC 1/1/2013 -3.59 -5.13 NA
ABC 1/2/2013 -8.69 4.22 NA
ABC 1/3/2013 2.80 -0.59 NA
ABC 1/4/2013 0.54 5.06 NA
ABC 1/5/2013 1.13 3.49 -0.03
ABC 1/6/2013 0.52 5.16 -0.38
ABC 1/7/2013 -0.24 -5.40 0.08
ABC 1/8/2013 3.26 -2.75 -0.16
ABC 1/9/2013 1.33 5.94 -0.04
ABC 1/10/2013 2.24 1.14 -0.01
ABC 1/11/2013 0.01 9.87 -0.24
ABC 1/12/2013 2.29 1.28 -0.99
ABC 1/13/2013 1.03 -6.30 -0.41
ABC 1/14/2013 0.62 4.82 -0.47
ABC 1/15/2013 1.08 -1.17 -0.50
ABC 1/16/2013 2.43 8.86 0.45
ABC 1/17/2013 -3.43 9.38 -0.35
ABC 1/18/2013 -5.73 7.59 -0.38
ABC 1/19/2013 1.77 3.13 -0.44
ABC 1/20/2013 -0.97 -0.77 -0.24
XYZ 1/1/2013 2.12 10.22 NA
XYZ 1/2/2013 -0.81 0.22 NA
XYZ 1/3/2013 -1.55 -2.25 NA
XYZ 1/4/2013 -4.53 3.63 NA
XYZ 1/5/2013 2.95 -1.51 0.13
XYZ 1/6/2013 6.76 24.16 0.69
XYZ 1/7/2013 3.33 7.31 0.66
XYZ 1/8/2013 -1.47 -4.23 0.67
XYZ 1/9/2013 3.89 -0.43 0.81
XYZ 1/10/2013 5.63 17.95 0.86
XYZ 1/11/2013 3.29 -7.09 0.63
XYZ 1/12/2013 6.03 -9.03 0.29
XYZ 1/13/2013 -5.63 6.96 -0.19
XYZ 1/14/2013 1.70 13.59 -0.18
XYZ 1/15/2013 -1.19 -16.79 -0.29
XYZ 1/16/2013 4.76 4.91 -0.11
XYZ 1/17/2013 9.02 25.16 0.57
XYZ 1/18/2013 4.56 6.48 0.84
XYZ 1/19/2013 5.30 11.81 0.99
XYZ 1/20/2013 -0.60 3.38 0.84
UPDATE: I have tried all of your suggestions and have run into problems using the actual data. I have attached a subset of the data below:
https://www.dropbox.com/s/6k4xhwuinlu0p1f/TEST_SUBSET.csv?dl=0
I cannot get this to work. I've tried removing the NA's, renaming the rows, reading the data in differently, formatting the date differently. Nothing is working for me. Can you see if what you are running is working for this dataset? Thank you very much folks!
Apply ave to the row indexes of df to process by name and use rollapplyr to perform the rolling computations. Note that i is a vector of indexes:
library(zoo)
corx <- function(x) cor(x[, 1], x[, 2])
df$Correl <- ave(1:nrow(df), df$name, FUN = function(i)
rollapplyr(df[i, c("x", "y")], 5, corx, by.column = FALSE, fill = NA))
Update Changed rollapply to rollapplyr to be consistent with the output shown in the question. If you want centred correlations change it back to rollapply.
This is a little late to the party, but the below is a pretty compact solution with dplyr and rollapply from (zoo package).
library(dplyr)
library(zoo)
dt<-seq(as.Date("2013/1/1"), by = "days", length.out = 20)
df1<-data.frame("ABC",dt,rnorm(20, 0,3),rnorm(20, 2,4) )
names(df1)<-c("name","date","x","y")
df2<-data.frame("XYZ",dt,rnorm(20, 2,5),rnorm(20, 3,10) )
names(df2)<-c("name","date","x","y")
df<-rbind(df1,df2)
df<-df %>%
group_by(name)%>%
arrange(date) %>%
do({
correl <- rollapply(.[-(1:2)],width = 5, function(a) cor(a[,1],a[,2]), by.column = FALSE, align = "right", fill = NA)
data.frame(., correl)
})
which returns...
> df
Source: local data frame [40 x 5]
Groups: name
name date x y correl
1 ABC 2013-01-01 -0.61707785 -0.7299461 NA
2 ABC 2013-01-02 1.35353618 9.1314743 NA
3 ABC 2013-01-03 2.60815932 0.2511828 NA
4 ABC 2013-01-04 -2.89619789 -1.2586655 NA
5 ABC 2013-01-05 2.23750886 4.6616034 0.52013407
6 ABC 2013-01-06 -1.97573999 3.6800832 0.37575664
7 ABC 2013-01-07 1.70360813 2.2621718 0.32390612
8 ABC 2013-01-08 0.02017797 2.5088032 0.64020507
9 ABC 2013-01-09 0.96263256 1.6711756 -0.00557611
10 ABC 2013-01-10 -0.62400803 5.2011656 -0.66040650
.. ... ... ... ... ...
checking that the other group responds correctly...
> df %>%
+ filter(name=="XYZ")
Source: local data frame [20 x 5]
Groups: name
name date x y correl
1 XYZ 2013-01-01 3.4199729 5.0866361 NA
2 XYZ 2013-01-02 4.7326297 -5.4613465 NA
3 XYZ 2013-01-03 3.8983329 11.1635903 NA
4 XYZ 2013-01-04 1.5235936 3.9077184 NA
5 XYZ 2013-01-05 -5.4885373 7.8961020 -0.3755766
6 XYZ 2013-01-06 0.2311371 2.0157046 -0.3754510
7 XYZ 2013-01-07 2.6903306 -3.2940181 -0.1808097
8 XYZ 2013-01-08 -0.2584268 3.6047800 -0.8457930
9 XYZ 2013-01-09 -0.2897795 2.1029431 -0.9526992
10 XYZ 2013-01-10 5.9571558 18.5810947 0.7025559
11 XYZ 2013-01-11 -7.5250647 -8.0858699 0.7949917
12 XYZ 2013-01-12 2.8438336 -8.4072829 0.6563161
13 XYZ 2013-01-13 7.2295030 -0.1236801 0.5383666
14 XYZ 2013-01-14 -0.7579570 -0.2830291 0.5542751
15 XYZ 2013-01-15 4.3116507 -6.5291051 0.3894343
16 XYZ 2013-01-16 1.4334510 0.5957465 -0.1480032
17 XYZ 2013-01-17 -2.6444881 6.1261976 -0.6183805
18 XYZ 2013-01-18 0.8517223 0.5587499 -0.9243050
19 XYZ 2013-01-19 6.2140131 -3.0944259 -0.8939475
20 XYZ 2013-01-20 11.2871086 -0.1187153 -0.6845300
Hope this helps!
FOLLOW UP
I just ran the following on your actual data set:
library(dplyr)
library(zoo)
import <- read.csv("TEST_SUBSET.CSV", header=TRUE, stringsAsFactors = FALSE)
str(head(import))
import_df<-import %>%
group_by(id)%>%
arrange(asof_dt) %>%
do({
correl <- rollapply(.[-(1:2)],width = 5, function(a) cor(a[,1],a[,2]), by.column = FALSE, align = "right", fill = NA)
data.frame(., correl)
})
import_df
and received the following:
> import_df
Source: local data frame [15,365 x 5]
Groups: id
id asof_dt x y correl
1 DC1123 1/10/1990 -0.003773632 NA NA
2 DC1123 1/10/1991 0.014034992 NA NA
3 DC1123 1/10/1992 -0.004109765 NA NA
4 DC1123 1/10/1994 0.006369326 0.012176085 NA
5 DC1123 1/10/1995 0.014900600 0.001241080 NA
6 DC1123 1/10/1996 0.005763689 -0.013112491 NA
7 DC1123 1/10/1997 0.006949765 0.010737034 NA
8 DC1123 1/10/2000 0.044052805 0.003346296 0.02724175
9 DC1123 1/10/2001 0.009452785 0.017582638 0.01362101
10 DC1123 1/10/2002 -0.018876970 0.004346372 0.01343657
.. ... ... ... ... ...
so it feels like its working.
The (cor) function is only going to return data when it has 5 input points, which doesn't happen until row 8.
Here is a a solution using base R, note that it requires that the data set be sorted by name and date, in that order.
dt<-seq(as.Date("2013/1/1"), by = "days", length.out = 20)
df1<-data.frame("ABC",dt,rnorm(20, 0,3),rnorm(20, 2,4) )
names(df1)<-c("name","date","x","y")
df2<-data.frame("XYZ",dt,rnorm(20, 2,5),rnorm(20, 3,10) )
names(df2)<-c("name","date","x","y")
df<-rbind(df1,df2)
rollcorr = function(df, lag = 4) {
out = numeric(nrow(df) - lag)
for( i in seq_along(out) ) {
window = i:(i+lag)
out[i] = cor(df$x[window], df$y[window])
}
out <- c(rep(NA, lag), out)
return(out)
}
df$Correl <- do.call(c, by(df[, -1], df[, 1], rollcorr))

Resources