I have an array of iterations in an MCMC algorithm. The rows represent draws from a distribution. The columns represent parameters (variables) in the distribution. For ease of exposition: assume two variables, five iterations. So I have:
> draws <- data.frame( iteration = c(1:5),
alpha = rnorm(5,0,1),
beta = rnorm(5,0,1))
iteration alpha beta
1 1 -0.3157940 0.2122465
2 2 1.0087298 -0.2346733
3 3 1.0366165 0.3472915
4 4 -2.4256564 0.9863279
5 5 -0.6089072 -1.1213000
When I melt the dataset, I get:
> melt(draws)
Using as id variables
variable value
1 iteration 1.0000000
2 iteration 2.0000000
3 iteration 3.0000000
4 iteration 4.0000000
5 iteration 5.0000000
6 alpha -0.1042616
7 alpha 1.0707001
8 alpha 0.2166865
9 alpha 0.0771617
10 alpha -0.8893614
11 beta -0.4846693
12 beta -1.5950729
13 beta -0.7178340
14 beta 1.0149766
15 beta -0.3128256
But I want to hold iteration out so that I get the equivalent of (hand edited):
> melt(draws)
Using as id variables
iteration variable value
1 1 alpha -0.1042616
2 2 alpha 1.0707001
3 3 alpha 0.2166865
4 4 alpha 0.0771617
5 5 alpha -0.8893614
6 1 beta -0.4846693
7 2 beta -1.5950729
8 3 beta -0.7178340
9 4 beta 1.0149766
10 5 beta -0.3128256
Supply the id variable to melt:
melt(draws, id = "iteration")
Gives:
iteration variable value
1 1 alpha -0.02765436
2 2 alpha -1.42138702
3 3 alpha 0.83525096
4 4 alpha -1.10677555
5 5 alpha 0.72465936
6 1 beta 0.59269720
7 2 beta -0.32164072
8 3 beta -1.31097204
9 4 beta 0.94993620
10 5 beta 0.20919169
Bah. I always ask a question right before finding the answer...
I had been reading help(melt.array), but when I converted to a data.frame, to post my question, it eventually led me to the answer in help(melt.data.frame).
To get what I want, I will use:
myMelt <- melt( draws, id.var = "iteration" );
So that I can then make a faceted plot:
ggplot(myMelt, aes(x = iteration,y = value)) + geom_point() + stat_smooth() + facet_grid(variable ~ ., scales="free")
Related
First, the following data are split randomly into two groups according to the sl variable and then run the model for both groups using the for loop shown below the data set
mydata
y x sl
1 5.297967 1 1
2 3.322833 2 1
3 4.969813 3 1
4 4.276666 4 1
5 5.972807 1 2
6 6.619440 2 2
7 8.045588 3 2
8 7.377759 4 2
9 6.907755 5 2
10 8.672486 6 2
11 8.283999 7 2
12 8.455318 8 2
13 7.414573 9 2
14 8.634087 10 2
15 7.356355 1 3
16 6.606247 2 3
17 6.396930 9 3
18 6.579251 10 3
19 5.521110 1 4
20 2.224221 2 4
21 6.742881 3 4
22 6.709304 4 4
23 6.875232 5 4
24 8.476371 6 4
25 7.360104 7 4
Runnign model using lme() function for both group and then store the beta coefficients as matrix and theta[ random intercept term ] as vector
sl.no=unique(mydata$sl)
m=length(unique(mydata$sl))
ngrp=2
set.seed(125)
idx=sample(1:ngrp, size=m, replace = T)
beta=matrix(NA, nrow = ngrp, ncol=3, byrow=T) #null matrix to store coefficients from both groups
theta=rep(0,m) #null vector to store intercepts from both groups
library(nlme)
for ( g in 1:ngrp){
rg=sl.no[idx==g]
mydata_rG=mydata[mydata$sl %in% rg,] #Data set belongs to group-g
lme_mod=lme(y~x+I(x^2),random = ~ 1|sl,
data = mydata_rG, method = "ML") #mixed effect model for each group
beta[g,]=c(unlist(lme_mod$coefficients[1])[[1]],
unlist(lme_mod$coefficients[1])[[2]],
unlist(lme_mod$coefficients[1])[[3]])
theta=c(unname(lme_mod$coefficients$random$sl))
}
I am expecting a theta vector of length m. Unfortunately, theta comes as the size of one.
Any help is appreciated.
results of beta and theta
beta
[,1] [,2] [,3]
[1,] 4.895805 0.7954474 -0.05602771
[2,] 6.423533 -1.7441753 0.32049662
theta
[1] 4.264366e-21 #it should be length of m.
It's only about how you store theta values
sl.no=unique(mydata$sl)
m=length(unique(mydata$sl))
ngrp=2
set.seed(125)
idx=sample(1:ngrp, size=m, replace = T)
beta=matrix(NA, nrow = ngrp, ncol=3, byrow=T)
theta=numeric() #Change here
library(nlme)
for ( g in 1:ngrp){
rg=sl.no[idx==g]
mydata_rG=mydata[mydata$sl %in% rg,]
lme_mod=lme(y~x+I(x^2),random = ~ 1|sl,
data = mydata_rG, method = "ML")
beta[g,]=c(unlist(lme_mod$coefficients[1])[[1]],
unlist(lme_mod$coefficients[1])[[2]],
unlist(lme_mod$coefficients[1])[[3]])
theta=c(theta, unname(lme_mod$coefficients$random$sl)) #Change here
}
I'm trying to plot three data series in a single plot. The X and Y coordinates of each series are in separate columns in my data frame:
X1 Y1 X2 Y2 X3 Y3
1 0 1 0 2 0 3
2 1 2 1 3 1 4
3 2 3 2 4 2 5
4 3 4 3 5 3 6
5 4 5 4 6 4 7
6 5 6 5 7 5 8
7 6 7 6 8 6 9
8 0 0 7 9 7 8
9 0 0 8 8 0 0
10 0 0 9 7 0 0
Since the trailing (0,0) data points of each series are invalid, only this subset of points should eventually be plotted:
X1 Y1 X2 Y2 X3 Y3
1 0 1 0 2 0 3
2 1 2 1 3 1 4
3 2 3 2 4 2 5
4 3 4 3 5 3 6
5 4 5 4 6 4 7
6 5 6 5 7 5 8
7 6 7 6 8 6 9
8 7 9 7 8
9 8 8
10 9 7
Additionally, the X-axis of the first series should be inverted:
Even without cleaning up with data frame first, I struggled to plot the column pairs as individual series in ggplot2 (see 'legend').
require(ggplot2)
report <- function(df){
plot = ggplot(data=df, aes(x=-X1, y=Y1, size=3)) + #inverted X-axis of series 1
layer(geom="point") +
geom_point(aes(X2, Y2, colour="red", size=2)) +
geom_point(aes(X3, Y3, colour="blue", size=1)) +
xlab("X") + ylab("Y")
print(plot)
}
X1 = c(0,1,2,3,4,5,6,0,0,0)
Y1 = c(1,2,3,4,5,6,7,0,0,0)
X2 = c(0,1,2,3,4,5,6,7,8,9)
Y2 = c(2,3,4,5,6,7,8,9,8,7)
X3 = c(0,1,2,3,4,5,6,7,0,0)
Y3 = c(3,4,5,6,7,8,9,8,0,0)
df <- data.frame(X1,Y1,X2,Y2,X3,Y3)
colnames(df) <- c("X1","Y1","X2","Y2","X3","Y3")
report(df)
What would be the best way to get rid of the invalid (0,0) data points in each series, and how should I plot them properly?
I think you actually want to transform your data.frame in order to make your ggplot call more concise. Here is the updated version to plot your data correctly using the dplyr package to transform the data.
In response to comment requesting additional info on dplyr. It provides the %>% operator which simply passed the argument to the left into the function on the right as the first argument. It allows for much more readable R code. The mutate function adds the Series variable via a manual setting of the variable given the knowledge of which points are part of which series. Then the filter function removes the 0,0 points which you indicated were not wanted. You can inspect the df after these operations to see the final output. Hope this helps interpret the below code. Also here is a link to the dplyr page.
library(dplyr)
df <- rbind.data.frame(
data.frame(X=-X1, Y=Y1),
data.frame(X=X2, Y=Y2),
data.frame(X=X3, Y=Y3))
df <- df %>%
mutate(Series=rep(c('S1', 'S2', 'S3'), each=10)) %>%
filter(!(X == 0 & Y == 0))
png('foo.png')
ggplot(df) + geom_point(aes(x=X, y=Y, color=Series, size=Series))
dev.off()
Also if you want to manual set the values of color and size as well as adding the lines as in your ideal example plot, here is a more complex ggplot command:
ggplot(df, aes(x=X, y=Y, color=Series, size=Series)) +
geom_point() + geom_line(size=1) + theme_bw() +
scale_color_manual(values=c('black', 'red', 'blue')) +
scale_size_manual(values=seq(4,2,-1))
I have the following data set for which I've written some code to do permutation testing
df <- read.table(text="Group var1 var2 var3 var4 var5
1 3 5 7 3 7
1 3 7 5 9 6
1 5 2 6 7 6
1 9 5 7 0 8
1 2 4 5 7 8
1 2 3 1 6 4
2 4 2 7 6 5
2 0 8 3 7 5
2 1 2 3 5 9
2 1 5 3 8 0
2 2 6 9 0 7
2 3 6 7 8 8
2 10 6 3 8 0", header = TRUE)
This is my code. However it doesn't seem to work for some reason - all the p values I get at the end are about 0.5. Can anyone see what I'm doing wrong??
data = df[,2:6]
t.test.pvals = matrix(NA,nrow=1000,ncol=5)
ids.group1 = c(1,2,3,4,5,6)
ids.group2 = c(7,8,9,10,11,12,13)
#Define binary vector type for the t test
group1.binary <- rep(0,times=6)
group2.binary <- rep(1,times=7)
type <- c(group1.binary,group2.binary)
#Permutation testing
for (i in 1:1000) {
index = sample(1:13, size=13, replace=F)
group1 = data[which(index %in% ids.group1),]
group2 = data[which(index %in% ids.group2),]
group.total = rbind(group1,group2)
temp = t(sapply(group.total, function(x)
unlist(t.test(x~type)[c("p.value")])))
temp = as.vector(temp)
t.test.pvals[i,] = temp
}
You can either do a t-test or do permutation testing. In the permutation testing, you don't use t-tests. See for instance here for a tutorial on permutation testing. Below you find the code for your particular example (e.g. var5):
# t-test
with(df, t.test(var5~Group))$p.value
# Permutation testing
# mean difference
mean.diff <- with(df, abs(mean(var5[Group==1])-mean(var5[Group==2])))
# function that calculates resampled mean
one.test <- function(x,y) {
xstar<-sample(x)
abs(mean(y[xstar==1])-mean(y[xstar==2]))
}
# calculating the resampled means
many.diff <- c(mean.diff, with(df, replicate(1000, one.test(Group, var5))))
# pvalue
p5 <- mean(abs(many.diff) >= abs(mean.diff))
p5
The way you did it, you resampled and then calculated p-values from a t-test. After the resampling, the p-value is uniformly distributed between 0 and 1. Therefore when you look at summary(t.test.pvals), you see uniformly distributed p-values (as expected).
#shadow explained the issue with your code well. If I were you I would generally refrain from coding this kind of thing from scratch. The coin package implements all the permutation tests you could ever want to use. No need to re-invent the wheel.
This code
library(coin)
sapply(df[,-1], function(x) pvalue(oneway_test(x ~ as.factor(df$Group))))
## var1 var2 var3 var4 var5
## 0.548 0.544 0.898 0.685 0.304
does what you seem to want to do (i.e., test whether there is a shift in the distribution of varX in Group 1 versus Group 2).
I have the following code that perform hiearchical
clustering and plot them in heatmap.
library(gplots)
set.seed(538)
# generate data
y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep="")))
# the actual data is much larger that the above
# perform hiearchical clustering and plot heatmap
test <- heatmap.2(y)
Which plot this:
What I want to do is to get the cluster member from each hierarchy of in the plot
yielding:
Clust 1: g3-g2-g4
Clust 2: g2-g4
Clust 3: g4-g7
etc
Cluster last: g1-g2-g3-g4-g5-g6-g7-g8-g9-g10
Is there a way to do it?
I did have the answer, after all! #zkurtz identified the problem ... the data I was using were different than the data you were using. I added a set.seed(538) statement to your code to stabilize the data.
Use this code to create a matrix of cluster membership for the dendrogram of the rows using the following code:
cutree(as.hclust(test$rowDendrogram), 1:dim(y)[1])
This will give you:
1 2 3 4 5 6 7 8 9 10
g1 1 1 1 1 1 1 1 1 1 1
g2 1 2 2 2 2 2 2 2 2 2
g3 1 2 2 3 3 3 3 3 3 3
g4 1 2 2 2 2 2 2 2 2 4
g5 1 1 1 1 1 1 1 4 4 5
g6 1 2 3 4 4 4 4 5 5 6
g7 1 2 2 2 2 5 5 6 6 7
g8 1 2 3 4 5 6 6 7 7 8
g9 1 2 3 4 4 4 7 8 8 9
g10 1 2 3 4 5 6 6 7 9 10
This solution requires computing the cluster structure using a different packags:
# Generate data
y = matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep="")))
# The new packags:
library(nnclust)
# Create the links between all pairs of points with
# squared euclidean distance less than threshold
links = nncluster(y, threshold = 2, fill = 1, give.up =1)
# Assign a cluster number to each point
clusters=clusterMember(links, outlier = FALSE)
# Display the points that are "alone" in their own cluster:
nas = which(is.na(clusters))
print(rownames(y)[nas])
clusters = clusters[-nas]
# For each cluster (with at least two points), display the included points
for(i in 1:max(clusters, na.rm = TRUE)) print(rownames(y)[clusters == i])
Obviously you would want to revise this into a function of some kind to be more user friendly. In particular, this gives the clusters at only one level of the dendrogram. To get the clusters at other levels, you would have to play with the threshold parameter.
It's my first day learning R and ggplot. I've followed some tutorials and would like plots like are generated by the following command:
qplot(age, circumference, data = Orange, geom = c("point", "line"), colour = Tree)
It looks like the figure on this page:
http://www.r-bloggers.com/quick-introduction-to-ggplot2/
I had a handmade test data file I created, which looks like this:
site temp humidity
1 1 1 3
2 1 2 4.5
3 1 12 8
4 1 14 10
5 2 1 5
6 2 3 9
7 2 4 6
8 2 8 7
but when I try to read and plot it with:
test <- read.table('test.data')
qplot(temp, humidity, data = test, color=site, geom = c("point", "line"))
the lines on the plot aren't separate series, but link together:
http://imgur.com/weRaX
What am I doing wrong?
Thanks.
You need to tell ggplot2 how to group the data into separate lines. It's not a mind reader! ;)
dat <- read.table(text = " site temp humidity
1 1 1 3
2 1 2 4.5
3 1 12 8
4 1 14 10
5 2 1 5
6 2 3 9
7 2 4 6
8 2 8 7",sep = "",header = TRUE)
qplot(temp, humidity, data = dat, group = site,color=site, geom = c("point", "line"))
Note that you probably also wanted to do color = factor(site) in order to force a discrete color scale, rather than a continuous one.