Alter my values to surround a certain point R - r

I have the following data, which shows the values for 5 different cohorts of patients (3 patients in each cohort):
dat <- data.frame(Cohort=c(1,1,1, 2,2,2, 3,3,3, 4,4,4, 5,5,5),
LEN_Dose=c(15,15,15, 25,25,25, 15,15,15, 10,10,10, 10,10,10),
DLT=c("N","N","N", "Y","Y","N", "Y","N","Y", "N","N","Y", "N","N","Y"))
I would like to modify the cohort levels to be +/- 0.2 of the main cohort number so they don't sit on top of one another in a graph. I can achive what I want like this:
dat$Cohort <- dat$Cohort-0.2
dat$Cohort <- ifelse(duplicated(dat$Cohort), dat$Cohort+0.2, dat$Cohort)
dat$Cohort <- ifelse(duplicated(dat$Cohort), dat$Cohort+0.2, dat$Cohort) # have to run this twice as there are 3 patients
So the result is:
head(dat)
# Cohort LEN_Dose DLT
# 0.8 15 N
# 1.0 15 N
# 1.2 15 N
# 1.8 25 Y
# 2.0 25 Y
# 2.2 25 N
But I'm wondering if there's a better way to do this? Eg somehow inputting the base cohort level and some function automatically works out the 3 values I need?
The point is to eventually graph the data using this graph:
ggplot(aes(x=Cohort, y=as.numeric(LEN_Dose)), data = dat) +
ylab("Dose Level\n") +
xlab("\nCohort") +
ggtitle("\n") +
scale_y_continuous(breaks = c(5, 10, 15, 25),
label = c("1.2mg/kg\n5mg", "1.2mg/kg\n10mg", "1.8mg/kg\n15mg", "1.8mg/kg\n25mg")) +
scale_fill_manual(values = c("white", "darkred"),
name="Had DLT") +
geom_line(colour="grey20", size=1) +
geom_point(shape=23, size=6, aes(fill=DLT), stroke=1.1, colour="grey20") + # 21 for circles
theme_classic() +
theme(legend.box.margin=margin(c(0,0,0,-10))) +
expand_limits(y=c(5,25))
EDIT: I have tried position = position_jitter, position = position_dodge and all the other types of positions within ggplot itself, but they don't space the points equally or in any particular order, which is why I'm trying to modify the dataframe itself

How about writing your jitter function, something like:
jitterit<- function(xTojitter= dat$Cohort, howMuchjitter=0.2){
x<-xTojitter
uni<-unique(x)
for (i in 1:length(uni)) {
if (is.na(uni[i])) {
x[is.na(x)]<-NA
} else if (sum(x==uni[i], na.rm = T) %%2 ==1) {
if(sum(x==uni[i], na.rm = T)==1){x[x==uni[i] & !is.na(x)][middle] <- uni[i]
} else {
middle<-ceiling (sum(x==uni[i], na.rm = T)/2)
x[x==uni[i] & !is.na(x)][1:(middle-1)] <- uni[i] - howMuchjitter
x[x==uni[i] & !is.na(x)][(middle+1):sum(x==uni[i], na.rm = T) ]<- uni[i] + howMuchjitter
x[x==uni[i] & !is.na(x)][middle] <- uni[i]
}} else if (sum(x==uni[i], na.rm = T) %%2 ==0) {
x[x==uni[i] & !is.na(x)]<- rep(c(uni[i] - howMuchjitter,uni[i] + howMuchjitter), each= sum(x==uni[i],na.rm = T)/2)
}
}
return(x)
}
It will work for all kind of duplicated data (even or odd number of duplication)
jitterit(xTojitter = c(1,1,2,1,2,NA), howMuchjitter=0.2)
[1] 0.8 1.0 1.8 1.2 2.2 NA

Related

plot gradient with three dimensions (coded as red, green & blue) in a ternary plot (ggtern)

I have the following data frame;
df <- data.frame(expand.grid(A = seq(0,1,0.1), B = seq(0,1,0.1), C = seq(0,1,0.1)))
df <- df[rowSums(df[,1:3]) == 1,]
df <- data.frame(df, value_A = 0.2 * df$A + -0.4 * df$B + 0 * df$C, value_B = 0.8 * df$A + 0.1 * df$B + 0.5 * df$C, value_C = 0 * df$A + 0.8 * df$B + 0 * df$C)
df[,c(4:6)] <- (df[,c(4:6)] + abs(apply(df[,c(4:6)],1,min))) / (apply(df[,c(4:6)],1,max) + abs(apply(df[,c(4:6)],1,min)))
df <- data.frame(df, color = rgb(red=df$value_A, green=df$value_B, blue=df$value_C))
In the data-frame, each row gives me a proportion of A's, B's and C's that sum up to one.
Further, for each A, B and C, I have a value.
From these three values I generate one RGB value that gives me an indication of the relative importance of A,B & C under different proportions of A, B and C's.
Now I would like to plot these RGB values as a surface in a ternary plot.
I can plot one "dimension" with the following code using the ggtern package in R;
library(ggtern)
ggtern(df, aes(A,B,C, value=value_A)) +
theme_showarrows() +
stat_interpolate_tern(geom="polygon",
formula=value~x+y,
n=20, method='lm',
breaks=seq(0,1, by=0.001),
aes(fill=..level..), expand=F
) +
scale_fill_gradient(low="red", high="green")
But instead of value_A, I actually want to use the RGB values directly.
However, I can not figure out, how, instead of value=value_A I can specify the color value for each point from which the surface is calculated directly.
Is this possible with ggplot2 / ggtern?

How to dodge points in ggplot2 in R

df = data.frame(subj=c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10), block=factor(rep(c(1,2),10)), acc=c(0.75,0.83,0.58,0.75,0.58,0.83,0.92,0.83,0.83,0.67,0.75,0.5,0.67,0.83,0.92,0.58,0.75,0.5,0.67,0.67))
ggplot(df,aes(block,acc,group=subj)) + geom_point(position=position_dodge(width=0.3)) + ylim(0,1) + labs(x='Block',y='Accuracy')
How do I get points to dodge each other uniformly in the horizontal direction? (I grouped by subj in order to get it to dodge at all, which might not be the correct thing to do...)
I think this might be what you were looking for, although no doubt you have solved it by now.
Hopefully it will help someone else with the same issue.
A simple way is to use geom_dotplot like this:
ggplot(df,aes(x=block,y=acc)) +
geom_dotplot(binaxis = "y", stackdir = "center", binwidth = 0.03) + ylim(0,1) + labs(x='Block',y='Accuracy')
This looks like this:
Note that x (block in this case) has to be a factor for this to work.
If they don't have to be perfectly aligned horizontally, here's one quick way of doing it, using geom_jitter. You don't need to group by subj.
Method 1 [Simpler]: Using geom_jitter()
ggplot(df,aes(x=block,y=acc)) + geom_jitter(position=position_jitter(0.05)) + ylim(0,1) + labs(x='Block',y='Accuracy')
Play with the jitter width for greater degree of jittering.
which produces:
Method 2: Deterministically calculating the jitter value for each row
We first use aggregate to count the number of duplicated entries. Then in a new data frame, for each duplicated value, move it horizontally to the left by an epsilon distance.
df$subj <- NULL #drop this so that aggregate works.
#a new data frame that shows duplicated values
agg.df <- aggregate(list(numdup=seq_len(nrow(df))), df, length)
agg.df$block <- as.numeric(agg.df$block) #block is not a factor
# block acc numdup
#1 2 0.50 2
#2 1 0.58 2
#3 2 0.58 1
#4 1 0.67 2
#...
epsilon <- 0.02 #jitter distance
new.df <- NULL #create an expanded dataframe, with block value jittered deterministically
r <- 0
for (i in 1:nrow(agg.df)) {
for (j in 1:agg.df$numdup[i]) {
r <- r+1 #row counter in the expanded df
new.df$block[r] <- agg.df$block[i]
new.df$acc[r] <- agg.df$acc[i]
new.df$jit.value[r] <- agg.df$block[i] - (j-1)*epsilon
}
}
new.df <- as.data.frame(new.df)
ggplot(new.df,aes(x=jit.value,y=acc)) + geom_point(size=2) + ylim(0,1) + labs(x='Block',y='Accuracy') + xlim(0,3)
which produces:

Coloring line segments in ggplot2

Suppose I have following data for a student's score on a test.
set.seed(1)
df <- data.frame(question = 0:10,
resp = c(NA,sample(c("Correct","Incorrect"),10,replace=TRUE)),
score.after.resp=50)
for (i in 1:10) {
ifelse(df$resp[i+1] == "Correct",
df$score.after.resp[i+1] <- df$score.after.resp[i] + 5,
df$score.after.resp[i+1] <- df$score.after.resp[i] - 5)
}
df
.
question resp score.after.resp
1 0 <NA> 50
2 1 Correct 55
3 2 Correct 60
4 3 Incorrect 55
5 4 Incorrect 50
6 5 Correct 55
7 6 Incorrect 50
8 7 Incorrect 45
9 8 Incorrect 40
10 9 Incorrect 35
11 10 Correct 40
I want to get following graph:
library(ggplot2)
ggplot(df,aes(x = question, y = score.after.resp)) + geom_line() + geom_point()
My problem is: I want to color segments of this line according to student response. If correct (increasing) line segment will be green and if incorrect response (decreasing) line should be red.
I tried following code but did not work:
ggplot(df,aes(x = question, y = score.after.resp, color=factor(resp))) +
geom_line() + geom_point()
Any ideas?
I would probably approach this a little differently, and use geom_segment instead:
df1 <- as.data.frame(with(df,cbind(embed(score.after.resp,2),embed(question,2))))
colnames(df1) <- c('yend','y','xend','x')
df1$col <- ifelse(df1$y - df1$yend >= 0,'Decrease','Increase')
ggplot(df1) +
geom_segment(aes(x = x,y = y,xend = xend,yend = yend,colour = col)) +
geom_point(data = df,aes(x = question,y = score.after.resp))
A brief explanation:
I'm using embed to transform the x and y variables into starting and ending points for each line segment, and then simply adding a variable that indicates whether each segment went up or down. Then I used the previous data frame to add the original points themselves.
Alternatively, I suppose you could use geom_line something like this:
df$resp1 <- c(as.character(df$resp[-1]),NA)
ggplot(df,aes(x = question, y = score.after.resp, color=factor(resp1),group = 1)) +
geom_line() + geom_point(color = "black")
By default ggplot2 groups the data according to the aesthetics that are mapped to factors. You can override this default by setting group explicitly,
last_plot() + aes(group=NA)

How to create summary tables and graphs in R by looping through the response variables (in columns)

I have a dataset with multiple response variables and three treatments. Treatment2 is nested within treatment1 and treatment3 is nested within treatment 2. I have shown only three response variables for the sake of simplicity. I would like to run this over 22 response variable of which 3 are shown in the demo table.
My objective:
To visualize how the response variable(s) change based on the treatment combination. I have created a script to perform this on one response variable. I am copy pasting this code to run through other columns which to me is an extremely crude way to do it. Which leads to my second objective.
Automate or modify the following script so that it can automatically loops through the column and produce desired table and graphs.
Demo data:
demo.table
Here is my script:
library(doBy)
length2 <- function (x, na.rm=FALSE) {
if (na.rm) sum(!is.na(x))
else length(x)
}
attach (demo)
cdataNA <- summaryBy(tyr ~ spp + wat + ins, data=demo, FUN=c(length2,mean,sd), na.rm=TRUE)
# Rename column change.length to just N
names(cdataNA)[names(cdataNA)=="tyr.length2"] <- "N"
# Calculate standard error of the mean
cdataNA$tyr.SE <- cdataNA$tyr.sd / sqrt(cdataNA$N)
cdataNA
# Now create a barplot using ggplot2
library(ggplot2)
a <- ggplot(cdataNA, aes(x = wat, y = tyr.mean, fill = ins))
b <- a + geom_bar(stat = "identity", position = "dodge") + facet_grid (~ spp)
# Now put errorbars.
c <- b + geom_errorbar(aes(ymin=tyr.mean-tyr.SE, ymax=tyr.mean+tyr.SE),
width=.2, # Width of the error bars
position=position_dodge(.9)) +
xlab ("wat") +
ylab ("tyr (PA/PA std)")
c
## esc
library(doBy)
length2 <- function (x, na.rm=FALSE) {
if (na.rm) sum(!is.na(x))
else length(x)
}
cdataNA1 <- summaryBy(esc ~ spp + wat + ins, data=demo, FUN=c(length2,mean,sd), na.rm=TRUE)
# Rename column change.length to just N
names(cdataNA1)[names(cdataNA1)=="esc.length2"] <- "N"
# Calculate standard error of the mean
cdataNA1$esc.SE <- cdataNA1$esc.sd / sqrt(cdataNA1$N)
cdataNA1
# Now create a barplot using ggplot2
library(ggplot2)
a1 <- ggplot(cdataNA1, aes(x = wat, y = esc.mean, fill = ins))
b1 <- a1 + geom_bar(stat = "identity", position = "dodge") + facet_grid (~ spp)
# Now put errorbars.
c1 <- b1 + geom_errorbar(aes(ymin=esc.mean-esc.SE, ymax=esc.mean+esc.SE),
width=.2, # Width of the error bars
position=position_dodge(.9)) +
xlab ("wat") +
ylab ("esc (PA/PA std)")
c1
Resulting table for tyr:
spp wat ins N tyr.mean tyr.sd tyr.SE
1 Bl High No 4 0.305325 0.034102041 0.017051020
2 Bl High Yes 5 0.186140 0.045165894 0.020198802
3 Bl Low No 5 0.310540 0.061810096 0.027642315
4 Bl Low Yes 5 0.202840 0.029034944 0.012984822
5 Man High No 4 0.122725 0.075867005 0.037933503
6 Man High Yes 5 0.081800 0.013463469 0.006021046
7 Man Low No 5 0.079880 0.009569587 0.004279650
8 Man Low Yes 4 0.083550 0.018431947 0.009215973
Resulting graph for esc:
demo figure for esc
So the whole thing works but still requires considerable manual labor which impedes the work-flow. it would be great to achieve automation.
Thanks in advance.
You can organize the data in just two lines:
melt.dta <- melt(dta, id.vars = c("spp", "wat", "ins"), measure.vars = "tyr")
cast(melt.dta, spp + wat + ins ~ .,
function (x) c("N" = sum(!is.na(x)),
"mean" = mean(x, na.rm = TRUE),
"sd" = sd(x, na.rm = TRUE),
"se" = sd(x, na.rm = TRUE)/sqrt(sum(!is.na(x)))))
It returns:
spp wat ins N mean sd se
1 Bl High No 4 0.3053 0.03410 0.01705
2 Bl High Yes 5 0.1861 0.04517 0.02020
3 Bl Low No 5 0.3105 0.06181 0.02764
4 Bl Low Yes 5 0.2028 0.02903 0.01298
5 Man High No 4 0.1227 0.07587 0.03793
6 Man High Yes 5 0.0818 0.01346 0.00602
7 Man Low No 5 0.0799 0.00957 0.00428
8 Man Low Yes 4 0.0835 0.01843 0.00922

Generate multiple serial graphs/scatterplots from data in two dataframes

I have 2 dataframes, Tg and Pf, each of 127 columns. All columns have at least one row and can have up to thousands of them. All the values are between 0 and 1 and there are some missing values (empty cells). Here is a little subset:
Tg
Tg1 Tg2 Tg3 ... Tg127
0.9 0.5 0.4 0
0.9 0.3 0.6 0
0.4 0.6 0.6 0.3
0.1 0.7 0.6 0.4
0.1 0.8
0.3 0.9
0.9
0.6
0.1
Pf
Pf1 Pf2 Pf3 ...Pf127
0.9 0.5 0.4 1
0.9 0.3 0.6 0.8
0.6 0.6 0.6 0.7
0.4 0.7 0.6 0.5
0.1 0.6 0.5
0.3
0.3
0.3
Note that some cell are empty and the vector lengths for the same subset (i.e. 1 to 127) can be of very different length and are rarely the same exact length.
I want to generate 127 graph as follow for the 127 vectors (i.e. graph is for col 1 from each dataframe, graph 2 is for col 2 for each dataframe etc...):
Hope that makes sense. I'm looking forward to your assistance as I don't want to make those graphs one by one...
Thanks!
Here is an example to get you started (data at https://gist.github.com/1349300). For further tweaking, check out the excellent ggplot2 documentation that is all over the web.
library(ggplot2)
# Load data
Tg = read.table('Tg.txt', header=T, fill=T, sep=' ')
Pf = read.table('Pf.txt', header=T, fill=T, sep=' ')
# Format data
Tg$x = as.numeric(rownames(Tg))
Tg = melt(Tg, id.vars='x')
Tg$source = 'Tg'
Tg$variable = factor(as.numeric(gsub('Tg(.+)', '\\1', Tg$variable)))
Pf$x = as.numeric(rownames(Pf))
Pf = melt(Pf, id.vars='x')
Pf$source = 'Pf'
Pf$variable = factor(as.numeric(gsub('Pf(.+)', '\\1', Pf$variable)))
# Stack data
data = rbind(Tg, Pf)
# Plot
dev.new(width=5, height=4)
p = ggplot(data=data, aes(x=x)) + geom_line(aes(y=value, group=source, color=source)) + facet_wrap(~variable)
p
Highlighting the area between the lines
First, interpolate the data onto a finer grid. This way the ribbon will follow the actual envelope of the lines, rather than just where the original data points were located.
data = ddply(data, c('variable', 'source'), function(x) data.frame(approx(x$x, x$value, xout=seq(min(x$x), max(x$x), length.out=100))))
names(data)[4] = 'value'
Next, calculate the data needed for geom_ribbon - namely ymax and ymin.
ribbon.data = ddply(data, c('variable', 'x'), summarize, ymin=min(value), ymax=max(value))
Now it is time to plot. Notice how we've added a new ribbon layer, for which we've substituted our new ribbon.data frame.
dev.new(width=5, height=4)
p + geom_ribbon(aes(ymin=ymin, ymax=ymax), alpha=0.3, data=ribbon.data)
Dynamic coloring between the lines
The trickiest variation is if you want the coloring to vary based on the data. For that, you currently must create a new grouping variable to identify the different segments. Here, for example, we might use a function that indicates when the "Tg" group is on top:
GetSegs <- function(x) {
segs = x[x$source=='Tg', ]$value > x[x$source=='Pf', ]$value
segs.rle = rle(segs)
on.top = ifelse(segs, 'Tg', 'Pf')
on.top[is.na(on.top)] = 'Tg'
group = rep.int(1:length(segs.rle$lengths), times=segs.rle$lengths)
group[is.na(segs)] = NA
data.frame(x=unique(x$x), group, on.top)
}
Now we apply it and merge the results back with our original ribbon data.
groups = ddply(data, 'variable', GetSegs)
ribbon.data = join(ribbon.data, groups)
For the plot, the key is that we now specify a grouping aesthetic to the ribbon geom.
dev.new(width=5, height=4)
p + geom_ribbon(aes(ymin=ymin, ymax=ymax, group=group, fill=on.top), alpha=0.3, data=ribbon.data)
Code is available together at: https://gist.github.com/1349300
Here is a three-liner to do the same :-). We first reshape from base to convert the data into long form. Then, it is melted to suit ggplot2. Finally, we generate the plot!
mydf <- reshape(cbind(Tg, Pf), varying = 1:8, direction = 'long', sep = "")
mydf_m <- melt(mydf, id.var = c(1, 4), variable = 'source')
qplot(id, value, colour = source, data = mydf_m, geom = 'line') +
facet_wrap(~ time, ncol = 2)
NOTE. The reshape function in base R is extremely powerful, albeit very confusing to use. It is used to transform data between long and wide formats.
Kudos for automating something you used to do in Excel using R! That's exactly how I got started with R and a common path to R enlightenment :)
All you really need is a little looping. Here's an example, most of which is creating example data that represents your data structure:
## create some example data
Tg <- data.frame(Tg1 = rnorm(10))
for (i in 2:10) {
vec <- rep(NA, 8)
vec <- c(rnorm(sample(5:10,1)), vec)
Tg[paste("Tg", i, sep="")] <- vec[1:10]
}
Pf <- data.frame(Pf1 = rnorm(10))
for (i in 2:10) {
vec <- rep(NA, 8)
vec <- c(rnorm(sample(5:10,1)), vec)
Pf[paste("Pf", i, sep="")] <- vec[1:10]
}
## ok, sample data created
## now lets loop through all the columns
## if you didn't know how many columns there are you could
## use ncol(Tg) to figure out
for (i in 1:10) {
plot(1:10, Tg[,i], type = "l", col="blue", lwd=5, ylim=c(-3,3),
xlim=c(1, max(length(na.omit(Tg[,i])), length(na.omit(Pf[,i])))))
lines(1:10, Pf[,i], type = "l", col="red", lwd=5, ylim=c(-3,3))
dev.copy(png, paste('rplot', i, '.png', sep=""))
dev.off()
}
This will result in 10 graphs in your working directory that look like the following:

Resources