I have data that looks like this:
#val Freq1 Freq2
0.000 178 202
0.001 4611 5300
0.002 99 112
0.003 26 30
0.004 17 20
0.005 15 20
0.006 11 14
0.007 11 13
0.008 13 13
...many more lines..
Full data can be found here:
http://dpaste.com/173536/plain/
What I intend to do is to have a cumulative graph
with "val" as x-axis with "Freq1" & "Freq2" as
y-axis, plot together in 1 graph.
I have this code. But it creates two plots instead of 1.
dat <- read.table("stat.txt",header=F);
val<-dat$V1
freq1<-dat$V2
freq2<-dat$V3
valf1<-rep(val,freq1)
valf2<-rep(val,freq2)
valfreq1table<- table(valf1)
valfreq2table<- table(valf2)
cumfreq1=c(0,cumsum(valfreq1table))
cumfreq2=c(0,cumsum(valfreq2table))
plot(cumfreq1, ylab="CumFreq",xlab="Loglik Ratio")
lines(cumfreq1)
plot(cumfreq2, ylab="CumFreq",xlab="Loglik Ratio")
lines(cumfreq2)
What's the right way to approach this?
data <- read.table("http://dpaste.com/173536/plain/", header = FALSE)
sample1 <- unlist(apply(as.matrix(data),1,function(x) rep(x[1],x[2])))
sample2 <- unlist(apply(as.matrix(data),1,function(x) rep(x[1],x[3])))
plot(ecdf(sample1), verticals=TRUE, do.p=FALSE,
main="ECDF plot for both samples", xlab="Scores",
ylab="Cumulative Percent",lty="dashed")
lines(ecdf(sample2), verticals=TRUE, do.p=FALSE,
col.h="red", col.v="red",lty="dotted")
legend(100,.8,c("Sample 1","Sample 2"),
col=c("black","red"),lty=c("dashed","dotted"))
Try the ecdf() function in base R --- which uses plot.stepfun() if memory serves --- or the Ecdf() function in Hmisc by Frank Harrell. Here is an example from help(Ecdf) that uses a grouping variable to show two ecdfs in one plot:
# Example showing how to draw multiple ECDFs from paired data
pre.test <- rnorm(100,50,10)
post.test <- rnorm(100,55,10)
x <- c(pre.test, post.test)
g <- c(rep('Pre',length(pre.test)),rep('Post',length(post.test)))
Ecdf(x, group=g, xlab='Test Results', label.curves=list(keys=1:2))
Just for the record, here is how you get multiple lines in the same plot "by hand":
plot(cumfreq1, ylab="CumFreq",xlab="Loglik Ratio", type="l")
# or type="b" for lines and points
lines(cumfreq2, col="red")
Related
I have a table which has multiple columns and rows. I want to access the each value by its column name and rowname, and make a plot with these values.
The table looks like this with 101 columns:
IDs Exam1 Exam2 Exam3 Exam4 .... Exam100
Ellie 12 48 33 64
Kate 98 34 21 76
Joe 22 53 49 72
Van 77 40 12
Xavier 88 92
What I want is to be able to reach the marks for given row (IDs),and given column(exams) as:
table[Ellie,Exam3] --> 48
table[Ellie,Exam100] --> 64
table[Ellie,Exam2] --> (empty)
Then with these numbers, I want to see the distribution of how Ellie did comparing the rest of exams to Exam2,3 and 100.
I have almost figured out this part with R:
library(data.table)
library(ggplot2)
pdf("distirbution_given_row.pdf")
selectedvalues <- c(table[Ellie,Exam3] ,table[Ellie,Exam100])
library(plyr)
cdat <- ddply(selected values, "IDs", summarise, exams.mean=mean(exams))
selectedvaluesggplot <- ggplot(selectedvalues, aes(x=IDs, colour=exams)) + geom_density() + geom_vline(data=cdat, aes(xintercept=exams.mean, colour=IDs), linetype="dashed", size=1)
dev.off()
Which should generate the Ellie's marks for exams of interests versus the rest of the marks ( if it is a blank, then it should not be seen as zero. It is still a blank.)
Red: Marks for Exam3, 100 and 2 , Blue: The marks for the remaining 97 exams
(The code and the plot are taken as an example of ggplot2 from this link.)
All ideas are appreciated!
For accessing your data at least you can do the following:
df=data.frame(IDs=c("Ellie","Kate","Joe","Van","Xavier"),Exam1=c(12,98,22,77,NA),Exam2=c(NA,34,53,NA,NA),
Exam3=c(48,21,49,40,NA),Exam4=c(33,76,NA,12,88))
row.names(df)=df$IDs
df=df%>%select(-IDs)
> df['Joe','Exam2']
[1] 53
Now I prepared an example with random created numbers to illustrate a bit what you could do. First let us create an example data frame
df=as.data.frame(matrix(rnorm(505,50,10),ncol=101))
colnames(df)=c("IDs",paste0("Exam",as.character(1:100)))
df$IDs=c("Ellie","Kate","Joe","Van","Xavier")
To work with ggplot it is recomended to convert it to long format:
df0=df%>%gather(key="exams",value="score",-IDs)
From here on you can play with your variables as desired. For instance plotting the density of the score per ID:
ggplot(df0, aes(x=score,col=IDs)) + geom_density()
or selecting only Exams 2,3,100 and plotting density for different exams
df0=df0%>%filter(exams=="Exam2"|exams=="Exam3"|exams=="Exam100")
ggplot(df0, aes(x=score,col=exams)) + geom_density()
IIUC - you want to plot each IDs select exams with all else exams. Consider the following steps:
Reshape your data to long format even replace NAs with zero as needed.
Run by() to subset data by IDs and build mean aggregrate data and ggplots.
Within by, create a SelectValues indicator column on the select exams then graph with vertical line mean summation.
Data
txt = 'IDs Exam1 Exam2 Exam3 Exam4 Exam100
Ellie 12 NA 48 33 64
Kate 98 34 21 76 NA
Joe 22 53 49 NA 72
Van 77 NA 40 12 NA
Xavier NA NA NA 88 92'
exams_df <- read.table(text=txt, header = TRUE)
# ADD OTHER EXAM COLUMNS (SEEDED FOR REPRODUCIBILITY)
set.seed(444)
exams_df[paste0("Exam", seq(5:99))] <- replicate(99-4, sample(100, 5))
Reshape and Graph
library(ggplot2) # ONLY PACKAGE NEEDED
# FILL NA
exams_df[is.na(exams_df)] <- 0
# RESHAPE (BASE R VERSION)
exams_long_df <- reshape(exams_df,
timevar = "Exam",
times = names(exams_df)[grep("Exam", names(exams_df))],
v.names = "Score",
varying = names(exams_df)[grep("Exam", names(exams_df))],
new.row.names = 1:1000,
direction = "long")
# GRAPH BY EACH ID
by(exams_long_df, exams_long_df$IDs, FUN=function(df) {
df$SelectValues <- ifelse(df$Exam %in% c("Exam1", "Exam3", "Exam100"), "Select Exams", "All Else")
cdat <- aggregate(Score ~ SelectValues, df, FUN=mean)
ggplot(df, aes(Score, colour=SelectValues)) +
geom_density() + xlim(-50, 120) +
labs(title=paste(df$IDs[[1]], "Density Plot of Scores"), x ="Exam Score", y = "Density") +
geom_vline(data=cdat, aes(xintercept=Score, colour=SelectValues), linetype="dashed", size=1)
})
Output
I am trying to build a complex bar plot that has categories to distinguish. Here is the data frame
Treatment DCA.f Megalorchestia Talitridae Traskorchestia
1 A (-Inf,0] 8.000000 4843.6667 1394.0000
2 U (-Inf,0] 21.000000 2905.3333 483.6667
3 A (0,0.1] 25.000000 254.8571 41.0000
4 U (0,0.1] 30.714286 691.0000 360.1429
5 A (0.1,0.2] 35.400000 1355.2000 127.4000
6 U (0.1,0.2] 104.400000 705.4000 50.2000
7 A (0.2,0.3] 3.857143 649.7143 633.4286
8 U (0.2,0.3] 10.857143 510.4286 268.7143
9 A (0.3,0.4] 13.444444 981.5556 207.5556
10 U (0.3,0.4] 10.666667 1567.5556 417.5556
11 A (0.4, Inf] 0.000000 3.0000 1.2000
12 U (0.4, Inf] 0.000000 3.8000 0.0000
I want a barplot that for each DCA.f group shows 6 values for the three organisms categories (the right three columns), separated by treatment (A v U). So if you read the bottom of the plot there would be a big category for DCA.f and then with in that category there would be six bars. Two for each genera color coded by treatment. And then repeated for all DAC.f. I have looked through many of the other barplot posts and they have not gotten me anywhere.
Any help?
Here is one possibility. When using barplot, each column of the input matrix will correspond to a group of bars, and each row to different bars within groups. Thus, we need to reshape the data so that columns represent the levels of 'DCA.f'
library(reshape2)
library(RColorBrewer)
# reshape data
df2 <- melt(df)
df3 <- dcast(df2, Treatment + variable ~ DCA.f)
# create color palette
ncols <- length(unique(df3$variable))
cols <- c(brewer.pal(ncols, "Greens"), brewer.pal(ncols, "Reds"))
# plot
barplot(as.matrix(df3[ , -c(1, 2)]),
beside = TRUE,
col = cols,
cex.names = 0.7)
# add legend
legend(x = 15, y = 4000, legend = paste(df3$Treatment, df3$variable), fill = cols)
I have the data like this, I want to draw multiple lines in R, the lines contained SC1, SC2, SC3, SC4 and SC5, the xlab is chr (from 1 to 10).
chr pos SC1 SC2 SC3 SC4 SC5
chr01.8.5 1 0.000 2.420907e-02 1.317053e+00 7.171021e-02 3.280758e-03 1.185807e+00
chr01.6.5 1 0.714 0.040931607 1.150449274 0.042270667 0.044192568 0.976696855
A quick and slightly dirty way is to use ?matlines
# assume d is your data
plot(d$chr, d$pos) # plots the data as points
matlines(d$chr, d[,-(1:2)]) # plots every column except 1,2 against d$chr
How to create a categorical bubble plot, using GNU R, similar to that used in systematic mapping studies (see below)?
EDIT: ok, here's what I've tried so far. First, my dataset (Var1 goes to the x-axis, Var2 goes to the y-axis):
> grid
Var1 Var2 count
1 Does.Not.apply Does.Not.apply 53
2 Not.specified Does.Not.apply 15
3 Active.Learning..general. Does.Not.apply 1
4 Problem.based.Learning Does.Not.apply 2
5 Project.Method Does.Not.apply 4
6 Case.based.Learning Does.Not.apply 22
7 Peer.Learning Does.Not.apply 6
10 Other Does.Not.apply 1
11 Does.Not.apply Not.specified 15
12 Not.specified Not.specified 15
21 Does.Not.apply Active.Learning..general. 1
23 Active.Learning..general. Active.Learning..general. 1
31 Does.Not.apply Problem.based.Learning 2
34 Problem.based.Learning Problem.based.Learning 2
41 Does.Not.apply Project.Method 4
45 Project.Method Project.Method 4
51 Does.Not.apply Case.based.Learning 22
56 Case.based.Learning Case.based.Learning 22
61 Does.Not.apply Peer.Learning 6
67 Peer.Learning Peer.Learning 6
91 Does.Not.apply Other 1
100 Other Other 1
Then, trying to plot the data:
# Based on http://flowingdata.com/2010/11/23/how-to-make-bubble-charts/
grid <- subset(grid, count > 0)
radius <- sqrt( grid$count / pi )
symbols(grid$Var1, grid$Var2, radius, inches=0.30, xlab="Research type", ylab="Research area")
text(grid$Var1, grid$Var2, grid$count, cex=0.5)
Here's the result:
Problems: axis labels are wrong, the dashed grid lines are missing.
Here is ggplot2 solution. First, added radius as new variable to your data frame.
grid$radius <- sqrt( grid$count / pi )
You should play around with size of the points and text labels inside the plot to perfect fit.
library(ggplot2)
ggplot(grid,aes(Var1,Var2))+
geom_point(aes(size=radius*7.5),shape=21,fill="white")+
geom_text(aes(label=count),size=4)+
scale_size_identity()+
theme(panel.grid.major=element_line(linetype=2,color="black"),
axis.text.x=element_text(angle=90,hjust=1,vjust=0))
This will get you started by adding the tick marks to your xaxis.
To add the lines, just add a line at each level
ggs <- subset(gg, count > 0)
radius <- sqrt( ggs$count / pi )
# ggs$Var1 <- as.character(ggs$Var1)
# set up your tick marks
# (this can all be put into a single line in `axis`, but it's placed separate here to be more readable)
#--------------
# at which values to place the x tick marks
x_at <- seq_along(levels(gg$Var1))
# the string to place at each tick mark
x_labels <- levels(gg$Var1)
# use xaxt="n" to supress the standard axis ticks
symbols(ggs$Var1, ggs$Var2, radius, inches=0.30, xlab="Research type", ylab="Research area", xaxt="n")
axis(side=1, at=x_at, labels=x_labels)
text(ggs$Var1, ggs$Var2, ggs$count, cex=0.5)
also, notice that instead of calling the object grid I called it gg, and then ggs for the subset. grid is a function in R. While it is "allowed" to overwrite the function with an object, it is not recommended and can lead to annoying bugs down the line.
Here a version using levelplot from latticeExtra.
library(latticeExtra)
levelplot(count~Var1*Var2,data=dat,
panel=function(x,y,z,...)
{
panel.abline(h=x,v=y,lty=2)
cex <- scale(z)*3
panel.levelplot.points(x,y,z,...,cex=5)
panel.text(x,y,label=z,cex=0.8)
},scales=(x=list(abbreviate=TRUE))) ## to get short labels
To get the size of bubble proprtional to the count , you can do this
library(latticeExtra)
levelplot(count~Var1*Var2,data=dat,
panel=function(x,y,z,...)
{
panel.abline(h=x,v=y,lty=2)
cex <- scale(z)*3
panel.levelplot.points(x,y,z,...,cex=5)
panel.text(x,y,label=z,cex=0.8)
})
I don't display it since the render is not clear as in the fix size case.
I am trying to plot the CDF curve for a large dataset containing about 29 million values using ggplot. The way I am computing this is like this:
mycounts = ddply(idata.frame(newdata), .(Type), transform, ecd = ecdf(Value)(Value))
plot = ggplot(mycounts, aes(x=Value, y=ecd))
This is taking ages to plot. I was wondering if there is a clean way to plot only a sample of this dataset (say, every 10th point or 50th point) without compromising on the actual result?
I am not sure about your data structure, but a simple sample call might be enough:
n <- nrow(mycounts) # number of cases in data frame
mycounts <- mycounts[sample(n, round(n/10)), ] # get an n/10 sample to the same data frame
Instead of taking every n-th point, can you quantize your data set down to a sufficient resolution before plotting it? That way, you won't have to plot resolution you don't need (or can't see).
Here's one way you can do it. (The function I've written below is generic, but the example uses names from your question.)
library(ggplot2)
library(plyr)
## A data set containing two ramps up to 100, one by 1, one by 10
tens <- data.frame(Type = factor(c(rep(10, 10), rep(1, 100))),
Value = c(1:10 * 10, 1:100))
## Given a data frame and ddply-style arguments, partition the frame
## using ddply and summarize the values in each partition with a
## quantized ecdf. The resulting data frame for each partition has
## two columns: value and value_ecdf.
dd_ecdf <- function(df, ..., .quantizer = identity, .value = value) {
value_colname <- deparse(substitute(.value))
ddply(df, ..., .fun = function(rdf) {
xs <- rdf[[value_colname]]
qxs <- sort(unique(.quantizer(xs)))
data.frame(value = qxs, value_ecdf = ecdf(xs)(qxs))
})
}
## Plot each type's ECDF (w/o quantization)
tens_cdf <- dd_ecdf(tens, .(Type), .value = Value)
qplot(value, value_ecdf, color = Type, geom = "step", data = tens_cdf)
## Plot each type's ECDF (quantizing to nearest 25)
rounder <- function(...) function(x) round_any(x, ...)
tens_cdfq <- dd_ecdf(tens, .(Type), .value = Value, .quantizer = rounder(25))
qplot(value, value_ecdf, color = Type, geom = "step", data = tens_cdfq)
While the original data set and the ecdf set had 110 rows, the quantized-ecdf set is much reduced:
> dim(tens)
[1] 110 2
> dim(tens_cdf)
[1] 110 3
> dim(tens_cdfq)
[1] 10 3
> tens_cdfq
Type value value_ecdf
1 1 0 0.00
2 1 25 0.25
3 1 50 0.50
4 1 75 0.75
5 1 100 1.00
6 10 0 0.00
7 10 25 0.20
8 10 50 0.50
9 10 75 0.70
10 10 100 1.00
I hope this helps! :-)