Combining ggplot with ddply - r

I have sample data test.data as follows.
income expend id
9142.7 1576.2 1
23648.75 2595 2
9014.25 156 1
4670.4 604.4 3
6691.4 3654.4 3
14425.2 66 2
8563.45 1976.2 2
2392 6 1
7915.95 619.2 3
4424.2 504.2 2
I first use ddply to get the mean income and expend for each id
library(plyr)
group<-ddply(test.data, .id,summarize, income.mean=mean(income),expend.mean=mean(expend))
Now, I use the plot function from ggplot2 to plot income.mean and expend.mean by id.
library (ggplot2)
plot.income<-qplot(id,income.mean,data=group)
plot.expend<-qplot(id,expend.mean,data=group)
While the above code runs without any error, I am looking for the efficient way to combine qplot function in ddply or vice versa. Also, if I need to combine both these plots how do I do that?
Thanks .

I think what you're trying to get at is going to require switching from the 'qplot' function to the 'ggplot' function. Including the graphing functions inside your 'ddply' function is not going to be very pretty, and vice versa. You're better leaving them separate, so I'm going to just focus on combining the graphs. There are two good (in my opinion) ways to do this:
Option 1: Just do both plots as separate geometries on the same 'ggplot' object. This isn't two hard to do, and works like this:
ggplot(group) + geom_point(aes(x=id, y=income.mean), colour="red") + geom_point(aes(x=id, y=expend.mean), colour="blue")
This is a fast option and gets the job done with minimal computation. However, it requires that you specify a new geometry for each column. In your sample data, this isn't an issue, but in many cases, you want to do this with code, instead of doing it by hand.
Option 2: Reshape your data to combine both sets inside of one plot. Then, we can specify groupings by coloring by the variable
library(reshape2)
plot_Data <- melt(group, id="id")
# Output of plot_Data
# id variable value
# 1 1 income.mean 6849.650
# 2 2 income.mean 12765.400
# 3 3 income.mean 6425.917
# 4 1 expend.mean 579.400
# 5 2 expend.mean 1285.350
# 6 3 expend.mean 1626.000
ggplot(plot_Data, aes(x=id, y=value, col=variable)) + geom_point()
The disadvantage of this method is that we are doing a lot more computation, so large complicated data frames may become slow to process. However, the advantage (and this is huge) is that we don't have to know what columns existed in the data frame we are plotting. Everything is sorted, colored, and plotted without our intervention, so we can use this flexibly for just about anything.
You should be able to adjust from here to suit your needs.

To combine both plots I had to throw in the reshape2 package to melt the data:
library(ggplot2)
library(plyr)
library(reshape2)
test.data <- read.table(text="income expend id
9142.7 1576.2 1
23648.75 2595 2
9014.25 156 1
4670.4 604.4 3
6691.4 3654.4 3
14425.2 66 2
8563.45 1976.2 2
2392 6 1
7915.95 619.2 3
4424.2 504.2 2", header=TRUE)
qplot(data=melt(ddply(test.data, .(id), colwise(mean)), id.vars="id"), x=id, y=value, colour=variable)

Well, your question is not very precise because we don't know what you exactly want to do. But here is a guess :
d <- read.table(textConnection("income expend id
9142.7 1576.2 1
23648.75 2595 2
9014.25 156 1
4670.4 604.4 3
6691.4 3654.4 3
14425.2 66 2
8563.45 1976.2 2
2392 6 1
7915.95 619.2 3
4424.2 504.2 2"), header=TRUE)
library(reshape2)
d2 <- melt(d, id.var="id")
ggplot(data=d2, aes(x=id,y=value)) + stat_summary(fun.y="mean", geom="bar") + facet_grid(.~variable)
Will give :

Related

R Plot muliptle lines with dates

I am trying to create a line plot in R. For each 'RuleID' in my data frame I want to plot the 'ErrorCount' at each 'ProcessorTimeStamp'
DQ_Counts= data.frame(RuleID=c(1,2,1,2),
ProcessorTimeStamp=as.Date(c('2016-08-04','2016-08-04','2016-08-08','2016-08-08')),
ErrorCount=c(6,8,3,4))
# RuleID ProcessorTimeStamp ErrorCount
# 1 1 2016-08-04 6
# 2 2 2016-08-04 8
# 3 1 2016-08-08 3
# 4 2 2016-08-08 4
This is a plot I found online that I would like the end result to look like all though I am obviously not talking about trees. The code for this plot is here Code for Tree Growth Plot but I don't understand it well enough to make it work for me.
For my plot 'ProcessTimeStamp' would be my x and 'ErrorCount' would by my y. Each line would represent a different 'RuleID'.
One thing to note is that I have 'ErrorCounts' ranging from 0 to over 3 million (this is why I need to report on them to get them fixed!).
Thanks in advance.
This is probably the easiest way to get a basic plot like the one above with your data
lattice::xyplot(ErrorCount~ProcessorTimeStamp, DQ_Counts,
groups=RuleID, auto.key=T, type="l")
Which returns
or you could use ggplot2
library(ggplot2)
ggplot(DQ_Counts, aes(ProcessorTimeStamp, ErrorCount, color=factor(RuleID))) + geom_line()
to get

How to plot recurrencies in R

How can I plot a recurrency in R.
Any solution with base plot, ggplot2, lattice, or a dedicated package is welcome.
For example:
Imagine I have these data:
mydata <- data.frame(t=1:10, Y=runif(10))
t Y
1 0.3744869
2 0.6314202
3 0.3900789
4 0.6896278
5 0.6894134
6 0.5549006
7 0.4296244
8 0.4527201
9 0.3064433
10 0.5783539
I could transform it like this:
mydata2 <- data.frame(t=c(NA,mydata$t),Y=c(NA,mydata$Y),Y2=c(mydata$Y, NA))
t Y Y2
NA NA 0.9103703
1 0.9103703 0.1426041
2 0.1426041 0.4150476
3 0.4150476 0.2109258
4 0.2109258 0.4287504
5 0.4287504 0.1326900
6 0.1326900 0.4600964
7 0.4600964 0.9429571
8 0.9429571 0.7619739
9 0.7619739 0.9329098
10 0.9329098 NA
(or similar methods, but I can have problems with missing data)
And plot it
plot(Y2~Y, data=mydata2)
I guess I must use some grouping function such as ave or apply. But it's not an elegant solution, and if I have more columns it can become difficult to generalize the transformation.
For example
mydata3 <- data.frame(x=sample(10,100, replace=T),t=1:100, Y=2*runif(100)+1)
For every x (or combination of values on other columns) I want to plot Y_{i+1} ~ Y_i, on the same plot.
Other tools, such as Mathematica have functions to plot sequences directly.
I've found a solution, thoug not very beautiful:
For this sample data.
mydata <- data.frame(x=sample(4,25, replace=T),t=1:25, Y=2*runif(25)+1)
newdata <- mydata[order(mydata$x, mydata$t), ]
newdata$prev <- ave(newdata$Y, newdata$x, FUN=function(x) c(NA,head(x,-1)))
plot(Y~prev, data=newdata)
In this example you don't have rows for every t value, you would need to first generate NAs for missing values. But it's just a quick solution. In my real data I have many observations for each t.
lag.plot can plot recurrence plots but not within each subgroup.

Creating stacked barplots in R using different variables

I am a novice R user, hence the question. I refer to the solution on creating stacked barplots from R programming: creating a stacked bar graph, with variable colors for each stacked bar.
My issue is slightly different. I have 4 column data. The last column is the summed total of the first 3 column. I want to plot bar charts with the following information 1) the summed total value (ie 4th column), 2) each bar is split by the relative contributions of each of the three column.
I was hoping someone could help.
Regards,
Bernard
If I understood it rightly, this may do the trick
the following code works well for the example df dataframe
df <- a b c sum
1 9 8 18
3 6 2 11
1 5 4 10
23 4 5 32
5 12 3 20
2 24 1 27
1 2 4 7
As you don't want to plot a counter of variables, but the actual value in your dataframe, you need to use the goem_bar(stat="identity") method on ggplot2. Some data manipulation is necessary too. And you don't need a sum column, ggplot does the sum for you.
df <- df[,-ncol(df)] #drop the last column (assumed to be the sum one)
df$event <- seq.int(nrow(df)) #create a column to indicate which values happaned on the same column for each variable
df <- melt(df, id='event') #reshape dataframe to make it readable to gpglot
px = ggplot(df, aes(x = event, y = value, fill = variable)) + geom_bar(stat = "identity")
print (px)
this code generates the plot bellow

R_Multiple plots on same figure using a for loop

I have 2 data frames, mydf1 and mydf2
> mydf1
id a b c
1 1 2 10 2
2 2 3 11 4
3 3 5 12 6
4 4 7 13 8
5 5 8 14 10
> mydf2
id a b c
1 1 4 20 4
2 2 6 22 8
3 3 10 24 12
4 4 14 26 16
5 5 16 28 20
I would like to plot variables a,b & c against id (sample graphs is given below). I want similar graphs for variables b and c too and I want to do it in a loop and then export it to a local folder. So, I am using the following code
for (i in 2:4) {
jpeg(paste("C:/Data/myplot",i,".jpg"))
ymin<-min(mydf1[,i],mydf2[,i])
ymax<-max(mydf1[,i],mydf2[,i])
plot(mydf1[,1],mydf1[,i],ylim=c(ymin,ymax),xlab="id",ylab=colnames(mydf)[i])
points(mydf2[,1],mydf2[,i],pch=2)
legend("topright",c("mydf1","mydf2"),pch=c(1,2))
dev.off()
}
My problem is that I would like to get all three different graphs, (id vs a (mydf1 and mydf2) , id vs b(mydf1 and mydf2), id vs c(mydf1 and mydf2) in one figure.(something like 2 along the first row of the figure and the third one in the second row with legend) I tried the following
jpeg("C:/Data/myplot.jpg")
par(mfrow=c(2,2))
for (i in 2:4) {
ymin<-min(mydf1[,i],mydf2[,i])
ymax<-max(mydf1[,i],mydf2[,i])
plot(mydf1[,1],mydf1[,i],ylim=c(ymin,ymax),xlab="id",ylab=colnames(mydf)[i])
points(mydf2[,1],mydf2[,i],pch=2)
legend("topright",c("mydf1","mydf2"),pch=c(1,2))
dev.off()
}
But it didn't work. Any suggestion to do this?
p.s: This is the simplified version of my task. Actually I have hundreds of columns, that's why I am using a loop operation
Sample plot id vs a (mydf1 and mydf2) plotted on the same graph
It is unclear what you are trying to do. Do you want 2 plots, one for mydf1 and one for mydf2 or all on one figure? If two panels, you should change to mfrow=c(2,1) instead of c(2,2) which is currently making 4 panels?
If you want them all on a single plot, then remove the par(mfrow... line.
Then within the plots, you are plotting the first series from mydf1 and the other two series from mydf2. Is that actually what you want?
Using base graphics, you should move your plot line outside the loop so it is done once, change the loop to start at 3, and then keep the points statements inside the loop. Alternatively, you could put an if statement inside the loop to see if it is the first time.
You also have a typo in the plot statement with mydf (no number).
And move your dev.off() outside the loop so it only closes the figure once.
Here is some code that generates a single-panel plot, and you should be able to modify it to work for your desired output...
jpeg("myplot.jpg")
for (i in 2:4) {
ymin<-min(mydf1[,i],mydf2[,i])
ymax<-max(mydf1[,i],mydf2[,i])
if (i==2){
plot(mydf1[,1],mydf1[,i],ylim=c(ymin,ymax),xlab="id",ylab=colnames(mydf1)[i])
legend("topright",c("mydf1","mydf2"),pch=c(1,2))
}
else{
points(mydf2[,1],mydf2[,i],pch=2)
}
}
dev.off()
EDIT: After your clarified question, I think the only problem is that dev.off() should be outside the loop. (I recommend PNG or PDF instead of JPEG for any plot worth presenting....)
png("myplot.png")
par(mfrow=c(2,2))
for (i in 2:4) {
ymin<-min(mydf1[,i],mydf2[,i])
ymax<-max(mydf1[,i],mydf2[,i])
plot(mydf1[,1],mydf1[,i],ylim=c(ymin,ymax),xlab="id",ylab=colnames(mydf1)[i])
legend("topright",c("mydf1","mydf2"),pch=c(1,2))
points(mydf2[,1],mydf2[,i],pch=2)
}
dev.off()
I'd do something like
mydf1$g <- 1
mydf2$g <- 2
d3 <- rbind(mydf1, mydf2)
library(reshape2)
d3 <- melt(d3, id.vars = c('id', 'g'))
library(ggplot2)
ggplot(d3, aes(x=id, y=value)) +
geom_point(aes(colour = as.factor(g), shape = variable))
or using facets
ggplot(d3, aes(x=id, y=value)) +
geom_point(aes(colour = as.factor(g))) +
facet_wrap(~variable)
to finally export it
ggsave(file = paste0(tempdir(), 'myplot.png'),
last_plot()
)

Histograms in R with a "more" categorie, similar to MS Excel

Consider the following frequency data:
> table(income)
income
3 5 6 7 8 5000
2 7 2 2 2 1
When I type >hist(income) I get the following histogram
So as you can see, the fact that most income values are concentrated around 5 and there is one value quite distant from the others makes the histogram not look very good. MS Excel can consider the 5000 value as of another category, so the data would like this instead:
> table(income)
income
3 5 6 7 8 more
2 7 2 2 2 1
So plotting this as a histogram would look much better, so you can see the frequency within a shorter range:
Is there anyway to do this either with the hist() function or others functions from lattice or ggplot2? I do however, don't want to overwrite the values that exceed a certain threshold, so as I do lose any information.
Thanks a lot!
Data generation:
income <- c(rep(3,2), rep(5,7), rep(6,2), rep(7,2), rep(8,2), 5000)
Function for preparing data for plotting:
nice.data <- function(x, threshold=10){
x[x>threshold] <- "More"
x
}
Plotting:
library(ggplot2)
ggplot() + geom_histogram(aes(x=nice.data(income))) + xlab("Income")
Result:

Resources