R: plot multiple lines in different colours from subset of database - r

I've created a database with six different countries and multiple GDP and inequality measures.
For starters, I want to plot the GDP growth of the countries in one plot. This works out perfectly fine:
plot(my_six_countries$Year, my_six_countries$GDP.growth.rate, main = "Development of GDP growth", xlab = "Year", ylab = "GDP growth", type = "l", col = 600)
However, I want the lines for the different countries to be displayed in different colours and not just 600. I virtually spend the whole day on this super nooby problem and I've tried all sort of things from creating a colour vector over subsetting manually to playing with ggplot - but I'm really stuck.
Any idea how the lines could be displayed in different colours?
Thank you so much!

I just wanted to say that I ended up using a way less elegant method - but it worked.
Firstly, I subsetted my countries.
c1 <- subset(countries,countries$Country=="c1")
c2 <- subset(countries,countries$Country=="c2")
c3 <- subset(countries,countries$Country=="c3")
Secondly, I plotted the lines one by one.
plot(c1$Year, c1$GDP, type = "l", bty="l", col="brown")
lines(c2$Year, c2$GDP, col="cornflowerblue")
lines(c3$Year, c3$GDP, col="darkblue")

Related

Display a specific value with boxplot

I make a histogram and a boxplot on different data. I want to display a specific value on these graphs (a point with a label); is it possible with boxplot and hist or is it better to use the ggplot2 package? I don't really know how to use it.
Thanks in advance for your help on this probably very simple question!
The first one represents the distribution of the share of couples without children in different localities and the second one the share of under 14 years old in the population of these localities.
I would like to indicate with a dot the value of a particular locality (present in the source data table)
hist(tableau.complet$proportion.coupleSenf,
main = "ex.main",
xlab = "ex.x",
ylab = "ex.y",
col = "brown")
boxplot(tableau.complet$part.moins.14,
main="ex..main",
ylab="ex.ylab",
col= "brown",
las =1 )

change axis/scale for time series plot after forecast

I'm struggling with changing the x-axis (time) for my time series forecast plot. I have ran many models but I am struggling with the same issue. I'm going to write the code for the model fit, forecast and the plot here for one of the models. First here is my original time series. Note: I'm fitting my model on my training data that is from 2008-2016 and testing my model on my test data for the 11 months in 2017.
Data Split.
sal.ts <- window(sal.ts.original, start=c(2008,1), end=c(2016,12))
sal.test <- window(sal.ts.original, start=c(2017,1))
Now, the model.
sal.hw.mul <- HoltWinters(sal.ts, seasonal = "mult")
sal.hw.mul
fc.hwm <- forecast(sal.hw.mul, h=11)
fc.hwm
plot(fc.hwm, xlim=c(2017,2017+11/12), main = "Forecast from Mutltiplicative HW", xlab = "Year", ylab = "Total Sales, $M")
lines(sal.test,col='red', lwd=2)
legend("topleft", c("Actual", "Predicted"), col = c(4,2), lty = 1)
Here's my forecast plot:
See that ugly 2017.0, 2017.2.... 2017.8? I want it to instead say 1,2,3,....11 for the 11 months of 2017.
Yes, I only want to plot my test data and forecast on it and not the whole series.
I am pretty sure my problem is around my use of the xlim function. I am using that xlim function to just plot the months of 2017 and if I don't use that then R plots the whole series from 2008-2017. I tried to play around with the axis function a lot by setting xaxt="n" in the plot command but still couldn't figure it out.
Let me know if you need more information from me. Any help will be appreciated.
Update, on someone's suggestion I tried to write a custom axis by setting xaxt = 'n' in my plot. Here's the change in code.
x <- seq(1,11,1)
fc.hwm <- forecast(sal.hw.mul, h=11)
fc.hwm
layout(1:1)
plot(fc.hwm, xaxt='n', xlim=c(2017,2017+11/12), main = "Forecast from Mutltiplicative HW", xlab = "Year", ylab = "Total Sales, $M")
axis(side=1, at= x, labels=c("1","2","3","4","5","6","7","8","9","10","11"))
lines(sal.test,col='red', lwd=2)
legend("topleft", c("Actual", "Predicted"), col = c(4,2), lty = 1)
Like you can see. It gets me there half way. I can remove my current axis label but I am not being able to write a new axis. This new code is not even giving me an error or else I would've tried to debug it. It accepts my code but doesn't give me the desired output.
Here's an idea. I'm not sure what the data look like, but I'm guessing that you have a Date type for the date variable -- and that means that your "by" sequence of integer 1 to 11 might be placing those new labels outside the plot limits. Try using a Date sequence instead.
Change this:
x <- seq(1,11,1)
To something like this:
x <- seq.Date(as.Date("2017-01-01"), as.Date("2017-11-01"), "months")
I'm not sure how far into November your data go, so you might want to set that "to" Date in the sequence to December instead, so you can fully cover your November data points.

R X-axis Date Labels using plot()

Using the plot() function in R, I'm trying to produce a scatterplot of points of the form (SaleDate,SalePrice) = (saldt,sapPr) from a time-series, cross-section real estate sales dataset in dataframe format. My problem concerns labels for the X-axis. Just about any series of annual labels would be adequate, e.g. 1999,2000,...,2013 or 1999-01-01,...,2013-01-01. What I'm getting now, a single label, 2000, at what appears to be the proper location won't work.
The following is my call to plot():
plot(r12rgr0$saldt, r12rgr0$salpr/1000, type="p", pch=20, col="blue", cex.axis=.75,
xlim=c(as.Date("1999-01-01"),as.Date("2014-01-01")),
ylim=c(100,650),
main="Heritage Square Sales Prices $000s 1990-2014",xlab="Sale Date",ylab="$000s")
The xlim and ylim are called out to bound the date and price ranges of the data to be plotted; note prices are plotted as $000s. r12rgr0$saldt really is a date; str(r12rgr0$saldt) returns:
Date[1:4190], format: "1999-10-26" "2013-07-06" "2003-08-25" NA NA "2000-05-24" xx
I have reviewed several threads here concerning similar questions, and see that the solution probably lies with turning off the default X-axis behavior and using axis.date, but i) At my current level of R skill, I'm not sure I'd be able to solve the problem, and ii) I wonder why the plotting defaults are producing these rather puzzling (to me, at least) results?
Addl Observations: The Y-axis labels are just fine 100, 200,..., 600. The general appearance of the scatterplot indicates the called-for date ranges are being observed and the relative positions of the plotted points are correct. Replacing xlim=... as above with xlim=c("1999-01-01","2014-01-01")
or
xlim=c(as.numeric(as.character("1999-01-01")),as.numeric(as.character("2014-01-01")))
or
xlim=c(as.POSIXct("1999-01-01", format="%Y-%m-%d"),as.POSIXct("2014-01-01", format="%Y-%m-%d"))
all result in error messages.
With plots it's very hard to reproduce results with out sample data. Here's a sample I'll use
dd<-data.frame(
saldt=seq(as.Date("1999-01-01"), as.Date("2014-01-10"), by="6 mon"),
salpr = cumsum(rnorm(31))
)
A simple plot with
with(dd, plot(saldt, salpr))
produces a few year marks
If i wanted more control, I could use axis.Date as you alluded to
with(dd, plot(saldt, salpr, xaxt="n"))
axis.Date(1, at=seq(min(dd$saldt), max(dd$saldt), by="30 mon"), format="%m-%Y")
which gives
note that xlim will only zoom in parts of the plot. It is not directly connected to the axis labels but the axis labels will adjust to provide a "pretty" range to cover the data that is plotted. Doing just
xlim=c(as.Date("1999-01-01"),as.Date("2014-01-01"))
is the correct way to zoom the plot. No need for conversion to numeric or POSIXct.
If you are running a plot in real time and don't mind some warnings, you can just pass, e.g., format = "%Y-%m-%d" in the plot function. For instance:
plot(seq((Sys.Date()-9),Sys.Date(), 1), runif(10), xlab = "Date", ylab = "Random")
yields:
while:
plot(seq((Sys.Date()-9), Sys.Date(), 1), runif(10), format = "%Y-%m-%d", xlab = "Date", ylab = "Random")
yields:
with lots of warnings about format not being a graphical parameter.

R heat map: Ordering by value; label issues

I am looking to improve upon output I implemented in R based on Jeromy's answer here (thanks!). Mine is a 31x31 matrix with positive and negative values, and uses basically the same ggplot2 code:
library(ggplot2)
library(reshape)
z<-cor(insheet3,use="complete.obs",method="kendall")
zm<-melt(z)
ggplot(zm, aes(X1,X2, fill=value)) + geom_tile() +
scale_fill_gradient2(low = "blue", high = "dark violet")
I need to change three things:
Right now, the rows appear in reverse alphabetical order, which means no visible data trends. How can I influence the order of the rows and columns, such that either:
A. (Preferred:) The columns are ordered by correlation value (negative to positive or vice versa), as they are in the ellipse package output on that same page; or
B. The columns are manually ordered, so that I can group similar variables?
Along the bottom X-axis, my variable names are overlapping dramatically and are unreadable. They need to remain long (i.e., OrthoPhos, Ammonia, Residential...), so how can I rotate their labels 90 degrees?
Is there a way to remove the "X1" and "X2" labels along each axis?
Thank you!
Following what I'll call an extensive/religious R journey into correlation matrix possibilities, I wanted to share what I'm finally going to use. Also, thanks to the previous answerers; I've found that there are many "right" answers to this.
Since my reviewers insisted I include numbers and not just colors, and that I stay away from more "confusing" and "busy" output like correlogram, I finally found "image" and based my final output on this example. Thanks #Marcinthebox.
Also to appease StackOverflow, here is a link to the image, rather than the image itself.
Because some of these specifications took a while to figure out and were critical to the final output, here's my code, shortened as much as I could.
#Subsetting to only the vectors I want to see in the correlation, as ordered
insheet<-subset(insheet1,
select=c("Cond", "CL", "SO4", "TN", "TP", "OrthoPhos", "DO", ...., "Rural"))
#Defining "high" and "low" colors
library(colorspace)
mycolors<-diverge_hcl(8, h = c(8, 240), c = 80, l = c(50,100), power = 1)
#Correlating them into a matrix
sheet<-cor(insheet,use="complete.obs")
#Making it!
image(x=seq(dim(sheet)[2]), y=seq(dim(sheet)[2]), z=sheet, ann=FALSE,
col=mycolors, xlab="x column", ylab="y column", xaxt='n', yaxt='n')
text(expand.grid(x=seq(dim(sheet)[2]), y=seq(dim(sheet)[2])),
labels=round(c(sheet),2), cex=0.5)
axis(1, 1:dim(insheet2)[2], colnames(insheet2), las=2)
axis(2, 1:dim(insheet2)[2], colnames(insheet2), las=2)
par(mar=c(5.5, 5.5, 2, 1)) #Moves margins over to allow for axis labels
I was also able to for-loop this to output multiple .wmf files, once errors were suppressed. Too bad I couldn't visualize significant p-values as well... another time. Thanks!
I assume that you mean "clustering" for point 1.?
For such tasks I prefer the heatmap.2() function from the gplots package, which offers various clustering options.
For point 2 and 3: The heatmap.2() function will also take care of the 90º rotation and the labels since it is using a data matrix as input instead of a data table.

How to extract coordinates to plot line segments connecting legend keys in ggplot2?

I've long puzzled over a concise way to communicate significance of an interaction between numeric and categorical variables in a line plot (response on the Y-axis, numeric predictor variable on the X-axis, and each level of the categoric variable a line of a different color or pattern plotted on those axes). I finally came up with the idea of drawing the traditional "brackets and p-values" connecting legend keys instead of lines of data.
Here is a mockup of what I mean:
library(ggplot2);
mydat <- do.call(rbind,lapply(1:3,function(ii) data.frame(
y=seq(0,10)*c(.695,.78,1.39)[ii]+c(.322,.663,.847)[ii],
a=factor(ii-1),b=0:10)));
myplot <- ggplot(data=mydat,aes(x=b,y=y,colour=a,group=a)) +
geom_line()+theme(legend.position=c(.1,.9));
# Plotting with p-value bracket:
myplot +
# The three line segments making up the bracket
geom_segment(x=1.2,xend=1.2,y=13.8,yend=13) +
geom_segment(x=1.1,xend=1.2,y=13,yend=13) +
geom_segment(x=1.1,xend=1.2,y=13.8,yend=13.8) +
# The text accompanying the bracket.
geom_text(label='p < 0.001',x=2,y=13.4);
This is less cluttered than trying to plot brackets someplace on the line-plot itself.
The problem is that the x and y values for the geom_segments and geom_text were obtained by trial and error and for another dataset these coordinates would be completely wrong. That's a problem if I'm trying to write a function whose purpose is to automate the process of pulling these contrasts out of models and plotting them (kind of like the effects package, but with more flexibility about how to represent the data).
My question is: is there a way to somehow pull the actual coordinates of each box comprising the legend and convert them to the scale used by geom_segment and geom_text, or manually specify the coordinates of each box when creating the myplot object, or reliably predict where the individual boxes will be and convert them to the plot's scale given that myplot$theme$legend.position returns 0.1 0.9?
I'd like to do this within ggplot2, because it's robust, elegant, and perfect for all the other things I want to do with my script. I'm open to using additional packages that extend ggplot2 and I'm also open to other approaches to visually indicating significance level on line-plots. However, suggestions that amount to "you shouldn't even do that" are not constructive-- because whether or not I personally agree with you, my collaborators and their editors don't read Stackoverflow (unfortunately).
Update:
This question kind of simplifies to: if the myplot$theme$legend.key.height is in lines and myplot$theme$legend.position seems to be roughly in fractions of the overall plot area (but not exactly) how can I convert these to the units in which the x and y axes are delineated, or alternatively, convert the x and y axis scales to the units of legend.key.height and legend.position?
I don't know the answer to your question as posed. But, another, definitely quickly do-able if less fancy approach to convey the information is to change the names of the levels so that the level names include significance codes. In your first example, you could use
levels(mydat$a) <- list("0" = "0", "1 *" = "1", "2 *" = "2")
And then the legend will reflect this:
With more levels and combos of significance, you could probably work out a set of symbols. Then mention in your figure legend the p level reflected in each set of symbols.
This might be a related way to convey the information: The figure below is produced by rxnNorm in HandyStuff here. Unfortunately, this is another non-answer as I have not been able to make this work with the new version of ggplot2. Hopefully I can figure it out soon.
My answer is not using ggplot2, but the lattice package. I think dotplot is what I would use if I want to compare a continuous variable versus categorical variables.
Here I use dotplot in 2 manners, one where I reproduce your plot, and another where
library(lattice)
library(latticeExtra) ## to get ggplot2 theme
#y versus levels of B, in different panel of A
p1 <- dotplot(b~y|a ,
data = mydat,
groups = a,
type = c("p", "h"),
main = "interaction between numeric and categorical variables ",
xlab = "continuous value",
par.settings = ggplot2like())
#y versus levels of B , grouped by a(color and line are defined by a)
p2 <- dotplot(b~y, groups= a ,
data = mydat,
type = c("l"),
main = "interaction between numeric and categorical variables ",
xlab = "continuous value",
par.settings = ggplot2like())
library(gridExtra) ## to arrange many grid plots
grid.arrange(p1,p2)

Resources