Multiple data points in one R ggplot2 plot - r

I have two sets of data points that both relate to the same primary axis, but who differ in secondary axis. Is there some way to plot them on top of each other in R using ggplot2?
What I am looking for is basically something that looks like this:
4+ |
| x . + 220
3+ . . |
| x |
2+ . + 210
| x |
1+ . x x |
| + 200
0+-+-+-+-+-+-+
time
. temperatur
x car sale
(This is just a example of possible data)

Shane's answer, "you can't in ggplot2," is correct, if incomplete. Arguably, it's not something you want to do. How do you decide how to scale the Y axis? Do you want the means of the lines to be the same? The range? There's no principled way of doing it, and it's too easy to make the results look like anything you want them to look like. Instead, what you might want to do, especially in a time-series like that, is to norm the two lines of data so that at a particular value of t, often min(t), Y1 = Y2 = 100. Here's an example I pulled off of the Bonddad Blog (not using ggplot2, which is why it's ugly!) But you can cleanly tell the relative increase and decrease of the two lines, which have completely different underlying scales.

I'm not an expert on this, but it's my understanding that this is possible with lattice, but not with ggplot2. See this leanr blog post for an example of a secondary axis plot. Also see Hadley's response to this question.
Here's an example of how to do it in lattice (from Gabor Grothendieck):
library(lattice)
library(grid) # needed for grid.text
# data
Lines.raw <- "Date Fo Co
6/27/2007 57.1 13.9
6/28/2007 57.7 14.3
6/29/2007 57.8 14.3
6/30/2007 57 13.9
7/1/2007 57.1 13.9
7/2/2007 57.2 14.0
7/3/2007 57.3 14.1
7/4/2007 57.6 14.2
7/5/2007 58 14.4
7/6/2007 58.1 14.5
7/7/2007 58.2 14.6
7/8/2007 58.4 14.7
7/9/2007 58.7 14.8
"
# in reality next stmt would be DF <- read.table("myfile.dat", header = TRUE)
DF <- read.table(textConnection(Lines.raw), header = TRUE)
DF$Date <- as.Date(DF$Date, "%m/%d/%Y")
par.settings <- list(
layout.widths = list(left.padding = 10, right.padding = 10),
layout.heights = list(bottom.padding = 10, top.padding = 10)
)
xyplot(Co ~ Date, DF, default.scales = list(y = list(relation = "free")),
ylab = "C", par.settings = par.settings)
trellis.focus("panel", 1, 1, clip.off = TRUE)
pr <- pretty(DF$Fo)
at <- 5/9 * (pr - 32)
panel.axis("right", at = at, lab = pr, outside = TRUE)
grid.text("F", x = 1.1, rot = 90) # right y axis label
trellis.unfocus()

Related

R: interpolate a value from dataframe based on two inputs

I have a data frame that looks like this:
Teff logg M_div_H U B V R I J H K L Lprime M
1 2000 4.0 -0.1 -13.443 -11.390 -7.895 -4.464 -1.831 1.666 3.511 2.701 4.345 4.765 5.680
2 2000 4.5 -0.1 -13.402 -11.416 -7.896 -4.454 -1.794 1.664 3.503 2.728 4.352 4.772 5.687
3 2000 5.0 -0.1 -13.358 -11.428 -7.888 -4.431 -1.738 1.664 3.488 2.753 4.361 4.779 5.685
4 2000 5.5 -0.1 -13.220 -11.079 -7.377 -4.136 -1.483 1.656 3.418 2.759 4.355 4.753 5.638
5 2200 3.5 -0.1 -11.866 -9.557 -6.378 -3.612 -1.185 1.892 3.294 2.608 3.929 4.289 4.842
6 2200 4.5 -0.1 -11.845 -9.643 -6.348 -3.589 -1.132 1.874 3.310 2.648 3.947 4.305 4.939
...
Let's say I have two values:
input_Teff = 4.8529282904170595E+003
input_log_g = 1.9241934741026787E+000
Notice how every V value has a unique Teff, logg combination. From the input values, I would like to interpolate a value for V. Is there a way to do this in R?
Edit 1: Here is the link to the full data frame: https://www.dropbox.com/s/prbceabxmd25etx/lcb98cor.dat?dl=0
Building on Ian Campbell's observation that you can consider your data as points on a two-dimensional plane, you can use spatial interpolation methods. The simplest approach is inverse-distance weighting, which you can implement like this
library(data.table)
d <- fread("https://www.dropbox.com/s/prbceabxmd25etx/lcb98cor.dat?dl=1")
setnames(d,"#Teff","Teff")
First rescale the data as appropriate (not shown here, see Ian's answer)
library(gstat)
# fit model
idw <- gstat(id="V", formula = V~1, locations = ~Teff+logg, data=d, nmax=7, set=list(idp = .5))
# new "points" to predict to
newd <- data.frame(Teff=c(4100, 4852.928), logg=c(1.5, 1.9241934741026787))
p <- predict(idw, newd)
#[inverse distance weighted interpolation]
p$V.pred
#[1] -0.9818571 -0.3602857
For higher dimensions you could use fields::Tps (I think you can force that to be an exact method, that is, exactly honor the observations, by making each observation a node)
We can imagine that Teff and logg exist in a 2-dimensional plane. We can see that your input point exists in that same space:
library(tidyverse)
ggplot(data,aes(x = Teff, y = logg)) +
geom_point() +
geom_point(data = data.frame(Teff = 4.8529282904170595e3, logg = 1.9241934741026787),
color = "orange")
However, we can see the scale of Teff and logg are not the same. Simply taking log(Teff) gets us pretty close, but not quite. So we can rescale between 0 and 1 instead. We can create a custom rescale function. It will become clear why we can't use scales::rescale in a moment.
rescale = function(x,y){(x - min(y))/(max(y)-min(y))}
We can now rescale the data:
data %>%
mutate(Teff.scale = rescale(Teff,Teff),
logg.scale = rescale(logg,logg)) -> data
From here, we might use raster::pointDistance to calculate the distance from the input point to all of the scaled values:
raster::pointDistance(cbind(rescale(input_Teff,data$Teff),rescale(input_log_g,data$logg)),
data[,c("Teff.scale","logg.scale")],
lonlat = FALSE)
We can use which.min to find the row with the minimum distance:
data[which.min(raster::pointDistance(cbind(rescale(input_Teff,data$Teff),rescale(input_log_g,data$logg)),
data[,c("Teff.scale","logg.scale")],
lonlat = FALSE)),]
Teff logg M_div_H U B V R I J H K L Lprime M Teff.scale logg.scale
1: 4750 2 -0.1 -2.447 -1.438 -0.355 0.159 0.589 1.384 1.976 1.881 2.079 2.083 2.489 0.05729167 0.4631902
Here we can visualize the result:
ggplot(data,aes(x = Teff.scale, y = logg.scale)) +
geom_point() +
geom_point(data = data[which.min(raster::pointDistance(cbind(rescale(input_Teff,data$Teff),rescale(input_log_g,data$logg)),data[,c("Teff.scale","logg.scale")], FALSE)),],
color = "blue") +
geom_point(data = data.frame(Teff.scale = rescale(input_Teff,data$Teff),logg.scale = rescale(input_log_g,data$logg)),
color = "orange")
And access the appropriate value for V:
data[which.min(raster::pointDistance(cbind(rescale(input_Teff,data$Teff),rescale(input_log_g,data$logg)),data[,c("Teff.scale","logg.scale")], FALSE)),"V"]
V
1: -0.355
Data:
library(data.table)
data <- fread("https://www.dropbox.com/s/prbceabxmd25etx/lcb98cor.dat?dl=1")
setnames(data,"#Teff","Teff")
input_Teff = 4.8529282904170595E+003
input_log_g = 1.9241934741026787E+000

ggplot2 labeling scatter points in R

Trying to Label my scatter points in R. This is my first plot, very straight forward but can't seem to figure out adding text. I've looked at some of the other posts in here and they partially make sense but i just don't understand the lingo yet.
stats <- read.csv(file.choose())
qplot(data=stats, x=Avg.of.FD.Points, y=Avg.FD.Dev)
text(x, y, label=Home.Skater)
Home.Skater Avg.of.FD.Points Avg.FD.Dev
A.J. Greer | 4.27 | 2.84
Aaron Ekblad | 12.40 | 6.22
Aaron Ness | 5.60 | 4.00
Here is a simple scatterplot example with geom_text based on your sample data.
df <- read.table(text =
"Home.Skater Avg.FD.PTS Avg.FD.Dev
A.J._Greer 4.27 2.84
Aaron_Ekblad 12.40 6.22
Aaron_Ness 5.60 4.00", header = T);
require(ggplot2);
ggplot(df, aes(x = Avg.FD.PTS, y = Avg.FD.Dev, label = Home.Skater)) +
geom_point() +
geom_text(hjust = 0, nudge_x = 0.05) +
xlim(0, 15);
To avoid cluttering of (many) labels, you may want to consider the R library ggrepel.

subset dataframe and plot all the subsets with a loop [R]

Im working with a dataframe with 8 useful variables, the idea of the code its to plot 4 variables (3 on y axis and a common x axis). The data frame looks like this:
It has like 6500 rows
I want to subset the data.frame from the file column, and then plot LogP as a x axis and Temperature, RH and ozone as y axis.
I tried using subset inside the plot function but didnt go well. I used this code for the plot with one of the original files, but no idea how to include the subset
> plot(DataOzono$LogP, DataOzono$Temperature, axes= F,type="l",col="red", ylab = NULL, xlab = 'LogP',xaxt="n",yaxt="n" )
axis(2,ylim(c(min(DataOzono$Temperature),max(DataOzono$Temperature)), layout.widths(2)))
mtext(text = 'T',line = 2,side = 2)
par(new=TRUE)
plot(DataOzono$LogP, DataOzono$RH,type="l",col="blue",xaxt="n",yaxt="n",xlab="",ylab="")
axis(4)
mtext("RH",side=4,line=2)
par(new=TRUE)
plot(DataOzono$LogP, DataOzono$Ozone,type="l",col="green",xaxt="n",yaxt="n",xlab="",ylab="")
mtext("O3",side=5,line=3)
axis(2, line = 4)
any advice will be very helpful.
Here's how to plot the charts in a loop. In the example you gave, we only have one file number. However, it should create a chart for every number in the file column. On Windows, you can use savePlot to save to your drive. I simplified your example because I was getting errors.
DataOzono <- read.table(text="pressure height Temperature RH Ozone file LogP
753.6 2541 16.8 76 0 80131 0.3475673
748.0 2604 17.7 32 0 80131 0.347959
743.5 2656 15.9 38 0 80131 0.3482766
739.8 2697 15.4 39 0 80131 0.3485396
736.6 2734 15.0 41 0 80131 0.3487685
731.8 2790 14.5 42 0 80131 0.3491142", header=TRUE, stringsAsFactors=FALSE)
original_par <- par()
par(mar=c(5.1, 8.1, 4.1, 3.1))
for (i in unique(DataOzono$file)){
DataOzono_subset <- DataOzono[DataOzono$file==i,] #keep only rows for that file number
plot(DataOzono_subset$LogP, DataOzono_subset$Temperature, axes= F,type="l",col="red", ylab = "", xlab = 'LogP',xaxt="n",yaxt="n" )
axis(2,col="red",col.axis="red")
mtext(text = 'T',line = 2,side = 2,col="red",col.lab="red")
par(new=TRUE)
plot(DataOzono_subset$LogP, DataOzono_subset$RH,type="l",col="blue",xaxt="n",yaxt="n",xlab="",ylab="")
axis(4,col="blue",col.axis="blue")
mtext("RH",side=4,line=2,col="blue",col.lab="blue" )
par(new=TRUE)
plot(DataOzono_subset$LogP, DataOzono_subset$Ozone,type="l",col="darkgreen",xaxt="n",yaxt="n",xlab="",ylab="")
mtext("O3",side=2,line=6,,col="darkgreen",col.lab="darkgreen")
axis(2, line = 4,col="darkgreen",col.axis="darkgreen")
savePlot(filename=paste0("c:/temp/",i,".png"),type="png")
}
par() <- original_par #restore par to initial value.

join axes in barplot

I would like to eliminate the gap between the x and y axes in barplot and extend the predicted line back to intersect the y axis, preferably in base R. Is this possible? Thank you for any advice or suggestions.
my.data <- read.table(text = '
band mid.point count
1 0.5 74
2 1.5 73
3 2.5 79
4 3.5 70
5 4.5 78
6 5.5 63
7 6.5 59
8 7.5 60
', header = TRUE)
my.data
x <- my.data$mid.point^2
my.model <- lm(count ~ x, data = my.data)
my.plot <- barplot(my.data$count, ylim=c(0,100), space=0, col=NA)
axis(1, at=my.plot+0.5, labels=my.data$band)
lines(predict(my.model, data.frame(x=x), type="resp"), col="black", lwd = 1.5)
EDIT November 26, 2014
I just realized the two plots are not the same (the plot in the original post and the plot in my answer below). Compare the two curved lines closely, particularly at the right-side of the plot. Clearly the two curved lines intersect the top of the 8th bar in different locations. However, I have not yet had time to figure out why the plots differ.
Here is one way to extrapolate the predicted line back to the y axis. I incorporate rawr's suggestion regarding eliminating the gap between the y axis and the x axis.
setwd('c:/users/markm/simple R programs/')
jpeg(filename = "barplot_and_line.jpeg")
my.data <- read.table(text = '
band mid.point count
1 0.5 74
2 1.5 73
3 2.5 79
4 3.5 70
5 4.5 78
6 5.5 63
7 6.5 59
8 7.5 60
', header = TRUE)
x <- my.data$mid.point^2
my.model <- lm(count ~ x, data = my.data)
z <- seq(0,8,0.01)
y <- my.model$coef[1] + my.model$coef[2] * z^2
barplot(my.data$count, ylim=c(0,100), space=0, col=NA, xaxs = 'i')
points(z, y, type='l', col=1)
dev.off()

Find two densities' point of intersection in R

I have two densities that overlap as seen in the attached picture. I want to find out where the two lines meet. How would I go about doing that?
This is the code that produced the image:
... #reading in files etc.
pdf("test-plot.pdf")
d1 <- density(somedata)
d2 <- density(someotherdata)
plot(d1)
par(col="red")
lines(d2)
dev.off()
The original data is just two monodimensional vectors, so what I'm interested in is the intersection point of their densities.
I tried to use the solution shown in here, but unfortunately, it neither gives me a number nor even draws the lines correctly:
edit: I have found what I was looking for
# create and plot example data
set.seed(1)
plotrange <- c(-1,8)
d1 <- density(rchisq(1000, df=2), from=plotrange[1], to=plotrange[2])
d2 <- density(rchisq(1000, df=3)-1, from=plotrange[1], to=plotrange[2])
plot(d1)
lines(d2)
# look for points of intersection
poi <- which(diff(d1$y > d2$y) != 0)
# Mark those points with a circle:
points(x=d1$x[poi], y=d1$y[poi], col="red")
# or with lines:
abline(v=d1$x[poi], col="orange", lty=2)
abline(h=d1$y[poi], col="orange", lty=2)
intersect(x,y)
see this help file
For example: If your data are in the same data.frame df
intersect(df$col1, df$col2)
Here is a small example extending John's answer with an example.
require(ggplot2)
require(reshape2)
set.seed(12)
df <- data.frame(x = round(rnorm(100, 20, 10),1), y = round((100/log(100:199)),1))
str(df)
# 'data.frame': 200 obs. of 2 variables:
# $ variable: Factor w/ 2 levels "x","y": 1 1 1 1 1 1 1 1 1 1 ...
# $ value : num 16.8 25.7 20.5 22 19 ...
# Melt and plot
mdf <- melt(df)
ggplot(mdf) +
geom_density(aes(x = value, color = variable))
# Find points that intersect
intersect(df$x, df$y)
# [1] 18.9 20.1 21.3 21.5 21.0 19.6 19.0 20.0 19.8
# To make the answer more complete, here is the source code of intersect.
function (x, y)
{
y <- as.vector(y)
unique(y[match(as.vector(x), y, 0L)])
}
<bytecode: 0x10285d400>
<environment: namespace:base>
>
# It's actually posible to use unique and match to produce the same output
unique(as.vector(df$y)[match(as.vector(df$x), df$y, 0L)])
# [1] 18.9 20.1 21.3 21.5 21.0 19.6 19.0 20.0 19.8!
I'm sure your answers are correct, but here's what finally worked for me:
d1$x[abs(d1$y-d2$y) < 0.00001 && d1$x < 1000 && d1$x > 500]
(because I really only needed to find out one value and am a total R newbie, which made it difficult to understand your answers, since I don't even understand most basic R concepts yet. Thank you for your help and sorry.

Resources