ggplot2 labeling scatter points in R - r

Trying to Label my scatter points in R. This is my first plot, very straight forward but can't seem to figure out adding text. I've looked at some of the other posts in here and they partially make sense but i just don't understand the lingo yet.
stats <- read.csv(file.choose())
qplot(data=stats, x=Avg.of.FD.Points, y=Avg.FD.Dev)
text(x, y, label=Home.Skater)
Home.Skater Avg.of.FD.Points Avg.FD.Dev
A.J. Greer | 4.27 | 2.84
Aaron Ekblad | 12.40 | 6.22
Aaron Ness | 5.60 | 4.00

Here is a simple scatterplot example with geom_text based on your sample data.
df <- read.table(text =
"Home.Skater Avg.FD.PTS Avg.FD.Dev
A.J._Greer 4.27 2.84
Aaron_Ekblad 12.40 6.22
Aaron_Ness 5.60 4.00", header = T);
require(ggplot2);
ggplot(df, aes(x = Avg.FD.PTS, y = Avg.FD.Dev, label = Home.Skater)) +
geom_point() +
geom_text(hjust = 0, nudge_x = 0.05) +
xlim(0, 15);
To avoid cluttering of (many) labels, you may want to consider the R library ggrepel.

Related

Extra Surface problem on the graph of R colored based on positive and negative value

I have colored a graph with ggplot2 based on a threshold value of 1. Surface scores greater than 1
was colored azure and surface scores less than 1 is colored beige. Here is my sample code.
library(ggplot2)
setwd("F:/SUST_mutation/Graph_input")
d <- read.csv(file = "N.csv", sep = ",", header = TRUE)
ggplot(d, aes(x= Position,y= wild_Score)) + xlab("Positions") + ylab("Scores") +
geom_ribbon(aes(ymin=pmin(wild_Score,1), ymax=1), fill="beige", alpha= 1.5) +
geom_ribbon(aes(ymin=1, ymax=pmax(wild_Score,1)), fill="azure", alpha= 1.5)
My problem is that if I go through the upper surface to the lower surface, I expect the surface line in one line.
But if you see the figure, you will see that they are not. Around the threshold line, the lower surface does not meet the upper surface rather it creates some extra surface. For convenience, I have marked the portions with a red circle.
extra surface on the negative portion close to threshold:
Position Wild_Score
4 1.048
5 1.052
6 1.016
7 0.996
8 0.97
9 0.951
10 0.971
11 1.047
12 1.036
13 1.051
14 1.124
15 1.172
16 1.172
17 1.164
18 1.145
19 1.186
20 1.197
21 1.197
22 1.216
23 1.193
24 1.216
25 1.216
26 1.262
Problem-2:
I have a data frame like following.
Position Score_1 Score_2
4 1.048 1.048
5 1.052 1.052
6 1.016 1.016
7 0.996 1.433
8 0.97 1.432
9 0.951 1.567
10 0.971 1.231
11 1.047 1.055
12 1.036 1.036
13 1.051 1.051
14 1.124 1.124
15 1.172 1.172
16 1.172 1.172
17 1.164 1.164
I plot the surface for position vs score_1 with Tibble and a line graph on that surface with the same positions vs score_2 like the following,
desired graph
As the line just differs at some points I subsetted the main dataset(both column and row).
I get the following error.
"Error: Aesthetics must be either length 1 or the same as the data (13): x" I guess this is because I used two different data frames for the graphs.
here is my code:
d <- read.csv(file = "E.csv", sep = ",", header = TRUE)
d1 <- tibble::tibble(
x = seq(min(d$Position), max(d$Position), length.out = 1000),
y = approx(d$Position, d$Score_1, xout = x)$y
)
ggplot(d1, aes(x= x,y= y)) + xlab("Positions") + ylab("Scores") +
geom_ribbon(aes(ymin=pmin(y,1), ymax=1), fill="red", alpha= 1.5) +
geom_ribbon(aes(ymin=1, ymax=pmax(y,1)), fill="blue", alpha= 1.5) +
geom_line(aes(y=1)) + geom_line(d = d[c(3:10), c(1,3)],aes(y =
Score_2), color = "blue", size = 1)
I want to know what is causing the problem and how should I deal with it?
It's because the negative surface at, for example, row 3 and 4 starts from 1 and goes to 0.996, instead of going from 1.016 to 0.996. Relevant discussion and other examples at ggplot2's issue tracker.
This problem is typically only visible if the number of observations is small-ish, so the typical way people overcome this problem is to interpolate the data. You can find an example of that below (I've omitted your colours because it was hard to see):
library(ggplot2)
# txt <- "your_example_table" # Omitted for brevity
df <- read.table(text = txt, sep = "\t", header = TRUE)
data2 <- tibble::tibble(
x = seq(min(df$Position), max(df$Position), length.out = 1000),
y = approx(df$Position, df$Wild_Score, xout = x)$y
)
ggplot(data2, aes(x= x,y= y)) + xlab("Positions") + ylab("Scores") +
geom_ribbon(aes(ymin=pmin(y,1), ymax=1, fill = "A")) +
geom_ribbon(aes(ymin=1, ymax=pmax(y,1), fill = "B"))
This is great for hiding the problem, but calculating the exact line intersection points is a bit of a pain. I apologise for the self-promotion but I ran into this too and wrapped my solution for finding these line intersection points in a function on the dev version of my package ggh4x, which you might find useful.
library(ggh4x) # devtools::install_github("teunbrand/ggh4x")
ggplot(df, aes(x= Position,y= Wild_Score)) +
stat_difference(aes(ymin = 1, ymax = Wild_Score))
Created on 2021-08-15 by the reprex package (v1.0.0)

stat_density2d - What does the legend mean?

I have a map done in R with stat_density2d. This is the code:
ggplot(data, aes(x=Lon, y=Lat)) +
stat_density2d(aes(fill = ..level..), alpha=0.5, geom="polygon",show.legend=FALSE)+
geom_point(colour="red")+
geom_path(data=map.df,aes(x=long, y=lat, group=group), colour="grey50")+
scale_fill_gradientn(colours=rev(brewer.pal(7,"Spectral")))+
xlim(-10,+2.5) +
ylim(+47,+60) +
coord_fixed(1.7) +
theme_void()
And it produces this:
Great. It works. However I do not know what the legend means. I did find this wikipedia page:
https://en.wikipedia.org/wiki/Multivariate_kernel_density_estimation
And the example they used (which contains red, orange and yellow) stated:
The coloured contours correspond to the smallest region which contains
the respective probability mass: red = 25%, orange + red = 50%, yellow
+ orange + red = 75%
However, using stat_density2d, I have 11 contours in my map. Does anyone know how stat_density2d works and what the legend means? Ideally I wanted to be able to state something like the red contour contains 25% of the plots etc.
I have read this: https://ggplot2.tidyverse.org/reference/geom_density_2d.html and I am still none the wiser.
Let's take the faithful example from ggplot2:
ggplot(faithful, aes(x = eruptions, y = waiting)) +
stat_density_2d(aes(fill = factor(stat(level))), geom = "polygon") +
geom_point() +
xlim(0.5, 6) +
ylim(40, 110)
(apologies in advance for not making this prettier)
The level is the height at which the 3D "mountains" were sliced. I don't know of a way (others might) to translate that to a percentage but I do know to get you said percentages.
If we look at that chart, level 0.002 contains the vast majority of the points (all but 2). Level 0.004 is actually 2 polygons and they contain all but ~dozen of the points. If I'm getting the gist of what you're asking that's what you want to know, except not count but the percentage of points encompassed by polygons at a given level. That's straightforward to compute using the methodology from the various ggplot2 "stats" involved.
Note that while we're importing the tidyverse and sp packages we'll use some other functions fully-qualified. Now, let's reshape the faithful data a bit:
library(tidyverse)
library(sp)
xdf <- select(faithful, x = eruptions, y = waiting)
(easier to type x and y)
Now, we'll compute the two-dimensional kernel density estimation the way ggplot2 does:
h <- c(MASS::bandwidth.nrd(xdf$x), MASS::bandwidth.nrd(xdf$y))
dens <- MASS::kde2d(
xdf$x, xdf$y, h = h, n = 100,
lims = c(0.5, 6, 40, 110)
)
breaks <- pretty(range(zdf$z), 10)
zdf <- data.frame(expand.grid(x = dens$x, y = dens$y), z = as.vector(dens$z))
z <- tapply(zdf$z, zdf[c("x", "y")], identity)
cl <- grDevices::contourLines(
x = sort(unique(dens$x)), y = sort(unique(dens$y)), z = dens$z,
levels = breaks
)
I won't clutter the answer with str() output but it's kinda fun looking at what happens there.
We can use spatial ops to figure out how many points fall within given polygons, then we can group the polygons at the same level to provide counts and percentages per-level:
SpatialPolygons(
lapply(1:length(cl), function(idx) {
Polygons(
srl = list(Polygon(
matrix(c(cl[[idx]]$x, cl[[idx]]$y), nrow=length(cl[[idx]]$x), byrow=FALSE)
)),
ID = idx
)
})
) -> cont
coordinates(xdf) <- ~x+y
data_frame(
ct = sapply(over(cont, geometry(xdf), returnList = TRUE), length),
id = 1:length(ct),
lvl = sapply(cl, function(x) x$level)
) %>%
count(lvl, wt=ct) %>%
mutate(
pct = n/length(xdf),
pct_lab = sprintf("%s of the points fall within this level", scales::percent(pct))
)
## # A tibble: 12 x 4
## lvl n pct pct_lab
## <dbl> <int> <dbl> <chr>
## 1 0.002 270 0.993 99.3% of the points fall within this level
## 2 0.004 259 0.952 95.2% of the points fall within this level
## 3 0.006 249 0.915 91.5% of the points fall within this level
## 4 0.008 232 0.853 85.3% of the points fall within this level
## 5 0.01 206 0.757 75.7% of the points fall within this level
## 6 0.012 175 0.643 64.3% of the points fall within this level
## 7 0.014 145 0.533 53.3% of the points fall within this level
## 8 0.016 94 0.346 34.6% of the points fall within this level
## 9 0.018 81 0.298 29.8% of the points fall within this level
## 10 0.02 60 0.221 22.1% of the points fall within this level
## 11 0.022 43 0.158 15.8% of the points fall within this level
## 12 0.024 13 0.0478 4.8% of the points fall within this level
I only spelled it out to avoid blathering more but the percentages will change depending on how you modify the various parameters to the density computation (same holds true for my ggalt::geom_bkde2d() which uses a different estimator).
If there is a way to tease out the percentages without re-performing the calculations there's no better way to have that pointed out than by letting other SO R folks show how much more clever they are than the person writing this answer (hopefully in more diplomatic ways than seem to be the mode of late).

Group bar plot in R

I have the following data in R.
> dat
algo taxi.d taxi.s hanoi.d. hanoi.s ep
1 plain VI 7.81 9.67 32.92 38.12 140.33
2 model VI 12.00 46.67 53.17 356.68 229.89
3 our algorithm 6.66 6.86 11.71 21.96 213.27
I have made a graph of this in Excel, now I want something similar in R. Please note that the vertical scale is logarithmic, with powers of 2.
What R commands do I need to use to have this?
Sorry if this is a very easy question, I am a complete novice to R.
The reshape2 and ggplot packages should help accomplish what you want:
dat = read.table(header=TRUE, text=
"algo taxi.d taxi.s hanoi.d hanoi.s ep
1 'plain VI' 7.81 9.67 32.92 38.12 140.33
2 'model VI' 12.00 46.67 53.17 356.68 229.89
3 'our algorithm' 6.66 6.86 11.71 21.96 213.27")
install.packages("reshape2") # only run the first time
install.packages("ggplot2") # only run the first time
library(reshape2)
library(ggplot2)
# convert the data into a more graph-friendly format
data2 = melt(dat, id.vars='algo', value.name='performance', variable.name='benchmark')
# graph data + bar chart + log scale
ggplot(data2) +
geom_bar(aes(x = benchmark, y = performance, fill = algo), stat='identity', position='dodge') +
scale_y_log10()
Hope this code will help you up with your plot
dat <- matrix(c(
c(0.25,0.25,0.25,0.25),
c(0.05,0,0.95,0),
c(0.4,0.1,0.1,0.4)),
nrow=4,ncol=3,byrow=FALSE,
dimnames=list(c("A","C","G","T"),
c("E","S","I"))
)
barplot(dat,border=FALSE,beside=TRUE,
col=rainbow(4),ylim=c(0,1),
legend=rownames(dat),main="Plot name",
xlab="State",ylab="observation")
grid()
box()

Dashed linetype in legend of ggplot in R

I'm a new here but I hope that you can help me. I just googled my problem but coudn't solve it.
I have a data frame containing lots of data which I want to plot with ggplot in R. All does work very well but the legend drives me crazy. The linetypes in the legend are always solid instead of what I defined.
I'm loading a csv file, then making subsets with loops and summarize the subsets with SummarySE().
A subset is looking like this:
ExperimentCombinations LB TargetPosition N C_measured sd se ci
1 HS 0.10 Foveal 10 0.11007970 0.04114193 0.013010221 0.02943116
2 HS 0.21 Foveal 10 0.09821870 0.04838134 0.015299523 0.03460992
3 HS 0.30 Foveal 9 0.07911856 0.04037776 0.013459252 0.03103709
4 HS 1.00 Foveal 11 0.06657355 0.02688821 0.008107099 0.01806374
5 LED 0.10 Foveal 8 0.12569725 0.03607487 0.012754393 0.03015935
6 LED 0.21 Foveal 10 0.08797370 0.02091996 0.006615472 0.01496524
7 LED 0.30 Foveal 10 0.07358290 0.03002596 0.009495042 0.02147928
8 LED 1.00 Foveal 8 0.06630350 0.01894423 0.006697796 0.01583777
in this case TargetPosition has Levels Foveal or Peripheral.
The ggplot code I'm using is (looks awful because I was trying to solve my problem...):
ColourFoveal <- c("#FFCC33","#00CCFF")
ColourPeripheral <- c("#FFCC33","#00CCFF")
#ColourPeripheral <- c("#FF9900","#0066FF")
PointType <- c(20,20)
PointTypeSweepUp <- c(24,24)
PointTypeSweepDown <- c(25,25)
ColourHSFillFoveal <- c("#FFCC33", "#FFCC33")
ColourHSFillPeripheral <- c("#FF9900", "#FF9900")
#LineTypeFoveal <- c("solid", "solid")
LineTypePeripheral <- c("dashed","dashed")
xbreaks <- c(0.1,0.21,0.3,1.0)
plotsYmax <- 0.2
if(field=="Peripheral"){
lineType<-sprintf("dashed")
lineColour<-ColourPeripheral
}else{
lineType<-"solid"
lineColour<-ColourFoveal
}
ggplot(df, aes(x=LB, y=C_measured, shape=ExperimentCombinations, colour=ExperimentCombinations)) +
geom_errorbar(aes(ymin=C_measured-se, ymax=C_measured+se), width=.1) +
geom_line(linetype=lineType) +
geom_point() +
ggtitle(paste(targetDataFrame$AgeGroup, targetDataFrame$TargetPosition)) +
scale_colour_manual(name="", values=lineColour)+
scale_shape_manual(name="", values=PointType)+
scale_fill_manual(name="", values=lineColour)+
scale_x_continuous(breaks=xbreaks)+
coord_cartesian(ylim = c(0, plotsYmax+0.01))+
scale_y_continuous(breaks=c(0,0.05,0.1,0.15,0.2))+
theme(axis.line=element_line(colour="black"))+
theme(panel.grid=element_blank())+
theme_bw()+
theme(legend.key.width=unit(2,"line"))
}
The peripheral plots should have dashed lines, the foveal ones solid lines.
What I get is always like this: (As a new user I'm not allowed to post images!)
The lines are dashed an in the colours I like to, the points are right, too. But in the legend, the lines are solid instead of dashed. The Colours and points are alright in the legend, too.
Could you help me to define the linetypes in the legend as dashed in the peripheral case?

Multiple data points in one R ggplot2 plot

I have two sets of data points that both relate to the same primary axis, but who differ in secondary axis. Is there some way to plot them on top of each other in R using ggplot2?
What I am looking for is basically something that looks like this:
4+ |
| x . + 220
3+ . . |
| x |
2+ . + 210
| x |
1+ . x x |
| + 200
0+-+-+-+-+-+-+
time
. temperatur
x car sale
(This is just a example of possible data)
Shane's answer, "you can't in ggplot2," is correct, if incomplete. Arguably, it's not something you want to do. How do you decide how to scale the Y axis? Do you want the means of the lines to be the same? The range? There's no principled way of doing it, and it's too easy to make the results look like anything you want them to look like. Instead, what you might want to do, especially in a time-series like that, is to norm the two lines of data so that at a particular value of t, often min(t), Y1 = Y2 = 100. Here's an example I pulled off of the Bonddad Blog (not using ggplot2, which is why it's ugly!) But you can cleanly tell the relative increase and decrease of the two lines, which have completely different underlying scales.
I'm not an expert on this, but it's my understanding that this is possible with lattice, but not with ggplot2. See this leanr blog post for an example of a secondary axis plot. Also see Hadley's response to this question.
Here's an example of how to do it in lattice (from Gabor Grothendieck):
library(lattice)
library(grid) # needed for grid.text
# data
Lines.raw <- "Date Fo Co
6/27/2007 57.1 13.9
6/28/2007 57.7 14.3
6/29/2007 57.8 14.3
6/30/2007 57 13.9
7/1/2007 57.1 13.9
7/2/2007 57.2 14.0
7/3/2007 57.3 14.1
7/4/2007 57.6 14.2
7/5/2007 58 14.4
7/6/2007 58.1 14.5
7/7/2007 58.2 14.6
7/8/2007 58.4 14.7
7/9/2007 58.7 14.8
"
# in reality next stmt would be DF <- read.table("myfile.dat", header = TRUE)
DF <- read.table(textConnection(Lines.raw), header = TRUE)
DF$Date <- as.Date(DF$Date, "%m/%d/%Y")
par.settings <- list(
layout.widths = list(left.padding = 10, right.padding = 10),
layout.heights = list(bottom.padding = 10, top.padding = 10)
)
xyplot(Co ~ Date, DF, default.scales = list(y = list(relation = "free")),
ylab = "C", par.settings = par.settings)
trellis.focus("panel", 1, 1, clip.off = TRUE)
pr <- pretty(DF$Fo)
at <- 5/9 * (pr - 32)
panel.axis("right", at = at, lab = pr, outside = TRUE)
grid.text("F", x = 1.1, rot = 90) # right y axis label
trellis.unfocus()

Resources