Labeling extrema with stat_peaks/stat_valleys produces duplicate labels - r

I extracted some longitudinal temperature data from a .nc weather dataset (ncdf4 package) and would like to label the local extrema with their respective dates from x-axis using ggplot2 and its extension ggpmisc that includes stat_peaks/stat_valleys. Oddly, all the labels read the same: "Dec 1969".
I figured the most likely culprit was that my data used for the x-axis was not formatted correctly as Date, but the x-axis displays correctly and I have checked the class of the input data to confirm. I also tried applying group=1 which resulted in no change -- I admit I am new to R and ggplot2 (more familiar with Python/Pandas) and do not completely understand what group=1 does, though it was necessary to get the line to display correctly. Perhaps this is the result of a bug?
ggplot(df_denver, aes(x=Date, y=Temp..C., group=1)) +
geom_line() +
scale_x_date(date_labels="%b %Y", date_breaks = "10 years", expand=c(0,0)) +
stat_peaks(span=24, ignore_threshold = 0.80, color="red") +
stat_peaks(geom="text", span=24, ignore_threshold = 0.80, x.label.fmt = "%b %Y", color="red", angle=90, hjust=-0.1) +
stat_valleys(span=24, ignore_threshold = 0.55, color="blue") +
stat_valleys(geom="text", span=24, ignore_threshold = 0.55, x.label.fmt = "%b %Y", color="blue", angle=90, hjust=1.1) +
labs(x="Date", y="Temp (C)", title="Monthly Air Surface Temp for Denver from 1880 on")
Here are the first 100 rows of my dataset that produce 3 peaks and 3 valleys to illustrate:
Date Temp..C.
1 1880-01-01 2.91287017
2 1880-02-01 -2.73586297
3 1880-03-01 -2.04185677
4 1880-04-01 0.37948364
5 1880-05-01 0.78548384
6 1880-06-01 0.44176754
7 1880-07-01 -1.06966007
8 1880-08-01 -0.53162575
9 1880-09-01 -0.29665694
10 1880-10-01 -2.08401608
11 1880-11-01 -9.46955109
12 1880-12-01 -1.52052176
13 1881-01-01 -2.53366208
14 1881-02-01 -1.88263988
15 1881-03-01 -0.06864686
16 1881-04-01 3.32321167
17 1881-05-01 1.75613177
18 1881-06-01 2.82765651
19 1881-07-01 1.76543093
20 1881-08-01 1.39409852
21 1881-09-01 -0.98141575
22 1881-10-01 -0.63346595
23 1881-11-01 -1.95676208
24 1881-12-01 3.28983855
25 1882-01-01 -0.64792717
26 1882-02-01 2.15854502
27 1882-03-01 2.91465187
28 1882-04-01 0.56616443
29 1882-05-01 -1.89441001
30 1882-06-01 -0.63149375
31 1882-07-01 -0.64883423
32 1882-08-01 0.82802373
33 1882-09-01 0.66150969
34 1882-10-01 -0.54113626
35 1882-11-01 -1.21310496
36 1882-12-01 1.30559540
37 1883-01-01 -1.41802752
38 1883-02-01 -6.39232874
39 1883-03-01 2.96320987
40 1883-04-01 -0.48122203
41 1883-05-01 -0.99614143
42 1883-06-01 -0.67229420
43 1883-07-01 -0.56595141
44 1883-08-01 0.52161294
45 1883-09-01 0.09190032
46 1883-10-01 -2.65115738
47 1883-11-01 1.88332438
48 1883-12-01 -0.19942272
49 1884-01-01 -0.34669495
50 1884-02-01 -2.21085262
51 1884-03-01 0.55254096
52 1884-04-01 -1.21859336
53 1884-05-01 -0.40969065
54 1884-06-01 0.44454563
55 1884-07-01 1.28881764
56 1884-08-01 -1.09331822
57 1884-09-01 1.52377772
58 1884-10-01 1.76569140
59 1884-11-01 0.72411090
60 1884-12-01 -4.64927006
61 1885-01-01 -1.03242493
62 1885-02-01 -0.79325873
63 1885-03-01 0.65910935
64 1885-04-01 -0.10181000
65 1885-05-01 -1.50702798
66 1885-06-01 -1.25801849
67 1885-07-01 -0.88433135
68 1885-08-01 -1.18410277
69 1885-09-01 0.15284735
70 1885-10-01 -0.91721576
71 1885-11-01 1.82403481
72 1885-12-01 1.68553519
73 1886-01-01 -4.21202993
74 1886-02-01 2.43953681
75 1886-03-01 -2.24947429
76 1886-04-01 -1.22557247
77 1886-05-01 2.66594267
78 1886-06-01 -0.21662886
79 1886-07-01 1.09909940
80 1886-08-01 0.63720244
81 1886-09-01 -0.11845125
82 1886-10-01 0.49225059
83 1886-11-01 -3.16969180
84 1886-12-01 2.18220520
85 1887-01-01 0.51427501
86 1887-02-01 -0.69656581
87 1887-03-01 3.96693182
88 1887-04-01 0.92614591
89 1887-05-01 1.66550291
90 1887-06-01 1.88668025
91 1887-07-01 -1.48990893
92 1887-08-01 -0.98355341
93 1887-09-01 0.93172997
94 1887-10-01 -1.12551820
95 1887-11-01 1.07798636
96 1887-12-01 -2.15758419
97 1888-01-01 -1.69266903
98 1888-02-01 2.55955243
99 1888-03-01 -1.83599913
100 1888-04-01 3.63450384
As you can see, the labels produced by stat_peaks and stat_valleys are identical and not even within the range of the abbreviated data, rather than the correct dates corresponding to the x-axis.
Monthly Air Surface Temp for Denver from 1880 on

stat_peaks and stat_valleys labels will work with dates in POSIXct format:
df_denver$Date <- as.POSIXct(df_denver$Date, format = "%Y-%m-%d")
ggplot(df_denver, aes(x=Date, y=Temp)) +
geom_line() +
scale_x_datetime(date_labels="%b %Y", date_breaks = "1 year", expand=c(0,0)) +
stat_peaks(span=24, ignore_threshold = 0.80, color="red") +
stat_peaks(geom="text", span=24, ignore_threshold = 0.80, x.label.fmt = "%b %Y", color="red", angle=90, hjust=-0.1) +
stat_valleys(span=24, ignore_threshold = 0.55, color="blue") +
stat_valleys(geom="text", span=24, ignore_threshold = 0.55, x.label.fmt = "%b %Y", color="blue", angle=90, hjust=1.1) +
labs(x="Date", y="Temp (C)", title="Monthly Air Surface Temp for Denver from 1880 on") +
expand_limits(y = 6)
Note: scale_x_date was changed to scale_x_datetime. In addition, changed date_breaks to 1 year to demonstrate x-axis labels for example data, and expand_limits to ensure peak labels are readable. group=1 should not be needed.

Related

How can I stop geom_point from removing rows in order to create a map

My intention is to plot several locations for which I have the longitude and the latitude onto a map (as simple dots). The locations are distributed across Uganda.
print(locations)
Latitude Longitude
1 0.482980 30.212160
2 0.647717 30.315984
3 0.44735 30.18063
4 0.58416316 30.2066327
5 0.60012 30.19998
6 0.433483 30.20179
7 0.625317 30.224837
8 0.654277 30.251667
9 0.387517 30.197475
10 0.607402 30.292068
11 0.770128 30.403456
12 0.767266 30.414246
13 0.777873 30.389111
14 0.631774 30.290356
15 0.734015 30.279161
16 0.722133 30.277941
17 0.66322994 30.22795225
18 0.66900827 30.21357739
19 0.450372 30.197764
20 0.493699 30.250891
21 0.479716 30.180958
22 0.483242 30.284576
23 0.645044 30.321270
24 0.602389 30.275637
25 0.868827 30.465939
26 0.631194 30.263565
27 0.631576 30.263855
28 0.413701 30.247934
29 0.67135 30.2675
30 0.492360 30.223620
31 0.81481 30.39311
32 0.396665 30.26309
33 0.666170 30.308960
34 0.610067 30.306058
35 0.677144 30.196810
36 0.677144 30.196810
37 0.555555 30.231681
38 0.63874 30.231691
39 0.512953 30.207603
40 0.442291 30.279173
41 0.575658 30.310231
42 0.423129 30.211289
43 0.623838 30.256925
44 0.639643 30.341620
45 0.653550 30.170428
46 0.752630 30.401040
47 0.478544 30.191938
48 0.48114 30.198471
49 0.679820 30.259800
50 0.581293 30.158619
51 0.730410 30.376620
52 0.504059 30.178556
53 0.587441 30.310364
54 0.588072 30.277877
55 0.70893233 30.19008103
56 0.81699 30.41799
57 0.609300 30.271613
58 0.595226 30.315580
59 0.459029 30.277659
60 0.727873 30.216385
61 0.647722 30.217760
62 0.690064 30.193881
63 0.512339 30.140107
64 0.649181 30.302570
65 0.649881 30.303974
66 0.649736 30.302481
67 0.722082 30.226063
68 0.463480 30.203050
69 0.692930 30.281880
70 0.652864 30.229106
71 0.491520 30.233780
72 0.778370 30.415920
73 0.682090 30.276460
74 0.564670 30.148920
75 0.655588 30.243047
76 0.647717 30.315984
77 0.518769 30.159384
78 0.683070 30.339650
79 0.662980 30.253890
80 0.591899 30.145857
81 0.699690 30.344650
82 0.441030 30.177240
83 0.612202 30.213022
84 0.472530 30.236980
85 0.473722 30.165020
86 0.499181 30.159485
87 0.6598021 30.29158
88 0.6601362 30.29119
89 0.48386 30.23142
90 0.679470 30.282190
91 0.685860 30.271070
92 0.528797 30.171251
93 0.514863 30.243976
94 0.603612 30.258705
95 0.484708 30.142588
96 0.523857 30.233239
97 0.395356 30.215351
98 0.612247 30.269341
99 0.55878815 30.17702095
100 0.747630 30.384240
101 0.538778 30.326353
102 0.554198 30.299815
103 0.504410 30.298260
104 0.418705 30.259747
105 0.669850 30.324100
106 0.654277 30.251667
107 0.460830 30.214070
108 0.378725 30.216429
Here is what I managed to do so far:
locations$Latitude=as.numeric(levels(locations$Latitude))[locations$Latitude]
locations$Longitude=as.numeric(levels(locations$Longitude))[locations$Longitude]
uganda <- raster::getData('GADM', country='UGA', level=1)
ggplot() +
geom_polygon(data = uganda,
aes(x = long, y = lat, group = group),
colour = "grey10", fill = "#fff7bc") +
geom_point(data = locations,
aes(x = Longitude, y = Latitude)) +
coord_map() +
theme_bw() +
xlab("Longitude") + ylab("Latitude")
As you can see by executing the code above, the map of Uganda is loaded from the GADM database and displayed correctly. However, I get the following warning message:
Warning:
Removed 108 rows containing missing values (geom_point).
I read in another post (Explain ggplot2 warning: "Removed k rows containing missing values") that this error might be caused by erroneous axis ranges. I'm not familiar with the plotting of geographic data and GADM maps, though. This is why I wasn't able to adjust the ranges (I guess this would be done in the geom_polygon -part). Can somebody help me, please?
I am not sure why you are running your first part of the code:
locations$Latitude=as.numeric(levels(locations$Latitude))[locations$Latitude] locations$Longitude=as.numeric(levels(locations$Longitude))[locations$Longitude]
If you don't run that part, there won't be any NA anymore. So if you run the following code, it should work:
library(tidyverse)
library(raster)
uganda <- raster::getData('GADM', country='UGA', level=1)
ggplot() +
geom_polygon(data = uganda,
aes(x = long, y = lat, group = group),
colour = "grey10", fill = "#fff7bc") +
geom_point(data = locations,
aes(x = Longitude, y = Latitude)) +
coord_map() +
theme_bw() +
xlab("Longitude") + ylab("Latitude")
Output:

ggplot facets: show annotated text in selected facets

I want to create a 2 by 2 faceted plot with a vertical line shared by the four facets. However, because the facets on top have the same date information as the facets at the bottom, I only want to have the vline annotated twice: in this case in the two facets at the bottom.
I looked a.o. here, which does not work for me. (In addition I have my doubts whether this is still valid code, today.) I also looked here. I also looked up how to influence the font size in geom_text: according to the help pages this is size. In the case below it doesn't work out well.
This is my code:
library(ggplot2)
library(tidyr)
my_df <- read.table(header = TRUE, text =
"Date AM_PM First_Second Systolic Diastolic Pulse
01/12/2017 AM 1 134 83 68
01/12/2017 PM 1 129 84 76
02/12/2017 AM 1 144 88 56
02/12/2017 AM 2 148 93 65
02/12/2017 PM 1 131 85 59
02/12/2017 PM 2 129 83 58
03/12/2017 AM 1 153 90 62
03/12/2017 AM 2 143 92 59
03/12/2017 PM 1 139 89 56
03/12/2017 PM 2 141 86 56
04/12/2017 AM 1 140 87 58
04/12/2017 AM 2 135 85 55
04/12/2017 PM 1 140 89 67
04/12/2017 PM 2 128 88 69
05/12/2017 AM 1 134 99 67
05/12/2017 AM 2 128 90 63
05/12/2017 PM 1 136 88 63
05/12/2017 PM 2 123 83 61
")
# setting the classes right
my_df$Date <- as.Date(as.character(my_df$Date), format = "%d/%m/%Y")
my_df$First_Second <- as.factor(my_df$First_Second)
# to tidy format
my_df2 <- gather(data = my_df, key = Measure, value = Value,
-c(Date, AM_PM, First_Second), factor_key = TRUE)
# Measures in 1 facet, facets split over AM_PM and First_Second
## add anntotations column for geom_text
my_df2$Annotations <- rep("", 54)
my_df2$Annotations[c(4,6)] <- "Start"
p2 <- ggplot(data = my_df2) +
ggtitle("Blood Pressure and Pulse as a function of AM/PM,\n Repetition, and date") +
geom_line(aes(x = Date, y = Value, col= Measure, group = Measure), size = 1.) +
geom_point(aes(x = Date, y = Value, col= Measure, group = Measure), size= 1.5) +
facet_grid(First_Second ~ AM_PM) +
geom_vline(aes(xintercept = as.Date("2017/12/02")), linetype = "dashed",
colour = "darkgray") +
theme(axis.text.x=element_text(angle = -90))
p2
yields this graph:
This is the basic plot from which I start. Now we try to annotate it.
p2 + annotate(geom="text", x = as.Date("2017/12/02"), y= 110, label="start", size= 3)
yielding this plot:
This plot has the problem that the annotation occurs 4 times, while we only want it in the bottom parts of the graph.
Now we use geom_text which will use the "Annotations" column in our dataframe, in line with this SO Question. Be carefull, the column added to the dataframe must be present when you create "p2", the first time (that is why we added the column supra)
p2 + geom_text(aes(x=as.Date("2017/12/02"), y=100, label = Annotations, size = .6))
yielding this plot:
Yes, we succeeded in getting the annotation only in the bottom two parts of the graph. But the font is too big ( ... and ugly) and when we try to correct it with size, two things are interesting: (1) the font size is not changed (although you would expect that from the help pages) and (2) a legend is added.
I have been clicking around a lot and have been unable to solve this after hours and hours. Any help would be appreciated.

How to use ggplot to plot the trend of four variables in R?

I have a data set records the tumor size at four different time points (each row is one patient). I want to perform an analysis on this dataset to show that overall for all patients, the tumor size is decreasing after each time point.
What kind of analysis can I do? How should I use ggplot to visualize these data and show the trend? Many thanks!
SUBJECTID Baseline 1 2 3
1001 88 78 30 14
1002 29 26 66 16
1003 50 64 54 46
1004 91 90 99 43
1005 98 109 60 42
1007 100 100 54
1008 45 49 47 32
1009 75 66 57 7
1010 60 52 20 3
1011 68 68 56 47
1012 78 84 56 57
1013 71 70 8 5
1015 79 50 11 3
1016 73 60 57 36
1017 54 27 16
1018 50 37 33 26
1019 115 68 33 67
1021 63 55 0 0
1022 98 91 76 75
1024 76 76 0
1025 47 45 42 42
1026 32 25 14 0
1027 40 37 65
1028 60 110 110 0
A box plot might work. Try the following:
library(tidyverse)
df %>%
gather(key = "time", value = "tumor_size", -SUBJECTID) %>%
ggplot(aes(time, tumor_size)) +
geom_boxplot() +
labs(title = "Tumor Size ~ Time",
subtitle = "Insert subtitle if you want",
caption = "Insert caption if you want",
x = "Time",
y = "Tumor Size (insert unit)") +
theme_bw() +
theme(
panel.grid.major.x = element_blank(),
text = element_text(family = "Palatino"),
plot.title = element_text(face = "bold", size = 20)
)
You could also add geom_jitter() if you'd like. After the geom_boxplot() + line, add:
geom_jitter(width = 0.1, pch = 21, fill = "grey") +
You'll get something like this:
To show that overall tumor size is decreasing after each time point, you usually want a mean tumor size after each time frame. It's much easier to plot than every individual element. I've written how to do this using your first four rows, producing a dot graph:
baseline <- c(88, 29, 50, 91)
dAC <- c(78, 26, 64, 90)
InterReg <- c(30, 66, 54, 99)
PreSurg <- c(14, 16, 46, 43)
matrix <- rbind(baseline, dAC, InterReg, PreSurg)
means <- rowMeans(matrix)
plot(means)
Dot graph:
In terms of what analysis to do, I can't really answer that. That depends on what you want it to look like. What I've done is the most basic way of representing the data. You may want to use a column graph, a bar graph, a line graph etc. That's up to your personal preference. In terms of using ggplot, here are many different examples you can use: https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf

R Multiple Line Graph for MLB Team Wins by Year

I am trying to make it so that there is a line for each team, with the color of that line matching the color in the legend. I wrote the program as if it were a bar chart, since I know how to do that, so I think there are only a few changes that need to be made in order to make it into lines. Note: I don't want a line of best fit, but rather, one that connects from dot to dot.
This next part may be very time consuming, so I don't expect any one to help with this, but I would also really like to have the team logos in the legend, maybe replacing the team names. Then in the legend, I would like to have the color associated with the team as a line rather than a box.
Any help with either or both of these would be very much appreciated.
EDIT: I would like to keep all the features that the program below has, such as the gray background, white grids, ect.
df <- read.table(textConnection(
'Year Orioles RedSox Yankees Rays BlueJays
1998 79 92 114 63 88
1999 78 94 98 69 84
2000 74 85 87 69 83
2001 63 82 95 62 80
2002 67 93 103 55 78
2003 71 95 101 63 86
2004 78 98 101 70 67
2005 74 95 95 67 80
2006 70 86 97 61 87
2007 69 96 94 66 83
2008 68 95 89 97 86
2009 64 95 103 84 75
2010 66 89 95 96 85
2011 69 90 97 91 81
2012 93 69 95 90 73
2013 85 97 85 92 74
2014 96 71 84 77 83
2015 81 78 87 80 93
2016 89 93 84 68 89'), header = TRUE)
df %>%
gather(Team, Wins, -Year) %>%
mutate(Team = factor(Team, c("Orioles", "RedSox", "Yankees","Rays","BlueJays"))) %>%
ggplot(aes(x=Year, y=Wins)) +
ggtitle("AL East Wins") +
ylab("Wins") +
geom_col(aes(fill = Team), position = "dodge") +
scale_fill_manual(values = c("orange", "red", "blue", "black","purple"))+
theme(
plot.title = element_text(hjust=0.5),
axis.title.y = element_text(angle = 0, vjust = 0.5),
panel.background = element_rect(fill = "gray"),
panel.grid = element_line(colour = "white")
)
You can use geom_path(aes(color = Team)) instead of geom_col(aes(fill = Team) and a named color palette to achieve your basic goals like this:
# break this off the pipeline
df <- gather(df, Team, Wins, -Year) %>%
mutate(Team = factor(Team, c("Orioles", "RedSox", "Yankees","Rays","BlueJays")))
# if you want to resuse the same theme a bunch this is nice
# theme_grey() is the default theme
theme_set(theme_grey() +
theme(plot.title = element_text(hjust=0.5),
axis.title.y = element_text(angle = 0, vjust = 0.5),
panel.background = element_rect(fill = "gray")))
# named palettes are easy
# for specific colors i like hex codes the best
# i just grabbed these of this nice website TeamColorCodes, could be fun!
cust <- c("#FC4C00", "#C60C30", "#1C2841", "#79BDEE","#003DA5")
names(cust) <- levels(df$Team)
# use geom_path inplace of geom_col
ggplot(df, aes(x=Year, y=Wins, color = Team)) +
geom_path(aes(color = Team)) +
scale_color_manual(values = cust) +
labs(title = "AL East Wins",
subtitle = "Ahhh",
y = "Wins",
x = "Year")
Link to teamcolorcodes.com

Creating grouped bar-plot of multi-column data in R

I have the following data
Input Rtime Rcost Rsolutions Btime Bcost
1 12 proc. 1 36 614425 40 36
2 15 proc. 1 51 534037 50 51
3 18-proc 5 62 1843820 66 66
4 20-proc 4 68 1645581 104400 73
5 20-proc(l) 4 64 1658509 14400 65
6 21-proc 10 78 3923623 453600 82
I want to create a grouped bar chart from this data such that x-axis contains Input field (as groups) and y axis represent the log scale for the Rtime and Btime fields (the two bars).
All solutions/examples I checked online had similar data put into a three column layout. I do not know how to use the data I have to generate the grouped bar-chart. Or if there is a way to convert this data (manually converting is not an options because it is a huge file with a lot of rows) into a R and ggplot compatible data format.
Edit :
Graph generated using gncs solution
As requested, a ggplot2 solution that also uses reshape2:
library(reshape2)
df <- read.table(text = " Input Rtime Rcost Rsolutions Btime Bcost
1 12-proc. 1 36 614425 40 36
2 15-proc. 1 51 534037 50 51
3 18-proc 5 62 1843820 66 66
4 20-proc 4 68 1645581 104400 73
5 20-proc(l) 4 64 1658509 14400 65
6 21-proc 10 78 3923623 453600 82",header = TRUE,sep = "")
dfm <- melt(df[,c('Input','Rtime','Btime')],id.vars = 1)
ggplot(dfm,aes(x = Input,y = value)) +
geom_bar(aes(fill = variable),stat = "identity",position = "dodge") +
scale_y_log10()
Note a style difference here, where since log(1) = 0, ggplot2 treats that as a bar of zero height and doesn't plot anything, whereas barplot plots a little stub (which in my opinion is a little misleading).
I think I understand the problem and this is what I would suggest (short run - option):
data <- read.table("data.txt", header=TRUE)
subset <- t(data.frame(data$Rtime, data$Btime))
barplot(subset, legend = c("Rtime", "Btime"), names.arg=data$Input, log="y", beside=TRUE)
Is that what you want? It is kind of dirty, but it does the job.
Update: code corrected.
As requested, a ggplot2 solution that also uses pivot_longer() https://tidyr.tidyverse.org/reference/pivot_longer.html to transform the data into a format that geom_bar() can easily plot.
library(dplyr)
library(ggplot2)
df <- read.table(text = " Input Rtime Rcost Rsolutions Btime Bcost
1 12-proc. 1 36 614425 40 36
2 15-proc. 1 51 534037 50 51
3 18-proc 5 62 1843820 66 66
4 20-proc 4 68 1645581 104400 73
5 20-proc(l) 4 64 1658509 14400 65
6 21-proc 10 78 3923623 453600 82",
header = TRUE,sep = "")
dfm <- pivot_longer(df, -Input, names_to="variable", values_to="value")
## pivot_longer takes the input data frame, excludes the Input field from the transformation, turns the remaining column names into the variable "variable" (often called the "key"), and assigns the values to the variable "value".
ggplot(dfm,aes(x = Input,y = value)) +
geom_bar(aes(fill = variable),stat = "identity",position = "dodge") +
scale_y_log10()
joran's answer helped me a lot, but I had to use stat="identity" in the ggplot statement like that:
ggplot(dfm, aes(x = Input,y = value)) +
geom_bar(aes(fill = variable), position = "dodge", stat="identity") +
scale_y_log10()
My version of R is 3.2.2 and ggplot2 version 1.0.1
Thanks.

Resources