Create one line chart per country using ggplot in R - r

My dataset is constructed as follows:
# A tibble: 20 x 8
iso3 year Var1 Var1_imp Var2 Var2_imp Var1_type Var2_type
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 ATG 2000 NA 144 NA 277 imputed imputed
2 ATG 2001 NA 144 NA 277 imputed imputed
3 ATG 2002 NA 144 NA 277 imputed imputed
4 ATG 2003 NA 144 NA 277 imputed imputed
5 ATG 2004 NA 144 NA 277 imputed imputed
6 ATG 2005 NA 144 NA 277 imputed imputed
7 ATG 2006 NA 144 NA 277 imputed imputed
8 ATG 2007 144 144 277 277 observed observed
9 ATG 2008 45 45 NA 301 observed imputed
10 ATG 2009 NA 71.3 NA 325 imputed imputed
11 ATG 2010 NA 97.7 NA 349 imputed imputed
12 ATG 2011 NA 124 NA 373 imputed imputed
13 ATG 2012 NA 150. NA 397 imputed imputed
14 ATG 2013 NA 177. 421 421 imputed observed
15 ATG 2014 NA 203 434 434 imputed observed
16 ATG 2015 NA 229. 422 422 imputed observed
17 ATG 2016 NA 256. 424 424 imputed observed
18 ATG 2017 282 282 429 429 observed observed
19 ATG 2018 NA 282 435 435 imputed observed
20 EGY 2000 NA 38485 NA 146761 imputed imputed
I am new to R and I would like to create a line chart for each country with time series for variables Var1_imp and Var2_imp on the same chart (I have 193 countries in my database with data from 2000 to 2018) using filled circles when data are observed and unfilled circles when data are imputed (based on Var1_type and VAr2_type). Circles would be joined with lines if two subsequent data points are observed otherwise circles would be joined with dotted lines.
The main goal is to check country by country if the method used to impute missing data is good or bad, depending on whether there are outliers in time series.
I have tried the following:
ggplot(df, aes(x=year, y=Var1_imp, group=Var1_type))
+ geom_point(size=2, shape=21) # shape = 21 for unfilled circles and shape = 19 for filled circles
+ geom_line(linetype = "dashed") # () for not dotted line, otherwise linetype ="dashed"
I have difficulties to find out:
1/ how to do one single chart per country per variable
2/ how to include both Var1_imp and Var2_imp on the same chart
3/ how to use geom_point based on conditions (imputed versus observed in Var1_type)
4/ how to use geom_line based on conditions (plain line if two subsequent observed data points, otherwise dotted).
Thank you very much for your help - I think this exercise is not easy and I would learn a lot from your inputs.

You can use the following code
df %>%
pivot_longer(cols = -c(sl, iso3, year, Var1, Var2, Var1_type, Var2_type), values_to = "values", names_to = "variable") %>%
ggplot(aes(x=year, y=values, group=variable)) +
geom_point(size=2, shape=21) +
geom_line(linetype = "dashed") + facet_wrap(iso3~., scales = "free") +
xlab("Year") + ylab("Imp")
Better to use colour like
df %>%
pivot_longer(cols = -c(sl, iso3, year, Var1, Var2, Var1_type, Var2_type), values_to = "values", names_to = "variable") %>%
ggplot(aes(x=year, y=values, colour=variable)) +
geom_point(size=2, shape=21) +
geom_line() + facet_wrap(iso3~., scales = "free") + xlab("Year") + ylab("Imp")
Update
df %>%
pivot_longer(cols = -c(sl, iso3, year, Var1, Var2),
names_to = c("group", ".value"),
names_pattern = "(.*)_(.*)") %>%
ggplot(aes(x=year, y=imp, shape = type, colour=group)) +
geom_line(aes(group = group, colour = group), size = 0.5) +
geom_point(aes(group = group, colour = group, shape = type),size=2) +
scale_shape_manual(values = c('imputed' = 21, 'observed' = 16)) +
facet_wrap(iso3~., scales = "free") + xlab("Year") + ylab("Imp")
Data
df = structure(list(sl = 1:20, iso3 = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L
), .Label = c("ATG", "EGY"), class = "factor"), year = c(2000L,
2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L,
2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2018L,
2000L), Var1 = c(NA, NA, NA, NA, NA, NA, NA, 144L, 45L, NA, NA,
NA, NA, NA, NA, NA, NA, 282L, NA, NA), Var1_imp = c(144, 144,
144, 144, 144, 144, 144, 144, 45, 71.3, 97.7, 124, 150, 177,
203, 229, 256, 282, 282, 38485), Var2 = c(NA, NA, NA, NA, NA,
NA, NA, 277L, NA, NA, NA, NA, NA, 421L, 434L, 422L, 424L, 429L,
435L, NA), Var2_imp = c(277L, 277L, 277L, 277L, 277L, 277L, 277L,
277L, 301L, 325L, 349L, 373L, 397L, 421L, 434L, 422L, 424L, 429L,
435L, 146761L), Var1_type = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L), .Label = c("imputed",
"observed"), class = "factor"), Var2_type = structure(c(1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 1L), .Label = c("imputed", "observed"), class = "factor")), class = "data.frame", row.names = c(NA,
-20L))

Plotting two varibles at the same time in a meaningful way in a line chart is going to be a bit hard. It's easier if you use pivot_longer to create one column containing both the var1_imp and var2_imp values. You will then have a key column containing var1_imp and var2_imp, and a values column containing the values for those two. You can then plot using year as x, and the new values column as y, with fill set to the key column. You'll then get two lines per country.
However, looking for outliers based on a line chart for 193 countries ins't a very good idea. Use
outlier_values <- boxplot.stats(airquality$Ozone)$out
for to get outliers in a column, or similar with sapply to get multiple columns. Outliers are normally defined as 1.5* IQR, so it's easy to figure out which ones are.

Related

How would I color values in a scatterplot in ggplot2 IF the variable is defining how it is plotted?

I have the following dataset:
Species Country IUCN_Area IUCN.Estimate Estimate.year
1 Reticulated Kenya Embu 0 2018
2 Reticulated Kenya Laikipia_Isiolo_Samburu 3043 2018
3 Reticulated Kenya Marsabit 625 2018
4 Reticulated Kenya Meru 999 2018
5 Reticulated Kenya Turkana 0 2018
6 Reticulated Kenya West Pokot 0 2018
GEC_Stratum_Detect_Estimate UpperCI_detect LowerCI_detect
1 130 277 -17
2 16414 19919 12910
3 57 347 -233
4 4143 6232 2054
5 0 0 0
6 0 0 0
I want to create a scatterplot which has on the x-axis "IUCN Estimate", and on the y-axis the "GEC_Stratum_Detect_Estimate". I then want to color the dots by type, i.e. "IUCN" and "GEC". However, how would I color the dots by their type, if the variables are defining the axes? I'm pretty sure there must be a simple code to layer on, but it's been stumping me so far. I've also tried rejigging the dataset but haven't managed to get anywhere. Here's the plot code:
ggplot(df, aes(x=IUCN.Estimate, y=GEC_Stratum_Detect_Estimate, shape=Species)) +
geom_point() +
geom_smooth(method=lm, se=FALSE, fullrange=TRUE)+
theme_classic()
And here is the data:
structure(list(Species = structure(c(2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Maasai",
"Reticulated", "Southern"), class = "factor"), Country = structure(c(2L,
2L, 2L, 2L, 2L, 2L), .Label = c("Botswana", "Kenya", "Tanzania"
), class = "factor"), IUCN_Area = structure(c(4L, 10L, 12L, 13L,
23L, 25L), .Label = c("Burigi-Biharamulo", "Central District",
"Chobe", "Embu", "Kajiado", "Katavi-Rukwa", "Kilifi", "Kitui",
"Kwale", "Laikipia_Isiolo_Samburu", "Makueni/ Machakos", "Marsabit",
"Meru", "Moremi GR", "Narok", "Ngamiland", "No IUCN Estimate",
"Nxai and Makgadikgadi", "Ruahu-Rungwa-Kisigo", "Selous-Mikumi",
"Taita Taveta", "Tana River", "Turkana", "Ugalla GR", "West Pokot"
), class = "factor"), IUCN.Estimate = c(0L, 3043L, 625L, 999L,
0L, 0L), Estimate.year = c(2018L, 2018L, 2018L, 2018L, 2018L,
2018L), GEC_Stratum_Detect_Estimate = c(130L, 16414L, 57L, 4143L,
0L, 0L), UpperCI_detect = c(277L, 19919L, 347L, 6232L, 0L, 0L
), LowerCI_detect = c(-17L, 12910L, -233L, 2054L, 0L, 0L)), row.names = c(NA,
6L), class = "data.frame")
Thank you in advance.
The following will color the dots by the IUCN_Area:
library(ggplot2)
ggplot(df, aes(x=IUCN.Estimate, y=GEC_Stratum_Detect_Estimate, shape=Species)) +
geom_point(aes(color=IUCN_Area)) +
geom_smooth(method=lm, se=FALSE, fullrange=TRUE)+
theme_classic()
And the following by IUCN.Estimate. As IUCN is a numeric value, ggplot default colors by range values. Where as in above, the factor value colors by discrete values.
library(ggplot2)
ggplot(df, aes(x=IUCN.Estimate, y=GEC_Stratum_Detect_Estimate, shape=Species)) +
geom_point(aes(color=IUCN.Estimate)) +
geom_smooth(method=lm, se=FALSE, fullrange=TRUE)+
theme_classic()
As OP requested to color by both IUCN and GEC below, this will do this. How it is interpreted may be another matter. But any value can be given to the color variable. Here, I've added the two numbers together and set as.factor(). Presumably in large datasets the sum of the points might identify a group of note.
library(ggplot2)
ggplot(df, aes(x=IUCN.Estimate, y=GEC_Stratum_Detect_Estimate, shape=Species)) +
geom_point(aes(color=as.factor(IUCN.Estimate+GEC_Stratum_Detect_Estimate))) +
geom_smooth(method=lm, se=FALSE, fullrange=TRUE)+
theme_classic()

Delete rows if column values are equal

I want delete the rows if the columns (YEAR, POL, CTY, ID, AMOUNT) are equal in the values across all rows. Please see the output table below.
Table:
YEAR POL CTY ID AMOUNT RAN LEGAL
2017 30408 11 36 3500 RANGE1 L0015N20W23
2017 30408 11 36 3500 RANGE1 L00210N20W24
2017 30408 11 36 3500 RANGE1 L00310N20W25
2017 30409 11 36 3500 RANGE1 L0015N20W23
2017 30409 11 35 3500 RANGE2 NANANA
2017 30409 11 35 3500 RANGE3 NANANA
2017 30409 11 35 3500 RANGE3 NANANA
Output:
YEAR POL CTY ID AMOUNT RAN LEGAL
2017 30408 11 35 3500 RANGE1 L0015N20W23
You can try this:
no_duplicate_cols <- c("YEAR", "POL", "CTY", "ID", "AMOUNT")
new_df <- df[!duplicated(df[, no_duplicate_cols]), ]
The data frame new_df will hold the rows from df that are not duplicated.
If I understood the question correctly then I think you can try this
library(dplyr)
df %>%
group_by(YEAR, POL, CTY, ID, AMOUNT) %>%
filter(n() == 1)
Output (but it seems that the output provided in the original question has bit of typo!):
# A tibble: 1 x 7
# Groups: YEAR, POL, CTY, ID, AMOUNT [1]
YEAR POL CTY ID AMOUNT RAN LEGAL
1 2017 30409 11 36 3500 RANGE1 L0015N20W23
#sample data
> dput(df)
structure(list(YEAR = c(2017L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L), POL = c(30408L, 30408L, 30408L, 30409L, 30409L, 30409L,
30409L), CTY = c(11L, 11L, 11L, 11L, 11L, 11L, 11L), ID = c(36L,
36L, 36L, 36L, 35L, 35L, 35L), AMOUNT = c(3500L, 3500L, 3500L,
3500L, 3500L, 3500L, 3500L), RAN = structure(c(1L, 1L, 1L, 1L,
2L, 3L, 3L), .Label = c("RANGE1", "RANGE2", "RANGE3"), class = "factor"),
LEGAL = structure(c(1L, 2L, 3L, 1L, 4L, 4L, 4L), .Label = c("L0015N20W23",
"L00210N20W24", "L00310N20W25", "NANANA"), class = "factor")), .Names = c("YEAR",
"POL", "CTY", "ID", "AMOUNT", "RAN", "LEGAL"), class = "data.frame", row.names = c(NA,
-7L))

geom_area group and fill by different variables

I have a geom_area plot that looks like this:
The x-axis is a time serie, and i want to color the fill of each facet by groups of the variable "estacion" (seasons of the year). Here's a sample of my data:
año censo estacion tipoEuro censEu censTot pCensEu
2010 2010-01-01 Invierno HA frisona 13 32 40.62500
2010 2010-01-01 Invierno Bovinos jovenes 10 32 31.25000
2010 2010-01-02 Invierno HA frisona 13 32 40.62500
---
2014 2014-12-30 Invierno Bovinos jovenes 15 26 57.69231
2014 2014-12-31 Invierno HA frisona 3 26 11.53846
2014 2014-12-31 Invierno Terneros 8 26 30.76923
Here's the code I'm using to make the plot:
ggplot(censTot1,aes(x=censo, y=pCensEu,group=tipoEuro)) +
geom_area() +
geom_line()+ facet_grid(tipoEuro ~ .)
and this is the code i intend to use, and the error generated:
ggplot(censTot1,aes(x=censo,y=pCensEu,group=tipoEuro,fill=estacion)) +
geom_area() +
geom_line()+ facet_grid(tipoEuro ~ .)
Error: Aesthetics can not vary with a ribbon
I'm not sure about the desired output. Would this solve your problem?
library(ggplot2)
ggplot(df, aes(x=censo, y=pCensEu,group=tipoEuro))+
geom_area(aes(fill=estacion))+
geom_line()+
facet_grid(tipoEuro ~ .)
The data used
df <- structure(list(año = c(2010L, 2010L, 2010L, 2014L, 2014L, 2014L
), censo = structure(c(1L, 1L, 2L, 3L, 4L, 4L), .Label = c("01/01/2010",
"02/01/2010", "30/12/2014", "31/12/2014"), class = "factor"),
estacion = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "Invierno", class = "factor"),
tipoEuro = structure(c(2L, 1L, 2L, 1L, 2L, 3L), .Label = c("Bovinos jovenes",
"HA frisona", "Terneros"), class = "factor"), censEu = c(13L,
10L, 13L, 15L, 3L, 8L), censTot = c(32L, 32L, 32L, 26L, 26L,
26L), pCensEu = c(40.625, 31.25, 40.625, 57.69231, 11.53846,
30.76923)), .Names = c("año", "censo", "estacion", "tipoEuro",
"censEu", "censTot", "pCensEu"), class = "data.frame", row.names = c(NA,
-6L))
You can also try this:
ggplot(df, aes(x=censo, y=pCensEu, color=estacion, group=interaction(tipoEuro,estacion))) +
geom_area() +
geom_line()+ facet_grid(tipoEuro ~ .)

controlling text when using add_tooltip in ggvis - r

I am trying to get more control over the text that appears when using add_tooltip in ggvis.
Say I want to plot 'totalinns' against 'avg' for this dataframe. Color points by 'country'.
The text I want to appear in the hovering tooltip would be: 'player', 'country', 'debutyear' 'avg'
tmp:
# player totalruns totalinns totalno totalout avg debutyear country
# 1 AG Ganteaume 112 1 0 1 112.00000 1948 WI
# 2 DG Bradman 6996 80 10 70 99.94286 1928 Aus
# 3 MN Nawaz 99 2 1 1 99.00000 2002 SL
# 4 VH Stollmeyer 96 1 0 1 96.00000 1939 WI
# 5 DM Lewis 259 5 2 3 86.33333 1971 WI
# 6 Abul Hasan 165 5 3 2 82.50000 2012 Ban
# 7 RE Redmond 163 2 0 2 81.50000 1973 NZ
# 8 BA Richards 508 7 0 7 72.57143 1970 SA
# 9 H Wood 204 4 1 3 68.00000 1888 Eng
# 10 JC Buttler 200 3 0 3 66.66667 2014 Eng
I understand that I need to make a key/id variable as ggvis only takes information supplied to it. Therefore I need to refer back to the original data. I have tried changing my text inside of my paste0() command, but still can't get it right.
tmp$id <- 1:nrow(tmp)
all_values <- function(x) {
if(is.null(x)) return(NULL)
row <- tmp[tmp$id == x$id, ]
paste0(tmp$player, tmp$country, tmp$debutyear,
tmp$avg, format(row), collapse = "<br />")
}
tmp %>% ggvis(x = ~totalinns, y = ~avg, key := ~id) %>%
layer_points(fill = ~factor(country)) %>%
add_tooltip(all_values, "hover")
Find below code to reproduce example:
tmp <- structure(list(player = c("AG Ganteaume", "DG Bradman", "MN Nawaz",
"VH Stollmeyer", "DM Lewis", "Abul Hasan", "RE Redmond", "BA Richards",
"H Wood", "JC Buttler"), totalruns = c(112L, 6996L, 99L, 96L,
259L, 165L, 163L, 508L, 204L, 200L), totalinns = c(1L, 80L, 2L,
1L, 5L, 5L, 2L, 7L, 4L, 3L), totalno = c(0L, 10L, 1L, 0L, 2L,
3L, 0L, 0L, 1L, 0L), totalout = c(1L, 70L, 1L, 1L, 3L, 2L, 2L,
7L, 3L, 3L), avg = c(112, 99.9428571428571, 99, 96, 86.3333333333333,
82.5, 81.5, 72.5714285714286, 68, 66.6666666666667), debutyear = c(1948L,
1928L, 2002L, 1939L, 1971L, 2012L, 1973L, 1970L, 1888L, 2014L
), country = c("WI", "Aus", "SL", "WI", "WI", "Ban", "NZ", "SA",
"Eng", "Eng")), .Names = c("player", "totalruns", "totalinns",
"totalno", "totalout", "avg", "debutyear", "country"), class = c("tbl_df",
"data.frame"), row.names = c(NA, -10L))
I think this is closer:
all_values <- function(x) {
if(is.null(x)) return(NULL)
row <- tmp[tmp$id == x$id, ]
paste(tmp$player[x$id], tmp$country[x$id], tmp$debutyear[x$id],
tmp$avg[x$id], sep="<br>")
}

R: Assign colors to values/color gradient palette

I have a sample dataframe which looks like this:
reg1 <- structure(list(REGION = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("REG1", "REG2"), class = "factor"),STARTYEAR = c(1959L, 1960L, 1961L, 1962L, 1963L, 1964L, 1965L, 1966L, 1967L, 1945L, 1946L, 1947L, 1948L, 1949L), ENDYEAR = c(1960L, 1961L, 1962L, 1963L, 1964L, 1965L, 1966L, 1967L, 1968L, 1946L, 1947L, 1948L, 1949L, 1950L), Y_START = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 2L, 2L, 2L, 2L), Y_END = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L), COLOR_VALUE = c(-969L, -712L, -574L, -312L, -12L, 1L, 0L, -782L, -999L, -100L, 23L, 45L, NA, 999L)), .Names = c("REGION", "STARTYEAR", "ENDYEAR", "Y_START", "Y_END", "COLOR_VALUE"), class = "data.frame", row.names = c(NA, -14L))
REGION STARTYEAR ENDYEAR Y_START Y_END COLOR_VALUE
1 REG1 1959 1960 0 1 -969
2 REG1 1960 1961 0 1 -712
3 REG1 1961 1962 0 1 -574
4 REG1 1962 1963 0 1 -312
5 REG1 1963 1964 0 1 -12
6 REG1 1964 1965 0 1 1
7 REG1 1965 1966 0 1 0
8 REG1 1966 1967 0 1 -782
9 REG1 1967 1968 0 1 -999
10 REG2 1945 1946 2 3 -100
11 REG2 1946 1947 2 3 23
12 REG2 1947 1948 2 3 45
13 REG2 1948 1949 2 3 NA
14 REG2 1949 1950 2 3 999
I am creating a plot with the rect() function which works fine.
xx = unlist(reg1[, c(2, 3)])
yy = unlist(reg1[, c(4, 5)])
png(width=1679, height=1165, res=150)
if(any(xx < 1946)) {my_x_lim <- c(min(xx), 2014)} else {my_x_lim <- c(1946, 2014)}
plot(xx, yy, type='n', xlim = my_x_lim)
apply(reg1, 1, function(y)
rect(y[2], y[4], y[3], y[5]))
dev.off()
In my reg1 data I have a 6th column which contains values between +1000 and -1000. What I was wondering is if there is a method that I could colour the rectangles in my plot according to my color values. Low values should be blue, values around 0 should result in white and high values in red (if no value is present or NA, then grey should be plotted).
My question: How could I create a color palette that ranges from values 1000 to -1000 (from red over white to blue) and apply it to my plot so that each rectangle gets coloured according to the color value?
Here is how your get a color ramp and match it in the data frame.
my.colors<-colorRampPalette(c("blue", "white", "red")) #creates a function my.colors which interpolates n colors between blue, white and red
color.df<-data.frame(COLOR_VALUE=seq(-1000,1000,1), color.name=my.colors(2001)) #generates 2001 colors from the color ramp
reg1.with.color<-merge(reg1, color.df, by="COLOR_VALUE")
I can't help you with the rect() plotting, I've never used it

Resources