geom_area group and fill by different variables - r

I have a geom_area plot that looks like this:
The x-axis is a time serie, and i want to color the fill of each facet by groups of the variable "estacion" (seasons of the year). Here's a sample of my data:
año censo estacion tipoEuro censEu censTot pCensEu
2010 2010-01-01 Invierno HA frisona 13 32 40.62500
2010 2010-01-01 Invierno Bovinos jovenes 10 32 31.25000
2010 2010-01-02 Invierno HA frisona 13 32 40.62500
---
2014 2014-12-30 Invierno Bovinos jovenes 15 26 57.69231
2014 2014-12-31 Invierno HA frisona 3 26 11.53846
2014 2014-12-31 Invierno Terneros 8 26 30.76923
Here's the code I'm using to make the plot:
ggplot(censTot1,aes(x=censo, y=pCensEu,group=tipoEuro)) +
geom_area() +
geom_line()+ facet_grid(tipoEuro ~ .)
and this is the code i intend to use, and the error generated:
ggplot(censTot1,aes(x=censo,y=pCensEu,group=tipoEuro,fill=estacion)) +
geom_area() +
geom_line()+ facet_grid(tipoEuro ~ .)
Error: Aesthetics can not vary with a ribbon

I'm not sure about the desired output. Would this solve your problem?
library(ggplot2)
ggplot(df, aes(x=censo, y=pCensEu,group=tipoEuro))+
geom_area(aes(fill=estacion))+
geom_line()+
facet_grid(tipoEuro ~ .)
The data used
df <- structure(list(año = c(2010L, 2010L, 2010L, 2014L, 2014L, 2014L
), censo = structure(c(1L, 1L, 2L, 3L, 4L, 4L), .Label = c("01/01/2010",
"02/01/2010", "30/12/2014", "31/12/2014"), class = "factor"),
estacion = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "Invierno", class = "factor"),
tipoEuro = structure(c(2L, 1L, 2L, 1L, 2L, 3L), .Label = c("Bovinos jovenes",
"HA frisona", "Terneros"), class = "factor"), censEu = c(13L,
10L, 13L, 15L, 3L, 8L), censTot = c(32L, 32L, 32L, 26L, 26L,
26L), pCensEu = c(40.625, 31.25, 40.625, 57.69231, 11.53846,
30.76923)), .Names = c("año", "censo", "estacion", "tipoEuro",
"censEu", "censTot", "pCensEu"), class = "data.frame", row.names = c(NA,
-6L))

You can also try this:
ggplot(df, aes(x=censo, y=pCensEu, color=estacion, group=interaction(tipoEuro,estacion))) +
geom_area() +
geom_line()+ facet_grid(tipoEuro ~ .)

Related

Create one line chart per country using ggplot in R

My dataset is constructed as follows:
# A tibble: 20 x 8
iso3 year Var1 Var1_imp Var2 Var2_imp Var1_type Var2_type
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 ATG 2000 NA 144 NA 277 imputed imputed
2 ATG 2001 NA 144 NA 277 imputed imputed
3 ATG 2002 NA 144 NA 277 imputed imputed
4 ATG 2003 NA 144 NA 277 imputed imputed
5 ATG 2004 NA 144 NA 277 imputed imputed
6 ATG 2005 NA 144 NA 277 imputed imputed
7 ATG 2006 NA 144 NA 277 imputed imputed
8 ATG 2007 144 144 277 277 observed observed
9 ATG 2008 45 45 NA 301 observed imputed
10 ATG 2009 NA 71.3 NA 325 imputed imputed
11 ATG 2010 NA 97.7 NA 349 imputed imputed
12 ATG 2011 NA 124 NA 373 imputed imputed
13 ATG 2012 NA 150. NA 397 imputed imputed
14 ATG 2013 NA 177. 421 421 imputed observed
15 ATG 2014 NA 203 434 434 imputed observed
16 ATG 2015 NA 229. 422 422 imputed observed
17 ATG 2016 NA 256. 424 424 imputed observed
18 ATG 2017 282 282 429 429 observed observed
19 ATG 2018 NA 282 435 435 imputed observed
20 EGY 2000 NA 38485 NA 146761 imputed imputed
I am new to R and I would like to create a line chart for each country with time series for variables Var1_imp and Var2_imp on the same chart (I have 193 countries in my database with data from 2000 to 2018) using filled circles when data are observed and unfilled circles when data are imputed (based on Var1_type and VAr2_type). Circles would be joined with lines if two subsequent data points are observed otherwise circles would be joined with dotted lines.
The main goal is to check country by country if the method used to impute missing data is good or bad, depending on whether there are outliers in time series.
I have tried the following:
ggplot(df, aes(x=year, y=Var1_imp, group=Var1_type))
+ geom_point(size=2, shape=21) # shape = 21 for unfilled circles and shape = 19 for filled circles
+ geom_line(linetype = "dashed") # () for not dotted line, otherwise linetype ="dashed"
I have difficulties to find out:
1/ how to do one single chart per country per variable
2/ how to include both Var1_imp and Var2_imp on the same chart
3/ how to use geom_point based on conditions (imputed versus observed in Var1_type)
4/ how to use geom_line based on conditions (plain line if two subsequent observed data points, otherwise dotted).
Thank you very much for your help - I think this exercise is not easy and I would learn a lot from your inputs.
You can use the following code
df %>%
pivot_longer(cols = -c(sl, iso3, year, Var1, Var2, Var1_type, Var2_type), values_to = "values", names_to = "variable") %>%
ggplot(aes(x=year, y=values, group=variable)) +
geom_point(size=2, shape=21) +
geom_line(linetype = "dashed") + facet_wrap(iso3~., scales = "free") +
xlab("Year") + ylab("Imp")
Better to use colour like
df %>%
pivot_longer(cols = -c(sl, iso3, year, Var1, Var2, Var1_type, Var2_type), values_to = "values", names_to = "variable") %>%
ggplot(aes(x=year, y=values, colour=variable)) +
geom_point(size=2, shape=21) +
geom_line() + facet_wrap(iso3~., scales = "free") + xlab("Year") + ylab("Imp")
Update
df %>%
pivot_longer(cols = -c(sl, iso3, year, Var1, Var2),
names_to = c("group", ".value"),
names_pattern = "(.*)_(.*)") %>%
ggplot(aes(x=year, y=imp, shape = type, colour=group)) +
geom_line(aes(group = group, colour = group), size = 0.5) +
geom_point(aes(group = group, colour = group, shape = type),size=2) +
scale_shape_manual(values = c('imputed' = 21, 'observed' = 16)) +
facet_wrap(iso3~., scales = "free") + xlab("Year") + ylab("Imp")
Data
df = structure(list(sl = 1:20, iso3 = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L
), .Label = c("ATG", "EGY"), class = "factor"), year = c(2000L,
2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L,
2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2018L,
2000L), Var1 = c(NA, NA, NA, NA, NA, NA, NA, 144L, 45L, NA, NA,
NA, NA, NA, NA, NA, NA, 282L, NA, NA), Var1_imp = c(144, 144,
144, 144, 144, 144, 144, 144, 45, 71.3, 97.7, 124, 150, 177,
203, 229, 256, 282, 282, 38485), Var2 = c(NA, NA, NA, NA, NA,
NA, NA, 277L, NA, NA, NA, NA, NA, 421L, 434L, 422L, 424L, 429L,
435L, NA), Var2_imp = c(277L, 277L, 277L, 277L, 277L, 277L, 277L,
277L, 301L, 325L, 349L, 373L, 397L, 421L, 434L, 422L, 424L, 429L,
435L, 146761L), Var1_type = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L), .Label = c("imputed",
"observed"), class = "factor"), Var2_type = structure(c(1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 1L), .Label = c("imputed", "observed"), class = "factor")), class = "data.frame", row.names = c(NA,
-20L))
Plotting two varibles at the same time in a meaningful way in a line chart is going to be a bit hard. It's easier if you use pivot_longer to create one column containing both the var1_imp and var2_imp values. You will then have a key column containing var1_imp and var2_imp, and a values column containing the values for those two. You can then plot using year as x, and the new values column as y, with fill set to the key column. You'll then get two lines per country.
However, looking for outliers based on a line chart for 193 countries ins't a very good idea. Use
outlier_values <- boxplot.stats(airquality$Ozone)$out
for to get outliers in a column, or similar with sapply to get multiple columns. Outliers are normally defined as 1.5* IQR, so it's easy to figure out which ones are.

How would I color values in a scatterplot in ggplot2 IF the variable is defining how it is plotted?

I have the following dataset:
Species Country IUCN_Area IUCN.Estimate Estimate.year
1 Reticulated Kenya Embu 0 2018
2 Reticulated Kenya Laikipia_Isiolo_Samburu 3043 2018
3 Reticulated Kenya Marsabit 625 2018
4 Reticulated Kenya Meru 999 2018
5 Reticulated Kenya Turkana 0 2018
6 Reticulated Kenya West Pokot 0 2018
GEC_Stratum_Detect_Estimate UpperCI_detect LowerCI_detect
1 130 277 -17
2 16414 19919 12910
3 57 347 -233
4 4143 6232 2054
5 0 0 0
6 0 0 0
I want to create a scatterplot which has on the x-axis "IUCN Estimate", and on the y-axis the "GEC_Stratum_Detect_Estimate". I then want to color the dots by type, i.e. "IUCN" and "GEC". However, how would I color the dots by their type, if the variables are defining the axes? I'm pretty sure there must be a simple code to layer on, but it's been stumping me so far. I've also tried rejigging the dataset but haven't managed to get anywhere. Here's the plot code:
ggplot(df, aes(x=IUCN.Estimate, y=GEC_Stratum_Detect_Estimate, shape=Species)) +
geom_point() +
geom_smooth(method=lm, se=FALSE, fullrange=TRUE)+
theme_classic()
And here is the data:
structure(list(Species = structure(c(2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Maasai",
"Reticulated", "Southern"), class = "factor"), Country = structure(c(2L,
2L, 2L, 2L, 2L, 2L), .Label = c("Botswana", "Kenya", "Tanzania"
), class = "factor"), IUCN_Area = structure(c(4L, 10L, 12L, 13L,
23L, 25L), .Label = c("Burigi-Biharamulo", "Central District",
"Chobe", "Embu", "Kajiado", "Katavi-Rukwa", "Kilifi", "Kitui",
"Kwale", "Laikipia_Isiolo_Samburu", "Makueni/ Machakos", "Marsabit",
"Meru", "Moremi GR", "Narok", "Ngamiland", "No IUCN Estimate",
"Nxai and Makgadikgadi", "Ruahu-Rungwa-Kisigo", "Selous-Mikumi",
"Taita Taveta", "Tana River", "Turkana", "Ugalla GR", "West Pokot"
), class = "factor"), IUCN.Estimate = c(0L, 3043L, 625L, 999L,
0L, 0L), Estimate.year = c(2018L, 2018L, 2018L, 2018L, 2018L,
2018L), GEC_Stratum_Detect_Estimate = c(130L, 16414L, 57L, 4143L,
0L, 0L), UpperCI_detect = c(277L, 19919L, 347L, 6232L, 0L, 0L
), LowerCI_detect = c(-17L, 12910L, -233L, 2054L, 0L, 0L)), row.names = c(NA,
6L), class = "data.frame")
Thank you in advance.
The following will color the dots by the IUCN_Area:
library(ggplot2)
ggplot(df, aes(x=IUCN.Estimate, y=GEC_Stratum_Detect_Estimate, shape=Species)) +
geom_point(aes(color=IUCN_Area)) +
geom_smooth(method=lm, se=FALSE, fullrange=TRUE)+
theme_classic()
And the following by IUCN.Estimate. As IUCN is a numeric value, ggplot default colors by range values. Where as in above, the factor value colors by discrete values.
library(ggplot2)
ggplot(df, aes(x=IUCN.Estimate, y=GEC_Stratum_Detect_Estimate, shape=Species)) +
geom_point(aes(color=IUCN.Estimate)) +
geom_smooth(method=lm, se=FALSE, fullrange=TRUE)+
theme_classic()
As OP requested to color by both IUCN and GEC below, this will do this. How it is interpreted may be another matter. But any value can be given to the color variable. Here, I've added the two numbers together and set as.factor(). Presumably in large datasets the sum of the points might identify a group of note.
library(ggplot2)
ggplot(df, aes(x=IUCN.Estimate, y=GEC_Stratum_Detect_Estimate, shape=Species)) +
geom_point(aes(color=as.factor(IUCN.Estimate+GEC_Stratum_Detect_Estimate))) +
geom_smooth(method=lm, se=FALSE, fullrange=TRUE)+
theme_classic()

Delete rows if column values are equal

I want delete the rows if the columns (YEAR, POL, CTY, ID, AMOUNT) are equal in the values across all rows. Please see the output table below.
Table:
YEAR POL CTY ID AMOUNT RAN LEGAL
2017 30408 11 36 3500 RANGE1 L0015N20W23
2017 30408 11 36 3500 RANGE1 L00210N20W24
2017 30408 11 36 3500 RANGE1 L00310N20W25
2017 30409 11 36 3500 RANGE1 L0015N20W23
2017 30409 11 35 3500 RANGE2 NANANA
2017 30409 11 35 3500 RANGE3 NANANA
2017 30409 11 35 3500 RANGE3 NANANA
Output:
YEAR POL CTY ID AMOUNT RAN LEGAL
2017 30408 11 35 3500 RANGE1 L0015N20W23
You can try this:
no_duplicate_cols <- c("YEAR", "POL", "CTY", "ID", "AMOUNT")
new_df <- df[!duplicated(df[, no_duplicate_cols]), ]
The data frame new_df will hold the rows from df that are not duplicated.
If I understood the question correctly then I think you can try this
library(dplyr)
df %>%
group_by(YEAR, POL, CTY, ID, AMOUNT) %>%
filter(n() == 1)
Output (but it seems that the output provided in the original question has bit of typo!):
# A tibble: 1 x 7
# Groups: YEAR, POL, CTY, ID, AMOUNT [1]
YEAR POL CTY ID AMOUNT RAN LEGAL
1 2017 30409 11 36 3500 RANGE1 L0015N20W23
#sample data
> dput(df)
structure(list(YEAR = c(2017L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L), POL = c(30408L, 30408L, 30408L, 30409L, 30409L, 30409L,
30409L), CTY = c(11L, 11L, 11L, 11L, 11L, 11L, 11L), ID = c(36L,
36L, 36L, 36L, 35L, 35L, 35L), AMOUNT = c(3500L, 3500L, 3500L,
3500L, 3500L, 3500L, 3500L), RAN = structure(c(1L, 1L, 1L, 1L,
2L, 3L, 3L), .Label = c("RANGE1", "RANGE2", "RANGE3"), class = "factor"),
LEGAL = structure(c(1L, 2L, 3L, 1L, 4L, 4L, 4L), .Label = c("L0015N20W23",
"L00210N20W24", "L00310N20W25", "NANANA"), class = "factor")), .Names = c("YEAR",
"POL", "CTY", "ID", "AMOUNT", "RAN", "LEGAL"), class = "data.frame", row.names = c(NA,
-7L))

Replacing loop in dplyr R

So I am trying to program function with dplyr withou loop and here is something I do not know how to do
Say we have tv stations (x,y,z) and months (2,3). If I group by this say we get
this output also with summarised numeric value
TV months value
x 2 52
y 2 87
z 2 65
x 3 180
y 3 36
z 3 99
This is for evaluated Brand.
Then I will have many Brands I need to filter to get only those which get value >=0.8*value of evaluated brand & <=1.2*value of evaluated brand
So for example from this down I would only want to filter first two, and this should be done for all months&TV combinations
brand TV MONTH value
sdg x 2 60
sdfg x 2 55
shs x 2 120
sdg x 2 11
sdga x 2 5000
As #akrun said, you need to use a combination of merging and subsetting. Here's a base R solution.
m <- merge(df, data, by.x=c("TV", "MONTH"), by.y=c("TV", "months"))
m[m$value.x >= m$value.y*0.8 & m$value.x <= m$value.y*1.2,][,-5]
# TV MONTH brand value.x
#1 x 2 sdg 60
#2 x 2 sdfg 55
Data
data <- structure(list(TV = structure(c(1L, 2L, 3L, 1L, 2L, 3L), .Label = c("x",
"y", "z"), class = "factor"), months = c(2L, 2L, 2L, 3L, 3L,
3L), value = c(52L, 87L, 65L, 180L, 36L, 99L)), .Names = c("TV",
"months", "value"), class = "data.frame", row.names = c(NA, -6L
))
df <- structure(list(brand = structure(c(2L, 1L, 4L, 2L, 3L), .Label = c("sdfg",
"sdg", "sdga", "shs"), class = "factor"), TV = structure(c(1L,
1L, 1L, 1L, 1L), .Label = "x", class = "factor"), MONTH = c(2L,
2L, 2L, 2L, 2L), value = c(60L, 55L, 120L, 11L, 5000L)), .Names = c("brand",
"TV", "MONTH", "value"), class = "data.frame", row.names = c(NA,
-5L))

Why ggplot2 pie-chart facet confuses the facet labelling

I have two types of data that looks like this:
Type 1 (http://dpaste.com/1697615/plain/)
Cluster-6 abTcells 1456.74119
Cluster-6 Macrophages 5656.38478
Cluster-6 Monocytes 4415.69078
Cluster-6 StemCells 1752.11026
Cluster-6 Bcells 1869.37056
Cluster-6 gdTCells 1511.35291
Cluster-6 NKCells 1412.61504
Cluster-6 DendriticCells 3326.87741
Cluster-6 StromalCells 2008.20603
Cluster-6 Neutrophils 12867.50224
Cluster-3 abTcells 471.67118
Cluster-3 Macrophages 1000.98164
Cluster-3 Monocytes 712.92273
Cluster-3 StemCells 557.88648
Cluster-3 Bcells 599.94109
Cluster-3 gdTCells 492.61994
Cluster-3 NKCells 524.42522
Cluster-3 DendriticCells 647.28811
Cluster-3 StromalCells 876.27875
Cluster-3 Neutrophils 1025.24105
And type two, (http://dpaste.com/1697602/plain/).
These values are identical with Cluster-6 in type 1 above:
abTcells 1456.74119
Macrophages 5656.38478
Monocytes 4415.69078
StemCells 1752.11026
Bcells 1869.37056
gdTCells 1511.35291
NKCells 1412.61504
DendriticCells 3326.87741
StromalCells 2008.20603
Neutrophils 12867.50224
But why when dealing with type 1 data with this code:
library(ggplot2);
library(RColorBrewer);
filcol <- brewer.pal(10, "Set3")
dat <- read.table("http://dpaste.com/1697615/plain/")
ggplot(dat,aes(x=factor(1),y=dat$V3,fill=dat$V2))+
facet_wrap(~V1)+
xlab("") +
ylab("") +
geom_bar(width=1,stat="identity",position = "fill") +
scale_fill_manual(values = filcol,guide = guide_legend(title = "")) +
coord_polar(theta="y")+
theme(strip.text.x = element_text(size = 8, colour = "black", angle = 0))
Ready data:
> dput(dat)
structure(list(V1 = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Cluster-3",
"Cluster-6"), class = "factor"), V2 = structure(c(1L, 5L, 6L,
9L, 2L, 4L, 8L, 3L, 10L, 7L, 1L, 5L, 6L, 9L, 2L, 4L, 8L, 3L,
10L, 7L), .Label = c("abTcells", "Bcells", "DendriticCells",
"gdTCells", "Macrophages", "Monocytes", "Neutrophils", "NKCells",
"StemCells", "StromalCells"), class = "factor"), V3 = c(1456.74119,
5656.38478, 4415.69078, 1752.11026, 1869.37056, 1511.35291, 1412.61504,
3326.87741, 2008.20603, 12867.50224, 471.67118, 1000.98164, 712.92273,
557.88648, 599.94109, 492.61994, 524.42522, 647.28811, 876.27875,
1025.24105)), .Names = c("V1", "V2", "V3"), class = "data.frame", row.names = c(NA,
-20L))
Generated this following figures:
Notice that the Facet label is misplaced, Cluster-3 should be Cluster-6,
where Neutrophils takes larger proportions.
How can I resolve the problem?
When dealing with type 2 data have no problem at all.
library(ggplot2)
df <- read.table("http://dpaste.com/1697602/plain/");
library(RColorBrewer);
filcol <- brewer.pal(10, "Set3")
ggplot(df,aes(x=factor(1),y=V2,fill=V1))+
geom_bar(width=1,stat="identity")+coord_polar(theta="y")+
theme(axis.title = element_blank())+
scale_fill_manual(values = filcol,guide = guide_legend(title = "")) +
theme(strip.text.x = element_text(size = 8, colour = "black", angle = 0))
Ready data:
> dput(df)
structure(list(V1 = structure(c(1L, 5L, 6L, 9L, 2L, 4L, 8L, 3L,
10L, 7L), .Label = c("abTcells", "Bcells", "DendriticCells",
"gdTCells", "Macrophages", "Monocytes", "Neutrophils", "NKCells",
"StemCells", "StromalCells"), class = "factor"), V2 = c(1456.74119,
5656.38478, 4415.69078, 1752.11026, 1869.37056, 1511.35291, 1412.61504,
3326.87741, 2008.20603, 12867.50224)), .Names = c("V1", "V2"), class = "data.frame", row.names = c(NA,
-10L))
It's because you use the data frame name in aes(...). This fixes the problem.
ggplot(dat,aes(x=factor(1),y=V3,fill=V2))+
facet_wrap(~V1)+
xlab("") +
ylab("") +
geom_bar(width=1,stat="identity",position = "fill") +
scale_fill_manual(values = filcol,guide = guide_legend(title = "")) +
coord_polar(theta="y")+
theme(strip.text.x = element_text(size = 8, colour = "black", angle = 0))
In defining the facets, you reference V1 in the context of the default dataset, and ggplot sorts alphabetically by level (so "Cluster-3" comes first). In your call to aes(...) you reference dat$V3 directly, so ggplot goes out of the context of the default dataset to the original dataframe. There, Cluster-6 is first.
As a general comment, one should never reference data in aes(...) outside the context of the dataset defined with data=.... So:
ggplot(data=dat, aes(y=V3...)) # good
ggplot(data=dat, aes(y=dat$V3...)) # bad
Your problem is a perfect example of why the second option is bad.

Resources