ploting a graph using ggplot plotting system - r

I have the following data frame
year type Measure
1 1989 NP 2107
2 2002 NP 109
3 2003 NP 159
4 2008 NP 137
5 1989 NR 522
6 2002 NR 240
7 2003 NR 248
8 2008 NR 55
9 1989 OR 346
10 2002 OR 134
11 2003 OR 130
12 2008 OR 88
13 1989 P 296
14 2002 P 569
15 2003 P 1202
16 2008 P 34
I want to plot Measure Vs Year plot separately for each type using the ggplot2 system. Can someone help me in getting the plot. I want a single plot with Measure Vs Year subplots for each type
The output of packageDescription("ggplot2") :
packageDescription("ggplot2")
Package: ggplot2
Type: Package
Title: An Implementation of the Grammar of Graphics
Version: 1.0.1
Authors#R: c( person("Hadley", "Wickham", role = c("aut", "cre"), email
= "h.wickham#gmail.com"), person("Winston", "Chang", role =
"aut", email = "winston#stdout.org") )
Description: An implementation of the grammar of graphics in R. It
combines the advantages of both base and lattice graphics:
conditioning and shared axes are handled automatically, and you
can still build up a plot step by step from multiple data
sources. It also implements a sophisticated multidimensional
conditioning system and a consistent interface to map data to
aesthetic attributes. See http://ggplot2.org for more
information, documentation and examples.
Depends: R (>= 2.14), stats, methods
Imports: plyr (>= 1.7.1), digest, grid, gtable (>= 0.1.1), reshape2,
scales (>= 0.2.3), proto, MASS
Suggests: quantreg, Hmisc, mapproj, maps, hexbin, maptools, multcomp,
nlme, testthat, knitr, mgcv
VignetteBuilder: knitr
Enhances: sp
License: GPL-2
URL: http://ggplot2.org, https://github.com/hadley/ggplot2
BugReports: https://github.com/hadley/ggplot2/issues
LazyData: true
Collate: 'aaa-.r' 'aaa-constants.r' 'aes-calculated.r' .....
Packaged: 2015-03-16 20:29:42 UTC; winston
Author: Hadley Wickham [aut, cre], Winston Chang [aut]
Maintainer: Hadley Wickham <h.wickham#gmail.com>
NeedsCompilation: no
Repository: CRAN
Date/Publication: 2015-03-17 17:49:38
Built: R 3.2.1; ; 2015-07-19 04:13:46 UTC; unix
-- File: /home/R/i686-pc-linux-gnu-library/3.2/ggplot2/Meta/package.rds
output of dput(head(main_data))
dput(head(main_data))
structure(list(Measure = c(6.532,
78.88, 0.92, 10.376, 10.859, 83.025), type = c("P", "P",
"P", "P", "P", "P"), year = c(1989L, 1989L, 1989L,
1989L, 1989L, 1989L)), .Names = c("Measure", "type", "year"), row.names = c("114288", "114296",
"114300", "114308", "114325", "114329"), class = "data.frame")

Something like this?
df <- structure(list(year = c(1989L, 2002L, 2003L, 2008L, 1989L, 2002L,
2003L, 2008L, 1989L, 2002L, 2003L, 2008L, 1989L, 2002L, 2003L,
2008L), type = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 4L, 4L, 4L, 4L), .Label = c(" NP ", " NR ",
" OR ", " P "), class = "factor"), Measure = c(2107L,
109L, 159L, 137L, 522L, 240L, 248L, 55L, 346L, 134L, 130L, 88L,
296L, 569L, 1202L, 34L)), .Names = c("year", "type", "Measure"
), class = "data.frame", row.names = c(NA, -16L))
ggplot(df, aes(x=year, y=Measure)) +
geom_bar(stat='identity') +
facet_grid(. ~ type)

Related

Cut values to intervals and plot a heatmap in ggplot2

Given a dataframe as follows:
df <- structure(list(year = c(2001L, 2001L, 2001L, 2001L, 2002L, 2002L,
2002L, 2002L, 2003L, 2003L, 2003L, 2003L), quater = c(1L, 2L,
3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), value = c(4L, 23L, 14L,
12L, 6L, 22L, 45L, 12L, 34L, 15L, 3L, 40L)), class = "data.frame", row.names = c(NA,
-12L))
Out:
year quater value
0 2001 1 4
1 2001 2 23
2 2001 3 14
3 2001 4 12
4 2002 1 6
5 2002 2 22
6 2002 3 45
7 2002 4 12
8 2003 1 34
9 2003 2 15
10 2003 3 3
11 2003 4 40
How could I plot a chart similar to the plot below:
Please note the year and quater in this dataset correspondent to year and week to the plot above.
I need to first cut the value column by (0, 10], (10, 20], (20, 30], (30, 40], (40, 50] then plot them.
The code I have tried:
ggplot(df, aes(week, year, fill= value)) +
geom_tile() +
scale_fill_gradient(low="white", high="red")
Out:
As you can see, the legend is different to what I need.
Thanks for your help.
You should first use cut to get the classes (as Ronak Shah already mentioned) and then you can use scale_fill_brewer to change the color of the tiles.
library(tidyverse)
df %>%
mutate(class = cut(value, seq(0, 50, 10))) %>%
ggplot(aes(quater, year, fill = class) ) +
geom_tile() +
scale_fill_brewer(type = "seq",
direction = 1,
palette = "RdPu")

Create one line chart per country using ggplot in R

My dataset is constructed as follows:
# A tibble: 20 x 8
iso3 year Var1 Var1_imp Var2 Var2_imp Var1_type Var2_type
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 ATG 2000 NA 144 NA 277 imputed imputed
2 ATG 2001 NA 144 NA 277 imputed imputed
3 ATG 2002 NA 144 NA 277 imputed imputed
4 ATG 2003 NA 144 NA 277 imputed imputed
5 ATG 2004 NA 144 NA 277 imputed imputed
6 ATG 2005 NA 144 NA 277 imputed imputed
7 ATG 2006 NA 144 NA 277 imputed imputed
8 ATG 2007 144 144 277 277 observed observed
9 ATG 2008 45 45 NA 301 observed imputed
10 ATG 2009 NA 71.3 NA 325 imputed imputed
11 ATG 2010 NA 97.7 NA 349 imputed imputed
12 ATG 2011 NA 124 NA 373 imputed imputed
13 ATG 2012 NA 150. NA 397 imputed imputed
14 ATG 2013 NA 177. 421 421 imputed observed
15 ATG 2014 NA 203 434 434 imputed observed
16 ATG 2015 NA 229. 422 422 imputed observed
17 ATG 2016 NA 256. 424 424 imputed observed
18 ATG 2017 282 282 429 429 observed observed
19 ATG 2018 NA 282 435 435 imputed observed
20 EGY 2000 NA 38485 NA 146761 imputed imputed
I am new to R and I would like to create a line chart for each country with time series for variables Var1_imp and Var2_imp on the same chart (I have 193 countries in my database with data from 2000 to 2018) using filled circles when data are observed and unfilled circles when data are imputed (based on Var1_type and VAr2_type). Circles would be joined with lines if two subsequent data points are observed otherwise circles would be joined with dotted lines.
The main goal is to check country by country if the method used to impute missing data is good or bad, depending on whether there are outliers in time series.
I have tried the following:
ggplot(df, aes(x=year, y=Var1_imp, group=Var1_type))
+ geom_point(size=2, shape=21) # shape = 21 for unfilled circles and shape = 19 for filled circles
+ geom_line(linetype = "dashed") # () for not dotted line, otherwise linetype ="dashed"
I have difficulties to find out:
1/ how to do one single chart per country per variable
2/ how to include both Var1_imp and Var2_imp on the same chart
3/ how to use geom_point based on conditions (imputed versus observed in Var1_type)
4/ how to use geom_line based on conditions (plain line if two subsequent observed data points, otherwise dotted).
Thank you very much for your help - I think this exercise is not easy and I would learn a lot from your inputs.
You can use the following code
df %>%
pivot_longer(cols = -c(sl, iso3, year, Var1, Var2, Var1_type, Var2_type), values_to = "values", names_to = "variable") %>%
ggplot(aes(x=year, y=values, group=variable)) +
geom_point(size=2, shape=21) +
geom_line(linetype = "dashed") + facet_wrap(iso3~., scales = "free") +
xlab("Year") + ylab("Imp")
Better to use colour like
df %>%
pivot_longer(cols = -c(sl, iso3, year, Var1, Var2, Var1_type, Var2_type), values_to = "values", names_to = "variable") %>%
ggplot(aes(x=year, y=values, colour=variable)) +
geom_point(size=2, shape=21) +
geom_line() + facet_wrap(iso3~., scales = "free") + xlab("Year") + ylab("Imp")
Update
df %>%
pivot_longer(cols = -c(sl, iso3, year, Var1, Var2),
names_to = c("group", ".value"),
names_pattern = "(.*)_(.*)") %>%
ggplot(aes(x=year, y=imp, shape = type, colour=group)) +
geom_line(aes(group = group, colour = group), size = 0.5) +
geom_point(aes(group = group, colour = group, shape = type),size=2) +
scale_shape_manual(values = c('imputed' = 21, 'observed' = 16)) +
facet_wrap(iso3~., scales = "free") + xlab("Year") + ylab("Imp")
Data
df = structure(list(sl = 1:20, iso3 = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L
), .Label = c("ATG", "EGY"), class = "factor"), year = c(2000L,
2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L,
2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2018L,
2000L), Var1 = c(NA, NA, NA, NA, NA, NA, NA, 144L, 45L, NA, NA,
NA, NA, NA, NA, NA, NA, 282L, NA, NA), Var1_imp = c(144, 144,
144, 144, 144, 144, 144, 144, 45, 71.3, 97.7, 124, 150, 177,
203, 229, 256, 282, 282, 38485), Var2 = c(NA, NA, NA, NA, NA,
NA, NA, 277L, NA, NA, NA, NA, NA, 421L, 434L, 422L, 424L, 429L,
435L, NA), Var2_imp = c(277L, 277L, 277L, 277L, 277L, 277L, 277L,
277L, 301L, 325L, 349L, 373L, 397L, 421L, 434L, 422L, 424L, 429L,
435L, 146761L), Var1_type = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L), .Label = c("imputed",
"observed"), class = "factor"), Var2_type = structure(c(1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 1L), .Label = c("imputed", "observed"), class = "factor")), class = "data.frame", row.names = c(NA,
-20L))
Plotting two varibles at the same time in a meaningful way in a line chart is going to be a bit hard. It's easier if you use pivot_longer to create one column containing both the var1_imp and var2_imp values. You will then have a key column containing var1_imp and var2_imp, and a values column containing the values for those two. You can then plot using year as x, and the new values column as y, with fill set to the key column. You'll then get two lines per country.
However, looking for outliers based on a line chart for 193 countries ins't a very good idea. Use
outlier_values <- boxplot.stats(airquality$Ozone)$out
for to get outliers in a column, or similar with sapply to get multiple columns. Outliers are normally defined as 1.5* IQR, so it's easy to figure out which ones are.

How to take an Average of + or - SD

I have data where the [1] dependent variable is taken from a controlled and independent variable [2] then independent variable. The mean and SD are taken from [1].
(a) and this is the result of SD:
Year Species Pop_Index
1 1994 Corn Bunting 2.082483
5 1998 Corn Bunting 2.048155
10 2004 Corn Bunting 2.061617
15 2009 Corn Bunting 2.497792
20 1994 Goldfinch 1.961236
25 1999 Goldfinch 1.995600
30 2005 Goldfinch 2.101403
35 2010 Goldfinch 2.138496
40 1995 Grey Partridge 2.162136
(b) And the result of mean:
Year Species Pop_Index
1 1994 Corn Bunting 2.821668
5 1998 Corn Bunting 2.916975
10 2004 Corn Bunting 2.662797
15 2009 Corn Bunting 4.171538
20 1994 Goldfinch 3.226108
25 1999 Goldfinch 2.452807
30 2005 Goldfinch 2.954816
35 2010 Goldfinch 3.386772
40 1995 Grey Partridge 2.207708
(c) This is the Code for SD:
structure(list(Year = c(1994L, 1998L, 2004L, 2009L, 1994L, 1999L,
2005L, 2010L, 1995L), Species = structure(c(1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 3L), .Label = c("Corn Bunting", "Goldfinch", "Grey Partridge"
), class = "factor"), Pop_Index = c(2.0824833420524, 2.04815530904537,
2.06161673349657, 2.49779159320587, 1.96123572400404, 1.99559986715288,
2.10140285528351, 2.13849611018009, 2.1621364896722)), row.names = c(1L,
5L, 10L, 15L, 20L, 25L, 30L, 35L, 40L), class = "data.frame")
(d) This is the code for mean:
structure(list(Year = c(1994L, 1998L, 2004L, 2009L, 1994L, 1999L,
2005L, 2010L, 1995L), Species = structure(c(1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 3L), .Label = c("Corn Bunting", "Goldfinch", "Grey Partridge"
), class = "factor"), Pop_Index = c(2.82166841455814, 2.91697463618566,
2.66279663056763, 4.17153795031277, 3.22610845074252, 2.45280743991572,
2.95481600904799, 3.38677188055508, 2.20770835158744)), row.names = c(1L,
5L, 10L, 15L, 20L, 25L, 30L, 35L, 40L), class = "data.frame")
(e) And this is the code used to take the mean of mean Pop_Index over the years:
df2 <- aggregate(Pop_Index ~ Year, df1, mean)
(f) And this is the result:
Year Pop_Index
1 1994 3.023888
2 1995 2.207708
3 1998 2.916975
4 1999 2.452807
5 2004 2.662797
6 2005 2.954816
7 2009 4.171538
8 2010 3.386772
Now it wouldn't make sense for me to take the average of SD by doing the same procedure as before with the function mean or SD.
I have looked online and found someone in a similar predicament with this data:
Month: January
Week 1 Mean: 67.3 Std. Dev: 0.8
Week 2 Mean: 80.5 Std. Dev: 0.6
Week 3 Mean: 82.4 Std. Dev: 0.8
And the response:
"With equal samples size, which is what you have, the standard deviation you are looking for is:
Sqrt [ (.64 + .36 + .64) / 3 ] = 0.739369"
How would I do this in R, or is there another way of doing this? Because I want to plot error bars and the dataset plotted is like that of (f), and it would be absurd to plot the SD of (a) against this because the vector lengths would differ.
Sample from original data.frame with a few columns and many rows not included:
structure(list(GRIDREF = structure(c(1L, 1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 10L), .Label = c("SP8816", "SP9212", "SP9322",
"SP9326", "SP9440", "SP9513", "SP9632", "SP9939", "TF7133", "TF9437"
), class = "factor"), Lat = c(51.83568688, 51.83568688, 51.79908899,
51.88880822, 51.92476157, 52.05042795, 51.80757645, 51.97818159,
52.04057068, 52.86730817, 52.89542895), Long = c(-0.724233561,
-0.724233561, -0.667258035, -0.650074995, -0.648996758, -0.630626734,
-0.62349292, -0.603710436, -0.558026241, 0.538966197, 0.882597783
), Year = c(2006L, 2007L, 1999L, 2004L, 1995L, 2009L, 2011L,
2007L, 2011L, 1996L, 2007L), Species = structure(c(4L, 7L, 5L,
10L, 4L, 6L, 8L, 3L, 2L, 9L, 1L), .Label = c("Blue Tit", "Buzzard",
"Canada Goose", "Collared Dove", "Greenfinch", "Jackdaw", "Linnet",
"Meadow Pipit", "Robin", "Willow Warbler"), class = "factor"),
Pop_Index = c(0L, 0L, 2L, 0L, 1L, 0L, 1L, 4L, 0L, 0L, 8L)), row.names = c(1L,
100L, 1000L, 2000L, 3000L, 4000L, 5000L, 6000L, 10000L, 20213L,
30213L), class = "data.frame")
A look into this data.frame:
GRIDREF Lat Long Year Species Pop_Index TempJanuary
1 SP8816 51.83569 -0.7242336 2006 Collared Dove 0 2.128387
100 SP8816 51.83569 -0.7242336 2007 Linnet 0 4.233226
1000 SP9212 51.79909 -0.6672580 1999 Greenfinch 2 5.270968
2000 SP9322 51.88881 -0.6500750 2004 Willow Warbler 0 4.826452
3000 SP9326 51.92476 -0.6489968 1995 Collared Dove 1 4.390322
4000 SP9440 52.05043 -0.6306267 2009 Jackdaw 0 2.934516
5000 SP9513 51.80758 -0.6234929 2011 Meadow Pipit 1 3.841290
6000 SP9632 51.97818 -0.6037104 2007 Canada Goose 4 7.082580
10000 SP9939 52.04057 -0.5580262 2011 Buzzard 0 3.981290
20213 TF7133 52.86731 0.5389662 1996 Robin 0 3.532903
30213 TF9437 52.89543 0.8825978 2007 Blue Tit 8 7.028710

Issue Merging Dataframes: Warning joining factor and character vector

I have two data frames that look like so
> df1
county state code
ANDERSON Texas 1
ANDREWS Texas 2
ANGELINA Texas 3
....
> df2
county state citations year
ANDERSON Texas 124 2011
ANDREWS Texas 32 2011
ANGELINA Texas 491 2011
....
I have tried to merge the two of these a few different ways:
merge <- full_join(df1, df2, by = c("county", "state"))
merge <- merge(df1, df2, by = c("county", "state"))
In both cases, I receive the following warning:
Warning message:
Column `county` joining factor and character vector, coercing into
character vector
The resulting data frame does not have any data for df2, even after coercing the factor into a character. I tried it again after turning the county column into a character in both data frames and still have issues.
Here are the heads of the two data frames I am attempting to merge:
> dput(head(data))
structure(list(year = c(2011L, 2011L, 2011L, 2011L, 2011L, 2011L
), month = c(1L, 1L, 1L, 1L, 1L, 1L), county = c("ANDERSON COUNTY",
"ANGELINA COUNTY", "ARANSAS COUNTY", "ATASCOSA COUNTY", "BASTROP COUNTY",
"BELL COUNTY"), state = structure(c(2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Montana",
"Texas"), class = "factor"), citations = c(218L, 422L, 55L, 472L,
745L, 1403L), warnings = c(521L, 711L, 124L, 1173L, 819L, 2242L
), population = c(56760L, 82812L, 24721L, 43589L, 72248L, 276975L
), d_revenue = c(-736L, -6723L, 1134L, 71L, 2308L, 852L), crashes = c(73L,
133L, 18L, 71L, 95L, 422L), density = c(55, 108.8, 91.9, 36.8,
83.5, 295.2), unemp_rate = c(8, 8.3, 9.6, 8.5, 8.5, 8), stops =
c(739L, 1133L, 179L, 1645L, 1564L, 3645L), stops_per_cap = c(0.013019732,
0.013681592, 0.007240807, 0.037738879, 0.021647658, 0.013160032
), crashes_per_cap = c(0.001286117, 0.001606047, 0.000728126,
0.001628851, 0.001314915, 0.001523603)), .Names = c("year", "month",
"county", "state", "citations", "warnings", "population", "d_revenue",
"crashes", "density", "unemp_rate", "stops", "stops_per_cap",
"crashes_per_cap"), row.names = c(NA, 6L), class = "data.frame")
> dput(head(codes))
structure(list(county = c("ANDERSON COUNTY ", "ANDREWS COUNTY ",
"ANGELINA COUNTY ", "ARANSAS COUNTY ", "ARCHER COUNTY ", "ARMSTRONG COUNTY "
), state = structure(c(2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Montana",
"Texas"), class = "factor"), code = 1:6), .Names = c("county",
"state", "code"), row.names = c(NA, 6L), class = "data.frame")

Plot benchmark experiment results as one plot

I was benchmarking different implementations of an algorithm as part of a project and wanted to visualize this data in R succinctly. I was thinking I could assign different versions to different colors and use x axis of the plot to represent size and y axis to represent time. Please let me know how to plot these results in R. If there is any other way that would look better, I'd be happy to follow as well.
Size
8192 2 1 1 1
65536 10 5 4 4
1048576 81 60 63 52
8388608 675 555 572 464
16777216 1334 1124 1171 953
33554432 2780 2348 2438 2014
67108864 5853 5229 4957 4238
134217728 12437 10303 10521 8921
|__________
Just putting #BenBolker's answer down here to close out the question. (If you would like, Ben, feel free to copy/paste this as your own.)
Here's your sample input:
mytab <- structure(list(Size = c(8192L, 65536L, 1048576L, 8388608L, 16777216L,
33554432L, 67108864L, 134217728L), Version1 = c(2L, 10L, 81L,
675L, 1334L, 2780L, 5853L, 12437L), Version2 = c(1L, 5L, 60L,
555L, 1124L, 2348L, 5229L, 10303L), Version3 = c(1L, 4L, 63L,
572L, 1171L, 2438L, 4957L, 10521L), Version4 = c(1L, 4L, 52L,
464L, 953L, 2014L, 4238L, 8921L)), .Names = c("Size", "Version1",
"Version2", "Version3", "Version4"), class = "data.frame", row.names = c(NA,
-8L))
and here's now to make the plot
library(reshape2);
library(ggplot2);
d <- melt(mytab, "Size");
ggplot(d,aes(x=Size,y=value,colour=variable))+
geom_point()+
geom_line()+
scale_x_log10()+scale_y_log10()
which gives

Resources