R ggplot barplot with people name over it - r

I have data-frame like below for 4 years:
State Sex Year Name Percent
Arizona M 1962 John 0.3
Arizona F 1962 Mary 0.6
Arizona M 1963 Peter 0.4
Arizona F 1963 Jane 0.9
Arizona M 1964 Dave 0.7
Arizona F 1964 Lara 0.3
Arizona M 1965 Den 0.7
Arizona F 1965 Kate 0.2
I need a barplot with people name over it for every year but only with two colors like green and red.
One example is like below:
So in my case:
x-axis are Years
y-axis are Percent
Numbers over barplot are people names and instead of blue I need red and green.

You can do it all in ggplot with stat_summary to place the text as well. The key is to use the cumsum to get the y-positions.
ggplot(df, aes(x=Year, y=Percent, fill=Sex)) +
geom_bar(stat='identity') +
stat_summary(aes(label=Name, order=desc(Sex)), fun.y=cumsum,
position='stack', geom='text', vjust=1)

Here is a solution. The only problem is the position of the text labels : you have to compute them beforehand. My solution assumes there are only two observations a year and that they are ordered M first, F second.
txt <- readLines(n=9)
State Sex Year Name Percent
Arizona M 1962 John 0.3
Arizona F 1962 Mary 0.6
Arizona M 1963 Peter 0.4
Arizona F 1963 Jane 0.9
Arizona M 1964 Dave 0.7
Arizona F 1964 Lara 0.3
Arizona M 1965 Den 0.7
Arizona F 1965 Kate 0.2
df <- read.table(text=txt,head=TRUE,stringsAsFactors = FALSE)
library(ggplot2)
library(dplyr)
df <- group_by(df,Year) %>%
mutate(pos=ifelse(Sex=="M",Percent,Percent+lag(Percent)))
ggplot(df,aes(x=Year,label=Name,fill=Sex)) +
geom_bar(aes(y=Percent),stat="identity",position="stack") +
geom_text(aes(y=pos),vjust=1)

Related

Join 2 dataframes together if two columns match

I have 2 dataframes:
CountryPoints
From.country To.Country points
Belgium Finland 4
Belgium Germany 5
Malta Italy 12
Malta UK 1
and another dataframe with neighbouring/bordering countries:
From.country To.Country
Belgium Finland
Belgium Germany
Malta Italy
I would like to add another column in CountryPoints called neighbour (Y/N) depending if the key value pair is found in the neighbour/bordering countries dataframe. Is this somehow possible - so it is a kind of a join but the result should be a boolean column.
The result should be:
From.country To.Country points Neighbour
Belgium Finland 4 Y
Belgium Germany 5 Y
Malta Italy 12 Y
Malta UK 1 N
In the question below it shows how you can merge but it doesn't show how you can add that extra boolean column
Two alternative approaches:
1) with base R:
idx <- match(df1$From.country, df2$From.country, nomatch = 0) &
match(df1$To.Country, df2$To.Country, nomatch = 0)
df1$Neighbour <- c('N','Y')[1 + idx]
2) with data.table:
library(data.table)
setDT(df1)
setDT(df2)
df1[, Neighbour := 'N'][df2, on = .(From.country, To.Country), Neighbour := 'Y'][]
which both give (data.table-output shown):
From.country To.Country points Neighbour
1: Belgium Finland 4 Y
2: Belgium Germany 5 Y
3: Malta Italy 12 Y
4: Malta UK 1 N
Borrowing the idea from this post:
df1$Neighbour <- duplicated(rbind(df2[, 1:2], df1[, 1:2]))[ -seq_len(nrow(df2)) ]
df1
# From.country To.Country points Neighbour
# 1 Belgium Finland 4 TRUE
# 2 Belgium Germany 5 TRUE
# 3 Malta Italy 12 TRUE
# 4 Malta UK 1 FALSE
What about something like this?
sortpaste <- function(x) paste0(sort(x), collapse = "_");
df1$Neighbour <- apply(df1[, 1:2], 1, sortpaste) %in% apply(df2[, 1:2], 1, sortpaste)
# From.country To.Country points Neighbour
#1 Belgium Finland 4 TRUE
#2 Belgium Germany 5 TRUE
#3 Malta Italy 12 TRUE
#4 Malta UK 1 FALSE
Sample data
df1 <- read.table(text =
"From.country To.Country points
Belgium Finland 4
Belgium Germany 5
Malta Italy 12
Malta UK 1", header = T)
df2 <- read.table(text =
"From.country To.Country
Belgium Finland
Belgium Germany
Malta Italy", header = T)

Replace NA with 0 in factor-type column

I have a continuous field (HighBAC) that I am trying to categorize based on whether the BAC falls into a certain range:
BAC >= .16 or < .21 : category 1
BAC >= .09 or < .16 : category 2
State Year high_bac
Alabama 2016 0.15
Alabama 2015 0.15
Alabama 2011 N
Alabama 2010 N
Arizona 2016 0.15
Arizona 2015 0.15
Idaho 2016 0.2
Idaho 2015 0.2
Idaho 2014 O
Idaho 2013 O
But I can't do that until I've replaced the 'N' and 'O' characters with NA. Otherwise, the cut() function won't work.
df_codes$high_bac[df_codes$high_bac=='N'|df_codes$high_bac=='O'] = NA
df_codes$high_bac = as.numeric(df_codes$high_bac)
df_codes$high_bac <- cut(df_codes$high_bac, breaks=c(.09, .16, .22), right=FALSE, labels=c(2:1))
Output:
State Year high_bac
Alabama 2016 2
Alabama 2015 2
Alabama 2011 NA
Alabama 2010 NA
Arizona 2016 2
Arizona 2015 2
Idaho 2016 1
Idaho 2015 1
Idaho 2014 NA
Idaho 2013 NA
I would like to then replace the NAs with 0s, AFTER I categorize them, because 0 is a special code (that should not be included as a category). But when I use the solution suggested in this post, I get the following error:
df_codes$high_bac[is.na(df_codes$high_bac)] <- 0
Warning message:
In [<-.factor(*tmp*, is.na(df_codes$high_bac), value = c(1L, :
invalid factor level, NA generated
Does is.na() not work with factor-type columns? If that's the case, is there another way I can replace all the NAs with 0? Or is there a way to use cut() with non-numeric columns?
(Context: I'm categorizing states based on their threshold for super-drunk drivers. States with lower BAC thresholds get more points. But states that have no super-drunk BAC levels ('N' or 'O') get zero points.)

Maps, ggplot2, fill by state is missing certain areas on the map

I am working with maps and ggplot2 to visualize the number of certain crimes in each state for different years. The data set that I am working with was produced by the FBI and can be downloaded from their site or from here (if you don't want to download the dataset I don't blame you, but it is way too big to copy and paste into this question, and including a fraction of the data set wouldn't help, as there wouldn't be enough information to recreate the graph).
The problem is easier seen than described.
As you can see California is missing a large chunk as well as a few other states. Here is the code that produced this plot:
# load libraries
library(maps)
library(ggplot2)
# load data
fbi <- read.csv("http://www.hofroe.net/stat579/crimes-2012.csv")
fbi <- subset(fbi, state != "United States")
states <- map_data("state")
# merge data sets by region
fbi$region <- tolower(fbi$state)
fbimap <- merge(fbi, states, by="region")
# plot robbery numbers by state for year 2012
fbimap12 <- subset(fbimap, Year == 2012)
qplot(long, lat, geom="polygon", data=fbimap12,
facets=~Year, fill=Robbery, group=group)
This is what the states data looks like:
long lat group order region subregion
1 -87.46201 30.38968 1 1 alabama <NA>
2 -87.48493 30.37249 1 2 alabama <NA>
3 -87.52503 30.37249 1 3 alabama <NA>
4 -87.53076 30.33239 1 4 alabama <NA>
5 -87.57087 30.32665 1 5 alabama <NA>
6 -87.58806 30.32665 1 6 alabama <NA>
And this is what the fbi data looks like:
Year Population Violent Property Murder Forcible.Rape Robbery
1 1960 3266740 6097 33823 406 281 898
2 1961 3302000 5564 32541 427 252 630
3 1962 3358000 5283 35829 316 218 754
4 1963 3347000 6115 38521 340 192 828
5 1964 3407000 7260 46290 316 397 992
6 1965 3462000 6916 48215 395 367 992
Aggravated.Assault Burglary Larceny.Theft Vehicle.Theft abbr state region
1 4512 11626 19344 2853 AL Alabama alabama
2 4255 11205 18801 2535 AL Alabama alabama
3 3995 11722 21306 2801 AL Alabama alabama
4 4755 12614 22874 3033 AL Alabama alabama
5 5555 15898 26713 3679 AL Alabama alabama
6 5162 16398 28115 3702 AL Alabama alabama
I then merged the two sets along region. The subset I am trying to plot is
region Year Robbery long lat group
8283 alabama 2012 5020 -87.46201 30.38968 1
8284 alabama 2012 5020 -87.48493 30.37249 1
8285 alabama 2012 5020 -87.95475 30.24644 1
8286 alabama 2012 5020 -88.00632 30.24071 1
8287 alabama 2012 5020 -88.01778 30.25217 1
8288 alabama 2012 5020 -87.52503 30.37249 1
... ... ... ...
Any ideas on how I can create this plot without those ugly missing spots?
I played with your code. One thing I can tell is that when you used merge something happened. I drew states map using geom_path and confirmed that there were a couple of weird lines which do not exist in the original map data. I, then, further investigated this case by playing with merge and inner_join. merge and inner_join are doing the same job here. However, I found a difference. When I used merge, order changed; the numbers were not in the right sequence. This was not the case with inner_join. You will see a bit of data with California below. Your approach was right. But merge somehow did not work in your favour. I am not sure why the function changed order, though.
library(dplyr)
### Call US map polygon
states <- map_data("state")
### Get crime data
fbi <- read.csv("http://www.hofroe.net/stat579/crimes-2012.csv")
fbi <- subset(fbi, state != "United States")
fbi$state <- tolower(fbi$state)
### Check if both files have identical state names: The answer is NO
### states$region does not have Alaska, Hawaii, and Washington D.C.
### fbi$state does not have District of Columbia.
setdiff(fbi$state, states$region)
#[1] "alaska" "hawaii" "washington d. c."
setdiff(states$region, fbi$state)
#[1] "district of columbia"
### Select data for 2012 and choose two columns (i.e., state and Robbery)
fbi2 <- fbi %>%
filter(Year == 2012) %>%
select(state, Robbery)
Now I created two data frames with merge and inner_join.
### Create two data frames with merge and inner_join
ana <- merge(fbi2, states, by.x = "state", by.y = "region")
bob <- inner_join(fbi2, states, by = c("state" ="region"))
ana %>%
filter(state == "california") %>%
slice(1:5)
# state Robbery long lat group order subregion
#1 california 56521 -119.8685 38.90956 4 676 <NA>
#2 california 56521 -119.5706 38.69757 4 677 <NA>
#3 california 56521 -119.3299 38.53141 4 678 <NA>
#4 california 56521 -120.0060 42.00927 4 667 <NA>
#5 california 56521 -120.0060 41.20139 4 668 <NA>
bob %>%
filter(state == "california") %>%
slice(1:5)
# state Robbery long lat group order subregion
#1 california 56521 -120.0060 42.00927 4 667 <NA>
#2 california 56521 -120.0060 41.20139 4 668 <NA>
#3 california 56521 -120.0060 39.70024 4 669 <NA>
#4 california 56521 -119.9946 39.44241 4 670 <NA>
#5 california 56521 -120.0060 39.31636 4 671 <NA>
ggplot(data = bob, aes(x = long, y = lat, fill = Robbery, group = group)) +
geom_polygon()
The problem is in the order of arguments to merge
fbimap <- merge(fbi, states, by="region")
has the thematic data first and the geo data second. Switching the order with
fbimap <- merge(states, fbi, by="region")
the polygons should all close up.

How to drop unused factors in faceted R ggplot boxplot?

Below is some example code I use to make some boxplots:
stest <- read.table(text=" site year conc
south 2001 5.3
south 2001 4.67
south 2001 4.98
south 2002 5.76
south 2002 5.93
north 2001 4.64
north 2001 6.32
north 2003 11.5
north 2003 6.3
north 2004 9.6
north 2004 56.11
north 2004 63.55
north 2004 61.35
north 2005 67.11
north 2006 39.17
north 2006 43.51
north 2006 76.21
north 2006 158.89
north 2006 122.27
", header=TRUE)
require(ggplot2)
ggplot(stest, aes(x=year, y=conc)) +
geom_boxplot(horizontal=TRUE) +
facet_wrap(~site, ncol=1) +
coord_flip() +
scale_y_log10()
Which results in this:
I tried everything I could think of but cannot make a plot where the south facet only contains years where data is displayed (2001 and 2002). Is what I am trying to do possible?
Here is a link (DEAD) to the screenshot showing what I want to achieve:
Use the scales='free.x' argument to facet_wrap. But I suspect you'll need to do more than that to get the plot you're looking for.
Specifically aes(x=factor(year), y=conc) in your initial ggplot call.
A simple way to circumvent your problem (with a fairly good result):
generate separately the two boxplots and then join them together using the grid.arrange command of the gridExtra package.
library(gridExtra)
p1 <- ggplot(subset(stest,site=="north"), aes(x=factor(year), y=conc)) +
geom_boxplot(horizontal=TRUE) + coord_flip() + scale_y_log10(name="")
p2 <- ggplot(subset(stest,site=="south"), aes(x=factor(year), y=conc)) +
geom_boxplot(horizontal=TRUE) + coord_flip() +
scale_y_log10(name="X Title",breaks=seq(4,6,by=.5)) +
grid.arrange(p1, p2, ncol=1)

Change colour scheme for ggplot geom_polygon in R

I'm creating a map using the maps library and ggplot's geom_polygon. I'd simply like to change the default blue, red, purple colour scheme to something else. I'm extremely new to ggplot so please forgive if I'm just not using the right data types. Here's what the data I'm using looks like:
> head(m)
region long lat group order subregion Group.1 debt.to.income.ratio.mean ratio total
17 alabama -87.46201 30.38968 1 1 <NA> alabama 12.4059 20.51282 39
18 alabama -87.48493 30.37249 1 2 <NA> alabama 12.4059 20.51282 39
19 alabama -87.52503 30.37249 1 3 <NA> alabama 12.4059 20.51282 39
20 alabama -87.53076 30.33239 1 4 <NA> alabama 12.4059 20.51282 39
21 alabama -87.57087 30.32665 1 5 <NA> alabama 12.4059 20.51282 39
22 alabama -87.58806 30.32665 1 6 <NA> alabama 12.4059 20.51282 39
> head(v)
Group.1 debt.to.income.ratio.mean ratio region total
alabama alabama 12.40590 20.51282 alabama 39
alaska alaska 11.05333 33.33333 alaska 6
arizona arizona 11.62867 25.55556 arizona 90
arkansas arkansas 11.90300 5.00000 arkansas 20
california california 11.00183 32.59587 california 678
colorado colorado 11.55424 30.43478 colorado 92
Here's the code:
library(ggplot2)
library(maps)
states <- map_data("state")
m <- merge(states, v, by="region")
m <- m[order(m$order),]
p<-qplot(long, lat, data=m, group=group, fill=ratio, geom="polygon")
I've tried the below and more:
cols <- c("8" = "red","4" = "blue","6" = "darkgreen", "10" = "orange")
p + scale_colour_manual(values = cols)
p + scale_colour_brewer(palette="Set1")
p + scale_color_manual(values=c("#CC6666", "#9999CC"))
The problem is that you are using a color scale but are using the fill aesthetic in the plot. You can use scale_fill_gradient() for two colors and scale_fill_gradient2() for three colors:
p + scale_fill_gradient(low = "pink", high = "green") #UGLY COLORS!!!
I was getting issues with scale_fill_brewer() complaining about a continuous variable supplied when a discrete variable was expected. One easy fix is to create discrete bins with cut() and then use that as the fill aesthetic:
m$breaks <- cut(m$ratio, 5) #Change to number of bins you want
p <- qplot(long, lat, data = m, group = group, fill = breaks, geom = "polygon")
p + scale_fill_brewer(palette = "Blues")

Resources