geom_line is behaving strangely - r

i'm having an issue with a pretty simple visualization.
i'm just trying to do a simple time series plot of some very clean data, which looks like the following. (it is from the Fatalities dataset from the AER package.)
Fatalities %>%
select(year, state, fatal_rate) %>%
filter(state %in% c('ca', 'az'))
year state fatal_rate
8 1982 az 2.49914
9 1983 az 2.26738
10 1984 az 2.82878
11 1985 az 2.80201
12 1986 az 3.07106
13 1987 az 2.76728
14 1988 az 2.70565
22 1982 ca 1.86194
23 1983 ca 1.80672
24 1984 ca 1.94611
25 1985 ca 1.88128
26 1986 ca 1.94548
27 1987 ca 1.98966
28 1988 ca 1.90365
when i plot it i almost get what i want, which is one line plot for each state, but there is this problem of the lines for different states connecting to each other for some reason. it's always one state connecting to another state from the end of its time series at the beginning of the other state's time series. if i add more states it just looks messier, and the result is the same: lots of different colored lines connected to each other.
anyone know how i can get ggplot2 to stop doing this? and any idea why this is happening so i can avoid such issues in the future? thank you in advance for any advice.
Fatalities %>%
select(year, state, fatal_rate) %>%
filter(state %in% c('ca', 'az')) %>%
ggplot(aes(year, fatal_rate)) +
geom_line(aes(color = state), group = 1) +
theme_bw()

I would just make state a grouper rather than have them as the same group as you are doing this. ggplot will then provide the behaviour you desire:
library(tidyverse)
data(Fatalities, package = "AER")
Fatalities %>%
select(year, state, afatal) %>%
filter(state %in% c('ca', 'az')) %>%
ggplot(aes(year, afatal, group = state)) +
geom_line(aes(color = state)) +
theme_bw()
This way it knows that these are two separate time series.

Related

How to reproduce this graph?

Here is my code ;
library(rvest)
library(dplyr)
library(tidyr)
col_link <- "https://ourworldindata.org/famines#famines-by-world-region-since-1860"
col_page <- read_html(col_link)
col_table <- col_page %>% html_nodes("table#tablepress-73") %>%
html_table() %>% . [[1]]
new_data <- col_table %>%
select(Year, Country, `Excess Mortality midpoint`)
new_data
I would like to arrange the years and countries in such a way that I can use them in a graph but I can't. My objective is to reproduce this graph :
My problem is that in the "year" column, some data last several years for a country. For example to show that the famine lasted from 1846 to 1852 in Ireland it says "1846-52" and this is a problem because I cannot use the data in this form for a graph.
Year Country `Excess Mortality midpoint`
<chr> <chr> <chr>
1 1846–52 Ireland 1,000,000
2 1860-1 India 2,000,000
3 1863-67 Cape Verde 30,000
4 1866-7 India 961,043
5 1868 Finland 100,000
6 1868-70 India 1,500,000
7 1870–1871 Persia (now Iran) 1,000,000
8 1876–79 Brazil 750,000
9 1876–79 India 7,176,346
10 1877–79 China 11,000,000
# ... with 67 more rows
I think it's more of a question of data than R programming, you could try matching the year periods to the decades. However if a year range spans several decades the data should be 'split up' in some way (e.g. do a simple proportional split) to accommodate that. If the chart you linked to is made with this data, some assumptions had to made to adjust the data, without knowing those assumptions you won't be able to reproduce the chart.

top_n() returning all the rows in R

Below is the head of my tibble. I am trying to find the top two countries with the highest r.squared using top_n() command. Why is that I am getting back the whole dataframe instead of just 2 rows? Appreciate inputs.
head(model_p)
country
<fctr>
r.squared
<dbl>
Algeria 0.9522064
Argentina 0.9843108
Australia 0.9830777
Austria 0.9866741
Bangladesh 0.9485248
Belgium 0.9902805
dim(model_p)
[1] 77 2
model_p %>% top_n(n=2, wt=r.squared)
country
<fctr>
r.squared
<dbl>
Algeria 0.952206405
Argentina 0.984310769
Australia 0.983077726
Austria 0.986674082
Bangladesh 0.948524805
Belgium 0.990280511
Benin 0.963144992
Bolivia 0.992357210
Botswana 0.013649835
Brazil 0.994024334
...
1-10 of 77 rows
After some research and advise from others, I understood the problem. Earlier, at some point, my dataframe was grouped. I understand that it is a good practice to ungroup the dataframe, if it s being used for further later analysis. So, piping my dataframe through ungroup() worked fine.
model_p <- model_p %>% ungroup()
model_p %>% top_n(wt=r.squared, n=2)
Given dplyr strategy to supersede top_n() with slice_*() function, the two lines can be:
model_p <- model_p %>% ungroup()
model_p %>% slice_max(order_by=r.squared, n=2)
If you look at the R Documentation of Tidyverse, it is immediately clear that top_n() is superseded by slice_*. This should work for you,
mtcars %>%
slice_max(order_by = mpg, n = 2)
Where order_by = mpg, orders the data according to mpg.
Edit:
After reviewing your question, I just realized that you are trying to sort according to a factor. Convert these to numeric, and you will be able to get top N.

Why are multiple points showing in this gganimate plot?

So I am trying to plot this data using gganimate:
YEAR WEEK COUNTRY CODE MARKET ARRIVALS DATE pct.chg
2020 1 Usa US CONTAINER SHIPS 347 2020-01-08 7.7639752
2020 2 Usa US CONTAINER SHIPS 395 2020-01-15 -2.2277228
2020 3 Usa US CONTAINER SHIPS 353 2020-01-22 -15.1442308
2020 4 Usa US CONTAINER SHIPS 359 2020-01-29 -11.3580247
2020 5 Usa US CONTAINER SHIPS 385 2020-02-05 0.2604167
The data is in an object called changesimp. I want to plot the arrivals over time, as you might expect. So here is the code I'm using to do that:
library(tidyverse)
library(gganimate)
changesimp %>%
filter(COUNTRY == "Usa") %>%
filter(YEAR == "2020") %>%
ggplot(aes(DATE, pct.chg)) +
geom_line() +
geom_point()+
labs(y="Year-over-year % change",
x="",
title="Percent change in port calls")+
theme_clean()+
transition_reveal(DATE)
This worked fine when I was just using geom_line. But when I added the geom_point part then things got a little weird and it give me this output (this is just one frame from the animation):
What I'm trying to get is something like this, found here:
There is only one value of pct.chg per week, I have checked already. So I'm not sure why it is plotting multiple points like this. Any thoughts? Thanks.
When I use dummy data as
df <- data.frame(
COUNTRY = c(rep("Usa",26)),
YEAR = c(rep("2020",26)),
WEEK = c(1:26),
pct.chg = c(rnorm(26,0,15))
)
changesimp <- df %>% mutate(DATE=(7*WEEK+as.Date('2020-01-01', format="%Y-%m-%d")))
Your program works fine and generates the following output:

Need some tips on how R can handle this situation

I am using the csv version of the Lahman 2018 database found here: http://www.seanlahman.com/baseball-archive/statistics/.
In R, I would like to identify how many extra-base hits all Mets rookies have hit in their rookie seasons by game 95. I want to find out which Met rookie hit the most extra-base hits by game 95.
I have been experimenting with dplyr functions including select, filter, and summarize.
The main thing I am uncertain about is how to get only each Mets players' doubles, triples, and homers for the first 95 games of his first season.
This code shows more of what I have done then how I think my problem can be solved -- for that I am seeking tips.
library(dplyr)
df %>% filter(teamID=='NYN') %>%
select(c(playerID, yearID, G, 2B, 3B, HR)) %>%
group_by(playerID, yearID) %>%
summarise(xbh = sum(2B) + sum(3B)+ sum(HR)) %>%
arrange(desc(xbh))
Here is how I would like the output to appear:
Player Season 2B 3B HR XBH
x 1975 10 2 8 20
y 1980 5 5 5 15
z 2000 9 0 4 13
and so on.
I would like the XBH to be in descending order.

How to create a separate histograms for different years in panel data?

I have a data frame newspaper_yearly, which contains the circulation ($CIRC) for a set of newspapers per year. I want to see how the distribution of these numbers change over time. So I want to create multiple, separate histograms for these different years.
I have tried the following:
ggplot(newspaper_yearly,aes(x=CIRC))+geom_histogram()+facet_grid(~YEAR==2004)+theme_bw()
But this shows two histograms, one where YEAR==2004 is true, and one where YEAR=2004 is not true. I want to only see the histogram for when YEAR=2004 is true.
Edit:
here's a cleaned up sample of the basic data structure:
YEAR CIRC
45938 1972 16557
10396 1900 2320
56311 2000 1195
1002 1872 1200
53335 1992 17764
7376 1896 1760
30101 1940 100651
18633 1916 11956
3171 1884 1900
54022 1992 5530
38751 1956 8006
42125 1964 10208
636 1872 1500
48706 1980 18830
22497 1924 NA
28024 1936 7211
7684 1896 21752
56087 2000 107129
43935 1968 9288
34692 1948 5083
I understand I could just make a subset like this (which is effectively the result I want), but I want to circumvent making a subset for every single year.
datahist2000 <- newspaper_yearly[ which(newspaper_yearly$YEAR == "2000"), ]
hist(datahist2000$CIRC)
Something like this might help.
par(mfrow=c(3,3))
for(i in levels(d$YEAR)){
datahist <- d[which(d$YEAR == i), ]
hist(datahist$CIRC)}
I used your subset approach to solve the problem with a for loop. I don't quite know if that is what you are trying to accomplish. I presume that there are quite a few entries for 'CIRC' per year, right? Otherwise separate plots don't make much sense, at least not for the data you provided.
If I understand the question correctly, you want a histogram for each year separately? In that case you can simply do
ggplot(newspaper_yearly, aes(x = CIRC)) + geom_histogram() + facet_grid(~YEAR) + theme_bw()
In case, You want to group years in a more complicated way, I recommend adding a new variable group, such as the following
group_year<- function(year){
if (year >= 1900 && year < 1980) return ("1900 - 1980")
if (year >= 1980 && year < 2020) return ("1980 - 2020")
return ("default")
}
newspaper_yearly$group = sapply(newspaper_yearly$YEAR, group_year)
ggplot(newspaper_yearly, aes(x = CIRC)) + geom_histogram() + facet_grid(~group) + theme_bw()

Resources