Why are multiple points showing in this gganimate plot? - r

So I am trying to plot this data using gganimate:
YEAR WEEK COUNTRY CODE MARKET ARRIVALS DATE pct.chg
2020 1 Usa US CONTAINER SHIPS 347 2020-01-08 7.7639752
2020 2 Usa US CONTAINER SHIPS 395 2020-01-15 -2.2277228
2020 3 Usa US CONTAINER SHIPS 353 2020-01-22 -15.1442308
2020 4 Usa US CONTAINER SHIPS 359 2020-01-29 -11.3580247
2020 5 Usa US CONTAINER SHIPS 385 2020-02-05 0.2604167
The data is in an object called changesimp. I want to plot the arrivals over time, as you might expect. So here is the code I'm using to do that:
library(tidyverse)
library(gganimate)
changesimp %>%
filter(COUNTRY == "Usa") %>%
filter(YEAR == "2020") %>%
ggplot(aes(DATE, pct.chg)) +
geom_line() +
geom_point()+
labs(y="Year-over-year % change",
x="",
title="Percent change in port calls")+
theme_clean()+
transition_reveal(DATE)
This worked fine when I was just using geom_line. But when I added the geom_point part then things got a little weird and it give me this output (this is just one frame from the animation):
What I'm trying to get is something like this, found here:
There is only one value of pct.chg per week, I have checked already. So I'm not sure why it is plotting multiple points like this. Any thoughts? Thanks.

When I use dummy data as
df <- data.frame(
COUNTRY = c(rep("Usa",26)),
YEAR = c(rep("2020",26)),
WEEK = c(1:26),
pct.chg = c(rnorm(26,0,15))
)
changesimp <- df %>% mutate(DATE=(7*WEEK+as.Date('2020-01-01', format="%Y-%m-%d")))
Your program works fine and generates the following output:

Related

How to plot total cumulative row count over time ggplot

All I'm trying to do is plot a cumulative row count (so that by 2021 the graph line has reached 73) over time. I'm quite new to r, and I feel like this is really easy so I don't know why it's not really working.
My data looks like this:
ID name year
73 name73 2021
72 name72 2021
71 name71 2019
70 name70 2017
69 name69 2015
68 name68 2015
I've tried this code and it kind of works but sometimes the line goes down which doesn't seem right, since I just want a cumulative count.
ggplot(df, aes(x=year, y=ID)) +
geom_line()
Any help would be much appreciated!
Order the data by year and ID before plotting and it will go from the first year to the last and within year the smaller ID first.
x <- 'ID name year
73 name73 2021
72 name72 2021
71 name71 2019
70 name70 2017
69 name69 2015
68 name68 2015'
df <- read.table(textConnection(x), header = TRUE)
library(ggplot2)
i <- order(df$year, df$ID)
ggplot(df[i,], aes(x=year, y=ID)) +
geom_line()
Created on 2022-07-08 by the reprex package (v2.0.1)
An alternative, that I do not know is what the question is asking for, is to aggregate the IDs by year keeping the maximum in each year.
The code below does this and pipes to the plot directly, without creating an extra data set in the global environment.
aggregate(ID ~ year, df, max) |>
ggplot(aes(x=year, y=ID)) +
geom_line()
Created on 2022-07-08 by the reprex package (v2.0.1)

How to reproduce this graph?

Here is my code ;
library(rvest)
library(dplyr)
library(tidyr)
col_link <- "https://ourworldindata.org/famines#famines-by-world-region-since-1860"
col_page <- read_html(col_link)
col_table <- col_page %>% html_nodes("table#tablepress-73") %>%
html_table() %>% . [[1]]
new_data <- col_table %>%
select(Year, Country, `Excess Mortality midpoint`)
new_data
I would like to arrange the years and countries in such a way that I can use them in a graph but I can't. My objective is to reproduce this graph :
My problem is that in the "year" column, some data last several years for a country. For example to show that the famine lasted from 1846 to 1852 in Ireland it says "1846-52" and this is a problem because I cannot use the data in this form for a graph.
Year Country `Excess Mortality midpoint`
<chr> <chr> <chr>
1 1846–52 Ireland 1,000,000
2 1860-1 India 2,000,000
3 1863-67 Cape Verde 30,000
4 1866-7 India 961,043
5 1868 Finland 100,000
6 1868-70 India 1,500,000
7 1870–1871 Persia (now Iran) 1,000,000
8 1876–79 Brazil 750,000
9 1876–79 India 7,176,346
10 1877–79 China 11,000,000
# ... with 67 more rows
I think it's more of a question of data than R programming, you could try matching the year periods to the decades. However if a year range spans several decades the data should be 'split up' in some way (e.g. do a simple proportional split) to accommodate that. If the chart you linked to is made with this data, some assumptions had to made to adjust the data, without knowing those assumptions you won't be able to reproduce the chart.

geom_line is behaving strangely

i'm having an issue with a pretty simple visualization.
i'm just trying to do a simple time series plot of some very clean data, which looks like the following. (it is from the Fatalities dataset from the AER package.)
Fatalities %>%
select(year, state, fatal_rate) %>%
filter(state %in% c('ca', 'az'))
year state fatal_rate
8 1982 az 2.49914
9 1983 az 2.26738
10 1984 az 2.82878
11 1985 az 2.80201
12 1986 az 3.07106
13 1987 az 2.76728
14 1988 az 2.70565
22 1982 ca 1.86194
23 1983 ca 1.80672
24 1984 ca 1.94611
25 1985 ca 1.88128
26 1986 ca 1.94548
27 1987 ca 1.98966
28 1988 ca 1.90365
when i plot it i almost get what i want, which is one line plot for each state, but there is this problem of the lines for different states connecting to each other for some reason. it's always one state connecting to another state from the end of its time series at the beginning of the other state's time series. if i add more states it just looks messier, and the result is the same: lots of different colored lines connected to each other.
anyone know how i can get ggplot2 to stop doing this? and any idea why this is happening so i can avoid such issues in the future? thank you in advance for any advice.
Fatalities %>%
select(year, state, fatal_rate) %>%
filter(state %in% c('ca', 'az')) %>%
ggplot(aes(year, fatal_rate)) +
geom_line(aes(color = state), group = 1) +
theme_bw()
I would just make state a grouper rather than have them as the same group as you are doing this. ggplot will then provide the behaviour you desire:
library(tidyverse)
data(Fatalities, package = "AER")
Fatalities %>%
select(year, state, afatal) %>%
filter(state %in% c('ca', 'az')) %>%
ggplot(aes(year, afatal, group = state)) +
geom_line(aes(color = state)) +
theme_bw()
This way it knows that these are two separate time series.

RStudio: Separate YYYY-MM-DD into Individual Columns

I am fairly new to R and I am pulling my hair out trying to do what is probably something super simple.
I downloaded the crime data for Los Angeles from 2010 - 2019. There are 2,114,010 rows of data. Right now, it is called 'df' in my Global Environment area.
I want to manipulate one specific column titled "Occurred" - which is a date reference to when the crime occurred.
Right now, it is set up as YYYY-MM-DD (ie., 2010-02-20).
I am trying to separate all three into individual columns. I have Googled, and Googled, and Googled and tried and tried and tried things from this forum and StackExchange and just cannot get it to work.
I have tried Lubridate and followed instructions to other answers, but it simply won't create new columns (one each for Year, Month, Day).
Here is a bit of the reprex from the dataset ... I did not include all of the different variables, because they aren't the issue.
As mentioned, I am trying to separate 'occurred' into individual Year, Month, and Day columns.
> head(df, 10)[c('dr_no','occurred','time','area_name')]
dr_no occurred time area_name
1 1307355 2010-02-20 1350 Newton
2 11401303 2010-09-12 45 Pacific
3 70309629 2010-08-09 1515 Newton
4 90631215 2010-01-05 150 Hollywood
5 100100501 2010-01-02 2100 Central
6 100100506 2010-01-04 1650 Central
7 100100508 2010-01-07 2005 Central
8 100100509 2010-01-08 2100 Central
9 100100510 2010-01-09 230 Central
10 100100511 2010-01-06 2100 Central
We can do this with tidyverse and lubridate
library(dplyr)
library(lubridate)
df <- df %>%
mutate(occurred = as.Date(occurred),
year = year(occurred), month = month(occurred), day = day(occurred))

Need some tips on how R can handle this situation

I am using the csv version of the Lahman 2018 database found here: http://www.seanlahman.com/baseball-archive/statistics/.
In R, I would like to identify how many extra-base hits all Mets rookies have hit in their rookie seasons by game 95. I want to find out which Met rookie hit the most extra-base hits by game 95.
I have been experimenting with dplyr functions including select, filter, and summarize.
The main thing I am uncertain about is how to get only each Mets players' doubles, triples, and homers for the first 95 games of his first season.
This code shows more of what I have done then how I think my problem can be solved -- for that I am seeking tips.
library(dplyr)
df %>% filter(teamID=='NYN') %>%
select(c(playerID, yearID, G, 2B, 3B, HR)) %>%
group_by(playerID, yearID) %>%
summarise(xbh = sum(2B) + sum(3B)+ sum(HR)) %>%
arrange(desc(xbh))
Here is how I would like the output to appear:
Player Season 2B 3B HR XBH
x 1975 10 2 8 20
y 1980 5 5 5 15
z 2000 9 0 4 13
and so on.
I would like the XBH to be in descending order.

Resources