Here is my code ;
library(rvest)
library(dplyr)
library(tidyr)
col_link <- "https://ourworldindata.org/famines#famines-by-world-region-since-1860"
col_page <- read_html(col_link)
col_table <- col_page %>% html_nodes("table#tablepress-73") %>%
html_table() %>% . [[1]]
new_data <- col_table %>%
select(Year, Country, `Excess Mortality midpoint`)
new_data
I would like to arrange the years and countries in such a way that I can use them in a graph but I can't. My objective is to reproduce this graph :
My problem is that in the "year" column, some data last several years for a country. For example to show that the famine lasted from 1846 to 1852 in Ireland it says "1846-52" and this is a problem because I cannot use the data in this form for a graph.
Year Country `Excess Mortality midpoint`
<chr> <chr> <chr>
1 1846–52 Ireland 1,000,000
2 1860-1 India 2,000,000
3 1863-67 Cape Verde 30,000
4 1866-7 India 961,043
5 1868 Finland 100,000
6 1868-70 India 1,500,000
7 1870–1871 Persia (now Iran) 1,000,000
8 1876–79 Brazil 750,000
9 1876–79 India 7,176,346
10 1877–79 China 11,000,000
# ... with 67 more rows
I think it's more of a question of data than R programming, you could try matching the year periods to the decades. However if a year range spans several decades the data should be 'split up' in some way (e.g. do a simple proportional split) to accommodate that. If the chart you linked to is made with this data, some assumptions had to made to adjust the data, without knowing those assumptions you won't be able to reproduce the chart.
Related
I have a tibble called master_table that is 488 rows by 9 variables. The two relevant variables to the problem are wrestler_name and reign_begin. There are multiple repeats of certain wrestler_name values. I want to reduce the rows to only be one instance of each unique wrestler_name value, decided by the earliest reign_begin date value. Sample tibble is linked below:
So, in this slice of the tibble, the end goal would be to have just five rows instead of eleven, and the single Ric Flair row, for example, would have a reign_begin date of 9/17/1981, the earliest of the four Ric Flair reign_begin values.
I thought that group_by would make the most sense, but the more I think about it and try to use it, the more I think it might not be the right path. Here are some things I tried:
group_by(wrestler_name) %>%
tibbletime::filter_time(reign_begin, 'start' ~ 'start')
#Trying to get it to just filter the first date it finds for each wrestler_name group, but did not work
master_table_2 <- master_table %>%
group_by(wrestler_name) %>%
filter(reign_begin)
#I know that this would not work, but its the place I'm basically stuck
edit: Per request, here is the head(master_table), which contains slightly different data, but it still expresses the issue:
1 Ric Flair NWA World Heavyweight Championship 40 8 69 1991-01-11 1991-03-21
2 Sting NWA World Heavyweight Championship 39 1 188 1990-07-07 1991-01-11
3 Ric Flair NWA World Heavyweight Championship 38 7 426 1989-05-07 1990-07-07
4 Ricky Steamboat NWA World Heavyweight Championship 37 1 76 1989-02-20 1989-05-07
5 Ric Flair NWA World Heavyweight Championship 36 6 452 1987-11-26 1989-02-20
6 Ronnie Garvin NWA World Heavyweight Championship 35 1 62 1987-09-25 1987-11-26
city_state country
1 East Rutherford, New Jersey USA
2 Baltimore, Maryland USA
3 Nashville, Tennessee USA
4 Chicago, Illinois USA
5 Chicago, Illinois USA
6 Detroit, Michigan USA
The common way to do this for databases involves a join:
earliest <- master_table %>%
group_by(wrestler_name) %>%
summarise(reign_begin = min(reign_begin)
master_table_2 <- master_table %>%
inner_join(earliest, by = c("wrestler_name", "reign_begin"))
No filter is required as an inner join only include overlaps.
The above approach is often required for database because of how they calculate summaries. But as #Martin_Gal suggests R can handle this a different way because it stores the data in memory.
master_table_2 <- master_table %>%
group_by(wrestler_name) %>%
filter(reign_begin == min(reign_begin))
You may also find having the lubridate package installed assist for working with dates.
So I am trying to plot this data using gganimate:
YEAR WEEK COUNTRY CODE MARKET ARRIVALS DATE pct.chg
2020 1 Usa US CONTAINER SHIPS 347 2020-01-08 7.7639752
2020 2 Usa US CONTAINER SHIPS 395 2020-01-15 -2.2277228
2020 3 Usa US CONTAINER SHIPS 353 2020-01-22 -15.1442308
2020 4 Usa US CONTAINER SHIPS 359 2020-01-29 -11.3580247
2020 5 Usa US CONTAINER SHIPS 385 2020-02-05 0.2604167
The data is in an object called changesimp. I want to plot the arrivals over time, as you might expect. So here is the code I'm using to do that:
library(tidyverse)
library(gganimate)
changesimp %>%
filter(COUNTRY == "Usa") %>%
filter(YEAR == "2020") %>%
ggplot(aes(DATE, pct.chg)) +
geom_line() +
geom_point()+
labs(y="Year-over-year % change",
x="",
title="Percent change in port calls")+
theme_clean()+
transition_reveal(DATE)
This worked fine when I was just using geom_line. But when I added the geom_point part then things got a little weird and it give me this output (this is just one frame from the animation):
What I'm trying to get is something like this, found here:
There is only one value of pct.chg per week, I have checked already. So I'm not sure why it is plotting multiple points like this. Any thoughts? Thanks.
When I use dummy data as
df <- data.frame(
COUNTRY = c(rep("Usa",26)),
YEAR = c(rep("2020",26)),
WEEK = c(1:26),
pct.chg = c(rnorm(26,0,15))
)
changesimp <- df %>% mutate(DATE=(7*WEEK+as.Date('2020-01-01', format="%Y-%m-%d")))
Your program works fine and generates the following output:
I am fairly new to R and I am pulling my hair out trying to do what is probably something super simple.
I downloaded the crime data for Los Angeles from 2010 - 2019. There are 2,114,010 rows of data. Right now, it is called 'df' in my Global Environment area.
I want to manipulate one specific column titled "Occurred" - which is a date reference to when the crime occurred.
Right now, it is set up as YYYY-MM-DD (ie., 2010-02-20).
I am trying to separate all three into individual columns. I have Googled, and Googled, and Googled and tried and tried and tried things from this forum and StackExchange and just cannot get it to work.
I have tried Lubridate and followed instructions to other answers, but it simply won't create new columns (one each for Year, Month, Day).
Here is a bit of the reprex from the dataset ... I did not include all of the different variables, because they aren't the issue.
As mentioned, I am trying to separate 'occurred' into individual Year, Month, and Day columns.
> head(df, 10)[c('dr_no','occurred','time','area_name')]
dr_no occurred time area_name
1 1307355 2010-02-20 1350 Newton
2 11401303 2010-09-12 45 Pacific
3 70309629 2010-08-09 1515 Newton
4 90631215 2010-01-05 150 Hollywood
5 100100501 2010-01-02 2100 Central
6 100100506 2010-01-04 1650 Central
7 100100508 2010-01-07 2005 Central
8 100100509 2010-01-08 2100 Central
9 100100510 2010-01-09 230 Central
10 100100511 2010-01-06 2100 Central
We can do this with tidyverse and lubridate
library(dplyr)
library(lubridate)
df <- df %>%
mutate(occurred = as.Date(occurred),
year = year(occurred), month = month(occurred), day = day(occurred))
I have two problems I'm trying to solve, the first issue is the main one. Hopefully I've explained the second one decently.
1) My initial issue is trying to create spatial polygon dataframe from a tibble. For example, I have a tibble that outlines U.S. states, from the urbnmapr library and I want to be able to plot spatial polygons for all 50 states. (Note: I already have made a map from these data in ggplot but I specifically want spatial polygons to plot and animate in leaflet):
> states <- urbnmapr::states
> states
# A tibble: 83,933 x 10
long lat order hole piece group state_fips state_abbv state_name fips
<dbl> <dbl> <int> <lgl> <fct> <fct> <chr> <chr> <chr> <chr>
1 -88.5 31.9 1 FALSE 1 01.1 01 AL Alabama 01
2 -88.5 31.9 2 FALSE 1 01.1 01 AL Alabama 01
3 -88.5 31.9 3 FALSE 1 01.1 01 AL Alabama 01
...
2) Once I do this, I will want to join additional data from a separate tibble to the spatial polygons by the state name. What would be the best way to do that if I different data for each year? i.e. for the 50 states I have three years of data, so would I create 150 different polygons for the states across years or have 50 state polygons but have all the information in each to be able to make 3 different plots of all states for the different years?
I can propose you the following (unchecked because I don't have access to the urbnmapr package with my R version).
Problem 1
If you specifically want polygons, I think the best would be to join a dataframe to an object that comes from a shapefile.
If you still want to do it on your own, you need to do two things:
Convert your tibble into a spatial object with a point geometry
Aggregate points by state
sf package can do both. For the first step (the easy one), use sf_as_sf function.
library(sf)
states
states_spat <- states %>% st_as_sf(., coords = c("lon","lat"))
For the second step, you will need to aggregate geometries. I can propose you something that will give you a MULTIPOINT geometry, not polygons. To convert into polygons, you could find this thread to help
states_spat <- states_spat %>% group_by(state_name) %>%
dplyr::summarise(x = n())
Problem 2
That's a standard join based on a common attributes between your data and a spatial object (e.g. a state code). merge or *_join functions from dplyr work with sf data as they would do with tibbles. You have elements there
By the way, I think it is better for you to do that than creating your own polygons from a series of points.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
So I have the following data:
I have 5 regions, and years 1998-2009. What I like to do is to classify countries each year by regions. I'm new at R so the only step I've taken so far is the following:
finalData$Region = factor(finalData$Region, levels = c('Former Socialist Bloc', 'Independent', 'Western Europe','Scandinavia', 'Former Yugoslavia'), levels = c(1, 2, 3, 4, 5))
but I get this error:
Error in factor(finalData$Region, levels = c("Former Socialist Bloc",
: formal argument "levels" matched by multiple actual arguments
Could please tell me how to fix this and an approach to how to do the classification? Thank you!
This is a terribly formulated question. But I am going to imagine this is roughly what you want to do to give you an idea.
You have samples (rows) which are countries and you have some variables (columns) which are observations about these samples. You want to use all/multiple variables (multivariate analysis) to cluster the countries. If this is what you want to do, then below is one approach.
I am creating a data.frame with pseudo dataset.
dfr <- data.frame(Country=c("USA","UK","Germany","Austria","Taiwan","China","Japan","South Korea"),
Year=factor(c(2009,2009,2009,2010,2010,2010,2010,2011)),
Language=c("English","English","German","German","Chinese","Chinese","Japanese","Korean"),
Region=c("North America","Europe","Europe","Europe","Asia","Asia","Asia","Asia"))
> head(dfr)
Country Year Language Region
1 USA 2009 English North America
2 UK 2009 English Europe
3 Germany 2009 German Europe
4 Austria 2010 German Europe
5 Taiwan 2010 Chinese Asia
6 China 2010 Chinese Asia
First thing you want to do is move the country names out to row names because country names are the sample labels and they are not observations.
rownames(dfr) <- dfr$Country
dfr$Country <- NULL
Now you want all the remaining variables to be numeric or factors. Do that manually and carefully. I only have categorical observations. Finally we want to
recode all factors to integers. So that our final data.frame contains only numbers.
dfr1 <- as.data.frame(sapply(dfr,as.numeric))
rownames(dfr1) <- rownames(dfr)
> head(dfr1)
Year Language Region
USA 1 2 3
UK 1 2 2
Germany 1 3 2
Austria 2 3 2
Taiwan 2 1 1
China 2 1 1
Now run some clustering algorithm. Here for example a PCA.
pc <- prcomp(dfr1)
pcval <- pc$x
> head(pcval)
PC1 PC2 PC3
USA -1.04369951 1.2743507 0.36120850
UK -0.87597336 0.5087910 -0.25990844
Germany 0.06243255 0.8258430 -0.39728520
Austria 0.36452903 0.2660249 0.37429849
Taiwan -1.34455665 -1.1336389 0.02793507
China -1.34455665 -1.1336389 0.02793507
Combine the output principal components with original data.
pcval1 <- cbind(pcval,dfr)
rownames(pcval1) <- rownames(dfr)
> head(pcval1)
PC1 PC2 PC3 Year Language Region
USA -1.04369951 1.2743507 0.36120850 2009 English North America
UK -0.87597336 0.5087910 -0.25990844 2009 English Europe
Germany 0.06243255 0.8258430 -0.39728520 2009 German Europe
Austria 0.36452903 0.2660249 0.37429849 2010 German Europe
Taiwan -1.34455665 -1.1336389 0.02793507 2010 Chinese Asia
China -1.34455665 -1.1336389 0.02793507 2010 Chinese Asia
What is PCA and what is going on here is clearly out of the scope of this answer. In short, it creates some new variables based on all your observed variables.
Scatterplot the principal components 1 and 2. Colour points by some variable. Say "Region". Add country names as text labels.
library(ggplot2)
pcval1$Country <- rownames(pcval1)
ggplot(pcval1,aes(x=PC1,y=PC2,colour=Region))+
geom_point(size=3)+
geom_text(aes(label=pcval1$Country),hjust=1.5)+
theme_bw(base_size=15)
Now we see that countries have clustered together based on the observations in your dataset. We have the countries roughly grouping by Region. Obviously, there may or may not be any clustering depending on your data.
This is just an example. If you blindly follow this, you may be violating all sorts of statistical assumptions and what-not. You have to take into account what kind of data distributions you are dealing with and what clustering algorithm is suitable for that data etc.