Above is my dataset, just a simple dataset. It shows the GDP per capita of the richest and the poorest regions in nine countries in 2000 and 2015 as well as the gap of GDP per capita between the poorest and richest regions. Below is the reproducible example of this dataset:
structure(list(Country = c("Britain", "Germany", "United State",
"France", "South Korea", "Italy", "Japan", "Spain", "Sweden"),
Poor2000 = c(69, 50, 74, 52, 79, 50, 80, 80, 90), Poor2015 = c(61,
48, 73, 50, 73, 52, 78, 84, 82), Rich2000 = c(848, 311, 290,
270, 212, 180, 294, 143, 148), Rich2015 = c(1150, 391, 310,
299, 200, 198, 290, 151, 149)), row.names = c(NA, -9L), class = c("tbl_df",
"tbl", "data.frame"))
I wanna make a plot like this:
In this plot I just wanna show the GDP per capita of the poorest regions in the nine countries in 2000 and 2015 (the draft picture just has three countries for the sake of convenience). But I don't know how to do it using ggplot. Because it seems like I need to set x-axis as "Country" and y-axis as "Poor2000" and "Poor2015" the two variables. I don't know how to do that. Thanks many in advance.
Here a possible solution. Starting from your dataframe, you can first create a new dataframe that will reshape it into a longer format. FOr doing that, I used pivot_longer function from tidyr package:
library(tidyr)
library(dplyr)
DF <- df %>% select(Country, Poor2000, Poor2015) %>%
mutate(Diff = Poor2015 - Poor2000) %>%
pivot_longer(-Country, names_to = "Poor", values_to = "value")
# A tibble: 27 x 3
Country Poor value
<fct> <chr> <dbl>
1 Britain Poor2000 69
2 Britain Poor2015 61
3 Britain Diff -8
4 Germany Poor2000 50
5 Germany Poor2015 48
6 Germany Diff -2
7 United States Poor2000 74
8 United States Poor2015 73
9 United States Diff -1
10 France Poor2000 52
# … with 17 more rows
We will also create a second dataframe that will contain the difference of values between Poor2000 and Poor2015:
DF_second_label <- df %>% select(Country, Poor2000, Poor2015) %>%
group_by(Country) %>%
mutate(Diff = Poor2015 - Poor2000, ypos = max(Poor2000,Poor2015))
# A tibble: 9 x 5
# Groups: Country [9]
Country Poor2000 Poor2015 Diff ypos
<fct> <dbl> <dbl> <dbl> <dbl>
1 Britain 69 61 -8 69
2 Germany 50 48 -2 50
3 United States 74 73 -1 74
4 France 52 50 -2 52
5 South Korea 79 73 -6 79
6 Italy 50 52 2 52
7 Japan 80 78 -2 80
8 Spain 80 84 4 84
9 Sweden 90 82 -8 90
Then, we can plot both new dataframe in ggplot2 and select only countries of interest by using subset function:
ggplot(subset(DF, Poor != "Diff" & Country %in% c("Britain","South Korea","Sweden")),
aes(x = Country, y = value, fill = Poor))+
geom_col(position = position_dodge())+
geom_text(aes(label = value), position = position_dodge(0.9), vjust = -0.5, show.legend = FALSE)+
geom_text(inherit.aes = FALSE,
data = subset(DF_second_label, Country %in% c("Britain","South Korea","Sweden")),
aes(x = Country,
y = ypos+10,
label = Diff), color = "darkgreen", size = 6, show.legend = FALSE)+
labs(x = "", y = "GDP per Person", title = "Poor in 2000 & 2015")+
theme(plot.title = element_text(hjust = 0.5))
And you get:
Reproducible example
df <- data.frame(Country = c("Britain","Germany", "United States", "France", "South Korea", "Italy","Japan","Spain","Sweden"),
Poor2000 = c(69,50,74,52,79,50,80,80,90),
Poor2015 = c(61,48,73,50,73,52,78,84,82),
Rich2000 = c(848,311,290,270,212,180,294,143,148))
Related
For this week's tidytuesday challenge, for some reason, I am not able to group the column names in R which I was doing with pivot_longer function from tidyr previously. So, here is my code and I do not get it why it does throw an error and not give what I want.
library(tidyverse)
tuesdata <- tidytuesdayR::tt_load(2023, week = 7)
age_gaps <- tuesdata$age_gaps
df_long <- age_gaps %>%
pivot_longer(cols= actor_1_name:actor_2_name, names_to = "actornumber", values_to = "actorname") %>%
pivot_longer(cols= character_1_gender:character_2_gender, names_to = "gendernumber", values_to = "gender") %>%
pivot_longer(cols= actor_1_age:actor_2_age, names_to = "agenumber", values_to = "age") %>%
select(movie_name, release_year, director, age_difference, actorname, gender, age)
As seen from the code, the initial data has 1155 rows and after doing the quick data wrangling, I am expecting to get a data of 1155x2=2310 rows as I would like to merge the columns on actor names and their relevant information such as age and birthdate. Yet, the code does not give me the expected outcome and I am wondering why and how can I solve this problem. Thank you for your attention beforehand.
Example data (first 6 rows)
age_gaps <- structure(list(movie_name = c("Harold and Maude", "Venus", "The Quiet American",
"The Big Lebowski", "Beginners", "Poison Ivy"), release_year = c(1971,
2006, 2002, 1998, 2010, 1992), director = c("Hal Ashby", "Roger Michell",
"Phillip Noyce", "Joel Coen", "Mike Mills", "Katt Shea"), age_difference = c(52,
50, 49, 45, 43, 42), couple_number = c(1, 1, 1, 1, 1, 1), actor_1_name = c("Ruth Gordon",
"Peter O'Toole", "Michael Caine", "David Huddleston", "Christopher Plummer",
"Tom Skerritt"), actor_2_name = c("Bud Cort", "Jodie Whittaker",
"Do Thi Hai Yen", "Tara Reid", "Goran Visnjic", "Drew Barrymore"
), character_1_gender = c("woman", "man", "man", "man", "man",
"man"), character_2_gender = c("man", "woman", "woman", "woman",
"man", "woman"), actor_1_birthdate = structure(c(-26725, -13666,
-13442, -14351, -14629, -13278), class = "Date"), actor_2_birthdate = structure(c(-7948,
4536, 4656, 2137, 982, 1878), class = "Date"), actor_1_age = c(75,
74, 69, 68, 81, 59), actor_2_age = c(23, 24, 20, 23, 38, 17)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
You could set ".value" in names_to and supply one of names_sep or names_pattern to specify how the column names should be split.
library(tidyr)
age_gaps %>%
pivot_longer(actor_1_name:actor_2_age,
names_prefix = "(actor|character)_",
names_to = c("actor", ".value"),
names_sep = '_')
# A tibble: 12 × 10
movie_name release_year director age_difference couple_number actor name gender birthdate age
<chr> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <date> <dbl>
1 Harold and Maude 1971 Hal Ashby 52 1 1 Ruth Gordon woman 1896-10-30 75
2 Harold and Maude 1971 Hal Ashby 52 1 2 Bud Cort man 1948-03-29 23
3 Venus 2006 Roger Michell 50 1 1 Peter O'Toole man 1932-08-02 74
4 Venus 2006 Roger Michell 50 1 2 Jodie Whittaker woman 1982-06-03 24
5 The Quiet American 2002 Phillip Noyce 49 1 1 Michael Caine man 1933-03-14 69
6 The Quiet American 2002 Phillip Noyce 49 1 2 Do Thi Hai Yen woman 1982-10-01 20
7 The Big Lebowski 1998 Joel Coen 45 1 1 David Huddleston man 1930-09-17 68
8 The Big Lebowski 1998 Joel Coen 45 1 2 Tara Reid woman 1975-11-08 23
9 Beginners 2010 Mike Mills 43 1 1 Christopher Plummer man 1929-12-13 81
10 Beginners 2010 Mike Mills 43 1 2 Goran Visnjic man 1972-09-09 38
11 Poison Ivy 1992 Katt Shea 42 1 1 Tom Skerritt man 1933-08-25 59
12 Poison Ivy 1992 Katt Shea 42 1 2 Drew Barrymore woman 1975-02-22 17
I'm new to R and I need your help with the next problem.
I have the following dataset
id
Country
City
Accrued_Jan
Accrued_Feb
Accrued_Mar
Paid_Jan
Paid_Feb
Paid_Mar
01
USA
NY
110
110
130
100
100
110
02
ITALY
ROME
80
90
100
70
70
90
03
FRANCE
PARIS
70
80
90
70
70
90
And the result that I want is the next:
id
Country
City
Month
Accrued
Paid
01
USA
NY
Jan
100
100
01
USA
NY
Feb
110
100
01
USA
NY
Mar
130
110
02
ITALY
ROME
Jan
80
70
02
ITALY
ROME
Feb
90
70
02
ITALY
ROME
Mar
100
90
03
FRANCE
PARIS
Jan
70
70
03
FRANCE
PARIS
Feb
80
70
03
FRANCE
PARIS
Mar
90
90
Any idea on how to do this? maybe with pivot.longer?
I would like to add a column that identifies the month and keep the name and values of the variables "Accrued" and "Paid" in separate columns
you could follow a tidyverse approach and use a bunch of tidyverse expressions. One way that helps me visualise it is how I would do something like this in excel and take it forward from there. Here is a link to learn about tidyverse.
First you will have to activate the library:
library(tidyverse)
And then the code would look something like this:
df <- df |> pivot_longer(-c(id,Country,City),
names_to = "type",
values_to = "amount")
df <- df |> separate(col = c(type),sep = "_",into = c("status","Month"))
df <- df |> pivot_wider(id_cols = -c(status,amount),
names_from = status,
values_from = amount)
Hope this helps. Happy learning!
Here is a tidyverse solution
library(tidyverse)
df <- dplyr::tribble(
~id, ~Country, ~City, ~Accrued_Jan, ~Accrued_Feb, ~Accrued_Mar, ~Paid_Jan, ~Paid_Feb, ~Paid_Mar,
1, "USA", "NY", 100, 110, 130, 100, 100, 110,
2, "Italy", "Rome", 80, 90, 100, 70, 70, 90,
3, "France", "Paris", 70, 80, 90, 70, 70, 90) %>%
tidyr::pivot_longer(
cols = -c(id, Country, City),
names_to = c("name", "month"),
names_sep = "_",
values_to = "value") %>%
tidyr::pivot_wider(names_from = name, values_from = value)
I have census data that is listed by country and separated by wards. There is also a variable for continent. Here is a sample dataset.
df1 <- data.frame(country = c("Brazil", "Colombia", "Croatia", "France"), ward_1 = c(45, 35, 15, 80), ward_2 = c(25, 55, 10, 145), ward_23 = c(105, 65, 25, 85), continent = c("Americas", "Americas", "Europe", "Europe"))
I need to sum by continent for each of the wards. This is the output I am trying to achieve:
df2 <- data.frame(continent = c("Americas", "Europe"), ward_1 = c(80, 95), ward_2 = c(80, 155), ward_23 = c(170, 110))
I think I have to use group_by(continent) but then how do you output the sum for each ward?
What you need is summarise() after group_by().
In across(), it sums up everything in columns with the name that starts_with "ward".
library(dplyr)
df1 %>%
group_by(continent) %>%
summarize(across(starts_with("ward"), ~sum(.)))
# A tibble: 2 x 4
continent ward_1 ward_2 ward_23
<chr> <dbl> <dbl> <dbl>
1 Americas 80 80 170
2 Europe 95 155 110
Say you have a database like gapminder with the population per country. Even though the current year is 2021, you also have predictions for the following years to come.
location 2020.0 2021.0 2022.0
Canada 5 7 9
China 23 34 54
Congo 1 2 3
and another database like this, vaccins
location date amount_of_vaccins
Canada 2020-01-02 50
China 2021-05-03 59
Congo 2022-03-05 34
How can I merge the population of each country into the second database, but following the dates in the second database.
I managed to merge them by country like this:
merge(gapminder,vaccins, by = "location")
but I'm getting this
location date amount_of_vaccins 2020.0 2021.0 2022.0
Canada 2020-01-02 50 5 7 9
China 2021-05-03 59 23 34 54
Congo 2022-03-05 34 1 2 3
I'd like to have only a new variable giving the population of the country according to the year. Thank you.
You could do something like this with tidyverse.
library(tidyverse)
df1 <- df1 %>%
pivot_longer(!location, names_to = "date", values_to = "population") %>%
dplyr::mutate(year = str_sub(date, 1, 4))
df2 %>%
dplyr::mutate(year = str_sub(date, end = 4)) %>%
dplyr::left_join(., df1, by = c("location", "year")) %>%
dplyr::select(-c(date.y, year)) %>%
dplyr::rename(date = date.x)
Output
location date amount_of_vaccins population
1 Canada 2020-01-02 50 5
2 China 2021-05-03 59 34
3 Congo 2022-03-05 54 3
Data
df1 <-
structure(
list(
location = c("Canada", "China", "Congo"),
`2020.0` = c(5, 23, 1),
`2021.0` = c(7, 34, 2),
`2022.0` = c(9, 54, 3)
),
class = "data.frame",
row.names = c(NA,-3L)
)
df2 <-
structure(
list(
location = c("Canada", "China", "Congo"),
date = c("2020-01-02",
"2021-05-03", "2022-03-05"),
amount_of_vaccins = c(50, 59, 54)
),
class = "data.frame",
row.names = c(NA,-3L)
)
I have age column of character type. How do I replace all age_char with corresponding_age_num in column age
Thanks
library(tidyverse)
countries <- c("America", "Brazil", "Canada", "Denmark", "England", "France")
age <- c("70 to 75", "unknown", "75 to 80", "70 to 75", "80 and above", "75 to 80")
tbl <- tibble(countries, age)
age_char <- unique(tbl$age)
correponding_age_num <- (72.5, 50, 77.5, 85)
if we do the split at the 'to' to create two numeric columns, we can do the average
library(tidyr)
library(stringr)
tbl %>%
separate(age, into = c('age1', 'age2'), sep = '\\s+to\\s+|[^0-9]+',
convert = TRUE, remove = FALSE) %>%
transmute(countries, age, age_mean = case_when(str_detect(age,
'and above') ~ age1 + 5, TRUE ~ (age1 + age2)/2))
-output
# A tibble: 6 x 3
# countries age age_mean
# <chr> <chr> <dbl>
#1 America 70 to 75 72.5
#2 Brazil unknown NA
#3 Canada 75 to 80 77.5
#4 Denmark 70 to 75 72.5
#5 England 80 and above 85
#6 France 75 to 80 77.5
if we need the unique values and its corresponding mean, then wrap with distinct
tbl %>%
separate(age, into = c('age1', 'age2'), sep = '\\s+to\\s+|[^0-9]+',
convert = TRUE, remove = FALSE) %>%
transmute(countries, age, age_mean = case_when(str_detect(age,
'and above') ~ age1 + 5, TRUE ~ (age1 + age2)/2)) %>%
select(-countries) %>%
distinct(age, .keep_all = TRUE)
library(tidyverse)
countries <- c("America", "Brazil", "Canada", "Denmark", "England", "France")
age <- c("70 to 75", "75 to 80", "75 to 80", "70 to 75", "80 and above", "75 to 80")
tbl <- tibble(countries, age)
tbl <- tbl %>%
mutate( num1 = str_extract(age,"^[0-9]{1,3}"),
num2 = str_extract(age, "to [0-9]{1,3}" ),
num2 = str_extract(num2, "[0-9]{1,3}"),
num1 = num1 %>% as.numeric(),
num2 = num2 %>% as.numeric())
tbl$mean <-apply(tbl[,3:4], mean , MARGIN = 1)
tbl
# A tibble: 6 x 5
countries age num1 num2 mean
<chr> <chr> <dbl> <dbl> <dbl>
1 America 70 to 75 70 75 72.5
2 Brazil 75 to 80 75 80 77.5
3 Canada 75 to 80 75 80 77.5
4 Denmark 70 to 75 70 75 72.5
5 England 80 and above 80 NA NA
6 France 75 to 80 75 80 77.5