Properly modify a data set with dplyr

Properly modify a data set with dplyr - r

I have a data set I modified a lot, to the point where the code doesn't look very clean and tidy, and I need some help in order to put everything in a clean dplyr style, this is my code:
ddd_dataset <- read_excel("data/ddd_dataset.xlsx")
new_data = ddd_dataset[ddd_dataset$`Indicator name`=="Population covered by at least a 4G mobile network (%)",]
new_data = new_data[order(new_data$Country),]
new_data = spread(new_data[-c(1553, 1554), c(1,5,6)], Year, value = Value)
# Data imputation
new_data = new_data %>% pivot_longer(-Country, names_to = "year") %>%
mutate(value = value %>% as.numeric()) %>%
group_by(Country) %>%
fill(value, .direction = "updown") %>%
pivot_wider(names_from = year, values_from = value)
# Change column
itu_emi_countries <- read_csv("data/itu-emi-countries.csv")
itu_emi_countries <- itu_emi_countries %>% rename(Country = `ITU Name`)
new_data = left_join(new_data, itu_emi_countries, by.x = "Country", by.y = "Country")
new_data$Country = new_data$`EMI Name`
new_data = new_data[,1:10]
# Turn data into long format
new_long =
new_data %>%
pivot_longer(-Country, names_to = "year", values_to = "x") %>%
mutate(across(year, as.numeric))
Does anyone know how I can rewrite these functions into a single function that has the style of a dplyr function (using %>%)?

Literal, with inference and caveats:
library(dplyr)
library(tidyr) # pivot_*, complete, fill
# library(readr)
# library(readxl)
ddd_dataset <- readxl::read_excel("ddd_dataset.xlsx")
itu_emi_countries <- readr::read_csv("itu-emi-countries.csv") %>%
rename(Country = `ITU Name`)
new_data <- ddd_dataset %>%
filter(`Indicator name` == "Population covered by at least a 4G mobile network (%)") %>%
mutate(Value = suppressWarnings(as.numeric(Value))) %>%
pivot_wider(Country, names_from = Year, values_from = Value) %>%
# we cannot impute before here, since some countries do not have all years, but now they will
pivot_longer(-Country, names_to = "Year", values_to = "Value") %>%
arrange(Country, Year) %>%
group_by(Country) %>%
fill(Value, .direction = "updown") %>%
pivot_wider(Country, names_from = Year, values_from = Value)
new_long <- left_join(new_data, itu_emi_countries, by = "Country") %>%
# inferring that you want to keep names for countries in new_data not present in itu
mutate(Country = coalesce(`EMI Name`, Country)) %>%
# inferring you want all but `EMI Name`, not just hard-coding 1:10
select(-`EMI Name`) %>%
pivot_longer(-Country, names_to = "year", values_to = "x") %>%
mutate(year = as.integer(year))
new_data
# # A tibble: 196 x 10
# Country `2012` `2013` `2014` `2015` `2016` `2017` `2018` `2019` `2020`
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Afghanistan 0 0 0 0 0 4 7 22 26
# 2 Albania 0 0 0 35 80.2 85.3 85.5 95 98.4
# 3 Algeria 0 0 0 0 3.62 30.5 52.8 53.6 76.2
# 4 Andorra 50 50 50 50 50 85 85 85 85
# 5 Angola 7 7 7 7 8 8 8 18 30
# 6 Antigua and Barbuda 65 78.6 80 98 99 99 99 99 99
# 7 Argentina 0 0 0 65 85 85 90.8 91.2 97.7
# 8 Armenia 17.5 44 46 46.5 52.5 90.0 99.1 99.3 100
# 9 Australia 52.2 85 95 94 98 99 99.2 99.4 99.5
# 10 Austria 31.6 58.4 85 98 98 98 98 98 98
# # ... with 186 more rows
new_long
# # A tibble: 1,764 x 3
# Country year x
# <chr> <int> <dbl>
# 1 Afghanistan 2012 0
# 2 Afghanistan 2013 0
# 3 Afghanistan 2014 0
# 4 Afghanistan 2015 0
# 5 Afghanistan 2016 0
# 6 Afghanistan 2017 4
# 7 Afghanistan 2018 7
# 8 Afghanistan 2019 22
# 9 Afghanistan 2020 26
# 10 Albania 2012 0
# # ... with 1,754 more rows
But it seems unnecessary and inefficient to pivot back and forth when you ultimately want it in long format in the end. One-step:
new_long2 <- ddd_dataset %>%
filter(`Indicator name` == "Population covered by at least a 4G mobile network (%)") %>%
left_join(itu_emi_countries, by = "Country") %>%
mutate(
Country = coalesce(`EMI Name`, Country), # some `EMI Name` are missing
Value = suppressWarnings(as.numeric(Value)) # "NULL" -> NA
) %>%
complete(Country, Year) %>%
arrange(Year) %>%
group_by(Country) %>%
fill(Value, .direction = "updown") %>%
ungroup() %>%
select(Country, year = Year, x = Value)
(The only difference in the data, other than order, is that Year is a numeric in this last block and is integer above. This can easily be remedied, over to you.)

Related

Trying to reproduce a particular pivot table in R

My data take this shape:
set.seed(666)
grouping <- rep(c("A", "B"), 3)
theMonth <- c("2022_01", "2022_01", "2022_02", "2022_02", "2022_03", "2022_03")
revenue <- sample(100:1000, 6)
df <- tibble(grouping, theMonth, revenue)
I'm being asked to spread these data by month...
step1 <- spread(df, theMonth, revenue)
step1
# A tibble: 2 × 4
grouping `2022_01` `2022_02` `2022_03`
<chr> <int> <int> <int>
1 A 673 707 639
2 B 737 222 753
...but also, within the same table, I'm being asked for the cumulative progress of B (and only B) toward a target, say in this case 10000. So the desired output is something like:
grouping `2022_01` `2022_02` `2022_03`
<chr> <int> <int> <int>
1 A 673 707 639
2 B 737 222 753
3 CumSumB 737 959 1712
4 Progress 7.37% 9.59% 17.12%
What's the best way to attack this? Should I do it before I spread, probably using mutate? Or is there a clean way to do it after the spread?
(Answer does not have to use dplyr, but that is my preferred package for this sort of work.)

We may filter the data first, get the cumulative sum column, bind the data with the original data and then create the row for 'Progress' with add_row
library(dplyr)
library(tidyr)
library(tibble)
df %>%
filter(grouping == 'B') %>%
mutate(grouping = 'CumSumB', revenue = cumsum(revenue)) %>%
bind_rows(df, .) %>%
pivot_wider(names_from = theMonth, values_from = revenue) %>%
add_row(., tibble(grouping = "Progress", .[3, -1]/10000 * 100))
-output
# A tibble: 4 × 4
grouping `2022_01` `2022_02` `2022_03`
<chr> <dbl> <dbl> <dbl>
1 A 673 707 639
2 B 737 222 753
3 CumSumB 737 959 1712
4 Progress 7.37 9.59 17.1
Adding the % would make the whole column character. If needed, it can be done
library(stringr)
df %>%
filter(grouping == 'B') %>%
mutate(grouping = 'CumSumB', revenue = cumsum(revenue)) %>%
bind_rows(df, .) %>%
pivot_wider(names_from = theMonth, values_from = revenue) %>%
add_row(., tibble(grouping = "Progress", .[3, -1]/10000 * 100)) %>%
mutate(across(-grouping, ~ replace(.x, n(), str_c(.x[n()], "%"))))
# A tibble: 4 × 4
grouping `2022_01` `2022_02` `2022_03`
<chr> <chr> <chr> <chr>
1 A 673 707 639
2 B 737 222 753
3 CumSumB 737 959 1712
4 Progress 7.37% 9.59% 17.12%

Here is an alternative approach:
library(dplyr)
library(tidyr)
df %>%
mutate(revenueA = lag(revenue, default = revenue[1])) %>%
filter(row_number() %% 2 == 0) %>%
mutate(CumSum = cumsum(revenue),
Progres = paste0(CumSum/100, "%")) %>%
pivot_longer(-c(grouping, theMonth),
names_to = "key",
values_to = "val",
values_transform = list(val = as.character)) %>%
pivot_wider(names_from = theMonth, values_from = val) %>%
mutate(grouping = case_when(key == "revenue" ~"B",
key == "revenueA" ~ "A",
TRUE ~ key)) %>%
arrange(grouping) %>%
select(-key)
grouping `2022_01` `2022_02` `2022_03`
<chr> <chr> <chr> <chr>
1 A 673 707 639
2 B 737 222 753
3 CumSum 737 959 1712
4 Progres 7.37% 9.59% 17.12%

Here is another option:
library(dplyr)
library(tidyr)
df %>%
pivot_wider(names_from = grouping, values_from = revenue) %>%
mutate(
CumSumB = cumsum(B),
Progress = (CumSumB / 10000) * 100
) %>%
pivot_longer(-theMonth, names_to = "grouping") %>%
pivot_wider(names_from = theMonth, values_from = value)
Returns:
grouping `2022_01` `2022_02` `2022_03`
<chr> <dbl> <dbl> <dbl>
1 A 673 707 639
2 B 737 222 753
3 CumSumB 737 959 1712
4 Progress 7.37 9.59 17.1

I want to show percentage

What should i write in summarise for showing de percentaje of Amount of Accidents. Thanks
dfc %>%
group_by(Urban_or_Rural_Area) %>%
summarise(
Accidents = mean(Number_of_Casualties),
`Amount of Accidents` = n()
)

There is likely a dupe somewhere, but ...
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarize(Amt = n()) %>%
ungroup() %>%
mutate(Pct = 100 * Amt / sum(Amt))
# # A tibble: 3 x 3
# cyl Amt Pct
# <dbl> <int> <dbl>
# 1 4 11 34.4
# 2 6 7 21.9
# 3 8 14 43.8

How to calculate rates between observations with R

Considering I have a data frame ordered by date and for each one I have some quantities, how can I calculate Xday / Xday-1 index for each row?
My dataset: https://raw.githubusercontent.com/imdevskp/covid_19_jhu_data_web_scrap_and_cleaning/master/covid_19_clean_complete.csv
My processe dataset (R code):
library(tidyverse)
library(lubridate)
covid19 <- read.table(file = "covid_19_clean_complete.csv",
header = TRUE,
stringsAsFactors = FALSE,
sep = ",",
dec = ".",
quote = "\"")
covid19$Date <- mdy(covid19$Date)
brasil <- covid19 %>%
filter(Country.Region == "Brazil") %>%
group_by(Country.Region, Date) %>%
summarise(Cases = sum(Confirmed))
My rate will be calculated over Cases variable.

We can take the lag of 'Cases` and use that to divide the 'Cases'
library(dplyr)
out <- covid19 %>%
group_by(Country.Region, Date) %>%
summarise(Cases = sum(Confirmed)) %>%
mutate(Ratio = Cases/lag(Cases))
out %>%
filter(Country.Region == "Brazil") %>%
tail
# A tibble: 6 x 4
# Groups: Country.Region [1]
# Country.Region Date Cases Ratio
# <chr> <date> <int> <dbl>
#1 Brazil 2020-03-08 20 1.54
#2 Brazil 2020-03-09 25 1.25
#3 Brazil 2020-03-10 31 1.24
#4 Brazil 2020-03-11 38 1.23
#5 Brazil 2020-03-12 52 1.37
#6 Brazil 2020-03-13 151 2.90

how to filter by group keys on an already grouped dataframe (grouped_df)

Do you know how to filter by group keys (or indices) on an already grouped data frame (grouped_df)?
For example:
df <- tibble(id1 = sample(1:10, 100, replace = TRUE),
id2 = sample(1:10, 100, replace = TRUE),
value = runif(100, 0, 1)) %>%
arrange(id1, id2)
If I want to obtain the rows corresponding to the last 5 groups:
df %>%
mutate(grp_id=paste0(id1, "_", id2)) %>%
filter(grp_id %in% tail(unique(grp_id), 5)) %>%
group_by(id1, id2)
df %>%
group_by(id1, id2) %>%
mutate(grp_id = group_indices()) %>%
ungroup() %>%
filter(grp_id %in% tail(unique(grp_id), 5)) %>%
group_by(id1, id2)
Do you know how to write the filter after grouping?
df %>%
group_by(id1, id2) %>%
xxxxxx ?

If we need to avoid the double group_by with ungroup, create the 'grp_id', and then use the %in% by extracting the whole column (.$grp_id)
library(dplyr)
df %>%
group_by(id1, id2) %>%
mutate(grp_id = group_indices()) %>%
filter(grp_id %in% tail(unique(.$grp_id), 5))
# A tibble: 12 x 4
# Groups: id1, id2 [5]
# id1 id2 value grp_id
# <int> <int> <dbl> <int>
# 1 10 3 0.527 59
# 2 10 5 0.264 60
# 3 10 5 0.569 60
# 4 10 5 0.157 60
# 5 10 6 0.0504 61
# 6 10 6 0.703 61
# 7 10 6 0.109 61
# 8 10 7 0.896 62
# 9 10 9 0.785 63
#10 10 9 0.775 63
#11 10 9 0.940 63
#12 10 9 0.450 63

Group_by and mutate by multiple columns in R

I have dataframe with country, gender, 2013,2014,2014,2015 column names.
City Gender 2013 2014 2015
Aberdeen Female 30 40 50
Aberdeen Male 20 15 16
Aberdeenshire Female 60 80 70
Aberdeenshire Male 50 40 15
.....Includes 425 records.
I want to perform female to male ratio (dividing Female/male for each city) for each city, so this is how i tried to get,
City 2013_ratio 2014_ratio 2015_ration
Aberdeen 1.5 2.66 2.5
Aberdeenshire 1.2 2 4.66
can anyone help me to solve this. I have tried grouping by city but I don't know how to do by getting value by rows in gender.

You can more easily calculate the ratio if the Male and Female are in different columns, which you can change the structure by using tidyr
library(dplyr)
library(tidyr)
df %>%
gather(Year, Value, -City, - Gender) %>%
spread(Gender, Value) %>%
mutate(Ratio = Female/Male, Year = paste0(Year, "_Ratio")) %>%
select(-Female, -Male) %>%
spread(Year, Ratio)

The code from Rob's suggested solution would be (with an additional spread() step:
# data
df = data.frame(City = c("a", "a", "b", "b"),
Gender = c("Female", "Male", "Female", "Male"),
`2013` = c(30, 20, 60, 50),
`2014` = c(40, 15, 80, 40),
`2015` = c(50, 16, 70, 15))
# Actual process
library("dplyr")
library("tidyr")
df %>%
# Transform wide table into tidy
gather("Year", "Number", X2013:X2015) %>%
# Reshape gender columns for easier summaries
spread("Gender", "Number") %>%
# Compute ratios
group_by(City, Year) %>%
summarise(ratio = Female/(Male + Female))
#> # A tibble: 6 x 3
#> # Groups: City [?]
#> City Year ratio
#> <fct> <chr> <dbl>
#> 1 a X2013 0.6
#> 2 a X2014 0.727
#> 3 a X2015 0.758
#> 4 b X2013 0.545
#> 5 b X2014 0.667
#> 6 b X2015 0.824
Created on 2018-10-10 by the reprex package (v0.2.1)
To get exactly your result you can apply back the function spread() to spread the ratios over years, (spread(Year, ratio))

With tidyverse:
df = read.table(text="City Gender 2013 2014 2015
Aberdeen Female 30 40 50
Aberdeen Male 20 15 16
Aberdeenshire Female 60 80 70
Aberdeenshire Male 50 40 15", header = T)
> library(tidyverse)
>
> df %>%
group_by(City) %>%
arrange(City, Gender) %>%
summarise_at(vars(X2013:X2015), .funs = funs(ratio = first(.)/last(.)))
# A tibble: 2 x 4
City X2013_ratio X2014_ratio X2015_ratio
<fct> <dbl> <dbl> <dbl>
1 Aberdeen 1.5 2.67 3.12
2 Aberdeenshire 1.2 2 4.67
or
df %>%
group_by(City) %>%
arrange(City,Gender) %>%
summarise_at(vars(X2013:X2015), .funs = funs(ratio = .[Gender == "Female"]/.[Gender != "Female"]))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Properly modify a data set with dplyr - r

Related

Trying to reproduce a particular pivot table in R

I want to show percentage

How to calculate rates between observations with R

how to filter by group keys on an already grouped dataframe (grouped_df)

Group_by and mutate by multiple columns in R

Categories

Resources