Filtering Down Data, then Adding Brackets - r

Currently working with a data set about billionaires (small summary included below); I am looking to get them into three age brackets: "40 and under", "41 to 65", and "above 65", then what is the most common category (their kind of profession) of billionaire in each of the three age brackets.
I have tried to select the data down, and mutate or separate to create the brackets, but unsure what to do. The struck out lines are examples of things I have been trying.
load("bil.RData")
print(bil)
# A tibble: 2,614 x 22
age category citizenship company.name company.type country_code founded
<int> <chr> <chr> <chr> <chr> <chr> <int>
1 NA Financi… Saudi Arab… Rolaco Trad… new SAU 1968
2 34 Financi… United Sta… Fidelity In… new USA 1946
3 59 Non-Tra… Brazil Companhia B… new BRA 1948
4 61 New Sec… Germany Ratiopharm new DEU 1881
5 NA Financi… Hong Kong Swire new HKG 1816
6 NA Traded … Bahrain YBA Kanoo new BHR 1890
7 NA New Sec… Japan Otsuka Hold… new JPN 1921
8 NA Traded … Japan Sony new JPN 1946
9 66 Financi… Japan Mori Buildi… new JPN 1959
10 NA Traded … France Chanel new FRA 1909
# … with 2,604 more rows, and 15 more variables: `from emerging` <chr>,
# gdp <dbl>, gender <chr>, industry <chr>, inherited <chr>, name <chr>,
# rank <int>, region <chr>, relationship <chr>, sector <chr>, `was
# founder` <chr>, `was political` <chr>, wealth.type <chr>,
# worth_billions <dbl>, year <int>
bil %>%
select(age, category) %>%
arrange(age) %>%
filter(!is.na(age), !is.na(category)) %>%
group_by(age, category) %>%
#mutate(n = sum(age)) %>%
#separate(col = age, c("Under 40", "41-65", "Above 65")) %>%
print()
# A tibble: 2,158 x 2
# Groups: age, category [312]
age category
<int> <chr>
1 12 Financial
2 21 Financial
3 24 Financial
4 24 Financial
5 28 Non-Traded Sectors
6 28 Resource Related
7 29 Financial
8 29 Traded Sectors
9 29 New Sectors
10 29 New Sectors
Preferably, I am looking for a table with three rows (one per category, under 40, 41-65, 65+), and three columns (age_bracket, most common category, and n). Also, let me know the best way to include data sets on stackoverflow because this set is a bit large for dput() to be useful (I think).

I think there might be many ways to get what want.
Here are an example of using table1 :
library(tidyverse)
library(table1)
bil1<-bil %>%
mutate(age_group = cut(hp,
breaks = c(0, 40, 65, 110),
labels = c("< 40", "40 - 64", "65+")))
table1(~ category) | age_group, data = bil1)
You may also want to try many other packages such as arsenal and stargazer

To divide data into different buckets, use findInterval or cut and then use top_n to return top category in each bucket.
library(dplyr)
bil %>%
filter(!is.na(age)) %>%
group_by(group = findInterval(age, c(40, 65))) %>%
count(category) %>%
top_n(1, n)
This would just return 1, 2, 3 as labels, if you want to name the labels you can do
bil %>%
filter(!is.na(age)) %>%
group_by(group = c("40 and under", "41 to 65", "above 65")[
findInterval(age, c(40, 65)) + 1]) %>%
count(category) %>%
top_n(1, n)

Related

R: Filtering rows based on a group criterion

I have a data frame with over 100,000 rows and with about 40 columns. The schools column has about 100 distinct schools. I have data from 1980 to 2023.
I want to keep all data from schools that have at least 10 rows for each of the years 2018 through 2022. Schools that do not meet that criterion should have all rows deleted.
In my minimal example, Schools, I have three schools.
Computing a table makes it apparent that only Washington should be retained. Adams only has 5 rows for 2018 and Jefferson has 0 for 2018.
Schools2 is what the result should look like.
How do I use the table computation or a dplyr computation to perform the filter?
Schools =
data.frame(school = c(rep('Washington', 60),
rep('Adams',70),
rep('Jefferson', 100)),
year = c(rep(2016, 5), rep(2018:2022, each = 10), rep(2023, 5),
rep(2017, 25), rep(2018, 5), rep(2019:2022, each = 10),
rep(2019:2023, each = 20)),
stuff = rnorm(230)
)
Schools2 =
data.frame(school = c(rep('Washington', 60)),
year = c(rep(2016, 5), rep(2018:2022, each = 10), rep(2023, 5)),
stuff = rnorm(60)
)
table(Schools$school, Schools$year)
Schools |> group_by(school, year) |> summarize(counts = n())
Keep only the year from 2018 to 2022 in the data with filter, then add a frequency count column by school, year, and filter only those 'school', having all count greater than or equal to 10 and if all the year from the range are present
library(dplyr)# version >= 1.1.0
Schools %>%
filter(all(table(year[year %in% 2018:2022]) >= 10) &
all(2018:2022 %in% year), .by = c("school")) %>%
as_tibble()
-output
# A tibble: 60 × 3
school year stuff
<chr> <dbl> <dbl>
1 Washington 2016 0.680
2 Washington 2016 -1.14
3 Washington 2016 0.0420
4 Washington 2016 -0.603
5 Washington 2016 2.05
6 Washington 2018 -0.810
7 Washington 2018 0.692
8 Washington 2018 -0.502
9 Washington 2018 0.464
10 Washington 2018 0.397
# … with 50 more rows
Or using count
library(magrittr)
Schools %>%
filter(tibble(year) %>%
filter(year %in% 2018:2022) %>%
count(year) %>%
pull(n) %>%
is_weakly_greater_than(10) %>%
all, all(2018:2022 %in% year) , .by = "school")
As it turns out, a friend just helped me come up with a base R solution.
# form 2-way table, school against year
sdTable = table(Schools$school, Schools$year)
# say want years 2018-2022 having lots of rows in school data
sdTable = sdTable[,3:7]
# which have >= 10 rows in all years 2018-2022
allGtEq = function(oneRow) all(oneRow >= 10)
whichToKeep = which(apply(sdTable,1,allGtEq))
# now whichToKeep is row numbers from the table; get the school names
whichToKeep = names(whichToKeep)
# back to school data
whichOrigRowsToKeep = which(Schools$school %in% whichToKeep)
newHousing = Schools[whichOrigRowsToKeep,]
newHousing

Opening and reshaping xlsx files with nameless columns in r using a pattern

I'm working with French electoral data but I'm having issues opening xlsx files to work on them in r. I was wondering if anyone had had the same problem and found a solution.
The issue is that only the first 29 columns out of +100 columns have names and the rest are nameless. I've tried editing the column names in excel before opening them but this solution is time consuming and prone to mistakes. I'm looking for a way to automatize the process.
The datasets have a pattern that I'm trying to exploit to rename the columns and reshape the files:
the first 6 columns correspond to the geographic id of the precinct (region, municipality, etc...)
the next 15 columns give information about aggregate results in the precinct (number of voters, number of registered voters, participation, etc..)
The next 8 columns give information about a given candidate and her results in the precinct (name, sex, party id, number of votes, .. etc)
These 29 columns have names.
The next columns are nameless and correspond to other candidates. They repeat the 8 columns for the other candidates.
There is another layer of difficulty since each precinct does not have the same number of candidates so the number of nameless columns changes.
Ideally, I would want r to recognize the pattern and reshape the datasets to long by creating a new row for each candidate keeping the precinct id and aggregate data in each row. To do this, I would like r to recognize each sequence of nameless 8 columns.
To simplify, let's say that my data frame looks like the following:
precinct_id
tot_votes
candidate_id
candidate_votes
...1
...2
Paris 05
1000
Jean Dupont
400
Paul Dupuy
300
Paris 06
500
Jean Dupont
50
Paul Dupuy
150
where:
candidate_id and candidate_votes correspond to the id and result of the first candidate
...1, ...2 is how r is automatically renaming the nameless columns that correspond to candidate_id and candidate_votes for candidate 2 in the same precinct.
I need r to select the observations in each sequence of 2 columns and paste them into new rows under candidate_id candidate_votes while keeping the precinct_id and precinct_votes columns.
precinct_id
tot_votes
candidate_id
candidate_votes
Paris 05
1000
Jean Dupont
400
Paris 06
500
Jean Dupont
50
Paris 05
1000
Paul Dupuy
300
Paris 06
500
Paul Dupuy
150
I have no idea how to reshape without column names... Any help would be greatly appreciated! Thanks!
PS: The files come from here: https://www.data.gouv.fr/fr/datasets/elections-legislatives-des-12-et-19-juin-2022-resultats-definitifs-du-premier-tour/
Actually, there's an even simpler solution to the one I suggested. .names_repair can take a function as its value. This function should accept a vector of "input" column names and return a vector of "output column names". As we want to treat the data for the first candidate in each row in eactly the same way as every subsequent set of eight columns, I'll ignore only the first 21 columns, not the first 29.
read_excel(
"resultats-par-niveau-subcom-t1-france-entiere.xlsx",
.name_repair=function(x) {
suffixes <- c("NPanneau", "Sexe", "Nom", "Prénom", "Nuance", "Voix", "PctVoixIns", "PctVoixExp")
if ((length(x) - 21) %% 8 != 0) stop(paste("Don't know how to handle a sheet with", length(x), "columns [", (length(x) - 21) %% 8, "]"))
for (i in 1:length(x)) {
if (i > 21) {
x[i] <- paste0("C", 1 + floor((i-22)/8), "_", suffixes[1 + (i-22) %% 8])
}
}
x
}
)
# A tibble: 35,429 × 197
`Code du département` `Libellé du dép…` `Code de la ci…` `Libellé de la…` `Code de la co…` `Libellé de la…` `Etat saisie` Inscrits Abstentions
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 01 Ain 01 1ère circonscri… 016 Arbigny Complet 327 154
2 01 Ain 01 1ère circonscri… 024 Attignat Complet 2454 1281
3 01 Ain 01 1ère circonscri… 029 Beaupont Complet 446 224
4 01 Ain 01 1ère circonscri… 038 Bény Complet 604 306
5 01 Ain 01 1ère circonscri… 040 Béréziat Complet 362 179
6 01 Ain 01 1ère circonscri… 050 Boissey Complet 262 137
7 01 Ain 01 1ère circonscri… 053 Bourg-en-Bresse Complet 15516 8426
8 01 Ain 01 1ère circonscri… 057 Boz Complet 391 210
9 01 Ain 01 1ère circonscri… 065 Buellas Complet 1408 654
10 01 Ain 01 1ère circonscri… 069 Certines Complet 1169 639
# … with 35,419 more rows, and 188 more variables: `% Abs/Ins` <dbl>, Votants <dbl>, `% Vot/Ins` <dbl>, Blancs <dbl>, `% Blancs/Ins` <dbl>,
# `% Blancs/Vot` <dbl>, Nuls <dbl>, `% Nuls/Ins` <dbl>, `% Nuls/Vot` <dbl>, Exprimés <dbl>, `% Exp/Ins` <dbl>, `% Exp/Vot` <dbl>,
# C1_NPanneau <dbl>, C1_Sexe <chr>, C1_Nom <chr>, C1_Prénom <chr>, C1_Nuance <chr>, C1_Voix <dbl>, C1_PctVoixIns <dbl>, C1_PctVoixExp <dbl>,
# C2_NPanneau <dbl>, C2_Sexe <chr>, C2_Nom <chr>, C2_Prénom <chr>, C2_Nuance <chr>, C2_Voix <dbl>, C2_PctVoixIns <dbl>, C2_PctVoixExp <dbl>,
# C3_NPanneau <dbl>, C3_Sexe <chr>, C3_Nom <chr>, C3_Prénom <chr>, C3_Nuance <chr>, C3_Voix <dbl>, C3_PctVoixIns <dbl>, C3_PctVoixExp <dbl>,
# C4_NPanneau <dbl>, C4_Sexe <chr>, C4_Nom <chr>, C4_Prénom <chr>, C4_Nuance <chr>, C4_Voix <dbl>, C4_PctVoixIns <dbl>, C4_PctVoixExp <dbl>,
# C5_NPanneau <dbl>, C5_Sexe <chr>, C5_Nom <chr>, C5_Prénom <chr>, C5_Nuance <chr>, C5_Voix <dbl>, C5_PctVoixIns <dbl>, C5_PctVoixExp <dbl>, …
That's read the data in and named the columns. To get the final format you want, we will need to do a standard pivot_longer()/pivot_wider() trick, but the situation here is slightly complicated because some of your columns are character and some are numeric. So first, I'll turn the numeric columns into character columns so that the pivot_longer() step doesn't fail.
For clarity, I'll drop the first 21 columns so that it's easy to see what's going on.
read_excel(
"resultats-par-niveau-subcom-t1-france-entiere.xlsx",
.name_repair=function(x) {
suffixes <- c("NPanneau", "Sexe", "Nom", "Prénom", "Nuance", "Voix", "PctVoixIns", "PctVoixExp")
if ((length(x) - 21) %% 8 != 0) stop(paste("Don't know how to handle a sheet with", length(x), "columns [", (length(x) - 21) %% 8, "]"))
for (i in 1:length(x)) {
if (i > 21) {
x[i] <- paste0("C", 1 + floor((i-22)/8), "_", suffixes[1 + (i-22) %% 8])
}
}
x
}
) %>%
mutate(across(where(is.numeric) | where(is.logical), as.character)) %>%
pivot_longer(!1:21, names_sep="_", names_to=c("Candidate", "Variable"), values_to="Value") %>%
select(!1:21)
# A tibble: 6,235,504 × 3
Candidate Variable Value
<chr> <chr> <chr>
1 C1 NPanneau 2
2 C1 Sexe M
3 C1 Nom LAHY
4 C1 Prénom Éric
5 C1 Nuance DXG
6 C1 Voix 2
7 C1 PctVoixIns 0.61
8 C1 PctVoixExp 1.23
9 C2 NPanneau 8
10 C2 Sexe M
# … with 6,235,494 more rows
Now add the pivot_wider(), again dropping the first 21 columns, purely for clarity.
read_excel(
"resultats-par-niveau-subcom-t1-france-entiere.xlsx",
.name_repair=function(x) {
suffixes <- c("NPanneau", "Sexe", "Nom", "Prénom", "Nuance", "Voix", "PctVoixIns", "PctVoixExp")
if ((length(x) - 21) %% 8 != 0) stop(paste("Don't know how to handle a sheet with", length(x), "columns [", (length(x) - 21) %% 8, "]"))
for (i in 1:length(x)) {
if (i > 21) {
x[i] <- paste0("C", 1 + floor((i-22)/8), "_", suffixes[1 + (i-22) %% 8])
}
}
x
}
) %>%
mutate(across(where(is.numeric) | where(is.logical), as.character)) %>%
pivot_longer(!1:21, names_sep="_", names_to=c("Candidate", "Variable"), values_to="Value") %>%
pivot_wider(names_from=Variable, values_from=Value) %>%
select(!1:21)
# A tibble: 779,438 × 9
Candidate NPanneau Sexe Nom Prénom Nuance Voix PctVoixIns PctVoixExp
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 C1 2 M LAHY Éric DXG 2 0.61 1.23
2 C2 8 M GUÉRAUD Sébastien NUP 26 7.95 15.95
3 C3 7 F ARMENJON Eliane ECO 3 0.92 1.84
4 C4 1 M GUILLERMIN Vincent ENS 30 9.17 18.4
5 C5 3 M BRETON Xavier LR 44 13.46 26.99
6 C6 5 M MENDES Michael DSV 3 0.92 1.84
7 C7 6 M BELLON Julien REC 6 1.83 3.68
8 C8 4 F PIROUX GIANNOTTI Brigitte RN 49 14.98 30.06
9 C9 NA NA NA NA NA NA NA NA
10 C10 NA NA NA NA NA NA NA NA
# … with 779,428 more rows
Finally, convert the "temporary character" columns back to numeric. (Still dropping the first 21 columns for clarity.)
read_excel(
"resultats-par-niveau-subcom-t1-france-entiere.xlsx",
.name_repair=function(x) {
suffixes <- c("NPanneau", "Sexe", "Nom", "Prénom", "Nuance", "Voix", "PctVoixIns", "PctVoixExp")
if ((length(x) - 21) %% 8 != 0) stop(paste("Don't know how to handle a sheet with", length(x), "columns [", (length(x) - 21) %% 8, "]"))
for (i in 1:length(x)) {
if (i > 21) {
x[i] <- paste0("C", 1 + floor((i-22)/8), "_", suffixes[1 + (i-22) %% 8])
}
}
x
}
) %>%
mutate(across(where(is.numeric) | where(is.logical), as.character)) %>%
pivot_longer(!1:21, names_sep="_", names_to=c("Candidate", "Variable"), values_to="Value") %>%
pivot_wider(names_from=Variable, values_from=Value) %>%
mutate(across(c(Voix, PctVoixIns, PctVoixExp), as.numeric)) %>%
select(!1:21)
# A tibble: 779,438 × 9
Candidate NPanneau Sexe Nom Prénom Nuance Voix PctVoixIns PctVoixExp
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 C1 2 M LAHY Éric DXG 2 0.61 1.23
2 C2 8 M GUÉRAUD Sébastien NUP 26 7.95 16.0
3 C3 7 F ARMENJON Eliane ECO 3 0.92 1.84
4 C4 1 M GUILLERMIN Vincent ENS 30 9.17 18.4
5 C5 3 M BRETON Xavier LR 44 13.5 27.0
6 C6 5 M MENDES Michael DSV 3 0.92 1.84
7 C7 6 M BELLON Julien REC 6 1.83 3.68
8 C8 4 F PIROUX GIANNOTTI Brigitte RN 49 15.0 30.1
9 C9 NA NA NA NA NA NA NA NA
10 C10 NA NA NA NA NA NA NA NA
# … with 779,428 more rows
This, I think, is the format you want, though you may need to arrange() the rows into the order you want. Obviously, you should drop the final %>% select(!1:21) for your production version.
It is an easy matter to convert this code to a function that accepts a filename as its parameter and then use this in an lapply to read an entire folder into a list of data frames. However...
It appears that not every file in the folder has the same layout. resultats-par-niveau-fe-t1-outre-mer.xlsx, for example, appears to have fewer "prefix columns" before the 8-columns-per-candidate repeat begins.
The import generates several warnings. This appears to be because the election(?) with the largest number of candidates does not appear in the first rows of the worksheet. I've not investigated whether these warnings are generated by meaningful problems with the import.

I need to filter data without missing information, it is character but I can't filter it

library(XML)
library(dplyr)
library(rvest)
presid <- read_html("https://en.wikipedia.org/wiki/List_of_presidents_of_Peru") %>% # read the html page
html_nodes("table") %>% # extract nodes which contain a table
.[3] %>% # select the node which contains the relevant table
html_table(header = NA,
trim = T) # extract the table
t3 <- presid[[1]] # flatten data
t4 <-t3[unique(t3$N),] # eliminated duplicate
t5 <- subset(t4,!is.na(President))#
I need to read this table and filter the data in the best way that does not allow losing a lot of information when filtering the data.
The loss of rows is very important, it is reduced from 98 rows in t3, to 72 in t4 and to 63 in t5 when in reality I only need to reduce the information from 98 rows to 84 that can be filtered through column N
I have tried these formulas, but not result
strsplit (as.character (t3$N), split = "(? <= [a-zA-Z]) (? = [0-9])", perl = TRUE)
other
grep("[[:numeric:]]{2, }",N,value=T)
the rows of column N that I need to filter are those with the decimal point 0.5, 2.5, 6.5, 6.6, and others that have the terminal .5, in total there are 14 rows that I must remove.
my dataframe would be reduced from 98 to 84 rows.
I could filter by date but I have not found much material about it that can help me,
thanks
Since the data from the website has duplicate column names we can use janitor::clean_names() to have clean column names and then keep only those rows that have whole numbers in the n column.
library(rvest)
library(dplyr)
read_html("https://en.wikipedia.org/wiki/List_of_presidents_of_Peru") %>%
html_nodes("table") %>%
.[3] %>%
html_table(header = NA,trim = T) %>%
.[[1]] %>%
janitor::clean_names() %>%
filter(grepl('^\\d+$', n)) -> result
result
# A tibble: 85 x 10
# n president president_2 president_3 term_of_office term_of_office_2 title
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 "" "" José de la R… 28 February 18… 23 June 1823 President of …
# 2 2 "" "" José Bernard… 16 August 1823 18 November 1823 President of …
# 3 2 "" "" José Bernard… 18 November 18… 10 February 1824 Constitutiona…
# 4 3 "" "" José de La M… 10 June 1827 7 June 1829 Constitutiona…
# 5 4 "" "" Agustín Gama… 7 June 1829 19 December 1829 Antonio Gutié…
# 6 4 "" "" Agustín Gama… 1 September 18… 19 December 1829 Provisional P…
# 7 4 "" "" Agustín Gama… 19 December 18… 19 December 1833 Constitutiona…
# 8 5 "" "" Luis José de… 21 December 18… 21 December 1833 Provisional P…
# 9 6 "" "" Felipe Salav… 25 February 18… 7 February 1836 Supreme Head …
#10 7 "" "" Agustín Gama… 20 January 183… 15 August 1839 Provisional P…
# … with 75 more rows, and 3 more variables: form_of_entry <chr>, vice_president <chr>,
# vice_president_2 <chr>

Create a new columns in R

I am carrying out an analysis on some Italian regions. I have a dataset similar to the following:
mydata <- data.frame(date= c(2020,2021,2020,2021,2020,2021),
Region= c('Sicilia','Sicilia','Sardegna','Sardegna','Campania','Campania'),
Number=c(20,30,50,70,90,69) )
Now I have to create two new columns. The first (called 'Total population') containing a fixed number for each region (for example each row with Sicily will have a "Total Population" = 250). The second column instead contains the % ratio between the value of 'Number' column and the corresponding value of 'Total Population' (for example for Sicily the value will be 20/250 and so on).
I hope I explained myself well, Thank you very much
Like thsi perhaps:
mydata %<>% group_by( Region ) %>%
mutate(
`Total Population` = sum(Number),
`Ratio of Total` = sprintf( "%.1f%%",100 * Number / sum(Number)) )
mydata is now:
> mydata
# A tibble: 6 x 5
# Groups: Region [3]
date Region Number `Total Population` `Ratio of Total`
<dbl> <chr> <dbl> <dbl> <chr>
1 2020 Sicilia 20 50 40.0%
2 2021 Sicilia 30 50 60.0%
3 2020 Sardegna 50 120 41.7%
4 2021 Sardegna 70 120 58.3%
5 2020 Campania 90 159 56.6%
6 2021 Campania 69 159 43.4%

Format a tbl within a dplyr chain

I am trying to add commas for thousands in my data e.g. 10,000 along with dollars e.g. $10,000.
I'm using several dplyr commands along with tidyr gather and spread functions. Here's what I tried:
Cut n paste this code block to generate the random data "dataset" I'm working with:
library(dplyr)
library(tidyr)
library(lubridate)
## Generate some data
channels <- c("Facebook", "Youtube", "SEM", "Organic", "Direct", "Email")
last_month <- Sys.Date() %m+% months(-1) %>% floor_date("month")
mts <- seq(from = last_month %m+% months(-23), to = last_month, by = "1 month") %>% as.Date()
dimvars <- expand.grid(Month = mts, Channel = channels, stringsAsFactors = FALSE)
# metrics
rws <- nrow(dimvars)
set.seed(42)
# generates variablility in the random data
randwalk <- function(initial_val, ...){
initial_val + cumsum(rnorm(...))
}
Sessions <- ceiling(randwalk(3000, n = rws, mean = 8, sd = 1500)) %>% abs()
Revenue <- ceiling(randwalk(10000, n = rws, mean = 0, sd = 3500)) %>% abs()
# make primary df
dataset <- cbind(dimvars, Revenue)
Which looks like:
> tbl_df(dataset)
# A tibble: 144 × 3
Month Channel Revenue
<date> <chr> <dbl>
1 2015-06-01 Facebook 8552
2 2015-07-01 Facebook 12449
3 2015-08-01 Facebook 10765
4 2015-09-01 Facebook 9249
5 2015-10-01 Facebook 11688
6 2015-11-01 Facebook 7991
7 2015-12-01 Facebook 7849
8 2016-01-01 Facebook 2418
9 2016-02-01 Facebook 6503
10 2016-03-01 Facebook 5545
# ... with 134 more rows
Now I want to spread the months into columns to show revenue trend by channel, month over month. I can do that like so:
revenueTable <- dataset %>% select(Month, Channel, Revenue) %>%
group_by(Month, Channel) %>%
summarise(Revenue = sum(Revenue)) %>%
#mutate(Revenue = paste0("$", format(Revenue, big.interval = ","))) %>%
gather(Key, Value, -Channel, -Month) %>%
spread(Month, Value) %>%
select(-Key)
And it looks almost exactly as I want:
> revenueTable
# A tibble: 6 × 25
Channel `2015-06-01` `2015-07-01` `2015-08-01` `2015-09-01` `2015-10-01` `2015-11-01` `2015-12-01` `2016-01-01` `2016-02-01` `2016-03-01` `2016-04-01`
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Direct 11910 8417 4012 359 4473 2702 6261 6167 8630 5230 1394
2 Email 7244 3517 671 1339 10788 10575 8567 8406 7856 6345 7733
3 Facebook 8552 12449 10765 9249 11688 7991 7849 2418 6503 5545 3908
4 Organic 4191 978 219 4274 2924 4155 5981 9719 8220 8829 7024
5 SEM 2344 6873 10230 6429 5016 2964 3390 3841 3163 1994 2105
6 Youtube 186 2949 2144 5073 1035 4878 7905 7377 2305 4556 6247
# ... with 13 more variables: `2016-05-01` <dbl>, `2016-06-01` <dbl>, `2016-07-01` <dbl>, `2016-08-01` <dbl>, `2016-09-01` <dbl>, `2016-10-01` <dbl>,
# `2016-11-01` <dbl>, `2016-12-01` <dbl>, `2017-01-01` <dbl>, `2017-02-01` <dbl>, `2017-03-01` <dbl>, `2017-04-01` <dbl>, `2017-05-01` <dbl>
Now the part I'm struggling with. I would like to format the data as currency. I tried adding this inbetween summarise() and gather() within the chain:
mutate(Revenue = paste0("$", format(Revenue, big.interval = ","))) %>%
This half works. The dollar sign is prepended but the comma separators do not show. I tried removing the paste0("$" part to see if I could get the comma formatting to work with no success.
How can I format my tbl as a currency with dollars and commas, rounded to nearest whole dollars (no $1.99, just $2)?
I think you can just do this at the end with dplyr::mutate_at().
revenueTable %>% mutate_at(vars(-Channel), funs(. %>% round(0) %>% scales::dollar()))
#> # A tibble: 6 x 25
#> Channel `2015-06-01` `2015-07-01` `2015-08-01` `2015-09-01`
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Direct $11,910 $8,417 $4,012 $359
#> 2 Email $7,244 $3,517 $671 $1,339
#> 3 Facebook $8,552 $12,449 $10,765 $9,249
#> 4 Organic $4,191 $978 $219 $4,274
#> 5 SEM $2,344 $6,873 $10,230 $6,429
#> 6 Youtube $186 $2,949 $2,144 $5,073
#> # ... with 20 more variables: `2015-10-01` <chr>, `2015-11-01` <chr>,
#> # `2015-12-01` <chr>, `2016-01-01` <chr>, `2016-02-01` <chr>,
#> # `2016-03-01` <chr>, `2016-04-01` <chr>, `2016-05-01` <chr>,
#> # `2016-06-01` <chr>, `2016-07-01` <chr>, `2016-08-01` <chr>,
#> # `2016-09-01` <chr>, `2016-10-01` <chr>, `2016-11-01` <chr>,
#> # `2016-12-01` <chr>, `2017-01-01` <chr>, `2017-02-01` <chr>,
#> # `2017-03-01` <chr>, `2017-04-01` <chr>, `2017-05-01` <chr>
We can use data.table
library(data.table)
nm1 <- setdiff(names(revenueTable), 'Channel')
setDT(revenueTable)[, (nm1) := lapply(.SD, function(x)
scales::dollar(round(x))), .SDcols = nm1]
revenueTable[, 1:3, with = FALSE]
# Channel `2015-06-01` `2015-07-01`
#1: Direct $11,910 $8,417
#2: Email $7,244 $3,517
#3: Facebook $8,552 $12,449
#4: Organic $4,191 $978
#5: SEM $2,344 $6,873
#6: Youtube $186 $2,949

Resources