I'm using the data here (specifically the cpi and grosses tibbles) and I would like to calculate each of the columns containing dollar amounts (denoted by having 'gross' or 'price' in their names) in terms of their 2020 value. I would then like to order these columns directly after their matches.
The following method was inspired by the post here:
cpi_recent <- cpi %>% # pulls most recent cpi in the tibble
select(cpi) %>%
slice_tail() %>%
pull()
grosses_adj <- grosses %>%
mutate(year_month = floor_date(week_ending, 'month')) %>%
left_join(cpi, 'year_month') %>%
mutate(across(contains(c('gross', 'price')),
list(adj = ~ cpi_recent/cpi * .))) %>% # creates new col in 2020 dollars w/ _adj suffix
select(-year_month, -cpi)
However, this will place all new columns after the last column, by default.
Is there any way to dynamically order the new columns directly after the matching column, like: gross_1, gross_1_adj, price_2, price_2_adj, etc.? Of course this is trivial using select() manually, but I'm assuming there is some way to reference these columns dynamically using mutate() and the .after argument.
You can do this by following the approach from Stackoverflow: How to relocate several columns in one step using dplyr::relocate?
First of all, you'll need to store the variable names that you are you are using in across selection in a vector or list.
gross_price_vars <- names(grosses) %>% str_subset("gross|price")
gross_price_vars
# [1] "weekly_gross_overall" "weekly_gross" "potential_gross"
# "avg_ticket_price" "top_ticket_price"
With the help of reduce from the purrr package you can relocate the freshly generated variables after the ones you used for calculation and whose names you stored in gross_price_vars.
grosses_adj %>%
reduce(.x = gross_price_vars,
.f = ~ relocate(.x, str_c(.y, "_adj"), .after = .y),
.init = .)
# week_ending week_number weekly_gross_overall weekly_gross_overall_adj show theatre weekly_gross weekly_gross_adj
# <date> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
# 1 1985-06-09 1 3915937 9576099. 42nd Str~ St. Ja~ 282368 690508.
# 2 1985-06-09 1 3915937 9576099. A Chorus~ Sam S.~ 222584 544311.
# 3 1985-06-09 1 3915937 9576099. Aren't W~ Brooks~ 249272 609574.
# 4 1985-06-09 1 3915937 9576099. Arms and~ Circle~ 95688 233997.
# 5 1985-06-09 1 3915937 9576099. As Is Lyceum~ 61059 149315.
Keep in mind, that the .x inside relocate is another object than the .x outside from it. Inside relocate it referes to the .init argument, which is the data.frame, while .y refereres to the names the variable list (the .x argument from the reduce function).
The Variables generated from the mutate(across ...) function have the postfix "_adj" attached to it's original name. In that way we identify the new variables by str_c(old_varname, "_adj").
I would do something simpler, you can use peek_vars to put the columns in order if you're conscientious about how you're naming your new columns (I usually use prefixes or suffixes to select). It looks like it would work fine in this case.
The first columns (week_ending, week_number) will not be part of the sort, everything else after that will be alphabetical.
grosses_adj %>%
select(week_ending, week_number, sort(tidyselect::peek_vars()))
Related
I have a df of lot #'s with all of the data associated with them. Some of that data is experimental. Those lot #'s start with X. For example, X42A7299, where any normal lot would be 42A7299. I want to exclude those rows. The DF is called all_cls4. Here is the code I have tried:
all_cls4new<- all_cls4 %>% filter(!str_detect(Lot_#, ^X))
this returns a +
I also get this result with filter and !grep. What am I missing?
library(dplyr)
library(stringr)
x <- tribble(
~lot, ~other_data,
"X42A7299", 45,
"42A7299", 100
)
x %>%
filter(!(str_detect(lot, '^X')))
#> # A tibble: 1 × 2
#> lot other_data
#> <chr> <dbl>
#> 1 42A7299 100
Also, be careful with a symbol in your column name (e.g. Lot_#). I would rename it to a "clean" name (e.g. snakecase). janitor::clean_names() is useful for this. If you use it as is, you will have to wrap in backticks:
x %>%
filter(!(str_detect(`Lot_#`, '^X')))
I'm trying to make objects directly from information listed in a tibble that can be called on by later functions/tibbles in my environment. I can make the objects manually but I'm working to do this iteratively.
library(tidyverse)
##determine mean from 2x OD Negatives in experimental plates, then save summary for use in appending table
ELISA_negatives = "my_file.csv"
neg_tibble <- as_tibble(read_csv(ELISA_negatives, col_names = TRUE)) %>%
group_by(Species_ab, Antibody, Protein) %>%
filter(str_detect(Animal_ID, "2x.*")) %>%
summarize(ave_neg_U_mL = mean(U_mL, na.rm = TRUE), n=sum(!is.na(U_mL)))
neg_tibble
# A tibble: 4 x 5
# Groups: Species_ab, Antibody [2]
Species_ab Antibody Protein ave_neg_U_mL n
<chr> <chr> <chr> <dbl> <int>
1 Mouse IgG GP 28.2 6
2 Mouse IgG NP 45.9 6
3 Rat IgG GP 5.24 4
4 Rat IgG NP 1.41 1
I can write the object manually based off the above tibble:
Mouse_IgG_GP_cutoff <- as.numeric(neg_tibble[1,4])
Mouse_IgG_GP_cutoff
[1] 28.20336
In my attempt to do this iteratively, I can make a new tibble neg_tibble_string with the information I need. All I would need to do now is make a global object from the Name in the first column Test_Name, and assign it to the numeric value in the second column ave_neg_U_mL (which is where I'm getting stuck).
neg_tibble_string <- neg_tibble %>%
select(Species_ab:Protein) %>%
unite(col='Test_Name', c('Species_ab', 'Antibody', 'Protein'), sep = "_") %>%
mutate(Test_Name = str_c(Test_Name, "_cutoff")) %>%
bind_cols(neg_tibble[4])
neg_tibble_string
# A tibble: 4 x 2
Test_Name ave_neg_U_mL
<chr> <dbl>
1 Mouse_IgG_GP_cutoff 28.2
2 Mouse_IgG_NP_cutoff 45.9
3 Rat_IgG_GP_cutoff 5.24
4 Rat_IgG_NP_cutoff 1.41
I feel like there has to be a way to do this to get this from the above tibble neg_tibble_string, and make this for all four of the rows. I've tried a variant of this and this, but can't get anywhere.
> list_df <- mget(ls(pattern = "neg_tibble_string"))
> list_output <- map(list_df, ~neg_tibble_string$ave_neg_U_mL)
Warning message:
Unknown or uninitialised column: `ave_neg_U_mL`.
> list_output
$neg_tibble_string
NULL
As always, any insight is appreciated! I'm making progress on my R journey but I know I am missing large gaps in knowledge.
As we already returned the object value in a list, we need only to specify the lambda function i.e. .x returns the value of the list element which is a tibble and extract the column
library(purrr)
list_output <- map(list_df, ~.x$ave_neg_U_ml)
If the intention is to create global objects, deframe, convert to a list and then use list2env
library(tibble)
list2env(as.list(deframe(neg_tibble_string)), .GlobalEnv)
I'm working with R for the first time for a class in college. To preface this: I don't know enough to know what I don't know, so I'm sorry if this question has been asked before. I am trying to predict the results of the Texas state house elections in 2020, and I think the best prior for that is the results of the 2018 state house elections. There are 150 races, so I can't bare to input them all by hand, but I can't find any spreadsheet that has data formatted how I want it. I want it in a pretty standard table format:
My desired table format. However, the table from the Secretary of state I have looks like the following:
Gross ugly table.
I wrote some psuedo code:
Here's the Psuedo Code, basically we want to construct a new CSV:
'''%First, we want to find a district, the house races are always preceded by a line of dashes, so I will need a function like this:
Create a New CSV;
for(x=1; x<151 ; x +=1){
Assign x to the cell under the district number cloumn;
Find "---------------" ;
Go down one line;
Go over two lines;
% We should now be in the third column and now want to read in which party got how many votes. The number of parties is not consistant, so we need to account for uncontested races, libertarians, greens, and write ins. I want totals for Republicans, Democrats, and Other.
while(cell is not empty){
Party <- function which reads cell (but I want to read a string);
go right one column;
Votes <- function which reads cell (but I want to read an integer);
if(Party = Rep){
put this data in place in new CSV;
else if (Party = Dem)
put this data in place in new CSV;
else
OtherVote += Votes;
};
};
Assign OtherVote to the column for other party;
OtherVote <- 0;
%Now I want to assign 0 to null cells (ones where no rep, or no Dem, or no other party contested
read through single row 4 spaces, if its null assign it 0;
Party <- null
};'''
But I don't know enough to google what to do! Here's what I need help with: Can I create a new CSV in Rstudio, how? How can I read specific cells in a table, hopefully indexing? Lastly, how do I write to a table in R. Any help is appreciated! Thank you!
Can I create a new CSV in Rstudio, how?
Yes you can. Use the "write.csv" function.
write.csv(df, file = "df.csv") #see help for more information.
How can I read specific cells in a table?
Use the brackets after df,example below.
df <- data.frame(x = c(1,2,3), y = c("A","B","C"), z = c(15,25,35))
df[1,1]
#[1] 1
df[1,1:2]
# x y
#1 1 A
How do I write to a table in R?
If you want to write a table in xlsx use the function write.xlsx from openxlsx package.
Wikipedia seems to have a table that is closer to the format you are looking for.
In order to get to the table you are looking for we need a few steps:
Download data from Wikipedia and extract table.
Clean up table.
Select columns.
Calculate margins.
1. Download data from wikipedia and extract table.
The rvest table helps with downloading and parsing websites into R objects.
First we download the HTML of the whole website.
library(dplyr)
library(rvest)
wiki_html <-
read_html(
"https://en.wikipedia.org/wiki/2018_United_States_House_of_Representatives_elections_in_Texas"
)
There are a few ways to get a specific object from an HTML file in this case
I dedided to look for the table that has the class name “wikitable plainrowheaders sortable”,
as I learned from inspecting the code, that the only table with that class is
the one we want to extract.
library(purrr)
html_nodes(wiki_html, "table") %>%
map_lgl( ~ html_attr(., "class") == "wikitable plainrowheaders sortable") %>%
which()
#> [1] 20
Then we can select table number 20 and convert it to a dataframe with html_table()
raw_table <-
html_nodes(wiki_html, "table")[[20]] %>%
html_table(fill = TRUE)
2. Clean up table.
The table has duplicated names, we can change that by using as_tibble() and its .name_repair argument. We then usedplyr::select() to get the columns. Furthermore we usedplyr::filter() to delete the first two rows, that have "District" as a value in theDistrictcolumn. Now the columns are still characters
vectors, but we need them to be numeric, therefore we first delete commas from
all columns and then transform columns 2 to 4 to numeric.
clean_table <-
raw_table %>%
as_tibble(.name_repair = "unique") %>%
filter(District != "District") %>%
mutate_all( ~ gsub(",", "", .)) %>%
mutate_at(2:4, as.numeric)
3. Select columns and 4. Calculate margins.
We use dplyr::select() to select the columns you are interested in and give them more helpful names.
Finally we calculate the margin between democratic and republican votes by first adding up there votes
as total_votes and then dividing the difference by total_votes.
clean_table %>%
select(District,
RepVote = Republican...2,
DemVote = Democratic...4,
OthVote = Others...6) %>%
mutate(
total_votes = RepVote + DemVote,
margin = abs(RepVote - DemVote) / total_votes * 100
)
#> # A tibble: 37 x 6
#> District RepVote DemVote OthVote total_votes margin
#> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 District 1 168165 61263 3292 229428 46.6
#> 2 District 2 139188 119992 4212 259180 7.41
#> 3 District 3 169520 138234 4604 307754 10.2
#> 4 District 4 188667 57400 3178 246067 53.3
#> 5 District 5 130617 78666 224 209283 24.8
#> 6 District 6 135961 116350 3731 252311 7.77
#> 7 District 7 115642 127959 0 243601 5.06
#> 8 District 8 200619 67930 4621 268549 49.4
#> 9 District 9 0 136256 16745 136256 100
#> 10 District 10 157166 144034 6627 301200 4.36
#> # … with 27 more rows
Edit: In case you want to go with the data provided by the state, it looks to me as if the data you are looking for is in the first, third and fourth column. So what you want to do is.
(All the code below is not tested, as I do not have the original data.)
read data into R
library(readr)
tx18 <- read_csv("filename.csv")
select relevant columns
tx18 <- tx18 %>%
select(c(1,3,4))
clean table
tx18 <- tx18 %>%
filter(!is.na(X3),
X3 != "Party",
X3 != "Race Total")
Group and summarize data by party
tx18 <- tx18 %>%
group_by(X3) %>%
summarise(votes = sum(X3))
Pivot/ Reshape data to wide format
tx18 %>$
pivot_wider(names_from = X3,
values_from = votes)
After this you could then calculate the margin similarly as I did with the Wikipedia data.
I got an XLSX with data from a questionnaire for my master thesis.
The questions and answers for an interviewee are in one row in the second column. The first column contains the date.
The data of the second column comes in a form like this:
"age":"52","height":"170","Gender":"Female",...and so on
I started with:
test12 <- read_xlsx("Testdaten.xlsx")
library(splitstackshape)
test13 <- concat.split(data = test12, split.col= "age", sep =",")
Then I got the questions and the answers as a column divided by a ":".
For e.g. column 1: "age":"52" and column2:"height":"170".
But the data is so messy that sometimes in the column of the age question and answer there is a height question and answer and for some questionnaires questions and answers double.
I would need the questions as variables and the answers as observations. But I have no clue how to get there. I could clean the data in excel first, but with the fact that columns are not constant and there are for e.g. some height questions in the age column I see no chance to do it as I will get new data regularly, formated the same way.
Here is an example of the data:
A tibble: 5 x 2
partner.createdAt partner.wphg.info
<chr> <chr>
1 2019-11-09T12:13:11.099Z "{\"age_years\":\"50\",\"job_des\":\"unemployed\",\"height_cm\":\"170\",\"Gender\":\"female\",\"born_in\":\"Italy\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"200000\""
2 2019-11-01T06:43:22.581Z "{\"age_years\":\"34\",\"job_des\":\"self-employed\",\"height_cm\":\"158\",\"Gender\":\"male\",\"born_in\":\"Germany\",\"Alcoholic\":\"true\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"10000\""
3 2019-11-10T07:59:46.136Z "{\"age_years\":\"24\",\"height_cm\":\"187\",\"Gender\":\"male\",\"born_in\":\"England\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"150000\""
4 2019-11-11T13:01:48.488Z "{\"age_years\":\"59\",\"job_des\":\"employed\",\"height_cm\":\"167\",\"Gender\":\"female\",\"born_in\":\"United States\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"2\",\"total_wealth\":\"1000000~
5 2019-11-08T14:54:26.654Z "{\"age_years\":\"36\",\"height_cm\":\"180\",\"born_in\":\"Germany\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"170000\",\"job_des\":\"employed\",\"Gender\":\"male\""
Thank you so much for your time!
You can loop through each entry, splitting at , as you did. Then you can loop through them all again, splitting at :.
The result will be a bunch of variable/value pairings. This can be all done stacked. Then you just want to pivot back into columns.
data
Updated the data based on your edit.
data <- tribble(~partner.createdAt, ~partner.wphg.info,
'2019-11-09T12:13:11.099Z', '{\"age_years\":\"50\",\"job_des\":\"unemployed\",\"height_cm\":\"170\",\"Gender\":\"female\",\"born_in\":\"Italy\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"200000\"',
'2019-11-01T06:43:22.581Z', '{\"age_years\":\"34\",\"job_des\":\"self-employed\",\"height_cm\":\"158\",\"Gender\":\"male\",\"born_in\":\"Germany\",\"Alcoholic\":\"true\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"10000\"',
'2019-11-10T07:59:46.136Z', '{\"age_years\":\"24\",\"height_cm\":\"187\",\"Gender\":\"male\",\"born_in\":\"England\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"150000\"',
'2019-11-11T13:01:48.488Z', '{\"age_years\":\"59\",\"job_des\":\"employed\",\"height_cm\":\"167\",\"Gender\":\"female\",\"born_in\":\"United States\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"2\",\"total_wealth\":\"1000000\"',
'2019-11-08T14:54:26.654Z', '{\"age_years\":\"36\",\"height_cm\":\"180\",\"born_in\":\"Germany\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"170000\",\"job_des\":\"employed\",\"Gender\":\"male\"')
libraries
We need a few here. Or you can just call tidyverse.
library(stringr)
library(purrr)
library(dplyr)
library(tibble)
library(tidyr)
function
This function will create a data frame (or tibble) for each question. The first column is the date, the second is the variable, the third is the value.
clean_record <- function(date, text) {
clean_records <- str_split(text, pattern = ",", simplify = TRUE) %>%
str_remove_all(pattern = "\\\"") %>% # remove double quote
str_remove_all(pattern = "\\{|\\}") %>% # remove curly brackets
str_split(pattern = ":", simplify = TRUE)
tibble(date = as.Date(date), variable = clean_records[,1], value = clean_records[,2])
}
iteration
Now we use pmap_dfr from purrr to loop over the rows, outputting each row with an id variable named record.
This will stack the data as described in the function. The mutate() line converts all variable names to lowercase. The distinct() line will filter out rows that are exact duplicates.
What we do then is just pivot on the variable column. Of course, replace data with whatever you name your data frame.
data_clean <- pmap_dfr(data, ~ clean_record(..1, ..2), .id = "record") %>%
mutate(variable = tolower(variable)) %>%
distinct() %>%
pivot_wider(names_from = variable, values_from = value)
result
The result is something like this. Note how I had reordered some of the columns, but it still works. You are probably not done just yet. All columns are now of type character. You need to figure out the desired type for each and convert.
# A tibble: 5 x 10
record date age_years job_des height_cm gender born_in alcoholic knowledge_selfass total_wealth
<chr> <date> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 2019-11-09 50 unemployed 170 female Italy false 5 200000
2 2 2019-11-01 34 self-employed 158 male Germany true 3 10000
3 3 2019-11-10 24 NA 187 male England false 3 150000
4 4 2019-11-11 59 employed 167 female United States false 2 1000000
5 5 2019-11-08 36 employed 180 male Germany false 5 170000
For example, convert age_years to numeric.
data_clean %>%
mutate(age_years = as.numeric(age_years))
I am sure you may run into other things, but this should be a start.
Intro
After recently taking Hadley Wickham's functional programming class I decided I'd try applying some of the lessons to my projects at work. Naturally, the first project I tried has proven to be more complicated than the examples worked demonstrated in the class. Does anyone have recommendations for a way to use the purrr package to make the task described below more efficient?
Project Background
I need to assign quintile groups to records in a spatial polygon dataframe. In addition to the record identifier there are several other variables and I need to calculate the quintile group for each.
Here's the crux of the problem: I have been asked to identify outliers in one particular variable and to omit those records from the entire analysis as long as it doesn't change the quintile composition of the first quintile group for any of the other variables.
Question
I have put together a dplyr pipeline (see the example below) that performs this checking process for a single variable, but how might I rewrite this process so that I can efficiently check each variable?
EDIT: While it is certainly possible to change the shape of the data from wide to long as an intermediary step, in the end it needs to return to its wide format so that it matches up with the #polygons slot of the spatial polygons dataframe.
Reproducible Example
You can find the complete script here: https://gist.github.com/tiernanmartin/6cd3e2946a77b7c9daecb51aa11e0c94
Libraries and Settings
library(grDevices) # boxplot.stats()
library(operator.tools) # %!in% logical operator
library(tmap) # 'metro' data set
library(magrittr) # piping
library(dplyr) # exploratory data analysis verbs
library(purrr) # recursive mapping of functions
library(tibble) # improved version of a data.frame
library(ggplot2) # dot plot
library(ggrepel) # avoid label overlap
options(scipen=999)
set.seed(888)
Load the example data and take a small sample of it
data("metro")
m_spdf <- metro
# Take a sample
m <-
metro#data %>%
as_tibble %>%
select(-name_long,-iso_a3) %>%
sample_n(50)
> m
# A tibble: 50 x 10
name pop1950 pop1960 pop1970 pop1980 pop1990
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Sydney 1689935 2134673 2892477 3252111 3631940
2 Havana 1141959 1435511 1779491 1913377 2108381
3 Campinas 151977 293174 540430 1108903 1693359
4 Kano 123073 229203 541992 1349646 2095384
5 Omsk 444326 608363 829860 1032150 1143813
6 Ouagadougou 33035 59126 115374 265200 537441
7 Marseille 755805 928768 1182048 1372495 1418279
8 Taiyuan 196510 349535 621625 1105695 1636599
9 La Paz 319247 437687 600016 809218 1061850
10 Baltimore 1167656 1422067 1554538 1748983 1848834
# ... with 40 more rows, and 4 more variables:
# pop2000 <dbl>, pop2010 <dbl>, pop2020 <dbl>,
# pop2030 <dbl>
Calculate quintile groups with and without outlier records
# Calculate the quintile groups for one variable (e.g., `pop1990`)
m_all <-
m %>%
mutate(qnt_1990_all = dplyr::ntile(pop1990,5))
# Find the outliers for a different variable (e.g., 'pop1950')
# and subset the df to exlcude these outlier records
m_out <- boxplot.stats(m$pop1950) %>% .[["out"]]
m_trim <-
m %>%
filter(pop1950 %!in% m_out) %>%
mutate(qnt_1990_trim = dplyr::ntile(pop1990,5))
# Assess whether the outlier trimming impacted the first quintile group
m_comp <-
m_trim %>%
select(name,dplyr::contains("qnt")) %>%
left_join(m_all,.,"name") %>%
select(name,dplyr::contains("qnt"),everything()) %>%
mutate(qnt_1990_chng_lgl = !is.na(qnt_1990_trim) & qnt_1990_trim != qnt_1990_all,
qnt_1990_chng_dir = if_else(qnt_1990_chng_lgl,
paste0(qnt_1990_all," to ",qnt_1990_trim),
"No change"))
With a little help from ggplot2, I can see that in this example six outliers were identified and that their omission did not affect the first quintile group for pop1990.
Importantly, this information is tracked in two new variables: qnt_1990_chng_lgl and qnt_1990_chng_dir.
> m_comp %>% select(name,qnt_1990_chng_lgl,qnt_1990_chng_dir,everything())
# A tibble: 50 x 14
name qnt_1990_chng_lgl qnt_1990_chng_dir qnt_1990_all qnt_1990_trim
<chr> <lgl> <chr> <dbl> <dbl>
1 Sydney FALSE No change 5 NA
2 Havana TRUE 4 to 5 4 5
3 Campinas TRUE 3 to 4 3 4
4 Kano FALSE No change 4 4
5 Omsk FALSE No change 3 3
6 Ouagadougou FALSE No change 1 1
7 Marseille FALSE No change 3 3
8 Taiyuan TRUE 3 to 4 3 4
9 La Paz FALSE No change 2 2
10 Baltimore FALSE No change 4 4
# ... with 40 more rows, and 9 more variables: pop1950 <dbl>, pop1960 <dbl>,
# pop1970 <dbl>, pop1980 <dbl>, pop1990 <dbl>, pop2000 <dbl>, pop2010 <dbl>,
# pop2020 <dbl>, pop2030 <dbl>
I now need to find a way to repeat this process for every variable in the dataframe (i.e., pop1960 - pop2030). Ideally, two new variables would be created for each existing pop* variable and their names would be preceded by qnt_ and followed by either _chng_dir or _chng_lgl.
Is purrr the right tool to use for this? dplyr::mutate_? data.table?
It turns out this problem is solvable using tidyr::gather + dplyr::group_by + tidyr::spread functions. While #shayaa and #Gregor didn't provide the solution I was looking for, their advice helped me course-correct away from the functional programming methods I was researching.
I ended up using #shayaa's gather and group_by combination, followed by mutate to create the variable names (qnt_*_chng_lgl and qnt_*_chng_dir) and then using spread to make it wide again. An anonymous function passed to summarize_all removed all the extra NA's that the wide-long-wide transformations created.
m_comp <-
m %>%
mutate(qnt = dplyr::ntile(pop1950,5)) %>%
filter(pop1950 %!in% m_out) %>%
gather(year,pop,-name,-qnt) %>%
group_by(year) %>%
mutate(qntTrim = dplyr::ntile(pop,5),
qnt_chng_lgl = !is.na(qnt) & qnt != qntTrim,
qnt_chng_dir = ifelse(qnt_chng_lgl,
paste0(qnt," to ",qntTrim),
"No change"),
year_lgl = paste0("qnt_chng_",year,"_lgl"),
year_dir = paste0("qnt_chng_",year,"_dir")) %>%
spread(year_lgl,qnt_chng_lgl) %>%
spread(year_dir,qnt_chng_dir) %>%
spread(year,pop) %>%
select(-qnt,-qntTrim) %>%
group_by(name) %>%
summarize_all(function(.){subset(.,!is.na(.)) %>% first})
Nothing wrong with your analysis it seems to me,
After this part
m <- metro#data %>%
as_tibble %>%
select(-name_long,-iso_a3) %>%
sample_n(50)
Just melt your data and continue your analysis but with group_by(year)
library(reshape2)
library(stringr)
mm <- melt(m)
mm[,2] <- as.factor(str_sub(mm[,2],-4))
names(mm)[2:3] <- c("year", "population")
e.g.,
mm %>% group_by(year) %>%
+ mutate(qnt_all = dplyr::ntile(population,5))