how are you?
So, I have a dataset that looks like this:
dirtax_trev indtax_trev lag2_majority pub_exp
<dbl> <dbl> <dbl> <dbl>
0.1542 0.5186 0 9754
0.1603 0.4935 0 9260
0.1511 0.5222 1 8926
0.2016 0.5501 0 9682
0.6555 0.2862 1 10447
I'm having the following problem. I want to execute a series of t.tests along a dummy variable (lag2_majority), collect the p-value of this tests, and attribute it to a vector, using a pipe.
All variables that I want to run these t-tests are selected below, then I omit NA values for my t.test variable (lag2_majority), and then I try to summarize it with this code:
test <- g %>%
select(dirtax_trev, indtax_trev, gdpc_ppp, pub_exp,
SOC_tot, balance, fdi, debt, polity2, chga_demo, b_gov, social_dem,
iaep_ufs, gini, pov4, informal, lab, al_ethnic, al_language, al_religion,
lag_left, lag2_left, majority, lag2_majority, left, system, b_system,
execrlc, allhouse, numvote, legelec, exelec, pr) %>%
na.omit(lag2_majority) %>%
summarise_all(funs(t.test(.[lag2_majority], .[lag2_majority == 1])$p.value))
However, once I run this, the response I get is: Error in summarise_impl(.data, dots): Evaluation error: data are essentially constant., which is confusing since there is a clear difference on means along the dummy variable. The same error appears when I replace the last line of the code indicated above with: summarise_all(funs(t.test(.~lag2_majority)$p.value)).
Alternatively, since all I want to do is: t.test(dirtax_trev~lag2_majority, g)$p.value, for instance, I thought I could do a loop, like this:
for (i in vars){
t.test(i~lag2_majority, g)$p.value
},
Where vars is an object that contains all variables selected in code indicated above. But once again I get an error message. Specifically, this one: Error in model.frame.default(formula = i ~ lag2_majority, data = g): comprimentos das variáveis diferem (encontradas em 'lag2_majority')
What am I doing wrong?
Best Regards!
Your question is not reproducible, please read this for how you could improve its quality.
My answer has been generalised to be reproducible because I don't have your data and cannot therefore adapt your code directly.
Using a tidy approach I'll produce a data frame of p-values for each variable.
library(tidyr)
library(dplyr)
library(purrr)
mtcars %>%
select_if(is.numeric) %>%
map(t.test) %>%
lapply(`[[`, "p.value") %>%
as_tibble %>%
gather(key, p.value)
# # A tibble: 11 x 2
# key p.value
# <chr> <dbl>
# 1 mpg 1.526151e-18
# 2 cyl 5.048147e-19
# 3 disp 9.189065e-12
# 4 hp 2.794134e-13
# 5 drat 1.377586e-27
# 6 wt 2.257406e-18
# 7 qsec 7.790282e-33
# 8 vs 2.776961e-05
# 9 am 6.632258e-05
# 10 gear 1.066949e-23
# 11 carb 4.590930e-11
update
Thank you for updating your question, note that the value you included in your earlier comment is likely from your original dataset and is still not reproducible here. When I run the code, this is the output.
t.test(dirtax_trev ~ lag2_majority, g)$p.value
# [1] 0.5272474
Please frame your questions in a way that anyone can see the problem in the same way that you do.
To build up the formula you are running through the t.test, I have taken a slightly different approach.
library(magrittr)
library(dplyr)
library(purrr)
g <- tribble(
~dirtax_trev, ~indtax_trev, ~lag2_majority, ~pub_exp,
0.1542, 0.5186, 0, 9754,
0.1603, 0.4935, 0, 9260,
0.1511, 0.5222, 1, 8926,
0.2016, 0.5501, 0, 9682,
0.6555, 0.2862, 1, 10447
)
dummy <- "lag2_majority"
colnames(g) %>%
.[. != dummy] %>% # vector of variables to send through t.test
paste(., "~", dummy) %>% # build formula as character
map(as.formula) %>% # convert to formula class
map(t.test, data = g) %$% # run t.test for each, note the special operator
tibble(
data.name = unlist(lapply(., `[[`, "data.name")),
p.value = unlist(lapply(., `[[`, "p.value"))
)
# # A tibble: 3 x 2
# data.name p.value
# <chr> <dbl>
# 1 dirtax_trev by lag2_majority 0.5272474
# 2 indtax_trev by lag2_majority 0.5021217
# 3 pub_exp by lag2_majority 0.8998690
If you prefer to drop the dummy variable name from data.name, you could modify its assignment in the tibble with:
data.name = unlist(strsplit(unlist(lapply(., `[[`, "data.name")), paste(" by", dummy)))
N.B. I used the special %$% from magrittr to expose the names from the list of tests to build a data frame. I'm sure there are other ways that may be more elegant, however, I find this form quite easy to reason about.
Related
I keep getting "summarise() has grouped output by 'new_brand'. You can override using
the .groups argument." I'm not sure if I'm getting this error because I created columns pos_prop and neg_prop
superbowl %>% group_by(new_brand, superbowl) %>% summarize(mean(superbowl$volume, superbowl$pos_prop, superbowl$neg_prop), sd(superbowl$volume, superbowl$pos_prop, superbowl$neg_prop)) %>% filter(superbowl, superbowl == "0")
When I run rlang::last_error() The code works, I'm not sure how to make the code run properly. Any help will be appreciated.
You're using summarize and such incorrectly. Try this:
superbowl %>%
group_by(new_brand) %>%
summarize(across(c(volume, pos_prop, neg_prop),
list(mu = ~ mean(.), sigma = ~ sd(.)))) %>%
filter(superbowl == "0")
Notes on your code:
once you start a dplyr-pipe with superbowl %>%, almost never use superbowl$ in the dplyr verbs (very rare exceptions); I also removed references to superbowl in both group_by and filter, since it is not clear if you're trying to refer to the original frame symbol again ... if you have superbowl$superbowl, then they may still be appropriate;
either use across(..) as above or name the calculations, e.g., summarize(volume_mu = mean(volume), pos_mu = mean(pos_prop), ...); and
I'm inferring, but ... mean(volume, pos_prop, neg_prop) (with or without the superbowl$) is an error: in this case, the call is effectively mean(volume, trim=pos_prop, na.rm=neg_prop), which should be producing errors. One could adapt this to be mean(c(volume, pos_prop, neg_prop)) if you really want to aggregate three columns' data into a single number, but I thought that might be unintended over-aggregation.
Demonstration of this with real data:
mtcars %>%
group_by(cyl) %>%
summarize(across(c(disp, mpg),
list(mu = ~ mean(.), sigma = ~ sd(.))))
# # A tibble: 3 x 5
# cyl disp_mu disp_sigma mpg_mu mpg_sigma
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 4 105. 26.9 26.7 4.51
# 2 6 183. 41.6 19.7 1.45
# 3 8 353. 67.8 15.1 2.56
I have a data frame in which I would I would like to compute some extra column as a function of the existing columns, but want to specify both each new column name and the function dynamically. I have a vector of column names that are already in the dataframe df_daily:
DAILY_QUESTIONS <- c("Q1_Daily", "Q2_Daily", "Q3_Daily", "Q4_Daily", "Q5_Daily")
The rows of the dataframe have responses to each question from each user each time they answer the questionnaire, as well as a column with the number of days since the user first answered the questionnaire (i.e. Days_From_First_Use = 0 on the very first use, = 1 if it is used the next day etc.). I want to average the responses to these questions by Days_From_First_Use . I start by by grouping my dataframe by Days_From_First_Use:
df_test <- df_daily %>%
group_by(Days_From_First_Use)
and then try averaging the responses in a loop as follows:
for(i in 1:5){
df_test <- df_test %>%
mutate(!! paste0('Avg_Score_', DAILY_QUESTIONS[i]) :=
paste0('mean(', DAILY_QUESTIONS[i], ')'))
}
Unfortunately, while my new variable names are correct ("Avg_Score_Q1_Daily", "Avg_Score_Q2_Daily", "Avg_Score_Q3_Daily", "Avg_Score_Q4_Daily", "Avg_Score_Q5_Daily"), my answers are not: every row in my data frame has a string such as "mean(Q1_Daily)" in the relevant column .
So I'm clearly doing something wrong - what do I need to do fix this and get the average score across all users on each day?
Sincerely and with many thanks in advance
Thomas Philips
I took a somewhat different approach, using summarize(across(...)) after group_by(Days_From_First_Use) I achieve the dynamic names by using rename_with and a custom function that replaces (starts with)"Q" with "Avg_Score_Q"
library(dplyr, warn.conflicts = FALSE)
# fake data -- 30 normalized "responses" from 0 to 2 days from first use to 5 questions
DAILY_QUESTIONS <- c("Q1_Daily", "Q2_Daily", "Q3_Daily", "Q4_Daily", "Q5_Daily")
df_daily <- as.data.frame(do.call('cbind', lapply(1:5, function(i) rnorm(30, i))))
colnames(df_daily) <- DAILY_QUESTIONS
df_daily$Days_From_First_Use <- floor(runif(30, 0, 3))
df_test <- df_daily %>%
group_by(Days_From_First_Use) %>%
summarize(across(.fns = mean)) %>%
rename_with(.fn = function(x) gsub("^Q","Avg_Score_Q",x))
#> `summarise()` ungrouping output (override with `.groups` argument)
df_test
#> # A tibble: 3 x 6
#> Days_From_First… Avg_Score_Q1_Da… Avg_Score_Q2_Da… Avg_Score_Q3_Da…
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0 1.26 1.75 3.02
#> 2 1 0.966 2.14 3.48
#> 3 2 1.08 2.45 3.01
#> # … with 2 more variables: Avg_Score_Q4_Daily <dbl>, Avg_Score_Q5_Daily <dbl>
Created on 2020-12-06 by the reprex package (v0.3.0)
I have a data frame (df) with the following structure:
ID Long_Ref Lat_Ref Long_1 Lat_1 Long_2 Lat_2
A -71.69 -33.39 -70.29 -33.39 -71.69 -34.19
B -72.39 -34.34 -74.29 -31.19 -70.54 -33.38
C -70.14 -32.38 -70.62 -32.37 -69.76 -32.22
D -70.39 -33.54 -70.42 -34.99 -68.69 -32.33
I am trying to append new columns into my existing data frame with the distance between Ref and 1, Ref and 2, etc using distHaversine.
This is the code the i have worked through so far:
for (i in 1:2){
for (j in 1:nrow(df)){
df[,paste("Dist_Mts_",i,sep="")][j]<-distHaversine(matrix(c(df$Long_Ref,df$Lat_Ref),ncol=2),
matrix(c(as.matrix(df[,paste("Long_",i,sep="")]),
as.matrix(df[,paste("Lat_",i,sep="")])),ncol=2))
}
}
However, I keep getting the following error:
Error in .pointsToMatrix(p2) * toRad :
non-numeric argument to binary operator
In addition: Warning message:
In .pointsToMatrix(p2) : NAs introduced by coercion
I would appreciate any help. Thank you.
Assuming that you have a number of points which might be greater than 2 which you'd like to calculate the distances to, you could do something like this (without using a for loop):
point.no <- 1:2 # depending on how many points you have
library(tidyverse)
point.no %>%
map(~ paste0(c('Long_', 'Lat_'), .x)) %>%
setNames(paste0('Dist_Mts_', point.no)) %>%
map_df(~ distHaversine(df[c('Long_Ref', 'Lat_Ref')], df[.x]))
If you only need to compute the distances to those two points then you could just write this:
list(Dist_Mts_1=c('Long_1', 'Lat_1'), Dist_Mts_2=c('Long_2', 'Lat_2')) %>%
map_df(~ distHaversine(df[c('Long_Ref', 'Lat_Ref')], df[.x]))
Output:
# A tibble: 4 x 2
Dist_Mts_1 Dist_Mts_2
<dbl> <dbl>
1 130123 89056
2 393159 201653
3 45141 39946
4 161437 208248
Alright, I'm waving my white flag.
I'm trying to compute a loess regression on my dataset.
I want loess to compute a different set of points that plots as a smooth line for each group.
The problem is that the loess calculation is escaping the dplyr::group_by function, so the loess regression is calculated on the whole dataset.
Internet searching leads me to believe this is because dplyr::group_by wasn't meant to work this way.
I just can't figure out how to make this work on a per-group basis.
Here are some examples of my failed attempts.
test2 <- test %>%
group_by(CpG) %>%
dplyr::arrange(AVGMOrder) %>%
do(broom::tidy(predict(loess(Meth ~ AVGMOrder, span = .85, data=.))))
> test2
# A tibble: 136 x 2
# Groups: CpG [4]
CpG x
<chr> <dbl>
1 cg01003813 0.781
2 cg01003813 0.793
3 cg01003813 0.805
4 cg01003813 0.816
5 cg01003813 0.829
6 cg01003813 0.841
7 cg01003813 0.854
8 cg01003813 0.866
9 cg01003813 0.878
10 cg01003813 0.893
This one works, but I can't figure out how to apply the result to a column in my original dataframe. The result I want is column x. If I apply x as a column in a separate line, I run into issues because I called dplyr::arrange earlier.
test2 <- test %>%
group_by(CpG) %>%
dplyr::arrange(AVGMOrder) %>%
dplyr::do({
predict(loess(Meth ~ AVGMOrder, span = .85, data=.))
})
This one simply fails with the following error.
"Error: Results 1, 2, 3, 4 must be data frames, not numeric"
Also it still isn't applied as a new column with dplyr::mutate
fems <- fems %>%
group_by(CpG) %>%
dplyr::arrange(AVGMOrder) %>%
dplyr::mutate(Loess = predict(loess(Meth ~ AVGMOrder, span = .5, data=.)))
This was my fist attempt and mostly resembles what I want to do. Problem is that this one performs the loess prediction on the entire dataframe and not on each CpG group.
I am really stuck here. I read online that the purr package might help, but I'm having trouble figuring it out.
data looks like this:
> head(test)
X geneID CpG CellLine Meth AVGMOrder neworder Group SmoothMeth
1 40 XG cg25296477 iPS__HDF51IPS14_passage27_Female____165.592.1.2 0.81107210 1 1 5 0.7808767
2 94 XG cg01003813 iPS__HDF51IPS14_passage27_Female____165.592.1.2 0.97052120 1 1 5 0.7927130
3 148 XG cg13176022 iPS__HDF51IPS14_passage27_Female____165.592.1.2 0.06900448 1 1 5 0.8045080
4 202 XG cg26484667 iPS__HDF51IPS14_passage27_Female____165.592.1.2 0.84077890 1 1 5 0.8163997
5 27 XG cg25296477 iPS__HDF51IPS6_passage33_Female____157.647.1.2 0.81623880 2 2 3 0.8285259
6 81 XG cg01003813 iPS__HDF51IPS6_passage33_Female____157.647.1.2 0.95569240 2 2 3 0.8409501
unique(test$CpG)
[1] "cg25296477" "cg01003813" "cg13176022" "cg26484667"
So, to be clear, I want to do a loess regression on each unique CpG in my dataframe, apply the resulting "regressed y axis values" to a column matching the original y axis values (Meth).
My actual dataset has a few thousand of those CpG's, not just the four.
https://docs.google.com/spreadsheets/d/1-Wluc9NDFSnOeTwgBw4n0pdPuSlMSTfUVM0GJTiEn_Y/edit?usp=sharing
This is a neat Tidyverse way to make it work:
library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)
models <- fems %>%
tidyr::nest(-CpG) %>%
dplyr::mutate(
# Perform loess calculation on each CpG group
m = purrr::map(data, loess,
formula = Meth ~ AVGMOrder, span = .5),
# Retrieve the fitted values from each model
fitted = purrr::map(m, `[[`, "fitted")
)
# Apply fitted y's as a new column
results <- models %>%
dplyr::select(-m) %>%
tidyr::unnest()
# Plot with loess line for each group
ggplot(results, aes(x = AVGMOrder, y = Meth, group = CpG, colour = CpG)) +
geom_point() +
geom_line(aes(y = fitted))
You may have already figured this out -- but if not, here's some help.
Basically, you need to feed the predict function a data.frame (a vector may work too but I didn't try it) of the values you want to predict at.
So for your case:
fems <- fems %>%
group_by(CpG) %>%
arrange(CpG, AVGMOrder) %>%
mutate(Loess = predict(loess(Meth ~ AVGMOrder, span = .5, data=.),
data.frame(AVGMOrder = seq(min(AVGMOrder), max(AVGMOrder), 1))))
Note, loess requires a minimum number of observations to run (~4? I can't remember precisely). Also, this will take a while to run so test with a slice of your data to make sure it's working properly.
Unfortunately, the approaches described above did not work in my case. Thus, I implemented the Loess prediction into a regular function, which worked very well. In the example below, the data is contained in the df data frame while we group by df$profile and want to fit the Loess prediction into the df$daily_sum values.
# Define important variables
span_60 <- 60/365 # 60 days of a year
span_365 <- 365/365 # a whole year
# Group and order the data set
df <- as.data.frame(
df %>%
group_by(profile) %>%
arrange(profile, day) %>%
)
)
# Define the Loess function. x is the data frame that has to be passed
predict_loess <- function(x) {
# Declare that the loess column exists, but is blank
df$loess_60 <- NA
df$loess_365 <- NA
# Identify all unique profilee IDs
all_ids <- unique(x$profile)
# Iterate through the unique profilee IDs, determine the length of each vector (which should correspond to 365 days)
# and isolate the according rows that belong to the profilee ID.
for (i in all_ids) {
len_entries <- length(which(x$profile == i))
queried_rows <- result <- x[which(x$profile == i), ]
# Run the loess fit and write the result to the according column
fit_60 <- predict(loess(daily_sum ~ seq(1, len_entries), data=queried_rows, span = span_60))
fit_365 <- predict(loess(daily_sum ~ seq(1, len_entries), data=queried_rows, span = span_365))
x[which(x$profile == i), "loess_60"] <- fit_60
x[which(x$profile == i), "loess_365"] <- fit_365
}
# Return the initial data frame
return(x)
}
# Run the Loess prediction and put the results into two columns - one for a short and one for a long time span
df <- predict_loess(df)
Intro
After recently taking Hadley Wickham's functional programming class I decided I'd try applying some of the lessons to my projects at work. Naturally, the first project I tried has proven to be more complicated than the examples worked demonstrated in the class. Does anyone have recommendations for a way to use the purrr package to make the task described below more efficient?
Project Background
I need to assign quintile groups to records in a spatial polygon dataframe. In addition to the record identifier there are several other variables and I need to calculate the quintile group for each.
Here's the crux of the problem: I have been asked to identify outliers in one particular variable and to omit those records from the entire analysis as long as it doesn't change the quintile composition of the first quintile group for any of the other variables.
Question
I have put together a dplyr pipeline (see the example below) that performs this checking process for a single variable, but how might I rewrite this process so that I can efficiently check each variable?
EDIT: While it is certainly possible to change the shape of the data from wide to long as an intermediary step, in the end it needs to return to its wide format so that it matches up with the #polygons slot of the spatial polygons dataframe.
Reproducible Example
You can find the complete script here: https://gist.github.com/tiernanmartin/6cd3e2946a77b7c9daecb51aa11e0c94
Libraries and Settings
library(grDevices) # boxplot.stats()
library(operator.tools) # %!in% logical operator
library(tmap) # 'metro' data set
library(magrittr) # piping
library(dplyr) # exploratory data analysis verbs
library(purrr) # recursive mapping of functions
library(tibble) # improved version of a data.frame
library(ggplot2) # dot plot
library(ggrepel) # avoid label overlap
options(scipen=999)
set.seed(888)
Load the example data and take a small sample of it
data("metro")
m_spdf <- metro
# Take a sample
m <-
metro#data %>%
as_tibble %>%
select(-name_long,-iso_a3) %>%
sample_n(50)
> m
# A tibble: 50 x 10
name pop1950 pop1960 pop1970 pop1980 pop1990
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Sydney 1689935 2134673 2892477 3252111 3631940
2 Havana 1141959 1435511 1779491 1913377 2108381
3 Campinas 151977 293174 540430 1108903 1693359
4 Kano 123073 229203 541992 1349646 2095384
5 Omsk 444326 608363 829860 1032150 1143813
6 Ouagadougou 33035 59126 115374 265200 537441
7 Marseille 755805 928768 1182048 1372495 1418279
8 Taiyuan 196510 349535 621625 1105695 1636599
9 La Paz 319247 437687 600016 809218 1061850
10 Baltimore 1167656 1422067 1554538 1748983 1848834
# ... with 40 more rows, and 4 more variables:
# pop2000 <dbl>, pop2010 <dbl>, pop2020 <dbl>,
# pop2030 <dbl>
Calculate quintile groups with and without outlier records
# Calculate the quintile groups for one variable (e.g., `pop1990`)
m_all <-
m %>%
mutate(qnt_1990_all = dplyr::ntile(pop1990,5))
# Find the outliers for a different variable (e.g., 'pop1950')
# and subset the df to exlcude these outlier records
m_out <- boxplot.stats(m$pop1950) %>% .[["out"]]
m_trim <-
m %>%
filter(pop1950 %!in% m_out) %>%
mutate(qnt_1990_trim = dplyr::ntile(pop1990,5))
# Assess whether the outlier trimming impacted the first quintile group
m_comp <-
m_trim %>%
select(name,dplyr::contains("qnt")) %>%
left_join(m_all,.,"name") %>%
select(name,dplyr::contains("qnt"),everything()) %>%
mutate(qnt_1990_chng_lgl = !is.na(qnt_1990_trim) & qnt_1990_trim != qnt_1990_all,
qnt_1990_chng_dir = if_else(qnt_1990_chng_lgl,
paste0(qnt_1990_all," to ",qnt_1990_trim),
"No change"))
With a little help from ggplot2, I can see that in this example six outliers were identified and that their omission did not affect the first quintile group for pop1990.
Importantly, this information is tracked in two new variables: qnt_1990_chng_lgl and qnt_1990_chng_dir.
> m_comp %>% select(name,qnt_1990_chng_lgl,qnt_1990_chng_dir,everything())
# A tibble: 50 x 14
name qnt_1990_chng_lgl qnt_1990_chng_dir qnt_1990_all qnt_1990_trim
<chr> <lgl> <chr> <dbl> <dbl>
1 Sydney FALSE No change 5 NA
2 Havana TRUE 4 to 5 4 5
3 Campinas TRUE 3 to 4 3 4
4 Kano FALSE No change 4 4
5 Omsk FALSE No change 3 3
6 Ouagadougou FALSE No change 1 1
7 Marseille FALSE No change 3 3
8 Taiyuan TRUE 3 to 4 3 4
9 La Paz FALSE No change 2 2
10 Baltimore FALSE No change 4 4
# ... with 40 more rows, and 9 more variables: pop1950 <dbl>, pop1960 <dbl>,
# pop1970 <dbl>, pop1980 <dbl>, pop1990 <dbl>, pop2000 <dbl>, pop2010 <dbl>,
# pop2020 <dbl>, pop2030 <dbl>
I now need to find a way to repeat this process for every variable in the dataframe (i.e., pop1960 - pop2030). Ideally, two new variables would be created for each existing pop* variable and their names would be preceded by qnt_ and followed by either _chng_dir or _chng_lgl.
Is purrr the right tool to use for this? dplyr::mutate_? data.table?
It turns out this problem is solvable using tidyr::gather + dplyr::group_by + tidyr::spread functions. While #shayaa and #Gregor didn't provide the solution I was looking for, their advice helped me course-correct away from the functional programming methods I was researching.
I ended up using #shayaa's gather and group_by combination, followed by mutate to create the variable names (qnt_*_chng_lgl and qnt_*_chng_dir) and then using spread to make it wide again. An anonymous function passed to summarize_all removed all the extra NA's that the wide-long-wide transformations created.
m_comp <-
m %>%
mutate(qnt = dplyr::ntile(pop1950,5)) %>%
filter(pop1950 %!in% m_out) %>%
gather(year,pop,-name,-qnt) %>%
group_by(year) %>%
mutate(qntTrim = dplyr::ntile(pop,5),
qnt_chng_lgl = !is.na(qnt) & qnt != qntTrim,
qnt_chng_dir = ifelse(qnt_chng_lgl,
paste0(qnt," to ",qntTrim),
"No change"),
year_lgl = paste0("qnt_chng_",year,"_lgl"),
year_dir = paste0("qnt_chng_",year,"_dir")) %>%
spread(year_lgl,qnt_chng_lgl) %>%
spread(year_dir,qnt_chng_dir) %>%
spread(year,pop) %>%
select(-qnt,-qntTrim) %>%
group_by(name) %>%
summarize_all(function(.){subset(.,!is.na(.)) %>% first})
Nothing wrong with your analysis it seems to me,
After this part
m <- metro#data %>%
as_tibble %>%
select(-name_long,-iso_a3) %>%
sample_n(50)
Just melt your data and continue your analysis but with group_by(year)
library(reshape2)
library(stringr)
mm <- melt(m)
mm[,2] <- as.factor(str_sub(mm[,2],-4))
names(mm)[2:3] <- c("year", "population")
e.g.,
mm %>% group_by(year) %>%
+ mutate(qnt_all = dplyr::ntile(population,5))