Trying to generate ASV table from phyloseq - r

I recognize most people have the opposite problem. But I'm trying to create an ASV table, with column names as "identified OTUs" (aka the column name is drawn from the taxonomy information from GlobalPatterns#tax.table, rather than just being the assigned OTU code that's encoded in GlobalPatterns#otu.table), and row names as sample name.
I also want to append the metadata to the end of the ASV table, to allow for analysis based on said metadata.
I managed to generate a table without the taxonomic information with this code, using GlobalPatterns for reproducibility:
data(GlobalPatterns)
asv.matrix <- as.matrix(GlobalPatterns#otu_table#.Data)
asv <- data.frame(t(asv.matrix)) #transposing to make sample name the row name
meta.df <- as.data.frame(GlobalPatterns#sam_data)
asv.full <- data.frame(asv,meta.df)
write.csv(asv.full, file = "full_asv.csv",quote = FALSE,sep = ",")
However, I can't figure out how to force taxonomy information into the column names, which makes the ASV table functionally useless for analysis.
EDIT:
My preferred format is (abbreviated with faked metadata appended) as below. Tried to make a table, failed, have a fake code chunk.
Sample-ID / Species1 / Species2 / ...etc... / Metadata1 / Metadata2 /...etc... /
--------- / -------- / -------- / --------- / --------- / --------- /--------- /
Sample1 / 1 / 5 / ...etc... / lake / summer /...etc... /
Sample2 / 4 / 0 / ...etc... / bog / spring /...etc... /

I think you're looking for the phyloseq::psmelt function, which combines the otu_table, tax_table and sample_data tables into a single, long format table that is suitable for analysis.
One way of dealing with unresolved taxonomy is to assign the highest known taxonomy to any unresolved level. You can use the name_na_taxa function from the fantaxtic package for this, prior to using psmelt.
EDIT
After seeing your updated post, I understand a bit better what you want. You can take the output from psmelt and pivot this into a semi-wide format; see the code chunk below.
require("phyloseq")
require("fantaxtic")
require("tidyverse")
# Load data
data(GlobalPatterns)
# Generate (unique) species names using fantaxtic
ps <- name_na_taxa(GlobalPatterns)
ps <- label_duplicate_taxa(ps, tax_level = "Species", asv_as_id = T)
# Convert to long data format
ps_long <- psmelt(ps)
# Convert to semi-wide data format where each column has a taxon name
# and contains the abundance in each sample
meta_vars <- sample_variables(ps)
ps_wide <- ps_long %>%
select(all_of(meta_vars), Species, Abundance) %>%
pivot_wider(names_from = Species,
values_from = Abundance)
# Inspect the final table
head(ps_wide)
#> # A tibble: 6 x 19,223
#> X.SampleID Primer Final_Barcode Barcode_truncate~ Barcode_full_le~ SampleType
#> <fct> <fct> <fct> <fct> <fct> <fct>
#> 1 AQC4cm ILBC_17 ACAGCT AGCTGT CAAGCTAGCTG Freshwate~
#> 2 LMEpi24M ILBC_13 ACACTG CAGTGT CATGAACAGTG Freshwater
#> 3 AQC7cm ILBC_18 ACAGTG CACTGT ATGAAGCACTG Freshwate~
#> 4 AQC1cm ILBC_16 ACAGCA TGCTGT GACCACTGCTG Freshwate~
#> 5 M31Tong ILBC_10 ACACGA TCGTGT TGTGGCTCGTG Tongue
#> 6 M11Fcsw ILBC_05 AAGCTG CAGCTT CGACTGCAGCT Feces
#> # ... with 19,217 more variables: Description <fct>,
#> # `Unknown Stramenopiles (Order) 549656` <dbl>,
#> # `Unknown Dolichospermum (Genus) 279599` <dbl>,
#> # `Unknown Neisseria (Genus) 360229` <dbl>,
#> # `Unknown Bacteroides (Genus) 331820` <dbl>,
#> # `Haemophilusparainfluenzae 94166` <dbl>,
#> # `Unknown ACK-M1 (Family) 329744` <dbl>, ...
Created on 2022-09-26 by the reprex package (v2.0.1)
Note that this will potentially lead to a table with thousands of columns (about 20k in the case of GlobalPatterns), which might be hard to work with.

Related

Add row including each variable value to existing dataframe

I have a data frame with 80 existing rows and 6 variables, they are:
Row_ID
CatName
CatAge
Request
Friends
ID,
and I need to add some outliers to the dataset of generated data by adding a row on to the end containing specific data.
I attempted the following but it does not work. Any tips on how to get this to work?
```{r, create row 1, echo=TRUE,include=TRUE}
Cat_dataframe %>%
add_row(Row_ID = "30",CatName = "Carla",CatAge="30",Request="30",Friends="8",ID="500000")
```
Your command looks pretty good to me:
library(tidyverse)
df <- tribble(~"Row_ID", ~"CatName", ~"CatAge", ~"Request", ~"Friends", ~"ID",
"1", "name1", "31", "request1", "2", "051245")
df %>%
add_row(Row_ID = "30",CatName = "Carla",CatAge="30",Request="30",Friends="8",ID="500000")
#> # A tibble: 2 × 6
#> Row_ID CatName CatAge Request Friends ID
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 name1 31 request1 2 051245
#> 2 30 Carla 30 30 8 500000
Created on 2022-04-03 by the reprex package (v2.0.1)
You may have an issue with your chunk title (i.e. try {r create_row_1, echo=TRUE, include=TRUE} instead of {r, create row 1, echo=TRUE,include=TRUE}) and you may have an issue with different data types, e.g. if "CatAge" is an integer in your original dataframe and a character string in your add_rows() command (age=31 and age="31" are different types).
If you edit your original question to include the error message/s you're getting it will very likely make it easier troubleshoot your problem.

R , Looping through a dataframe , creating a new one with additional content

Caution, quite new to R - but I really would like to do this in R instead of java.
My csv-file (Swedish redlist for species 2020 ) looks like this:
id,svenskt,latin,Organismgrupp,Kategori,Observationer,Landskapstyp,status_abbrev,Rodlistekriterium
249012,,Abia candens,stekel,Art,3,"Jordbrukslandskap (J) - Stor betydelse, Skog (S) - Har betydelse",DD,
249014,,Abia lonicerae,stekel,Art,2,Skog (S) - Stor betydelse,DD,
261452,,Abia nitens,stekel,Art,0,Jordbrukslandskap (J) - Stor betydelse,DD,
The whole csv-file can be download from SLU by pressing the button 'skapa csv-fil'.
The interesting columns for me is only the 'id' and the 'status_abbrev' columns.
I would like to use those columns to update my db-table, doing something like this:
sql<- paste("update redlist SET status_abbrev='",abbrev,"' ","where id=",id,sep="")
reading the csv-file with this command:
library(dplyr)
redlist <- read.csv("rodlistade_arter_tampered_2.csv",header=TRUE);
dat <- select(redlist,'id', 'status_abbrev')
the output from the 3 first lines would be:
redlist is a dataframe, contains the csv with header.
datis a dataframe , contains a subset of redlist (id and status_abbrev).
But which library would be best to iterate through the 'dat' data-frame to be able to create something like this ?
iterating and picking out abbrev and id and creating the below string for each row - (in the end I would like to write these strings to an sql-batch file and update the roughy 5660-records)
sql<- paste("update redlist SET status_abbrev='",abbrev,"' ","where id=",id,sep="")
so that my resulting string would be like this (then iterating through the whole file) :
update redlist SET status_abbrev=DD where id=249012
screenshot of redlist and dat -
best,i
Using dplyr::mutate() and glue::glue() you can create the strings like this
library(tidyverse)
library(glue)
#>
#> Attaching package: 'glue'
#> The following object is masked from 'package:dplyr':
#>
#> collapse
str <- 'id,svenskt,latin,Organismgrupp,Kategori,Observationer,Landskapstyp,status_abbrev,Rodlistekriterium
249012,,Abia candens,stekel,Art,3,"Jordbrukslandskap (J) - Stor betydelse, Skog (S) - Har betydelse",DD,
249014,,Abia lonicerae,stekel,Art,2,Skog (S) - Stor betydelse,DD,
261452,,Abia nitens,stekel,Art,0,Jordbrukslandskap (J) - Stor betydelse,DD,'
df <- read_csv(str)
df2 <- df %>%
mutate(sql_string = glue("update redlist SET status_abbrev='{status_abbrev}' where id={id}"))
df2
#> # A tibble: 3 x 10
#> id svenskt latin Organismgrupp Kategori Observationer Landskapstyp
#> <dbl> <lgl> <chr> <chr> <chr> <dbl> <chr>
#> 1 249012 NA Abia… stekel Art 3 Jordbruksla…
#> 2 249014 NA Abia… stekel Art 2 Skog (S) - …
#> 3 261452 NA Abia… stekel Art 0 Jordbruksla…
#> # … with 3 more variables: status_abbrev <chr>, Rodlistekriterium <lgl>,
#> # sql_string <glue>
df2 %>% pull(sql_string)
#> update redlist SET status_abbrev='DD' where id=249012
#> update redlist SET status_abbrev='DD' where id=249014
#> update redlist SET status_abbrev='DD' where id=261452
Created on 2020-07-27 by the reprex package (v0.3.0)
Is this what you are looking for?
For database integration, have a look at DBI.

How can I use map* and mutate to convert a list into a set of additional columns?

I have tried probably hundreds of permutations of this code for literally days to try to get a function that will do what I want, and I have finally given up. It feels like it should definitely be doable and I am so close!
I have tried to get back to the nub of things here with my reprex below.
Basically I have a single-row dataframe, with a column containing a list of strings ("concepts"). I want to create an additional column for each of those strings, using mutate, ideally with the column taking its name from the string, and then to populate the column with the results of a function call (?it doesn't matter which function, for now? - I just need the infrastructure of the function to work.)
I feel, as usual, like I must be missing something obvious... maybe just a syntax error.
I also wonder if I need to use purrr::map, maybe a simpler vectorised mapping would work fine.
I feel like the fact that new columns are named ..1 rather than the concept name is a bit of a clue as to what is wrong.
I can create the data frame I want by calling each concept manually (see end of reprex) but since the list of concepts is different for different data frames, I want to functionalise this using pipes and tidyverse techniques rather than do it manually.
I've read the following questions to find help:
How to use map from purrr with dplyr::mutate to create multiple new columns based on column pairs
How to mutate multiple columns with dynamic variable using purrr:map function?
(R) Cleaner way to use map() with list-columns
Add multiple output variables using purrr and a predefined function
Creating new variables with purrr (how does one go about that?)
How to compute multiple new columns in a R dataframe with dynamic names
but none of those has quite helped me crack the problem I'm experiencing. [edit: added in last q to that list which may be the technique I need].
<!-- language-all: lang-r -->
# load packages -----------------------------------------------------------
library(rlang)
library(dplyr)
library(tidyr)
library(magrittr)
library(purrr)
library(nomisr)
# set up initial list of tibbles ------------------------------------------
df <- list(
district_population = tibble(
dataset_title = "Population estimates - local authority based by single year",
dataset_id = "NM_2002_1"
),
jsa_claimants = tibble(
dataset_title = "Jobseeker\'s Allowance with rates and proportions",
dataset_id = "NM_1_1"
)
)
# just use the first tibble for now, for testing --------------------------
# ideally I want to map across dfs through a list -------------------------
df <- df[[1]]
# nitty gritty functions --------------------------------------------------
get_concept_list <- function(df) {
dataset_id <- pluck(df, "dataset_id")
nomis_overview(id = dataset_id,
select = c("dimensions", "codes")) %>%
pluck("value", 1, "dimension") %>%
filter(!concept == "geography") %>%
pull("concept")
}
# get_concept_list() returns the strings I need:
get_concept_list(df)
#> [1] "time" "gender" "c_age" "measures"
# Here is a list of examples of types of map* that do various things,
# none of which is what I need it to do
# I'm using toupper() here for simplicity - ultimately I will use
# get_concept_info() to populate the new columns
# this creates four new tibbles
get_concept_list(df) %>%
map(~ mutate(df, {{.x}} := toupper(.x)))
#> [[1]]
#> # A tibble: 1 x 3
#> dataset_title dataset_id ..1
#> <chr> <chr> <chr>
#> 1 Population estimates - local authority based by single year NM_2002_1 TIME
#>
#> [[2]]
#> # A tibble: 1 x 3
#> dataset_title dataset_id ..1
#> <chr> <chr> <chr>
#> 1 Population estimates - local authority based by single year NM_2002_1 GENDER
#>
#> [[3]]
#> # A tibble: 1 x 3
#> dataset_title dataset_id ..1
#> <chr> <chr> <chr>
#> 1 Population estimates - local authority based by single year NM_2002_1 C_AGE
#>
#> [[4]]
#> # A tibble: 1 x 3
#> dataset_title dataset_id ..1
#> <chr> <chr> <chr>
#> 1 Population estimates - local authority based by single year NM_2002_1 MEASUR~
# this throws an error
get_concept_list(df) %>%
map_chr(~ mutate(df, {{.x}} := toupper(.x)))
#> Error: Result 1 must be a single string, not a vector of class `tbl_df/tbl/data.frame` and of length 3
# this creates three extra rows in the tibble
get_concept_list(df) %>%
map_df(~ mutate(df, {{.x}} := toupper(.x)))
#> # A tibble: 4 x 3
#> dataset_title dataset_id ..1
#> <chr> <chr> <chr>
#> 1 Population estimates - local authority based by single year NM_2002_1 TIME
#> 2 Population estimates - local authority based by single year NM_2002_1 GENDER
#> 3 Population estimates - local authority based by single year NM_2002_1 C_AGE
#> 4 Population estimates - local authority based by single year NM_2002_1 MEASUR~
# this does the same as map_df
get_concept_list(df) %>%
map_dfr(~ mutate(df, {{.x}} := toupper(.x)))
#> # A tibble: 4 x 3
#> dataset_title dataset_id ..1
#> <chr> <chr> <chr>
#> 1 Population estimates - local authority based by single year NM_2002_1 TIME
#> 2 Population estimates - local authority based by single year NM_2002_1 GENDER
#> 3 Population estimates - local authority based by single year NM_2002_1 C_AGE
#> 4 Population estimates - local authority based by single year NM_2002_1 MEASUR~
# this creates a single tibble 12 columns wide
get_concept_list(df) %>%
map_dfc(~ mutate(df, {{.x}} := toupper(.x)))
#> # A tibble: 1 x 12
#> dataset_title dataset_id ..1 dataset_title1 dataset_id1 ..11 dataset_title2
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Population e~ NM_2002_1 TIME Population es~ NM_2002_1 GEND~ Population es~
#> # ... with 5 more variables: dataset_id2 <chr>, ..12 <chr>,
#> # dataset_title3 <chr>, dataset_id3 <chr>, ..13 <chr>
# function to get info on each concept (except geography) -----------------
# this is the function I want to use eventually to populate my new columns
get_concept_info <- function(df, concept_name) {
dataset_id <- pluck(df, "dataset_id")
nomis_overview(id = dataset_id) %>%
filter(name == "dimensions") %>%
pluck("value", 1, "dimension") %>%
filter(concept == concept_name) %>%
pluck("codes.code", 1) %>%
select(name, value) %>%
nest(data = everything()) %>%
as.list() %>%
pluck("data")
}
# individual mutate works, for comparison ---------------------------------
# I can create the kind of table I want manually using a line like the one below
# df %>% map(~ mutate(., measures = get_concept_info(., concept_name = "measures")))
df %>% mutate(., measures = get_concept_info(df, "measures"))
#> # A tibble: 1 x 3
#> dataset_title dataset_id measures
#> <chr> <chr> <list>
#> 1 Population estimates - local authority based by sin~ NM_2002_1 <tibble [2 x ~
<sup>Created on 2020-02-10 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)</sup>
Using !! and := lets you dynamically name columns. Then, we can reduce the list output of map() with reduce(), which left_joins() all the dataframes in the list using the dataset title and id columns.
df_2 <-
map(get_concept_list(df),
~ mutate(df,
!!.x := get_concept_info(df, .x))) %>%
reduce(left_join, by = c("dataset_title", "dataset_id"))
df_2
# A tibble: 1 x 6
dataset_title dataset_id time gender c_age measures
<chr> <chr> <list<df[,2]>> <list<df[,2]>> <list<df[,2]>> <list<df[,2]>>
1 Population estimates - local authority based by single year NM_2002_1 [28 x 2] [3 x 2] [121 x 2] [2 x 2]

Cleaning a column in a dataset R [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
So I got a dataset with a column that I need to clean.
The column has objects with stuff like: "$10,000 - $19,999", "$40,000 and over."
How do I code this so for example "$10,000 - $19,999" becomes 15000 instead, and "$40,000 and over" becomes 40000 in a new column?
I am new to R so I have no idea how to start. I need to do a regression analysis on this but it doesn't work if I don't get this fixed.
I have been told that some basic string/regex operations are what I need. How should I proceed?
Here's a solution using the tidyverse.
Load packages
library(dplyr) # for general cleaning functions
library(stringr) # for string manipulations
library(magrittr) # for the '%<>% function
Make a dummy dataset based on your example.
df <- data_frame(price = sample(c(rep('$40,000 and over', 10),
rep('$10,000', 10),
rep('$19,999', 10),
rep('$9,000', 10),
rep('$28,000', 10))))
Inspect the new dataframe
print(df)
#> # A tibble: 50 x 1
#> price
#> <chr>
#> 1 $9,000
#> 2 $40,000 and over
#> 3 $28,000
#> 4 $10,000
#> 5 $10,000
#> 6 $9,000
#> 7 $19,999
#> 8 $10,000
#> 9 $19,999
#> 10 $40,000 and over
#> # ... with 40 more rows
Clean-up the the format of the price strings by removing the $ symbol and ,. Note the use of the '\\' before the $ symbol. This formatting is used within R to escape special characters (the second \ is a standard regex escape switch, the first \ is tells R to escape the second \).
df %<>%
mutate(price = str_remove(string = price, pattern = '\\$'), # remove $ sign
price = str_remove(string = price, pattern = ',')) # remove comma
Quick check of the data.
head(df)
#> # A tibble: 6 x 1
#> price
#> <chr>
#> 1 9000
#> 2 40000 and over
#> 3 28000
#> 4 10000
#> 5 10000
#> 6 9000
Process the number strings into numerics. First convert 40000 and over to 40000, then convert all the strings to numerics, then use logic statements to convert the numbers to the values you want. The functions ifelse() and case_when() are interchangeable, but I tend to use ifelse() for single rules, and case_when() when there are multiple rules because of the more compact format of the case_when().
df %<>%
mutate(price = ifelse(price == '40000 and over', # convert 40000+ to 40000
yes = '40000',
no = price),
price = as.numeric(price), # convert all to numeric
price = case_when( # use logic statements to change values to desired value
price == 40000 ~ 40000,
price >= 30000 & price < 40000 ~ 35000,
price >= 20000 & price < 30000 ~ 25000,
price >= 10000 & price < 20000 ~ 15000,
price >= 0 & price < 10000 ~ 5000
))
Have a final look.
print(df)
#> # A tibble: 50 x 1
#> price
#> <dbl>
#> 1 5000
#> 2 40000
#> 3 25000
#> 4 15000
#> 5 15000
#> 6 5000
#> 7 15000
#> 8 15000
#> 9 15000
#> 10 40000
#> # ... with 40 more rows
```
Created on 2018-11-18 by the reprex package (v0.2.1)
First you should see what exactly your data is composed of- use the table() function on data$column to see how many unique entries you must account for.
table(data$column)
If whoever was entering this data was consistent about their wording, it may be easiest to hard code for substitution for each unique entry. So if unique(data$column)[1]== "$10,000 - $19,999", and unique(data$column)[2]== "$40,000 and over."
data$column[which(data$column==unique(data$column)[1])] <- "15000"
data$column[which(data$column==unique(data$column)[2])] <- "40000"
...
If you have too many unique entries for this approach to be viable, I'd suggest looking for consistencies in character sequences that can be used to make replacements. If you found that whoever entered this data was inconsistent about how they would write "$40,000 and over" such that you had:
data$column==unique(data$column)[2]
>"$40,000 and over."
data$column==unique(data$column)[3]
>"$40,000 and over"
data$column==unique(data$column)[4]
>"above $40,000"
...
If there weren't instances of "$40,000" that belonged to other categories, you could combine these entries for substitution a la:
data$column[which(grepl("$40,000",data$column))] <- "40000"
Inconsistency in qualitative data entry is a very human problem and requires exploring your data to search for trends and easy ways to consolidate your replacements. I think it's a fine idea to use R to identify and replace for patterns you find to save time, but ultimately it will require a fine touch as you get down to individual cases where you have to interpret/correct someone's entries to include them in your desired bins. Depending on your data quality standards, you can always throw out these entries that don't seem to fit your observed patterns.

Programmatically create new variables using purrr?

Intro
After recently taking Hadley Wickham's functional programming class I decided I'd try applying some of the lessons to my projects at work. Naturally, the first project I tried has proven to be more complicated than the examples worked demonstrated in the class. Does anyone have recommendations for a way to use the purrr package to make the task described below more efficient?
Project Background
I need to assign quintile groups to records in a spatial polygon dataframe. In addition to the record identifier there are several other variables and I need to calculate the quintile group for each.
Here's the crux of the problem: I have been asked to identify outliers in one particular variable and to omit those records from the entire analysis as long as it doesn't change the quintile composition of the first quintile group for any of the other variables.
Question
I have put together a dplyr pipeline (see the example below) that performs this checking process for a single variable, but how might I rewrite this process so that I can efficiently check each variable?
EDIT: While it is certainly possible to change the shape of the data from wide to long as an intermediary step, in the end it needs to return to its wide format so that it matches up with the #polygons slot of the spatial polygons dataframe.
Reproducible Example
You can find the complete script here: https://gist.github.com/tiernanmartin/6cd3e2946a77b7c9daecb51aa11e0c94
Libraries and Settings
library(grDevices) # boxplot.stats()
library(operator.tools) # %!in% logical operator
library(tmap) # 'metro' data set
library(magrittr) # piping
library(dplyr) # exploratory data analysis verbs
library(purrr) # recursive mapping of functions
library(tibble) # improved version of a data.frame
library(ggplot2) # dot plot
library(ggrepel) # avoid label overlap
options(scipen=999)
set.seed(888)
Load the example data and take a small sample of it
data("metro")
m_spdf <- metro
# Take a sample
m <-
metro#data %>%
as_tibble %>%
select(-name_long,-iso_a3) %>%
sample_n(50)
> m
# A tibble: 50 x 10
name pop1950 pop1960 pop1970 pop1980 pop1990
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Sydney 1689935 2134673 2892477 3252111 3631940
2 Havana 1141959 1435511 1779491 1913377 2108381
3 Campinas 151977 293174 540430 1108903 1693359
4 Kano 123073 229203 541992 1349646 2095384
5 Omsk 444326 608363 829860 1032150 1143813
6 Ouagadougou 33035 59126 115374 265200 537441
7 Marseille 755805 928768 1182048 1372495 1418279
8 Taiyuan 196510 349535 621625 1105695 1636599
9 La Paz 319247 437687 600016 809218 1061850
10 Baltimore 1167656 1422067 1554538 1748983 1848834
# ... with 40 more rows, and 4 more variables:
# pop2000 <dbl>, pop2010 <dbl>, pop2020 <dbl>,
# pop2030 <dbl>
Calculate quintile groups with and without outlier records
# Calculate the quintile groups for one variable (e.g., `pop1990`)
m_all <-
m %>%
mutate(qnt_1990_all = dplyr::ntile(pop1990,5))
# Find the outliers for a different variable (e.g., 'pop1950')
# and subset the df to exlcude these outlier records
m_out <- boxplot.stats(m$pop1950) %>% .[["out"]]
m_trim <-
m %>%
filter(pop1950 %!in% m_out) %>%
mutate(qnt_1990_trim = dplyr::ntile(pop1990,5))
# Assess whether the outlier trimming impacted the first quintile group
m_comp <-
m_trim %>%
select(name,dplyr::contains("qnt")) %>%
left_join(m_all,.,"name") %>%
select(name,dplyr::contains("qnt"),everything()) %>%
mutate(qnt_1990_chng_lgl = !is.na(qnt_1990_trim) & qnt_1990_trim != qnt_1990_all,
qnt_1990_chng_dir = if_else(qnt_1990_chng_lgl,
paste0(qnt_1990_all," to ",qnt_1990_trim),
"No change"))
With a little help from ggplot2, I can see that in this example six outliers were identified and that their omission did not affect the first quintile group for pop1990.
Importantly, this information is tracked in two new variables: qnt_1990_chng_lgl and qnt_1990_chng_dir.
> m_comp %>% select(name,qnt_1990_chng_lgl,qnt_1990_chng_dir,everything())
# A tibble: 50 x 14
name qnt_1990_chng_lgl qnt_1990_chng_dir qnt_1990_all qnt_1990_trim
<chr> <lgl> <chr> <dbl> <dbl>
1 Sydney FALSE No change 5 NA
2 Havana TRUE 4 to 5 4 5
3 Campinas TRUE 3 to 4 3 4
4 Kano FALSE No change 4 4
5 Omsk FALSE No change 3 3
6 Ouagadougou FALSE No change 1 1
7 Marseille FALSE No change 3 3
8 Taiyuan TRUE 3 to 4 3 4
9 La Paz FALSE No change 2 2
10 Baltimore FALSE No change 4 4
# ... with 40 more rows, and 9 more variables: pop1950 <dbl>, pop1960 <dbl>,
# pop1970 <dbl>, pop1980 <dbl>, pop1990 <dbl>, pop2000 <dbl>, pop2010 <dbl>,
# pop2020 <dbl>, pop2030 <dbl>
I now need to find a way to repeat this process for every variable in the dataframe (i.e., pop1960 - pop2030). Ideally, two new variables would be created for each existing pop* variable and their names would be preceded by qnt_ and followed by either _chng_dir or _chng_lgl.
Is purrr the right tool to use for this? dplyr::mutate_? data.table?
It turns out this problem is solvable using tidyr::gather + dplyr::group_by + tidyr::spread functions. While #shayaa and #Gregor didn't provide the solution I was looking for, their advice helped me course-correct away from the functional programming methods I was researching.
I ended up using #shayaa's gather and group_by combination, followed by mutate to create the variable names (qnt_*_chng_lgl and qnt_*_chng_dir) and then using spread to make it wide again. An anonymous function passed to summarize_all removed all the extra NA's that the wide-long-wide transformations created.
m_comp <-
m %>%
mutate(qnt = dplyr::ntile(pop1950,5)) %>%
filter(pop1950 %!in% m_out) %>%
gather(year,pop,-name,-qnt) %>%
group_by(year) %>%
mutate(qntTrim = dplyr::ntile(pop,5),
qnt_chng_lgl = !is.na(qnt) & qnt != qntTrim,
qnt_chng_dir = ifelse(qnt_chng_lgl,
paste0(qnt," to ",qntTrim),
"No change"),
year_lgl = paste0("qnt_chng_",year,"_lgl"),
year_dir = paste0("qnt_chng_",year,"_dir")) %>%
spread(year_lgl,qnt_chng_lgl) %>%
spread(year_dir,qnt_chng_dir) %>%
spread(year,pop) %>%
select(-qnt,-qntTrim) %>%
group_by(name) %>%
summarize_all(function(.){subset(.,!is.na(.)) %>% first})
Nothing wrong with your analysis it seems to me,
After this part
m <- metro#data %>%
as_tibble %>%
select(-name_long,-iso_a3) %>%
sample_n(50)
Just melt your data and continue your analysis but with group_by(year)
library(reshape2)
library(stringr)
mm <- melt(m)
mm[,2] <- as.factor(str_sub(mm[,2],-4))
names(mm)[2:3] <- c("year", "population")
e.g.,
mm %>% group_by(year) %>%
+ mutate(qnt_all = dplyr::ntile(population,5))

Resources