Step_dummy. Dealing with duplicated column names generated by recipe() steps, Tidymodels - r

Dear community,
I have been struggeling for extensive amount of time now trying to understand what is going on here, when I perform my recipe() steps for my linear (glm) model using the Tidymodels framework. The recipe() step_dummy(all_nominal(), -all_outcomes()) was suggested by the usemodels() function https://usemodels.tidymodels.org/index.html .
When I commend out the step_dummy() the recipe() and prep() works fine, however its important to me that these categorical variables are dummyfied (..is that a word!?).
This is the first time I making use of and including a reprex in a question on stackoverflow, so please let me know if you need more information to assist on this matter.
I have looked everywhere, e.g. including a one_hot = TRUE or keep_original_cols argument in the step_dummy() but it does not seem to be effective.
It should be quite easy as it is a matter of renaming the generated columns as unique, but do not succeed. Here is the era.af_train set.
> era.af_train
# A tibble: 7,104 x 44
logRR ID AEZ16simple PrName.Code SubPrName.Code Product Country
<dbl> <dbl> <fct> <fct> <fct> <fct> <fct>
1 -0.851 1663 Warm.Semiar~ BP TW Pearl Mill~ Niger
2 -1.17 1663 Warm.Semiar~ BP/Mu Mu-N/TW Pearl Mill~ Niger
3 -0.314 1663 Warm.Semiar~ BP TW Pearl Mill~ Niger
4 -0.776 1663 Warm.Semiar~ BP TW Pearl Mill~ Niger
5 -0.0850 1675 Warm.Semiar~ AP TPM+N Pearl Mill~ Niger
6 -0.159 1689 Warm.Subhum~ Al/AP/BP Al+N/LF/TP/TPM~ Maize Togo
7 -0.579 1701 Warm.Semiar~ BP TW Fodder (Le~ Tunisia
8 -0.662 1729 Warm.Subhum~ Al Al-N/Al+N Cassava or~ Nigeria
9 -1.80 1802 Cool.Subhum~ Al/AP Al+N/TP Wheat Ethiop~
10 -1.74 1802 Cool.Subhum~ Al/AP Al+N/TP/TPI+N Wheat Ethiop~
# ... with 7,094 more rows, and 37 more variables: Latitude <dbl>,
# Longitude <dbl>, Site.Type <fct>, Tree <fct>, Bio01_MT_Anu.Mean <dbl>,
# Bio02_MDR.Mean <dbl>, Bio03_Iso.Mean <dbl>, Bio04_TS.Mean <dbl>,
# Bio05_TWM.Mean <dbl>, Bio06_MinTCM.Mean <dbl>, Bio07_TAR.Mean <dbl>,
# Bio08_MT_WetQ.Mean <dbl>, Bio09_MT_DryQ.Mean <dbl>,
# Bio10_MT_WarQ.Mean <dbl>, Bio11_MT_ColQ.Mean <dbl>,
# Bio12_Pecip_Anu.Mean <dbl>, Bio13_Precip_WetM.Mean <dbl>,
# Bio14_Precip_DryM.Mean <dbl>, Bio15_Precip_S.Mean <dbl>,
# Bio16_Precip_WetQ.Mean <dbl>, Bio17_Precip_DryQ.Mean <dbl>,
# Mean_log.n_tot_ncs <dbl>, Mean_log.ca_mehlich3 <dbl>,
# Mean_log.k_mehlich3 <dbl>, Mean_log.mg_mehlich3 <dbl>,
# Mean_log.p_mehlich3 <dbl>, Mean_log.s_mehlich3 <dbl>,
# Mean_log.fe_mehlich3 <dbl>, Mean_db_od <dbl>, Mean_bdr <dbl>,
# Mean_sand_tot_psa <dbl>, Mean_clay_tot_psa <dbl>, Mean_ph_h2o <dbl>,
# Mean_log.ecec.f <dbl>, Mean_log.c_tot <dbl>, Mean_log.oc <dbl>,
# Slope.mean <dbl>
I am including the columns ID, AEZ16simple, PrName.Code, SubPrName.Code, Product, Country, Latitude and Longitude as "ID variables", as I wish to compare the glm model later with a random forest model and a XGBoost model.
All help is welcome!
Have a good weekend and
thank you in advance.
library(reprex)
#> Warning: package 'reprex' was built under R version 4.0.5
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.0.5
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(recipes)
#> Warning: package 'recipes' was built under R version 4.0.5
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
library(readr)
#> Warning: package 'readr' was built under R version 4.0.5
setwd("C:/Users/lindh011/OneDrive - Wageningen University & Research/Internship ICRAF (ERA)/ERA_Agroforestry_WURwork")
era.af_train <- read_csv("era.af_train.csv")
#>
#> -- Column specification --------------------------------------------------------
#> cols(
#> .default = col_double(),
#> AEZ16simple = col_character(),
#> PrName.Code = col_character(),
#> SubPrName.Code = col_character(),
#> Product = col_character(),
#> Country = col_character(),
#> Site.Type = col_character(),
#> Tree = col_character()
#> )
#> i Use `spec()` for the full column specifications.
era.af_train_Tib <- as_tibble(era.af_train)
glmnet_recipe <-
recipe(formula = logRR ~ ., data = era.af_train) %>%
step_novel(all_nominal(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes(), naming = dummy_names) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
update_role(ID,
AEZ16simple,
PrName.Code,
SubPrName.Code,
Product,
Country,
Latitude,
Longitude,
new_role = "sample ID") %>%
step_impute_mode(all_nominal(), -all_outcomes()) %>%
step_impute_knn (all_numeric_predictors()) %>%
step_impute_knn(logRR) %>%
step_corr(all_numeric_predictors()) %>%
step_nzv(all_numeric_predictors()) %>%
prep()
#> Error: Column names `SubPrName.Code_AF.N.Al.N.TP`, `SubPrName.Code_AF.N.Al.N.TP.TPM`, `SubPrName.Code_Al.N.In.N`, `SubPrName.Code_Al.N.In.N`, `SubPrName.Code_Al.N`, and 33 more must not be duplicated.
#> Use .name_repair to specify repair.
Created on 2021-07-02 by the reprex package (v2.0.0)

Related

step_pca() arguments are not being applied

I'm new to tidymodels but apparently the step_pca() arguments such as nom_comp or threshold are not being implemented when being trained. as in example below, I'm still getting 4 component despite setting nom_comp = 2.
library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
rec <- recipe( ~ ., data = USArrests) %>%
step_normalize(all_numeric()) %>%
step_pca(all_numeric(), num_comp = 2)
prep(rec) %>% tidy(number = 2, type = "coef") %>%
pivot_wider(names_from = component, values_from = value, id_cols = terms)
#> # A tibble: 4 x 5
#> terms PC1 PC2 PC3 PC4
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Murder -0.536 0.418 -0.341 0.649
#> 2 Assault -0.583 0.188 -0.268 -0.743
#> 3 UrbanPop -0.278 -0.873 -0.378 0.134
#> 4 Rape -0.543 -0.167 0.818 0.0890
The full PCA is determined (so you can still compute the variances of each term) and num_comp only specifies how many of the components are retained as predictors. If you want to specify the maximal rank, you can pass that through options:
library(recipes)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
rec <- recipe( ~ ., data = USArrests) %>%
step_normalize(all_numeric()) %>%
step_pca(all_numeric(), num_comp = 2, options = list(rank. = 2))
prep(rec) %>% tidy(number = 2, type = "coef")
#> # A tibble: 8 × 4
#> terms value component id
#> <chr> <dbl> <chr> <chr>
#> 1 Murder -0.536 PC1 pca_AoFOm
#> 2 Assault -0.583 PC1 pca_AoFOm
#> 3 UrbanPop -0.278 PC1 pca_AoFOm
#> 4 Rape -0.543 PC1 pca_AoFOm
#> 5 Murder 0.418 PC2 pca_AoFOm
#> 6 Assault 0.188 PC2 pca_AoFOm
#> 7 UrbanPop -0.873 PC2 pca_AoFOm
#> 8 Rape -0.167 PC2 pca_AoFOm
Created on 2022-01-12 by the reprex package (v2.0.1)
You could also control this via the tol argument from stats::prcomp(), also passed in as an option.
If you bake the recipe it seems to work as intended but I don't know what you aim to achieve afterward.
library(tidyverse)
library(tidymodels)
USArrests <- USArrests %>%
rownames_to_column("Countries")
rec <-
recipe( ~ ., data = USArrests) %>%
step_normalize(all_numeric()) %>%
step_pca(all_numeric(), num_comp = 2)
prep(rec) %>%
bake(new_data = NULL)
#> # A tibble: 50 x 3
#> Countries PC1 PC2
#> <fct> <dbl> <dbl>
#> 1 Alabama -0.976 1.12
#> 2 Alaska -1.93 1.06
#> 3 Arizona -1.75 -0.738
#> 4 Arkansas 0.140 1.11
#> 5 California -2.50 -1.53
#> 6 Colorado -1.50 -0.978
#> 7 Connecticut 1.34 -1.08
#> 8 Delaware -0.0472 -0.322
#> 9 Florida -2.98 0.0388
#> 10 Georgia -1.62 1.27
#> # ... with 40 more rows
Created on 2022-01-11 by the reprex package (v2.0.1)

Retrieving all releases for a specific R package

What are available r tools to obtain list of all releases for a specific R CRAN package.
There is expected to retrieve at least Dates each package version was released.
Other metadata for each package are in value too.
self-promotion of my new CRAN package https://CRAN.R-project.org/package=pacs
TL;DR
pacs::pac_timemachine
pkgsearch::cran_package_history
pkgdown:::pkg_timeline (non-exported and only Date of publish)
pacs::pac_timemachine in pacs package.
pacs::pac_timemachine is using CRAN website or crandb.
head(pacs::pac_timemachine("tidyr"), 3)
#> Package Version Released Archived LifeDuration
#> 2 tidyr 0.1 2014-07-21 2015-09-08 414 days
#> 3 tidyr 0.2.0 2015-09-08 2015-09-08 0 days
#> 4 tidyr 0.3.0 2015-09-08 2015-09-10 2 days
#> URL Size
#> 2 Archive/tidyr/tidyr_0.1.tar.gz 134K
#> 3 Archive/tidyr/tidyr_0.2.0.tar.gz 139K
#> 4 Archive/tidyr/tidyr_0.3.0.tar.gz 147K
tail(pacs::pac_timemachine("tidyr"), 3)
#> Package Version Released Archived LifeDuration
#> 25 tidyr 1.1.1 2020-07-31 2020-08-27 27 days
#> 26 tidyr 1.1.2 2020-08-27 2021-03-03 188 days
#> 1 tidyr 1.1.3 2021-03-03 <NA> 192 days
#> URL Size
#> 25 Archive/tidyr/tidyr_1.1.1.tar.gz 859K
#> 26 Archive/tidyr/tidyr_1.1.2.tar.gz 861K
#> 1 tidyr_1.1.3.tar.gz <NA>
We could get the result for certain Date or Date interval or version too.
pacs::pac_timemachine("tidyr", at = as.Date("2018-01-01"))
pacs::pac_timemachine("tidyr", version = "1.0.0")
pacs::pac_timemachine("tidyr", from = as.Date("2020-06-01"), to = as.Date("2020-08-01"))
Created on 2021-09-11 by the reprex package (v2.0.1)
the pkgsearch package.
This one is builded under private DB which is systematically appended with new DESCRIPTION files for each CRAN package.
head(pkgsearch::cran_package_history("tidyr"), 3)
#> # A tibble: 3 × 25
#> Package Title Version `Authors#R` Description License LazyData URL
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 tidyr Easily … 0.1 "'Hadley Wick… tidyr is an e… MIT + … true https…
#> 2 tidyr Easily … 0.2.0 "as.person(c(… An evolution … MIT + … true https…
#> 3 tidyr Easily … 0.3.0 "c(<U+000a>pe… An evolution … MIT + … true https…
#> # … with 17 more variables: VignetteBuilder <chr>, Packaged <chr>,
#> # Author <chr>, Maintainer <chr>, NeedsCompilation <chr>, Repository <chr>,
#> # Date/Publication <chr>, crandb_file_date <chr>, date <chr>,
#> # dependencies <list>, BugReports <chr>, RoxygenNote <chr>, Remotes <chr>,
#> # MD5sum <chr>, Encoding <chr>, SystemRequirements <chr>,
#> # Config/testthat/edition <chr>
tail(pkgsearch::cran_package_history("tidyr"), 3)
#> # A tibble: 3 × 25
#> Package Title Version `Authors#R` Description License LazyData URL
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 tidyr Tidy Messy Data 1.1.1 "\nc(perso… "Tools to … MIT + … true http…
#> 2 tidyr Tidy Messy Data 1.1.2 "\nc(perso… "Tools to … MIT + … true http…
#> 3 tidyr Tidy Messy Data 1.1.3 "\nc(perso… "Tools to … MIT + … true http…
#> # … with 17 more variables: VignetteBuilder <chr>, Packaged <chr>,
#> # Author <chr>, Maintainer <chr>, NeedsCompilation <chr>, Repository <chr>,
#> # Date/Publication <chr>, crandb_file_date <chr>, date <chr>,
#> # dependencies <list>, BugReports <chr>, RoxygenNote <chr>, Remotes <chr>,
#> # MD5sum <chr>, Encoding <chr>, SystemRequirements <chr>,
#> # Config/testthat/edition <chr>
Created on 2021-09-11 by the reprex package (v2.0.1)
pkgdown package
pkgdown:::pkg_timeline function in pkgdown package. It is a non-exported function so sb have to take that into account. It returns only Date when each package version was published.

tidymodels bake:Error: Please pass a data set to `new_data`

I'm using recipe()function in tidymodels packages for imputation missing values and fixing imbalanced data.
here is my data;
mer_df <- mer2 %>%
filter(!is.na(laststagestatus2)) %>%
select(Id, Age_Range__c, Gender__c, numberoflead, leadduration, firsttouch, lasttouch, laststagestatus2)%>%
mutate_if(is.character, factor) %>%
mutate_if(is.logical, as.integer)
# A tibble: 197,836 x 8
Id Age_Range__c Gender__c numberoflead leadduration firsttouch lasttouch
<fct> <fct> <fct> <int> <dbl> <fct> <fct>
1 0010~ NA NA 2 5.99 Dealer IB~ Walk in
2 0010~ NA NA 1 0 Online Se~ Online S~
3 0010~ NA NA 1 0 Walk in Walk in
4 0010~ NA NA 1 0 Online Se~ Online S~
5 0010~ NA NA 2 0.0128 Dealer IB~ Dealer I~
6 0010~ NA NA 1 0 OB Call OB Call
7 0010~ NA NA 1 0 Dealer IB~ Dealer I~
8 0010~ NA NA 4 73.9 Dealer IB~ Walk in
9 0010~ NA Male 24 0.000208 OB Call OB Call
10 0010~ NA NA 18 0.000150 OB Call OB Call
# ... with 197,826 more rows, and 1 more variable: laststagestatus2 <fct>
here is my codes;
mer_rec <- recipe(laststagestatus2 ~ ., data = mer_train)%>%
step_medianimpute(numberoflead,leadduration)%>%
step_knnimpute(Gender__c,Age_Range__c,fisrsttouch,lasttouch) %>%
step_other(Id,firsttouch) %>%
step_other(Id,lasttouch) %>%
step_dummy(all_nominal(), -laststagestatus2) %>%
step_smote(laststagestatus2)
mer_rec
mer_rec %>% prep()
it just works fine until here ;
Data Recipe
Inputs:
role #variables
outcome 1
predictor 7
Training data contained 148377 data points and 147597 incomplete rows.
Operations:
Median Imputation for 2 items [trained]
K-nearest neighbor imputation for Id, ... [trained]
Collapsing factor levels for Id, firsttouch [trained]
Collapsing factor levels for Id, lasttouch [trained]
Dummy variables from Id, ... [trained]
SMOTE based on laststagestatus2 [trained]
but when ı run bake() function that gives error says;
mer_rec %>% prep() %>% bake(new_data=NULL) %>% count(laststagestatus2)
Error: Please pass a data set to `new_data`.
Could anyone help me about what I m missing here?
There is a fix in the development version of recipes to get this up and working. You can install via:
devtools::install_github("tidymodels/recipes")
Then you can bake() with new_data = NULL to get out the transformed training data.
library(tidymodels)
data(ames)
ames <- mutate(ames, Sale_Price = log10(Sale_Price))
set.seed(123)
ames_split <- initial_split(ames, prob = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
ames_rec <-
recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type +
Latitude + Longitude, data = ames_train) %>%
step_log(Gr_Liv_Area, base = 10) %>%
step_other(Neighborhood, threshold = 0.01) %>%
step_dummy(all_nominal()) %>%
step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>%
step_ns(Latitude, Longitude, deg_free = 20)
ames_rec %>% prep() %>% bake(new_data = NULL)
#> # A tibble: 2,199 x 71
#> Gr_Liv_Area Year_Built Sale_Price Neighborhood_Co… Neighborhood_Ol…
#> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 3.22 1960 5.33 0 0
#> 2 2.95 1961 5.02 0 0
#> 3 3.12 1958 5.24 0 0
#> 4 3.21 1997 5.28 0 0
#> 5 3.21 1998 5.29 0 0
#> 6 3.13 2001 5.33 0 0
#> 7 3.11 1992 5.28 0 0
#> 8 3.21 1995 5.37 0 0
#> 9 3.22 1993 5.25 0 0
#> 10 3.17 1998 5.26 0 0
#> # … with 2,189 more rows, and 66 more variables: Neighborhood_Edwards <dbl>,
#> # Neighborhood_Somerset <dbl>, Neighborhood_Northridge_Heights <dbl>,
#> # Neighborhood_Gilbert <dbl>, Neighborhood_Sawyer <dbl>,
#> # Neighborhood_Northwest_Ames <dbl>, Neighborhood_Sawyer_West <dbl>,
#> # Neighborhood_Mitchell <dbl>, Neighborhood_Brookside <dbl>,
#> # Neighborhood_Crawford <dbl>, Neighborhood_Iowa_DOT_and_Rail_Road <dbl>,
#> # Neighborhood_Timberland <dbl>, Neighborhood_Northridge <dbl>,
#> # Neighborhood_Stone_Brook <dbl>,
#> # Neighborhood_South_and_West_of_Iowa_State_University <dbl>,
#> # Neighborhood_Clear_Creek <dbl>, Neighborhood_Meadow_Village <dbl>,
#> # Neighborhood_other <dbl>, Bldg_Type_TwoFmCon <dbl>, Bldg_Type_Duplex <dbl>,
#> # Bldg_Type_Twnhs <dbl>, Bldg_Type_TwnhsE <dbl>,
#> # Gr_Liv_Area_x_Bldg_Type_TwoFmCon <dbl>,
#> # Gr_Liv_Area_x_Bldg_Type_Duplex <dbl>, Gr_Liv_Area_x_Bldg_Type_Twnhs <dbl>,
#> # Gr_Liv_Area_x_Bldg_Type_TwnhsE <dbl>, Latitude_ns_01 <dbl>,
#> # Latitude_ns_02 <dbl>, Latitude_ns_03 <dbl>, Latitude_ns_04 <dbl>,
#> # Latitude_ns_05 <dbl>, Latitude_ns_06 <dbl>, Latitude_ns_07 <dbl>,
#> # Latitude_ns_08 <dbl>, Latitude_ns_09 <dbl>, Latitude_ns_10 <dbl>,
#> # Latitude_ns_11 <dbl>, Latitude_ns_12 <dbl>, Latitude_ns_13 <dbl>,
#> # Latitude_ns_14 <dbl>, Latitude_ns_15 <dbl>, Latitude_ns_16 <dbl>,
#> # Latitude_ns_17 <dbl>, Latitude_ns_18 <dbl>, Latitude_ns_19 <dbl>,
#> # Latitude_ns_20 <dbl>, Longitude_ns_01 <dbl>, Longitude_ns_02 <dbl>,
#> # Longitude_ns_03 <dbl>, Longitude_ns_04 <dbl>, Longitude_ns_05 <dbl>,
#> # Longitude_ns_06 <dbl>, Longitude_ns_07 <dbl>, Longitude_ns_08 <dbl>,
#> # Longitude_ns_09 <dbl>, Longitude_ns_10 <dbl>, Longitude_ns_11 <dbl>,
#> # Longitude_ns_12 <dbl>, Longitude_ns_13 <dbl>, Longitude_ns_14 <dbl>,
#> # Longitude_ns_15 <dbl>, Longitude_ns_16 <dbl>, Longitude_ns_17 <dbl>,
#> # Longitude_ns_18 <dbl>, Longitude_ns_19 <dbl>, Longitude_ns_20 <dbl>
Created on 2020-10-12 by the reprex package (v0.3.0.9001)
If you are unable to install packages from GitHub, you could use juice() to do the same thing.

readr::read_csv() - parsing failure with nested quotations

I have a csv where some columns have a quoted column with another quotation inside it:
"blah blah "nested quote"" and it generates parsing failures. I'm not sure if this is a bug or there is an argument to deal with this?
Reprex (file is here or content pasted below):
readr::read_csv("~/temp/shittyquotes.csv")
#> Parsed with column specification:
#> cols(
#> .default = col_double(),
#> INSTNM = col_character(),
#> ADDR = col_character(),
#> CITY = col_character(),
#> STABBR = col_character(),
#> ZIP = col_character(),
#> CHFNM = col_character(),
#> CHFTITLE = col_character(),
#> EIN = col_character(),
#> OPEID = col_character(),
#> WEBADDR = col_character(),
#> ADMINURL = col_character(),
#> FAIDURL = col_character(),
#> APPLURL = col_character(),
#> ACT = col_character(),
#> IALIAS = col_character(),
#> INSTCAT = col_character(),
#> CCBASIC = col_character(),
#> CCIPUG = col_character(),
#> CCSIZSET = col_character(),
#> CARNEGIE = col_character()
#> # ... with 2 more columns
#> )
#> See spec(...) for full column specifications.
#> Warning: 3 parsing failures.
#> row col expected actual file
#> 2 IALIAS delimiter or quote C '~/temp/shittyquotes.csv'
#> 2 IALIAS delimiter or quote D '~/temp/shittyquotes.csv'
#> 2 NA 59 columns 100 columns '~/temp/shittyquotes.csv'
#> # A tibble: 2 x 59
#> UNITID INSTNM ADDR CITY STABBR ZIP FIPS OBEREG CHFNM CHFTITLE
#> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
#> 1 441238 City … 1500… Duar… CA 9101… 6 8 Dr. … Director
#> 2 441247 Commu… 3800… Mode… CA 9535… 6 8 Vict… Preside…
#> # ... with 49 more variables: GENTELE <dbl>, EIN <chr>, OPEID <chr>,
#> # OPEFLAG <dbl>, WEBADDR <chr>, ADMINURL <chr>, FAIDURL <chr>,
#> # APPLURL <chr>, SECTOR <dbl>, ICLEVEL <dbl>, CONTROL <dbl>,
#> # HLOFFER <dbl>, UGOFFER <dbl>, GROFFER <dbl>, FPOFFER <dbl>,
#> # HDEGOFFR <dbl>, DEGGRANT <dbl>, HBCU <dbl>, HOSPITAL <dbl>,
#> # MEDICAL <dbl>, TRIBAL <dbl>, LOCALE <dbl>, OPENPUBL <dbl>, ACT <chr>,
#> # NEWID <dbl>, DEATHYR <dbl>, CLOSEDAT <dbl>, CYACTIVE <dbl>,
#> # POSTSEC <dbl>, PSEFLAG <dbl>, PSET4FLG <dbl>, RPTMTH <dbl>,
#> # IALIAS <chr>, INSTCAT <chr>, CCBASIC <chr>, CCIPUG <chr>,
#> # CCIPGRAD <dbl>, CCUGPROF <dbl>, CCENRPRF <dbl>, CCSIZSET <chr>,
#> # CARNEGIE <chr>, TENURSYS <dbl>, LANDGRNT <dbl>, INSTSIZE <chr>,
#> # CBSA <dbl>, CBSATYPE <chr>, CSA <dbl>, NECTA <dbl>, DFRCGID <dbl>
Created on 2018-12-04 by the reprex package (v0.2.1)
Also here's the csv content:
UNITID,INSTNM,ADDR,CITY,STABBR,ZIP,FIPS,OBEREG,CHFNM,CHFTITLE,GENTELE,EIN,OPEID,OPEFLAG,WEBADDR,ADMINURL,FAIDURL,APPLURL,SECTOR,ICLEVEL,CONTROL,HLOFFER,UGOFFER,GROFFER,FPOFFER,HDEGOFFR,DEGGRANT,HBCU,HOSPITAL,MEDICAL,TRIBAL,LOCALE,OPENPUBL,ACT,NEWID,DEATHYR,CLOSEDAT,CYACTIVE,POSTSEC,PSEFLAG,PSET4FLG,RPTMTH,IALIAS,INSTCAT,CCBASIC,CCIPUG,CCIPGRAD,CCUGPROF,CCENRPRF,CCSIZSET,CARNEGIE,TENURSYS,LANDGRNT,INSTSIZE,CBSA,CBSATYPE,CSA,NECTA,DFRCGID
441238,"City of Hope Graduate School of Biological Science","1500 E Duarte Rd","Duarte","CA","91010-3000", 6, 8,"Dr. Arthur Riggs","Director","6263018293","953432210","03592400",1,"gradschool.coh.org"," "," "," ",2,1,2,9,2,1,2,10,1,2,-2,2,2,21,1,"A ",-2,-2,"-2",1,1,1,1,1," ",1,25,-2,-2,-2,7,-2,-3,1,2,1,31100,1,348,-2,198
441247,"Community Business College","3800 McHenry Ave Suite M","Modesto","CA","95356-1569", 6, 8,"Victor L. Vandenberghe","President","2095293648","484-8230","03615300",7,"www.communitybusinesscollege.edu","www.communitybusinesscollege.edu","www.cbc123.com","www.123.com",9,3,3,1,1,2,2,0,2,2,-2,2,2,12,1,"A ",-2,-2,"-2",1,1,1,1,2,"formerly "Community Business School"",6,-3,-3,-3,-3,-3,-3,-3,2,2,1,33700,1,-2,-2,71
441256,"Design's School of Cosmetology","715 24th St Ste E","Paso Robles","CA","93446", 6, 8,"Sharon Skinner","Administrator","8052378575","80002030","03646300",1,"designsschool.com"," "," "," ",9,3,3,2,1,2,2,0,2,2,-2,2,2,13,1,"A ",-2,-2,"-2",1,1,1,1,2," ",6,-3,-3,-3,-3,-3,-3,-3,2,2,1,42020,1,-2,-2,46
Jim Hester provided this answer:
You need to use the escape_double = FALSE argument to read_delim(). This isn't part of read_csv() because excel style csvs escape inner quotations by doubling them.
data.table's fread() parses the file just fine... it throws a warning about the quotes, but you can ignore it..
library( data.table )
data.table::fread("./temp.csv" )
Warning message:
In data.table::fread("./temp.csv") :
Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.

Format a tbl within a dplyr chain

I am trying to add commas for thousands in my data e.g. 10,000 along with dollars e.g. $10,000.
I'm using several dplyr commands along with tidyr gather and spread functions. Here's what I tried:
Cut n paste this code block to generate the random data "dataset" I'm working with:
library(dplyr)
library(tidyr)
library(lubridate)
## Generate some data
channels <- c("Facebook", "Youtube", "SEM", "Organic", "Direct", "Email")
last_month <- Sys.Date() %m+% months(-1) %>% floor_date("month")
mts <- seq(from = last_month %m+% months(-23), to = last_month, by = "1 month") %>% as.Date()
dimvars <- expand.grid(Month = mts, Channel = channels, stringsAsFactors = FALSE)
# metrics
rws <- nrow(dimvars)
set.seed(42)
# generates variablility in the random data
randwalk <- function(initial_val, ...){
initial_val + cumsum(rnorm(...))
}
Sessions <- ceiling(randwalk(3000, n = rws, mean = 8, sd = 1500)) %>% abs()
Revenue <- ceiling(randwalk(10000, n = rws, mean = 0, sd = 3500)) %>% abs()
# make primary df
dataset <- cbind(dimvars, Revenue)
Which looks like:
> tbl_df(dataset)
# A tibble: 144 × 3
Month Channel Revenue
<date> <chr> <dbl>
1 2015-06-01 Facebook 8552
2 2015-07-01 Facebook 12449
3 2015-08-01 Facebook 10765
4 2015-09-01 Facebook 9249
5 2015-10-01 Facebook 11688
6 2015-11-01 Facebook 7991
7 2015-12-01 Facebook 7849
8 2016-01-01 Facebook 2418
9 2016-02-01 Facebook 6503
10 2016-03-01 Facebook 5545
# ... with 134 more rows
Now I want to spread the months into columns to show revenue trend by channel, month over month. I can do that like so:
revenueTable <- dataset %>% select(Month, Channel, Revenue) %>%
group_by(Month, Channel) %>%
summarise(Revenue = sum(Revenue)) %>%
#mutate(Revenue = paste0("$", format(Revenue, big.interval = ","))) %>%
gather(Key, Value, -Channel, -Month) %>%
spread(Month, Value) %>%
select(-Key)
And it looks almost exactly as I want:
> revenueTable
# A tibble: 6 × 25
Channel `2015-06-01` `2015-07-01` `2015-08-01` `2015-09-01` `2015-10-01` `2015-11-01` `2015-12-01` `2016-01-01` `2016-02-01` `2016-03-01` `2016-04-01`
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Direct 11910 8417 4012 359 4473 2702 6261 6167 8630 5230 1394
2 Email 7244 3517 671 1339 10788 10575 8567 8406 7856 6345 7733
3 Facebook 8552 12449 10765 9249 11688 7991 7849 2418 6503 5545 3908
4 Organic 4191 978 219 4274 2924 4155 5981 9719 8220 8829 7024
5 SEM 2344 6873 10230 6429 5016 2964 3390 3841 3163 1994 2105
6 Youtube 186 2949 2144 5073 1035 4878 7905 7377 2305 4556 6247
# ... with 13 more variables: `2016-05-01` <dbl>, `2016-06-01` <dbl>, `2016-07-01` <dbl>, `2016-08-01` <dbl>, `2016-09-01` <dbl>, `2016-10-01` <dbl>,
# `2016-11-01` <dbl>, `2016-12-01` <dbl>, `2017-01-01` <dbl>, `2017-02-01` <dbl>, `2017-03-01` <dbl>, `2017-04-01` <dbl>, `2017-05-01` <dbl>
Now the part I'm struggling with. I would like to format the data as currency. I tried adding this inbetween summarise() and gather() within the chain:
mutate(Revenue = paste0("$", format(Revenue, big.interval = ","))) %>%
This half works. The dollar sign is prepended but the comma separators do not show. I tried removing the paste0("$" part to see if I could get the comma formatting to work with no success.
How can I format my tbl as a currency with dollars and commas, rounded to nearest whole dollars (no $1.99, just $2)?
I think you can just do this at the end with dplyr::mutate_at().
revenueTable %>% mutate_at(vars(-Channel), funs(. %>% round(0) %>% scales::dollar()))
#> # A tibble: 6 x 25
#> Channel `2015-06-01` `2015-07-01` `2015-08-01` `2015-09-01`
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Direct $11,910 $8,417 $4,012 $359
#> 2 Email $7,244 $3,517 $671 $1,339
#> 3 Facebook $8,552 $12,449 $10,765 $9,249
#> 4 Organic $4,191 $978 $219 $4,274
#> 5 SEM $2,344 $6,873 $10,230 $6,429
#> 6 Youtube $186 $2,949 $2,144 $5,073
#> # ... with 20 more variables: `2015-10-01` <chr>, `2015-11-01` <chr>,
#> # `2015-12-01` <chr>, `2016-01-01` <chr>, `2016-02-01` <chr>,
#> # `2016-03-01` <chr>, `2016-04-01` <chr>, `2016-05-01` <chr>,
#> # `2016-06-01` <chr>, `2016-07-01` <chr>, `2016-08-01` <chr>,
#> # `2016-09-01` <chr>, `2016-10-01` <chr>, `2016-11-01` <chr>,
#> # `2016-12-01` <chr>, `2017-01-01` <chr>, `2017-02-01` <chr>,
#> # `2017-03-01` <chr>, `2017-04-01` <chr>, `2017-05-01` <chr>
We can use data.table
library(data.table)
nm1 <- setdiff(names(revenueTable), 'Channel')
setDT(revenueTable)[, (nm1) := lapply(.SD, function(x)
scales::dollar(round(x))), .SDcols = nm1]
revenueTable[, 1:3, with = FALSE]
# Channel `2015-06-01` `2015-07-01`
#1: Direct $11,910 $8,417
#2: Email $7,244 $3,517
#3: Facebook $8,552 $12,449
#4: Organic $4,191 $978
#5: SEM $2,344 $6,873
#6: Youtube $186 $2,949

Resources