Rvest returning table containing NAs - r

I am trying to scrape data from a table using the Rvest package, but the table is coming back filled with NAs and missing all but the first row.
How can I solve this?
Source <- "https://www.viewbase.com/bitfinex_long_short_position"
Longs <- read_html(Source)%>%html_node(xpath='/html/body/div[1]/div[3]/div[2]/div[2]/div[2]/div/div[2]/div/div[2]/table')%>%
html_table(fill=TRUE)%>% as.data.frame()
Longs
Long Position 24H Change Short Position 24H Change % Long vs. Short NA NA NA NA
1 NA NA NA NA NA NA NA NA NA NA

You might have to use RSelenium to get the exact same output that you see on the webpage. However, you might also get most of the information from json file available on the webpage.
jsonlite::fromJSON('https://api.viewbase.com/margin/bfx_long_short_now') %>%
dplyr::bind_rows()
# BTCUSD BTCUST ETHUSD ETHUST ETHBTC USTUSD XRPUSD XRPBTC BABUSD
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 58323 5.82e4 4.08e3 4078. 7.01e-2 1.00e0 1.57e0 2.69e-5 252
#2 58100 5.81e4 3.88e3 3880. 6.68e-2 1.00e0 1.52e0 2.62e-5 252
#3 31557. 2.30e3 1.93e5 21972. 3.62e+5 3.72e5 5.07e7 2.54e+7 8843.
#4 31547. 2.30e3 1.87e5 21944. 4.05e+5 1.41e6 5.11e7 2.53e+7 8843.
#5 559. 1.97e2 2.43e5 442. 1.02e+4 9.09e6 9.71e6 1.71e+7 15050.
#6 601. 3.81e1 2.46e5 598. 9.35e+3 1.34e5 9.97e6 1.73e+7 15050.
# … with 27 more variables: BABBTC <dbl>, LTCUSD <dbl>, LTCBTC <dbl>,
# EOSUSD <dbl>, EOSBTC <dbl>, ETCUSD <dbl>, ETCBTC <dbl>,
# BSVUSD <dbl>, BSVBTC <dbl>, XTZUSD <dbl>, XTZBTC <dbl>,
# ZECUSD <dbl>, ZECBTC <dbl>, LEOUSD <dbl>, LEOUST <dbl>,
# DSHUSD <dbl>, DSHBTC <dbl>, IOTUSD <dbl>, IOTBTC <dbl>,
# NEOUSD <dbl>, NEOBTC <dbl>, OMGUSD <dbl>, OMGBTC <dbl>,
# XLMUSD <dbl>, XLMBTC <dbl>, XMRUSD <dbl>, XMRBTC <dbl>

Related

assign values inside a function based on specific columns in a data frame

I'm looking for direction on assigning a data frame column value to a specific place in a function and then looping or something to create a series of objects to be bound into a longer table.
example data
a = c("17","17","29")
b = c("133","163","055")
data.frame(a, b)
doing this manually...
library(zipcodeR)
T1 <- search_fips("17", "133")
T2 <- search_fips("17", "163")
T3 <- search_fips("29", "055")
TT <- list(T1, T2, T3)
CZ_zips <- rbindlist(TT, use.names=TRUE, fill=TRUE)
would like a to read a and b columns into a set place in function to create a series of vectors or data frames that can then be bound into one longer table.
the search_fips function pulls out of the census FIPS data, a = state and b = county. package is zipcodeR.
One simple way is to wrap the search_fips() function in a lapply function and rest stays the same.
library(zipcodeR)
a = c("17","17","29")
b = c("133","163","055")
df<-data.frame(a, b)
output <-lapply(1:nrow(df), function(i) {
search_fips(df$a[i], df$b[i])
})
answer <- dplyr::bind_rows(output)
here is a loop you might want to put in your function:
library(dplyr)
library(zipcodeR)
my_list <- list()
for (i in 1:nrow(df)) {
my_list[i] <- search_fips(df$a[i], df$b[i])
}
new_df <- bind_rows(my_list)
bind_rows(my_list)
Using rowwise
library(dplyr)
library(tidyr)
library(zipcodeR)
out <- df %>%
rowwise %>%
mutate(result = list(search_fips(a, b))) %>%
ungroup %>%
unnest(result)
-output
> head(out, 2)
# A tibble: 2 × 26
a b zipcode zipcode_type major_city post_office_city common_city_list county state lat lng timezone radius_in_miles area_code_list
<chr> <chr> <chr> <chr> <chr> <chr> <blob> <chr> <chr> <dbl> <dbl> <chr> <dbl> <blob>
1 17 133 62236 Standard Columbia Columbia, IL <raw 20 B> Monroe County IL 38.4 -90.2 Central 7 <raw 15 B>
2 17 133 62244 Standard Fults Fults, IL <raw 17 B> Monroe County IL 38.2 -90.2 Central 7 <raw 15 B>
# … with 12 more variables: population <int>, population_density <dbl>, land_area_in_sqmi <dbl>, water_area_in_sqmi <dbl>, housing_units <int>,
# occupied_housing_units <int>, median_home_value <int>, median_household_income <int>, bounds_west <dbl>, bounds_east <dbl>,
# bounds_north <dbl>, bounds_south <dbl>
data
df <- structure(list(a = c("17", "17", "29"), b = c("133", "163", "055"
)), class = "data.frame", row.names = c(NA, -3L))
Here is a solution with Map.
The two one-liners below are equivalent, the first is probably more readable but the other one is simpler.
library(zipcodeR)
a <- c("17", "17", "29")
b <- c("133", "163", "055")
df <- data.frame(a, b)
Map(function(x, y) search_fips(x, y), df$a, df$b)
result <- Map(search_fips, df$a, df$b)
result <- dplyr::bind_rows(result)
head(result)
#> # A tibble: 6 x 24
#> zipcode zipcode_type major_city post_office_city common_city_list county state
#> <chr> <chr> <chr> <chr> <blob> <chr> <chr>
#> 1 62236 Standard Columbia Columbia, IL <raw 20 B> Monro~ IL
#> 2 62244 Standard Fults Fults, IL <raw 17 B> Monro~ IL
#> 3 62248 PO Box Hecker Hecker, IL <raw 18 B> Monro~ IL
#> 4 62256 PO Box Maeystown <NA> <raw 21 B> Monro~ IL
#> 5 62279 PO Box Renault Renault, IL <raw 19 B> Monro~ IL
#> 6 62295 Standard Valmeyer Valmeyer, IL <raw 20 B> Monro~ IL
#> # ... with 17 more variables: lat <dbl>, lng <dbl>, timezone <chr>,
#> # radius_in_miles <dbl>, area_code_list <blob>, population <int>,
#> # population_density <dbl>, land_area_in_sqmi <dbl>,
#> # water_area_in_sqmi <dbl>, housing_units <int>,
#> # occupied_housing_units <int>, median_home_value <int>,
#> # median_household_income <int>, bounds_west <dbl>, bounds_east <dbl>,
#> # bounds_north <dbl>, bounds_south <dbl>
Created on 2022-02-18 by the reprex package (v2.0.1)

Step_dummy. Dealing with duplicated column names generated by recipe() steps, Tidymodels

Dear community,
I have been struggeling for extensive amount of time now trying to understand what is going on here, when I perform my recipe() steps for my linear (glm) model using the Tidymodels framework. The recipe() step_dummy(all_nominal(), -all_outcomes()) was suggested by the usemodels() function https://usemodels.tidymodels.org/index.html .
When I commend out the step_dummy() the recipe() and prep() works fine, however its important to me that these categorical variables are dummyfied (..is that a word!?).
This is the first time I making use of and including a reprex in a question on stackoverflow, so please let me know if you need more information to assist on this matter.
I have looked everywhere, e.g. including a one_hot = TRUE or keep_original_cols argument in the step_dummy() but it does not seem to be effective.
It should be quite easy as it is a matter of renaming the generated columns as unique, but do not succeed. Here is the era.af_train set.
> era.af_train
# A tibble: 7,104 x 44
logRR ID AEZ16simple PrName.Code SubPrName.Code Product Country
<dbl> <dbl> <fct> <fct> <fct> <fct> <fct>
1 -0.851 1663 Warm.Semiar~ BP TW Pearl Mill~ Niger
2 -1.17 1663 Warm.Semiar~ BP/Mu Mu-N/TW Pearl Mill~ Niger
3 -0.314 1663 Warm.Semiar~ BP TW Pearl Mill~ Niger
4 -0.776 1663 Warm.Semiar~ BP TW Pearl Mill~ Niger
5 -0.0850 1675 Warm.Semiar~ AP TPM+N Pearl Mill~ Niger
6 -0.159 1689 Warm.Subhum~ Al/AP/BP Al+N/LF/TP/TPM~ Maize Togo
7 -0.579 1701 Warm.Semiar~ BP TW Fodder (Le~ Tunisia
8 -0.662 1729 Warm.Subhum~ Al Al-N/Al+N Cassava or~ Nigeria
9 -1.80 1802 Cool.Subhum~ Al/AP Al+N/TP Wheat Ethiop~
10 -1.74 1802 Cool.Subhum~ Al/AP Al+N/TP/TPI+N Wheat Ethiop~
# ... with 7,094 more rows, and 37 more variables: Latitude <dbl>,
# Longitude <dbl>, Site.Type <fct>, Tree <fct>, Bio01_MT_Anu.Mean <dbl>,
# Bio02_MDR.Mean <dbl>, Bio03_Iso.Mean <dbl>, Bio04_TS.Mean <dbl>,
# Bio05_TWM.Mean <dbl>, Bio06_MinTCM.Mean <dbl>, Bio07_TAR.Mean <dbl>,
# Bio08_MT_WetQ.Mean <dbl>, Bio09_MT_DryQ.Mean <dbl>,
# Bio10_MT_WarQ.Mean <dbl>, Bio11_MT_ColQ.Mean <dbl>,
# Bio12_Pecip_Anu.Mean <dbl>, Bio13_Precip_WetM.Mean <dbl>,
# Bio14_Precip_DryM.Mean <dbl>, Bio15_Precip_S.Mean <dbl>,
# Bio16_Precip_WetQ.Mean <dbl>, Bio17_Precip_DryQ.Mean <dbl>,
# Mean_log.n_tot_ncs <dbl>, Mean_log.ca_mehlich3 <dbl>,
# Mean_log.k_mehlich3 <dbl>, Mean_log.mg_mehlich3 <dbl>,
# Mean_log.p_mehlich3 <dbl>, Mean_log.s_mehlich3 <dbl>,
# Mean_log.fe_mehlich3 <dbl>, Mean_db_od <dbl>, Mean_bdr <dbl>,
# Mean_sand_tot_psa <dbl>, Mean_clay_tot_psa <dbl>, Mean_ph_h2o <dbl>,
# Mean_log.ecec.f <dbl>, Mean_log.c_tot <dbl>, Mean_log.oc <dbl>,
# Slope.mean <dbl>
I am including the columns ID, AEZ16simple, PrName.Code, SubPrName.Code, Product, Country, Latitude and Longitude as "ID variables", as I wish to compare the glm model later with a random forest model and a XGBoost model.
All help is welcome!
Have a good weekend and
thank you in advance.
library(reprex)
#> Warning: package 'reprex' was built under R version 4.0.5
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.0.5
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(recipes)
#> Warning: package 'recipes' was built under R version 4.0.5
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
library(readr)
#> Warning: package 'readr' was built under R version 4.0.5
setwd("C:/Users/lindh011/OneDrive - Wageningen University & Research/Internship ICRAF (ERA)/ERA_Agroforestry_WURwork")
era.af_train <- read_csv("era.af_train.csv")
#>
#> -- Column specification --------------------------------------------------------
#> cols(
#> .default = col_double(),
#> AEZ16simple = col_character(),
#> PrName.Code = col_character(),
#> SubPrName.Code = col_character(),
#> Product = col_character(),
#> Country = col_character(),
#> Site.Type = col_character(),
#> Tree = col_character()
#> )
#> i Use `spec()` for the full column specifications.
era.af_train_Tib <- as_tibble(era.af_train)
glmnet_recipe <-
recipe(formula = logRR ~ ., data = era.af_train) %>%
step_novel(all_nominal(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes(), naming = dummy_names) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
update_role(ID,
AEZ16simple,
PrName.Code,
SubPrName.Code,
Product,
Country,
Latitude,
Longitude,
new_role = "sample ID") %>%
step_impute_mode(all_nominal(), -all_outcomes()) %>%
step_impute_knn (all_numeric_predictors()) %>%
step_impute_knn(logRR) %>%
step_corr(all_numeric_predictors()) %>%
step_nzv(all_numeric_predictors()) %>%
prep()
#> Error: Column names `SubPrName.Code_AF.N.Al.N.TP`, `SubPrName.Code_AF.N.Al.N.TP.TPM`, `SubPrName.Code_Al.N.In.N`, `SubPrName.Code_Al.N.In.N`, `SubPrName.Code_Al.N`, and 33 more must not be duplicated.
#> Use .name_repair to specify repair.
Created on 2021-07-02 by the reprex package (v2.0.0)

tidymodels bake:Error: Please pass a data set to `new_data`

I'm using recipe()function in tidymodels packages for imputation missing values and fixing imbalanced data.
here is my data;
mer_df <- mer2 %>%
filter(!is.na(laststagestatus2)) %>%
select(Id, Age_Range__c, Gender__c, numberoflead, leadduration, firsttouch, lasttouch, laststagestatus2)%>%
mutate_if(is.character, factor) %>%
mutate_if(is.logical, as.integer)
# A tibble: 197,836 x 8
Id Age_Range__c Gender__c numberoflead leadduration firsttouch lasttouch
<fct> <fct> <fct> <int> <dbl> <fct> <fct>
1 0010~ NA NA 2 5.99 Dealer IB~ Walk in
2 0010~ NA NA 1 0 Online Se~ Online S~
3 0010~ NA NA 1 0 Walk in Walk in
4 0010~ NA NA 1 0 Online Se~ Online S~
5 0010~ NA NA 2 0.0128 Dealer IB~ Dealer I~
6 0010~ NA NA 1 0 OB Call OB Call
7 0010~ NA NA 1 0 Dealer IB~ Dealer I~
8 0010~ NA NA 4 73.9 Dealer IB~ Walk in
9 0010~ NA Male 24 0.000208 OB Call OB Call
10 0010~ NA NA 18 0.000150 OB Call OB Call
# ... with 197,826 more rows, and 1 more variable: laststagestatus2 <fct>
here is my codes;
mer_rec <- recipe(laststagestatus2 ~ ., data = mer_train)%>%
step_medianimpute(numberoflead,leadduration)%>%
step_knnimpute(Gender__c,Age_Range__c,fisrsttouch,lasttouch) %>%
step_other(Id,firsttouch) %>%
step_other(Id,lasttouch) %>%
step_dummy(all_nominal(), -laststagestatus2) %>%
step_smote(laststagestatus2)
mer_rec
mer_rec %>% prep()
it just works fine until here ;
Data Recipe
Inputs:
role #variables
outcome 1
predictor 7
Training data contained 148377 data points and 147597 incomplete rows.
Operations:
Median Imputation for 2 items [trained]
K-nearest neighbor imputation for Id, ... [trained]
Collapsing factor levels for Id, firsttouch [trained]
Collapsing factor levels for Id, lasttouch [trained]
Dummy variables from Id, ... [trained]
SMOTE based on laststagestatus2 [trained]
but when ı run bake() function that gives error says;
mer_rec %>% prep() %>% bake(new_data=NULL) %>% count(laststagestatus2)
Error: Please pass a data set to `new_data`.
Could anyone help me about what I m missing here?
There is a fix in the development version of recipes to get this up and working. You can install via:
devtools::install_github("tidymodels/recipes")
Then you can bake() with new_data = NULL to get out the transformed training data.
library(tidymodels)
data(ames)
ames <- mutate(ames, Sale_Price = log10(Sale_Price))
set.seed(123)
ames_split <- initial_split(ames, prob = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
ames_rec <-
recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type +
Latitude + Longitude, data = ames_train) %>%
step_log(Gr_Liv_Area, base = 10) %>%
step_other(Neighborhood, threshold = 0.01) %>%
step_dummy(all_nominal()) %>%
step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>%
step_ns(Latitude, Longitude, deg_free = 20)
ames_rec %>% prep() %>% bake(new_data = NULL)
#> # A tibble: 2,199 x 71
#> Gr_Liv_Area Year_Built Sale_Price Neighborhood_Co… Neighborhood_Ol…
#> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 3.22 1960 5.33 0 0
#> 2 2.95 1961 5.02 0 0
#> 3 3.12 1958 5.24 0 0
#> 4 3.21 1997 5.28 0 0
#> 5 3.21 1998 5.29 0 0
#> 6 3.13 2001 5.33 0 0
#> 7 3.11 1992 5.28 0 0
#> 8 3.21 1995 5.37 0 0
#> 9 3.22 1993 5.25 0 0
#> 10 3.17 1998 5.26 0 0
#> # … with 2,189 more rows, and 66 more variables: Neighborhood_Edwards <dbl>,
#> # Neighborhood_Somerset <dbl>, Neighborhood_Northridge_Heights <dbl>,
#> # Neighborhood_Gilbert <dbl>, Neighborhood_Sawyer <dbl>,
#> # Neighborhood_Northwest_Ames <dbl>, Neighborhood_Sawyer_West <dbl>,
#> # Neighborhood_Mitchell <dbl>, Neighborhood_Brookside <dbl>,
#> # Neighborhood_Crawford <dbl>, Neighborhood_Iowa_DOT_and_Rail_Road <dbl>,
#> # Neighborhood_Timberland <dbl>, Neighborhood_Northridge <dbl>,
#> # Neighborhood_Stone_Brook <dbl>,
#> # Neighborhood_South_and_West_of_Iowa_State_University <dbl>,
#> # Neighborhood_Clear_Creek <dbl>, Neighborhood_Meadow_Village <dbl>,
#> # Neighborhood_other <dbl>, Bldg_Type_TwoFmCon <dbl>, Bldg_Type_Duplex <dbl>,
#> # Bldg_Type_Twnhs <dbl>, Bldg_Type_TwnhsE <dbl>,
#> # Gr_Liv_Area_x_Bldg_Type_TwoFmCon <dbl>,
#> # Gr_Liv_Area_x_Bldg_Type_Duplex <dbl>, Gr_Liv_Area_x_Bldg_Type_Twnhs <dbl>,
#> # Gr_Liv_Area_x_Bldg_Type_TwnhsE <dbl>, Latitude_ns_01 <dbl>,
#> # Latitude_ns_02 <dbl>, Latitude_ns_03 <dbl>, Latitude_ns_04 <dbl>,
#> # Latitude_ns_05 <dbl>, Latitude_ns_06 <dbl>, Latitude_ns_07 <dbl>,
#> # Latitude_ns_08 <dbl>, Latitude_ns_09 <dbl>, Latitude_ns_10 <dbl>,
#> # Latitude_ns_11 <dbl>, Latitude_ns_12 <dbl>, Latitude_ns_13 <dbl>,
#> # Latitude_ns_14 <dbl>, Latitude_ns_15 <dbl>, Latitude_ns_16 <dbl>,
#> # Latitude_ns_17 <dbl>, Latitude_ns_18 <dbl>, Latitude_ns_19 <dbl>,
#> # Latitude_ns_20 <dbl>, Longitude_ns_01 <dbl>, Longitude_ns_02 <dbl>,
#> # Longitude_ns_03 <dbl>, Longitude_ns_04 <dbl>, Longitude_ns_05 <dbl>,
#> # Longitude_ns_06 <dbl>, Longitude_ns_07 <dbl>, Longitude_ns_08 <dbl>,
#> # Longitude_ns_09 <dbl>, Longitude_ns_10 <dbl>, Longitude_ns_11 <dbl>,
#> # Longitude_ns_12 <dbl>, Longitude_ns_13 <dbl>, Longitude_ns_14 <dbl>,
#> # Longitude_ns_15 <dbl>, Longitude_ns_16 <dbl>, Longitude_ns_17 <dbl>,
#> # Longitude_ns_18 <dbl>, Longitude_ns_19 <dbl>, Longitude_ns_20 <dbl>
Created on 2020-10-12 by the reprex package (v0.3.0.9001)
If you are unable to install packages from GitHub, you could use juice() to do the same thing.

readr::read_csv() - parsing failure with nested quotations

I have a csv where some columns have a quoted column with another quotation inside it:
"blah blah "nested quote"" and it generates parsing failures. I'm not sure if this is a bug or there is an argument to deal with this?
Reprex (file is here or content pasted below):
readr::read_csv("~/temp/shittyquotes.csv")
#> Parsed with column specification:
#> cols(
#> .default = col_double(),
#> INSTNM = col_character(),
#> ADDR = col_character(),
#> CITY = col_character(),
#> STABBR = col_character(),
#> ZIP = col_character(),
#> CHFNM = col_character(),
#> CHFTITLE = col_character(),
#> EIN = col_character(),
#> OPEID = col_character(),
#> WEBADDR = col_character(),
#> ADMINURL = col_character(),
#> FAIDURL = col_character(),
#> APPLURL = col_character(),
#> ACT = col_character(),
#> IALIAS = col_character(),
#> INSTCAT = col_character(),
#> CCBASIC = col_character(),
#> CCIPUG = col_character(),
#> CCSIZSET = col_character(),
#> CARNEGIE = col_character()
#> # ... with 2 more columns
#> )
#> See spec(...) for full column specifications.
#> Warning: 3 parsing failures.
#> row col expected actual file
#> 2 IALIAS delimiter or quote C '~/temp/shittyquotes.csv'
#> 2 IALIAS delimiter or quote D '~/temp/shittyquotes.csv'
#> 2 NA 59 columns 100 columns '~/temp/shittyquotes.csv'
#> # A tibble: 2 x 59
#> UNITID INSTNM ADDR CITY STABBR ZIP FIPS OBEREG CHFNM CHFTITLE
#> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
#> 1 441238 City … 1500… Duar… CA 9101… 6 8 Dr. … Director
#> 2 441247 Commu… 3800… Mode… CA 9535… 6 8 Vict… Preside…
#> # ... with 49 more variables: GENTELE <dbl>, EIN <chr>, OPEID <chr>,
#> # OPEFLAG <dbl>, WEBADDR <chr>, ADMINURL <chr>, FAIDURL <chr>,
#> # APPLURL <chr>, SECTOR <dbl>, ICLEVEL <dbl>, CONTROL <dbl>,
#> # HLOFFER <dbl>, UGOFFER <dbl>, GROFFER <dbl>, FPOFFER <dbl>,
#> # HDEGOFFR <dbl>, DEGGRANT <dbl>, HBCU <dbl>, HOSPITAL <dbl>,
#> # MEDICAL <dbl>, TRIBAL <dbl>, LOCALE <dbl>, OPENPUBL <dbl>, ACT <chr>,
#> # NEWID <dbl>, DEATHYR <dbl>, CLOSEDAT <dbl>, CYACTIVE <dbl>,
#> # POSTSEC <dbl>, PSEFLAG <dbl>, PSET4FLG <dbl>, RPTMTH <dbl>,
#> # IALIAS <chr>, INSTCAT <chr>, CCBASIC <chr>, CCIPUG <chr>,
#> # CCIPGRAD <dbl>, CCUGPROF <dbl>, CCENRPRF <dbl>, CCSIZSET <chr>,
#> # CARNEGIE <chr>, TENURSYS <dbl>, LANDGRNT <dbl>, INSTSIZE <chr>,
#> # CBSA <dbl>, CBSATYPE <chr>, CSA <dbl>, NECTA <dbl>, DFRCGID <dbl>
Created on 2018-12-04 by the reprex package (v0.2.1)
Also here's the csv content:
UNITID,INSTNM,ADDR,CITY,STABBR,ZIP,FIPS,OBEREG,CHFNM,CHFTITLE,GENTELE,EIN,OPEID,OPEFLAG,WEBADDR,ADMINURL,FAIDURL,APPLURL,SECTOR,ICLEVEL,CONTROL,HLOFFER,UGOFFER,GROFFER,FPOFFER,HDEGOFFR,DEGGRANT,HBCU,HOSPITAL,MEDICAL,TRIBAL,LOCALE,OPENPUBL,ACT,NEWID,DEATHYR,CLOSEDAT,CYACTIVE,POSTSEC,PSEFLAG,PSET4FLG,RPTMTH,IALIAS,INSTCAT,CCBASIC,CCIPUG,CCIPGRAD,CCUGPROF,CCENRPRF,CCSIZSET,CARNEGIE,TENURSYS,LANDGRNT,INSTSIZE,CBSA,CBSATYPE,CSA,NECTA,DFRCGID
441238,"City of Hope Graduate School of Biological Science","1500 E Duarte Rd","Duarte","CA","91010-3000", 6, 8,"Dr. Arthur Riggs","Director","6263018293","953432210","03592400",1,"gradschool.coh.org"," "," "," ",2,1,2,9,2,1,2,10,1,2,-2,2,2,21,1,"A ",-2,-2,"-2",1,1,1,1,1," ",1,25,-2,-2,-2,7,-2,-3,1,2,1,31100,1,348,-2,198
441247,"Community Business College","3800 McHenry Ave Suite M","Modesto","CA","95356-1569", 6, 8,"Victor L. Vandenberghe","President","2095293648","484-8230","03615300",7,"www.communitybusinesscollege.edu","www.communitybusinesscollege.edu","www.cbc123.com","www.123.com",9,3,3,1,1,2,2,0,2,2,-2,2,2,12,1,"A ",-2,-2,"-2",1,1,1,1,2,"formerly "Community Business School"",6,-3,-3,-3,-3,-3,-3,-3,2,2,1,33700,1,-2,-2,71
441256,"Design's School of Cosmetology","715 24th St Ste E","Paso Robles","CA","93446", 6, 8,"Sharon Skinner","Administrator","8052378575","80002030","03646300",1,"designsschool.com"," "," "," ",9,3,3,2,1,2,2,0,2,2,-2,2,2,13,1,"A ",-2,-2,"-2",1,1,1,1,2," ",6,-3,-3,-3,-3,-3,-3,-3,2,2,1,42020,1,-2,-2,46
Jim Hester provided this answer:
You need to use the escape_double = FALSE argument to read_delim(). This isn't part of read_csv() because excel style csvs escape inner quotations by doubling them.
data.table's fread() parses the file just fine... it throws a warning about the quotes, but you can ignore it..
library( data.table )
data.table::fread("./temp.csv" )
Warning message:
In data.table::fread("./temp.csv") :
Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.

Format a tbl within a dplyr chain

I am trying to add commas for thousands in my data e.g. 10,000 along with dollars e.g. $10,000.
I'm using several dplyr commands along with tidyr gather and spread functions. Here's what I tried:
Cut n paste this code block to generate the random data "dataset" I'm working with:
library(dplyr)
library(tidyr)
library(lubridate)
## Generate some data
channels <- c("Facebook", "Youtube", "SEM", "Organic", "Direct", "Email")
last_month <- Sys.Date() %m+% months(-1) %>% floor_date("month")
mts <- seq(from = last_month %m+% months(-23), to = last_month, by = "1 month") %>% as.Date()
dimvars <- expand.grid(Month = mts, Channel = channels, stringsAsFactors = FALSE)
# metrics
rws <- nrow(dimvars)
set.seed(42)
# generates variablility in the random data
randwalk <- function(initial_val, ...){
initial_val + cumsum(rnorm(...))
}
Sessions <- ceiling(randwalk(3000, n = rws, mean = 8, sd = 1500)) %>% abs()
Revenue <- ceiling(randwalk(10000, n = rws, mean = 0, sd = 3500)) %>% abs()
# make primary df
dataset <- cbind(dimvars, Revenue)
Which looks like:
> tbl_df(dataset)
# A tibble: 144 × 3
Month Channel Revenue
<date> <chr> <dbl>
1 2015-06-01 Facebook 8552
2 2015-07-01 Facebook 12449
3 2015-08-01 Facebook 10765
4 2015-09-01 Facebook 9249
5 2015-10-01 Facebook 11688
6 2015-11-01 Facebook 7991
7 2015-12-01 Facebook 7849
8 2016-01-01 Facebook 2418
9 2016-02-01 Facebook 6503
10 2016-03-01 Facebook 5545
# ... with 134 more rows
Now I want to spread the months into columns to show revenue trend by channel, month over month. I can do that like so:
revenueTable <- dataset %>% select(Month, Channel, Revenue) %>%
group_by(Month, Channel) %>%
summarise(Revenue = sum(Revenue)) %>%
#mutate(Revenue = paste0("$", format(Revenue, big.interval = ","))) %>%
gather(Key, Value, -Channel, -Month) %>%
spread(Month, Value) %>%
select(-Key)
And it looks almost exactly as I want:
> revenueTable
# A tibble: 6 × 25
Channel `2015-06-01` `2015-07-01` `2015-08-01` `2015-09-01` `2015-10-01` `2015-11-01` `2015-12-01` `2016-01-01` `2016-02-01` `2016-03-01` `2016-04-01`
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Direct 11910 8417 4012 359 4473 2702 6261 6167 8630 5230 1394
2 Email 7244 3517 671 1339 10788 10575 8567 8406 7856 6345 7733
3 Facebook 8552 12449 10765 9249 11688 7991 7849 2418 6503 5545 3908
4 Organic 4191 978 219 4274 2924 4155 5981 9719 8220 8829 7024
5 SEM 2344 6873 10230 6429 5016 2964 3390 3841 3163 1994 2105
6 Youtube 186 2949 2144 5073 1035 4878 7905 7377 2305 4556 6247
# ... with 13 more variables: `2016-05-01` <dbl>, `2016-06-01` <dbl>, `2016-07-01` <dbl>, `2016-08-01` <dbl>, `2016-09-01` <dbl>, `2016-10-01` <dbl>,
# `2016-11-01` <dbl>, `2016-12-01` <dbl>, `2017-01-01` <dbl>, `2017-02-01` <dbl>, `2017-03-01` <dbl>, `2017-04-01` <dbl>, `2017-05-01` <dbl>
Now the part I'm struggling with. I would like to format the data as currency. I tried adding this inbetween summarise() and gather() within the chain:
mutate(Revenue = paste0("$", format(Revenue, big.interval = ","))) %>%
This half works. The dollar sign is prepended but the comma separators do not show. I tried removing the paste0("$" part to see if I could get the comma formatting to work with no success.
How can I format my tbl as a currency with dollars and commas, rounded to nearest whole dollars (no $1.99, just $2)?
I think you can just do this at the end with dplyr::mutate_at().
revenueTable %>% mutate_at(vars(-Channel), funs(. %>% round(0) %>% scales::dollar()))
#> # A tibble: 6 x 25
#> Channel `2015-06-01` `2015-07-01` `2015-08-01` `2015-09-01`
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Direct $11,910 $8,417 $4,012 $359
#> 2 Email $7,244 $3,517 $671 $1,339
#> 3 Facebook $8,552 $12,449 $10,765 $9,249
#> 4 Organic $4,191 $978 $219 $4,274
#> 5 SEM $2,344 $6,873 $10,230 $6,429
#> 6 Youtube $186 $2,949 $2,144 $5,073
#> # ... with 20 more variables: `2015-10-01` <chr>, `2015-11-01` <chr>,
#> # `2015-12-01` <chr>, `2016-01-01` <chr>, `2016-02-01` <chr>,
#> # `2016-03-01` <chr>, `2016-04-01` <chr>, `2016-05-01` <chr>,
#> # `2016-06-01` <chr>, `2016-07-01` <chr>, `2016-08-01` <chr>,
#> # `2016-09-01` <chr>, `2016-10-01` <chr>, `2016-11-01` <chr>,
#> # `2016-12-01` <chr>, `2017-01-01` <chr>, `2017-02-01` <chr>,
#> # `2017-03-01` <chr>, `2017-04-01` <chr>, `2017-05-01` <chr>
We can use data.table
library(data.table)
nm1 <- setdiff(names(revenueTable), 'Channel')
setDT(revenueTable)[, (nm1) := lapply(.SD, function(x)
scales::dollar(round(x))), .SDcols = nm1]
revenueTable[, 1:3, with = FALSE]
# Channel `2015-06-01` `2015-07-01`
#1: Direct $11,910 $8,417
#2: Email $7,244 $3,517
#3: Facebook $8,552 $12,449
#4: Organic $4,191 $978
#5: SEM $2,344 $6,873
#6: Youtube $186 $2,949

Resources