Format a tbl within a dplyr chain - r

I am trying to add commas for thousands in my data e.g. 10,000 along with dollars e.g. $10,000.
I'm using several dplyr commands along with tidyr gather and spread functions. Here's what I tried:
Cut n paste this code block to generate the random data "dataset" I'm working with:
library(dplyr)
library(tidyr)
library(lubridate)
## Generate some data
channels <- c("Facebook", "Youtube", "SEM", "Organic", "Direct", "Email")
last_month <- Sys.Date() %m+% months(-1) %>% floor_date("month")
mts <- seq(from = last_month %m+% months(-23), to = last_month, by = "1 month") %>% as.Date()
dimvars <- expand.grid(Month = mts, Channel = channels, stringsAsFactors = FALSE)
# metrics
rws <- nrow(dimvars)
set.seed(42)
# generates variablility in the random data
randwalk <- function(initial_val, ...){
initial_val + cumsum(rnorm(...))
}
Sessions <- ceiling(randwalk(3000, n = rws, mean = 8, sd = 1500)) %>% abs()
Revenue <- ceiling(randwalk(10000, n = rws, mean = 0, sd = 3500)) %>% abs()
# make primary df
dataset <- cbind(dimvars, Revenue)
Which looks like:
> tbl_df(dataset)
# A tibble: 144 × 3
Month Channel Revenue
<date> <chr> <dbl>
1 2015-06-01 Facebook 8552
2 2015-07-01 Facebook 12449
3 2015-08-01 Facebook 10765
4 2015-09-01 Facebook 9249
5 2015-10-01 Facebook 11688
6 2015-11-01 Facebook 7991
7 2015-12-01 Facebook 7849
8 2016-01-01 Facebook 2418
9 2016-02-01 Facebook 6503
10 2016-03-01 Facebook 5545
# ... with 134 more rows
Now I want to spread the months into columns to show revenue trend by channel, month over month. I can do that like so:
revenueTable <- dataset %>% select(Month, Channel, Revenue) %>%
group_by(Month, Channel) %>%
summarise(Revenue = sum(Revenue)) %>%
#mutate(Revenue = paste0("$", format(Revenue, big.interval = ","))) %>%
gather(Key, Value, -Channel, -Month) %>%
spread(Month, Value) %>%
select(-Key)
And it looks almost exactly as I want:
> revenueTable
# A tibble: 6 × 25
Channel `2015-06-01` `2015-07-01` `2015-08-01` `2015-09-01` `2015-10-01` `2015-11-01` `2015-12-01` `2016-01-01` `2016-02-01` `2016-03-01` `2016-04-01`
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Direct 11910 8417 4012 359 4473 2702 6261 6167 8630 5230 1394
2 Email 7244 3517 671 1339 10788 10575 8567 8406 7856 6345 7733
3 Facebook 8552 12449 10765 9249 11688 7991 7849 2418 6503 5545 3908
4 Organic 4191 978 219 4274 2924 4155 5981 9719 8220 8829 7024
5 SEM 2344 6873 10230 6429 5016 2964 3390 3841 3163 1994 2105
6 Youtube 186 2949 2144 5073 1035 4878 7905 7377 2305 4556 6247
# ... with 13 more variables: `2016-05-01` <dbl>, `2016-06-01` <dbl>, `2016-07-01` <dbl>, `2016-08-01` <dbl>, `2016-09-01` <dbl>, `2016-10-01` <dbl>,
# `2016-11-01` <dbl>, `2016-12-01` <dbl>, `2017-01-01` <dbl>, `2017-02-01` <dbl>, `2017-03-01` <dbl>, `2017-04-01` <dbl>, `2017-05-01` <dbl>
Now the part I'm struggling with. I would like to format the data as currency. I tried adding this inbetween summarise() and gather() within the chain:
mutate(Revenue = paste0("$", format(Revenue, big.interval = ","))) %>%
This half works. The dollar sign is prepended but the comma separators do not show. I tried removing the paste0("$" part to see if I could get the comma formatting to work with no success.
How can I format my tbl as a currency with dollars and commas, rounded to nearest whole dollars (no $1.99, just $2)?

I think you can just do this at the end with dplyr::mutate_at().
revenueTable %>% mutate_at(vars(-Channel), funs(. %>% round(0) %>% scales::dollar()))
#> # A tibble: 6 x 25
#> Channel `2015-06-01` `2015-07-01` `2015-08-01` `2015-09-01`
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Direct $11,910 $8,417 $4,012 $359
#> 2 Email $7,244 $3,517 $671 $1,339
#> 3 Facebook $8,552 $12,449 $10,765 $9,249
#> 4 Organic $4,191 $978 $219 $4,274
#> 5 SEM $2,344 $6,873 $10,230 $6,429
#> 6 Youtube $186 $2,949 $2,144 $5,073
#> # ... with 20 more variables: `2015-10-01` <chr>, `2015-11-01` <chr>,
#> # `2015-12-01` <chr>, `2016-01-01` <chr>, `2016-02-01` <chr>,
#> # `2016-03-01` <chr>, `2016-04-01` <chr>, `2016-05-01` <chr>,
#> # `2016-06-01` <chr>, `2016-07-01` <chr>, `2016-08-01` <chr>,
#> # `2016-09-01` <chr>, `2016-10-01` <chr>, `2016-11-01` <chr>,
#> # `2016-12-01` <chr>, `2017-01-01` <chr>, `2017-02-01` <chr>,
#> # `2017-03-01` <chr>, `2017-04-01` <chr>, `2017-05-01` <chr>

We can use data.table
library(data.table)
nm1 <- setdiff(names(revenueTable), 'Channel')
setDT(revenueTable)[, (nm1) := lapply(.SD, function(x)
scales::dollar(round(x))), .SDcols = nm1]
revenueTable[, 1:3, with = FALSE]
# Channel `2015-06-01` `2015-07-01`
#1: Direct $11,910 $8,417
#2: Email $7,244 $3,517
#3: Facebook $8,552 $12,449
#4: Organic $4,191 $978
#5: SEM $2,344 $6,873
#6: Youtube $186 $2,949

Related

assign values inside a function based on specific columns in a data frame

I'm looking for direction on assigning a data frame column value to a specific place in a function and then looping or something to create a series of objects to be bound into a longer table.
example data
a = c("17","17","29")
b = c("133","163","055")
data.frame(a, b)
doing this manually...
library(zipcodeR)
T1 <- search_fips("17", "133")
T2 <- search_fips("17", "163")
T3 <- search_fips("29", "055")
TT <- list(T1, T2, T3)
CZ_zips <- rbindlist(TT, use.names=TRUE, fill=TRUE)
would like a to read a and b columns into a set place in function to create a series of vectors or data frames that can then be bound into one longer table.
the search_fips function pulls out of the census FIPS data, a = state and b = county. package is zipcodeR.
One simple way is to wrap the search_fips() function in a lapply function and rest stays the same.
library(zipcodeR)
a = c("17","17","29")
b = c("133","163","055")
df<-data.frame(a, b)
output <-lapply(1:nrow(df), function(i) {
search_fips(df$a[i], df$b[i])
})
answer <- dplyr::bind_rows(output)
here is a loop you might want to put in your function:
library(dplyr)
library(zipcodeR)
my_list <- list()
for (i in 1:nrow(df)) {
my_list[i] <- search_fips(df$a[i], df$b[i])
}
new_df <- bind_rows(my_list)
bind_rows(my_list)
Using rowwise
library(dplyr)
library(tidyr)
library(zipcodeR)
out <- df %>%
rowwise %>%
mutate(result = list(search_fips(a, b))) %>%
ungroup %>%
unnest(result)
-output
> head(out, 2)
# A tibble: 2 × 26
a b zipcode zipcode_type major_city post_office_city common_city_list county state lat lng timezone radius_in_miles area_code_list
<chr> <chr> <chr> <chr> <chr> <chr> <blob> <chr> <chr> <dbl> <dbl> <chr> <dbl> <blob>
1 17 133 62236 Standard Columbia Columbia, IL <raw 20 B> Monroe County IL 38.4 -90.2 Central 7 <raw 15 B>
2 17 133 62244 Standard Fults Fults, IL <raw 17 B> Monroe County IL 38.2 -90.2 Central 7 <raw 15 B>
# … with 12 more variables: population <int>, population_density <dbl>, land_area_in_sqmi <dbl>, water_area_in_sqmi <dbl>, housing_units <int>,
# occupied_housing_units <int>, median_home_value <int>, median_household_income <int>, bounds_west <dbl>, bounds_east <dbl>,
# bounds_north <dbl>, bounds_south <dbl>
data
df <- structure(list(a = c("17", "17", "29"), b = c("133", "163", "055"
)), class = "data.frame", row.names = c(NA, -3L))
Here is a solution with Map.
The two one-liners below are equivalent, the first is probably more readable but the other one is simpler.
library(zipcodeR)
a <- c("17", "17", "29")
b <- c("133", "163", "055")
df <- data.frame(a, b)
Map(function(x, y) search_fips(x, y), df$a, df$b)
result <- Map(search_fips, df$a, df$b)
result <- dplyr::bind_rows(result)
head(result)
#> # A tibble: 6 x 24
#> zipcode zipcode_type major_city post_office_city common_city_list county state
#> <chr> <chr> <chr> <chr> <blob> <chr> <chr>
#> 1 62236 Standard Columbia Columbia, IL <raw 20 B> Monro~ IL
#> 2 62244 Standard Fults Fults, IL <raw 17 B> Monro~ IL
#> 3 62248 PO Box Hecker Hecker, IL <raw 18 B> Monro~ IL
#> 4 62256 PO Box Maeystown <NA> <raw 21 B> Monro~ IL
#> 5 62279 PO Box Renault Renault, IL <raw 19 B> Monro~ IL
#> 6 62295 Standard Valmeyer Valmeyer, IL <raw 20 B> Monro~ IL
#> # ... with 17 more variables: lat <dbl>, lng <dbl>, timezone <chr>,
#> # radius_in_miles <dbl>, area_code_list <blob>, population <int>,
#> # population_density <dbl>, land_area_in_sqmi <dbl>,
#> # water_area_in_sqmi <dbl>, housing_units <int>,
#> # occupied_housing_units <int>, median_home_value <int>,
#> # median_household_income <int>, bounds_west <dbl>, bounds_east <dbl>,
#> # bounds_north <dbl>, bounds_south <dbl>
Created on 2022-02-18 by the reprex package (v2.0.1)

tidymodels bake:Error: Please pass a data set to `new_data`

I'm using recipe()function in tidymodels packages for imputation missing values and fixing imbalanced data.
here is my data;
mer_df <- mer2 %>%
filter(!is.na(laststagestatus2)) %>%
select(Id, Age_Range__c, Gender__c, numberoflead, leadduration, firsttouch, lasttouch, laststagestatus2)%>%
mutate_if(is.character, factor) %>%
mutate_if(is.logical, as.integer)
# A tibble: 197,836 x 8
Id Age_Range__c Gender__c numberoflead leadduration firsttouch lasttouch
<fct> <fct> <fct> <int> <dbl> <fct> <fct>
1 0010~ NA NA 2 5.99 Dealer IB~ Walk in
2 0010~ NA NA 1 0 Online Se~ Online S~
3 0010~ NA NA 1 0 Walk in Walk in
4 0010~ NA NA 1 0 Online Se~ Online S~
5 0010~ NA NA 2 0.0128 Dealer IB~ Dealer I~
6 0010~ NA NA 1 0 OB Call OB Call
7 0010~ NA NA 1 0 Dealer IB~ Dealer I~
8 0010~ NA NA 4 73.9 Dealer IB~ Walk in
9 0010~ NA Male 24 0.000208 OB Call OB Call
10 0010~ NA NA 18 0.000150 OB Call OB Call
# ... with 197,826 more rows, and 1 more variable: laststagestatus2 <fct>
here is my codes;
mer_rec <- recipe(laststagestatus2 ~ ., data = mer_train)%>%
step_medianimpute(numberoflead,leadduration)%>%
step_knnimpute(Gender__c,Age_Range__c,fisrsttouch,lasttouch) %>%
step_other(Id,firsttouch) %>%
step_other(Id,lasttouch) %>%
step_dummy(all_nominal(), -laststagestatus2) %>%
step_smote(laststagestatus2)
mer_rec
mer_rec %>% prep()
it just works fine until here ;
Data Recipe
Inputs:
role #variables
outcome 1
predictor 7
Training data contained 148377 data points and 147597 incomplete rows.
Operations:
Median Imputation for 2 items [trained]
K-nearest neighbor imputation for Id, ... [trained]
Collapsing factor levels for Id, firsttouch [trained]
Collapsing factor levels for Id, lasttouch [trained]
Dummy variables from Id, ... [trained]
SMOTE based on laststagestatus2 [trained]
but when ı run bake() function that gives error says;
mer_rec %>% prep() %>% bake(new_data=NULL) %>% count(laststagestatus2)
Error: Please pass a data set to `new_data`.
Could anyone help me about what I m missing here?
There is a fix in the development version of recipes to get this up and working. You can install via:
devtools::install_github("tidymodels/recipes")
Then you can bake() with new_data = NULL to get out the transformed training data.
library(tidymodels)
data(ames)
ames <- mutate(ames, Sale_Price = log10(Sale_Price))
set.seed(123)
ames_split <- initial_split(ames, prob = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
ames_rec <-
recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type +
Latitude + Longitude, data = ames_train) %>%
step_log(Gr_Liv_Area, base = 10) %>%
step_other(Neighborhood, threshold = 0.01) %>%
step_dummy(all_nominal()) %>%
step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>%
step_ns(Latitude, Longitude, deg_free = 20)
ames_rec %>% prep() %>% bake(new_data = NULL)
#> # A tibble: 2,199 x 71
#> Gr_Liv_Area Year_Built Sale_Price Neighborhood_Co… Neighborhood_Ol…
#> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 3.22 1960 5.33 0 0
#> 2 2.95 1961 5.02 0 0
#> 3 3.12 1958 5.24 0 0
#> 4 3.21 1997 5.28 0 0
#> 5 3.21 1998 5.29 0 0
#> 6 3.13 2001 5.33 0 0
#> 7 3.11 1992 5.28 0 0
#> 8 3.21 1995 5.37 0 0
#> 9 3.22 1993 5.25 0 0
#> 10 3.17 1998 5.26 0 0
#> # … with 2,189 more rows, and 66 more variables: Neighborhood_Edwards <dbl>,
#> # Neighborhood_Somerset <dbl>, Neighborhood_Northridge_Heights <dbl>,
#> # Neighborhood_Gilbert <dbl>, Neighborhood_Sawyer <dbl>,
#> # Neighborhood_Northwest_Ames <dbl>, Neighborhood_Sawyer_West <dbl>,
#> # Neighborhood_Mitchell <dbl>, Neighborhood_Brookside <dbl>,
#> # Neighborhood_Crawford <dbl>, Neighborhood_Iowa_DOT_and_Rail_Road <dbl>,
#> # Neighborhood_Timberland <dbl>, Neighborhood_Northridge <dbl>,
#> # Neighborhood_Stone_Brook <dbl>,
#> # Neighborhood_South_and_West_of_Iowa_State_University <dbl>,
#> # Neighborhood_Clear_Creek <dbl>, Neighborhood_Meadow_Village <dbl>,
#> # Neighborhood_other <dbl>, Bldg_Type_TwoFmCon <dbl>, Bldg_Type_Duplex <dbl>,
#> # Bldg_Type_Twnhs <dbl>, Bldg_Type_TwnhsE <dbl>,
#> # Gr_Liv_Area_x_Bldg_Type_TwoFmCon <dbl>,
#> # Gr_Liv_Area_x_Bldg_Type_Duplex <dbl>, Gr_Liv_Area_x_Bldg_Type_Twnhs <dbl>,
#> # Gr_Liv_Area_x_Bldg_Type_TwnhsE <dbl>, Latitude_ns_01 <dbl>,
#> # Latitude_ns_02 <dbl>, Latitude_ns_03 <dbl>, Latitude_ns_04 <dbl>,
#> # Latitude_ns_05 <dbl>, Latitude_ns_06 <dbl>, Latitude_ns_07 <dbl>,
#> # Latitude_ns_08 <dbl>, Latitude_ns_09 <dbl>, Latitude_ns_10 <dbl>,
#> # Latitude_ns_11 <dbl>, Latitude_ns_12 <dbl>, Latitude_ns_13 <dbl>,
#> # Latitude_ns_14 <dbl>, Latitude_ns_15 <dbl>, Latitude_ns_16 <dbl>,
#> # Latitude_ns_17 <dbl>, Latitude_ns_18 <dbl>, Latitude_ns_19 <dbl>,
#> # Latitude_ns_20 <dbl>, Longitude_ns_01 <dbl>, Longitude_ns_02 <dbl>,
#> # Longitude_ns_03 <dbl>, Longitude_ns_04 <dbl>, Longitude_ns_05 <dbl>,
#> # Longitude_ns_06 <dbl>, Longitude_ns_07 <dbl>, Longitude_ns_08 <dbl>,
#> # Longitude_ns_09 <dbl>, Longitude_ns_10 <dbl>, Longitude_ns_11 <dbl>,
#> # Longitude_ns_12 <dbl>, Longitude_ns_13 <dbl>, Longitude_ns_14 <dbl>,
#> # Longitude_ns_15 <dbl>, Longitude_ns_16 <dbl>, Longitude_ns_17 <dbl>,
#> # Longitude_ns_18 <dbl>, Longitude_ns_19 <dbl>, Longitude_ns_20 <dbl>
Created on 2020-10-12 by the reprex package (v0.3.0.9001)
If you are unable to install packages from GitHub, you could use juice() to do the same thing.

Apply a function to every case of a tibble

My second participation here, in Stackoverflow.
I have a function called bw_test with several args like this:
bw_test <- function(localip, remoteip, localspeed, remotespeed , duracion =30,direction ="both"){
comando <- str_c("ssh usuario#", localip ," /tool bandwidth-test direction=", direction," remote-tx-speed=",remotespeed,"M local-tx-speed=",localspeed,"M protocol=udp user=usuario password=mipasso duration=",duracion," ",remoteip)
resultado <- system(comando,intern = T,ignore.stderr = T)
# resultado pull from a ssh server a vector like this:
# head(resultado)
#[1] " status: connecting\r" " tx-current: #0bps\r" " tx-10-second-average: 0bps\r"
#[4] " tx-total-average: 0bps\r" " rx-current: #0bps\r" " rx-10-second-average: 0bps\r"
resultado %<>%
replace("\r","") %>%
tail(17) %>%
trimws("both") %>%
as_tibble %>%
mutate(local=localip, remote=remoteip) %>%
separate(value,sep=":", into=c("parametro","valor")) %>%
head(15)
resultado$valor %<>%
trimws() %>%
str_replace("Mbps","") %>% str_replace("%","") %>% str_replace("s","")
resultado %<>%
spread(parametro,valor)
resultado %<>%
mutate(`tx-percentaje`=as.numeric(resultado$`tx-total-average`)/localspeed) %>%
mutate(`rx-percentaje`=as.numeric(resultado$`rx-total-average`)/remotespeed)
return(resultado)
}
this function returns a tibble like this one:
A tibble: 1 x 19
local remote `connection-cou… direction duration `local-cpu-load` `lost-packets` `random-data` `remote-cpu-loa…
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 192.… 192.1… 1 both 4 13 0 no 12
# … with 10 more variables: `rx-10-second-average` <chr>, `rx-current` <chr>, `rx-size` <chr>,
# `rx-total-average` <chr>, `tx-10-second-average` <chr>, `tx-current` <chr>, `tx-size` <chr>,
# `tx-total-average` <chr>, `tx-percentaje` <dbl>, `rx-percentaje` <dbl>
So, when I call the function inside rbind, got the result of every run on a tibble:
rbind(bw_test("192.168.105.10" ,"192.168.105.18", 75,125),
bw_test("192.168.133.11","192.168.133.9", 5 ,50),
bw_test("192.168.254.251","192.168.254.250", 25,150))
My results are for the example:
# A tibble: 3 x 19
local remote `connection-cou… direction duration `local-cpu-load` `lost-packets` `random-data` `remote-cpu-loa…
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 192.… 192.1… 20 both 28 63 232 no 48
2 192.… 192.1… 20 both 29 4 0 no 20
3 192.… 192.1… 20 both 29 15 0 no 22
# … with 10 more variables: `rx-10-second-average` <chr>, `rx-current` <chr>, `rx-size` <chr>,
# `rx-total-average` <chr>, `tx-10-second-average` <chr>, `tx-current` <chr>, `tx-size` <chr>,
# `tx-total-average` <chr>, `tx-percentaje` <dbl>, `rx-percentaje` <dbl>
My problem is to apply the function to the cases of a tibble like like this.
aps <- tribble(
~name, ~ip, ~remoteip , ~bw_test, ~localspeed,~remotespeed,
"backbone_border_core","192.168.253.1", "192.168.253.3", 1,200,200,
"backbone_2_site2","192.168.254.251", "192.168.254.250", 1, 25,150
}
I was trying to use map, but i got:
map(c(aps$ip,aps$remoteip,aps$localspeed,aps$remotespeed), bw_test)
el argumento "remotespeed" está ausente, sin valor por omisión
I believe cause c(aps$ip,aps$remoteip,aps$localspeed,aps$remotespeed) feeds first all cases of aps$ip, later all of aps$remoteip and so on.
I'm using the right strategie? it's map a suitable way
What i'm doing wrong?
¿how can I apply function to every row to get the requested tibble?
I'll appreciate your kindly help.
Greets.
Try using pmap_df.
output <- purrr::pmap_df(list(aps$ip, aps$remoteip, aps$localspeed,
aps$remotespeed), bw_test)

Filtering Down Data, then Adding Brackets

Currently working with a data set about billionaires (small summary included below); I am looking to get them into three age brackets: "40 and under", "41 to 65", and "above 65", then what is the most common category (their kind of profession) of billionaire in each of the three age brackets.
I have tried to select the data down, and mutate or separate to create the brackets, but unsure what to do. The struck out lines are examples of things I have been trying.
load("bil.RData")
print(bil)
# A tibble: 2,614 x 22
age category citizenship company.name company.type country_code founded
<int> <chr> <chr> <chr> <chr> <chr> <int>
1 NA Financi… Saudi Arab… Rolaco Trad… new SAU 1968
2 34 Financi… United Sta… Fidelity In… new USA 1946
3 59 Non-Tra… Brazil Companhia B… new BRA 1948
4 61 New Sec… Germany Ratiopharm new DEU 1881
5 NA Financi… Hong Kong Swire new HKG 1816
6 NA Traded … Bahrain YBA Kanoo new BHR 1890
7 NA New Sec… Japan Otsuka Hold… new JPN 1921
8 NA Traded … Japan Sony new JPN 1946
9 66 Financi… Japan Mori Buildi… new JPN 1959
10 NA Traded … France Chanel new FRA 1909
# … with 2,604 more rows, and 15 more variables: `from emerging` <chr>,
# gdp <dbl>, gender <chr>, industry <chr>, inherited <chr>, name <chr>,
# rank <int>, region <chr>, relationship <chr>, sector <chr>, `was
# founder` <chr>, `was political` <chr>, wealth.type <chr>,
# worth_billions <dbl>, year <int>
bil %>%
select(age, category) %>%
arrange(age) %>%
filter(!is.na(age), !is.na(category)) %>%
group_by(age, category) %>%
#mutate(n = sum(age)) %>%
#separate(col = age, c("Under 40", "41-65", "Above 65")) %>%
print()
# A tibble: 2,158 x 2
# Groups: age, category [312]
age category
<int> <chr>
1 12 Financial
2 21 Financial
3 24 Financial
4 24 Financial
5 28 Non-Traded Sectors
6 28 Resource Related
7 29 Financial
8 29 Traded Sectors
9 29 New Sectors
10 29 New Sectors
Preferably, I am looking for a table with three rows (one per category, under 40, 41-65, 65+), and three columns (age_bracket, most common category, and n). Also, let me know the best way to include data sets on stackoverflow because this set is a bit large for dput() to be useful (I think).
I think there might be many ways to get what want.
Here are an example of using table1 :
library(tidyverse)
library(table1)
bil1<-bil %>%
mutate(age_group = cut(hp,
breaks = c(0, 40, 65, 110),
labels = c("< 40", "40 - 64", "65+")))
table1(~ category) | age_group, data = bil1)
You may also want to try many other packages such as arsenal and stargazer
To divide data into different buckets, use findInterval or cut and then use top_n to return top category in each bucket.
library(dplyr)
bil %>%
filter(!is.na(age)) %>%
group_by(group = findInterval(age, c(40, 65))) %>%
count(category) %>%
top_n(1, n)
This would just return 1, 2, 3 as labels, if you want to name the labels you can do
bil %>%
filter(!is.na(age)) %>%
group_by(group = c("40 and under", "41 to 65", "above 65")[
findInterval(age, c(40, 65)) + 1]) %>%
count(category) %>%
top_n(1, n)

Tidying up my data frame: moving columns to headers and data

I'm using a webscraper to scrape some data from FinViz. Here's an example
The problem is that the data frame is messy, the first column holds what I would ideally want as the headers and the second column holds the corresponding data. Here's an output:
data1 data2 data3 data4 data5 data6 data7 data8 data9 data10
1 Index S&P 500 P/E 36.13 EPS (ttm) 4.60 Insider Own 0.10% Shs Outstand 2.93B
2 Market Cap 487.15B Forward P/E 25.65 EPS next Y 6.48 Insider Trans -86.95% Shs Float 2.33B
3 Income 13.58B PEG 1.36 EPS next Q 1.27 Inst Own 72.50% Short Float 0.87%
4 Sales 33.17B P/S 14.69 EPS this Y 170.20% Inst Trans -0.22% Short Ratio 1.13
5 Book/sh 22.92 P/B 7.26 EPS next Y 21.63% ROA 20.30% Target Price 192.62
6 Cash/sh 12.10 P/C 13.74 EPS next 5Y 26.57% ROE 22.50% 52W Range 113.55 - 175.49
7 Dividend - P/FCF 34.05 EPS past 5Y 62.10% ROI 17.10% 52W High -5.23%
8 Dividend % - Quick Ratio 12.30 Sales past 5Y 49.40% Gross Margin 86.60% 52W Low 46.47%
9 Employees 20658 Current Ratio 12.30 Sales Q/Q 44.80% Oper. Margin 46.40% RSI (14) 49.05
10 Optionable Yes Debt/Eq 0.00 EPS Q/Q 68.80% Profit Margin 40.90% Rel Volume 0.70
11 Shortable Yes LT Debt/Eq 0.00 Earnings Jul 26 AMC Payout 0.00% Avg Volume 17.87M
12 Recom 1.70 SMA20 -1.84% SMA50 2.85% SMA200 17.52% Volume 12,583,873
As you can see, data1 contains the categories and data2 contains the following information.
Ideally I'd want it in this structure:
Index | Market Cap | Income | Sales | Book sh | ...
------------------------------------------------
S&P500 | 487.15B | 13.58B | 33.17B | 22.92 |
So that data1,3,5,7 were all the headers and data2,4,6,8 where all in one row.
Could anyone provide any input? I'm trying to avoid compiling them into 2 different vectors then rbinding the frame together.
Cheerio!
Here is a solution using some tidyverse packages and your dataset.
library(rvest) # for scrapping the data
#> Le chargement a nécessité le package : xml2
library(dplyr, warn.conflicts = F)
library(tidyr)
library(purrr, warn.conflict = F)
Fisrt, we get your data directly from your example url.
tab <- read_html("http://finviz.com/quote.ashx?t=BA") %>%
html_node("table.snapshot-table2") %>%
html_table(header = F) %>%
as_data_frame()
tab
#> # A tibble: 12 x 12
#> X1 X2 X3 X4 X5 X6
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Index DJIA S&P500 P/E 20.77 EPS (ttm) 11.42
#> 2 Market Cap 141.89B Forward P/E 22.14 EPS next Y 10.71
#> 3 Income 7.12B PEG 1.13 EPS next Q 2.62
#> 4 Sales 90.90B P/S 1.56 EPS this Y 2.30%
#> 5 Book/sh -3.34 P/B - EPS next Y 7.28%
#> 6 Cash/sh 17.26 P/C 13.74 EPS next 5Y 18.36%
#> 7 Dividend 5.68 P/FCF 17.94 EPS past 5Y 7.40%
#> 8 Dividend % 2.39% Quick Ratio 0.40 Sales past 5Y 6.60%
#> 9 Employees 150500 Current Ratio 1.20 Sales Q/Q -8.10%
#> 10 Optionable Yes Debt/Eq - EPS Q/Q 885.50%
#> 11 Shortable Yes LT Debt/Eq - Earnings Jul 26 BMO
#> 12 Recom 2.20 SMA20 -0.16% SMA50 8.14%
#> # ... with 6 more variables: X7 <chr>, X8 <chr>, X9 <chr>, X10 <chr>,
#> # X11 <chr>, X12 <chr>
As headers are in every odd column and data in every even column, we
create a tidy tibble of two columns by row binding the subsets. For
that, we generate odd and even index. Then,
purrr::map_dfr allows us to iterates over those 2 lists, applies a function and row bind the results. The function consist of selecting 2 columns with of the table[ ]and rename those two columns withset_names.
col_num <- seq_len(ncol(tab))
even <- col_num[col_num %% 2 == 0]
odd <- setdiff(col_num, even)
tab2 <- map2_dfr(odd, even, ~ set_names(tab[, c(.x, .y)], c("header", "value")))
tab2
#> # A tibble: 72 x 2
#> header value
#> <chr> <chr>
#> 1 Index DJIA S&P500
#> 2 Market Cap 141.89B
#> 3 Income 7.12B
#> 4 Sales 90.90B
#> 5 Book/sh -3.34
#> 6 Cash/sh 17.26
#> 7 Dividend 5.68
#> 8 Dividend % 2.39%
#> 9 Employees 150500
#> 10 Optionable Yes
#> # ... with 62 more rows
You have a nice 2 column long table with all your data. Now if you want
the table in wide format instead of long format, you have to transpose.
But first, we have to deal with some duplicates names in the header
column. You can't have duplicates column names.
tab2 %>%
filter(header == header[duplicated(header)])
#> # A tibble: 2 x 2
#> header value
#> <chr> <chr>
#> 1 EPS next Y 10.71
#> 2 EPS next Y 7.28%
We just rename the second occurence adding _2
tab3 <- tab2 %>%
mutate(header = case_when(
duplicated(header) ~ paste(header, 2, sep = "_"),
TRUE ~ header)
)
# No more duplicates
any(duplicated(tab3$header))
#> [1] FALSE
tab3 %>% filter(stringr::str_detect(header, "EPS next Y"))
#> # A tibble: 2 x 2
#> header value
#> <chr> <chr>
#> 1 EPS next Y 10.71
#> 2 EPS next Y_2 7.28%
You can pass in wide format and have 72 columns instead of 72 lines.
tab3 %>%
spread(header, value)
#> # A tibble: 1 x 72
#> `52W High` `52W Low` `52W Range` ATR `Avg Volume` Beta `Book/sh`
#> * <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 -3.78% 87.78% 126.31 - 246.49 3.77 3.46M 1.18 -3.34
#> # ... with 65 more variables: `Cash/sh` <chr>, Change <chr>, `Current
#> # Ratio` <chr>, `Debt/Eq` <chr>, Dividend <chr>, `Dividend %` <chr>,
#> # Earnings <chr>, Employees <chr>, `EPS (ttm)` <chr>, `EPS next
#> # 5Y` <chr>, `EPS next Q` <chr>, `EPS next Y` <chr>, `EPS next
#> # Y_2` <chr>, `EPS past 5Y` <chr>, `EPS Q/Q` <chr>, `EPS this Y` <chr>,
#> # `Forward P/E` <chr>, `Gross Margin` <chr>, Income <chr>, Index <chr>,
#> # `Insider Own` <chr>, `Insider Trans` <chr>, `Inst Own` <chr>, `Inst
#> # Trans` <chr>, `LT Debt/Eq` <chr>, `Market Cap` <chr>, `Oper.
#> # Margin` <chr>, Optionable <chr>, `P/B` <chr>, `P/C` <chr>,
#> # `P/E` <chr>, `P/FCF` <chr>, `P/S` <chr>, Payout <chr>, PEG <chr>,
#> # `Perf Half Y` <chr>, `Perf Month` <chr>, `Perf Quarter` <chr>, `Perf
#> # Week` <chr>, `Perf Year` <chr>, `Perf YTD` <chr>, `Prev Close` <chr>,
#> # Price <chr>, `Profit Margin` <chr>, `Quick Ratio` <chr>, Recom <chr>,
#> # `Rel Volume` <chr>, ROA <chr>, ROE <chr>, ROI <chr>, `RSI (14)` <chr>,
#> # Sales <chr>, `Sales past 5Y` <chr>, `Sales Q/Q` <chr>, `Short
#> # Float` <chr>, `Short Ratio` <chr>, Shortable <chr>, `Shs Float` <chr>,
#> # `Shs Outstand` <chr>, SMA20 <chr>, SMA200 <chr>, SMA50 <chr>, `Target
#> # Price` <chr>, Volatility <chr>, Volume <chr>
Idea: You can also replace all the spaces by _ in the header column to have column names without spaces. Often simpler to handle.
Would this work ?
data <- data.frame(data1= letters[1:10],data2=LETTERS[1:10],data3= letters[11:20],data4=LETTERS[11:20],stringsAsFactors=F)
# data1 data2 data3 data4
# 1 a A k K
# 2 b B l L
# 3 c C m M
# 4 d D n N
# 5 e E o O
# 6 f F p P
# 7 g G q Q
# 8 h H r R
# 9 i I s S
# 10 j J t T
output <- setNames(data.frame(
t(unlist(data[!as.logical(seq_along(data)%%2)]))),
unlist(data[as.logical(seq_along(data)%%2)]))
# a b c d e f g h i j k l m n o p q r s t
# 1 A B C D E F G H I J K L M N O P Q R S T
You can try:
library(data.table); library(dplyr)
table1 <- df[, 1:2] %>%as.data.table() %>% dcast.data.table(.~data1, value.var = "data2")
table2 <- df[, 3:4] %>%as.data.table() %>% dcast.data.table(.~data3, value.var = "data4")
cbind(table1, table2)
and so on for the rest

Resources