Tidying up my data frame: moving columns to headers and data

Tidying up my data frame: moving columns to headers and data - r

I'm using a webscraper to scrape some data from FinViz. Here's an example
The problem is that the data frame is messy, the first column holds what I would ideally want as the headers and the second column holds the corresponding data. Here's an output:
data1 data2 data3 data4 data5 data6 data7 data8 data9 data10
1 Index S&P 500 P/E 36.13 EPS (ttm) 4.60 Insider Own 0.10% Shs Outstand 2.93B
2 Market Cap 487.15B Forward P/E 25.65 EPS next Y 6.48 Insider Trans -86.95% Shs Float 2.33B
3 Income 13.58B PEG 1.36 EPS next Q 1.27 Inst Own 72.50% Short Float 0.87%
4 Sales 33.17B P/S 14.69 EPS this Y 170.20% Inst Trans -0.22% Short Ratio 1.13
5 Book/sh 22.92 P/B 7.26 EPS next Y 21.63% ROA 20.30% Target Price 192.62
6 Cash/sh 12.10 P/C 13.74 EPS next 5Y 26.57% ROE 22.50% 52W Range 113.55 - 175.49
7 Dividend - P/FCF 34.05 EPS past 5Y 62.10% ROI 17.10% 52W High -5.23%
8 Dividend % - Quick Ratio 12.30 Sales past 5Y 49.40% Gross Margin 86.60% 52W Low 46.47%
9 Employees 20658 Current Ratio 12.30 Sales Q/Q 44.80% Oper. Margin 46.40% RSI (14) 49.05
10 Optionable Yes Debt/Eq 0.00 EPS Q/Q 68.80% Profit Margin 40.90% Rel Volume 0.70
11 Shortable Yes LT Debt/Eq 0.00 Earnings Jul 26 AMC Payout 0.00% Avg Volume 17.87M
12 Recom 1.70 SMA20 -1.84% SMA50 2.85% SMA200 17.52% Volume 12,583,873
As you can see, data1 contains the categories and data2 contains the following information.
Ideally I'd want it in this structure:
Index | Market Cap | Income | Sales | Book sh | ...
------------------------------------------------
S&P500 | 487.15B | 13.58B | 33.17B | 22.92 |
So that data1,3,5,7 were all the headers and data2,4,6,8 where all in one row.
Could anyone provide any input? I'm trying to avoid compiling them into 2 different vectors then rbinding the frame together.
Cheerio!

Here is a solution using some tidyverse packages and your dataset.
library(rvest) # for scrapping the data
#> Le chargement a nécessité le package : xml2
library(dplyr, warn.conflicts = F)
library(tidyr)
library(purrr, warn.conflict = F)
Fisrt, we get your data directly from your example url.
tab <- read_html("http://finviz.com/quote.ashx?t=BA") %>%
html_node("table.snapshot-table2") %>%
html_table(header = F) %>%
as_data_frame()
tab
#> # A tibble: 12 x 12
#> X1 X2 X3 X4 X5 X6
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Index DJIA S&P500 P/E 20.77 EPS (ttm) 11.42
#> 2 Market Cap 141.89B Forward P/E 22.14 EPS next Y 10.71
#> 3 Income 7.12B PEG 1.13 EPS next Q 2.62
#> 4 Sales 90.90B P/S 1.56 EPS this Y 2.30%
#> 5 Book/sh -3.34 P/B - EPS next Y 7.28%
#> 6 Cash/sh 17.26 P/C 13.74 EPS next 5Y 18.36%
#> 7 Dividend 5.68 P/FCF 17.94 EPS past 5Y 7.40%
#> 8 Dividend % 2.39% Quick Ratio 0.40 Sales past 5Y 6.60%
#> 9 Employees 150500 Current Ratio 1.20 Sales Q/Q -8.10%
#> 10 Optionable Yes Debt/Eq - EPS Q/Q 885.50%
#> 11 Shortable Yes LT Debt/Eq - Earnings Jul 26 BMO
#> 12 Recom 2.20 SMA20 -0.16% SMA50 8.14%
#> # ... with 6 more variables: X7 <chr>, X8 <chr>, X9 <chr>, X10 <chr>,
#> # X11 <chr>, X12 <chr>
As headers are in every odd column and data in every even column, we
create a tidy tibble of two columns by row binding the subsets. For
that, we generate odd and even index. Then,
purrr::map_dfr allows us to iterates over those 2 lists, applies a function and row bind the results. The function consist of selecting 2 columns with of the table[ ]and rename those two columns withset_names.
col_num <- seq_len(ncol(tab))
even <- col_num[col_num %% 2 == 0]
odd <- setdiff(col_num, even)
tab2 <- map2_dfr(odd, even, ~ set_names(tab[, c(.x, .y)], c("header", "value")))
tab2
#> # A tibble: 72 x 2
#> header value
#> <chr> <chr>
#> 1 Index DJIA S&P500
#> 2 Market Cap 141.89B
#> 3 Income 7.12B
#> 4 Sales 90.90B
#> 5 Book/sh -3.34
#> 6 Cash/sh 17.26
#> 7 Dividend 5.68
#> 8 Dividend % 2.39%
#> 9 Employees 150500
#> 10 Optionable Yes
#> # ... with 62 more rows
You have a nice 2 column long table with all your data. Now if you want
the table in wide format instead of long format, you have to transpose.
But first, we have to deal with some duplicates names in the header
column. You can't have duplicates column names.
tab2 %>%
filter(header == header[duplicated(header)])
#> # A tibble: 2 x 2
#> header value
#> <chr> <chr>
#> 1 EPS next Y 10.71
#> 2 EPS next Y 7.28%
We just rename the second occurence adding _2
tab3 <- tab2 %>%
mutate(header = case_when(
duplicated(header) ~ paste(header, 2, sep = "_"),
TRUE ~ header)
)
# No more duplicates
any(duplicated(tab3$header))
#> [1] FALSE
tab3 %>% filter(stringr::str_detect(header, "EPS next Y"))
#> # A tibble: 2 x 2
#> header value
#> <chr> <chr>
#> 1 EPS next Y 10.71
#> 2 EPS next Y_2 7.28%
You can pass in wide format and have 72 columns instead of 72 lines.
tab3 %>%
spread(header, value)
#> # A tibble: 1 x 72
#> `52W High` `52W Low` `52W Range` ATR `Avg Volume` Beta `Book/sh`
#> * <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 -3.78% 87.78% 126.31 - 246.49 3.77 3.46M 1.18 -3.34
#> # ... with 65 more variables: `Cash/sh` <chr>, Change <chr>, `Current
#> # Ratio` <chr>, `Debt/Eq` <chr>, Dividend <chr>, `Dividend %` <chr>,
#> # Earnings <chr>, Employees <chr>, `EPS (ttm)` <chr>, `EPS next
#> # 5Y` <chr>, `EPS next Q` <chr>, `EPS next Y` <chr>, `EPS next
#> # Y_2` <chr>, `EPS past 5Y` <chr>, `EPS Q/Q` <chr>, `EPS this Y` <chr>,
#> # `Forward P/E` <chr>, `Gross Margin` <chr>, Income <chr>, Index <chr>,
#> # `Insider Own` <chr>, `Insider Trans` <chr>, `Inst Own` <chr>, `Inst
#> # Trans` <chr>, `LT Debt/Eq` <chr>, `Market Cap` <chr>, `Oper.
#> # Margin` <chr>, Optionable <chr>, `P/B` <chr>, `P/C` <chr>,
#> # `P/E` <chr>, `P/FCF` <chr>, `P/S` <chr>, Payout <chr>, PEG <chr>,
#> # `Perf Half Y` <chr>, `Perf Month` <chr>, `Perf Quarter` <chr>, `Perf
#> # Week` <chr>, `Perf Year` <chr>, `Perf YTD` <chr>, `Prev Close` <chr>,
#> # Price <chr>, `Profit Margin` <chr>, `Quick Ratio` <chr>, Recom <chr>,
#> # `Rel Volume` <chr>, ROA <chr>, ROE <chr>, ROI <chr>, `RSI (14)` <chr>,
#> # Sales <chr>, `Sales past 5Y` <chr>, `Sales Q/Q` <chr>, `Short
#> # Float` <chr>, `Short Ratio` <chr>, Shortable <chr>, `Shs Float` <chr>,
#> # `Shs Outstand` <chr>, SMA20 <chr>, SMA200 <chr>, SMA50 <chr>, `Target
#> # Price` <chr>, Volatility <chr>, Volume <chr>
Idea: You can also replace all the spaces by _ in the header column to have column names without spaces. Often simpler to handle.

Would this work ?
data <- data.frame(data1= letters[1:10],data2=LETTERS[1:10],data3= letters[11:20],data4=LETTERS[11:20],stringsAsFactors=F)
# data1 data2 data3 data4
# 1 a A k K
# 2 b B l L
# 3 c C m M
# 4 d D n N
# 5 e E o O
# 6 f F p P
# 7 g G q Q
# 8 h H r R
# 9 i I s S
# 10 j J t T
output <- setNames(data.frame(
t(unlist(data[!as.logical(seq_along(data)%%2)]))),
unlist(data[as.logical(seq_along(data)%%2)]))
# a b c d e f g h i j k l m n o p q r s t
# 1 A B C D E F G H I J K L M N O P Q R S T

You can try:
library(data.table); library(dplyr)
table1 <- df[, 1:2] %>%as.data.table() %>% dcast.data.table(.~data1, value.var = "data2")
table2 <- df[, 3:4] %>%as.data.table() %>% dcast.data.table(.~data3, value.var = "data4")
cbind(table1, table2)
and so on for the rest

Related

assign values inside a function based on specific columns in a data frame

I'm looking for direction on assigning a data frame column value to a specific place in a function and then looping or something to create a series of objects to be bound into a longer table.
example data
a = c("17","17","29")
b = c("133","163","055")
data.frame(a, b)
doing this manually...
library(zipcodeR)
T1 <- search_fips("17", "133")
T2 <- search_fips("17", "163")
T3 <- search_fips("29", "055")
TT <- list(T1, T2, T3)
CZ_zips <- rbindlist(TT, use.names=TRUE, fill=TRUE)
would like a to read a and b columns into a set place in function to create a series of vectors or data frames that can then be bound into one longer table.
the search_fips function pulls out of the census FIPS data, a = state and b = county. package is zipcodeR.

One simple way is to wrap the search_fips() function in a lapply function and rest stays the same.
library(zipcodeR)
a = c("17","17","29")
b = c("133","163","055")
df<-data.frame(a, b)
output <-lapply(1:nrow(df), function(i) {
search_fips(df$a[i], df$b[i])
})
answer <- dplyr::bind_rows(output)

here is a loop you might want to put in your function:
library(dplyr)
library(zipcodeR)
my_list <- list()
for (i in 1:nrow(df)) {
my_list[i] <- search_fips(df$a[i], df$b[i])
}
new_df <- bind_rows(my_list)
bind_rows(my_list)

Using rowwise
library(dplyr)
library(tidyr)
library(zipcodeR)
out <- df %>%
rowwise %>%
mutate(result = list(search_fips(a, b))) %>%
ungroup %>%
unnest(result)
-output
> head(out, 2)
# A tibble: 2 × 26
a b zipcode zipcode_type major_city post_office_city common_city_list county state lat lng timezone radius_in_miles area_code_list
<chr> <chr> <chr> <chr> <chr> <chr> <blob> <chr> <chr> <dbl> <dbl> <chr> <dbl> <blob>
1 17 133 62236 Standard Columbia Columbia, IL <raw 20 B> Monroe County IL 38.4 -90.2 Central 7 <raw 15 B>
2 17 133 62244 Standard Fults Fults, IL <raw 17 B> Monroe County IL 38.2 -90.2 Central 7 <raw 15 B>
# … with 12 more variables: population <int>, population_density <dbl>, land_area_in_sqmi <dbl>, water_area_in_sqmi <dbl>, housing_units <int>,
# occupied_housing_units <int>, median_home_value <int>, median_household_income <int>, bounds_west <dbl>, bounds_east <dbl>,
# bounds_north <dbl>, bounds_south <dbl>
data
df <- structure(list(a = c("17", "17", "29"), b = c("133", "163", "055"
)), class = "data.frame", row.names = c(NA, -3L))

Here is a solution with Map.
The two one-liners below are equivalent, the first is probably more readable but the other one is simpler.
library(zipcodeR)
a <- c("17", "17", "29")
b <- c("133", "163", "055")
df <- data.frame(a, b)
Map(function(x, y) search_fips(x, y), df$a, df$b)
result <- Map(search_fips, df$a, df$b)
result <- dplyr::bind_rows(result)
head(result)
#> # A tibble: 6 x 24
#> zipcode zipcode_type major_city post_office_city common_city_list county state
#> <chr> <chr> <chr> <chr> <blob> <chr> <chr>
#> 1 62236 Standard Columbia Columbia, IL <raw 20 B> Monro~ IL
#> 2 62244 Standard Fults Fults, IL <raw 17 B> Monro~ IL
#> 3 62248 PO Box Hecker Hecker, IL <raw 18 B> Monro~ IL
#> 4 62256 PO Box Maeystown <NA> <raw 21 B> Monro~ IL
#> 5 62279 PO Box Renault Renault, IL <raw 19 B> Monro~ IL
#> 6 62295 Standard Valmeyer Valmeyer, IL <raw 20 B> Monro~ IL
#> # ... with 17 more variables: lat <dbl>, lng <dbl>, timezone <chr>,
#> # radius_in_miles <dbl>, area_code_list <blob>, population <int>,
#> # population_density <dbl>, land_area_in_sqmi <dbl>,
#> # water_area_in_sqmi <dbl>, housing_units <int>,
#> # occupied_housing_units <int>, median_home_value <int>,
#> # median_household_income <int>, bounds_west <dbl>, bounds_east <dbl>,
#> # bounds_north <dbl>, bounds_south <dbl>
Created on 2022-02-18 by the reprex package (v2.0.1)

R task, web scraping

I share my solution for the task, however, I get an error and cannot find the reason. Anyone can help with it?
Data download 1.1 Collect links Data on the Stack Overflow user survey is available on the Stack Overflow website. Create a web scraper that collects the links to the survey files. Select only the links to the surveys from 2017 to 2021.
lst_nodes <- "https://insights.stackoverflow.com/survey/" %>%
read_html() %>%
html_nodes(".js-download-link")
lst_url <- lst_nodes[1:5] %>%
html_attr("href")
print(lst_url)
Complete the function to download the data files from the URLs that extracted.
fun_download <- function(url) {
year <- # extract year from url
zip_file <- paste0("file_", year, ".zip")
zip_dir <- paste0("dir_", year)
download.file(url, zip_file)
unzip(zip_file, exdir = zip_dir, files = "survey_results_public.csv")
out <- read_csv(file.path(zip_dir, "survey_results_public.csv"), col_types = cols(.default = "c")) %>%
mutate(Year = year, ResponseId = row_number())
return(out)
year <- sub(".*[^0-9]([0-9]+)\\.zip$", "\\1", lst_url)
}
Apply the function to the URLs that you extracted and generate a data frame that contains the data from all surveys.
Save the data frame. Note: The read_csv command in the function seems to keep the downloaded csv files locked after reading. So once you tried to open the csv files, you cannot delete them. To overcome this lock, restart the R session.
Best to save the data so that you have to run the download and importing only once.
alldf <- lapply(lst_url, fun_download)
That is all I did so far...but it seems something is wrong

My suggestion to use year <- sub(.) needs to be put in context of the function itself, using its url only. This works.
fun_download <- function(url) {
stopifnot(length(url) == 1L) # just a safeguard
year <- sub(".*[^0-9]([0-9]+)\\.zip$", "\\1", url)
zip_file <- paste0("file_", year, ".zip")
zip_dir <- paste0("dir_", year)
download.file(url, zip_file)
unzip(zip_file, exdir = zip_dir, files = "survey_results_public.csv")
out <- readr::read_csv(file.path(zip_dir, "survey_results_public.csv"), col_types = readr::cols(.default = "c")) %>%
mutate(
Year = year,
ResponseId = row_number()
)
return(out)
}
fun_download(lst_url[[1]])
# trying URL 'https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2021.zip'
# Content type 'application/zip' length 8825103 bytes (8.4 MB)
# downloaded 8.4 MB
# # A tibble: 83,439 x 49
# ResponseId MainBranch Employment Country US_State UK_Country EdLevel Age1stCode LearnCode YearsCode YearsCodePro DevType
# <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 I am a deve~ Independen~ Slovakia NA NA Seconda~ 18 - 24 y~ Coding Bo~ NA NA Develop~
# 2 2 I am a stud~ Student, f~ Netherl~ NA NA Bachelo~ 11 - 17 y~ Other onl~ 7 NA NA
# 3 3 I am not pr~ Student, f~ Russian~ NA NA Bachelo~ 11 - 17 y~ Other onl~ NA NA NA
# 4 4 I am a deve~ Employed f~ Austria NA NA Master?~ 11 - 17 y~ NA NA NA Develop~
# 5 5 I am a deve~ Independen~ United ~ NA England Master?~ 5 - 10 ye~ Friend or~ 17 10 Develop~
# 6 6 I am a stud~ Student, p~ United ~ Georgia NA Bachelo~ 11 - 17 y~ Other onl~ NA NA NA
# 7 7 I code prim~ I prefer n~ United ~ New Ham~ NA Seconda~ 11 - 17 y~ Other onl~ 3 NA NA
# 8 8 I am a stud~ Student, f~ Malaysia NA NA Bachelo~ 11 - 17 y~ School;On~ 4 NA NA
# 9 9 I am a deve~ Employed p~ India NA NA Bachelo~ 18 - 24 y~ Coding Bo~ 6 4 Develop~
# 10 10 I am a deve~ Employed f~ Sweden NA NA Master?~ 11 - 17 y~ School 7 4 Data sc~
# # ... with 83,429 more rows, and 37 more variables: OrgSize <chr>, Currency <chr>, CompTotal <chr>, CompFreq <chr>,
# # LanguageHaveWorkedWith <chr>, LanguageWantToWorkWith <chr>, DatabaseHaveWorkedWith <chr>, DatabaseWantToWorkWith <chr>,
# # PlatformHaveWorkedWith <chr>, PlatformWantToWorkWith <chr>, WebframeHaveWorkedWith <chr>, WebframeWantToWorkWith <chr>,
# # MiscTechHaveWorkedWith <chr>, MiscTechWantToWorkWith <chr>, ToolsTechHaveWorkedWith <chr>, ToolsTechWantToWorkWith <chr>,
# # NEWCollabToolsHaveWorkedWith <chr>, NEWCollabToolsWantToWorkWith <chr>, OpSys <chr>, NEWStuck <chr>, NEWSOSites <chr>,
# # SOVisitFreq <chr>, SOAccount <chr>, SOPartFreq <chr>, SOComm <chr>, NEWOtherComms <chr>, Age <chr>, Gender <chr>,
# # Trans <chr>, Sexuality <chr>, Ethnicity <chr>, Accessibility <chr>, MentalHealth <chr>, SurveyLength <chr>, ...
From here, use lapply(., fun_download) to produce a list of frames.
list_of_frames <- lapply(lst_url, fun_download)
# trying URL 'https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2021.zip'
# Content type 'application/zip' length 8825103 bytes (8.4 MB)
# downloaded 8.4 MB
# trying URL 'https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2020.zip'
# Content type 'application/zip' length 9908290 bytes (9.4 MB)
# downloaded 9.4 MB
# trying URL 'https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2019.zip'
# Content type 'application/zip' length 18681322 bytes (17.8 MB)
# downloaded 17.8 MB
# trying URL 'https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2018.zip'
# Content type 'application/zip' length 20022841 bytes (19.1 MB)
# downloaded 19.1 MB
# trying URL 'https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2017.zip'
# Content type 'application/zip' length 9576818 bytes (9.1 MB)
# downloaded 9.1 MB
And a terse summary to show what they hold:
lapply(list_of_frames, function(z) z[1:2, 1:4])
# [[1]]
# # A tibble: 2 x 4
# ResponseId MainBranch Employment Country
# <int> <chr> <chr> <chr>
# 1 1 I am a developer by profession Independent contractor, freelancer, or self-employed Slovakia
# 2 2 I am a student who is learning to code Student, full-time Netherlands
# [[2]]
# # A tibble: 2 x 4
# Respondent MainBranch Hobbyist Age
# <chr> <chr> <chr> <chr>
# 1 1 I am a developer by profession Yes NA
# 2 2 I am a developer by profession No NA
# [[3]]
# # A tibble: 2 x 4
# Respondent MainBranch Hobbyist OpenSourcer
# <chr> <chr> <chr> <chr>
# 1 1 I am a student who is learning to code Yes Never
# 2 2 I am a student who is learning to code No Less than once per year
# [[4]]
# # A tibble: 2 x 4
# Respondent Hobby OpenSource Country
# <chr> <chr> <chr> <chr>
# 1 1 Yes No Kenya
# 2 3 Yes Yes United Kingdom
# [[5]]
# # A tibble: 2 x 4
# Respondent Professional ProgramHobby Country
# <chr> <chr> <chr> <chr>
# 1 1 Student Yes, both United States
# 2 2 Student Yes, both United Kingdom
If you need to assign names (such as the URL used to derive each dataset), then perhaps this, which adds a $url field to each frame.
list_of_frames <- Map(function(x, u) transform(x, url = u), list_of_frames, lst_url)
Data
library(rvest)
lst_nodes <- read_html("https://insights.stackoverflow.com/survey/") %>%
html_nodes(".js-download-link")
lst_url <- html_attr(lst_nodes [1:5], "href")
lst_url
# [1] "https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2021.zip"
# [2] "https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2020.zip"
# [3] "https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2019.zip"
# [4] "https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2018.zip"
# [5] "https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2017.zip"

Apply a function to every case of a tibble

My second participation here, in Stackoverflow.
I have a function called bw_test with several args like this:
bw_test <- function(localip, remoteip, localspeed, remotespeed , duracion =30,direction ="both"){
comando <- str_c("ssh usuario#", localip ," /tool bandwidth-test direction=", direction," remote-tx-speed=",remotespeed,"M local-tx-speed=",localspeed,"M protocol=udp user=usuario password=mipasso duration=",duracion," ",remoteip)
resultado <- system(comando,intern = T,ignore.stderr = T)
# resultado pull from a ssh server a vector like this:
# head(resultado)
#[1] " status: connecting\r" " tx-current: #0bps\r" " tx-10-second-average: 0bps\r"
#[4] " tx-total-average: 0bps\r" " rx-current: #0bps\r" " rx-10-second-average: 0bps\r"
resultado %<>%
replace("\r","") %>%
tail(17) %>%
trimws("both") %>%
as_tibble %>%
mutate(local=localip, remote=remoteip) %>%
separate(value,sep=":", into=c("parametro","valor")) %>%
head(15)
resultado$valor %<>%
trimws() %>%
str_replace("Mbps","") %>% str_replace("%","") %>% str_replace("s","")
resultado %<>%
spread(parametro,valor)
resultado %<>%
mutate(`tx-percentaje`=as.numeric(resultado$`tx-total-average`)/localspeed) %>%
mutate(`rx-percentaje`=as.numeric(resultado$`rx-total-average`)/remotespeed)
return(resultado)
}
this function returns a tibble like this one:
A tibble: 1 x 19
local remote `connection-cou… direction duration `local-cpu-load` `lost-packets` `random-data` `remote-cpu-loa…
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 192.… 192.1… 1 both 4 13 0 no 12
# … with 10 more variables: `rx-10-second-average` <chr>, `rx-current` <chr>, `rx-size` <chr>,
# `rx-total-average` <chr>, `tx-10-second-average` <chr>, `tx-current` <chr>, `tx-size` <chr>,
# `tx-total-average` <chr>, `tx-percentaje` <dbl>, `rx-percentaje` <dbl>
So, when I call the function inside rbind, got the result of every run on a tibble:
rbind(bw_test("192.168.105.10" ,"192.168.105.18", 75,125),
bw_test("192.168.133.11","192.168.133.9", 5 ,50),
bw_test("192.168.254.251","192.168.254.250", 25,150))
My results are for the example:
# A tibble: 3 x 19
local remote `connection-cou… direction duration `local-cpu-load` `lost-packets` `random-data` `remote-cpu-loa…
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 192.… 192.1… 20 both 28 63 232 no 48
2 192.… 192.1… 20 both 29 4 0 no 20
3 192.… 192.1… 20 both 29 15 0 no 22
# … with 10 more variables: `rx-10-second-average` <chr>, `rx-current` <chr>, `rx-size` <chr>,
# `rx-total-average` <chr>, `tx-10-second-average` <chr>, `tx-current` <chr>, `tx-size` <chr>,
# `tx-total-average` <chr>, `tx-percentaje` <dbl>, `rx-percentaje` <dbl>
My problem is to apply the function to the cases of a tibble like like this.
aps <- tribble(
~name, ~ip, ~remoteip , ~bw_test, ~localspeed,~remotespeed,
"backbone_border_core","192.168.253.1", "192.168.253.3", 1,200,200,
"backbone_2_site2","192.168.254.251", "192.168.254.250", 1, 25,150
}
I was trying to use map, but i got:
map(c(aps$ip,aps$remoteip,aps$localspeed,aps$remotespeed), bw_test)
el argumento "remotespeed" está ausente, sin valor por omisión
I believe cause c(aps$ip,aps$remoteip,aps$localspeed,aps$remotespeed) feeds first all cases of aps$ip, later all of aps$remoteip and so on.
I'm using the right strategie? it's map a suitable way
What i'm doing wrong?
¿how can I apply function to every row to get the requested tibble?
I'll appreciate your kindly help.
Greets.

Try using pmap_df.
output <- purrr::pmap_df(list(aps$ip, aps$remoteip, aps$localspeed,
aps$remotespeed), bw_test)

reading zip file directly with read_csv from readr producing weird results

I'm trying to read directly from a URL to grab a zip file that contains a pipe delimited text file. If I download the file, then use read_csv to read it from disk, I have no problems. But if I try to use read_csv to read the URL directly I get garbage in my resulting df. I can work around this by coding in a download then read. But it seems like it should work directly. Any clues on what's going on here?
library(readr)
url <- "https://www.rma.usda.gov/data/sob/sccc/sobcov_2018.zip"
df <- read_delim(url, delim='|',
col_names = c('year','stFips','stAbbr','coFips','coName',
'cropCd','cropName','planCd','planAbbr','coverCat',
'deliveryType','covLevel','policyCount','policyPremCount','policyIndemCount',
'unitsReportingPrem', 'indemCount','quantType', 'quantNet', 'companionAcres',
'liab','prem','subsidy','indem', 'lossRatio'))
#> Parsed with column specification:
#> cols(
#> .default = col_character()
#> )
#> See spec(...) for full column specifications.
#> Warning in rbind(names(probs), probs_f): number of columns of result is not
#> a multiple of vector length (arg 1)
#> Warning: 7908 parsing failures.
#> row # A tibble: 5 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 1 year "" embedded null 'https://www.rma.usda.gov/data/sob… file 2 1 <NA> 25 columns 1 columns 'https://www.rma.usda.gov/data/sob… row 3 2 <NA> 25 columns 4 columns 'https://www.rma.usda.gov/data/sob… col 4 3 <NA> 25 columns 2 columns 'https://www.rma.usda.gov/data/sob… expected 5 4 year "" embedded null 'https://www.rma.usda.gov/data/sob…
#> ... ................. ... .......................................................................... ........ .......................................................................... ...... .......................................................................... .... .......................................................................... ... .......................................................................... ... .......................................................................... ........ ..........................................................................
#> See problems(...) for more details.
head(df)
#> # A tibble: 6 x 25
#> year stFips stAbbr coFips coName cropCd cropName planCd planAbbr
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 "PK\u00… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 2 "K\xe6\… "\xf5\x… "\xc5\… "\xfa\… <NA> <NA> <NA> <NA> <NA>
#> 3 "\xb0\x… "\xfd\x… <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 4 "j`/Q\x… "\x96\x… <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 5 "\xc0\x… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 6 "z\xe4\… "~y\xf5… <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> # ... with 16 more variables: coverCat <chr>, deliveryType <chr>,
#> # covLevel <chr>, policyCount <chr>, policyPremCount <chr>,
#> # policyIndemCount <chr>, unitsReportingPrem <chr>, indemCount <chr>,
#> # quantType <chr>, quantNet <chr>, companionAcres <chr>, liab <chr>,
#> # prem <chr>, subsidy <chr>, indem <chr>, lossRatio <chr>
If I download first, I get the following output:
> url <- './data/sobcov_2018.zip'
> df <- read_delim(url, delim='|',
+ col_names = c('year','stFips','stAbbr','coFips','coName',
+ 'cropCd','cropName','planCd','planAbbr','coverCat',
+ 'deliveryType','covLevel','policyCount','policyPremCount','policyIndemCount',
+ 'unitsReportingPrem', 'indemCount','quantType', 'quantNet', 'companionAcres',
+ 'liab','prem','subsidy','indem', 'lossRatio'))
Parsed with column specification:
cols(
.default = col_integer(),
stFips = col_character(),
stAbbr = col_character(),
coFips = col_character(),
coName = col_character(),
cropCd = col_character(),
cropName = col_character(),
planCd = col_character(),
planAbbr = col_character(),
coverCat = col_character(),
deliveryType = col_character(),
covLevel = col_double(),
quantType = col_character(),
lossRatio = col_double()
)
See spec(...) for full column specifications.
> head(df)
# A tibble: 6 x 25
year stFips stAbbr coFips coName cropCd cropName planCd planAbbr coverCat deliveryType covLevel
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 2018 02 AK 999 "All Other … 9999 "All Other C… 01 "YP … "A " RBUP 0.500
2 2018 02 AK 240 "Southeast … 9999 "All Other C… 90 "APH … "A " RBUP 0.500
3 2018 02 AK 240 "Southeast … 9999 "All Other C… 90 "APH … "A " RBUP 0.750
4 2018 02 AK 240 "Southeast … 9999 "All Other C… 90 "APH … "C " RCAT 0.500
5 2018 02 AK 240 "Southeast … 9999 "All Other C… 02 "RP … "A " RBUP 0.600
6 2018 02 AK 240 "Southeast … 9999 "All Other C… 02 "RP … "A " RBUP 0.750
# ... with 13 more variables: policyCount <int>, policyPremCount <int>, policyIndemCount <int>,
# unitsReportingPrem <int>, indemCount <int>, quantType <chr>, quantNet <int>, companionAcres <int>,
# liab <int>, prem <int>, subsidy <int>, indem <int>, lossRatio <dbl>
>

readr can handle only gz compressed files as remote sources, since there are no analogues to base::gzcon() for other compression algorithms. See this github issue for a discussion and the improved documentation (also in ?readr::datasource).

Format a tbl within a dplyr chain

I am trying to add commas for thousands in my data e.g. 10,000 along with dollars e.g. $10,000.
I'm using several dplyr commands along with tidyr gather and spread functions. Here's what I tried:
Cut n paste this code block to generate the random data "dataset" I'm working with:
library(dplyr)
library(tidyr)
library(lubridate)
## Generate some data
channels <- c("Facebook", "Youtube", "SEM", "Organic", "Direct", "Email")
last_month <- Sys.Date() %m+% months(-1) %>% floor_date("month")
mts <- seq(from = last_month %m+% months(-23), to = last_month, by = "1 month") %>% as.Date()
dimvars <- expand.grid(Month = mts, Channel = channels, stringsAsFactors = FALSE)
# metrics
rws <- nrow(dimvars)
set.seed(42)
# generates variablility in the random data
randwalk <- function(initial_val, ...){
initial_val + cumsum(rnorm(...))
}
Sessions <- ceiling(randwalk(3000, n = rws, mean = 8, sd = 1500)) %>% abs()
Revenue <- ceiling(randwalk(10000, n = rws, mean = 0, sd = 3500)) %>% abs()
# make primary df
dataset <- cbind(dimvars, Revenue)
Which looks like:
> tbl_df(dataset)
# A tibble: 144 × 3
Month Channel Revenue
<date> <chr> <dbl>
1 2015-06-01 Facebook 8552
2 2015-07-01 Facebook 12449
3 2015-08-01 Facebook 10765
4 2015-09-01 Facebook 9249
5 2015-10-01 Facebook 11688
6 2015-11-01 Facebook 7991
7 2015-12-01 Facebook 7849
8 2016-01-01 Facebook 2418
9 2016-02-01 Facebook 6503
10 2016-03-01 Facebook 5545
# ... with 134 more rows
Now I want to spread the months into columns to show revenue trend by channel, month over month. I can do that like so:
revenueTable <- dataset %>% select(Month, Channel, Revenue) %>%
group_by(Month, Channel) %>%
summarise(Revenue = sum(Revenue)) %>%
#mutate(Revenue = paste0("$", format(Revenue, big.interval = ","))) %>%
gather(Key, Value, -Channel, -Month) %>%
spread(Month, Value) %>%
select(-Key)
And it looks almost exactly as I want:
> revenueTable
# A tibble: 6 × 25
Channel `2015-06-01` `2015-07-01` `2015-08-01` `2015-09-01` `2015-10-01` `2015-11-01` `2015-12-01` `2016-01-01` `2016-02-01` `2016-03-01` `2016-04-01`
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Direct 11910 8417 4012 359 4473 2702 6261 6167 8630 5230 1394
2 Email 7244 3517 671 1339 10788 10575 8567 8406 7856 6345 7733
3 Facebook 8552 12449 10765 9249 11688 7991 7849 2418 6503 5545 3908
4 Organic 4191 978 219 4274 2924 4155 5981 9719 8220 8829 7024
5 SEM 2344 6873 10230 6429 5016 2964 3390 3841 3163 1994 2105
6 Youtube 186 2949 2144 5073 1035 4878 7905 7377 2305 4556 6247
# ... with 13 more variables: `2016-05-01` <dbl>, `2016-06-01` <dbl>, `2016-07-01` <dbl>, `2016-08-01` <dbl>, `2016-09-01` <dbl>, `2016-10-01` <dbl>,
# `2016-11-01` <dbl>, `2016-12-01` <dbl>, `2017-01-01` <dbl>, `2017-02-01` <dbl>, `2017-03-01` <dbl>, `2017-04-01` <dbl>, `2017-05-01` <dbl>
Now the part I'm struggling with. I would like to format the data as currency. I tried adding this inbetween summarise() and gather() within the chain:
mutate(Revenue = paste0("$", format(Revenue, big.interval = ","))) %>%
This half works. The dollar sign is prepended but the comma separators do not show. I tried removing the paste0("$" part to see if I could get the comma formatting to work with no success.
How can I format my tbl as a currency with dollars and commas, rounded to nearest whole dollars (no $1.99, just $2)?

I think you can just do this at the end with dplyr::mutate_at().
revenueTable %>% mutate_at(vars(-Channel), funs(. %>% round(0) %>% scales::dollar()))
#> # A tibble: 6 x 25
#> Channel `2015-06-01` `2015-07-01` `2015-08-01` `2015-09-01`
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Direct $11,910 $8,417 $4,012 $359
#> 2 Email $7,244 $3,517 $671 $1,339
#> 3 Facebook $8,552 $12,449 $10,765 $9,249
#> 4 Organic $4,191 $978 $219 $4,274
#> 5 SEM $2,344 $6,873 $10,230 $6,429
#> 6 Youtube $186 $2,949 $2,144 $5,073
#> # ... with 20 more variables: `2015-10-01` <chr>, `2015-11-01` <chr>,
#> # `2015-12-01` <chr>, `2016-01-01` <chr>, `2016-02-01` <chr>,
#> # `2016-03-01` <chr>, `2016-04-01` <chr>, `2016-05-01` <chr>,
#> # `2016-06-01` <chr>, `2016-07-01` <chr>, `2016-08-01` <chr>,
#> # `2016-09-01` <chr>, `2016-10-01` <chr>, `2016-11-01` <chr>,
#> # `2016-12-01` <chr>, `2017-01-01` <chr>, `2017-02-01` <chr>,
#> # `2017-03-01` <chr>, `2017-04-01` <chr>, `2017-05-01` <chr>

We can use data.table
library(data.table)
nm1 <- setdiff(names(revenueTable), 'Channel')
setDT(revenueTable)[, (nm1) := lapply(.SD, function(x)
scales::dollar(round(x))), .SDcols = nm1]
revenueTable[, 1:3, with = FALSE]
# Channel `2015-06-01` `2015-07-01`
#1: Direct $11,910 $8,417
#2: Email $7,244 $3,517
#3: Facebook $8,552 $12,449
#4: Organic $4,191 $978
#5: SEM $2,344 $6,873
#6: Youtube $186 $2,949

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Tidying up my data frame: moving columns to headers and data - r

You can try: library(data.table); library(dplyr) table1 <- df[, 1:2] %>%as.data.table() %>% dcast.data.table(.~data1, value.var = "data2") table2 <- df[, 3:4] %>%as.data.table() %>% dcast.data.table(.~data3, value.var = "data4") cbind(table1, table2) and so on for the rest

Related

assign values inside a function based on specific columns in a data frame

R task, web scraping

Apply a function to every case of a tibble

reading zip file directly with read_csv from readr producing weird results

Format a tbl within a dplyr chain

Categories

Resources