I am trying to webscrape the table from this following page (https://www.coya.com/bike/fahrrad-index-2019), namely the values the bike index for 50 german cities (if u click "Alle Ergebnisse +", you ll see all 50 cities.
I need especially some columns ("Bewertung spezielle Radwege & Qualität der Radwege", "Investitionen & QUalität der Infrastruktur", "Bewertung der Infrastruktur", "Fahrradsharing-Score", "Autofreier Tag", "Critical-Mass-Fahrrad-aktionen, "Event-Score).
This is what I tried:
library(rvest)
num_link="https://www.coya.com/bike/fahrrad-index-2019"
num_page= read_html(num_link)
xyc= num_page %>% html_nodes("._1200:nth-child(2)") %>% html_text()
I tried Selectorgadget, unfortunately I get all the values of the table in a long String (str_split is challenging, because commas in numbers got mixed with commas between the numbers:
"[1] "Ergebnisse für DeutschlandKriminalitätInfrastrukturFahrrad-SharingEvents#StadtLandSizeTotal Score1OldenburgDeutschlandK57,90,4271,94588,3594,4684,5227,153,0590,3454,1836,4515,0525,75N31,5216,2669,122MünsterDeutschlandK58,740,3910,53445,5883,0488,4328,1551,2388,0453,0535,522630,76N23,8412,4265,933Freiburg i. Breisg.DeutschlandK59,350,"
Could someone help me scraping the table, if possible, especially only some values of specific columns (see above)? Very thankful for any help/tip.
Thank you in advance.
(I am a newbie, please be gentle.)
Here is one way to solve the puzzle. Though the row names use a lot of icons so I just leave empty column name. You can create a vector names and assign them manually using
names(table_content) <- names_vector
Here is the code
library(rvest)
#> Loading required package: xml2
library(dplyr, warn.conflicts = FALSE)
library(purrr)
# Here is just reuse your code
num_link <- "https://www.coya.com/bike/fahrrad-index-2019"
num_page <- read_html(num_link)
# Extract the items from your code but go further down
table_content <- num_page %>% html_nodes("._1200:nth-child(2)") %>%
# Extract the node that contain the table
html_nodes(css = ".w-dyn-list") %>%
# Extract the nodes corresponded to each row
html_nodes(css = ".bike-collection-item") %>%
# Then map the function that take each rows in and convert them to a table
# and bind them together into one table
map_dfr(function(x) {
# suppress the message due to no column name was feed into map_dfc
suppressMessages(
x %>% html_nodes(".td") %>%
map_dfc(function(x) { x %>% html_text })
)
})
Here is the extracted content
#> # A tibble: 70 x 21
#> ...1 ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10 ...11 ...12 ...13
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 Olde… Deut… K 57,9 0,427 1,94 588,… 94,46 84,52 27,1 53,05 90,34
#> 2 2 Müns… Deut… K 58,74 0,391 0,53 445,… 83,04 88,43 28,15 51,23 88,04
#> 3 3 Frei… Deut… K 59,35 0,34 2,27 962,… 88,87 77,52 32,57 48,11 93,49
#> 4 4 Bamb… Deut… K 55,59 0,302 0 456,… 89,04 92,66 30,29 47,74 93,75
#> 5 5 Gött… Deut… K 62,66 0,28 3,07 379,… 92,8 80,99 23,03 48,07 89,18
#> 6 6 Heid… Deut… K 63,14 0,22 1,21 394,… 90,39 88,33 29,02 47,88 94,21
#> 7 7 Karl… Deut… K 57,39 0,25 4,23 725,… 90,35 71,62 18,75 46,33 93,93
#> 8 8 Brau… Deut… K 67,36 0,21 0 522,… 85,89 90,97 20,55 49,2 89,78
#> 9 9 Kons… Deut… K 62,77 0,22 4,6 121,… 93,62 76,98 23 48,49 94,09
#> 10 10 Brem… Deut… M 58,86 0,21 1,38 334,… 87,34 87,15 18,64 59,78 94,64
#> # … with 60 more rows, and 8 more variables: ...14 <chr>, ...15 <chr>,
#> # ...16 <chr>, ...17 <chr>, ...18 <chr>, ...19 <chr>, ...20 <chr>,
#> # ...21 <chr>
Created on 2021-04-08 by the reprex package (v1.0.0)
You probably want to do one table at a time and then the columns one at a time, so you can then create a data frame. Try this for example:
col1 <- num_page %>% html_nodes(paste0(".w-dyn-item :nth-child(2) div")) %>%
html_text()
The selector gadget is nifty but I usually need to experiment a lot to get the right selectors.
Related
I'm trying to scrape all the data from this website. There are icons over some of the competitors names indicating that the person was disqualified for being a 'no-show'.
I would like create a data frame with all the competitors while also specifying who was disqualified, but I'm running into two issues:
(1) trying to add the disclaimer next to the persons name produces the error cannot coerce class ‘"xml_nodeset"’ to a data.frame.
(2) trying to extract the text from just the icon (and not the competitor names) produces a blank data frame.
library(rvest); library(tidyverse)
html = read_html('https://web.archive.org/web/20220913034642/https://www.bjjcompsystem.com/tournaments/1869/categories/2053162')
dq = data.frame(winner = html %>%
html_nodes('.match-card__competitor--red') %>%
html_text(trim = TRUE),
opponent = html %>%
html_nodes('hr+ .match-card__competitor'),
dq = html %>%
html_nodes('.match-card__disqualification') %>%
html_text())
This approach generally works only on tabular data where you can be sure that the number of matches for each of those selectors are constant and order is also fixed. In your example you have:
127 matches for .match-card__competitor--red
127 matches for hr+ .match-card__competitor
14 matches for .match-card__disqualification (you get no results for this because you should use html_attr("title") for title attribute instead of html_text())
Basically you are trying to combine columns of different lengths into the same dataframe. Even if it would work, you'd just add DSQ for 14 first matches.
As you'd probably want to keep information about matched, participants, results and disqualifications instead of just having a list of participants, I'd suggest to work with a list of match cards, i.e. extract all required information from a single card while not breaking relations and then move to the next card.
My purrr is far from perfect, but perhaps something like this:
library(rvest)
library(magrittr)
library(purrr)
library(dplyr)
library(tibble)
library(tidyr)
# helpers -----------------------------------------------------------------
# to keep matches with details (when/where) in header
is_valid_match <- function(element){
return(length(html_elements(element, ".bracket-match-header")) > 0)
}
# detect winner
is_winner <- function(element){
return(length(html_elements(element, ".match-competitor--loser")) < 1 )
}
# extract data from competitor sections
comp_details <- function(comp_card, prefix="_"){
l = lst()
l[paste(prefix, "n", sep = "")] <- comp_card %>% html_element(".match-card__competitor-n") %>% html_text()
l[paste(prefix, "name", sep = "")] <- comp_card %>% html_element(".match-card__competitor-name") %>% html_text()
l[paste(prefix, "club", sep = "")] <- comp_card %>% html_element(".match-card__club-name") %>% html_text()
l[paste(prefix, "dq", sep = "")] <- comp_card %>% html_element(".match-card__disqualification") %>% html_attr("title")
l[paste(prefix, "won", sep = "")] <- comp_card %>% html_element(".match-competitor--loser") %>% length() == 0
return(l)
}
# scrape & process --------------------------------------------------------
html <- read_html('https://web.archive.org/web/20220913034642/https://www.bjjcompsystem.com/tournaments/1869/categories/2053162')
html %>%
# collect all match cards
html_elements("div.tournament-category__match") %>%
keep(is_valid_match) %>%
# apply anonymous function to every item in the list of match cards
map(function(match_card){
match_id <- match_card %>% html_element(".tournament-category__match-card") %>% html_attr("id")
where <- match_card %>% html_element(".bracket-match-header__where") %>% html_text()
when <- match_card %>% html_element(".bracket-match-header__when") %>% html_text()
competitors <- html_nodes(match_card, ".match-card__competitor")
# extract competitior data
comp01 <- competitors[[1]] %>% comp_details(prefix = "comp01_")
comp02 <- competitors[[2]] %>% comp_details(prefix = "comp02_")
winner_idx <- competitors %>% detect_index(is_winner)
# lst for creating a named list
l <- lst(match_id, where, when, winner_idx)
# combine all items and comp lists into single list
l <- c(l,comp01, comp02)
return(l)
}) %>%
# each resulting list item into single-row tibble
map(as_tibble) %>%
# reduce list of tibbles into single tibble
reduce(bind_rows)
Result:
#> # A tibble: 65 × 14
#> match_id where when winne…¹ comp0…² comp0…³ comp0…⁴ comp0…⁵ comp0…⁶ comp0…⁷
#> <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr> <lgl> <chr>
#> 1 match-1-1 FIGH… Sat … 2 58 Christ… Rodrig… <NA> FALSE 66
#> 2 match-1-9 FIGH… Sat … 2 6 Melvin… GF Team Disqua… FALSE 66
#> 3 match-1-… FIGH… Sat … 2 47 Eric R… Atos J… <NA> FALSE 66
#> 4 match-1-… FIGH… Sat … 1 47 Eric R… Atos J… <NA> TRUE 10
#> 5 match-1-… FIGH… Sat … 2 42 Ivan M… CheckM… <NA> FALSE 66
#> 6 match-1-… FIGH… Sat … 2 18 Joel S… Gracie… <NA> FALSE 47
#> 7 match-1-… FIGH… Sat … 1 42 Ivan M… CheckM… <NA> TRUE 26
#> 8 match-1-… FIGH… Sat … 2 34 Matthe… Super … <NA> FALSE 18
#> 9 match-2-9 FIGH… Sat … 1 62 Bryan … Team J… <NA> TRUE 4
#> 10 match-2-… FIGH… Sat … 2 22 Steffe… Six Bl… <NA> FALSE 30
#> # … with 55 more rows, 4 more variables: comp02_name <chr>, comp02_club <chr>,
#> # comp02_dq <chr>, comp02_won <lgl>, and abbreviated variable names
#> # ¹winner_idx, ²comp01_n, ³comp01_name, ⁴comp01_club, ⁵comp01_dq,
#> # ⁶comp01_won, ⁷comp02_n
Created on 2022-09-19 with reprex v2.0.2
Also note that not all matches have a winner and both participants can be disqualified (screenshot), so splitting them to winners & opponents might not be optimal.
I'm looking for direction on assigning a data frame column value to a specific place in a function and then looping or something to create a series of objects to be bound into a longer table.
example data
a = c("17","17","29")
b = c("133","163","055")
data.frame(a, b)
doing this manually...
library(zipcodeR)
T1 <- search_fips("17", "133")
T2 <- search_fips("17", "163")
T3 <- search_fips("29", "055")
TT <- list(T1, T2, T3)
CZ_zips <- rbindlist(TT, use.names=TRUE, fill=TRUE)
would like a to read a and b columns into a set place in function to create a series of vectors or data frames that can then be bound into one longer table.
the search_fips function pulls out of the census FIPS data, a = state and b = county. package is zipcodeR.
One simple way is to wrap the search_fips() function in a lapply function and rest stays the same.
library(zipcodeR)
a = c("17","17","29")
b = c("133","163","055")
df<-data.frame(a, b)
output <-lapply(1:nrow(df), function(i) {
search_fips(df$a[i], df$b[i])
})
answer <- dplyr::bind_rows(output)
here is a loop you might want to put in your function:
library(dplyr)
library(zipcodeR)
my_list <- list()
for (i in 1:nrow(df)) {
my_list[i] <- search_fips(df$a[i], df$b[i])
}
new_df <- bind_rows(my_list)
bind_rows(my_list)
Using rowwise
library(dplyr)
library(tidyr)
library(zipcodeR)
out <- df %>%
rowwise %>%
mutate(result = list(search_fips(a, b))) %>%
ungroup %>%
unnest(result)
-output
> head(out, 2)
# A tibble: 2 × 26
a b zipcode zipcode_type major_city post_office_city common_city_list county state lat lng timezone radius_in_miles area_code_list
<chr> <chr> <chr> <chr> <chr> <chr> <blob> <chr> <chr> <dbl> <dbl> <chr> <dbl> <blob>
1 17 133 62236 Standard Columbia Columbia, IL <raw 20 B> Monroe County IL 38.4 -90.2 Central 7 <raw 15 B>
2 17 133 62244 Standard Fults Fults, IL <raw 17 B> Monroe County IL 38.2 -90.2 Central 7 <raw 15 B>
# … with 12 more variables: population <int>, population_density <dbl>, land_area_in_sqmi <dbl>, water_area_in_sqmi <dbl>, housing_units <int>,
# occupied_housing_units <int>, median_home_value <int>, median_household_income <int>, bounds_west <dbl>, bounds_east <dbl>,
# bounds_north <dbl>, bounds_south <dbl>
data
df <- structure(list(a = c("17", "17", "29"), b = c("133", "163", "055"
)), class = "data.frame", row.names = c(NA, -3L))
Here is a solution with Map.
The two one-liners below are equivalent, the first is probably more readable but the other one is simpler.
library(zipcodeR)
a <- c("17", "17", "29")
b <- c("133", "163", "055")
df <- data.frame(a, b)
Map(function(x, y) search_fips(x, y), df$a, df$b)
result <- Map(search_fips, df$a, df$b)
result <- dplyr::bind_rows(result)
head(result)
#> # A tibble: 6 x 24
#> zipcode zipcode_type major_city post_office_city common_city_list county state
#> <chr> <chr> <chr> <chr> <blob> <chr> <chr>
#> 1 62236 Standard Columbia Columbia, IL <raw 20 B> Monro~ IL
#> 2 62244 Standard Fults Fults, IL <raw 17 B> Monro~ IL
#> 3 62248 PO Box Hecker Hecker, IL <raw 18 B> Monro~ IL
#> 4 62256 PO Box Maeystown <NA> <raw 21 B> Monro~ IL
#> 5 62279 PO Box Renault Renault, IL <raw 19 B> Monro~ IL
#> 6 62295 Standard Valmeyer Valmeyer, IL <raw 20 B> Monro~ IL
#> # ... with 17 more variables: lat <dbl>, lng <dbl>, timezone <chr>,
#> # radius_in_miles <dbl>, area_code_list <blob>, population <int>,
#> # population_density <dbl>, land_area_in_sqmi <dbl>,
#> # water_area_in_sqmi <dbl>, housing_units <int>,
#> # occupied_housing_units <int>, median_home_value <int>,
#> # median_household_income <int>, bounds_west <dbl>, bounds_east <dbl>,
#> # bounds_north <dbl>, bounds_south <dbl>
Created on 2022-02-18 by the reprex package (v2.0.1)
I was wondering if anyone had useful ideas or code for web scraping tables from Wikipedia.
Specifically, I'm interested in the Presidential election results table in the "Results by county" section on Wikipedia.
An example table can be found using the following link and scrolling down to the "Results by county" section: https://en.wikipedia.org/wiki/1948_United_States_presidential_election_in_Texas
The table looks like this:
I've tried some solutions from the following StackOverflow post: Importing wikipedia tables in R
However, they don't appear to be appliable to the type of table I want to scrape in Wikipedia.
Any advice, solutions, or code would be greatly appreciated. Thank you!
Making use of the rvest package you could get the table by first selecting the element containing the desired table via html_element("table.wikitable.sortable") and then extracting the table via html_table() like so:
library(rvest)
url <- "https://en.wikipedia.org/wiki/1948_United_States_presidential_election_in_Texas"
html <- read_html(url)
county_table <- html %>%
html_element("table.wikitable.sortable") %>%
html_table()
head(county_table)
#> # A tibble: 6 x 14
#> County `Harry S. Truman… `Harry S. Truman… `Thomas E. Dewey… `Thomas E. Dewe…
#> <chr> <chr> <chr> <chr> <chr>
#> 1 County # % # %
#> 2 Anders… 3,242 62.37% 1,199 23.07%
#> 3 Andrews 816 85.27% 101 10.55%
#> 4 Angeli… 4,377 69.05% 1,000 15.78%
#> 5 Aransas 418 61.02% 235 34.31%
#> 6 Archer 1,599 86.20% 191 10.30%
#> # … with 9 more variables: Strom ThurmondStates’ Rights Democratic <chr>,
#> # Strom ThurmondStates’ Rights Democratic.1 <chr>,
#> # Henry A. WallaceProgressive <chr>, Henry A. WallaceProgressive.1 <chr>,
#> # Various candidatesOther parties <chr>,
#> # Various candidatesOther parties.1 <chr>, Margin <chr>, Margin.1 <chr>,
#> # Total votes cast[11] <chr>
My second participation here, in Stackoverflow.
I have a function called bw_test with several args like this:
bw_test <- function(localip, remoteip, localspeed, remotespeed , duracion =30,direction ="both"){
comando <- str_c("ssh usuario#", localip ," /tool bandwidth-test direction=", direction," remote-tx-speed=",remotespeed,"M local-tx-speed=",localspeed,"M protocol=udp user=usuario password=mipasso duration=",duracion," ",remoteip)
resultado <- system(comando,intern = T,ignore.stderr = T)
# resultado pull from a ssh server a vector like this:
# head(resultado)
#[1] " status: connecting\r" " tx-current: #0bps\r" " tx-10-second-average: 0bps\r"
#[4] " tx-total-average: 0bps\r" " rx-current: #0bps\r" " rx-10-second-average: 0bps\r"
resultado %<>%
replace("\r","") %>%
tail(17) %>%
trimws("both") %>%
as_tibble %>%
mutate(local=localip, remote=remoteip) %>%
separate(value,sep=":", into=c("parametro","valor")) %>%
head(15)
resultado$valor %<>%
trimws() %>%
str_replace("Mbps","") %>% str_replace("%","") %>% str_replace("s","")
resultado %<>%
spread(parametro,valor)
resultado %<>%
mutate(`tx-percentaje`=as.numeric(resultado$`tx-total-average`)/localspeed) %>%
mutate(`rx-percentaje`=as.numeric(resultado$`rx-total-average`)/remotespeed)
return(resultado)
}
this function returns a tibble like this one:
A tibble: 1 x 19
local remote `connection-cou… direction duration `local-cpu-load` `lost-packets` `random-data` `remote-cpu-loa…
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 192.… 192.1… 1 both 4 13 0 no 12
# … with 10 more variables: `rx-10-second-average` <chr>, `rx-current` <chr>, `rx-size` <chr>,
# `rx-total-average` <chr>, `tx-10-second-average` <chr>, `tx-current` <chr>, `tx-size` <chr>,
# `tx-total-average` <chr>, `tx-percentaje` <dbl>, `rx-percentaje` <dbl>
So, when I call the function inside rbind, got the result of every run on a tibble:
rbind(bw_test("192.168.105.10" ,"192.168.105.18", 75,125),
bw_test("192.168.133.11","192.168.133.9", 5 ,50),
bw_test("192.168.254.251","192.168.254.250", 25,150))
My results are for the example:
# A tibble: 3 x 19
local remote `connection-cou… direction duration `local-cpu-load` `lost-packets` `random-data` `remote-cpu-loa…
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 192.… 192.1… 20 both 28 63 232 no 48
2 192.… 192.1… 20 both 29 4 0 no 20
3 192.… 192.1… 20 both 29 15 0 no 22
# … with 10 more variables: `rx-10-second-average` <chr>, `rx-current` <chr>, `rx-size` <chr>,
# `rx-total-average` <chr>, `tx-10-second-average` <chr>, `tx-current` <chr>, `tx-size` <chr>,
# `tx-total-average` <chr>, `tx-percentaje` <dbl>, `rx-percentaje` <dbl>
My problem is to apply the function to the cases of a tibble like like this.
aps <- tribble(
~name, ~ip, ~remoteip , ~bw_test, ~localspeed,~remotespeed,
"backbone_border_core","192.168.253.1", "192.168.253.3", 1,200,200,
"backbone_2_site2","192.168.254.251", "192.168.254.250", 1, 25,150
}
I was trying to use map, but i got:
map(c(aps$ip,aps$remoteip,aps$localspeed,aps$remotespeed), bw_test)
el argumento "remotespeed" está ausente, sin valor por omisión
I believe cause c(aps$ip,aps$remoteip,aps$localspeed,aps$remotespeed) feeds first all cases of aps$ip, later all of aps$remoteip and so on.
I'm using the right strategie? it's map a suitable way
What i'm doing wrong?
¿how can I apply function to every row to get the requested tibble?
I'll appreciate your kindly help.
Greets.
Try using pmap_df.
output <- purrr::pmap_df(list(aps$ip, aps$remoteip, aps$localspeed,
aps$remotespeed), bw_test)
I'm using a webscraper to scrape some data from FinViz. Here's an example
The problem is that the data frame is messy, the first column holds what I would ideally want as the headers and the second column holds the corresponding data. Here's an output:
data1 data2 data3 data4 data5 data6 data7 data8 data9 data10
1 Index S&P 500 P/E 36.13 EPS (ttm) 4.60 Insider Own 0.10% Shs Outstand 2.93B
2 Market Cap 487.15B Forward P/E 25.65 EPS next Y 6.48 Insider Trans -86.95% Shs Float 2.33B
3 Income 13.58B PEG 1.36 EPS next Q 1.27 Inst Own 72.50% Short Float 0.87%
4 Sales 33.17B P/S 14.69 EPS this Y 170.20% Inst Trans -0.22% Short Ratio 1.13
5 Book/sh 22.92 P/B 7.26 EPS next Y 21.63% ROA 20.30% Target Price 192.62
6 Cash/sh 12.10 P/C 13.74 EPS next 5Y 26.57% ROE 22.50% 52W Range 113.55 - 175.49
7 Dividend - P/FCF 34.05 EPS past 5Y 62.10% ROI 17.10% 52W High -5.23%
8 Dividend % - Quick Ratio 12.30 Sales past 5Y 49.40% Gross Margin 86.60% 52W Low 46.47%
9 Employees 20658 Current Ratio 12.30 Sales Q/Q 44.80% Oper. Margin 46.40% RSI (14) 49.05
10 Optionable Yes Debt/Eq 0.00 EPS Q/Q 68.80% Profit Margin 40.90% Rel Volume 0.70
11 Shortable Yes LT Debt/Eq 0.00 Earnings Jul 26 AMC Payout 0.00% Avg Volume 17.87M
12 Recom 1.70 SMA20 -1.84% SMA50 2.85% SMA200 17.52% Volume 12,583,873
As you can see, data1 contains the categories and data2 contains the following information.
Ideally I'd want it in this structure:
Index | Market Cap | Income | Sales | Book sh | ...
------------------------------------------------
S&P500 | 487.15B | 13.58B | 33.17B | 22.92 |
So that data1,3,5,7 were all the headers and data2,4,6,8 where all in one row.
Could anyone provide any input? I'm trying to avoid compiling them into 2 different vectors then rbinding the frame together.
Cheerio!
Here is a solution using some tidyverse packages and your dataset.
library(rvest) # for scrapping the data
#> Le chargement a nécessité le package : xml2
library(dplyr, warn.conflicts = F)
library(tidyr)
library(purrr, warn.conflict = F)
Fisrt, we get your data directly from your example url.
tab <- read_html("http://finviz.com/quote.ashx?t=BA") %>%
html_node("table.snapshot-table2") %>%
html_table(header = F) %>%
as_data_frame()
tab
#> # A tibble: 12 x 12
#> X1 X2 X3 X4 X5 X6
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Index DJIA S&P500 P/E 20.77 EPS (ttm) 11.42
#> 2 Market Cap 141.89B Forward P/E 22.14 EPS next Y 10.71
#> 3 Income 7.12B PEG 1.13 EPS next Q 2.62
#> 4 Sales 90.90B P/S 1.56 EPS this Y 2.30%
#> 5 Book/sh -3.34 P/B - EPS next Y 7.28%
#> 6 Cash/sh 17.26 P/C 13.74 EPS next 5Y 18.36%
#> 7 Dividend 5.68 P/FCF 17.94 EPS past 5Y 7.40%
#> 8 Dividend % 2.39% Quick Ratio 0.40 Sales past 5Y 6.60%
#> 9 Employees 150500 Current Ratio 1.20 Sales Q/Q -8.10%
#> 10 Optionable Yes Debt/Eq - EPS Q/Q 885.50%
#> 11 Shortable Yes LT Debt/Eq - Earnings Jul 26 BMO
#> 12 Recom 2.20 SMA20 -0.16% SMA50 8.14%
#> # ... with 6 more variables: X7 <chr>, X8 <chr>, X9 <chr>, X10 <chr>,
#> # X11 <chr>, X12 <chr>
As headers are in every odd column and data in every even column, we
create a tidy tibble of two columns by row binding the subsets. For
that, we generate odd and even index. Then,
purrr::map_dfr allows us to iterates over those 2 lists, applies a function and row bind the results. The function consist of selecting 2 columns with of the table[ ]and rename those two columns withset_names.
col_num <- seq_len(ncol(tab))
even <- col_num[col_num %% 2 == 0]
odd <- setdiff(col_num, even)
tab2 <- map2_dfr(odd, even, ~ set_names(tab[, c(.x, .y)], c("header", "value")))
tab2
#> # A tibble: 72 x 2
#> header value
#> <chr> <chr>
#> 1 Index DJIA S&P500
#> 2 Market Cap 141.89B
#> 3 Income 7.12B
#> 4 Sales 90.90B
#> 5 Book/sh -3.34
#> 6 Cash/sh 17.26
#> 7 Dividend 5.68
#> 8 Dividend % 2.39%
#> 9 Employees 150500
#> 10 Optionable Yes
#> # ... with 62 more rows
You have a nice 2 column long table with all your data. Now if you want
the table in wide format instead of long format, you have to transpose.
But first, we have to deal with some duplicates names in the header
column. You can't have duplicates column names.
tab2 %>%
filter(header == header[duplicated(header)])
#> # A tibble: 2 x 2
#> header value
#> <chr> <chr>
#> 1 EPS next Y 10.71
#> 2 EPS next Y 7.28%
We just rename the second occurence adding _2
tab3 <- tab2 %>%
mutate(header = case_when(
duplicated(header) ~ paste(header, 2, sep = "_"),
TRUE ~ header)
)
# No more duplicates
any(duplicated(tab3$header))
#> [1] FALSE
tab3 %>% filter(stringr::str_detect(header, "EPS next Y"))
#> # A tibble: 2 x 2
#> header value
#> <chr> <chr>
#> 1 EPS next Y 10.71
#> 2 EPS next Y_2 7.28%
You can pass in wide format and have 72 columns instead of 72 lines.
tab3 %>%
spread(header, value)
#> # A tibble: 1 x 72
#> `52W High` `52W Low` `52W Range` ATR `Avg Volume` Beta `Book/sh`
#> * <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 -3.78% 87.78% 126.31 - 246.49 3.77 3.46M 1.18 -3.34
#> # ... with 65 more variables: `Cash/sh` <chr>, Change <chr>, `Current
#> # Ratio` <chr>, `Debt/Eq` <chr>, Dividend <chr>, `Dividend %` <chr>,
#> # Earnings <chr>, Employees <chr>, `EPS (ttm)` <chr>, `EPS next
#> # 5Y` <chr>, `EPS next Q` <chr>, `EPS next Y` <chr>, `EPS next
#> # Y_2` <chr>, `EPS past 5Y` <chr>, `EPS Q/Q` <chr>, `EPS this Y` <chr>,
#> # `Forward P/E` <chr>, `Gross Margin` <chr>, Income <chr>, Index <chr>,
#> # `Insider Own` <chr>, `Insider Trans` <chr>, `Inst Own` <chr>, `Inst
#> # Trans` <chr>, `LT Debt/Eq` <chr>, `Market Cap` <chr>, `Oper.
#> # Margin` <chr>, Optionable <chr>, `P/B` <chr>, `P/C` <chr>,
#> # `P/E` <chr>, `P/FCF` <chr>, `P/S` <chr>, Payout <chr>, PEG <chr>,
#> # `Perf Half Y` <chr>, `Perf Month` <chr>, `Perf Quarter` <chr>, `Perf
#> # Week` <chr>, `Perf Year` <chr>, `Perf YTD` <chr>, `Prev Close` <chr>,
#> # Price <chr>, `Profit Margin` <chr>, `Quick Ratio` <chr>, Recom <chr>,
#> # `Rel Volume` <chr>, ROA <chr>, ROE <chr>, ROI <chr>, `RSI (14)` <chr>,
#> # Sales <chr>, `Sales past 5Y` <chr>, `Sales Q/Q` <chr>, `Short
#> # Float` <chr>, `Short Ratio` <chr>, Shortable <chr>, `Shs Float` <chr>,
#> # `Shs Outstand` <chr>, SMA20 <chr>, SMA200 <chr>, SMA50 <chr>, `Target
#> # Price` <chr>, Volatility <chr>, Volume <chr>
Idea: You can also replace all the spaces by _ in the header column to have column names without spaces. Often simpler to handle.
Would this work ?
data <- data.frame(data1= letters[1:10],data2=LETTERS[1:10],data3= letters[11:20],data4=LETTERS[11:20],stringsAsFactors=F)
# data1 data2 data3 data4
# 1 a A k K
# 2 b B l L
# 3 c C m M
# 4 d D n N
# 5 e E o O
# 6 f F p P
# 7 g G q Q
# 8 h H r R
# 9 i I s S
# 10 j J t T
output <- setNames(data.frame(
t(unlist(data[!as.logical(seq_along(data)%%2)]))),
unlist(data[as.logical(seq_along(data)%%2)]))
# a b c d e f g h i j k l m n o p q r s t
# 1 A B C D E F G H I J K L M N O P Q R S T
You can try:
library(data.table); library(dplyr)
table1 <- df[, 1:2] %>%as.data.table() %>% dcast.data.table(.~data1, value.var = "data2")
table2 <- df[, 3:4] %>%as.data.table() %>% dcast.data.table(.~data3, value.var = "data4")
cbind(table1, table2)
and so on for the rest