Working with some Business Labor Statistics data (https://www.bls.gov/regions/mid-atlantic/data/producerpriceindexconcrete_us_table.htm). I have scraped the table from this url and am trying to get it into a tidy format. Here is a working example:
Commodity jan feb mar
Nonmetallic mineral products
2020 257.2 258.1 258.5
2021 262.6 263.4 264.4
Concrete ingredients
2020 316.0 316.9 317.8
2021 328.4 328.4 328.4
Construction gravel
2020 359.2 360.7 362.1
2021 375.0 374.7 374.1
How can I get the 2020 and 2021 rows into a "Year" column, and get jan, feb, mar, etc. into a "Month" column like below?
Commodity Month Year Value
Nonmetallic mineral products jan 2020 257.2
Nonmetallic mineral products feb 2020 258.1
Concrete ingredients jan 2020 316.0
Concrete ingredients jan 2021 328.4
We could use tidyverse to transform the data into the required format.
Create a grouping column 'grp' by doing the cumulative sum (cumsum) of a logical vector i.e. presence of letters in the 'Commodity' column
Use mutate to create the 'Year', by replaceing the first element as NA and modify the 'Commodity' by updating it with the first value
Remove the first row with slice
ungroup and reshape the data into long format with pivot_longer
library(dplyr)
library(stringr)
library(tidyr)
df1 %>%
group_by(grp = cumsum(str_detect(Commodity, "[A-Za-z]"))) %>%
mutate(Year = replace(Commodity, 1, NA),
Commodity = first(Commodity)) %>%
slice(-1) %>%
ungroup %>%
select(-grp) %>%
pivot_longer(cols = jan:mar, names_to = 'Month')
-output
# A tibble: 18 x 4
Commodity Year Month value
<chr> <chr> <chr> <dbl>
1 Nonmetallic mineral products 2020 jan 257.
2 Nonmetallic mineral products 2020 feb 258.
3 Nonmetallic mineral products 2020 mar 258.
4 Nonmetallic mineral products 2021 jan 263.
5 Nonmetallic mineral products 2021 feb 263.
6 Nonmetallic mineral products 2021 mar 264.
7 Concrete ingredients 2020 jan 316
8 Concrete ingredients 2020 feb 317.
9 Concrete ingredients 2020 mar 318.
10 Concrete ingredients 2021 jan 328.
11 Concrete ingredients 2021 feb 328.
12 Concrete ingredients 2021 mar 328.
13 Construction gravel 2020 jan 359.
14 Construction gravel 2020 feb 361.
15 Construction gravel 2020 mar 362.
16 Construction gravel 2021 jan 375
17 Construction gravel 2021 feb 375.
18 Construction gravel 2021 mar 374.
data
df1 <- structure(list(Commodity = c("Nonmetallic mineral products",
"2020", "2021", "Concrete ingredients", "2020", "2021", "Construction gravel",
"2020", "2021"), jan = c(NA, 257.2, 262.6, NA, 316, 328.4, NA,
359.2, 375), feb = c(NA, 258.1, 263.4, NA, 316.9, 328.4, NA,
360.7, 374.7), mar = c(NA, 258.5, 264.4, NA, 317.8, 328.4, NA,
362.1, 374.1)), class = "data.frame", row.names = c(NA, -9L))
Another strategy would be to transform the data while scraping:
library(rvest)
library(dpyr)
"https://www.bls.gov/regions/mid-atlantic/data/producerpriceindexconcrete_us_table.htm" %>%
read_html() %>%
html_table() %>%
first() %>%
set_names(.[1, ]) %>%
tail(-1) %>%
split(ifelse(str_detect(.$Commodity, "\\d{4}"),
"data", "commodities")) %>%
with(data %>%
select(-`Historical data`) %>%
mutate(Year = as.integer(Commodity),
Commodity = commodities$Commodity %>%
head(nrow(commodities) - 1) %>%
rep(times = rep(2, length(.))),
across(-c(Commodity, Year), readr::parse_number),
.before = 1)) %>%
pivot_longer(-c(Year, Commodity)) %>%
transmute(Commodity, Year, Month = name, Value = value)
Result:
# A tibble: 754 x 4
Commodity Year Month Value
<chr> <int> <chr> <dbl>
1 Nonmetallic mineral products 2020 Jan 257.
2 Nonmetallic mineral products 2020 Feb 258.
3 Nonmetallic mineral products 2020 Mar 258.
4 Nonmetallic mineral products 2020 Apr 258.
5 Nonmetallic mineral products 2020 May 257.
6 Nonmetallic mineral products 2020 Jun 257.
7 Nonmetallic mineral products 2020 Jul 257.
8 Nonmetallic mineral products 2020 Aug 257
9 Nonmetallic mineral products 2020 Sep 258.
10 Nonmetallic mineral products 2020 Oct 257.
# … with 744 more rows
Again, doing the transformation whilst scraping: You could also use nth-child range css selectors to isolate nodes and combine. Then repeat retrieved elements for the necessary lengths. CSS selectors are a quick way of filtering data.
Some explanation:
This gathers the commodities list and extends it for the number of years * number of months:
Commodity = page %>% html_nodes("table#ro3ppiconcrete > tbody > tr:nth-child(3n+1)") %>% html_text(trim = T) %>%
rep(each = length(months)) %>% rep(times = 2)
This joins all the Year 2 values (2021) under all the Year 1 (2020) values:
Value = vctrs::vec_c(
page %>% html_nodes("table#ro3ppiconcrete tbody > tr:nth-child(3n-1) td:nth-child(n+3):nth-child(-n+14)") %>% html_text(trim = T),
page %>% html_nodes("table#ro3ppiconcrete tbody tr:nth-child(3n) td:nth-child(n+3):nth-child(-n+14)") %>% html_text(trim = T)
This stacks the required number of repeats of Year 2 under those of Year 1
Year = vctrs::vec_c(
page %>% html_nodes("table#ro3ppiconcrete tbody tr:nth-child(3n+2) th") %>% html_text(trim = T) %>% rep(times = 2),
page %>% html_nodes("table#ro3ppiconcrete tbody tr:nth-child(3n) th") %>% html_text(trim = T) %>% rep(times = 2)
) %>% as.integer()
R:
library(rvest)
library(magrittr)
page <- read_html("https://www.bls.gov/regions/mid-atlantic/data/producerpriceindexconcrete_us_table.htm")
months <- page %>%
html_nodes("table#ro3ppiconcrete tr:nth-child(2) > th:nth-child(n+2):nth-child(-n+13)") %>%
html_text()
df <- data.frame(
Commodity = page %>% html_nodes("table#ro3ppiconcrete > tbody > tr:nth-child(3n+1)") %>% html_text(trim = T) %>%
rep(each = length(months)) %>% rep(times = 2),
Year = vctrs::vec_c(
page %>% html_nodes("table#ro3ppiconcrete tbody tr:nth-child(3n+2) th") %>% html_text(trim = T) %>% rep(times = 2),
page %>% html_nodes("table#ro3ppiconcrete tbody tr:nth-child(3n) th") %>% html_text(trim = T) %>% rep(times = 2)
) %>% as.integer(),
Month = months,
Value = vctrs::vec_c(
page %>% html_nodes("table#ro3ppiconcrete tbody > tr:nth-child(3n-1) td:nth-child(n+3):nth-child(-n+14)") %>% html_text(trim = T),
page %>% html_nodes("table#ro3ppiconcrete tbody tr:nth-child(3n) td:nth-child(n+3):nth-child(-n+14)") %>% html_text(trim = T)
)
)
# if you wish to remove the provisional flag from Value and have as numeric
df$Value <- gsub('(p)', '', df$Value, fixed = T) %>% as.double()
:nth-child()
The :nth-child() CSS pseudo-class matches elements based on their
position in a group of siblings.
Functional notation
<An+B> Represents elements in a list whose indices match those found
in a custom pattern of numbers, defined by An+B, where:
A is an integer step size,
B is an integer offset,
n is all nonnegative integers, starting from 0.
It can be read as the An+Bth element of a list.
Related
I have a data frame with over 100,000 rows and with about 40 columns. The schools column has about 100 distinct schools. I have data from 1980 to 2023.
I want to keep all data from schools that have at least 10 rows for each of the years 2018 through 2022. Schools that do not meet that criterion should have all rows deleted.
In my minimal example, Schools, I have three schools.
Computing a table makes it apparent that only Washington should be retained. Adams only has 5 rows for 2018 and Jefferson has 0 for 2018.
Schools2 is what the result should look like.
How do I use the table computation or a dplyr computation to perform the filter?
Schools =
data.frame(school = c(rep('Washington', 60),
rep('Adams',70),
rep('Jefferson', 100)),
year = c(rep(2016, 5), rep(2018:2022, each = 10), rep(2023, 5),
rep(2017, 25), rep(2018, 5), rep(2019:2022, each = 10),
rep(2019:2023, each = 20)),
stuff = rnorm(230)
)
Schools2 =
data.frame(school = c(rep('Washington', 60)),
year = c(rep(2016, 5), rep(2018:2022, each = 10), rep(2023, 5)),
stuff = rnorm(60)
)
table(Schools$school, Schools$year)
Schools |> group_by(school, year) |> summarize(counts = n())
Keep only the year from 2018 to 2022 in the data with filter, then add a frequency count column by school, year, and filter only those 'school', having all count greater than or equal to 10 and if all the year from the range are present
library(dplyr)# version >= 1.1.0
Schools %>%
filter(all(table(year[year %in% 2018:2022]) >= 10) &
all(2018:2022 %in% year), .by = c("school")) %>%
as_tibble()
-output
# A tibble: 60 × 3
school year stuff
<chr> <dbl> <dbl>
1 Washington 2016 0.680
2 Washington 2016 -1.14
3 Washington 2016 0.0420
4 Washington 2016 -0.603
5 Washington 2016 2.05
6 Washington 2018 -0.810
7 Washington 2018 0.692
8 Washington 2018 -0.502
9 Washington 2018 0.464
10 Washington 2018 0.397
# … with 50 more rows
Or using count
library(magrittr)
Schools %>%
filter(tibble(year) %>%
filter(year %in% 2018:2022) %>%
count(year) %>%
pull(n) %>%
is_weakly_greater_than(10) %>%
all, all(2018:2022 %in% year) , .by = "school")
As it turns out, a friend just helped me come up with a base R solution.
# form 2-way table, school against year
sdTable = table(Schools$school, Schools$year)
# say want years 2018-2022 having lots of rows in school data
sdTable = sdTable[,3:7]
# which have >= 10 rows in all years 2018-2022
allGtEq = function(oneRow) all(oneRow >= 10)
whichToKeep = which(apply(sdTable,1,allGtEq))
# now whichToKeep is row numbers from the table; get the school names
whichToKeep = names(whichToKeep)
# back to school data
whichOrigRowsToKeep = which(Schools$school %in% whichToKeep)
newHousing = Schools[whichOrigRowsToKeep,]
newHousing
I have a data manipulation and exclusion challenge that I just can't figure out how to approach successfully. I have data in a tidy format, all observations are rows. Here is a reprex for my dataset:
quarter <- c("Q4", "Q3", "Q2","Q1", "Q3", "Q2", "Q1","Q4", "Q2", "Q1", "Q4", "Q3", "Q2", "Q1","Q4", "Q3", "Q1")
year <- c("2020", "2020","2020","2020","2019","2019","2019", "2020", "2020","2020","2019","2019","2019","2019", "2020", "2020","2020")
country <- c("Brazil","Brazil","Brazil","Brazil","Brazil","Brazil","Brazil","Brazil","Brazil","Brazil", "Brazil","Brazil","Brazil","Brazil","France","France","France")
indicator <- c("Testing","Testing", "Testing","Testing","Testing","Testing","Testing","TestingPos","TestingPos","TestingPos","TestingPos","TestingPos","TestingPos","TestingPos", "Testing","Testing","Testing")
value <- sample(c(1:10), 17, replace = T)
quarterlydf <- data.frame(quarter, year, country, indicator, value)
quarter year country indicator value
1 Q4 2020 Brazil Testing 9
2 Q3 2020 Brazil Testing 3
3 Q2 2020 Brazil Testing 2
4 Q1 2020 Brazil Testing 7
5 Q3 2019 Brazil Testing 1
6 Q2 2019 Brazil Testing 5
7 Q1 2019 Brazil Testing 6
8 Q4 2020 Brazil TestingPos 4
9 Q2 2020 Brazil TestingPos 4
10 Q1 2020 Brazil TestingPos 3
11 Q4 2019 Brazil TestingPos 7
12 Q3 2019 Brazil TestingPos 2
13 Q2 2019 Brazil TestingPos 8
14 Q1 2019 Brazil TestingPos 1
15 Q4 2020 France Testing 1
16 Q3 2020 France Testing 1
17 Q1 2020 France Testing 8
For each country and indicator combination, I need to find the most recent contiguous 4 quarter period. For that most recent set of four contiguous quarters (e.g. Q3 2019, Q4 2019, Q1 2020, Q2 2020), I need to create a new row in a new dataframe (annualdf here) with the country, the start and end quarter/year, the indicator, the sum and the mean of the values for the included quarters.
All other contiguous quarter sets should be discarded, anywhere there is not a contiguous set should be discarded.
The product should look like this for the preceding frame:
start end country indicator sum mean
1 Q1_2020 Q4_2020 Brazil Testing 21 5.25
2 Q3_2019 Q2_2020 Brazil TestingPos 16 4
I won't go into all I've tried, but it's gotten very very ugly, involving trying to reassign sequential ids to each possible quarter/year combination, then use pivot_wider() to create multiple columns for each id, concatenate those columns into a single result, then use a grotesque set of str_detect() searches to search and assign values. Long story short, I think the entire approach I'm trying is very bad and incredibly inelegant.
There HAS to be an elegant way to do this.
Any suggestions would be very, very much appreciated. Thank you.
EDIT1: Per Limey there was a minor typo in the desired output (Q2_2019 was supposed to be Q2_2020). This has been fixed.
Though a bit long syntax (i Will try for shorter) but this will work. Only assumption lied here is that no year is completely missing, otherwise that field also needs to be completed by complete. Else these will work
quarterlydf %>%
arrange(desc(year, quarter)) %>%
group_by(country, indicator, year) %>%
complete(quarter = rev(c("Q1", "Q2", "Q3", "Q4"))) %>%
group_by(country, indicator) %>%
arrange(desc(year), desc(quarter), .by_group = T) %>%
filter(with(rle(is.na(value)), rep(lengths, lengths)) >=4, !is.na(value)) %>%
slice_head(n = 4) %>%
summarise(start = paste0(last(year),"_", last(quarter)),
end = paste0(first(year),"_", first(quarter)),
sum = sum(value),
mean = mean(value))
# A tibble: 2 x 6
# Groups: country [1]
country indicator start end sum mean
<chr> <chr> <chr> <chr> <int> <dbl>
1 Brazil Testing 2020_Q1 2020_Q4 18 4.5
2 Brazil TestingPos 2019_Q3 2020_Q2 16 4
can be done reversed (chronologically) too
quarterlydf %>%
arrange(year, quarter) %>%
group_by(country, indicator, year) %>%
complete(quarter = c("Q1", "Q2", "Q3", "Q4")) %>%
group_by(country, indicator) %>%
filter(with(rle(is.na(value)), rep(lengths, lengths)) >=4, !is.na(value)) %>%
slice_tail(n = 4) %>%
summarise(start = paste0(first(year),"_", first(quarter)),
end = paste0(last(year),"_", last(quarter)),
sum = sum(value),
mean = mean(value))
I’ve to transform my dataframe from the current to the new format (see image or structure below). I’ve no idea how I can accomplish that. I want a year for each ID, from 2013-2018 (so each ID has 6 rows, one for every year). The dates are the dates of living on that adress (entry date) and when they left that adress (end date). So each ID and year gives the zipcode and city they lived. The place the ID lived (for each year) should be were they lived the longest that year. I've already set the enddate to 31-12-2018 if they still live there (here showed with NA). Below a picture and the first 3 rows. Hopefully you guys can help me out!
Current format:
ID (1, 1, 2)
ZIPCODE (1234AB, 5678CD, 9012EF)
CITY (NEWYORK, LA, MIAMI)
ENTRY_DATE (2-1-2014, 13-3-2017, 10-11-2011)
END_DATE (13-5-2017, 21-12-2018, 6-9-2017)
New format:
ID (1, 1, 1, 1, 1, 1, 2)
YEAR (2013, 2014, 2015, 2016, 2017, 2018, 2013)
ZIPCODE (NA, 1234AB, 1234AB, 1234AB, 5678CD, 5678CD, 9012EF)
CITY (NA, NEWYORK, NEWYORK, NEWYORK, LA, LA, MIAMI)
See link below
Here is one approach.
First, create date intervals for each location from start to end dates. Using map2 and unnest you will create additional rows for each year.
Since you wish to include the location information where there were the greatest number of days for that calendar year, you could look at overlaps between 2 intervals: one interval is the calendar year, and the second interval is the ENTRY_DATE to END_DATE. For each year, you can filter by max(WEEKS) (or to ensure a single address per year, arrange in descending order by WEEKS and slice(1) --- or with latest tidyr consider slice_max). This will keep the row where there is the greatest number of weeks duration overlap between intervals.
The final complete will ensure you have rows for all years between 2013-2018.
library(tidyverse)
library(lubridate)
df %>%
mutate(ENTRY_END_INT = interval(ENTRY_DATE, END_DATE),
YEAR = map2(year(ENTRY_DATE), year(END_DATE), seq)) %>%
unnest(YEAR) %>%
mutate(YEAR_INT = interval(as.Date(paste0(YEAR, '-01-01')), as.Date(paste0(YEAR, '-12-31'))),
WEEKS = as.duration(intersect(ENTRY_END_INT, YEAR_INT))) %>%
group_by(ID, YEAR) %>%
arrange(desc(WEEKS)) %>%
slice(1) %>%
group_by(ID) %>%
complete(YEAR = seq(2013, 2018, 1)) %>%
arrange(ID, YEAR) %>%
select(-c(ENTRY_DATE, END_DATE, ENTRY_END_INT, YEAR_INT, WEEKS))
Output
# A tibble: 14 x 4
# Groups: ID [2]
ID YEAR ZIPCODE CITY
<dbl> <dbl> <chr> <chr>
1 1 2013 NA NA
2 1 2014 1234AB NEWYORK
3 1 2015 1234AB NEWYORK
4 1 2016 1234AB NEWYORK
5 1 2017 5678CD LA
6 1 2018 5678CD LA
7 2 2011 9012EF MIAMI
8 2 2012 9012EF MIAMI
9 2 2013 9012EF MIAMI
10 2 2014 9012EF MIAMI
11 2 2015 9012EF MIAMI
12 2 2016 9012EF MIAMI
13 2 2017 9012EF MIAMI
14 2 2018 NA NA
Data
df <- structure(list(ID = c(1, 1, 2), ZIPCODE = c("1234AB", "5678CD",
"9012EF"), CITY = c("NEWYORK", "LA", "MIAMI"), ENTRY_DATE = structure(c(16072,
17238, 15288), class = "Date"), END_DATE = structure(c(17299,
17896, 17415), class = "Date")), class = "data.frame", row.names = c(NA,
-3L))
I am trying to use rvest() to extract some information. What I have is a list of links and I would like to bind the rows of the data collected together.
What I currently have is the following;
EDIT: heres the links without the weekend data
links <- c("https://finance.yahoo.com/calendar/ipo?day=2018-03-05", "https://finance.yahoo.com/calendar/ipo?day=2018-03-06",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-07", "https://finance.yahoo.com/calendar/ipo?day=2018-03-08",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-09", "https://finance.yahoo.com/calendar/ipo?day=2018-03-12",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-13", "https://finance.yahoo.com/calendar/ipo?day=2018-03-14",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-15", "https://finance.yahoo.com/calendar/ipo?day=2018-03-16",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-19", "https://finance.yahoo.com/calendar/ipo?day=2018-03-20",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-21", "https://finance.yahoo.com/calendar/ipo?day=2018-03-22",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-23", "https://finance.yahoo.com/calendar/ipo?day=2018-03-26",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-27", "https://finance.yahoo.com/calendar/ipo?day=2018-03-28",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-29", "https://finance.yahoo.com/calendar/ipo?day=2018-03-30",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-02", "https://finance.yahoo.com/calendar/ipo?day=2018-04-03",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-04", "https://finance.yahoo.com/calendar/ipo?day=2018-04-05",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-06", "https://finance.yahoo.com/calendar/ipo?day=2018-04-09",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-10", "https://finance.yahoo.com/calendar/ipo?day=2018-04-11",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-12", "https://finance.yahoo.com/calendar/ipo?day=2018-04-13",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-16", "https://finance.yahoo.com/calendar/ipo?day=2018-04-17",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-18", "https://finance.yahoo.com/calendar/ipo?day=2018-04-19",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-20", "https://finance.yahoo.com/calendar/ipo?day=2018-04-23",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-24", "https://finance.yahoo.com/calendar/ipo?day=2018-04-25",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-26", "https://finance.yahoo.com/calendar/ipo?day=2018-04-27",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-30", "https://finance.yahoo.com/calendar/ipo?day=2018-05-01",
"https://finance.yahoo.com/calendar/ipo?day=2018-05-02", "https://finance.yahoo.com/calendar/ipo?day=2018-05-03",
"https://finance.yahoo.com/calendar/ipo?day=2018-05-04", "https://finance.yahoo.com/calendar/ipo?day=2018-05-07",
"https://finance.yahoo.com/calendar/ipo?day=2018-05-08", "https://finance.yahoo.com/calendar/ipo?day=2018-05-09",
"https://finance.yahoo.com/calendar/ipo?day=2018-05-10")
Code:
library(rvest)
library(dplyr)
library(magrittr)
x <- links %>%
read_html() %>%
html_table() %>%
extract2(1) %>%
bind_rows() %>%
as_tibble
This gives the following error:
Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) :
Expecting a single string value: [type=character; extent=68].
I am able to get the code working for 1 link however when I try to get it working for all the links I am running into errors. For example this code works:
x <- "https://finance.yahoo.com/calendar/ipo?day=2018-05-08" %>%
read_html() %>%
html_table() %>%
extract2(1) %>%
bind_rows() %>%
as_tibble
EDIT:
from = "2016-03-04"
to = "2018-05-10"
s <- seq(as.Date(from), as.Date(to), "days")
library(chron)
s <- s[!is.weekend(s)]
links <- paste0("https://finance.yahoo.com/calendar/ipo?day=", s)
library(rvest)
library(dplyr)
library(magrittr)
library(purrr)
library(naniar)
IPOs <- links[1:400] %>%
map_dfr(~read_html(.x) %>%
html_table() %>%
extract2(1) %>%
naniar::replace_with_na_all(condition = ~.x == "-") %>%
type.convert(as.is = TRUE) )
It looks like you want to loop through the URL's. For each one you want to read it, parse it into a data frame, and extracting the first data frame in the list. So the read_html() through extract2() steps should be done within the loop.
One option is to use a purrr::map_dfr() loop, since it looks like you want to bind things into a single tibble in the end.
Nominally that could look like:
library(rvest)
library(dplyr)
library(magrittr)
library(purrr)
links %>%
map_dfr(~read_html(.x) %>%
html_table() %>%
extract2(1) )
However, it turns out that you have missing values that are represented by hyphens (-). Some of the tables have these and some don't. When these are present, R reads your integer columns as characters while when they are not present integers are read as integer columns. This causes problems when binding everything together.
I did not see an argument in read_html() to deal with these directly (I was looking for the equivalent of na.strings in read.table() or na in readr::read_csv()). My work-around was to convert the hyphens to NA using function replace_with_na_all() from package naniar (see the vignette here). Then I converted all columns to the appropriate type with type.convert().
All of this was done within the map_dfr() loop.
Here is an example with just the first two URL's in links.
links[1:2] %>%
map_dfr(~read_html(.x) %>%
html_table() %>%
extract2(1) %>%
naniar::replace_with_na_all(condition = ~.x == "-") %>%
type.convert(as.is = TRUE) )
# A tibble: 15 x 9
Symbol Company Exchange Date `Price Range` Price Currency Shares Actions
<chr> <chr> <chr> <chr> <chr> <dbl> <chr> <int> <chr>
1 2003.HK Vcredit Hldg Ltd HKSE Jun 21, 2018 NA 20 HKD 49969000 Priced
2 2003.HK Vcredit Hldg Ltd HKSE Jun 21, 2018 NA 20 HKD 11745600 Priced
3 2003.HK Vcredit Hldg Ltd HKSE Jun 21, 2018 NA 20 HKD 6857200 Priced
4 0000 Vcredit Hldg Ltd HKSE Jun 12, 2018 NA NA HKD NA Expected
5 6571.JP QB Net Holdings Co Ltd Japan OTC Mar 14, 2018 21.11 - 21.11 NA Y 9785900 Expected
6 1621.HK Vico Intl Hldg Ltd HKSE Mar 05, 2018 NA 0.35 HKD 175000000 Priced
7 PZM.AX Piston Mach Ltd ASX Mar 05, 2018 0.32 - 0.32 NA AU 50000000 Expected
8 "" Agp Ltd Karachi Mar 05, 2018 0.76 - 0.76 80 PKR 8750000 Priced
9 GRC.L GRC International Group PLC LSE Mar 05, 2018 0.98 - 0.98 0.7 GBP 8414286 Priced
10 ACPH.BR Acacia Pharma Group PLC Brussels Mar 05, 2018 3.24 - 4.16 3.6 EUR 3175413 Priced
11 ACPH.BR Acacia Pharma Group PLC Brussels Mar 05, 2018 3.24 - 4.16 3.6 EUR 7935698 Priced
12 GCI.AX Gryphon Capital Income Tr ASX May 23, 2018 1.57 - 1.57 2 AUD 87650000 Priced
13 GCI.AX Gryphon Capital Income Tr ASX May 04, 2018 1.57 - 1.57 NA AUD 50000000 Expected
14 STRL.L Stirling Inds Plc LSE Mar 06, 2018 1.40 - 1.40 1 GBP 8881002 Priced
15 541006.BO Angel Fibers Ltd BSE Mar 06, 2018 NA 27 INR 6408000 Priced
I have a data frame like this:
year <-c(floor(runif(100,min=2015, max=2017)))
month <- c(floor(runif(100, min=1, max=13)))
inch <- c(floor(runif(100, min=0, max=10)))
mm <- c(floor(runif(100, min=0, max=100)))
df = data.frame(year, month, inch, mm);
year month inch mm
2016 11 0 10
2015 9 3 34
2016 6 3 33
2015 8 0 77
I only care about the columns year, month, and mm.
I need to re-arrange the data frame so that the first column is the name of the month and the rest of the columns is the value of mm.
Months 2015 2016
Jan # #
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
So two things needs to happen.
(1) The month needs to become a string of the first three letters of the month.
(2) I need to group by year, and then put the mm values in a column under that year.
So far I have this code, but I can't figure it out:
df %>%
select(-inch) %>%
group_by(month) %>%
summarize(mm = mm) %>%
ungroup()
To convert month to names, you can refer to month.abb; And then you can summarize by year and month, spread to wide format:
library(dplyr)
library(tidyr)
df %>%
group_by(year, month = month.abb[month]) %>%
summarise(mm = mean(mm)) %>% # use mean as an example, could also be sum or other
# intended aggregation methods
spread(year, mm) %>%
arrange(match(month, month.abb)) # rearrange month in chronological order
# A tibble: 12 x 3
# month `2015` `2016`
# <chr> <dbl> <dbl>
# 1 Jan 65.50000 28.14286
# 2 Feb 54.40000 30.00000
# 3 Mar 23.50000 95.00000
# 4 Apr 7.00000 43.60000
# 5 May 45.33333 44.50000
# 6 Jun 70.33333 63.16667
# 7 Jul 72.83333 52.00000
# 8 Aug 53.66667 66.50000
# 9 Sep 51.00000 64.40000
#10 Oct 74.00000 39.66667
#11 Nov 66.20000 58.71429
#12 Dec 38.25000 51.50000