set ... to na when using read_excel() - r

I am trying to read in excel files which use three horizontal dots to represent missing values, for example...
https://population.un.org/wpp/Download/Files/1_Indicators%20(Standard)/EXCEL_FILES/2_Fertility/WPP2019_FERT_F02_SEX_RATIO_AT_BIRTH.xlsx
Is it possible to set these to NA using read_excel()? I have tried different options for the na argument (see below), none which seem to work
d0 <- read_excel(path = "WPP2019_FERT_F02_SEX_RATIO_AT_BIRTH.xlsx)",
# na = "...", # does not work
# na = "…", # copying the output does not work
# na = "U+2026", # unicode character does not work
sheet = 2, skip = 16)
d0
# # A tibble: 255 x 21
# Index Variant `Region, subreg~ Notes `Country code` Type `Parent code` `1950-1955`
# <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl> <chr>
# 1 1 Estima~ WORLD NA 900 World 0 1.06
# 2 2 Estima~ UN development ~ a 1803 Labe~ 900 …
# 3 3 Estima~ More developed ~ b 901 Deve~ 1803 1.06
# 4 4 Estima~ Less developed ~ c 902 Deve~ 1803 1.06
# 5 5 Estima~ Least developed~ d 941 Deve~ 902 1.04
# 6 6 Estima~ Less developed ~ e 934 Deve~ 902 1.06
# 7 7 Estima~ Less developed ~ NA 948 Deve~ 1803 1.05
# 8 8 Estima~ Land-locked Dev~ f 1636 Spec~ 1803 1.04
# 9 9 Estima~ Small Island De~ g 1637 Spec~ 1803 1.05
# 10 10 Estima~ World Bank inco~ NA 1802 Labe~ 900 …
# # ... with 245 more rows, and 13 more variables: `1955-1960` <chr>, `1960-1965` <chr>,
# # `1965-1970` <chr>, `1970-1975` <chr>, `1975-1980` <chr>, `1980-1985` <chr>,
# # `1985-1990` <chr>, `1990-1995` <chr>, `1995-2000` <chr>, `2000-2005` <chr>,
# # `2005-2010` <chr>, `2010-2015` <chr>, `2015-2020` <chr>
Example column where NA is not being created, and values are not of numeric type...
d3 %>% select(`1950-1955`) %>% pull()
# [1] "1.06" "…" "1.06" "1.06"
# [5] "1.04" "1.06" "1.05" "1.04"
# [9] "1.05" "…" "1.06" "1.06"

As you can see here Print unicode character string in R the issue is with how you define the unicode char.
Try this:
readxl::read_xlsx("C:/Stack/WPP2019_FERT_F02_SEX_RATIO_AT_BIRTH.xlsx", sheet = 2, skip = 16, na="\U2026")

Simple but effective, just convert the column to numeric after importing. Throws a warning, but who cares.
head(df)
# something v
# 1 -0.2168503 1.06
# 2 0.9863558 …
# 3 1.8623381 1.06
# 4 -1.0441477 1.06
# 5 0.4244308 1.04
# 6 1.5825152 1.06
df <- transform(df, v=as.numeric(v))
head(df)
# something v
# 1 -0.2168503 1.06
# 2 0.9863558 NA
# 3 1.8623381 1.06
# 4 -1.0441477 1.06
# 5 0.4244308 1.04
# 6 1.5825152 1.06
Data
df <- structure(list(something = c(0.344600422686915, 1.12754949114835,
0.264102711671497, -0.588052830551214, 0.916134405190614, 0.118418825652515,
-1.5711759894206, 0.561452729377526, -0.27524305006459, -0.611306705421411,
0.614179300117269, 0.765082495652037), v = c("1.06", "…", "1.06",
"1.06", "1.04", "1.06", "1.05", "1.04", "1.05", "…", "1.06",
"1.06")), class = "data.frame", row.names = c(NA, -12L))

Related

Add leading zeros to colum names

I'm surprised to find no one asked this question on Stackoverflow before. Maybe it's too stupid to ask?
So I have a dataframe that contains 48 weather variables, each representing a weather value for a month. I have drawn a simplified table shown below:
weather 1
weather 2
weather 3
weather 4
weather 5
weather 6
weather 7
weather 8
weather 9
weather 10
weather 11
weather 12
12
6
34
9
100
.01
-4
38
64
77
21
34
99
42
-3
34
34
.5
27
19
7
18
NA
20
My objective is to make the column names from "weather 1, weather 2, ..." to "weather 01, weather 02, ...." And I wrote a loop like this:
for (i in 1:9){
colnames(df) = gsub(i, 0+i, colnames(df))
}
However, instead of replacing the single-digit numbers with a leading zero, R replaced the actual letter "i" with "0+i". Can anyone let me know what's going on here and how to fix it? Or is there a better way to add leading zeros to column names?
Thank you very much!
We can use
library(stringr)
colnames(df) <- str_replace(colnames(df), "\\d+",
function(x) sprintf("%02d", as.integer(x)))
Here is another option:
library(tidyverse)
set.seed(35)
example <- tibble(`weather 1` = runif(2),
`weather 2` = runif(2),
`weather 3` = runif(2))
rename_with(example, ~str_replace(., "(weather )(\\d+)", "\\10\\2"), everything())
#> # A tibble: 2 x 3
#> `weather 01` `weather 02` `weather 03`
#> <dbl> <dbl> <dbl>
#> 1 0.857 0.553 0.486
#> 2 0.0108 0.950 0.0939
or with base R
colnames(example) <- gsub("(weather )(\\d+)", "\\10\\2", colnames(example))
example
#> # A tibble: 2 x 3
#> `weather 01` `weather 02` `weather 03`
#> <dbl> <dbl> <dbl>
#> 1 0.857 0.553 0.486
#> 2 0.0108 0.950 0.0939

Opening and reshaping xlsx files with nameless columns in r using a pattern

I'm working with French electoral data but I'm having issues opening xlsx files to work on them in r. I was wondering if anyone had had the same problem and found a solution.
The issue is that only the first 29 columns out of +100 columns have names and the rest are nameless. I've tried editing the column names in excel before opening them but this solution is time consuming and prone to mistakes. I'm looking for a way to automatize the process.
The datasets have a pattern that I'm trying to exploit to rename the columns and reshape the files:
the first 6 columns correspond to the geographic id of the precinct (region, municipality, etc...)
the next 15 columns give information about aggregate results in the precinct (number of voters, number of registered voters, participation, etc..)
The next 8 columns give information about a given candidate and her results in the precinct (name, sex, party id, number of votes, .. etc)
These 29 columns have names.
The next columns are nameless and correspond to other candidates. They repeat the 8 columns for the other candidates.
There is another layer of difficulty since each precinct does not have the same number of candidates so the number of nameless columns changes.
Ideally, I would want r to recognize the pattern and reshape the datasets to long by creating a new row for each candidate keeping the precinct id and aggregate data in each row. To do this, I would like r to recognize each sequence of nameless 8 columns.
To simplify, let's say that my data frame looks like the following:
precinct_id
tot_votes
candidate_id
candidate_votes
...1
...2
Paris 05
1000
Jean Dupont
400
Paul Dupuy
300
Paris 06
500
Jean Dupont
50
Paul Dupuy
150
where:
candidate_id and candidate_votes correspond to the id and result of the first candidate
...1, ...2 is how r is automatically renaming the nameless columns that correspond to candidate_id and candidate_votes for candidate 2 in the same precinct.
I need r to select the observations in each sequence of 2 columns and paste them into new rows under candidate_id candidate_votes while keeping the precinct_id and precinct_votes columns.
precinct_id
tot_votes
candidate_id
candidate_votes
Paris 05
1000
Jean Dupont
400
Paris 06
500
Jean Dupont
50
Paris 05
1000
Paul Dupuy
300
Paris 06
500
Paul Dupuy
150
I have no idea how to reshape without column names... Any help would be greatly appreciated! Thanks!
PS: The files come from here: https://www.data.gouv.fr/fr/datasets/elections-legislatives-des-12-et-19-juin-2022-resultats-definitifs-du-premier-tour/
Actually, there's an even simpler solution to the one I suggested. .names_repair can take a function as its value. This function should accept a vector of "input" column names and return a vector of "output column names". As we want to treat the data for the first candidate in each row in eactly the same way as every subsequent set of eight columns, I'll ignore only the first 21 columns, not the first 29.
read_excel(
"resultats-par-niveau-subcom-t1-france-entiere.xlsx",
.name_repair=function(x) {
suffixes <- c("NPanneau", "Sexe", "Nom", "Prénom", "Nuance", "Voix", "PctVoixIns", "PctVoixExp")
if ((length(x) - 21) %% 8 != 0) stop(paste("Don't know how to handle a sheet with", length(x), "columns [", (length(x) - 21) %% 8, "]"))
for (i in 1:length(x)) {
if (i > 21) {
x[i] <- paste0("C", 1 + floor((i-22)/8), "_", suffixes[1 + (i-22) %% 8])
}
}
x
}
)
# A tibble: 35,429 × 197
`Code du département` `Libellé du dép…` `Code de la ci…` `Libellé de la…` `Code de la co…` `Libellé de la…` `Etat saisie` Inscrits Abstentions
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 01 Ain 01 1ère circonscri… 016 Arbigny Complet 327 154
2 01 Ain 01 1ère circonscri… 024 Attignat Complet 2454 1281
3 01 Ain 01 1ère circonscri… 029 Beaupont Complet 446 224
4 01 Ain 01 1ère circonscri… 038 Bény Complet 604 306
5 01 Ain 01 1ère circonscri… 040 Béréziat Complet 362 179
6 01 Ain 01 1ère circonscri… 050 Boissey Complet 262 137
7 01 Ain 01 1ère circonscri… 053 Bourg-en-Bresse Complet 15516 8426
8 01 Ain 01 1ère circonscri… 057 Boz Complet 391 210
9 01 Ain 01 1ère circonscri… 065 Buellas Complet 1408 654
10 01 Ain 01 1ère circonscri… 069 Certines Complet 1169 639
# … with 35,419 more rows, and 188 more variables: `% Abs/Ins` <dbl>, Votants <dbl>, `% Vot/Ins` <dbl>, Blancs <dbl>, `% Blancs/Ins` <dbl>,
# `% Blancs/Vot` <dbl>, Nuls <dbl>, `% Nuls/Ins` <dbl>, `% Nuls/Vot` <dbl>, Exprimés <dbl>, `% Exp/Ins` <dbl>, `% Exp/Vot` <dbl>,
# C1_NPanneau <dbl>, C1_Sexe <chr>, C1_Nom <chr>, C1_Prénom <chr>, C1_Nuance <chr>, C1_Voix <dbl>, C1_PctVoixIns <dbl>, C1_PctVoixExp <dbl>,
# C2_NPanneau <dbl>, C2_Sexe <chr>, C2_Nom <chr>, C2_Prénom <chr>, C2_Nuance <chr>, C2_Voix <dbl>, C2_PctVoixIns <dbl>, C2_PctVoixExp <dbl>,
# C3_NPanneau <dbl>, C3_Sexe <chr>, C3_Nom <chr>, C3_Prénom <chr>, C3_Nuance <chr>, C3_Voix <dbl>, C3_PctVoixIns <dbl>, C3_PctVoixExp <dbl>,
# C4_NPanneau <dbl>, C4_Sexe <chr>, C4_Nom <chr>, C4_Prénom <chr>, C4_Nuance <chr>, C4_Voix <dbl>, C4_PctVoixIns <dbl>, C4_PctVoixExp <dbl>,
# C5_NPanneau <dbl>, C5_Sexe <chr>, C5_Nom <chr>, C5_Prénom <chr>, C5_Nuance <chr>, C5_Voix <dbl>, C5_PctVoixIns <dbl>, C5_PctVoixExp <dbl>, …
That's read the data in and named the columns. To get the final format you want, we will need to do a standard pivot_longer()/pivot_wider() trick, but the situation here is slightly complicated because some of your columns are character and some are numeric. So first, I'll turn the numeric columns into character columns so that the pivot_longer() step doesn't fail.
For clarity, I'll drop the first 21 columns so that it's easy to see what's going on.
read_excel(
"resultats-par-niveau-subcom-t1-france-entiere.xlsx",
.name_repair=function(x) {
suffixes <- c("NPanneau", "Sexe", "Nom", "Prénom", "Nuance", "Voix", "PctVoixIns", "PctVoixExp")
if ((length(x) - 21) %% 8 != 0) stop(paste("Don't know how to handle a sheet with", length(x), "columns [", (length(x) - 21) %% 8, "]"))
for (i in 1:length(x)) {
if (i > 21) {
x[i] <- paste0("C", 1 + floor((i-22)/8), "_", suffixes[1 + (i-22) %% 8])
}
}
x
}
) %>%
mutate(across(where(is.numeric) | where(is.logical), as.character)) %>%
pivot_longer(!1:21, names_sep="_", names_to=c("Candidate", "Variable"), values_to="Value") %>%
select(!1:21)
# A tibble: 6,235,504 × 3
Candidate Variable Value
<chr> <chr> <chr>
1 C1 NPanneau 2
2 C1 Sexe M
3 C1 Nom LAHY
4 C1 Prénom Éric
5 C1 Nuance DXG
6 C1 Voix 2
7 C1 PctVoixIns 0.61
8 C1 PctVoixExp 1.23
9 C2 NPanneau 8
10 C2 Sexe M
# … with 6,235,494 more rows
Now add the pivot_wider(), again dropping the first 21 columns, purely for clarity.
read_excel(
"resultats-par-niveau-subcom-t1-france-entiere.xlsx",
.name_repair=function(x) {
suffixes <- c("NPanneau", "Sexe", "Nom", "Prénom", "Nuance", "Voix", "PctVoixIns", "PctVoixExp")
if ((length(x) - 21) %% 8 != 0) stop(paste("Don't know how to handle a sheet with", length(x), "columns [", (length(x) - 21) %% 8, "]"))
for (i in 1:length(x)) {
if (i > 21) {
x[i] <- paste0("C", 1 + floor((i-22)/8), "_", suffixes[1 + (i-22) %% 8])
}
}
x
}
) %>%
mutate(across(where(is.numeric) | where(is.logical), as.character)) %>%
pivot_longer(!1:21, names_sep="_", names_to=c("Candidate", "Variable"), values_to="Value") %>%
pivot_wider(names_from=Variable, values_from=Value) %>%
select(!1:21)
# A tibble: 779,438 × 9
Candidate NPanneau Sexe Nom Prénom Nuance Voix PctVoixIns PctVoixExp
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 C1 2 M LAHY Éric DXG 2 0.61 1.23
2 C2 8 M GUÉRAUD Sébastien NUP 26 7.95 15.95
3 C3 7 F ARMENJON Eliane ECO 3 0.92 1.84
4 C4 1 M GUILLERMIN Vincent ENS 30 9.17 18.4
5 C5 3 M BRETON Xavier LR 44 13.46 26.99
6 C6 5 M MENDES Michael DSV 3 0.92 1.84
7 C7 6 M BELLON Julien REC 6 1.83 3.68
8 C8 4 F PIROUX GIANNOTTI Brigitte RN 49 14.98 30.06
9 C9 NA NA NA NA NA NA NA NA
10 C10 NA NA NA NA NA NA NA NA
# … with 779,428 more rows
Finally, convert the "temporary character" columns back to numeric. (Still dropping the first 21 columns for clarity.)
read_excel(
"resultats-par-niveau-subcom-t1-france-entiere.xlsx",
.name_repair=function(x) {
suffixes <- c("NPanneau", "Sexe", "Nom", "Prénom", "Nuance", "Voix", "PctVoixIns", "PctVoixExp")
if ((length(x) - 21) %% 8 != 0) stop(paste("Don't know how to handle a sheet with", length(x), "columns [", (length(x) - 21) %% 8, "]"))
for (i in 1:length(x)) {
if (i > 21) {
x[i] <- paste0("C", 1 + floor((i-22)/8), "_", suffixes[1 + (i-22) %% 8])
}
}
x
}
) %>%
mutate(across(where(is.numeric) | where(is.logical), as.character)) %>%
pivot_longer(!1:21, names_sep="_", names_to=c("Candidate", "Variable"), values_to="Value") %>%
pivot_wider(names_from=Variable, values_from=Value) %>%
mutate(across(c(Voix, PctVoixIns, PctVoixExp), as.numeric)) %>%
select(!1:21)
# A tibble: 779,438 × 9
Candidate NPanneau Sexe Nom Prénom Nuance Voix PctVoixIns PctVoixExp
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 C1 2 M LAHY Éric DXG 2 0.61 1.23
2 C2 8 M GUÉRAUD Sébastien NUP 26 7.95 16.0
3 C3 7 F ARMENJON Eliane ECO 3 0.92 1.84
4 C4 1 M GUILLERMIN Vincent ENS 30 9.17 18.4
5 C5 3 M BRETON Xavier LR 44 13.5 27.0
6 C6 5 M MENDES Michael DSV 3 0.92 1.84
7 C7 6 M BELLON Julien REC 6 1.83 3.68
8 C8 4 F PIROUX GIANNOTTI Brigitte RN 49 15.0 30.1
9 C9 NA NA NA NA NA NA NA NA
10 C10 NA NA NA NA NA NA NA NA
# … with 779,428 more rows
This, I think, is the format you want, though you may need to arrange() the rows into the order you want. Obviously, you should drop the final %>% select(!1:21) for your production version.
It is an easy matter to convert this code to a function that accepts a filename as its parameter and then use this in an lapply to read an entire folder into a list of data frames. However...
It appears that not every file in the folder has the same layout. resultats-par-niveau-fe-t1-outre-mer.xlsx, for example, appears to have fewer "prefix columns" before the 8-columns-per-candidate repeat begins.
The import generates several warnings. This appears to be because the election(?) with the largest number of candidates does not appear in the first rows of the worksheet. I've not investigated whether these warnings are generated by meaningful problems with the import.

how can I make a new data frame where the columns are the unique values with corresponding observations from an old data frame? [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 11 months ago.
My data frame has different dates as rows. Every unique date occurs appr. 500 times. I want to make a new data frame where every column is a unique date and where the rows are all the observations of that date from my old dataset. So for every column dat represents a certain date, I should have appr. 500 rows that each represent a rel_spread from that day.
You can use pivot_wider from tidyr:
library(tidyr)
pivot_wider(df, names_from = date, values_from = rel_spread, values_fn = list) %>%
unnest(everything())
#> # A tibble: 2 x 17
#> `20000103` `20000104` `20000105` `20000106` `20000107` `20000108` `20000109`
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -0.0234 -0.0128 0.00729 0.0408 -0.0298 0.0398 0.0445
#> 2 0.0492 -0.0120 0.0277 0.0435 -0.0288 0.0152 -0.0374
#> # ... with 10 more variables: `20000110` <dbl>, `20000111` <dbl>,
#> # `20000112` <dbl>, `20000113` <dbl>, `20000114` <dbl>, `20000115` <dbl>,
#> # `20000116` <dbl>, `20000117` <dbl>, `20000118` <dbl>, `20000119` <dbl>
Note that we don't have your data (and I wasn't about to transcribe a picture of your data), but I created a little reproducible data set which should match the structure of your data set, except it only has two values per date for demo purposes:
set.seed(1)
df <- data.frame(date = rep(as.character(20000103:20000119), 2),
rel_spread = runif(34, -0.05, 0.05))
df
#> date rel_spread
#> 1 20000103 -0.0234491337
#> 2 20000104 -0.0127876100
#> 3 20000105 0.0072853363
#> 4 20000106 0.0408207790
#> 5 20000107 -0.0298318069
#> 6 20000108 0.0398389685
#> 7 20000109 0.0444675269
#> 8 20000110 0.0160797792
#> 9 20000111 0.0129114044
#> 10 20000112 -0.0438213730
#> 11 20000113 -0.0294025425
#> 12 20000114 -0.0323443247
#> 13 20000115 0.0187022847
#> 14 20000116 -0.0115896282
#> 15 20000117 0.0269841420
#> 16 20000118 -0.0002300758
#> 17 20000119 0.0217618508
#> 18 20000103 0.0491906095
#> 19 20000104 -0.0119964821
#> 20 20000105 0.0277445221
#> 21 20000106 0.0434705231
#> 22 20000107 -0.0287857479
#> 23 20000108 0.0151673766
#> 24 20000109 -0.0374444904
#> 25 20000110 -0.0232779331
#> 26 20000111 -0.0113885907
#> 27 20000112 -0.0486609667
#> 28 20000113 -0.0117612043
#> 29 20000114 0.0369690846
#> 30 20000115 -0.0159651003
#> 31 20000116 -0.0017919885
#> 32 20000117 0.0099565825
#> 33 20000118 -0.0006458693
#> 34 20000119 -0.0313782399
Allan’s answer is perfect if you have the same number of rows for each date. If this isn’t the case, the following should work:
library(tidyr)
library(dplyr)
data_wide <- data_long %>%
group_by(date) %>%
mutate(daterow = row_number()) %>%
ungroup() %>%
pivot_wider(names_from = date, values_from = rel_spread) %>%
select(!daterow)
data_wide
Output:
# A tibble: 6 x 4
`20000103` `20000104` `20000105` `20000106`
<dbl> <dbl> <dbl> <dbl>
1 -0.626 0.184 -0.836 -0.621
2 1.60 0.330 -0.820 -2.21
3 0.487 0.738 0.576 1.12
4 -0.305 1.51 0.390 -0.0449
5 NA NA NA -0.0162
6 NA NA NA 0.944
Example data:
set.seed(1)
data_long <- data.frame(
date = c(rep(20000103:20000105, 4), rep(20000106, 6)),
rel_spread = rnorm(18)
)

R task, web scraping

I share my solution for the task, however, I get an error and cannot find the reason. Anyone can help with it?
Data download 1.1 Collect links Data on the Stack Overflow user survey is available on the Stack Overflow website. Create a web scraper that collects the links to the survey files. Select only the links to the surveys from 2017 to 2021.
lst_nodes <- "https://insights.stackoverflow.com/survey/" %>%
read_html() %>%
html_nodes(".js-download-link")
lst_url <- lst_nodes[1:5] %>%
html_attr("href")
print(lst_url)
Complete the function to download the data files from the URLs that extracted.
fun_download <- function(url) {
year <- # extract year from url
zip_file <- paste0("file_", year, ".zip")
zip_dir <- paste0("dir_", year)
download.file(url, zip_file)
unzip(zip_file, exdir = zip_dir, files = "survey_results_public.csv")
out <- read_csv(file.path(zip_dir, "survey_results_public.csv"), col_types = cols(.default = "c")) %>%
mutate(Year = year, ResponseId = row_number())
return(out)
year <- sub(".*[^0-9]([0-9]+)\\.zip$", "\\1", lst_url)
}
Apply the function to the URLs that you extracted and generate a data frame that contains the data from all surveys.
Save the data frame. Note: The read_csv command in the function seems to keep the downloaded csv files locked after reading. So once you tried to open the csv files, you cannot delete them. To overcome this lock, restart the R session.
Best to save the data so that you have to run the download and importing only once.
alldf <- lapply(lst_url, fun_download)
That is all I did so far...but it seems something is wrong
My suggestion to use year <- sub(.) needs to be put in context of the function itself, using its url only. This works.
fun_download <- function(url) {
stopifnot(length(url) == 1L) # just a safeguard
year <- sub(".*[^0-9]([0-9]+)\\.zip$", "\\1", url)
zip_file <- paste0("file_", year, ".zip")
zip_dir <- paste0("dir_", year)
download.file(url, zip_file)
unzip(zip_file, exdir = zip_dir, files = "survey_results_public.csv")
out <- readr::read_csv(file.path(zip_dir, "survey_results_public.csv"), col_types = readr::cols(.default = "c")) %>%
mutate(
Year = year,
ResponseId = row_number()
)
return(out)
}
fun_download(lst_url[[1]])
# trying URL 'https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2021.zip'
# Content type 'application/zip' length 8825103 bytes (8.4 MB)
# downloaded 8.4 MB
# # A tibble: 83,439 x 49
# ResponseId MainBranch Employment Country US_State UK_Country EdLevel Age1stCode LearnCode YearsCode YearsCodePro DevType
# <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 I am a deve~ Independen~ Slovakia NA NA Seconda~ 18 - 24 y~ Coding Bo~ NA NA Develop~
# 2 2 I am a stud~ Student, f~ Netherl~ NA NA Bachelo~ 11 - 17 y~ Other onl~ 7 NA NA
# 3 3 I am not pr~ Student, f~ Russian~ NA NA Bachelo~ 11 - 17 y~ Other onl~ NA NA NA
# 4 4 I am a deve~ Employed f~ Austria NA NA Master?~ 11 - 17 y~ NA NA NA Develop~
# 5 5 I am a deve~ Independen~ United ~ NA England Master?~ 5 - 10 ye~ Friend or~ 17 10 Develop~
# 6 6 I am a stud~ Student, p~ United ~ Georgia NA Bachelo~ 11 - 17 y~ Other onl~ NA NA NA
# 7 7 I code prim~ I prefer n~ United ~ New Ham~ NA Seconda~ 11 - 17 y~ Other onl~ 3 NA NA
# 8 8 I am a stud~ Student, f~ Malaysia NA NA Bachelo~ 11 - 17 y~ School;On~ 4 NA NA
# 9 9 I am a deve~ Employed p~ India NA NA Bachelo~ 18 - 24 y~ Coding Bo~ 6 4 Develop~
# 10 10 I am a deve~ Employed f~ Sweden NA NA Master?~ 11 - 17 y~ School 7 4 Data sc~
# # ... with 83,429 more rows, and 37 more variables: OrgSize <chr>, Currency <chr>, CompTotal <chr>, CompFreq <chr>,
# # LanguageHaveWorkedWith <chr>, LanguageWantToWorkWith <chr>, DatabaseHaveWorkedWith <chr>, DatabaseWantToWorkWith <chr>,
# # PlatformHaveWorkedWith <chr>, PlatformWantToWorkWith <chr>, WebframeHaveWorkedWith <chr>, WebframeWantToWorkWith <chr>,
# # MiscTechHaveWorkedWith <chr>, MiscTechWantToWorkWith <chr>, ToolsTechHaveWorkedWith <chr>, ToolsTechWantToWorkWith <chr>,
# # NEWCollabToolsHaveWorkedWith <chr>, NEWCollabToolsWantToWorkWith <chr>, OpSys <chr>, NEWStuck <chr>, NEWSOSites <chr>,
# # SOVisitFreq <chr>, SOAccount <chr>, SOPartFreq <chr>, SOComm <chr>, NEWOtherComms <chr>, Age <chr>, Gender <chr>,
# # Trans <chr>, Sexuality <chr>, Ethnicity <chr>, Accessibility <chr>, MentalHealth <chr>, SurveyLength <chr>, ...
From here, use lapply(., fun_download) to produce a list of frames.
list_of_frames <- lapply(lst_url, fun_download)
# trying URL 'https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2021.zip'
# Content type 'application/zip' length 8825103 bytes (8.4 MB)
# downloaded 8.4 MB
# trying URL 'https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2020.zip'
# Content type 'application/zip' length 9908290 bytes (9.4 MB)
# downloaded 9.4 MB
# trying URL 'https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2019.zip'
# Content type 'application/zip' length 18681322 bytes (17.8 MB)
# downloaded 17.8 MB
# trying URL 'https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2018.zip'
# Content type 'application/zip' length 20022841 bytes (19.1 MB)
# downloaded 19.1 MB
# trying URL 'https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2017.zip'
# Content type 'application/zip' length 9576818 bytes (9.1 MB)
# downloaded 9.1 MB
And a terse summary to show what they hold:
lapply(list_of_frames, function(z) z[1:2, 1:4])
# [[1]]
# # A tibble: 2 x 4
# ResponseId MainBranch Employment Country
# <int> <chr> <chr> <chr>
# 1 1 I am a developer by profession Independent contractor, freelancer, or self-employed Slovakia
# 2 2 I am a student who is learning to code Student, full-time Netherlands
# [[2]]
# # A tibble: 2 x 4
# Respondent MainBranch Hobbyist Age
# <chr> <chr> <chr> <chr>
# 1 1 I am a developer by profession Yes NA
# 2 2 I am a developer by profession No NA
# [[3]]
# # A tibble: 2 x 4
# Respondent MainBranch Hobbyist OpenSourcer
# <chr> <chr> <chr> <chr>
# 1 1 I am a student who is learning to code Yes Never
# 2 2 I am a student who is learning to code No Less than once per year
# [[4]]
# # A tibble: 2 x 4
# Respondent Hobby OpenSource Country
# <chr> <chr> <chr> <chr>
# 1 1 Yes No Kenya
# 2 3 Yes Yes United Kingdom
# [[5]]
# # A tibble: 2 x 4
# Respondent Professional ProgramHobby Country
# <chr> <chr> <chr> <chr>
# 1 1 Student Yes, both United States
# 2 2 Student Yes, both United Kingdom
If you need to assign names (such as the URL used to derive each dataset), then perhaps this, which adds a $url field to each frame.
list_of_frames <- Map(function(x, u) transform(x, url = u), list_of_frames, lst_url)
Data
library(rvest)
lst_nodes <- read_html("https://insights.stackoverflow.com/survey/") %>%
html_nodes(".js-download-link")
lst_url <- html_attr(lst_nodes [1:5], "href")
lst_url
# [1] "https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2021.zip"
# [2] "https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2020.zip"
# [3] "https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2019.zip"
# [4] "https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2018.zip"
# [5] "https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2017.zip"

Extracting table data from a website using R [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I want to get information from (https://www.canada.ca/en/health-canada/services/drugs-health-products/drug-products/applications-submissions/register-innovative-drugs/register.html) using R.
The data is not in .csv or excel format. I am not sure where to start. I know very basic R and would welcome any help! thank you!
Presuming it's the table of data from the page you are looking for
library(tidyverse)
library(rvest)
page <- xml2::read_html("https://www.canada.ca/en/health-canada/services/drugs-health-products/drug-products/applications-submissions/register-innovative-drugs/register.html")
tbl <- html_table(page)[[1]]
tbl <- as.tibble(tbl)
tbl
# A tibble: 260 x 9
`Medicinal\r\n … `Submission Numb… `Innovative Dru… Manufacturer `Drug(s) Containi… `Notice of Compl… `6 Year\r\n … `Pediatric Exte… `Data Protectio…
<chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 abiraterone ace… 138343 Zytiga Janssen I… N/A 2011-07-27 2017-07-27 N/A 2019-07-27
2 aclidinium bromide 157598 Tudorza Genu… AstraZeneca … Duaklir Genuair 2013-07-29 2019-07-29 N/A 2021-07-29
3 afatinib dimaleate 158730 Giotrif Boehringer … N/A 2013-11-01 2019-11-01 N/A 2021-11-01
4 aflibercept 149321 Eylea Bayer Inc. N/A 2013-11-08 2019-11-08 N/A 2021-11-08
5 albiglutide 165145 Eperzan GlaxoSmithKl… N/A 2015-07-15 2021-07-15 N/A 2023-07-15
6 alectinib hydrochl… 189442 Alecensaro Hoffmann-La … N/A 2016-09-29 2022-09-29 N/A 2024-09-29
7 alirocumab 183116 Praluent Sanofi-avent… N/A 2016-04-11 2022-04-11 N/A 2024-04-11
8 alogliptin benzoate 158335 Nesina Takeda Ca… "Kazano\r\n … 2013-11-27 2019-11-27 N/A 2021-11-27
9 anthrax immune glo… 200446 Anthrasil Emergent … N/A 2017-11-06 2023-11-06 Yes 2026-05-06
10 antihemophilic fac… 163447 Eloctate Bioverativ … N/A 2014-08-22 2020-08-22 Yes 2023-02-22
# ... with 250 more rows
To read in the 2nd/3rd/4th table on the page change the number in tbl <- html_table(page)[[1]] to the number table wish to read
You'll be able to extract this data through web scraping.
Try something like
library(rvest)
library(dplyr)
url <- "https://www.canada.ca/en/health-canada/services/drugs-health-products/drug-products/applications-submissions/register-innovative-drugs/register.html"
page_html <- read_html(url)
tables <- page_html %>% html_nodes("table")
for (i in 1:length(tables)) {
table <- tables[i]
table_header <- table %>% html_nodes("thead th") %>% html_text(.) %>% trimws(.) %>% gsub("\r", "", .) %>% gsub("\n", "", .)
table_data <- matrix(ncol=length(table_header), nrow=1) %>% as.data.frame(.)
colnames(table_data) <- table_header
rows <- table %>% html_nodes("tr")
for (j in 2:length(rows)) {
table_data[j-1, ] <- rows[j] %>% html_nodes("td") %>% html_text(.) %>% trimws(.)
}
assign(paste0("table_data", i), table_data)
}
You can process them all the same way without a for loop and without using assign() (shudder). Plus, we can assign the table caption (the <h2> above each) to each table for a reference:
library(rvest)
xdf <- read_html("https://www.canada.ca/en/health-canada/services/drugs-health-products/drug-products/applications-submissions/register-innovative-drugs/register.html")
tbls <- html_table(xdf, trim = TRUE)
We clean up the column names using janitor::clean_names() then find the captions, clean them up so they're proper variable names and assign them to each table:
setNames(
lapply(tbls, function(tbl) {
janitor::clean_names(tbl) %>% # CLEAN UP TABLE COLUMN NAMES
tibble::as_tibble() # solely for better printing
}),
html_nodes(xdf, "table > caption") %>% # ASSIGN THE TABLE HEADER TO THE LIST ELEMENT
html_text() %>% # BUT WE NEED TO CLEAN THEM UP FIRST
trimws() %>%
tolower() %>%
gsub("[[:punct:][:space:]]+", "_", .) %>%
gsub("_+", "_", .) %>%
make.unique(sep = "_")
) -> tbls
Now we can access them by name in the list without using the nigh-never-recommended assign() (shudder again):
tbls$products_for_human_use_active_data_protection_period
## # A tibble: 260 x 9
## medicinal_ingre… submission_numb… innovative_drug manufacturer drug_s_containi… notice_of_compl… x6_year_no_file…
## <chr> <int> <chr> <chr> <chr> <chr> <chr>
## 1 abiraterone … 138343 Zytiga Janssen … N/A 2011-07-27 2017-07-27
## 2 aclidinium brom… 157598 Tudorza Gen… AstraZeneca… Duaklir Genu… 2013-07-29 2019-07-29
## 3 afatinib dimale… 158730 Giotrif Boehringer … N/A 2013-11-01 2019-11-01
## 4 aflibercept 149321 Eylea Bayer In… N/A 2013-11-08 2019-11-08
## 5 albiglutide 165145 Eperzan GlaxoSmithK… N/A 2015-07-15 2021-07-15
## 6 alectinib hydro… 189442 Alecensaro Hoffmann-La… N/A 2016-09-29 2022-09-29
## 7 alirocumab 183116 Praluent Sanofi-aven… N/A 2016-04-11 2022-04-11
## 8 alogliptin benz… 158335 Nesina Takeda C… "Kazano\r\n … 2013-11-27 2019-11-27
## 9 anthrax immune … 200446 Anthrasil Emergent … N/A 2017-11-06 2023-11-06
## 10 antihemophilic … 163447 Eloctate Bioverativ … N/A 2014-08-22 2020-08-22
## # ... with 250 more rows, and 2 more variables: pediatric_extension_yes_no <chr>, data_protection_ends <chr>
tbls$products_for_human_use_expired_data_protection_period
## # A tibble: 92 x 9
## medicinal_ingre… submission_numb… innovative_drug manufacturer drug_s_containi… notice_of_compl… x6_year_no_file…
## <chr> <int> <chr> <chr> <chr> <chr> <chr>
## 1 abatacept 98531 Orencia Bristol-Mye… N/A 2006-06-29 2012-06-29
## 2 acamprosate cal… 103287 Campral Mylan Pharm… N/A 2007-03-16 2013-03-16
## 3 alglucosidase a… 103381 Myozyme Genzyme Can… N/A 2006-08-14 2012-08-14
## 4 aliskiren hemif… 105388 Rasilez Novartis Ph… "Rasilez HCT\r\… 2007-11-14 2013-11-14
## 5 ambrisentan 113287 Volibris GlaxoSmithK… N/A 2008-03-20 2014-03-20
## 6 anidulafungin 110202 Eraxis Pfizer Cana… N/A 2007-11-14 2013-11-14
## 7 aprepitant 108483 Emend Merck Fross… "Emend Tri-Pack… 2007-08-24 2013-08-24
## 8 aripiprazole 120192 Abilify Bristol-Mye… Abilify Maintena 2009-07-09 2015-07-09
## 9 azacitidine 127108 Vidaza Celgene N/A 2009-10-23 2015-10-23
## 10 besifloxacin 123400 Besivance Bausch & … N/A 2009-10-23 2015-10-23
## # ... with 82 more rows, and 2 more variables: pediatric_extension_yes_no <chr>, data_protection_ends <chr>
tbls$products_for_veterinary_use_active_data_protection_period
## # A tibble: 26 x 8
## medicinal_ingre… submission_numb… innovative_drug manufacturer drug_s_containi… notice_of_compl… x6_year_no_file…
## <chr> <int> <chr> <chr> <chr> <chr> <chr>
## 1 afoxolaner 163768 Nexgard Merial Cana… Nexgard Spectra 2014-07-08 2020-07-08
## 2 avilamycin 156949 Surmax 100 Pre… Elanco Cana… Surmax 200 Prem… 2014-02-18 2020-02-18
## 3 cefpodoxime pro… 149164 Simplicef Zoetis Cana… N/A 2012-12-06 2018-12-06
## 4 clodronate diso… 172789 Osphos Injecti… Dechra Ltd. N/A 2015-05-06 2021-05-06
## 5 closantel sodium 180678 Flukiver Elanco Divi… N/A 2015-11-24 2021-11-24
## 6 derquantel 184844 Startect Zoetis Cana… N/A 2016-04-27 2022-04-27
## 7 dibotermin alfa… 148153 Truscient Zoetis Cana… N/A 2012-11-20 2018-11-20
## 8 fluralaner 166320 Bravecto Intervet Ca… N/A 2014-05-23 2020-05-23
## 9 gonadotropin re… 140525 Improvest Zoetis Cana… N/A 2011-06-22 2017-06-22
## 10 insulin human (… 150211 Prozinc Boehringer … N/A 2013-04-24 2019-04-24
## # ... with 16 more rows, and 1 more variable: data_protection_ends <chr>
tbls$products_for_veterinary_use_expired_data_protection_period
## # A tibble: 26 x 8
## medicinal_ingre… submission_numb… innovative_drug manufacturer drug_s_containi… notice_of_compl… x6_year_no_file…
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 acetaminophen 110139 Pracetam 20% O… Ceva Animal… N/A 2009-03-05 2015-03-05
## 2 buprenorphine h… 126077 Vetergesic Mul… Sogeval UK … N/A 2010-02-03 2016-02-03
## 3 cefovecin sodium 110061 Convenia Zoetis Cana… N/A 2007-05-30 2013-05-30
## 4 cephalexin mono… 126970 Vetolexin Vétoquinol … Cefaseptin 2010-06-24 2016-06-24
## 5 dirlotapide 110110 Slentrol Zoetis Cana… N/A 2008-08-14 2014-08-14
## 6 emamectin benzo… 109976 Slice Intervet Ca… N/A 2009-06-29 2015-06-29
## 7 emodepside 112103 / 112106… Profender Bayer Healt… N/A 2008-08-28 2014-08-28
## 8 firocoxib 110661 / 110379 Previcox Merial Cana… N/A 2007-09-28 2013-09-28
## 9 fluoxetine hydr… 109825 / 109826… Reconcile Elanco, Div… N/A 2008-03-28 2014-03-28
## 10 gamithromycin 125823 Zactran Merial Cana… N/A 2010-03-29 2016-03-29
## # ... with 16 more rows, and 1 more variable: data_protection_ends <chr>
There are also N/As in each we can turn into NA and there's a column drug_s_containing_the_medicinal_ingredient_variations common to each that - when an observation is not N/A - is one or more drugs separated by \r\n so we can use that to turn it into a list column that you can post-process with, say, tidyr::unnest():
lapply(tbls, function(x) {
# Make "N/A" into real NAs
x[] <- lapply(x, function(.x) ifelse(.x == "N/A", NA_character_, .x))
# The common `drug_s_containing_the_medicinal_ingredient_variations`
# column - when not N/A - has one drug per-line so we can use that
# fact to turn it into a list column which you can use `tidyr::unnest()` on
x$drug_s_containing_the_medicinal_ingredient_variations <-
lapply(x$drug_s_containing_the_medicinal_ingredient_variations, function(.x) {
strsplit(trimws(.x), "[\r\n]+")
})
x
}) -> tbls

Resources