webscraping using rvest package comes out empty - r

There is a table of taxes by country at the link below that I would like to scrape into a dataframe with Country and Tax columns.
I've tried using the rvest package as follows to get my Country column but the list I generate is empty and I don't understand why.
I would appreciate any pointers on resolving this problem.
library(rvest)
d1 <- read_html(
"http://taxsummaries.pwc.com/ID/Corporate-income-tax-(CIT)-rates"
)
TaxCountry <- d1 %>%
html_nodes('.countryNameQC') %>%
html_text()

The data is dynamically loaded and the DOM altered when javascript runs in the browser. This doesn't happen with rvest.
The following selectors, in the browser, would have isolated your nodes:
.twoCountryWrapper .countryNameAndYearQC:nth-child(1) .countryNameQC
.twoCountryWrapper .countryNameAndYearQC:nth-child(1) .countryYear
.twoCountryWrapper .countryNameAndYearQC:nth-child(2) .countryNameQC
.twoCountryWrapper .countryNameAndYearQC:nth-child(2) .countryYear
But, those classes are not even present in rvest return.
The data of interest is actually stored in several nodes; all of which have ids within a common prefix of dspQCLinks. The data inside looks like as follows:
So, you can gather all those nodes using css attribute = value with starts with operator (^) syntax:
html_nodes(page, "[id^=dspQCLinks]")
Then extract the text and combine into one string
paste(html_text(html_nodes(page, "[id^=dspQCLinks]")), collapse = '')
Now each row in your table is delimited by !, , so we can split on that to generate the rows:
info = strsplit(paste(html_text(html_nodes(page, "[id^=dspQCLinks]")), collapse = ''),"!,")[[1]]
An example row would then look like:
"Albania#/uk/taxsummaries/wwts.nsf/ID/Albania-Corporate-Taxes-on-corporate-income#15"
If we split each row on the #, the data we want is at indices 1 and 3:
arr = strsplit(i, '#')[[1]]
country <- arr[1]
tax <- arr[3]
Thanks to #Brian's feedback I have removed the loop I had to build the dataframe and replaced with, to quote #Brian,
str_split_fixed(info, "#", 3) [which] gives you a character matrix, which can be directly coerced to a dataframe.
df <- data.frame(str_split_fixed(info, "#", 3))
You then remove the empty rows at the bottom of the df.
df <- df[df$Country != "",]
Sample of df:
R
library(rvest)
library(stringr)
library(magrittr)
page <- read_html('http://taxsummaries.pwc.com/ID/Corporate-income-tax-(CIT)-rates')
info = strsplit(paste(html_text(html_nodes(page, "[id^=dspQCLinks]")), collapse = ''),"!,")[[1]]
df <- data.frame(str_split_fixed(info, "#", 3))
colnames(df) <- c("Country","Link","Tax")
df <- subset(df, select = c("Country","Tax"))
df <- df[df$Country != "",]
View(df)
Python:
I did this first in python as was quicker for me:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('http://taxsummaries.pwc.com/ID/Corporate-income-tax-(CIT)-rates')
soup = bs(r.content, 'lxml')
text = ''
for i in soup.select('[id^=dspQCLinks]'):
text+= i.text
rows = text.split('!,')
countries = []
tax_info = []
for row in rows:
if row:
items = row.split('#')
countries.append(items[0])
tax_info.append(items[2])
df = pd.DataFrame(list(zip(countries,tax_info)))
print(df)
Reading:
str_split_fixed

Related

How do I create / name dataframes in a for loop in R?

So I'm currently trying to scrape precinct results by county from JSON files on Virginia's Secretary of State. I got code working that gets the data from a URL and creates a dataframe named after the county. To speed up the process, I tried to put the code inside a for loop that iterates through Virginia's counties (which I'm sourcing from a 2020 election by county CSV already on my computer that I constructed from this: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VOQCHQ), constructs the URL for the county JSON file (since the format's consistent), and saves it to a dataframe. My current code doesn't save the dataframes though, so only the last county remains.
This is the code:
library(dplyr)
library(tidyverse)
library(jsonlite)
va <- filter(biden_margin, biden_margin$state_po == "VA")
#i put this line here because the spreadsheet uses spaces to separate "X" and "city" but the URL uses an underline
va$county_name <- gsub(" ", "_", va$county_name)
#i put this line here because the URLs have "county" in the name, but the spreadsheet doesn't; however the spreadsheet does have "city" for the independent cities, like the URLs (and the independent cities are the observations with FIPS above 51199)
va$county_name <- if_else(va$county_fips > 51199, va$county_name, paste0(va$county_name, "_COUNTY"))
#i did this as a list but i realize this might be a bad idea
governor_data <- vector(mode = "list", length = nrow(va))
for (i in nrow(va)) {
precincts <- paste0("https://results.elections.virginia.gov/vaelections/2021%20November%20General/Json/Locality/", va$county_name[i], "/Governor.json")
name <- paste0(va$county_name[i], "_governor_2021")
java_source <- stream_in(file(precincts))
df <- as.data.frame(java_source$Precincts)
df$county <- java_source$Locality$LocalityName
df <- unnest(df, cols = c(Candidates))
df <- subset(df, select = -c(PoliticalParty, BallotOrder))
df <- pivot_wider(df, names_from = BallotName, values_from = c(Votes, Percentage))
#tried append before this, got the same result
governor_data[i] <- assign(name, df)
}
Any thoughts?

Extracting pubmed abstracts in r retrieves each abstract in multiple rows (more rows in abstracts that in pubmed ID)

I am trying to extract pubmed abstracts and their titles to place them in a dataframe.
will the help of members stackoverflow, I was able to write the code below, which works. The issue now is that the number of rows in the abstracts variable is higher than that of pmid or title, therefore I am unable to merge them correctly. Looking at the structure of the xml file I have, it appears the abstracts have more than one ?node, that's why they get extracted in > one row.
Any suggestion how to overcome that and have each abstract in one row, so I can merge the variables.
Here is my code:
library(XML)
library(httr)
library(glue)
library(dplyr)
####
query = 'asthma[mesh]+AND+eosinophils[mesh]+AND+2009[pdat]'
reqq = glue ('https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&RetMax=50&term={query}')
op = GET(reqq)
content(op)
df_op <- op %>% xml2::read_xml() %>% xml2::as_list()
pmids <- df_op$eSearchResult$IdList %>% unlist(use.names = FALSE)
reqq1 = glue("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id={paste0(pmids, collapse = ',')}&rettype=abstract&retmode=xml")
op1 = GET(reqq1)
a = xmlParse(content(op1))
pmidd = as.data.frame(xpathSApply(a, '/PubmedArticleSet/PubmedArticle/MedlineCitation/PMID', xmlValue))
title = as.data.frame(xpathSApply(a, '/PubmedArticleSet/PubmedArticle/MedlineCitation/Article/ArticleTitle', xmlValue))
abstract = as.data.frame(xpathSApply(a, '/PubmedArticleSet/PubmedArticle/MedlineCitation/Article/Abstract/AbstractText', xmlValue))
nrow(pmidd)
nrow(abstract)
Some articles come with the abstract spread in several sections (Objective, Methods, ....), some have just one entry and then some don't have an abstract at all. You'll have to take care of all these different scenarios.
xml::xmlToList() can be used to extract a list from the xml data. We can then use purrr's map*() commands to flatten the data.
library(purrr)
b <- xmlToList(a)
res <- map_dfr(b, \(x) {
abstract_l <- x$MedlineCitation$Article$Abstract
if (is.null(abstract_l))
abstract_l <- ""
tibble(
pmid = x$MedlineCitation$PMID$text,
title = x$MedlineCitation$Article$ArticleTitle,
abstract = ifelse(
length(abstract_l) > 1,
map_chr(abstract_l, \(y) y[[1]]) |> paste(collapse = "\n"),
unlist(abstract_l)
)
)
})
res$abstract

How to write NA for missing results in rvest if there was no content in node (within loop) further how to merge variable with results

Hi i'm new to R and try to fetch the tickers/symbols of Yahoo Finance from a text file which contains company names like Adidas, BMW etc. in order to run an event study later. This file contains about 800 names. Some of them can be found in yahoo and some not. (Thats ok)
My loop work so far but missing results won't be displayed. Further it only creates a table with numbers and results which could be found.But i would like to create a list which displayed the variable i ("firmen") and the results that's has been found or an NA in case there was no result.
Hope you guys can help me. Thank you !!!
my code:
library(rvest)
# company_names
firmen <- c(read.table("Mappe1.txt"))
# init
df <- NULL
# loop for search names in Yahoo Ticker Lookup
for(i in firmen){
# find url
url <- paste0("https://finance.yahoo.com/lookup/all?s=", i, "/")
page <- read_html(url,as="text")
# grab table
table <- page %>%
html_nodes(xpath = "//*[#id='lookup-page']/section/div/div/div/div[1]/table/tbody/tr[1]/td[1]") %>%
html_text() %>%
as.data.frame()
# bind to dataframe
df <- rbind(df, table)
}
I solved the first problem and now empty nodes (if "i" has not been found on the yahoo page) will be displayed as "NA"
here is the code:
library(rvest)
# teams
firmen <- c(read.table("Mappe1.txt"))
# init
df <- NULL
table <- NULL
# loop
for(i in firmen){
# find url
url <- paste0("https://finance.yahoo.com/lookup/all?s=", i, "/")
page <- read_html(url,as="text")
# grab ticker from yahoo finance
table <- page %>%
html_nodes(xpath = "//*[#id='lookup-page']/section/div/div/div/div[1]/table/tbody/tr[1]/td[1]") %>%
html_text(trim=TRUE) %>% replace(!nzchar(table), NA) %>%
as.data.frame()
# bind to dataframe
df <- rbind(df,table)
}
Now there is just one question left
How can i merge "df" and "firmen" into one table which has the columns:
"tickers" = df and "firmen" = firmen
because df has just one column named "." with the results and the list firmen contains a number of companies placed in many colums but with just one row.
basically i need to transform the list "firmen" but i don't know how
Thank you for the help

Reading googledrive contents from R

I'm aiming to get a list of all files in a Google Drive folder, as well at the associated metadata for those files. When I use drive_ls, it returns 3 columns {name, id, drive_resource}. drive_resource is a structured like this: list(kind = "drive#file", id = "abc",...). However, some of the list is not qualified by quotations, and commas are also occassionally used when not a separator.
Any ideas how I might properly unlist this? I can't find anywhere in the package that can handle this.
Using the package 'googledrive', I can get a list of all the files
a <- drive_ls(path = "abc", recursive = TRUE)
The below attempt gets close, but fails to get thee column names and also splits some values at the wrong place based on a comma being contained in the string.
a$drive_resource <- vapply(a$drive_resource, paste, collapse = ",", character(1L))
abcd <- a%>% separate(drive_resource, sep = ",", into = c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30") )
You can try the following approach. It's an example with only four elements of the list (selected names are specified in the function). The function maps each list contained in each row to a tibble, so you can unnest it
require(googledrive)
require(dplyr)
f <- function(l){
l[c("version","webContentLink","viewedByMeTime","mimeType")] %>% as_tibble()
}
dr_content <- drive_ls(path = "<path>", recursive = TRUE)
dr_content <- dr_content %>% mutate(drive_resource = purrr::map(drive_resource, f))
dr_content <- dr_content %>% tidyr::unnest(drive_resource)

Filtering process not fetching full data? Using dplyr filter and grep

I have this log file that has about 1200 characters (max) on a line. What I want to do is read this first and then extract certain portions of the file into new columns. I want to extract rows that contain the text “[DF_API: input string]”.
When I read it and then filter based on the rows that I am interested, it almost seems like I am losing data. I tried this using the dplyr filter and using standard grep with the same result.
Not sure why this is the case. Appreciate your help with this. The code and the data is there at the following link.
Satish
Code is given below
library(dplyr)
setwd("C:/Users/satis/Documents/VF/df_issue_dec01")
sec1 <- read.delim(file="secondary1_aa_small.log")
head(sec1)
names(sec1) <- c("V1")
sec1_test <- filter(sec1,str_detect(V1,"DF_API: input string")==TRUE)
head(sec1_test)
sec1_test2 = sec1[grep("DF_API: input string",sec1$V1, perl = TRUE),]
head(sec1_test2)
write.csv(sec1_test, file = "test_out.txt", row.names = F, quote = F)
write.csv(sec1_test2, file = "test2_out.txt", row.names = F, quote = F)
Data (and code) is given at the link below. Sorry, I should have used dput.
https://spaces.hightail.com/space/arJlYkgIev
Try this below code which could give you a dataframe of filtered lines from your file based a matching condition.
#to read your file
sec1 <- readLines("secondary1_aa_small.log")
#framing a dataframe by extracting required lines from above file
new_sec1 <- data.frame(grep("DF_API: input string", sec1, value = T))
names(new_sec1) <- c("V1")
Edit: Simple way to split the above column into multiple columns
#extracting substring in between < & >
new_sec1$V1 <- gsub(".*[<\t]([^>]+)[>].*", "\\1", new_sec1$V1)
#replacing comma(,) with a white space
new_sec1$V1 <- gsub("[,]+", " ", new_sec1$V1)
#splitting into separate columns
new_sec1 <- strsplit(new_sec1$V1, " ")
new_sec1 <- lapply(new_sec1, function(x) x[x != ""] )
new_sec1 <- do.call(rbind, new_sec1)
new_sec1 <- data.frame(new_sec1)
Change columns names for your analysis.

Resources