How can I remove "/url?q=" from text data in R - r

I want to remove "/url?q=" from text data in R Studio.
This is my code for google search:
## Code for Google Search
# Enter Search Term Here
search.term <- "r-project"
# Creating Function
getGoogleURL <- function(search.term, domain = '.co.in', quotes=TRUE)
{
# Getting Search Term
search.term <- gsub(' ', '%20', search.term)
if(quotes) search.term <- paste('%22', search.term, '%22', sep='')
# Putting Search Term in Google Search
getGoogleURL <- paste('http://www.google', domain, '/search?q=', search.term, sep='') }
## Get Links from Google Search
# Creating Function to Get URLs From Search Results
getGoogleLinks <- function(google.url) {
# Creating a File to Save URLs
doc <- getURL(google.url, httpheader = c("User-Agent" = "R(3.4.0)"))
# Removing HTML code and Setting Nodes
html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function(...){})
nodes <- getNodeSet(html, "//h3[#class='r']//a")
return(sapply(nodes, function(x) x <- xmlAttrs(x)[["href"]])) }
## Remove quoted text, Create URL List
quotes <- "FALSE"
search.url <- getGoogleURL(search.term=search.term, quotes=quotes)
links <- getGoogleLinks(search.url)
## Print URL List
links
And my result is:
[1] "/url?q=https://www.r-project.org/&sa=U&ved=0ahUKEwj78ZWXoabUAhUcTI8KHaTEDTIQFggUMAA&usg=AFQjCNEqtiOAIA7OOTa3meWC8zaTjjTy8A"
[2] "/url?q=http://www.cran.r-project.org/&sa=U&ved=0ahUKEwj78ZWXoabUAhUcTI8KHaTEDTIQjBAIGzAB&usg=AFQjCNF8QmYbLzG0c66QZM2wsXF1n1-9tQ"
What can I do to remove "/url?q=" from the links above?

You can use gsub.
## Code for Google Search
# Enter Search Term Here
search.term <- "r-project"
# Creating Function
getGoogleURL <- function(search.term, domain = '.co.in', quotes=TRUE)
{
# Getting Search Term
search.term <- gsub(' ', '%20', search.term)
if(quotes) search.term <- paste('%22', search.term, '%22', sep='')
# Putting Search Term in Google Search
getGoogleURL <- paste('http://www.google', domain, '/search?q=', search.term, sep='') }
## Get Links from Google Search
# Creating Function to Get URLs From Search Results
getGoogleLinks <- function(google.url) {
# Creating a File to Save URLs
doc <- getURL(google.url, httpheader = c("User-Agent" = "R(3.4.0)"))
# Removing HTML code and Setting Nodes
html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function(...){})
nodes <- getNodeSet(html, "//h3[#class='r']//a")
return(sapply(nodes, function(x) x <- xmlAttrs(x)[["href"]])) }
## Remove quoted text, Create URL List
quotes <- "FALSE"
search.url <- getGoogleURL(search.term=search.term, quotes=quotes)
links <- getGoogleLinks(search.url)
## Print URL List
gsub("/url?q=", "", links)

I solved it this way as they were limited number of characters
links <- substring(links,8)

Alternatively to #JTeam's answer you can try this (given the links always start with /url?q=):
lapply(links,function(x) paste0(strsplit(x,'=')[[1]][-1],collapse = ''))
This gives you a nice list of clean links (if you prefer a vector, try sapply)

Related

Downloading txt files from multiple directories using R

I've been trying to download .txt files from https://ds.data.jma.go.jp/gmd/goos/data/pub/JMA-product/cobe2_sst_glb_M/
So far, I've managed to download the complete set for 1850 by using the code from Download all the files (.zip and .txt) from a webpage using R which is, for my case:
page <- "https://ds.data.jma.go.jp/gmd/goos/data/pub/JMA-product/cobe2_sst_glb_M/1850/"
a <- readLines(
page
)
loc.txt <- grep(
".txt",
a
)
#------------------------------------
convfn <- function(line, marker, page){
i <- unlist(gregexpr(pattern ='href="', line)) + 6
i2<- unlist(gregexpr(pattern =,marker, line)) + 3
#target file
.destfile <- substring(line, i[1], i2[1])
#target url
.url <- paste(page, .destfile, sep = "/")
#print targets
cat(.url, '\n', .destfile, '\n')
#the workhorse function
download.file(url=.url, destfile=.destfile)
}
#------------------------------------
print(
getwd()
)
sapply(a[loc.txt],
FUN = convfn,
marker = '.txt"',
page = page)
I would like to know how to write a function that will allow me to automate replacing the years 1850 to 2022 since doing this would somehow be long and repetitive (over 170 years). My idea is somehow stuck on the line:
page <- paste0("https://ds.data.jma.go.jp/gmd/goos/data/pub/JMA-product/cobe2_sst_glb_M/", c(seq(1850, 2022, by = 1)), "/")
but I do not know how to make it into a working function
Please help, thank you and keep safe
Best regards,
Raven
I'd be inclined to do this differently. Better to use xml/xpath to extract the file links.
library(httr) # for GET(...)
library(XML) # for htmlParse(...)
base.url <- 'https://ds.data.jma.go.jp/gmd/goos/data/pub/JMA-product/cobe2_sst_glb_M'
get.docs <- function(year) {
url <- paste(base.url, year, sep='/')
html <- htmlParse(content(GET(url), type='text'))
file.names <- html['//td/a/#href'][-1] # first href is to parent directory; remove
##
# uncomment next line to save files
#
# mapply(download.file, paste(url, file.names, sep='/'), file.names)
print(sprintf('Downloaded from: %s to: %s', paste(url, file.names, sep='/'), file.names))
}
lapply(1850:2022, get.docs)

Extracting data from html pages using regex

I feel like I'm very close to a solution here but can't seem to figure out why I'm not getting any result. I have an html page and I'm trying to parse out some IDs from it. I'm 99% certain my regex code is right, but for some reason I'm not getting any output.
In the html source, there are many ids that are wrapped with text like: /boardgame/9999/asdf. My regex code should pull out the /9999/ bit, but I can't figure out why it's just returning the same input html character string that I put in.
library(RCurl)
library(XML)
url <- sprintf("https://boardgamegeek.com/browse/boardgame/page/1")
html <- getURL(url, followlocation = TRUE)
id.parse <- sub("boardgame(.*?)[a-z]", "\\1", html)
Any thoughts?
I think your pattern was not accurate. In this case, you were picking up also other words, starting with "boardgames,".
This should work for one single ID.
id.pos <- regexpr("boardgame/[[:digit:]]{3,10}/[a-z]", html)
my.id <- substr(html, id.pos, id.pos + attributes(id.pos)$match.length)
gsub("(^[[:alpha:]]*/)|(/[[:alpha:]]*$)", "", my.id)
In my hands, it returns:
[1] "226501"
Also, I found many IDs in this html page. To catch them all in one list, you could do as follows.
url <- sprintf("https://boardgamegeek.com/browse/boardgame/page/1")
html <- getURL(url, followlocation = TRUE)
id.list <- list()
while (regexpr("boardgame/[[:digit:]]{3,10}/[a-z]", html) > 0) {
id.pos <- regexpr("boardgame/[[:digit:]]{3,10}/[a-z]", html)
my.id <- substr(html, id.pos, id.pos + attributes(id.pos)$match.length)
id.list[[(length(id.list) + 1)]] <- gsub("(^[[:alpha:]]*/)|(/[[:alpha:]]*$)", "", my.id)
html <- substr(html, (id.pos + attributes(id.pos)$match.length), nchar(html))
}
id.list

how to scrape all pages (1,2,3,.....n) from a website using r vest

# I would like to read the list of .html files to extract data. Appreciate your help.
library(rvest)
library(XML)
library(stringr)
library(data.table)
library(RCurl)
u0 <- "https://www.r-users.com/jobs/"
u1 <- read_html("https://www.r-users.com/jobs/")
download_folder <- ("C:/R/BNB/")
pages <- html_text(html_node(u1, ".results_count"))
Total_Pages <- substr(pages, 4, 7)
TP <- as.numeric(Total_Pages)
# reading first two pages, writing them as separate .html files
for (i in 1:TP) {
url <- paste(u0, "page=/", i, sep = "")
download.file(url, paste(download_folder, i, ".html", sep = ""))
#create html object
html <- html(paste(download_folder, i, ".html", sep = ""))
}
Here is a potential solution:
library(rvest)
library(stringr)
u0 <- "https://www.r-users.com/jobs/"
u1 <- read_html("https://www.r-users.com/jobs/")
download_folder <- getwd() #note change in output directory
TP<-max(as.integer(html_text(html_nodes(u1,"a.page-numbers"))), na.rm=TRUE)
# reading first two pages, writing them as separate .html files
for (i in 1:TP ) {
url <- paste(u0,"page/",i, "/", sep="")
print(url)
download.file(url,paste(download_folder,i,".html",sep=""))
#create html object
html <- read_html(paste(download_folder,i,".html",sep=""))
}
I could not find the class .result-count in the html, so instead I looked for the page-numbers class and pick the highest returned value.
Also, the function html is deprecated thus I replaced it with read_html.
Good luck

xpathSApply not finding required node

I'm trying to write some code to return the values of a given element in an xml feed. The following code works for all of the feeds except uk_legislation_feed. Can someone give me a hint as to why this might be and how to fix the problem? Thanks.
library(XML)
uk_legislation_feed <- c("http://www.legislation.gov.uk/new/data.feed", "xml", "//title")
test_feed <- c("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml", "xml", "//zipcode")
ons_feed <- c("https://www.ons.gov.uk/releasecalendar?rss", "xml", "//title")
read_data <- function(feed) {
if (feed[2] == "xml") {
if (!file.exists(feed[1])) download.file(feed[1], "tmp.xml", "curl")
dat <- xmlRoot(xmlTreeParse("tmp.xml", useInternalNodes = TRUE))
}
titles <- xpathSApply(dat, feed[3], xmlValue)
return(titles)
}
Due to the undeclared namespace in uk_legislation_feed (specifically, no xmlns prefix) http://www.w3.org/2005/Atom, nodes are not properly mapped. Hence, you will need to declare a namespace at the URI and use it in XPath expression:
url <- "http://www.legislation.gov.uk/new/data.feed"
webpage <- readLines(url)
file <- xmlParse(webpage)
nmsp <- c(ns="http://www.w3.org/2005/Atom")
titles <- xpathSApply(file, "//ns:title", xmlValue,
namespaces = nmsp)
titles
# [1] "Search Results"
# [2] "The Air Navigation (Restriction of Flying) (RNAS Culdrose) (Amendment) \
# Regulations 2016"
...

R scrape HTML table with multiple subheaders

I'm trying to import the list of nuclear test sites (from Wikipedia's page) in a data.frame using the code below:
library(RCurl)
library(XML)
theurl <- "https://en.wikipedia.org/wiki/List_of_nuclear_test_sites"
webpage <- getURL(theurl)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
# Find XPath (go the webpage, right-click inspect element, find table then right-click copyXPath)
myxpath <- "//*[#id='mw-content-text']/table[2]"
# Extract table header and contents
tablehead <- xpathSApply(pagetree, paste(myxpath,"/tr/th",sep=""), xmlValue)
results <- xpathSApply(pagetree, paste(myxpath,"/tr/td",sep=""), xmlValue)
# Convert character vector to dataframe
content <- as.data.frame(matrix(results, ncol = 5, byrow = TRUE))
names(content) <- c("Testing country", "Location", "Site", "Coordinates", "Notes")
However there are multiple sub-headers that prevent the data.frame to be populated consistently. How can I fix this?
Take a look at the htmltab package. It allows you to use the subheaders for populating a new column:
library(htmltab)
tab <- htmltab("https://en.wikipedia.org/wiki/List_of_nuclear_test_sites",
which = "/html/body/div[3]/div[3]/div[4]/table[2]",
header = 1 + "//tr/th[#style='background:#efefff;']",
rm_nodata_cols = F)
I found this example by Carson Sievert that worked well for me:
library(rvest)
theurl <- "https://en.wikipedia.org/wiki/List_of_nuclear_test_sites"
# First, grab the page source
content <- html(theurl) %>%
# then extract the first node with class of wikitable
html_node(".wikitable") %>%
# then convert the HTML table into a data frame
html_table()
Have you tried this?
l.wiki.url <- getURL( url = "https://en.wikipedia.org/wiki/List_of_nuclear_test_sites" )
l.wiki.par <- htmlParse( file = l.wiki.url )
l.tab.con <- xpathSApply( doc = l.wiki.par
, path = "//table[#class='wikitable']//tr//td"
, fun = xmlValue
)

Resources