In order to do some textual analysis in R, I would like to download several webpages that have a very similar design. I have tried it with several pages and this code indeed only keeps the lines I am interested in.
thepage= readLines("http://example/xwfw_665399/s2510_665401/t1480900.shtml")
thepage2 = readLines("http://example/xwfw_665399/s2510_665401/2535_665405/t851768.shtml")
mypattern1 = '<P style=\\"FONT.*\\">'
datalines1 = grep(mypattern1,thepage[1:length(thepage)],value=TRUE)
datalines2 = grep(mypattern1,thepage2[1:length(thepage)],value=TRUE)
mypattern2 = '<STRONG>'
mypattern3 = '</STRONG>'
mypattern4 = '</P>'
page1=gsub(mypattern1,"",datalines1)
page1=gsub(mypattern2,"", page1)
page1=gsub(mypattern3,"",page1)
page1=gsub(mypattern4,"",page1)
page2=gsub(mypattern1,"",datalines2)
page2=gsub(mypattern2,"", page2)
page2=gsub(mypattern3,"",page2)
page2=gsub(mypattern4,"",page2)
As you might see, the URLS are very similar, ending with s2510_665401/
Now, I wonder, is there a way to automatically retrieve all possible files after s2510_665401/ and have my code run over them? Despite some googleing, I haven't been able to find anything. Would it require to write a function? If so, would someone please point me in the right direction?
Thanks!
This is not a final working answer, I very rarely do web scraping so not sure how well this generalizes to other webpages, but it may help you in the right direction. Consider this page. We can write a function that extracts all .html elements, which we can then again crawl with the same function to get their references.
So in the code below, unique(refs_1) contains all .html pages that are one level deep, and unique(refs_2) contains all .html pages that are two levels deep.
You would still need a wrapper to either stop after a certain number of iterations, maybe prevent recrawling of already visited pages (setdiff(refs_2,refs_1)?), etc.
when you have all the URLs to scrape (in this case unique(c(refs_1,refs_2)), you should wrap your own read script in a function, and call lapply(x,f), where x is the list/array of URLs, and f is your function.`
Anyway, hope this helps!
library(qdapRegex)
get_refs_on_page <- function(page){
refs = lapply( tryCatch({readLines(page)}, error=function(cond){return(NA)},warning=function(cond){return(NA)})
, function(x) {y=rm_between(x , "href=\"", "\"", extract=TRUE)[[1]]})
refs=unlist(refs)
refs = refs[!is.na(refs)]
return(refs)
}
thepage = 'https://stat.ethz.ch/R-manual/R-devel/library/utils/'
refs_1 = get_refs_on_page(thepage)
refs_2 = unlist(lapply(paste0(thepage,refs_1),get_refs_on_page))
Example output:
> unique(c(refs_1,refs_2))
[1] "?C=N;O=D" "?C=M;O=A" "?C=S;O=A"
[4] "?C=D;O=A" "/R-manual/R-devel/library/" "DESCRIPTION"
[7] "doc/" "html/" "?C=N;O=A"
[10] "?C=M;O=D" "?C=S;O=D" "?C=D;O=D"
[13] "/doc/html/R.css" "/doc/html/index.html" "../../../library/utils/doc/Sweave.pdf"
[16] "../../../library/utils/doc/Sweave.Rnw" "../../../library/utils/doc/Sweave.R" "Sweave.Rnw.~r55105~"
Related
For a project I am creating different layers which should all be written into one geopackage.
I am using QGIS 3.16.1 and the Python console inside QGIS which runs on Python 3.7
I tried many things but cannot figure out how to do this. This is what I used so far.
vl = QgsVectorLayer("Point", "points1", "memory")
vl2 = QgsVectorLayer("Point", "points2", "memory")
pr = vl.dataProvider()
pr.addAttributes([QgsField("DayID", QVariant.Int), QgsField("distance", QVariant.Double)])
vl.updateFields()
f = QgsFeature()
for x in range(len(tag_temp)):
f.setGeometry(QgsGeometry.fromPointXY(QgsPointXY(lon[x],lat[x])))
f.setAttributes([dayID[x], distance[x]])
pr.addFeature(f)
vl.updateExtents()
# I'll do the same for vl2 but with other data
uri ="D:/Documents/QGIS/test.gpkg"
options = QgsVectorFileWriter.SaveVectorOptions()
context = QgsProject.instance().transformContext()
QgsVectorFileWriter.writeAsVectorFormatV2(vl1,uri,context,options)
QgsVectorFileWriter.writeAsVectorFormatV2(vl2,uri,context,options)
Problem is that the in the 'test.gpkg' a layer is created called 'test' and not 'points1' or 'points2'.
And the second QgsVectorFileWriter.writeAsVectorFormatV2() also overwrites the output of the first one instead of appending the layer into the existing geopackage.
I also tried to create single .geopackages and then use 'Package Layers' processing tool (processing.run("native:package") to merge all layers into one geopackage, but then the attributes types are all converted into strings unfortunately.
Any help is much appreciated. Many thanks in advance.
You need to change the SaveVectorOptions, in particular the mode of actionOnExistingFile after creating the gpkg file :
options = QgsVectorFileWriter.SaveVectorOptions()
#options.driverName = "GPKG"
options.layerName = v1.name()
QgsVectorFileWriter.writeAsVectorFormatV2(v1,uri,context,options)
#switch mode to append layer instead of overwriting the file
options.actionOnExistingFile = QgsVectorFileWriter.CreateOrOverwriteLayer
options.layerName = v2.name()
QgsVectorFileWriter.writeAsVectorFormatV2(v2,uri,context,options)
The documentation is here : SaveVectorOptions
I also tried to create single .geopackages and then use 'Package Layers' processing tool (processing.run("native:package") to merge all layers into one geopackage, but then the attributes types are all converted into strings unfortunately.
This is definitively the recommended way, please consider reporting the bug
My Goal: Using R, scrape all light bulb model #s and prices from homedepot.
My Problem: I can not find the URLs for ALL the light bulb pages. I can scrape one page, but I need to find a way to get the URLs so I can scrape them all.
Ideally I would like these pages
https://www.homedepot.com/p/TOGGLED-48-in-T8-16-Watt-Cool-White-Linear-LED-Tube-Light-Bulb-A416-40210/205935901
but even getting the list pages like these would be ok
https://www.homedepot.com/b/Lighting-Light-Bulbs/N-5yc1vZbmbu
I tried crawlr -> Does not work on homedepot (maybe because https?)I tried to get specific pages
I tried Rvest -> I tried using html_form and set_values to put light bulb in the search box, but the form comes back
[[1]]
<form> 'headerSearchForm' (GET )
<input hidden> '': 21
<input text> '':
<button > '<unnamed>
and set_value will not work because is '' so the error comes back
error: attempt to use zero-length variable name.
I also tried using the paste function and lapply
tmp <- lapply(0:696, function(page) {
url <- paste0("https://www.homedepot.com/b/Lighting-Light-Bulbs/N-
5yc1vZbmbu?Nao=", page, "4&Ns=None")
page <- read_html(url)
html_table(html_nodes(page, "table"))[[1]]
})
I got the error : error in html_table(html_nodes(page,"table"))[[1]]: script out of bounds.
I am seriously at a loss and any advice or tips would be so fantastic.
You can do it through rvest and tidyverse.
You can find a listing of all bulbs starting in this page, with a pagination of 24 bulbs per page across 30 pages:
https://www.homedepot.com/b/Lighting-Light-Bulbs-LED-Bulbs/N-5yc1vZbm79
Take a look at the pagination grid at the bottom of the initial page. I drew a(n ugly) yellow oval around it:
You could extract the link to each page listing 24 bulbs by following/extracting the links in that pagination grid.
Yet, just by comparing the urls it becomes evident that all pages follow a pattern, with "https://www.homedepot.com/b/Lighting-Light-Bulbs-LED-Bulbs/N-5yc1vZbm79" as root, and a tail where the
last digit characters represent the first lightbulb displayed, "?Nao=24"
So you could simply infer the structure of each url pointing to a display of the bulbs. The following command creates such a list in R:
library(rvest)
library(tidyverse)
index_list <- as.list(seq(0,(24*30), 24)) %>% paste0("https://www.homedepot.com/b/Lighting-Light-Bulbs-LED-Bulbs/N-5yc1vZbm79?Nao=", . )
Now, to extract the url for each lightbulb page, a combuination of a function and purrt's map function would come handy.
To exctract the individual bulbs url from the index pages, we can call this:
scrap_bulbs <- function(url){
object <- read_html(as.character(url))
object <- html_nodes(x = object, xpath = "//a[#data-pod-type='pr']")
object <- html_attr(x = object, 'href')
Sys.sleep(10) ## Courtesy pause of 10 seconds, prevents the website from possibly blocking your IP
paste0('https://www.homedepot.com', object)
}
Now we store the results in a list create by map().
bulbs_list <- map(.x = index_list, .f = scrap_bulbs)
unlist(bulbs_list)
Done!
Okay guys, I have, what I'm sure, is an entry-level problem. Still, I cannot explain it. Here's my code and its error:
> sample1 = readHTMLTable(http://www.pro-football-reference.com/boxscores/201609150buf.htm, which = 16)
Error: unexpected '/' in "sample1 = readHTMLTable(http:/"
It's having a problem with the second front-slash? Not only does every URL have two front-slashes, but I've poured through countless examples of this function, both on this site and others, and they've all formatted this code in this way. So, what am I doing wrong?
Additionally, I've tried it without the back-slashes:
> sample1 = readHTMLTable(www.pro-football-reference.com/boxscores/201609150buf.htm, which = 16)
Error: unexpected symbol in "sample1 = readHTMLTable(www.pro-football-reference.com/boxscores/201609150buf.htm"
Here, I'm not even sure which symbol it's talking about.
Please explain.
The issue is that you need to place your url in quotes (""). The following does return the table from your specified url:
sample1 = readHTMLTable("www.pro-football-reference.com/boxscores/201609150buf.htm")
As you probably know, the "which=" parameter is used to select which of the tables in that page you would like to retrieve. However my own attempts show that only 1 and 2 work. Could you tell me which table you are attempted to read into R? If this method doesn't end up working you can also attempt to read in the entirety of the webpage and parse out the table in question.
Hope this helps get things started!
I frequently use user defined functions in my code.
RStudio supports the automatic completion of code using the Tab key. I find this amazing because I always can read quickly what is supposed to go in the (...) of functions/calls.
However, my user defined functions just show the parameters, no additional info and obviously, no help page.
This isn't so much pain for me but I would like to share code I think it would be useful to have some information at hand besides the #coments in every line.
Nowadays, when I share, my lines usually look like this
myfun <- function(x1,x2,x3,...){
# This is a function for this and that
# x1 is a factor, x2 is an integer ...
# This line of code is useful for transformation of x2 by x1
some code here
# Now we do this other thing
more code
# This is where the magic happens
return (magic)
}
I think this line by line comment is great but I'd like to improve it and make some things handy just like every other function.
Not really an answer, but if you are interested in exploring this further, you should start at the rcompgen-help page (although that's not a function name) and also examine the code of:
rc.settings
Also, executing this allows you to see what the .CompletionEnv has in it for currently loaded packages:
names(rc.status())
#-----
[1] "attached_packages" "comps" "linebuffer" "start"
[5] "options" "help_topics" "isFirstArg" "fileName"
[9] "end" "token" "fguess" "settings"
And if you just look at:
rc.status()$help_topics
... you see the character items that the tab-completion mechanism uses for matching. On my machine at the moment there are 8881 items in that vector.
How do I find/use op:except with the multiple xml files?
I've gotten the nodes from file 1 and file 2, and in the xquery expression I'm tring to find the op:except of those two. When I use op:except, I end up getting an empty set.
XML File 1:
<a>txt</a>
<a>txt2</a>
<a>txt3</a>
XML File 2:
<a>txt2</a>
<a>txt4</a>
<a>txt3</a>
I want output from op:($nodesfromfile1, $nodesfromfile2) to be:
<a>txt</a>
It effectively comes down to the single line behind the return in the following code. You could put that in a function if you like, but it is already very dense, so maybe not worth it..
let $file1 := (
<a>txt</a>,
<a>txt2</a>,
<a>txt3</a>
)
let $file2 := (
<a>txt2</a>,
<a>txt4</a>,
<a>txt3</a>
)
return
$file1[not(. = $file2)]
Note, you also have the 'except' keyword ($file1 except $file2), but that works on node identity which won't work if the nodes comes from different files.
By the way, above code uses string-equality for comparison. If you would prefer to do a comparison on full node-structure, you could also use the deep-equal() function.
HTH!