I just started with web scrape using Google sheet and would like to seek clarity on the issues below.
Here is the screenshot which will help to illustrate the issues:
Formula used
A28 =IMPORTXML("https://www.thehuboug.com/collections/all?limit=100&sort=price+asc","//div[#class='grid__item small--one-half medium--one-half large--one-quarter']/a/#href")
B28 =CONCATENATE("https://www.thehuboug.com",A28)
C28 =IMPORTXML(B28,"//div[#class='grid__item large--one-half']/h1")
D28 =TRANSPOSE(INDEX(IMPORTHTML(B28,"table"),,2))
Questions
I couldn't figure out why C28 couldn't get the data but C29 has no issue with getting the data.
D28 is able to get the data but not for D29:D33
When I changed https -> http, I was able to get data for D34 and not C34. What does this mean?
Initially, when I first put in my formula, everything will be extracted accordingly. However, on the next day when I re-open the file again, I started to get results like this, with error Resource at url not found. If I copy the same formula on a new sheet, everything will work accordingly again. Overall, I am pretty confused about the behaviour here and would like to understand what can be improved to solve this.
Thanks in advance for helping me in this!
A28:
=IMPORTXML(
"https://www.thehuboug.com/collections/all?limit=100&sort=price+asc",
"//div[#class='grid__item small--one-half medium--one-half large--one-quarter']/a/#href")
B28:
=ARRAYFORMULA("https://www.thehuboug.com"&SUBSTITUTE(A28:A60, "/collections/all",, 1))
C28 and drag down:
=IMPORTXML(B28, "//div[#class='grid__item large--one-half']/h1")
D28 and drag down:
=IFERROR(TRANSPOSE(INDEX(IMPORTHTML(B28, "table"),,2)))
Related
note: I haven't asked a question here before, and am still not sure how to make this legible, so let me know of any confusion or tips on making this more readable
I'm trying to download user information from the 2004/06 to 2004/09 Internet Archive captures of makeoutclub.com (a wacky, now-defunct social network targeted toward alternative music fans, which was created in ~2000, making it one of the oldest profile-based social networks on the Internet) using r,* specifically the rcrawler package.
So far, I've been able to use the package to get the usernames and profile links in a dataframe, using xpath to identify the elements I want, but somehow it doesn't work for either the location or interests sections of the profiles, both of which are just text instead of other elements in the html. For an idea of the site/data I'm talking about, here's the page I've been texting my xpath on: https://web.archive.org/web/20040805155243/http://www.makeoutclub.com/03/profile/html/boys/2.html
I have been testing out my xpath expressions using rcrawler's ContentScraper function, which extracts the set of elements matching the specified xpath from one specific page of the site you need to crawl. Here is my functioning expression that identifies the usernames and links on the site, with the specific page I'm using specified, and returns a vector:
testwaybacktable <- ContentScraper(Url = "https://web.archive.org/web/20040805155243/http://www.makeoutclub.com/03/profile/html/boys/2.html", XpathPatterns = c("//tr[1]/td/font/a[1]/#href", "//tr[1]/td/font/a[1]"), ManyPerPattern = TRUE)
And here is the bad one, where I'm testing the "location," which ends up returning an empty vector
testwaybacklocations <- ContentScraper(Url = "https://web.archive.org/web/20040805155243/http://www.makeoutclub.com/03/profile/html/boys/2.html", XpathPatterns = "//td/table/tbody/tr[1]/td/font/text()[2]", ManyPerPattern = TRUE)
And the other bad one, this one looking for the text under "interests":
testwaybackint <- ContentScraper(Url = "https://web.archive.org/web/20040805155243/http://www.makeoutclub.com/03/profile/html/boys/2.html", XpathPatterns = "//td/table/tbody/tr[2]/td/font/text()", ManyPerPattern = TRUE)
The xpath expressions I'm using here seem to select the right elements when I try searching them in the Chrome Inspect thing, but the program doesn't seem to read them. I also have tried selecting only one element for each field, and it still produced an empty vector. I know that this tool can read text in this webpage–I tested another random piece of text–but somehow I'm getting nothing when I run this test.
Is there something wrong with my xpath expression? Should I be using different tools to do this?
Thanks for your patience!
*This is for a digital humanities project will hopefully use some nlp to analyze especially language around gender and sexuality, in dialogue with some nlp analysis of the lyrics of the most popular bands on the site.
A late answer, but maybe it will help nontheless. Also I am not sure about the whole TOS question, but I think that's yours to figure out. Long story short ... I will just try to to adress the technical aspects of your problem ;)
I am not familiar with the rcrawler-package. Usually I use rvest for webscraping and I think it is a good choice. To achive the desired output you would have to use something like
# parameters
url <- your_url
xpath_pattern <- your_pattern
# get the data
wp <- xml2::read_html(url)
# extract whatever you need
res <- rvest::html_nodes(wp,xpath=xpath_pattern)
I think it is not possible to use a vector with multiple elements as pattern argument, but you can run html_nodes for each pattern you want to extract seperately.
I think the first two urls/patterns should work this way. The pattern in your last url seems to be wrong somehow. If you want to extract the text inside the tables, it should probably be something like "//tr[2]/td/font/text()[2]"
Using dplyr in R (microsoft R Open 3.5.3 to be precise). I'm having a slight problem with dplyr whereby I'm sometimes seeing lots of additional information in the data frame I create. For example, for these lines of code:
claims_frame_2 <- left_join(claims_frame,
select(new_policy_frame, c(Lookup_Key_4, Exposure_Year, RowName)),
by = c("Accident_Year" = "Exposure_Year", "Lookup_Key_4" = "Lookup_Key_4")
)
claims_frame_3 <- claims_frame_2 %>% group_by(Claim.Number) %>% filter(RowName == max(RowName))
No problem with the left_join command, but when I do the second command (group by/filter), the data structure of the claims_frame_3 object is different to that of the claims_frame_2 object. Seems to suddenly have lots of attributes (something I know little about) attached to the RowName field. See the attached photo.
Does anyone know why this happens and how I can stop it?
I had hoped to put together a small chunk of reproducible code that demonstrated this happening, but so far I haven't been successful. I will continue. In the mean time, I'm hoping someone might see this code (from a real project) and immediately know why this is happening!
Grateful for any advice.
Thanks
Alan
So, here's the current situation:
I have 2000+ lines of R code that produces a couple dozen text files. This code runs in under 10 seconds.
I then manually paste each of these text files into a website, wait ~1 minute for the website's response (they're big text files), then manually copy and paste the response into Excel, and finally save them as text files again. This takes hours and is prone to user error.
Another ~600 lines of R code then combines these dozens of text files into a single analysis. This takes a couple of minutes.
I'd like to automate step 2--and I think I'm close, I just can't quite get it to work. Here's some sample code:
library(xml2)
library(rvest)
textString <- "C2-Boulder1 37.79927 -119.21545 3408.2 std 3.5 2.78 0.98934 0.0001 2012 ; C2-Boulder1 Be-10 quartz 581428 7934 07KNSTD ;"
url <- "http://hess.ess.washington.edu/math/v3/v3_age_in.html"
balcoForm <- html_form(read_html(url))[[1]]
set_values(balcoForm, summary = "no", text_block = textString)
balcoResults <- submit_form(html_session(url), balcoForm, submit = "text_block")
balcoResults
The code runs and every time I've done it "balcoResults" comes back with "Status: 200". Success! EXCEPT the file size is 0...
I don't know where the problem is, but my best guess is that the text block isn't getting filled out before the form is submitted. If I go to the website (http://hess.ess.washington.edu/math/v3/v3_age_in.html) and manually submit an empty form, it produces a blank webpage: pure white, nothing on it.
The problem with this potential explanation (and me fixing the code) is that I don't know why the text block wouldn't be filled out. The results of set_values tells me that "text_block" has 120 characters in it. This is the correct length for textString. I don't know why these 120 characters wouldn't be pasted into the web form.
An alternative possibility is that R isn't waiting long enough to get a response from the website, but this seems less likely because a single sample (as here) runs quickly and the status code of the response is 200.
Yesterday I took the DataCamp course on "Working with Web Data in R." I've explored GET and POST from the httr package, but I don't know how to pick apart the GET response to modify the form and then have POST submit it. I've considered trying the package RSelenium, but according to what I've read, I'd have to download and install a "Selenium Server". This intimidates me, but I could probably do it -- if I was convinced that RSelenium would solve my problem. When I look on CRAN at the function names in the RSelenium package, it's not clear which ones would help me. Without firm knowledge for how RSelenium would solve my problem, or even if it would, this seems like a poor return on the time investment required. (But if you guys told me it was the way to go, and which functions to use, I'd be happy to do it.)
I've explored SO for fixes, but none of the posts that I've found have helped. I've looked here, here, and here, to list three.
Any suggestions?
After two days of thinking, I spotted the problem. I didn't assign the results of set_value function to a variable (if that's the right R terminology).
Here's the corrected code:
library(xml2)
library(rvest)
textString <- "C2-Boulder1 37.79927 -119.21545 3408.2 std 3.5 2.78 0.98934 0.0001 2012 ; C2-Boulder1 Be-10 quartz 581428 7934 07KNSTD ;"
url <- "http://hess.ess.washington.edu/math/v3/v3_age_in.html"
balcoForm <- html_form(read_html(url))[[1]]
balcoForm <- set_values(balcoForm, summary = "no", text_block = textString)
balcoResults <- submit_form(html_session(url), balcoForm, submit = "text_block")
balcoResults
I am fairly new to R and am having trouble with pulling data from the Forbes website.
My current function is:
url = http://www.forbes.com/global2000/list/#page:1_sort:0_direction:asc_search:_filter:All%20industries_filter:All%20countries_filter:All%20states
data = readHTMLTable(url)
However, when I change the page # in the url from 1 to 2 (or to any other number), the data that is pulled is the same data from page 1. For some reason R does not pull the data from the correct page. If you manually paste the link into the browser with a specific page #, then it works fine.
Does anyone have an idea as to why this is happening?
Thanks!
This appears to be an issue caused by URL fragments, which the pound sign represents. It essentially creates an anchor on a page and directs your browser to jump to that particular location.
You might be having this trouble because readHTMLTable() might not be created to work with URL fragments. See if you can find a version of the same table that does not require # in the URL.
Here are some helpful links that might shed light on what you are experiencing:
What is it when a link has a pound "#" sign in it
https://support.microsoft.com/kb/202261/en-us
If I come across anything else that's helpful, I'll share it in follow-up comments.
What you might need to do is use the URLencode() method in R.
kdb.url <- "http://m1:5000/q.csv?select from data0 where folio0 = `KF"
kdb.url <- URLencode(kdb.url)
df <- read.csv(kdb.url, header=TRUE)
You might have meta-characters in your URL too. (Mine are the spaces and the backtick.)
>kdb.url
[1] "http://m1:5000/q.csv?select%20from%20data0%20where%20folio0%20=%20%60KF"
They think of everything those R guys.
I'm a starter in web scraping and I'm not yet familiarized with the nomenclature for the problems I'm trying to solve. Nevertheless, I've searched exhaustively for this specific problem and was unsuccessful in finding a solution. If it is already somewhere else, I apologize in advance and thank your suggestions.
Getting to it. I'm trying to build a script with R that will:
1. Search for specific keywords in a newspaper website;
2. Give me the headlines, dates and contents for the number of results/pages that I desire.
I already know how to post the form for the search and scrape the results from the first page, but I've had no success so far in getting the content from the next pages. To be honest, I don't even know where to start from (I've read stuff about RCurl and so on, but it still haven't made much sense to me).
Below, it follows a partial sample of the code I've written so far (scraping only the headlines of the first page to keep it simple).
curl <- getCurlHandle()
curlSetOpt(cookiefile='cookies.txt', curl=curl, followlocation = TRUE)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
search=getForm("http://www.washingtonpost.com/newssearch/search.html",
.params=list(st="Dilma Rousseff"),
.opts=curlOptions(followLocation = TRUE),
curl=curl)
results=htmlParse(search)
results=xmlRoot(results)
results=getNodeSet(results,"//div[#class='pb-feed-headline']/h3")
results=unlist(lapply(results, xmlValue))
I understand that I could perform the search directly on the website and then inspect the URL for references regarding the page numbers or the number of the news article displayed in each page and, then, use a loop to scrape each different page.
But please bear in mind that after I learn how to go from page 1 to page 2, 3, and so on, I will try to develop my script to perform more searches with different keywords in different websites, all at the same time, so the solution in the previous paragraph doesn't seem the best to me so far.
If you have any other solution to suggest me, I will gladly embrace it. I hope I've managed to state my issue clearly so I can get a share of your ideas and maybe help others that are facing similar issues. I thank you all in advance.
Best regards
First, I'd recommend you use httr instead of RCurl - for most problems it's much easier to use.
r <- GET("http://www.washingtonpost.com/newssearch/search.html",
query = list(
st = "Dilma Rousseff"
)
)
stop_for_status(r)
content(r)
Second, if you look at url in your browse, you'll notice that clicking the page number, modifies the startat query parameter:
r <- GET("http://www.washingtonpost.com/newssearch/search.html",
query = list(
st = "Dilma Rousseff",
startat = 10
)
)
Third, you might want to try out my experiment rvest package. It makes it easier to extract information from a web page:
# devtools::install_github("hadley/rvest")
library(rvest)
page <- html(r)
links <- page[sel(".pb-feed-headline a")]
links["href"]
html_text(links)
I highly recommend reading the selectorgadget tutorial and using that to figure out what css selectors you need.