I am trying to download the image of from this website listed below via R programming
http://www.ebay.com/itm/2pk-HP-60XL-Ink-Cartridge-Combo-Pack-CC641WN-CC644WN-/271805060791?ssPageName=STRK:MESE:IT
Which package should I use and how should I process it?
Objective: To download the image in this page to a folder
AND / OR
to find the image URL.
I used
url2 <- "http://www.ebay.co.uk/itm/381164104651"
url_content <- html(url2)
node1 <- html_node(url_content, "#icImg")
node1 has the image url, but when I am trying to edit this content I get an error saying it is a non-character element.
This solved my problem
html_node(url_content, "#icImg") %>% xml_attr("src")
Related
I have a problem with downloading data from HTTPS in R, I try using curl, but it doesn't work.
URL <- "https://github.com/Bitakhparsa/Capstone/blob/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv"
options('download.file.method'='curl')
download.file(URL, destfile = "./data.csv", method="auto")
I downloaded the CSV file with that code, but the format was changed when I checked the data. So it didn't download correctly.
Would you please someone help me?
I think you might actually have the URL wrong. I think you want:
https://raw.githubusercontent.com/Bitakhparsa/Capstone/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv
Then you can download the file directly using library(RCurl) rather than creating a variable with the URL
library(RCurl)
download.file("https://raw.githubusercontent.com/Bitakhparsa/Capstone/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv",destfile="./data.csv",method="libcurl")
You can also just load the file directly into R from the site using the following
URL <- "https://github.com/Bitakhparsa/Capstone/blob/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv"
out <- read.csv(textConnection(URL))
You can use the 'raw.githubusercontent.com' link, i.e. in the browser, when you go to "https://github.com/Bitakhparsa/Capstone/blob/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv" you can click on the link "View raw" (it's above "Sorry about that, but we can’t show files that are this big right now.") and this takes you to the actual data. You also have some minor typos.
This worked as expected for me:
url <- "https://raw.githubusercontent.com/Bitakhparsa/Capstone/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv"
download.file(url, destfile = "./data.csv", method="auto")
df <- read.csv("~/Desktop/data.csv")
I'm trying to harvest data using rvest (also tried using XML and selectr) but I am having difficulties with the following problem:
In my browser's web inspector the html looks like
<span data-widget="turboBinary_tradologic1_rate" class="widgetPlaceholder widgetRate rate-down">1226.45</span>
(Note: rate-downand 1226.45 are updated periodically.) I want to harvest the 1226.45 but when I run my code (below) it says there is no information stored there. Does this have something to do with
the fact that its a widget? Any suggestions on how to proceed would be appreciated.
library(rvest);library(selectr);library(XML)
zoom.turbo.url <- "https://www.zoomtrader.com/trade-now?game=turbo"
zoom.turbo <- read_html(zoom.turbo.url)
# Navigate to node
zoom.turbo <- zoom.turbo %>% html_nodes("span") %>% `[[`(90)
# No value
as.character(zoom.turbo)
html_text(zoom.turbo)
# Using XML and Selectr
doc <- htmlParse(zoom.turbo, asText = TRUE)
xmlValue(querySelector(doc, 'span'))
For websites that are difficult to scrape, for example where the content is dynamic, you can use RSelenium. With this package and a browser docker, you are able to navigate websites with R commands.
I have used this method to scrape a website that had a dynamic login script, that I could not get to work with other methods.
I'm trying to scrape some website using R. However, I cannot get all the information from the website due to an unknown reason. I found a work around by first downloading the complete webpage (save as from browser). I was wondering whether it would be to download complete websites using some function.
I tried "download.file" and "htmlParse" but they seems to only download the source code.
url = "http://www.tripadvisor.com/Hotel_Review-g2216639-d2215212-Reviews-Ayurveda_Kuren_Maho-Yapahuwa_North_Western_Province.html"
download.file(url , "webpage")
doc <- htmlParse(urll)
ratings = as.data.frame(xpathSApply(doc,'//div[#class="rating reviewItemInline"]/span//#alt'))
This worked with rvest first go.
llply(html(url) %>% html_nodes('div.rating.reviewItemInline'),function(i)
data.frame(nth_stars = html_nodes(i,'img') %>% html_attr('alt'),
date_var = html_text(i)%>%stri_replace_all_regex('(\n|Reviewed)','')))
I'd like to apologise in advance for the lack of a reproducible example. The data I'm using my script on are not live right now and are confidential in addition.
I wanted to make a script which can find all links on a certain page. The script works as following:
* find homepage html to start with
* find all urls on this homepage
* open these urls with Selenium
* save the html of each page in a list
* repeat this (find urls, open urls, save html)
The workhorse of this script is the following function:
function(listofhtmls) {
urls <- lapply(listofhtmls, scrape)
urls <- lapply(urls, clean)
urls <- unlist(urls)
urls <- urls[-which(duplicated(urls))]
urls <- paste("base_url", urls, sep = "")
html <- lapply(urls, savesource)
result <- list(html, urls)
return(result) }
Urls are scraped, cleaned (I don't need all urls) and duplicated urls are removed.
All of this works fine for most pages but sometimes I get a strange error while using this function:
Error: '' does not exist in current working directory.
Called from: check_path(path)
I don't see any link between the working directory and the parsing that's going on. I'd like to resolve this error as it's kinda blocking the rest of my script at the moment. Thanks in advance and once again excuses for not using a reproducible example.
I'm trying to automate downloading company profile images from Crunchbase's OpenDataMap using R. I've tried download.file, GET (in httr package) and getURLContent in RCurl, but they all return a 416 error. I know that I must be forgetting a parameter or user_agent, but I can't figure out what.
Here's an example URL for testing:
http://www.crunchbase.com/organization/google-ventures/primary-image/raw
Thanks for any help that you can provide.
I think I came up with a fairly clever, albeit slow-ish solution that worked with R.
Essentially, I created a headless browser that navigates from page-to-page, downloading the crunchbase images I need. This allows me to get past the 'redirect' and javascript that stops me from getting to the images via a simple Curl request.
This may work for other scraping projects.
library(RSelenium)
RSelenium::checkForServer()
startServer()
remDr <- remoteDriver$new()
remDr$open()
# For each url of interest profile_image_url is a list of image urls from crunchbase's open data map.
for(row in 1:length(profile_image_url)){
print(row) # keep track of where I am
# if already downloaded, don't do it again
if(file.exists(paste0("profileimages/",row,".png"))| file.exists(paste0("profileimages/",row,".jpg"))|file.exists(paste0("profileimages/",row,".gif"))){
next
}
# navigate to new page
remDr$navigate(paste0(profile_image_url[row],"?w=500&h=500"))
imageurl <- remDr$getCurrentUrl()[[1]]
# get file extension (to handle pngs and jpgs
file.ext <- gsub('[^\\]*\\.(\\w+)$',"\\1", imageurl)
# download image file from 'real' url
download.file(imageurl, paste0("profileimages/",thiscid,".",file.ext), method="curl")
# wait ten seconds to avoid rate-limiting
Sys.sleep(10)
}
remDr$close()