I wrote some code which should check whether a product is back in stock and when it is, send me an email to notify me. This works when the things I'm looking for are in the html.
However, sometimes certain objects are loaded through JavaScript. How could I edit my code so that the web scraping also works with JavaScript?
This is my code thus far:
import time
import requests
while True:
# Get the url of the IKEA page
url = 'https://www.ikea.com/nl/nl/p/flintan-bureaustoel-vissle-zwart-20336841/'
# Get the text from that page and put everything in lower cases
productpage = requests.get(url).text.lower()
# Set the strings that should be on the page if the product is not available
outofstockstrings = ['niet beschikbaar voor levering', 'alleen beschikbaar in de winkel']
# Check whether the strings are in the text of the webpage
if any(x in productpage for x in outofstockstrings):
time.sleep(1800)
continue
else:
# send me an email and break the loop
Instead of scraping and analyzing the HTML you could use the inofficial stock API that the IKEA website is using too. That API return JSON data which is way easier to analyze and you’ll also get estimates when the product gets back to stock.
There even is a project written in javascript / node which provides you this kind of information straight from the command line: https://github.com/Ephigenia/ikea-availability-checker
You can easily check the stock amount of the chair in all stores in the Netherlands:
npx ikea-availability-checker stock --country nl 20336841
I'm attempting to scrape a site that requires a form to be submitted before getting results. I'm struggling to understand how it works, let alone syntax and other things.
I have been looking at code posted by other people, and many people use rvest or RSelenium. I can't seem to get my form to submit properly, and not sure how to go about extracting the results into R once it does submit.
Now, I can't share the specific site that I'm working from, but I've found an analog:
https://gapines.org/eg/opac/advanced
For example, I might need to select "Books" under "Item Type," and "Braille" under "Item Form." Once the form is submitted, I would need to capture that results page.
Copying from other peoples' code, I have the following:
library(rvest)
url <- "https://gapines.org/eg/opac/advanced"
my_session <- html_session(url) #Create a persistant session
unfilled_forms <- html_form(my_session)
login_form <- unfilled_forms[[2]] # select the form you need to fill
filled_form <- set_values(login_form,'fi:item_type'="Books",'fi:item_form'="Braille")
login_session <- submit_form(my_session,filled_form)
When I run the submit_form(), it says "Submitting with 'NULL'."
Once it is submitted, I also want to extract the results, but not sure how to even begin.
Thanks in advance.
I try to download the reports available in Salesforce via the URL, e.g.
http://YOURInstance.my.salesforce.com/012389u13541?export=1&enc=UTF-8&xf=csv
in R.
I already did some investigation to access the report via HTTR-GET, however, up until today without any meaningful outcomes. Unfortunately, R is downloading HTML-code instead of the desired csv file. I also tried to realize the approach suggested here:
https://salesforce.stackexchange.com/questions/47414/download-a-report-using-python
The package "RForcecom" allows the interaction via an API, but I was not able to figure out how to realize above solution in R.
General GET-Request:
GET("http://YOUR_Instance.my.salesforce.com/012389u13541?export=1&enc=UTF-8&xf=csv")
I expect the output to be in csv format, but I receive the report data as html source code.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3...
<html>
<head>
<meta HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
...
Did anyone of you guys encounter same issues and can provide guidance? Any kind of help is much appreciated. Thanks in advance!
UPDATED and not-working R-Snippet:
library(RForcecom)
library(httr)
username='username'
password='password'
instanceURL <- "https://login.salesforce.com/"
session <- rforcecom.login(username, password, instanceURL)
sid=as.character(session['sessionID'])
url='http://YOURInstance.my.salesforce.com/012389u13541?export=1&enc=UTF-8&xf=csv'
getData=GET(url,add_headers('Content-Type'='application/json','Authorization'=paste0("Bearer ",sid),'X-PrettyPrint'='1'),set_cookies('sid'=sid))
Are you sure you have a valid report id? It doesn't look right (did you just obfuscate it for purposes of this post?). What is in that HTML you're getting, an error message? SF login screen?
What you're doing is effectively "screen scraping". This is not a real API, it can break at any time, you should find/build something that properly uses Salesforce Analytics API. You've been warned.
But if you're after a quick and dirty solution...
You need to pretend you're an authenticated user, that you have a valid session id. Add a cookie to your GET request.
How to get a valid session id?
You'd have to log in to SF first (for example use SOAP API's login call or I listed some REST api ideas here: https://stackoverflow.com/a/56034159/313628 )
or display some user's session ID in a SF formula, visualforce page and user would copy-paste it to your app.
Once you have it - add a Cookie header to your GET with value sid=<session id goes here>
Here's a raw request & response in SoapUI.
I recently struggled with the same issue, there's a magic parameter you need to add to the query : isdtp=p1
so if you try:
http://YOURInstance.my.salesforce.com/012389u13541?export=1&enc=UTF-8&xf=csv&isdtp=p1
it should return you the file directly.
In your example, I don't think that you can use the rforcecom session with httr functions as you are trying.
Here is a slightly different way to solve the problem.
Rather than trying to retrieve a report that you already created in Salesforce, why not specify the report in SOQL and use rforcecom.query function to execute the SOQL from r. That would return the data in a data frame and would require no further data wrangling in r to make it useable.
I use this technique often and once you get used to the Salesforce API I think that its probably faster and more powerful for most use cases.
Here is a simple function that I use to return select opportunity data for all opportunities in Salesforce.
getSFOpps <- function(session) {
#Construct SOQL Query
soql <- "SELECT Id,
Name,
AccountId,
Amount,
CurrencyIsoCode,
convertCurrency(Amount) usd_amount,
CloseDate,
CreatedDate,
Region__c,
IsClosed,
IsWon,
LastActivityDate,
LeadSource,
OwnerId,
Probability,
StageName,
Type,
IsDeleted
FROM Opportunity"
#Retrieve Opp information
as_tibble(RForcecom::rforcecom.query(session, soql))
}
It requires that you pass in a valid session from Rforcecom.login but you seem to have that part working from your code above.
I hope this helps ...
As of v0.2.0, the {salesforcer} R package implements the Salesforce Reports and Dashboards REST API. You can execute and manage reports without needing to write functions from scratch to pull down report data. Below is an example of how to find a report in your Org and then retrieve its data. You can also just use the report Id which appears in the URL bar when viewing the report in Salesforce (highlighted in red in the screenshot below).
# install.packages('salesforcer')
library(dplyr, warn.conflicts = FALSE)
library(salesforcer)
# Authenticate using username, password, and security token ...
sf_auth(username = "test#gmail.com",
password = "{PASSWORD_HERE}",
security_token = "{SECURITY_TOKEN_HERE}")
# ... or using OAuth 2.0 authentication
sf_auth()
# find a report in your org and run it
all_reports <- sf_query("SELECT Id, Name FROM Report")
this_report_id <- all_reports$Id[1]
results <- sf_run_report(this_report_id)
results
I am using R and I need to retrieve the few most recent posts from a Twitter user (#ExpressNewsPK) using twitteR api. I have created an account and have an access token, etc. I have used the following command to extract the tweets:
setup_twitter_oauth(consumerkey,consumersecret,accesstoken,accesssecret)
express_news_tweets <- searchTwitter("#ExpressNewsPK", n = 10, lang = "en" )
However, the posts that are returned aren't the most recent ones from this user. Where have I made a mistake?
I think searchTwitter would search with the search string provided (here #ExpressNewsPK). So instead of giving tweets by #ExpressNewsPK it would give tweets which are directed to #ExpressNewsPK.
To get tweets from #ExpressNewsPK, you have a function named userTimeline which would give tweets from a particular user.
So after you are done with setup_twitter_oauth, you can try
userTimeline("ExpressNewsPK")
read more about it at ?userTimeline
When you use searchTwitter(), you call the Twitter Search API. Search API only returns a sample history of the tweets.
What you really need to do is to call Twitter Streaming API. Using it you'll be able to download tweets in near real time. You can read more about the Streaming API here: https://dev.twitter.com/streaming/overview
I'm trying to begin working myself into web-scraping. Now my target is to get my personal rated movies from the moviepilot.de page.
For this I need to access following page: http://www.moviepilot.de/users/schlusie/rated/movies. But without authentication it is not possible.
I've read that the httr package can do something like this, save it as a handler with handle and than navigating over the homepage with your login-information. And thus accessing desired page. It should look like this:
library(httr)
mp = handle("http://moviepilot.de")
# authentication step
GET(handle=mp, path="/users/schlusie/rated/movies")
This is the login-page: http://www.moviepilot.de/login
Can someone please give me any pointers?