Change the nature of a URL request at IMDB - r

I am trying to get data from IMDB with
page <- read_html("URL_of_Movie")
The output is always in German. However, I need the data content to be in its original form in English. Although my settings are set to "English"
I saw others Questions here like this
curl -H "Accept-Language: en-us,en;q=0.8,de-de;q=0.5,de;q=0.3" http://www.imdb.com/title/tt0076306/
which shows how to use curl function for english but I don't know how to integrate this into my R code

Needed to include the GET Function and the correct Syntax for the language request
page <- read_html(GET(
"https://www.imdb.com/list/ls020643534/?sort=list_order,asc&st_dt=&mode=detail&page=1&title_type=movie&ref_=ttls_ref_typ",
add_headers("Accept-Language" = "en-US")))

Related

How to correctly format body of POST request

I am adding a page to a website through REST API. I use the following in bash and it works. It creates a new page with the specified title and body content.
token="dfrer4e"
curl -X POST -H "Authorization: Bearer $token" \
https://api/pages \
-d wiki_page[title]=New title \
-d wiki_page[body]=New content
I am trying to do the same using R package httr.
library(httr)
set_config(add_headers("Authorization"=paste0("Bearer dfrer4e")))
This works when I just use the title. It creates a new page with the specified title.
POST(url="https://api/pages/",body="wiki_page[title]=New title")
but I am not sure how to include the body part as well.
Attempts:
I tried providing the body as a vector, but it doesn't work correctly. It combines both into the title. The body remains empty.
POST(url="https://api/pages/",body=c("wiki_page[title]=New page","wiki_page[body]=New content"))
I tried providing the body as a list, but it returns an error.
POST(url="https://api/pages/",body=list("wiki_page"=list("title"="New title","body"="New content")))
Error in curl::handle_setform(handle, .list = req$fields) :
Unsupported value type for form field 'wiki_page'.
I tried to provide the body as JSON, but it returns a status 400 error.
j <- jsonlite::toJSON(list("wiki_page"=list("title"="New title","body"="New content")))
POST(url="https://api/pages/",body=j,encode="json")
Unfortunately, I cannot create a reproducible example.
The proper way to translate that command to httr is
POST(url="https://api/pages/",
body=list(
"wiki_page[title]" = "New page",
"wiki_page[body]" = "New content")
)
You need to separate the names and the values in your body so the values can be properly encoded.

Scraping information from site that requires login with R (Maybe use API)

Given this URL requires the following login data:
Benutzername oder E-Mail -> User: testuserscrap#web.de
Passwort -> Password: testuserscrap
(The website is kind of fantasy football of the German Bundesliga.)
There exists a post where someone asks for help for the same website.
However, I do not want to retrieve information about certain players but about the actual team. In the browser, these steps are required:
Click on the red circled icon:
Leads to this page where I would like to retrieve all the names (of the players) in list 1 and 2:
Means I would like to have an output such as:
Diego Contento
Alfred Finnbogason
...
I am not sure which way might be the best one. According to the referred posts there seem to be an API. However, I cannot access the information with the code adapted from the referred post:
library(rvest)
library(jsonlite)
library(httr)
library(plyr)
library(dplyr)
url<-"https://kickbase.sky.de/"
page<-html_session(url)
page<-rvest:::request_POST(page,url="https://kickbase.sky.de/api/v1/user/login",
body=list("email"="testuserscrap#web.de",
"password"="testuserscrap",
"redirect_url"="https://www.kickbase.com/transfermarkt/kader"),
encode='json'
)
ck <- cookies(page)
player_page<-jump_to(ck$value,"https://api.kickbase.com/leagues/1420282/lineupex")
Unfortunately, I'm not such an expert in coding or webscraping. I tried many things but I do not come to a solution :/ Therefore, I would be really grateful if you have any advice or idea how I can retrieve the information.
Best :)
Wow, this was a tough question, but a very good learning experience for me. To solve this one I used the "curlconvertor" package, download available from GitHub using devtools package. See https://github.com/hrbrmstr/curlconverter, and other questions/answers posted here at stack overflow.
First login into the web page using your browser and navigate to the page in interest. Using the developer tools copy the 'cURL' address from the file of intereste. The cURL can be stripped of the nonessential parts, but I would need to determine the noncritical parts through trial and error.
Then use the straighten function, edit the userid and password (these were not saved with the cURL address), make the request, and then parse the return.
#cURL copied from network tab for the requested file
xcurl<-"curl 'https://api.kickbase.com/leagues/1420282/lineupex'
-XGET
-H 'Accept: */*'
-H 'Origin: https://kickbase.com'
-H 'Referer: https://kickbase.com/transfermarkt/kader'
-H 'Accept-Language: en-us'
-H 'Host: api.kickbase.com'
-H 'Authorization: Bearer XU3DGDZBxlHB0sjqG01yLhHihT2AacPeIeWOlY+u3nxz/iokfCjn8a9vaKeKFXwxJpcH/0FXOgGg3J2EfmUUDJ9uwjT+oxHZTGc1EuOxbG0i66fRBBm1RBT0Yd4ACRDQ9BCs8yb+/w9+gOPIyhM2Vio3DZemExATq22osCGeW6VzYmos/3F8MTDbKOAk8NPKQYr5xPSght26ayZ4/X21ag==' \
-H 'Accept-Encoding: br, gzip, deflate'
-H 'Connection: keep-alive'"
#See https://github.com/hrbrmstr/curlconverter, install from devtools
library(curlconverter)
library(dplyr)
my_ip<-straighten(xcurl)
#add password and user id
my_ip[[1]]$password<-"testuserscrap"
my_ip[[1]]$username<-"testuserscrap#web.de"
#Make page request
respone<-my_ip %>% make_req()
#retrieve the entire file
#jsonfile<-jsonlite::toJSON(content(respone[[1]](), as="parsed"), auto_unbox = TRUE, pretty=TRUE)
#retrieve only the player info from file and convert to data frame
dfs <- lapply(content(respone[[1]](), as="parsed")$players, data.frame)
#not every player has the same information thus bind_rows instead of rbind
players <- do.call(bind_rows, dfs)
players
in case you are still looking for access to the kickbase API I've written a small python library lately for it and just released it some days ago. Might still have some bugs but it serves my purpose and maybe you want to contribute to it. :)
https://github.com/kevinskyba/kickbase-api-python

Build web graph with wget

I'm using wget with -r (recursive) option, to crawl and download all the pages starting from a root.
For debugging purpose I'd like to output which page routed me to another one, for example: https://stackoverflow.com/ -> https://stackoverflow.com/questions
Is there such a way to do that?
Please note that I need explicitly use wget.
The best solution I found untill now is to use the --warc-file option, to export a warc archive of my crawl. This format also store the Referer.
Using a python library to read the output I wrote the following simple script, to export a csv with source/target columns:
import warc
f = warc.open("crawler.warc")
for record in f:
if record['WARC-Type'] != 'request':
continue
for line in record.payload:
if line.startswith("Referer:"):
print line.replace("Referer: ", "").strip('\n\r'), ",", record['WARC-Target-URI']

Use AWK to safely search and replace URLs in Wordpress SQL-Dump

I am working on a webtool to mirror a Wordpress installation into a development system.
The aim is to have a Live system for production and a development system for testing. The webtool then offers a one-click-sync between those systems.
Each of the systems is standalone, with its own webroot, database and url.
I am having a trouble with the database dump in which I have to search all the references to the source and replace them with the URL of the destination (e.g.: "www.example.com" -> "www-dev.example.com").
What I need to do is:
Find all occurences of the URL and replace it with the new one.
IF the match also matches the format of a serialized string it should set the Field-Seperator, and reload the match, so that the actual length can be set in the array.
In a first attempt I tried to solve this with a 'sed' command looking as follows: sed -i.orig 's/360\.example\.com/360-dev\.my\.example\.dev/g'.
This didn't work because there are serialized arrays contained in the dump, containing the url. The sed command is no good for updating the string-length-indicator of the serialized arrays.
My latest attempt is to use an awk as suggested here, because it's capable of arithmetic operations.
My awk script looks like this:
/360[.]example[.]com/ {
sub("360.example.com", "360-dev.my.example.dev");
if ($0 ~ /s:[[:digit:]]+:["](http[s]?:\/\/)?360[.]example[.]com["]/){
FS="\"";
$0=$0;
n=length($2)-1;
sub(/:[[:digit:]]+:/, ":" n ":");
}
} 1
There seem to be some errors in my script, which I can't find. It does not replace all of the occurrences of the url and completely skips the length-indicator-update.
How can I fix my script to achieve what I want to do?
EDIT: (Added Input/Output samples)
Databasedump consists of the whole wordpress-database with CREATE TABLE IF NOT EXISTS and INSERT statements for each table and record.
Normal (unserialized) occurence:
(36, 'home', 'http://360.example.com/blogname', 'yes'),
should result in:
(36, 'home', 'http://360-dev.my.example.dev/blogname', 'yes'),
Serialized occurence:
(404, 'wp-maintenance-mode', 'a:21:{s:6:"active";i:1;s:4:"time";i:0;s:4:"link";i:1;s:7:"support";i:0;s:10:"admin_link";i:1;s:7:"rewrite";s:0:"";s:6:"notice";i:1;s:4:"unit";i:1;s:5:"theme";i:0;s:8:"styleurl";s:69:"http://360.example.com/wp-content/themes/blogname/css/maintenance.css";s:5:"index";i:0;s:5:"title";s:0:"";s:6:"header";s:0:"";s:7:"heading";s:0:"";s:4:"text";s:12:"Example Text";s:7:"exclude";a:1:{i:0;s:0:"";}s:6:"bypass";i:0;s:4:"role";a:1:{i:0;s:13:"administrator";}s:13:"role_frontend";a:1:{i:0;s:13:"administrator";}s:5:"radio";i:0;s:4:"date";s:0:"";}', 'yes'),
Should result in:
(404, 'wp-maintenance-mode', 'a:21:{s:6:"active";i:1;s:4:"time";i:0;s:4:"link";i:1;s:7:"support";i:0;s:10:"admin_link";i:1;s:7:"rewrite";s:0:"";s:6:"notice";i:1;s:4:"unit";i:1;s:5:"theme";i:0;s:8:"styleurl";s:76:"http://360-dev.my.example.dev/wp-content/themes/blogname/css/maintenance.css";s:5:"index";i:0;s:5:"title";s:0:"";s:6:"header";s:0:"";s:7:"heading";s:0:"";s:4:"text";s:12:"Example Text";s:7:"exclude";a:1:{i:0;s:0:"";}s:6:"bypass";i:0;s:4:"role";a:1:{i:0;s:13:"administrator";}s:13:"role_frontend";a:1:{i:0;s:13:"administrator";}s:5:"radio";i:0;s:4:"date";s:0:"";}', 'yes'),
EDIT 2:
Now using wp-cli to do the task of search & replace.
I've got a multisite setup with blogs numbered (2,3,9).
Executing wp search-replace --url=360.example.com '360.example.com' '360-dev.my.example.dev' results in an error, telling me that the Single-Site tables (wp_redirection_items and wp_redirection_groups) cannot be found.
This is true, because they really do not exist, but rather for each blog (e.g: wp_2_redirection_items and so on). This error results in over 9000 missed occurences in s&r. It's possible to replace everything with wp search-replace --url=360.example.com '360.example.com' '360-dev.my.example.com' wp_*. But it still throws the error.
As suggested by #archimiro the task now is done by wp-cli.
But as I am also having a multisite setup, which lead to some errors I had to figure out the command for a full database search-replace task.
The final command:
wp search-replace --url=360.example.com '360.example.com' '360-dev.my.example.dev' wp_*.
Without explicitly telling wp-cli to search&replace in ALL (wp_*) tables it would stop by the time a "table not found" error is thrown.
Also not awk or wpcli but this is a php function I wrote that seems to work well.
function snr($search, $replace, $inputfile, $outputfile){
$sql = file_get_contents($inputfile);
$sql1 = str_replace($search,$replace,$sql);
file_put_contents($outputfile,$sql1);
$serstrings = preg_split("/(?<=[{;])s:/",$sql1);
foreach($serstrings as $i=>$serstring) {
if (!!strpos($serstring, $replace)){
$justString = str_replace("\\","",str_replace("\\\\","j",explode('\\";',explode(':\\"',$serstring)[1])[0]));
$correct = strlen($justString);
$serstrings[$i] = preg_replace('/^\d+/',$correct, $serstrings[$i]);
}
}
file_put_contents($outputfile,implode("s:",$serstrings));
}
I've used this in past with success:
sed 's|360\.example\.com|360-dev\.my\.example\.dev|g' com.sql > local.sql
Edit: sorry not awk, but neither is wp-cli.

curl POST statement to RCurl or httr

I have this working curl statement to post a file to Nokia's HERE batch geocoding service...
curl -X POST -H 'Content-Type: multipart/form-data;boundary=----------------------------4ebf00fbcf09' \
--data-binary #example.txt \
'http://batch.geocoder.cit.api.here.com/6.2/jobs?action=run&mailto=test#gmail.com&maxresults=1&language=es-ES&header=true&indelim=|&outdelim=|&outcols=displayLatitude,displayLongitude,houseNumber,street,district,city,postalCode,county,state,country,matchLevel,relevance&outputCombined=false&app_code=AJKnXv84fjrb0KIHawS0Tg&app_id=DemoAppId01082013GAL'
I have tried this:
library(RCurl)
url <- "http://batch.geocoder.cit.api.here.com/6.2/jobs? action=run&mailto=test#gmail.com&maxresults=1&language=es-ES&header=true&indelim=|&outdelim=|&outcols=displayLatitude,displayLongitude,houseNumber,street,district,city,postalCode,county,state,country,matchLevel,relevance&outputCombined=false&app_code=AJKnXv84fjrb0KIHawS0Tg&app_id=DemoAppId01082013GAL'"
postForm(url, file=fileUpload(filename="example.txt",
contentType="multipart/form-data;boundary=----------------------------4ebf00fbcf09"))
And this:
library(httr)
a <- POST(url, body=upload_file("example.txt", type="text/plain"),
config=c(add_headers("multipart/form-data;boundary=----------------------------4ebf00fbcf09")))
content(a)
Using this file as example.txt: https://gist.github.com/corynissen/4f30378f11a5e51ad9ad
Is there any way to do this property in R?
I'm not a Nokia developer, and I'm assuming those are not your real API creds. This should help you get further with httr:
url <- "http://batch.geocoder.cit.api.here.com/6.2/jobs"
a <- POST(url, encode="multipart", # this will set the header for you
body=list(file=upload_file("example.txt")), # this is how to upload files
query=list(
action="run",
mailto="test#example.com",
maxresults="1",
language="es-ES", # this will build the query string
header="true",
indelim="|",
outdelim="|",
outcols="displayLatitude,displayLongitude", # i shortened this for the example
outputCombined="false",
app_code="APPCODE",
app_id="APPID"),
verbose()) # this lets you verify what's going on
But, I can't be sure w/o registering (and no time to do that).
This is the solution based on hrbrmstr's solution
bod <- paste(readLines("example.txt", warn=F), collapse="\n")
a <- POST(url, encode="multipart", # this will set the header for you
body=bod, # this is how to upload files
query=list(
action="run",
mailto="test#gmail.com",
maxresults="1",
language="es-ES", # this will build the query string
header="true",
indelim="|",
outdelim="|",
outcols="displayLatitude,displayLongitude,houseNumber,street,district,city,postalCode,county,state,country,matchLevel,relevance", # i shortened this for the example
outputCombined="false",
app_code="AJKnXv84fjrb0KIHawS0Tg",
app_id="DemoAppId01082013GAL"),
#config=c(add_headers("multipart/form-data;boundary=----------------------------4ebf00fbcf09")),
verbose()) # this lets you verify what's going on
content(a)
The problem I had to get around was that the normal upload process strips line breaks... but I needed them in there for the API to work (--data-binary option in curl does this). To get around this, I insert the data as a string after reading it via readLines().

Resources