Configure random proxies with R for scraping - r

I scraped a website authorizing scraping in robots rules but sometimes I get blocked.
While I contacted the admin to understand why, I want to understand how I can use different proxies within R to keep on scraping without being blocked.
I followed this quick tutorial:
https://support.rstudio.com/hc/en-us/articles/200488488-Configuring-R-to-Use-an-HTTP-or-HTTPS-Proxy
So I edited the environment file:
file.edit('~/.Renviron')
and within this I inserted a list of proxies to be selected randomly:
proxies_list <- c("128.199.109.241:8080","113.53.230.195:3128","125.141.200.53:80","125.141.200.14:80","128.199.200.112:138","149.56.123.99:3128","128.199.200.112:80","125.141.200.39:80","134.213.29.202:4444")
proxy <-paste0('https://', sample(proxies_list, 1))
https_proxy=proxy
But when I scrape with this code:
download.file(url_proxy, destfile ='output.html',quiet = TRUE)
html_output <- read_html('output.html')
I keep being blocked.
Am I not setting the proxies correctly?
Thanks !
M.

You need to set environment variables, not R variables. See ?download.file for more details.
eg
Sys.setenv(http_proxy=proxy)
before anything else happens. Also note the warning in the docs:
These environment variables must be set before the download code is
first used: they cannot be altered later by calling 'Sys.setenv'.

Related

set a path according to the user for a folder

I have a script that I am setting a path to where the datasets I will work on, but now the scripts will start to be run by other people on the team, how do I leave the folder with dynamic value according to the user who uses the script.
setwd("C:/Users/Jonas/Database")
I'm even creating a variable to receive the user of the machine, but I don't know how to add this to setwd
u <- Sys.info()["user"]
I tried to do that but was unsuccessful.
setwd("C:/Users/u/Database")
Use paste0
u <- Sys.info()["user"]
setwd(paste0("C:/Users/",u))

Scraping does not return the desired data

I am trying to get data from the site https://bill.torrentpower.com/. I desire to input the city "Ahmedabad" and Service number "3031629" and extract the table which gives the bill details.
My code is simple
a<- postForm("https://bill.torrentpower.com/billdetails.aspx",
"ctl00$cph1$drpCity" = 1,
"ctl00$cph1$txtServiceNo" = "3031629",
.opts = list(ssl.verifypeer = FALSE)
)
write(a,file="a.html")
When I open the file a.html, I do not see the table containing the bill details. All other details are visible on a.html. My aim is to capture the tablular output as an R object.
The issue here is that the table is generated by the JavaScript code after the page has loaded and hence you will not get the content of the table.
This is a common problem with scraping information that has lots of dynamic content.
A work around this is to stimulate a web browser using RSelenium.
http://cran.r-project.org/web/packages/RSelenium/RSelenium.pdf
This will stimulate with web browser in your R session and you can navigate the webpages using various methdos ( see the user manual for info)
Personally, I find RSelenium with PhantomJS combination the most useful since I use a lot of JavaScript. Alternatively, if you find using R Syntax abit troublesome you may use PhantomJS on its own as well. http://phantomjs.org/

Download ASPX page with R

There are a number of fairly detailed answers on SO which cover authenticated login to an aspx site and a download from it. As a complete n00b I haven't been able to find a simple explanation of how to get data from a web form
The following MWE is intended as an example only. And this question is more intended to teach me how to do it for a wider collection of webpages.
website :
http://data.un.org/Data.aspx?d=SNA&f=group_code%3a101
what I tried and (obviously) failed.
test=read.csv('http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=group_code:101;country_code:826&DataMartId=SNA&Format=csv&c=2,3,4,6,7,8,9,10,11,12,13&s=_cr_engNameOrderBy:asc,fiscal_year:desc,_grIt_code:asc')
giving me goobledegook with a View(test)
Anything that steps me through this or points me in the right direction would be very gratefully received.
The URL you are accessing using read.csv is returning a zipped file. You could download it
using httr say and write the contents to a temp file:
library(httr)
urlUN <- "http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=group_code:101;country_code:826&DataMartId=SNA&Format=csv&c=2,3,4,6,7,8,9,10,11,12,13&s=_cr_engNameOrderBy:asc,fiscal_year:desc,_grIt_code:asc"
response <- GET(urlUN)
writeBin(content(response, as = "raw"), "temp/temp.zip")
fName <- unzip("temp/temp.zip", list = TRUE)$Name
unzip("temp/temp.zip", exdir = "temp")
read.csv(paste0("temp/", fName))
Alternatively Hmisc has a useful getZip function:
library(Hmisc)
urlUN <- "http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=group_code:101;country_code:826&DataMartId=SNA&Format=csv&c=2,3,4,6,7,8,9,10,11,12,13&s=_cr_engNameOrderBy:asc,fiscal_year:desc,_grIt_code:asc"
unData <- read.csv(getZip(urlUN))
The links are being generated dynamically. The other problem is the content isn't actually at that link. You're making a request to a (very odd and poorly documented) API which will eventually return with the zip file. If you look in the Chrome dev tools as you click on that link you'll see the message and response headers.
There's a few ways you can solve this. If you know some javascript you can script a headless webkit instance like Phantom to load up these pages, simulate lick events and wait for a content response, then pipe that to something.
Alternately you may be able to finagle httr into treating this like a proper restful API. I have no idea if that's even remotely possible. :)

Attempting to deploy a binary to a location where a different binary is already stored

When I am publishing my page from tridio 2009, I am getting the error below:
Destination with name 'FTP=[Host=servername, Location=\RET, Password=******, Port=21, UserName=retftp]' reported the following failure:
A processing error occurred processing a transport package Attempting to deploy a binary [Binary id=tcm:553-974947-16 variantId= sg= path=/Images/image_thumbnail01.jpg] to a location where a different binary is already stored Existing binary: tcd:pub[553]/binarymeta[974950]
Below is my code snippet
Component bigImageComp = th.GetComponentValue("bigimage", imageMetaFields);
string bigImagefileName = string.Empty;
string bigImagePath = string.Empty;
bigImagefileName = bigImageComp.BinaryContent.Filename;
bigImagePath = m_Engine.AddBinary(bigImageComp.Id, TcmUri.UriNull, null, bigImageComp.BinaryContent.GetByteArray(), Path.GetFileName(bigImagefileName));
imageBigNode.InnerText = bigImagePath;
Please suggest
Chris Summers addressed this on his blog. Have a read of the article - http://www.urbancherry.net/blogengine/post/2010/02/09/Unique-binary-filenames-for-SDL-Tridion-Multimedia-Components.aspx
Generally in Tridion Content Delivery we can only keep one version of a Component. To get multiple "versions" of a MMC we have to publish MMC as variants. By this way we can produce as many variants as we need via templating.
You can refer below article for more detail:
http://yatb.mitza.net/2012/03/publishing-images-as-variants.html#!/2012/03/publishing-images-as-variants.html
When adding binaries you must ensure that the file and it's metadata is unique. If one of the values e.g. the filename appears to be the same but the rest of the metadata does not match, then deployment will fail.
In the given example (as Nuno points out) the binary 910 is trying to deploy over binary 703. The filename is the same but the binary is identified to be not the same (in the case a different ID from the same publication). For this example you will need to rename one of the binaries (either the file itself or change the path) and everything will be fine.
Other scenarios can be that the same image is used from two different templates and the template id is used as the varient ID. If this is the case it is the same image BUT the varient ID check fails so to avoid overwriting the same image the deployer fails it.
Often unpublishing can help, however, the image is only removed when ALL references to it are removed. So if it is used from more than one place there are more open references.
This is logical protection from the deployer. You would not want the wrong image replacing another and either upsetting the layout or potentially changing the content to another meeting (think advertising banner).
This is actual cause and reason for above problem (Something got from forum)

I'm having problems with configuring a filter that replicates specific tables only

I am trying to use filters to select specific tables to replicate.
I tried running this with the installer
./tools/tungsten-installer --master-slave -a \
...
--svc-extractor-filters=replicate \
--property=replicator.filter.replicate.do=test,*.foo"
and got this exception in trepctl status after the master had not installed properly:
Plugin class name property is missing or null: key=replicator.filter.replicate
which file is this properties file? How do I find it? Moreover, in specifying the settings for the filter, how do I know what exactly to put?
I discovered that I am supposed to Modify the configuration template file prior to configuration according to Issue 219 but what changes am I supposed to make in tungsten-replicator-2.0.5-diff that will later on be patched to the extraction?
Issue 254 suggests that If you want to apply a filter out of the box, you can use these options with tungsten-installer:
-a --property=replicator.filter.Replicate.ignoreFilter=schema_x.tablex,schema_x,tabley,schema_y,tablez
--svc-thl-filter=Replicate
However when I try using this for --property=replicator.filter.replicate.do,
but the problem is still the same:
pendingExceptionMessage: Plugin class name property is missing or null: key=replicator.filter.replicate
Your assistance will be greatly appreciated.
Rumbi
Update:
Hi
I had a look at this file: /root/tungsten/tungsten-replicator/samples/
conf/filters/default/tableignore.tpl .Acoording to this sample, a
static-SERVICE_NAME.properties file is supposed to have something like
this configured, please confirm if this is the correct syntax:
replicator.filter.tabledo=com.continuent.tungsten.replicator.filter.JavaScr iptFilter
replicator.filter.tabledo.script=${replicator.home.dir}/samples/
scripts/javascript-advanced/tabledo.js
replicator.filter.tabledo.tables=foo(database).bar(table)
replicator.stage.thl-to-dbms.filters=tabledo
However, I did not find tabledo.js (or something similar) in the
directory where tableignore.js exists. Could I please have the
location of this file. If there is an alternative way of specifiying
--property=replicator.filter.replicate.do=test without the use of
this .js file, your suggestions are most welcome.
Download the latest version of tungsten replicator. The missing tpl file was added about a month ago. After installation, the filtered tables should be added to static-service.properties under the section FILTERS.
Locate your replicator configuration file in static-YOUR_SERVICE_NAME.properties, e.g.
/opt/continuent/tungsten/tungsten-replicator/conf/static-mysql2vertica.properties
Make sure the individual dbms properties are set, in particular the setting replicator.applier.dbms:
# Batch applier basic configuration information.
replicator.applier.dbms=com.continuent.tungsten.replicator.applier.batch.SimpleBatchApplier
replicator.applier.dbms.url=jdbc:mysql:thin://${replicator.global.db.host}:${replicator.global.db.port}/tungsten_${service.name}?createDB=true
replicator.applier.dbms.driver=org.drizzle.jdbc.DrizzleDriver
replicator.applier.dbms.user=${replicator.global.db.user}
replicator.applier.dbms.password=${replicator.global.db.password}
replicator.applier.dbms.startupScript=${replicator.home.dir}/samples/scripts/batch/mysql-connect.sql
# Timezone and character set.
replicator.applier.dbms.timezone=GMT+0:00
replicator.applier.dbms.charset=UTF-8
# Parameters for loading and merging via stage tables.
replicator.applier.dbms.stageTablePrefix=stage_xxx_
replicator.applier.dbms.stageDirectory=/tmp/staging
replicator.applier.dbms.stageLoadScript=${replicator.home.dir}/samples/scripts/batch/mysql-load.sql
replicator.applier.dbms.stageMergeScript=${replicator.home.dir}/samples/scripts/batch/mysql-merge.sql
replicator.applier.dbms.cleanUpFiles=false
Depending on the database you are replicating to you may have to omit/modify some of the lines.
For more information see:
https://code.google.com/p/tungsten-replicator/wiki/Replicator_Batch_Loading
I don't know if this problem is still open or not.
I am using this version 2.0.6-xxx and installing the service using the parameters works for me.
I would like to point it out, that as the parameter says "--svc-extractor-filters" defines an extractor filter. Meaning that the parameters will guide the extraction of data in the master server.
If you intend to use it on the slave service, you should use the "--svc-applier-filters".
The parameters
--svc-extractor-filters=replicate \
--property=replicator.filter.replicate.do=test,*.foo"
supposed to create the following in the properties file:
This is the filter set up.
replicator.filter.replicate=com.continuent.tungsten.replicator.filter.ReplicateFilter
replicator.filter.replicate.ignore=
replicator.filter.replicate.do=test,*.foo
And you should also be able to find the
replicator.stage.binlog-to-q.filters=replicate
parameter set.
If you intend to use this filter in the slave, please find the line with:
replicator.stage.q-to-dbms.filters=mysqlsessions,pkey,bidiSlave
and change it as
replicator.stage.q-to-dbms.filters=mysqlsessions,pkey,bidiSlave,replicate
Hope this brief description did help to you!

Resources