Build Segment with RSiteCatalyst for Many Urls

Build Segment with RSiteCatalyst for Many Urls - r

I have a list of 100 urls that I need to pull analytics for. Suppose that the list is as follows:
allUrls <- c("google.com", "facebook.com", "instagram.com", "netflix.com")
I want to be able to pass a segment to adobe in RSiteCatalyst where the url is equal to one of the list of 100 urls. I've been able to build segments like the following
library(RSiteCatalyst)
rules <- data.frame(name = c("google", "facebook" ,"instagram", "netflix"),
element = c("url", "url", "url", "url"),
operator = c("equals", "equals", "equals", "equals"),
value = allUrls)
adobeSegment <- list(container = list(type = unbox("hits"),
rules = rules))
but I dont know how to specify that I want these to be OR rules, not AND. I looked into the BuildClassificationValueSegment function but I can't seem to make sense of the inputs, nor could I find any good examples of how it works.
I'm just hoping for an output of a df that'd look something like this:
url visits
google 1000
facebook 8000
instagram 3276
netflix 4675

Related

How To Prevent Web Scraping Script From Being Blocked By Google (HTTP 429)

I have this script to take each domain names in Dataframe and perform a "inurl:domain automation testing" Google search for each of them. I will scrape the 1st search result and add to my Dataframe.
import random
# Convert the Domain column in Dataframe into a list
working_backlink = backlink_df.iloc[23:len(backlink_df['Domain']), 1:22]
working_domain = working_backlink["Domain"]
domain_list = working_domain.values.tolist()
# Iterate through the list and perform query search for them
for x in range(23, len(domain_list)):
sleeptime = random.randint(1,10)
time.sleep(sleeptime)
for i in domain_list:
query = "inurl:{} automation testing".format(i)
delay = random.randint(10,30)
for j in search(query, tld="com", num=1,stop=1,pause=delay):
working_backlink.iat[x,5] = j
# Show Dataframe
working_backlink.head(n=40)
I tried using sleeptime and random delay time to prevent HTTP 429 error, but it still doesn't work. Could you suggest any solution to this? Thanks a lot!

How to retrieve EPPO Database info via POST?

I can retrieve EPPO DB info from GET requests.
I am looking for help to retrieve the info from POST requests.
Example code and other info in the linked Rmarkdown HTMP output
As suggested, I have gone trough the https://httr.r-lib.org/ site.
Interesting. I followed the links to https://httr.r-lib.org/articles/api-packages.html and then to https://cdn.zapier.com/storage/learn_ebooks/e06a35cfcf092ec6dd22670383d9fd12.pdf.
I suppose that the arguments for the POST() function should be (more or less) as follows, but yet the response is always 404
url = "https://data.eppo.int/api/rest/1.0/"
config = list(authtoken=my_authtoken)
body = list(intext = "Fraxinus anomala|Tobacco ringspot virus|invalide name|Sequoiadendron giganteum")
encode = "json"
#handle = ???
Created on 2021-04-26 by the reprex package (v0.3.0)
How do I find the missing pieces?

It is a little bit tricky:
You need to use correct url with name of the service from https://data.eppo.int/documentation/rest, e.g. to use Search preferred names from EPPOCodes list:
url = "https://data.eppo.int/api/rest/1.0/tools/codes2prefnames"
Authorization should be passed to body:
body = list(authtoken = "yourtoken", intext = "BEMITA")
So, if you want to check names for two eppocodes: XYLEFA and BEMITA the code should look like:
httr::POST(
url = "https://data.eppo.int/api/rest/1.0/tools/codes2prefnames",
body = list(authtoken = "yourtoken", intext = "BEMITA|XYLEFA")
)
Nonetheless, I would also recommend you to just use the pestr package. However, to find eppocodes it uses SQLite database under the hood instead of EPPO REST API. Since the db is not big itself (circa 40MB) this shouldn't be an issue.

I found the easy way following a suggestion in the DataCamp course:
"To find an R client for an API service search 'CRAN '"
I found the 'pestr' package that gives great access to EPPO database.
I still do not know how to use the POST() function myself. Any hint on that side is warmly welcome.

Here is a solution to loop over names to get EPPO-codes. Whit minor adjustments this also works for distribution and other information in the EPPO DB
# vector of species names
host_name <- c("Fraxinus anomala", "Tobacco ringspot virus", "invalide name",
"Sequoiadendron giganteum")
EPPO_key <- "enter personal key"
# EPPO url
path_eppo_code <- "https://data.eppo.int/api/rest/1.0/tools/names2codes"
# epty list
my_list <- list()
for (i in host_name) {
# POST request on eppo database
response <- httr::POST(path_eppo_code, body=list(authtoken=EPPO_key,
intext=i))
# get EPPO code
pest_eppo_code <- strsplit(httr::content(response)[[1]], ";")[[1]][2]
# add to list
my_list[[i]] <- pest_eppo_code
}
# list to data frame
data <- plyr::ldply(my_list)

Turn off search continuation results in python-ldap?

Using python-ldap.search_s() function (https://www.python-ldap.org/en/python-ldap-3.3.0/reference/ldap.html#ldap.LDAPObject.search_s) with params...
base = DC=myorg,DC=local
filterstr = (&(sAMAccountName={login})(|(memberOf=CN=zone1,OU=zones,OU=datagroups,DC=myorg,DC=local)(memberOf=CN=zone2,OU=zones,OU=datagroups,DC=myorg,DC=local)))
...to try to match against a specific AD user.
Yet when I look at the result returned (with login = myuser), I see something like:
[
(u'CN=zone1,OU=zones,OU=datagroups,DC=myorg,DC=local', {u'sAMAccountName': ['myuser']}),
(None, [u'ldap://DomainDnsZones.myorg.local/DC=DomainDnsZones,DC=myorg,DC=local']),
(None, [u'ldap://ForestDnsZones.myorg.local/DC=ForestDnsZones,DC=myorg,DC=local']),
(None, [u'ldap://myorg.local/CN=Configuration,DC=myorg,DC=local'])
]
where there are multiple other hits in the list (besides the myuser sAMAccountName match) that have nothing to do with the search filter.
Looking at the docs (https://www.python-ldap.org/en/python-ldap-3.3.0/faq.html) these appear to be "search continuations" / referrals that are included when the search base is at the domain level and it says that they can be turned off by including the code like...
l = ldap.initialize('ldap://foobar')
l.set_option(ldap.OPT_REFERRALS,0)
as well as trying
ldap.set_option(ldap.OPT_REFERRALS,0)
l = ldap.initialize('ldap://foobar')
...yet adding this code does not change the behavior at all and I get the same results (see https://www.python-ldap.org/en/python-ldap-3.3.0/reference/ldap.html?highlight=set_option#ldap.set_option).
Am I misunderstanding something here? Anyone know how to get these to stop popping up? Anyone know the structure of the tuples that this function returns (the docs do not describe)?

Just talked to someone else more familiar with python-ldap and was told that OPT_REFERRALS is controlling if you automatically follow the referral, but it doesn't stop AD from sending them.
For now, the only approach they recommended was to filter these values with something like:
results = ldap.search_s(...)
results = [ x for x in results if x[0] is not None ]
Noting that the structure of the results returned from search_s() is
[
( dn, {
attrname: [ value, value, ... ],
attrname: [ value, value, ... ],
}),
]
When it's a referral it's a DN of None and the entry dict is replaced with an array of URI's.
* (Note that in the search_s call you can request specific attributes to be returned in your search too)
* (Note that since my base DN is a domain level path, using the ldap.set_option(ldap.OPT_REFERRALS,0) snippet was still useful just to stop the search_s() from actually going down the referral paths (which was adding a few seconds to the search time))
Again, I believe that this problem is due to the base DN being a domain level path (unless there is some other base_dn or search.filter I could use for that fact that the group users are scattered across various AD paths in the domain that I'm missing).

Search list of keywords in elastic using R elastic package

I am making a shiny app where a user can input an excel file of terms and search them on elastic holdings. The excel file contains one column of keywords and is read into a list, then what I am trying to do is have the each item in the list searched using Search(). I easily did this in Python with a for loop over the terms and then the search connection inside the for loop and got accurate results. I understand that it isn't that easy in R, but I cannot get to the right solution. I am using the R elastic package and have been trying different versions of Search() for over a day. I have not used elastic before so my apologies for not understanding the syntax much. I know that I need to do something with aggs for a list of terms..
Essentially I want to search on the source.body_ field and I want to use match_phrase for searching my terms.
Here is the code I have in Python that works, but I need everything in R for the shiny app and don't want to use reticulate.
queries = list()
for term in my_terms:
search_result = es.search(index="cars", body={"query": {"match_phrase": {'body_':term}}}, size = 5000)
search_result.update([('term', term)])
queries.append(search_result)
I established my elastic connection as con and made sure it can bring back accurate matches on just one keyword with:
match <- {"query": {"match_phrase" : {"body_" : "mustang"}}}
search_results <- Search(con, index="cars", body = match, asdf = TRUE)
That worked how I expected it to with just one keyword explicitly defined.
So after that, here is what I have tried for a list of my_terms:
aggs <- '{"aggs":{"stats":{"terms":{"field":"my_terms"}}}}'
queries <- data.frame()
for (term in my_terms) {
final <- Search(con, index="cars", body = aggs, asdf = TRUE, size = 5000)
rbind(queries, final)
}
When I run this, it just brings back everything in my elastic. I also tried running that with the for loop commented out and that didn't work.
I also have tried to embed my_terms inside my match list from the single term search at the beginning of this post like so:
match <- '{"query": {"match_phrase" : {"body_": "my_terms"}}}'
Search(con, index="cars", body = match, asdf = TRUE, size = 5000)
This returns nothing. I have tried so many combinations of the aggs list and match list but nothing returns what I'm looking for. Any advice would be much appreciated, I feel like I've read just about everything so far and now I'm just confused.
UPDATE:
I figured it out with
p <- data.frame()
for (t in the_terms$keyword) {
result <- Search(con, index="cars", body = paste0('{"query": {"match_phrase" : {"body_":', '"', t, '"', '}}}'), asdf = TRUE, size = 5000)
p <- rbind(p, result$hit$hit)
}

Request limit and lapply

I want to make thousands of entrez requests. I know that you are not allowed to make more than 3 requests per second unless you have an API key. The rentrez introduction says that rentrez enforces this limit.
I just want to make sure that NCBI does not block me, so the performance is not the issue here.
https://cran.r-project.org/web/packages/rentrez/vignettes/rentrez_tutorial.html
But what if I combine lapply() with entrez_search()?
authors <- c("Uhelski ML", "Manjavachi MN")
authors_pubs <- lapply(authors, function(x) entrez_search(db = "pubmed",term = paste0(x, "[AUTH]"), use_history = TRUE))
Does rentrez still enforces the request limit in this case or do I have to introduce stops into my code?

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Build Segment with RSiteCatalyst for Many Urls - r

Related

How To Prevent Web Scraping Script From Being Blocked By Google (HTTP 429)

How to retrieve EPPO Database info via POST?

Turn off search continuation results in python-ldap?

Search list of keywords in elastic using R elastic package

Request limit and lapply

Categories

Resources