rentrez entrez_summary premature EOF - r

Trying to move on from my troubles with RISmed (see Problems with RISmed and large(ish) data sets), I decided to use rentrez and entrez_summary to retrieve a large list of pubmed titles from a query:
set_entrez_key("######") #I did provide my real API key here
Sys.getenv("ENTREZ_KEY")
rm(list=ls())
library(rentrez)
query="(United States[AD] AND France[AD] AND 1995:2020[PDAT])"
results<-entrez_search(db="pubmed",term=query,use_history=TRUE)
results
results$web_history
for (seq_start in seq(0, results$count, 100)) {
if (seq_start == 0) {
summary.append.l <- entrez_summary(
db = "pubmed",
web_history = results$web_history,
retmax = 100,
retstart = seq_start
)
}
Sys.sleep(0.1) #slow things down in case THAT'S a factor here....
summary.append.l <- append(
summary.append.l,
entrez_summary(
db = "pubmed",
web_history = results$web_history,
retmax = 100,
retstart = seq_start
)
)
}
The good news...i didn't get a flat out rejection from NCBI like i did with RISMed and EUtilsGet. The bad news...it's not completing. (I get either
Error in curl::curl_fetch_memory(url, handle = handle) :
transfer closed with outstanding read data remaining
or
Error: parse error: premature EOF
(right here) ------^
I almost think there's something about using an affiliation search string in the query, because if I change the query to
query="monoclonal[Title] AND antibody[Title] AND 2010:2020[PDAT]"
it completes the run, despite having about the same number of records to deal with. So...any ideas why a particular search string would result in problems with the NCBI servers?

Related

Im aware im trying to get data for multiple accounts but where can i specify that rather than receiving this error in R?

library(rgoogleads)
library(gargle)
token <- token_fetch()
token
gads_auth(email = 'xx#gmail.com'
Authentication complete.
ad_group_report <- gads_get_report(
resource = "ad_group",
fields = c("ad_group.campaign",
"ad_group.id",
"ad_group.name",
"ad_group.status",
"metrics.clicks",
"metrics.cost_micros"),
date_from = "2021-01-08",
date_to = "2021-01-10",
where = "ad_group.status = 'ENABLED'",
order_by = c("metrics.clicks DESC", "metrics.cost_micros")
)
i Multi account request
! The request you sent did not return any results, check the entered parameters and repeat the opposition.
Why do I receive this error? I have never received it in Radwords package. Where do I mention the argument for the multiple accounts?
https://cran.r-project.org/web/packages/rgoogleads/rgoogleads.pdf

I'm trying to get some tweets with academictwitteR, but the code points to an error with endpoint_url

I'm trying to get some tweets with academictwitteR, but the code throws the following error:
tweets_espn <- get_all_tweets( query = "fluminense",
+ user = "ESPNBrasil",
+ start_tweets = "2020-01-01T00: 00: 00Z " ,
+ end_tweets = "2020-31-12T00 : 00: 00Z " ,
+ n = 10000)
query: fluminense (from:ESPNBrasil) Error in make_query(url =
endpoint_url, params = params, bearer_token = bearer_token, :
something went wrong. Status code: 403 In addition: Warning messages:
1: Recommended to specify a data path in order to mitigate data loss
when ingesting large amounts of data. 2: Tweets will not be stored as
JSONs or as a .rds file and will only be available in local memory if
assigned to an object.
it seems to me that you can only access the Twitter API via academictwitteR if you have been awarded the "academic research" access from the Twitter developer portal. So i dont think it works with the essential or elevated access.

How to import data from a HTML table on a website to excel?

I would like to do some statistical analysis with Python on the live casino game called Crazy Time from Evolution Gaming. There is a website that has the data to do this: https://tracksino.com/crazytime. I want the data of the lowest table 'Spin History' to be imported into excel. However, I do not now how this can be done. Could anyone give me an idea where to start?
Thanks in advance!
Try the below code:
import json
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
import csv
import datetime
def scrap_history():
csv_headers = []
file_path = '' #mention your system where you have to save the file
file_name = 'spin_history.csv' # filename
page_number = 1
while True:
#Dynamic URL fetching data in chunks of 100
url = 'https://api.tracksino.com/crazytime_history?filter=&sort_by=&sort_desc=false&page_num=' + str(page_number) + '&per_page=100&period=24hours'
print('-' * 100)
print('URL created : ',url)
response = requests.get(url,verify=False)
result = json.loads(response.text) # loading data to convert in JSON.
history_data = result['data']
print(history_data)
if history_data != []:
with open(file_path + file_name ,'a+') as history:
#Headers for file
csv_headers = ['Occured At','Slot Result','Spin Result','Total Winners','Total Payout',]
csvwriter = csv.DictWriter(history, delimiter=',', lineterminator='\n',fieldnames=csv_headers)
if page_number == 1:
print('Writing CSV header now...')
csvwriter.writeheader()
#write exracted data in to csv file one by one
for item in history_data:
value = datetime.datetime.fromtimestamp(item['when'])
occured_at = f'{value:%d-%B-%Y # %H:%M:%S}'
csvwriter.writerow({'Occured At':occured_at,
'Slot Result': item['slot_result'],
'Spin Result': item['result'],
'Total Winners': item['total_winners'],
'Total Payout': item['total_payout'],
})
print('-' * 100)
page_number +=1
print(page_number)
print('-' * 100)
else:
break
Explanation:
I have implemented the above script using python requests way. The API url https://api.tracksino.com/crazytime_history?filter=&sort_by=&sort_desc=false&page_num=1&per_page=50&period=24hours extarcted from the web site itself(refer screenshot). In the very first step script will take the dynamic URL where page number is dynamic and changed upon on every iteration. For ex:- first it will be page_num = 1 then page_num = 2 and so on till all the data will get extracted.

The New York Times API with R

I'm trying to get articles' information using The New York Times API. The csv file I get doesn't reflect my filter query. For example, I restricted the source to 'The New York Times', but the file I got contains other sources also.
I would like to ask you why the filter query doesn't work.
Here's the code.
if (!require("jsonlite")) install.packages("jsonlite")
library(jsonlite)
api = "apikey"
nytime = function () {
url = paste('http://api.nytimes.com/svc/search/v2/articlesearch.json?',
'&fq=source:',("The New York Times"),'AND type_of_material:',("News"),
'AND persons:',("Trump, Donald J"),
'&begin_date=','20160522&end_date=','20161107&api-key=',api,sep="")
#get the total number of search results
initialsearch = fromJSON(url,flatten = T)
maxPages = round((initialsearch$response$meta$hits / 10)-1)
#try with the max page limit at 10
maxPages = ifelse(maxPages >= 10, 10, maxPages)
#creat a empty data frame
df = data.frame(id=as.numeric(),source=character(),type_of_material=character(),
web_url=character())
#save search results into data frame
for(i in 0:maxPages){
#get the search results of each page
nytSearch = fromJSON(paste0(url, "&page=", i), flatten = T)
temp = data.frame(id=1:nrow(nytSearch$response$docs),
source = nytSearch$response$docs$source,
type_of_material = nytSearch$response$docs$type_of_material,
web_url=nytSearch$response$docs$web_url)
df=rbind(df,temp)
Sys.sleep(5) #sleep for 5 second
}
return(df)
}
dt = nytime()
write.csv(dt, "trump.csv")
Here's the csv file I got.
It seems you need to put the () inside the quotes, not outside. Like this:
url = paste('http://api.nytimes.com/svc/search/v2/articlesearch.json?',
'&fq=source:',"(The New York Times)",'AND type_of_material:',"(News)",
'AND persons:',"(Trump, Donald J)",
'&begin_date=','20160522&end_date=','20161107&api-key=',api,sep="")
https://developer.nytimes.com/docs/articlesearch-product/1/overview

Why more number of duplicated data is saving in my excel sheet for my code?

Actually this code is generally used to scrape data from websites but the problem is more number of duplicated data is producing and saving in my excel sheet.
def extractor():
time.sleep(10)
souptree = html.fromstring(driver.page_source)
tburl = souptree.xpath("//table[contains(#id, 'theDataTable')]//tbody//tr//td[4]//a//#href")
for tbu in tburl:
allurl = []
allurl.append(urllib.parse.urljoin(siteurl, tbu))
for tb in allurl:
get_url = requests.get(tb)
get_soup = html.fromstring(get_url.content)
pattern = re.compile("^\s+|\s*,\s*|\s+$")
name = get_soup.xpath('//td[#headers="contactName"]//text()')
phone = get_soup.xpath('//td[#headers="contactPhone"]//text()')
mail = get_soup.xpath('//td[#headers="contactEmail"]//a//text()')
artitle = get_soup.xpath('//td[#headers="contactEmail"]//a//#href')
artit = ([x for x in pattern.split(str(artitle)) if x][-1])
title = artit[:-2]
for (nam, pho, mai) in zip(name, phone, mail):
fname = nam[9:]
allmails.append(mai)
allnames.append(fname)
allphone.append(pho)
alltitles.append(title)
fullfile = pd.DataFrame({'Names': allnames, 'Mails': allmails, 'Title': alltitles, 'Phone Numbers': allphone})
writer = ExcelWriter('G:\\Sheet_Name.xlsx')
fullfile.to_excel(writer, 'Sheet1', index=False)
writer.save()
print(fname, pho, mai, title, sep='\t')
while True:
time.sleep(10)
extractor()
try:
nextbutton()
except (WebDriverException):
driver.refresh()
except(NoSuchElementException):
time.sleep(10)
driver.quit()
I want the output should not be duplicated but almost half and more number of data are duplicating each time i run the code.

Resources