The New York Times API with R - r

I'm trying to get articles' information using The New York Times API. The csv file I get doesn't reflect my filter query. For example, I restricted the source to 'The New York Times', but the file I got contains other sources also.
I would like to ask you why the filter query doesn't work.
Here's the code.
if (!require("jsonlite")) install.packages("jsonlite")
library(jsonlite)
api = "apikey"
nytime = function () {
url = paste('http://api.nytimes.com/svc/search/v2/articlesearch.json?',
'&fq=source:',("The New York Times"),'AND type_of_material:',("News"),
'AND persons:',("Trump, Donald J"),
'&begin_date=','20160522&end_date=','20161107&api-key=',api,sep="")
#get the total number of search results
initialsearch = fromJSON(url,flatten = T)
maxPages = round((initialsearch$response$meta$hits / 10)-1)
#try with the max page limit at 10
maxPages = ifelse(maxPages >= 10, 10, maxPages)
#creat a empty data frame
df = data.frame(id=as.numeric(),source=character(),type_of_material=character(),
web_url=character())
#save search results into data frame
for(i in 0:maxPages){
#get the search results of each page
nytSearch = fromJSON(paste0(url, "&page=", i), flatten = T)
temp = data.frame(id=1:nrow(nytSearch$response$docs),
source = nytSearch$response$docs$source,
type_of_material = nytSearch$response$docs$type_of_material,
web_url=nytSearch$response$docs$web_url)
df=rbind(df,temp)
Sys.sleep(5) #sleep for 5 second
}
return(df)
}
dt = nytime()
write.csv(dt, "trump.csv")
Here's the csv file I got.

It seems you need to put the () inside the quotes, not outside. Like this:
url = paste('http://api.nytimes.com/svc/search/v2/articlesearch.json?',
'&fq=source:',"(The New York Times)",'AND type_of_material:',"(News)",
'AND persons:',"(Trump, Donald J)",
'&begin_date=','20160522&end_date=','20161107&api-key=',api,sep="")
https://developer.nytimes.com/docs/articlesearch-product/1/overview

Related

Using R- My for loop only works on first iteration

I have a loop on each iteration does at least 1 and at most 2 API GET requests
I have a list of movies I loop through, I do a GET request to get the movies id corresponding in the API database, then I do a request to get the reviews using the movies ID.
I run the code and it works for the first movie but all the other movies remain empty. even when the first movie has no reviews.
Here is the code
for(i in 1:9964) {
id_url<- paste(url,search_title,key,COPY$Film[i],sep = "")
#GET request for movie
api_call_id<-GET(id_url)
#make readdble
read<- rawToChar(api_call_id$content)
#turn json into object
JSON<-fromJSON(read,flatten = TRUE)
if(is.null(JSON$results$id[1])){
break;
}
else{ #get the movie id from the json
id<-JSON$results$id[1]
}
#url to get movies raitings
raiting_url<- paste(url,raiting,key,id,sep = "")
#call
api_call_raiting<- GET(raiting_url)
#readble
read<- rawToChar(api_call_raiting$content)
#json
JSON<-fromJSON(read,flatten = TRUE)
if(is.null(JSON$rottenTomatoes)){
#set column value
COPY[ i, 'rTomatoes'] <- "No Review"
}
else{
#set column value
COPY[ i, 'rTomatoes'] <- JSON$rottenTomatoes
}

Wrong page parsed BeautifulSoup?

I want to enter two values on this website https://hausratversicherung.friday.de/ and retrieve the value after submitting it. I wrote the following code
import requests, re
from robobrowser import RoboBrowser
br = RoboBrowser(parser='html.parser')
br.open("https://hausratversicherung.friday.de/")
form = br.get_form()
form['area'] = 100
form['postalCode'] = 44326
br.submit_form(form)
src = str(br.parsed())
start = '<div class="Typography-sc-3c3fuf-0 jEIicc" data-testid="totalPrice">'
end = ' €</div>'
result = re,search('%s(.*)%s' % (start, end),src).group(1)
print(result)
But the browser br is not opening the mentioned page and taking these values.
The postal code 44326 isn't accepted by the server. For other postal codes you can query their API directly:
import json
import requests
area = 100
postalcode = 44309
url = 'https://fdy2-policycenter-production.k8s.blue.friday-prod.de/rest/friday/hc/price?area={area}&postalCode={postalcode}'
data = requests.get(url.format(area=area, postalcode=postalcode)).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
# print some info to screen:
print(data['basicCoverages']['coverages'][0]['insuredSum']['amount'])
print(data['basicCoverages']['coverages'][0]['price']['amount'])
Prints:
65000.0
7.81

How to import data from a HTML table on a website to excel?

I would like to do some statistical analysis with Python on the live casino game called Crazy Time from Evolution Gaming. There is a website that has the data to do this: https://tracksino.com/crazytime. I want the data of the lowest table 'Spin History' to be imported into excel. However, I do not now how this can be done. Could anyone give me an idea where to start?
Thanks in advance!
Try the below code:
import json
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
import csv
import datetime
def scrap_history():
csv_headers = []
file_path = '' #mention your system where you have to save the file
file_name = 'spin_history.csv' # filename
page_number = 1
while True:
#Dynamic URL fetching data in chunks of 100
url = 'https://api.tracksino.com/crazytime_history?filter=&sort_by=&sort_desc=false&page_num=' + str(page_number) + '&per_page=100&period=24hours'
print('-' * 100)
print('URL created : ',url)
response = requests.get(url,verify=False)
result = json.loads(response.text) # loading data to convert in JSON.
history_data = result['data']
print(history_data)
if history_data != []:
with open(file_path + file_name ,'a+') as history:
#Headers for file
csv_headers = ['Occured At','Slot Result','Spin Result','Total Winners','Total Payout',]
csvwriter = csv.DictWriter(history, delimiter=',', lineterminator='\n',fieldnames=csv_headers)
if page_number == 1:
print('Writing CSV header now...')
csvwriter.writeheader()
#write exracted data in to csv file one by one
for item in history_data:
value = datetime.datetime.fromtimestamp(item['when'])
occured_at = f'{value:%d-%B-%Y # %H:%M:%S}'
csvwriter.writerow({'Occured At':occured_at,
'Slot Result': item['slot_result'],
'Spin Result': item['result'],
'Total Winners': item['total_winners'],
'Total Payout': item['total_payout'],
})
print('-' * 100)
page_number +=1
print(page_number)
print('-' * 100)
else:
break
Explanation:
I have implemented the above script using python requests way. The API url https://api.tracksino.com/crazytime_history?filter=&sort_by=&sort_desc=false&page_num=1&per_page=50&period=24hours extarcted from the web site itself(refer screenshot). In the very first step script will take the dynamic URL where page number is dynamic and changed upon on every iteration. For ex:- first it will be page_num = 1 then page_num = 2 and so on till all the data will get extracted.

I need help web scraping comments section

I am currently doing a project on football players. I am trying to scrape some public comments on football players for sentiment analysis. However I can't seem to scrape the comment. Any help would be MUCH appreciated. It is the comments part I can't seem to do. Weirdly enough I had it working but then it stopped and I cant seem to get scraping comments again. The website I am scraping from is : https://sofifa.com/player/192985/kevin-de-bruyne/200025/
likes = []
dislikes = []
follows = []
comments = []
driver_path = '/Users/niallmcnulty/Desktop/GeneralAssembly/Lessons/DSI11-lessons/week05/day2_web_scraping_and_apis/web_scraping/selenium-examples/chromedriver'
driver = webdriver.Chrome(executable_path=driver_path)
# i = 0
for url in tqdm_notebook(urls):
driver.get(url)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sleep(0.2)
soup1 = BeautifulSoup(driver.page_source,'lxml')
try:
dislike = soup1.find('button', attrs = {'class':'bp3-button bp3-minimal bp3-intent-danger dislike-btn need-sign-in'}).find('span',{'class':'count'}).text.strip()
dislikes.append(dislike)
except:
pass
try:
like = soup1.find('button', attrs = {'class':'bp3-button bp3-minimal bp3-intent-success like-btn need-sign-in'}).find('span',{'class':'count'}).text.strip()
likes.append(like)
except:
pass
try:
follow = soup1.find('button', attrs = {'class':'bp3-button bp3-minimal follow-btn need-sign-in'}).find('span',{'class':'count'}).text.strip()
follows.append(follow)
except:
pass
try:
comment = soup1.find_all('p').text[0:10]
comments.append(comment)
except:
pass
# i += 1
# if i % 5 == 0:
# sentiment = pd.DataFrame({"dislikes":dislikes,"likes":likes,"follows":follows,"comments":comments})
# sentiment.to_csv('/Users/niallmcnulty/Desktop/GeneralAssembly/Lessons/DSI11-lessons/projects/cap-csv/sentiment.csv')
sentiment_final = pd.DataFrame({"dislikes":dislikes,"likes":likes,"follows":follows,"comments":comments})
# df_sent = pd.merge(df, sentiment, left_index=True, right_index=True)
The comments section is dynamically loaded. you can try to capture it using the driver,
try:
comment_elements = driver.find_elements_by_tag_name('p')
for comment in comment_elements:
comments.append(comment.text)
except:
pass
print(Comments)

Why more number of duplicated data is saving in my excel sheet for my code?

Actually this code is generally used to scrape data from websites but the problem is more number of duplicated data is producing and saving in my excel sheet.
def extractor():
time.sleep(10)
souptree = html.fromstring(driver.page_source)
tburl = souptree.xpath("//table[contains(#id, 'theDataTable')]//tbody//tr//td[4]//a//#href")
for tbu in tburl:
allurl = []
allurl.append(urllib.parse.urljoin(siteurl, tbu))
for tb in allurl:
get_url = requests.get(tb)
get_soup = html.fromstring(get_url.content)
pattern = re.compile("^\s+|\s*,\s*|\s+$")
name = get_soup.xpath('//td[#headers="contactName"]//text()')
phone = get_soup.xpath('//td[#headers="contactPhone"]//text()')
mail = get_soup.xpath('//td[#headers="contactEmail"]//a//text()')
artitle = get_soup.xpath('//td[#headers="contactEmail"]//a//#href')
artit = ([x for x in pattern.split(str(artitle)) if x][-1])
title = artit[:-2]
for (nam, pho, mai) in zip(name, phone, mail):
fname = nam[9:]
allmails.append(mai)
allnames.append(fname)
allphone.append(pho)
alltitles.append(title)
fullfile = pd.DataFrame({'Names': allnames, 'Mails': allmails, 'Title': alltitles, 'Phone Numbers': allphone})
writer = ExcelWriter('G:\\Sheet_Name.xlsx')
fullfile.to_excel(writer, 'Sheet1', index=False)
writer.save()
print(fname, pho, mai, title, sep='\t')
while True:
time.sleep(10)
extractor()
try:
nextbutton()
except (WebDriverException):
driver.refresh()
except(NoSuchElementException):
time.sleep(10)
driver.quit()
I want the output should not be duplicated but almost half and more number of data are duplicating each time i run the code.

Resources