Why more number of duplicated data is saving in my excel sheet for my code? - web-scraping

Actually this code is generally used to scrape data from websites but the problem is more number of duplicated data is producing and saving in my excel sheet.
def extractor():
time.sleep(10)
souptree = html.fromstring(driver.page_source)
tburl = souptree.xpath("//table[contains(#id, 'theDataTable')]//tbody//tr//td[4]//a//#href")
for tbu in tburl:
allurl = []
allurl.append(urllib.parse.urljoin(siteurl, tbu))
for tb in allurl:
get_url = requests.get(tb)
get_soup = html.fromstring(get_url.content)
pattern = re.compile("^\s+|\s*,\s*|\s+$")
name = get_soup.xpath('//td[#headers="contactName"]//text()')
phone = get_soup.xpath('//td[#headers="contactPhone"]//text()')
mail = get_soup.xpath('//td[#headers="contactEmail"]//a//text()')
artitle = get_soup.xpath('//td[#headers="contactEmail"]//a//#href')
artit = ([x for x in pattern.split(str(artitle)) if x][-1])
title = artit[:-2]
for (nam, pho, mai) in zip(name, phone, mail):
fname = nam[9:]
allmails.append(mai)
allnames.append(fname)
allphone.append(pho)
alltitles.append(title)
fullfile = pd.DataFrame({'Names': allnames, 'Mails': allmails, 'Title': alltitles, 'Phone Numbers': allphone})
writer = ExcelWriter('G:\\Sheet_Name.xlsx')
fullfile.to_excel(writer, 'Sheet1', index=False)
writer.save()
print(fname, pho, mai, title, sep='\t')
while True:
time.sleep(10)
extractor()
try:
nextbutton()
except (WebDriverException):
driver.refresh()
except(NoSuchElementException):
time.sleep(10)
driver.quit()
I want the output should not be duplicated but almost half and more number of data are duplicating each time i run the code.

Related

How to import data from a HTML table on a website to excel?

I would like to do some statistical analysis with Python on the live casino game called Crazy Time from Evolution Gaming. There is a website that has the data to do this: https://tracksino.com/crazytime. I want the data of the lowest table 'Spin History' to be imported into excel. However, I do not now how this can be done. Could anyone give me an idea where to start?
Thanks in advance!
Try the below code:
import json
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
import csv
import datetime
def scrap_history():
csv_headers = []
file_path = '' #mention your system where you have to save the file
file_name = 'spin_history.csv' # filename
page_number = 1
while True:
#Dynamic URL fetching data in chunks of 100
url = 'https://api.tracksino.com/crazytime_history?filter=&sort_by=&sort_desc=false&page_num=' + str(page_number) + '&per_page=100&period=24hours'
print('-' * 100)
print('URL created : ',url)
response = requests.get(url,verify=False)
result = json.loads(response.text) # loading data to convert in JSON.
history_data = result['data']
print(history_data)
if history_data != []:
with open(file_path + file_name ,'a+') as history:
#Headers for file
csv_headers = ['Occured At','Slot Result','Spin Result','Total Winners','Total Payout',]
csvwriter = csv.DictWriter(history, delimiter=',', lineterminator='\n',fieldnames=csv_headers)
if page_number == 1:
print('Writing CSV header now...')
csvwriter.writeheader()
#write exracted data in to csv file one by one
for item in history_data:
value = datetime.datetime.fromtimestamp(item['when'])
occured_at = f'{value:%d-%B-%Y # %H:%M:%S}'
csvwriter.writerow({'Occured At':occured_at,
'Slot Result': item['slot_result'],
'Spin Result': item['result'],
'Total Winners': item['total_winners'],
'Total Payout': item['total_payout'],
})
print('-' * 100)
page_number +=1
print(page_number)
print('-' * 100)
else:
break
Explanation:
I have implemented the above script using python requests way. The API url https://api.tracksino.com/crazytime_history?filter=&sort_by=&sort_desc=false&page_num=1&per_page=50&period=24hours extarcted from the web site itself(refer screenshot). In the very first step script will take the dynamic URL where page number is dynamic and changed upon on every iteration. For ex:- first it will be page_num = 1 then page_num = 2 and so on till all the data will get extracted.

Issue importing transfer journal through x++

I have the following problem: the following code works perfectly:
inventJournalTrans.clear();
inventJournalTrans.initFromInventJournalTable(inventJournalTable);
inventJournalTrans.ItemId = "100836M";
frominventDim.InventLocationId="SD";
frominventDim.wMSLocationId = '11_RECEPTION';
fromInventDim.InventSizeId = '1000';
fromInventDim.inventBatchId = 'ID057828-CN';
ToinventDim.InventLocationId = "SD";
ToInventDim.wMSLocationId = '11_A2';
ToInventDim.InventSizeId = '1000';
ToInventDim.inventBatchId = 'T20/0001/1';
ToinventDim = InventDim::findOrCreate(ToinventDim);
frominventDim = InventDim::findOrCreate(frominventDim);
inventJournalTrans.InventDimId = frominventDim.inventDimId;
inventJournalTrans.initFromInventTable(InventTable::find("100836M"));
inventJournalTrans.Qty = -0.5;
inventJournalTrans.ToInventDimId = ToinventDim.inventDimId;
inventJournalTrans.CostAmount = InventJournalTrans.calcCostAmount(-abs(any2real(strReplace('-0.5',',','.'))));
inventJournalTrans.TransDate = SystemDateget();
inventJournalTrans.insert();
inventJournalCheckPost = InventJournalCheckPost::newJournalCheckPost(JournalCheckpostType::Post,inventJournalTable);
inventJournalCheckPost.parmThrowCheckFailed(_throwserror);
inventJournalCheckPost.parmShowInfoResult(_showinforesult);
inventJournalCheckPost.run();
It creates transfer journal line correctly and posts the transfer journal successfully.
My requirement is to import journal lines from a csv file. I wrote the following code:
inventJournalTrans.clear();
inventJournalTrans.initFromInventJournalTable(inventJournalTable::find(_journalID));
InventJournaltrans.ItemId = conpeek(_filerecord,4);
inventDim_From.InventLocationId = 'SD';
inventDim_From.wMSLocationId = '11_RECEPTION';
InventDim_from.InventSizeId = conpeek(_fileRecord,11);
InventDim_From.inventBatchId = strfmt("%1",conpeek(_fileRecord,5));
InventDim_To.InventLocationId = 'SD';
inventDim_To.wMSLocationId = strfmt("%1",conpeek(_fileRecord,10));
InventDim_To.InventSizeId = conpeek(_fileRecord,11);
InventDim_To.inventBatchId = strfmt("%1",conpeek(_fileRecord,6));
InventDim_From = InventDim::findOrCreate(inventDim_From);
inventDim_To = InventDim::findOrCreate(inventDim_To);
InventJournalTrans.InventDimId = inventDim_From.inventDimId;
InventJournalTrans.initFromInventTable(InventTable::find(conpeek(_filerecord,4)));
inventJournalTrans.Qty = -abs(any2real(strReplace(conpeek(_fileRecord,8),',','.')));
inventJournalTrans.ToInventDimId = inventDim_To.inventDimId;
InventJournalTrans.CostAmount = InventJournalTrans.calcCostAmount(-abs(any2real(strReplace(conpeek(_fileRecord,8),',','.'))));
inventJournalTrans.TransDate = str2date(conpeek(filerecord,9),123);
InventJournalTrans.insert();
I have the following error when I use the insert() method : size does not exists for the itemId. When I have a look in inventSize table for my itemId, the size exists, I thought it was an inventDimId problem in inventJournalTrans but they're strictly similar to the first code exemple. All my datas are the same as the first exemple but are not hard-coded and come from reading my csv file.
I spend a lot of time debugging and found nothing wrong but error message remains
I'm using Dynamics AX V4 SP1.
Thanks a lot for any help.
When you read from files always trim trailing spaces. You can do that using the strRtrim function.
Like so:
InventDim_To.InventSizeId = strRtrim(conpeek(_fileRecord,11));

Sample python code to replace a substring value in an xlsx cell

Sample code snippet tried:
for row in range(1,sheet.max_row+1):
for col in range(1, sheet.max_column+1):
temp = None
cell_obj = sheet.cell(row=row,column=col)
temp = re.search(r"requestor", str(cell_obj.value))
if temp:
if 'requestor' in cell_obj.value:
cell_obj.value.replace('requestor',
'ABC')
Trying to replace from an xlsx cell containing value "Customer name: requestor " with value "Customer name: ABC" .How can this be achieved easily ?
I found my answer in this post:https://www.edureka.co/community/42935/python-string-replace-not-working
The replace function doesn't store the result in the same variable. Hence the solution for above:
mvar = None
for row in range(1,sheet.max_row+1):
for col in range(1, sheet.max_column+1):
temp = None
cell_obj = sheet.cell(row=row,column=col)
temp = re.search(r"requestor", str(cell_obj.value))
if temp:
if 'requestor' in cell_obj.value:
mvar = cell_obj.value.replace('requestor',
'ABC')
cell_obj.value = mvar
Just keep it simple. Instead of re and replace, search for the given value and override the cell.
The example below also gives you the ability to change 'customer name' if needed:
wb = openpyxl.load_workbook("example.xlsx")
sheet = wb["Sheet1"]
customer_name = "requestor"
replace_with = "ABC"
search_string = f"Customer name: {customer_name}"
replace_string = f"Customer name: {replace_with}"
for row in range(1, sheet.max_row + 1):
for col in range(1, sheet.max_column + 1):
cell_obj = sheet.cell(row=row, column=col)
if cell_obj.value == search_string:
cell_obj.value = replace_string
wb.save("example_copy.xlsx") # remember that you need to save the results to the file

The New York Times API with R

I'm trying to get articles' information using The New York Times API. The csv file I get doesn't reflect my filter query. For example, I restricted the source to 'The New York Times', but the file I got contains other sources also.
I would like to ask you why the filter query doesn't work.
Here's the code.
if (!require("jsonlite")) install.packages("jsonlite")
library(jsonlite)
api = "apikey"
nytime = function () {
url = paste('http://api.nytimes.com/svc/search/v2/articlesearch.json?',
'&fq=source:',("The New York Times"),'AND type_of_material:',("News"),
'AND persons:',("Trump, Donald J"),
'&begin_date=','20160522&end_date=','20161107&api-key=',api,sep="")
#get the total number of search results
initialsearch = fromJSON(url,flatten = T)
maxPages = round((initialsearch$response$meta$hits / 10)-1)
#try with the max page limit at 10
maxPages = ifelse(maxPages >= 10, 10, maxPages)
#creat a empty data frame
df = data.frame(id=as.numeric(),source=character(),type_of_material=character(),
web_url=character())
#save search results into data frame
for(i in 0:maxPages){
#get the search results of each page
nytSearch = fromJSON(paste0(url, "&page=", i), flatten = T)
temp = data.frame(id=1:nrow(nytSearch$response$docs),
source = nytSearch$response$docs$source,
type_of_material = nytSearch$response$docs$type_of_material,
web_url=nytSearch$response$docs$web_url)
df=rbind(df,temp)
Sys.sleep(5) #sleep for 5 second
}
return(df)
}
dt = nytime()
write.csv(dt, "trump.csv")
Here's the csv file I got.
It seems you need to put the () inside the quotes, not outside. Like this:
url = paste('http://api.nytimes.com/svc/search/v2/articlesearch.json?',
'&fq=source:',"(The New York Times)",'AND type_of_material:',"(News)",
'AND persons:',"(Trump, Donald J)",
'&begin_date=','20160522&end_date=','20161107&api-key=',api,sep="")
https://developer.nytimes.com/docs/articlesearch-product/1/overview

How can I cut large csv files using any R packages like ff or data.table?

I want to cut large csv files (file size more than RAM size) and use them or save each in disk for later usage. Which R package is best for doing this for large files?
I haven't tried but using skip and nrows parameters in read.table or read.csv is worth a try. These are from ?read.table
skip integer: the number of lines of the data file to skip before
beginning to read data.
nrows integer: the maximum number of rows to read in. Negative and
other invalid values are ignored.
To avoid some troublesome issues at the end you need to do some error handling. In other words I don't know what happpens when skip value is greater than the number of rows in your big csv.
p.s. I also don't know whether header=TRUE is affecting skip or not, you also have to check that.
The answer given bu #berkorbay is OK and I can confirm that header can be used with skip. However, if your file is really large it gets painfully slow, as each subsequent reading after the first must skip over all previously read lines.
I had to do something similar and, after wasting quite a bit of time, I wrote a short script in PERL which fragments the original file in chuncks that you can read one after the other. It is much faster. I enclose the source here, translating some parts so that the intent is clear:
#!/usr/bin/perl
system("cls");
print("Fragment .csv file keeping header in each chunk\n") ;
print("\nEnter input file name = ") ;
$entrada = <STDIN> ;
print("\nEnter maximum number of lines in each fragment = ") ;
$nlineas = <STDIN> ;
print("\nEnter output file name stem = ") ;
$salida = <STDIN> ;
chop($salida) ;
open(IN,$entrada) || die "Cannot open input file: $!\n" ;
$cabecera = <IN> ;
$leidas = 0 ;
$fragmento = 1 ;
$fichero = $salida.$fragmento ;
open(OUT,">$fichero") || die "Cannot open output file: $!\n" ;
print OUT $cabecera ;
while(<IN>) {
if ($leidas > $nlineas) {
close(OUT) ;
$fragmento++ ;
$fichero = $salida.$fragmento ;
open(OUT,">$fichero") || die "Cannot open output file: $!\n" ;
print OUT $cabecera ;
$leidas = 0;
}
$leidas++ ;
print OUT $_ ;
}
close(OUT) ;
Just save with whatever name and execute. The first line might have to be changed if you have PERL in a diferent place (an, if you are on Windows, you migh have to invoke the script as "perl name-of-script").
One should have used read.csv.ffdf of ff package with specific parameters like this to read big file:
library(ff)
a <- read.csv.ffdf(file="big.csv", header=TRUE, VERBOSE=TRUE, first.rows=1000000, next.rows=1000000, colClasses=NA)
Once big file is read into a ff object, Subsetting ffobject into data frames can be done using:
a[1000:1000000,]
Rest of the code for subsetting and saving broken dataframes
totalrows = dim(a)[1]
row.size = as.integer(object.size(a[1:10000,])) / 10000 #in bytes
block.size = 200000000 #in bytes .IN Mbs 200 Mb
#rows.block is rows per block
rows.block = ceiling(block.size/row.size)
#nmaps is the number of chunks/maps of big dataframe(ff), nmaps = number of maps - 1
nmaps = floor(totalrows/rows.block)
for(i in (0:nmaps)){
if(i==nmaps){
df = a[(i*rows.block+1) : totalrows,]
}
else{
df = a[(i*rows.block+1) : ((i+1)*rows.block),]
}
#process df or save it
write.csv(df,paste0("M",i+1,".csv"))
#remove df
rm(df)
}
Alternatively you can first read the files into mysql using dbWriteTable and then use read.dbi.ffdf function from the ETLUtils package to read it back to R. Consider the function below;
read.csv.sql.ffdf <- function(file, name,overwrite = TRUE, header = TRUE, drv = MySQL(), dbname = "new", username = "root",host='localhost', password = "1234"){
conn = dbConnect(drv, user = username, password = password, host = host, dbname = dbname)
dbWriteTable(conn, name, file, header = header, overwrite = overwrite)
on.exit(dbRemoveTable(conn, name))
command = paste0("select * from ", name)
ret = read.dbi.ffdf(command, dbConnect.args = list(drv =drv, dbname = dbname, username = username, password = password))
return(ret)
}

Resources