How to fix python returning multiple lines in a .csv document instead of one? - web-scraping

I am trying to scrape data form a public forum for a school project, but every-time I run the code, the resulting .csv file shows multiple rows for the text variable instead of just one.
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
my_url = 'https://www.emimino.cz/diskuse/1ivf-repromeda-56566/'
uClient = uReq(my_url)
page_soup = soup(uClient.read(),"html.parser")
uClient.close()
containers = page_soup.findAll("div",{"class":"discussion_post"})
out_filename = "Repromeda.csv"
headers = "text,user_name,date \n"
f = open(out_filename, "w")
f.write(headers)
for container in containers:
text1 = container.div.p
text = text1.text
user_container = container.findAll("span",{"class":"user_category"})
user_id = user_container[0].text
date_container = container.findAll("span",{"class":"date"})
date = date_container[1].text
print("text: " + text + "\n" )
print("user_id: " + user_id + "\n")
print("date: " + date + "\n")
# writes the dataset to file
f.write(text.replace(",", "|") + ", " + user_id + ", " + date + "\n")
f.close()
Ideally I am trying to create a row for each data entry (ie. text, user_id, date in one row), but instead I get multiple rows for one text entry and only one row for user_id and date entry.
this is the actual output
this is the expected output

Just replace the new line with blank string.
for container in containers:
text1 = container.div.p
text = text1.text.replace('\n', ' ')

Related

Column not iterable, PySpark

I'm trying to perform a count vectorization using this function I've created however, I keep having an error returned stating "column not iterable" which I cannot figure out why and how to resolve it.
(dfNew is the data frame with just two of the columns, both of StringType)
import string
import re
import nltk
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
ps = PorterStemmer()
dfNew = df.select(F.col('Description'), F.col('ID'))
def clean_description(text):
text = "".join([word.lower() for word in text if word not in string.punctuation])
text = re.sub('[\n\r]+', ' ', text).lower()
Description = re.split('\W+', text)
text = [ps.stem(word) for word in Description if word not in nltk.corpus.stopwords.words('english')]
more_stop_words = ['please', 'search', 'submitted', 'thank', 'search', 'com', 'url', 'https', 'via', 'www']
text = [ps.stem(word) for word in Description if word not in more_stop_words]
return text
count_vectorize = CountVectorizer(analyzer=clean_description)
vectorized = count_vectorize.fit_transform(dfNew['Description'])
What am I doing wrong, and how can this be resolved?

how to split date-time and data from same column in csv using python?

I am storing data from Arduino to RaspberryPi. My code is working well, but the date-time and data are logging in the first column together (picture 1), although those are shown separately in the RaspberryPi display (picture 2). How can I log those in a seperate column? I attached my code here.
import time,datetime
from datetime import datetime
import csv
import serial
arduino_port = "/dev/ttyACM0"
baud = 9600
fileName="Arduino1.csv"
ser = serial.Serial(arduino_port, baud)
print("Connected to Arduino port:" + arduino_port)
file = open(fileName, "a")
print("Created file")
samples =float('inf')
print_labels = False
line = 0
print('Press Ctrl-C to quit...')
print('Time, CO2(ppm), Temp(C), Humi(%), Vis, IR, UV, pH')
print('-' *85)
while line <= samples:
d=datetime.now().strftime('%Y-%m-%d %H:%M:%S')
if print_labels:
if line==0:
print("Printing Column Headers")
else:
print("Line " + str(line) + ": writing...")
getData=str(ser.readline())
data=getData [1:] [1:-5]
print(d, data)
file = open(fileName, "a")
file.write(d+ data+ "\n")
line = line+1
file.close()
try adding a comma after the Date-time string to separate the data to another column? The date-time and data appear to be separate in the RaspberryPi, but not comma separated, thus both are in the same column.
Try this -
file.write(d + "," + data + "\n")

How to import data from a HTML table on a website to excel?

I would like to do some statistical analysis with Python on the live casino game called Crazy Time from Evolution Gaming. There is a website that has the data to do this: https://tracksino.com/crazytime. I want the data of the lowest table 'Spin History' to be imported into excel. However, I do not now how this can be done. Could anyone give me an idea where to start?
Thanks in advance!
Try the below code:
import json
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
import csv
import datetime
def scrap_history():
csv_headers = []
file_path = '' #mention your system where you have to save the file
file_name = 'spin_history.csv' # filename
page_number = 1
while True:
#Dynamic URL fetching data in chunks of 100
url = 'https://api.tracksino.com/crazytime_history?filter=&sort_by=&sort_desc=false&page_num=' + str(page_number) + '&per_page=100&period=24hours'
print('-' * 100)
print('URL created : ',url)
response = requests.get(url,verify=False)
result = json.loads(response.text) # loading data to convert in JSON.
history_data = result['data']
print(history_data)
if history_data != []:
with open(file_path + file_name ,'a+') as history:
#Headers for file
csv_headers = ['Occured At','Slot Result','Spin Result','Total Winners','Total Payout',]
csvwriter = csv.DictWriter(history, delimiter=',', lineterminator='\n',fieldnames=csv_headers)
if page_number == 1:
print('Writing CSV header now...')
csvwriter.writeheader()
#write exracted data in to csv file one by one
for item in history_data:
value = datetime.datetime.fromtimestamp(item['when'])
occured_at = f'{value:%d-%B-%Y # %H:%M:%S}'
csvwriter.writerow({'Occured At':occured_at,
'Slot Result': item['slot_result'],
'Spin Result': item['result'],
'Total Winners': item['total_winners'],
'Total Payout': item['total_payout'],
})
print('-' * 100)
page_number +=1
print(page_number)
print('-' * 100)
else:
break
Explanation:
I have implemented the above script using python requests way. The API url https://api.tracksino.com/crazytime_history?filter=&sort_by=&sort_desc=false&page_num=1&per_page=50&period=24hours extarcted from the web site itself(refer screenshot). In the very first step script will take the dynamic URL where page number is dynamic and changed upon on every iteration. For ex:- first it will be page_num = 1 then page_num = 2 and so on till all the data will get extracted.

Why more number of duplicated data is saving in my excel sheet for my code?

Actually this code is generally used to scrape data from websites but the problem is more number of duplicated data is producing and saving in my excel sheet.
def extractor():
time.sleep(10)
souptree = html.fromstring(driver.page_source)
tburl = souptree.xpath("//table[contains(#id, 'theDataTable')]//tbody//tr//td[4]//a//#href")
for tbu in tburl:
allurl = []
allurl.append(urllib.parse.urljoin(siteurl, tbu))
for tb in allurl:
get_url = requests.get(tb)
get_soup = html.fromstring(get_url.content)
pattern = re.compile("^\s+|\s*,\s*|\s+$")
name = get_soup.xpath('//td[#headers="contactName"]//text()')
phone = get_soup.xpath('//td[#headers="contactPhone"]//text()')
mail = get_soup.xpath('//td[#headers="contactEmail"]//a//text()')
artitle = get_soup.xpath('//td[#headers="contactEmail"]//a//#href')
artit = ([x for x in pattern.split(str(artitle)) if x][-1])
title = artit[:-2]
for (nam, pho, mai) in zip(name, phone, mail):
fname = nam[9:]
allmails.append(mai)
allnames.append(fname)
allphone.append(pho)
alltitles.append(title)
fullfile = pd.DataFrame({'Names': allnames, 'Mails': allmails, 'Title': alltitles, 'Phone Numbers': allphone})
writer = ExcelWriter('G:\\Sheet_Name.xlsx')
fullfile.to_excel(writer, 'Sheet1', index=False)
writer.save()
print(fname, pho, mai, title, sep='\t')
while True:
time.sleep(10)
extractor()
try:
nextbutton()
except (WebDriverException):
driver.refresh()
except(NoSuchElementException):
time.sleep(10)
driver.quit()
I want the output should not be duplicated but almost half and more number of data are duplicating each time i run the code.

writing csv adds extra lines

I've to write a csv for a third-party upload. They're saying they can't read the csv file created because it has two extra lines of code and doesn't end (when opened in Notepad) on the last character of the last line. That's true - but can anyone tell me why?
Dim _csvLine As New System.IO.StreamWriter(Server.MapPath("~/folder/_" & rptType.SelectedItem.Value.ToString & ".csv"), False)
Dim tb As New StringBuilder
Dim x As Integer = ds.Tables("csv").Columns.Count, y As Integer = 1
For Each row As Data.DataRow In ds.Tables("csv").Rows
For Each col As Data.DataColumn In ds.Tables("csv").Columns
If y <> x Then
tb.Append(Trim(System.Text.RegularExpressions.Regex.Replace(row.Item(col).ToString, "\s+", " ", RegexOptions.IgnoreCase Or RegexOptions.Multiline)) & ",")
y = y + 1
Else
tb.Append(Trim(System.Text.RegularExpressions.Regex.Replace(row.Item(col).ToString, "\s+", " ", RegexOptions.IgnoreCase Or RegexOptions.Multiline)))
tb.AppendLine()
y = 1
End If
Next
Next
_csvLine.WriteLine(Left(tb.ToString, Len(tb.ToString) - 2))
_csvLine.Flush()
_csvLine.Close()
_csvLine.Dispose()
One line is appended by _csvLine.WriteLine, use Write() instead.
tb.AppendLine() adds the second line, try avoiding it on the last line.
You're doing _csvLine.WriteLine( which is gonna add a linebreak to the end of the file.

Resources