The idea is to scrape a Website. By doing so, I wanted to scrape it via screenshots and then extract the data off the screenshot. Because in the Data I wanted to scrape is not in the HTML-Code and to be honest I didn't know how to handle it ( I am pretty new to python/programming).
It is working fine so far, but I had the problem that WebDriverWait doesn't work properly.
That's the Webpage: https://exporo.de/investment/betreutes-wohnen-huerth and in detail it's this dynamic part:
<div class="key">Bereits investiert</div>
<div class="value"
ng-controller="pubSubController as pubSubCtrl"
ng-show="pubSubCtrl.hasProject(2385)"
ng-bind="pubSubCtrl.getProject(2385, 'total')"></div>
So this is my code so far(the loop of it):
while AktuellerWert1 < Endwert1:
Zeit = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
driver1.get_screenshot_as_file(png_link % FileName1)
img = Image.open(png_link % FileName1)
PNG1 = image_to_string(img)
PNG1_bearb = PNG1.split()
AktuellerWert1 = PNG1_bearb[PNG1_bearb.index('investiert') + 1]
Endwert1 = PNG1_bearb[PNG1_bearb.index('Finanzierungsziel') + 1]
if AnfangsWert1 != AktuellerWert1:
with open("/Users/davidoverbeck/Dropbox/Screen/Exporo/%s.csv" % FileName1, 'a') as csvFile:
writer = csv.writer(csvFile)
writer.writerow([AktuellerWert1, Zeit])
print(AktuellerWert1)
else:
pass
AnfangsWert1 = AktuellerWert1
driver1.refresh()
element = WebDriverWait(driver1, 2).until(EC.visibility_of_all_elements_located((By.XPATH, '/html/body/main/section[1]/section/div[2]/div[2]/div[1]/div[2]/div[10]/div[2]')))
else:
with open("/Users/davidoverbeck/Dropbox/Screen/Abgeschlossen.csv", 'a') as csvFile:
writer = csv.writer(csvFile)
writer.writerow([Zeit, FileName1])
print(FileName1, 'abgeschlossen')
driver1.close()
It's working fine for 2 minutes and then it gives me the following error:
selenium.common.exceptions.TimeoutException: Message:
(no message behind it?!)
I am not sure whether the loop does anything at all or, in case it's working, what's wrong with it?
Thank you for your help!
I'm under the impression that the data you're looking for is here:
https://exporo.de/pubsub/initial .
In that case no need to parse html, you will need to parse the json.
See F12 -> network tab -> Type column = json
Related
I have a data in text file. The example of the text file looks like this:
"vLatitude ='23.8145833';
vLongitude ='90.4043056';
vcontents ='LRP: LRPS</br>Start of Road From the End of Banani Rail Crossing Over Pass</br>Division:Gazipur</br>Sub-Division:Tongi';
vLocations = new Array(vcontents, vLatitude, vLongitude);
locations.push(vLocations);"
Can I change it to like this in R?
eg.
latitute longtitude contents
23.8145833 90.4043056 LRP: LRPS Start...Tongi
Solution 1
That looks a lot like javascript code. Execute the javascript (using a web browser) and save the result to JSON, then open the file with R with jsonlite.
With your example, create this file and save it as my_page.html:
<html>
<header>
<script>
// Initialize locations to be able to push more values in it
// probably not required with your full code
var locations = [];
vLatitude ='23.8145833';
vLongitude ='90.4043056';
vcontents ='LRP: LRPS</br>Start of Road From the End of Banani Rail Crossing Over Pass</br>Division:Gazipur</br>Sub-Division:Tongi';
vLocations = new Array(vcontents, vLatitude, vLongitude);
locations.push(vLocations);
// convert locations to json
var jsonData = JSON.stringify(locations);
// actually write the json to file
function download(content, fileName, contentType) {
var a = document.createElement("a");
var file = new Blob([content], {type: contentType});
a.href = URL.createObjectURL(file);
a.download = fileName;
a.click();
}
download(jsonData, 'export_json.txt', 'text/plain');
</script>
</header>
<body>
Download should start automatically. You can look at the web console for errors.
</body>
</html>
When you open it with your web browser it should "download" a file, that you can open with R:
jsonlite::read_json("export_json.txt",simplifyVector = TRUE)
One problem is that the javascript code is created an array without names. So the names are not exported. I don't see how you could make javascript export it.
Solution 2
Instead of relying on a browser to execute the javascript code, you could do it directly in R with a javascript engine. It should give you the same result, but makes communication between the two easier.
Solution 3
If the file really looks like that all along, you might be able to remove the javascript lines that organize the arrays, and only keep the lines that define variables. In R, the symbols = and ; are technically valid, it's not too hard to rewrite the javascript into R code. Note this solution could be very fragile depending on what else is in your javascript code!
js_script <- "var locations = [];
vLatitude ='23.8145833';
vLongitude ='90.4043056';
vcontents ='LRP: LRPS</br>Start of Road From the End of Banani Rail Crossing Over Pass</br>Division:Gazipur</br>Sub-Division:Tongi';
vLocations = new Array(vcontents, vLatitude, vLongitude);
locations.push(vLocations);
// convert locations to json
var jsonData = JSON.stringify(locations);" %>%
str_split(pattern = "\n", simplify=TRUE) %>%
as.character() %>%
str_trim()
# Find the lines that look like defining variables
js_script <- js_script[str_detect(js_script, pattern = "^\\w+ ?= ?'.*' ?;$")]
# make it into an R expression
r_code <- str_remove(js_script, ";$") %>%
paste(collapse = ",")
r_code <- paste0("c(", r_code, ")")
# Execute
eval(str2expression(r_code))
I've collected 48,000 url pages and I put them in a list. My goal is to extract 8 pieces of data using BeautifulSoup and appending each data point to an empty list from this list of urls.
Before I ran the for loop below I tested the extraction on 3 urls from the list and it worked perfectly fine.
I know the code works but I am questioning the amount of time it is taking to complete the web scrape of 48,000 url pages since my code has been running for a day and a half already. This is making me question my code or that I created the code inefficiently.
Can someone please review my code and provide any suggestions or ideas on how to make the code run faster?
Thanks in advance!
title_list = []
price_list = []
descrip_list = []
grape_variety_list = []
region_list = []
region_list2 = []
wine_state_list = []
wine_country_list = []
with requests.Session() as session:
for link in grape_review_links_list:
response2 = session.get(link, headers=headers)
wine_html = response2.text
soup2 = BeautifulSoup(wine_html, 'html.parser')
wine_title = soup2.find('span', class_='rating').findNext('h1').text
title_list.append(wine_title)
wine_price = soup2.find(text='Buy Now').findPrevious('span').text.split(',')[0]
price_list.append(wine_price)
wine_descrip = soup2.find('p', class_='description').find(text=True, recursive=False)
descrip_list.append(wine_descrip)
wine_grape = soup2.find(text='Buy Now').findNext('a').text
grape_variety_list.append(wine_grape)
wine_region = soup2.find(text='Appellation').findNext('a').text
region_list.append(wine_region)
wine_region2 = soup2.find(text='Appellation').findNext('a').findNext('a').text
region_list2.append(wine_region2)
wine_state = soup2.find(text='Appellation').findNext('a').findNext('a').findNext('a').text
wine_state_list.append(wine_state)
wine_country = soup2.find(text='Appellation').findNext('a').findNext('a').findNext('a').findNext('a').text
wine_country_list.append(wine_country)
i wanted to create my bot in VBScript (i know its like troll and bad idea probably, i can do it in lua, python, C#, PHP, ...., but i decided to try and make it from vbscript)
the hard part is that i'm trying to Retrieve information from Telegram getUpdates
i've made this code for example and it kind of works, i'll explain what works and what doesn't
Dim fso, outFile, TeleTest
Set fso = CreateObject("Scripting.FileSystemObject")
Set outFile = fso.CreateTextFile("output.txt", True)
set TeleTest = fso.CreateTextFile("TeleTest.txt", True)
Dim url, req, json
Set req = CreateObject("MSXML2.XMLHTTP")
url = "https://api.telegram.org/bot"[TOKEN]"/getUpdates"
req.open "GET", url, False
req.send
If req.Status = 200 Then
TeleTest.Write req.responseText
End If
' Load the JSON array into a JsonArray:
set jsonArray = CreateObject("Chilkat_9_5_0.JsonArray")
success = jsonArray.Load("TeleTest.txt")
If (success <> 1) Then
outFile.WriteLine(jsonArray.LastErrorText)
WScript.Quit
End If
' Get some information from each record in the array.
numRecords = jsonArray.Size
i = 0
Do While i < numRecords
outFile.WriteLine("------ Record " & i & " -------")
' jsonRecord is a Chilkat_9_5_0.JsonObject
Set jsonRecord = jsonArray.ObjectAt(i)
outFile.WriteLine(" ok: " & jsonRecord.StringOf("ok"))
outFile.WriteLine(" result: " & jsonRecord.SizeOfArray("result"))
' Examine information for this record
u = 0
Do While u < nummessage
nummessage = jsonRecord.SizeOfArray("result[u].message")
Loop
outFile.WriteLine("Number of message: " & nummessage)
j = 0
Do While j < nummessage
jsonRecord.J = j
outFile.WriteLine(" message text: " & jsonRecord.StringOf("result[j].message[j].text"))
j = j + 1
Loop
i = i + 1
Loop
outFile.Close
so the first part that should get updates and save it ino TeleTest.txt works fine, it gets updates, it saves the json in to the .txt file (or anything, i can also save it into string in the vbs, or .json file)
the problem is that the second part where i'm using Chilkat gives error
Blockquote
ChilkatLog: Load:
ChilkatVersion: 9.5.0.78
Unable to get array at index 0. --Load
--ChilkatLog
any help or any idea would be appereciated, also if Chilkat is not good for doing this, maybe tell me why and give me something else?! (Chilkat was the only dll i found to work with vbscript and does json reading, stuff)
i got it to working, i found out that from this example
Chilkat needs the Json file to like this
[ { json } ]
but the Telegram json is like this
{ json }
so, the fix would be easy to just change line 15 from TeleTest.Write req.responseText to this code below
TeleTest.Write "[" + req.responseText + "]"
my code now works fine , if anyone else found something wrong or any answer to my question it would be appreciated
i hope someone else who needs this find this
Using two functions to scrape a website results in a driver.get error.
I've tried different variations of while and for loops to get this to work. Now I get a driver.get error. The initial function works on its own, but when running both functions one after another I get this error.
import requests, sys, webbrowser, bs4, time
import urllib.request
import pandas as pd
from selenium import webdriver
driver = webdriver.PhantomJS(executable_path = 'C:\\PhantomJS\\bin\\phantomjs.exe')
jobtit = 'some+job'
location = 'some+city'
urlpag = ('https://www.indeed.com/jobs?q=' + jobtit + '&l=' + location + '%2C+CA')
def initial_scrape():
data = []
try:
driver.get(urlpag)
results = driver.find_elements_by_tag_name('h2')
print('Finding the results for the first page of the search.')
for result in results: # loop 2
job_name = result.text
link = result.find_element_by_tag_name('a')
job_link = link.get_attribute('href')
data.append({'Job' : job_name, 'link' : job_link})
print('Appending the first page results to the data table.')
if result == len(results):
return
except Exception:
print('An error has occurred when trying to run this script. Please see the attached error message and screenshot.')
driver.save_screenshot('screenshot.png')
driver.close()
return data
def second_scrape():
data = []
try:
#driver.get(urlpag)
pages = driver.find_element_by_class_name('pagination')
print('Variable nxt_pg is ' + str(nxt_pg))
for page in pages:
page_ = page.find_element_by_tag_name('a')
page_link = page_.get_attribute('href')
print('Taking a look at the different page links..')
for page_link in range(1,pg_amount,1):
driver.click(page_link)
items = driver.find_elements_by_tag_name('h2')
print('Going through each new page and getting the jobs for ya...')
for item in items:
job_name = item.text
link = item.find_element_by_tag_name('a')
job_link = link.get_attribute('href')
data.append({'Job' : job_name, 'link' : job_link})
print('Appending the jobs to the data table....')
if page_link == pg_amount:
print('Oh boy! pg_link == pg_amount...time to exit the loops')
return
except Exception:
print('An error has occurred when trying to run this script. Please see the attached error message and screenshot.')
driver.save_screenshot('screenshot.png')
driver.close()
return data
Expected:
Initial Function
Get website from urlpag
Find element by tag name and loop through elements while appending to a list.
When done will all elements exit and return the list.
Second Function
While still on urlpag, find element by class name and get the links for the next pages to scrape.
As we have each page to scrape, go through each page scraping and appending the elements to a different table.
Once we reach our pg_amount limit - exit and return the finalized list.
Actual:
Initial Function
Get website from urlpag
Find element by tag name and loop through elements while appending to a list.
When done will all elements exit and return the list.
Second Function
Finds class pagination, prints nxt_variable and then throws the error below.
Traceback (most recent call last):
File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\Scripts\Indeedscraper\indeedscrape.py", line 23, in initial_scrape
driver.get(urlpag)
File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 333, in get
self.execute(Command.GET, {'url': url})
File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchWindowException: Message: {"errorMessage":"Currently Window handle/name is invalid (closed?)"
For individuals having this error, I ended up switching to chromedriver and using that instead for webscraping. It appears that using the PhantomJS driver will sometimes return this error.
I was having the same issue, until I placed my driver.close() after I was done interacting with selenium objects. I ended up closing the driver at the end of my script just to be on the safe side.
I'm setting up a spreadsheet for someone else with a form to enter data.
One of the columns is supposed to hold a date. The input date format is like this example: "Jan 26, 2013" (there will be a lot of copy & paste involved to collect data, so changing the format at input step is not a real option).
I need this date column to be sortable, but the spreadsheet doesn't recognize this as a date but simply as a string. (It would recognize "Jan-26-2013", I've tried.)
So I need to reformat the input date.
My question is: how can I do this? I have looked around and google apps script looks like the way to go (though I haven't found a good example of reformatting yet).
Unfortunately my only programming experience is in Python, and of intermediate level. I could do this in Python without a problem, but I don't know any JavaScript.
(My Python approach would be:
splitted = date.split()
newdate = "-".join([splitted[0], splitted[1][:-1], splitted[2]])
return newdate
)
I also don't know how I'd go about linking the script to the spreadsheet - would I attach it to the cell, or the form, or where? And how? Any link to a helpful, understandable tutorial etc. on this point would help greatly.
Any help greatly appreciated!
Edit: Here's the code I ended up with:
//Function to filter unwanted " chars from date entries
function reformatDate() {
var sheet = SpreadsheetApp.getActiveSheet();
var startrow = 2;
var firstcolumn = 6;
var columnspan = 1;
var lastrow = sheet.getLastRow();
var dates = sheet.getRange(startrow, firstcolumn, lastrow, columnspan).getValues();
newdates = []
for(var i in dates){
var mydate = dates[i][0];
try
{
var newdate = mydate.replace(/"/g,'');
}
catch(err)
{
var newdate = mydate
}
newdates.push([newdate]);
}
sheet.getRange(startrow, firstcolumn, lastrow, columnspan).setValues(newdates)
}
For other confused google-script Newbies like me:
attaching the script to the spreadsheet works by creating the script from within the spreadsheet (Tools => Script Editor). Just putting the function in there is enough, you don't seem to need a function call etc.
you select the trigger of the script from the Script Editor (Resources => This Project's Triggers).
Important: the script will only work if there's an empty row at the bottom of the sheet in question!
Just an idea :
If you double click on your date string in the spreadsheet you will see that its real value that makes it a string instead of a date object is this 'Jan 26, 2013 with the ' in front of the string that I didn't add here...(The form does that to allow you to type what you want in the text area, including +322475... for example if it is a phone number, that's a known trick in spreadsheets cells) You could simply make a script that runs on form submit and that removes the ' in the cells, I guess the spreadsheet would do the rest... (I didn't test that so give it a try and consider this as a suggestion).
To remove the ' you can simply use the .replace() method **
var newValue = value.replace(/'/g,'');
here are some links to the relevant documentation : link1 link2
EDIT following your comment :
It could be simpler since the replace doesn't generate an error if no match is found. So you could make it like this :
function reformatDate() {
var sheet = SpreadsheetApp.getActiveSheet();
var dates = sheet.getRange(2, 6, sheet.getLastRow(), 1).getValues();
newdates = []
for(var i in dates){
var mydate = dates[i][0];
var newdate = mydate.replace(/"/g,'');
newdates.push([newdate]);
}
sheet.getRange(2, 6, sheet.getLastRow(), 1).setValues(newdates)
}
Also, you used the " in your code, presumably on purpose... my test showed ' instead. What made you make this choice ?
Solved it, I just had to change the comma to dot and it worked