How to scrape data and keep it updated? - web-scraping

I am trying to scrape information about the highest payout on this page(https://www.oddsportal.com/tennis/australia/itf-w25-cairns-2-women-doubles/gibson-talia-hule-petra-parnaby-alana-preston-taylah-ClZyxWEg/), but I would also like the data to keep updating itself. I have no idea how to do this, please help me

To achieve your goal, you need two things:
A web scraper - to collect the data.
A cronjob - to run the web scraper at a set time.
1. Building the Scraper
Let's start with the first option. Since I am an engineer at Web Scraping API, here is a simple and effective way to extract the highest and the average payout using our service:
import requests
API_KEY = '<YOUR_API_KEY>'
SCRAPER_URL = 'https://api.webscrapingapi.com/v1'
TARGET_URL = 'https://www.oddsportal.com/tennis/australia/itf-w25-cairns-2-women-doubles/gibson-talia-hule-petra-parnaby-alana-preston-taylah-ClZyxWEg/'
PARAMS = {
"api_key":API_KEY,
"url": TARGET_URL,
"render_js":1,
"proxy_type":'residential',
"country":'de',
"timeout":20000,
"wait_for_css":'.no-border-right-highest',
"extract_rules":'{"highest":{"selector":".no-border-right-highest","output":"text"},"average":{"selector":".no-border-right-average","output":"text"}}',
}
response = requests.get(SCRAPER_URL, params=PARAMS)
with open(f'highest.txt', 'w+') as file:
file.write(response.text)
This script returns a JSON object containing:
{
"highest":[
"94.5%"
],
"average":[
"92.2%"
]
}
2. Running the Cronjob
Make sure you have cron on your system (install it otherwise).
Make the script above executable: chmod 777 web_scraper.py
Open up crontab: crontab -e
Schedule your cron:
*/5 * * * * cd 'path/to/your/script' && web_scraper.py
PS. To exit vim, press Esc > :wq.
PPS. You can use this tool to generate crontab values. For example, */5 * * * * means run the file every 5 minutes.

Related

How to get at results of Jenkins XRay Import Step XrayImportBuilder

When run the XrayImportBuilder step prints a lot of useful stuff to the Log but I can't see any simple way of getting at this information so it can be used from the Jenkinsfile script code. Specifically this appears in the Log:
XRAY_TEST_EXECS: ENT-8327
and I hoping to add this info to the current build description. Ideally the info would be returned from the call, but the result is empty. Alternatives might be to scan the log or I use a curl call and handle all the output - latter feels like a backwards step.
I was successful in extracting that information from the logs generated.
After the Xray import results stage I added:
stage('Extract Variable from log'){
steps {
script {
def logContent = Jenkins.getInstance().getItemByFullName(env.JOB_NAME).getBuildByNumber(Integer.parseInt(env.BUILD_NUMBER)).logFile.text
env.testExecs = (logContent =~ /XRAY_TEST_EXECS:.*/).findAll().first()
echo testExecs
}
}
}
stage('Using variable from another stage') {
steps {
script {
echo "${env.testExecs}"
}
}
You can change the REGEX used to your specific case. I've added the extracted value to a variable so that it can be used in another stages.

Informatica IPC - UNIX script fail

I have created a unix script to be executed after the session finished.
The script basically counts the lines of specific file and then creates a trailer with this specific structure:
T000014800000000000000000000000000000
T - for trailer
0000148 - number of lines
00000000000000000000000000000 - filler
I have tested the script in Mac, I know already that environments are totally different, but I want to know what is needed to be changed in order to execute this script successfully in IPC.
After execution I get the following error message:
The shell command failed with exit code 126.
I invoke the script as follows:
sh -c "$PMRootDir/scripts/exec_trailer_unix.sh $PMRootDir/TgtFiles"
#! /bin/sh
TgtFiles=$1
TgtFilesBody=$TgtFiles/body.txt
TgtFilesTrailer=$TgtFiles/trailer.txt
string1=$(sed -n '$=' $TgtFilesBody)
pad=$(printf '%0.1s' "0"{1..8})
padlength=8
string2='T'
string3=$(printf '%s%*.*s%s\n' "$string2" 0 $((padlength - ${#string1} - ${#string2} )) "$pad" "$string1")
string4='00000000000000000000000000000'
string5=$(printf '%s%*.*s%s\n' "$string3" 0 $((${#string3} - ${#string4} )) "$string4")
echo $string5 > $TgtFilesTrailer
Any idea would be great.
Thanks in advance.
Please check below points.
it looks like permission issue. Please login using informatica user(the user that runs infa demon) and run this command. You should be able to get the errors.
sh -c "$PMRootDir/scripts/exec_trailer_unix.sh $PMRootDir/TgtFiles"
Sometime the server variable $PMRootDir in UNIX doesnt get interpreted and can result null value. Please use echo $PMRootDir to check if its working after logging into UNIX using above user.
You can create trailer file using Infa easily.
Just add an aggregator transformation right before actual target( group by a dummy field to calculate count(*)). Then add an expression transformation to create those strings. And then trailer file target. Just 3 more transformations.
| --> AGG --> EXP --> Trailer Target file
Final Tr --|--> Final Target

appending log to a file in unix but with different name each day

i am appending the output of a cron job into a file like below
10,20,30,40,50 * * * * /home/mydir/shellScript.sh >> /home/mydir/shellScript.log 2>&1
but the file size is keep on increasing, i want to do either of the below things.
1)create a new file after reaching certain size
2)create a new file each day of its run
we need to maintain the file for at least 15 days due to audit reasons.can someone help on this,thanks in advance.
i solved using below:
10,20,30,40,50 * * * * /home/mydir/shellScript.sh >> /home/mydir/shellScript_`date +\%Y\%m\%d`.log 2>&1

Build web graph with wget

I'm using wget with -r (recursive) option, to crawl and download all the pages starting from a root.
For debugging purpose I'd like to output which page routed me to another one, for example: https://stackoverflow.com/ -> https://stackoverflow.com/questions
Is there such a way to do that?
Please note that I need explicitly use wget.
The best solution I found untill now is to use the --warc-file option, to export a warc archive of my crawl. This format also store the Referer.
Using a python library to read the output I wrote the following simple script, to export a csv with source/target columns:
import warc
f = warc.open("crawler.warc")
for record in f:
if record['WARC-Type'] != 'request':
continue
for line in record.payload:
if line.startswith("Referer:"):
print line.replace("Referer: ", "").strip('\n\r'), ",", record['WARC-Target-URI']

R script doesn't run from crontab after a while

I've been troubleshooting this for sometime now. I'm getting some quite unexpected behaviour. I've place a job in /etc/crontab to be run bihourly. It's an R script that produces a png graphic displayed on my server's webpage. It's called in the form:
0,30 * * * * my_user Rscript /path/to/file
What's odd is that it works for an hour or so before the graphic stops updating. If I ssh into the machine and then edit /etc/crontab without even changing anything, it'll start running again. Anyone know what might cause such an issue?
EDIT: I messed around with it a bit more, and it's getting even weirder. I'm running a PHP file from cron that scrapes some text and writes it to file. The PHP continues to work even when R has ceased to run.
If you want something to run once every two hours, you will have to use the slash, "/", character in your field. The slash character is the "step" character. In the case of a two hourly schedule, your time component of your cron file will read:
0 */2 * * *
The second field, "*/2", means every alternate hour.
Similarly, if you want something to run every 3 hours, you can change that field to "*/3", and so on.
bihourly: Occurring once every two hours.

Resources