Filter extracted data via ImportDATA - web-scraping

When trying to extract data from https://int.soccerway.com/ via ImportDATA, the spreadsheet sometimes returns a message saying that it exceeds the data limit.
What I would like to do is that instead of importing everything, it would filter only the values that are within ||| td class = "score-time status" |||, because I want to capture the links it has within that specific "class" in "td".
ImportXML to capture "//td[#class='score-time status']/#href" is not an option because some of these links are hidden and only appear in the general page record, so only with ImporDATA to be able to search all the existing links.
=IMPORTDATA("https://int.soccerway.com/")
I have tried in many ways to add ARRAYFORMULA and FILTER so that it only filters this data, but each time it returns in error.
What I need to be able to collect is the links that are within:
||| td class = "score-time status" |||

you can do something like:
=ARRAY_CONSTRAIN(IMPORTDATA("https://int.soccerway.com/"), 8000, 1)
then you can wrap it in query and filter it how it fits you. for example:
=QUERY(ARRAY_CONSTRAIN(IMPORTDATA("https://int.soccerway.com/"), 8000, 1),
"where Col1 contains 'td'", 0)
=QUERY(ARRAY_CONSTRAIN(IMPORTDATA("https://int.soccerway.com/"), 8000, 1),
"where Col1 contains 'href'", 0)
etc.

Related

In KQL how can I use bag_unpack to turn a serialized dictionary object in customDimensions into columns?

I'm trying to write a KQL query that will, among other things, display the contents of a serialized dictionary called Tags which has been added to the Application Insights traces table customDimensions column by application logging.
An example of the serialized Tags dictionary is:
{
"Source": "SAP",
"Destination": "TC",
"SAPDeliveryNo": "0012345678",
"PalletID": "(00)312340123456789012(02)21234987654(05)123456(06)1234567890"
}
I'd like to use evaluate bag_unpack(...) to evaluate the JSON and turn the keys into columns. We're likely to add more keys to the dictionary as the project develops and it would be handy not to have to explicitly list every column name in the query.
However, I'm already using project to reduce the number of other columns I display. How can I use both a project statement, to only display some of the other columns, and evaluate bag_unpack(...) to automatically unpack the Tags dictionary into columns?
Or is that not possible?
This is what I have so far, which doesn't work:
traces
| where datetime_part("dayOfYear", timestamp) == datetime_part("dayOfYear", now())
and message has "SendPalletData"
| extend TagsRaw = parse_json(customDimensions.["Tags"])
| evaluate bag_unpack(TagsRaw)
| project timestamp, message, ActionName = customDimensions.["ActionName"], TagsRaw
| order by timestamp desc
When it runs it displays only the columns listed in the project statement (including TagsRaw, so I know the Tags exist in customDimensions).
evaluate bag_unpack(TagsRaw) doesn't automatically add extra columns to the result set unpacked from the Tags in customDimensions.
EDIT: To clarify what I want to achieve, these are the columns I want to output:
timestamp
message
ActionName
TagsRaw
Source
Destination
SAPDeliveryNo
PalletID
EDIT 2: It turned out a major part of my problem was that double quotes within the Tags data are being escaped. While the Tags as viewed in the Azure portal looked like normal JSON, and copied out as normal JSON, when I copied out the whole of a customDimensions record the Tags looked like "Tags": "{\"Source\":\"SAP\",\"Destination\":\"TC\", ... with the double quotes escaped with backslashes.
The accepted answer from David Markovitz handles this situation in the line:
TagsRaw = todynamic(tostring(customDimensions["Tags"]))
A few comments:
When filtering on timestamp, better use the timestamp column As Is, and do the manipulations on the other side of the equation.
When using the has[...] operators, prefer the case-sensitive one (if feasable)
Everything extracted from dynamic value is also dynamic, and when given a dynamic value parse_json() (or its equivalent, todynamic()), simply returns it, As Is.
Therefore, we need to treet customDimensions.["Tags"] in 2 steps:
1st, convert it to string. 2nd, convert the result to dynamic.
To reference a field within a dynamic type you can use X.Y, X["Y"], or "X['Y'].
No need to combine them as you did with customDimensions.["Tags"].
As the bag_unpack plugin doc states:
"The specified input column (Column) is removed."
In other words, TagsRaw does not exist following the bag_unpack operation.
Please note that you can add prefix to the columns generated by bag_unpack. Might make it easier to differentiate them from the rest of the columns.
While you can use project, using project-away is sometimes easier.
// Data sample generation. Not part of the solution.
let traces =
print c1 = "some columns"
,c2 = "we"
,c3 = "don't need"
,timestamp = ago(now()%1d * rand())
,message = "abc SendPalletData xyz"
,customDimensions = dynamic
(
{
"Tags":"{\"Source\":\"SAP\",\"Destination\":\"TC\",\"SAPDeliveryNo\":\"0012345678\",\"PalletID\":\"(00)312340123456789012(02)21234987654(05)123456(06)1234567890\"}"
,"ActionName":"Action1"
}
)
;
// Solution starts here
traces
| where timestamp >= startofday(now())
and message has_cs "SendPalletData"
| extend TagsRaw = todynamic(tostring(customDimensions["Tags"]))
,ActionName = customDimensions.["ActionName"]
| project-away c*
| evaluate bag_unpack(TagsRaw, "TR_")
| order by timestamp desc
timestamp
message
ActionName
TR_Destination
TR_PalletID
TR_SAPDeliveryNo
TR_Source
2022-08-27T04:15:07.9337681Z
abc SendPalletData xyz
Action1
TC
(00)312340123456789012(02)21234987654(05)123456(06)1234567890
0012345678
SAP
Fiddle
If I understand correctly, you want to use project to limit the number of columns that are displayed, but you also want to include all of the unpacked columns from TagsRaw, without naming all of the tags explicitly.
The easiest way to achieve this is to switch the order of your steps, so that you first do the project (including the TagsRaw column) and then you unpack the tags. If desired, you can then use project-away to specifically remove the TagsRaw column after you've unpacked it.

Telegraf drop a tag from a specific measurement

Is it possible in telegraf using a processor to drop a tag from a measurement?
Using the cisco_telemetry plugin that takes in series and within one of the measurements not the whole plugin I want to only keep one tag.
I tried using the tag_limit processor but it didn't work. The current measurement "Cisco-IOS-XR-procfind-oper:proc-distribution/nodes/node/process/pid/filter-type" has two tags "pid" and "proc_name" each contain around 10000 values. I only want to keep "proc_name" and drop "pid" from this measurement. Should the tag_limit processor work for this? Version 1.23
[[processors.tag_limit]]
namepass = ["Cisco-IOS-XR-procfind-oper:proc-distribution/nodes/node/process/pid/filter-type"]
## Maximum number of tags to preserve
limit = 1
## List of tags to preferentially preserve
keep = ["proc_name"]
within one of the measurements
I would probably use a starlark processor then. Use namepass as you have done, and then remove the specific tag.
[[processors.starlark]]
namepass = ["Cisco-IOS-XR-procfind-oper:proc-distribution/nodes/node/process/pid/filter-type"]
source = '''
def apply(metric):
metric.tags.pop("pid")
return metric
'''
For users looking to do this to an entire measurement, they can drop tags from a measurement with metric modifiers. Specifically, you are looking for tagexclude, which will remove tags from a measurement matching those patterns. This way, you do not even need to use a processor and can add this directly to the end of your input:
[[inputs.cisco_telemetry]]
<connection details>
tagexclude = ["pid"]

Web scraping with R?

I have a dataframe which indicates, in column, an url.
test = data.frame (id = 1, url = "https://www.georisques.gouv.fr/risques/installations/donnees/details/0030.12015")
Using this, I would like to retrieve an element in the web page. Specifically, I would like to retrieve the value of the activity state.
https://zupimages.net/viewer.php?id=20/51/t1fx.png
Thanks to my research, I was able to find a code which allows to select the element thanks to its "XPath".
library (rvest)
page = read_html ("https://www.georisques.gouv.fr/risques/installations/donnees/details/0030.12015")
page%>% html_nodes (xpath = '// * [# id = "detailAttributFiche"] / div / p')%>% html_text ()%>% as.character ()
character (0)
As you can see, I always have a "character (0)" that appears, as if it couldn't read the whole page. I suspect some JavaScript part is not linking properly ...
How can I do ?
Thank you.
The data is from this link (the etatActiviteInst parameter): https://www.georisques.gouv.fr/webappReport/ws/installations/etablissement/0030-12015

Google sheet IMPORTHTML function could not find the data

I am trying to get the data table from this site http://people.stern.nyu.edu/adamodar/New_Home_Page/datafile/vebitda.html into goole sheets.
I have tried:
=IMPORTHTML("http://people.stern.nyu.edu/adamodar/New_Home_Page/datafile/vebitda.html", "table", 1), but this gives me a N/A
What is wrong?
you may try to get it via:
=QUERY(IMPORTDATA("http://people.stern.nyu.edu/adamodar/New_Home_Page/datafile/vebitda.html"),
"offset 1181")
and try to remove tags with:
=ARRAYFORMULA(IFNA(REGEXREPLACE(A1:A, "</?\S+[^<>]*>", )))
and then use FILTER with MOD to get every n-th value and recreate the whole table

xpath returning empty text when web-scraping in r

I'm trying to scrape information from https://www.kff.org/interactive/subsidy-calculator. For instance, put state=California, zip=90001, income=20000, no coverage, 1 people, 1 adult, no children, age=21, no tobacco.
We get the following:
https://www.kff.org/interactive/subsidy-calculator/#state=ca&zip=94704&income-type=dollars&income=20000&employer-coverage=0&people=1&alternate-plan-family=individual&adult-count=1&adults%5B0%5D%5Bage%5D=21&adults%5B0%5D%5Btobacco%5D=0&child-count=0
I would like to get the numbers for "estimated financial help" and "your cost for a silver plan" (they are bolded-blue in the "Results" grey box, for some reason I can't upload the screenshot). When I use the xpath for the numbers, I get back empty string. This is not the case if I were to retrieve some other text (not in the grey box). I wonder what could be wrong with this. I have attached code below. Please forgive me if this is a stupid question since I'm very new to web-scraping. Thank you!
state = tolower('CA')
zip = 94704
income = 20000
people = 1
adult = 1
children = 0
url = paste0("https://www.kff.org/interactive/subsidy-calculator/#state=", state, "&zip=", zip, "&income-type=dollars&income=", income, "&employer-coverage=0&people=", people, "&alternate-plan-family=individual&adult-count=", adult, "&adults%5B0%5D%5Bage%5D=21&adults%5B0%5D%5Btobacco%5D=0&child-count=", children)
# This returns empty string
r = read_html(url) %>%
html_nodes(xpath ='//*[#id="subsidy-calculator-new"]/div[5]/div/div/dl/dd[1]/span') %>% html_text()
# This returns "Number of children (20 and younger) enrolling in Marketplace coverage", a line that's not in the grey box.
r = read_html(url) %>%
html_nodes(xpath = '//*[#id="subsidy-form"]/div[2]/div[3]/div[3]/p') %>%
html_text()
The values are generated through scripts that run on the page. Your current method won't allow for this hence your result. You are likely better off using a method which allows scripts to run such as RSelenium.
The form you complete #subsidy-form feeds values into a template in a script tag #results-template. The associated calculations are covered in this script https://www.kff.org/wp-content/themes/kaiser-foundation-2016/interactives/subsidy-calculator/2019/calculator.js?ver=1.7.7 where you will find the logic and the pre-set values such as poverty lines per year.
The simplest quick view is probably to inspect the javascript variables when the new SubsidyCalculator object is created to process the form i.e. js starting with var sc = new SubsidyCalculator. You could 'reverse engineer' those variables with your values plus the values returned from the json below which I think, but haven't confirmed, feed the 6 variables that begin with kff_sc, according to zipcode, into the calculator e.g. silver: kff_sc.silver . You get an idea of the ballpark figures given there are default values given at top of script.
Figures in relation to zipcode are retrieved from this: https://www.kff.org/wp-content/themes/kaiser-foundation-2016/interactives/subsidy-calculator/2019/json/zips/94.json where the last two numbers before .json are the first two numbers of zipcode. You can determine this from the input validation script: https://www.kff.org/wp-content/themes/kaiser-foundation-2016/interactives/subsidy-calculator/2019/shared.js?ver=1.7.7
var bucket = $( this ).val().substring( 0, 2 );
if ( kff_sc.buckets[bucket] ) return;
$.ajax( '/wp-content/themes/vip/kaiser-foundation-2016/interactives/subsidy-calculator/2019/json/zips/' + bucket + '.json',
The first two digits determine the bucket.
All in all you could likely implement your own calculator but you would be re-inventing the wheel. Seems easier to just automate the browser and then extract the resultant values.

Resources