Google sheet IMPORTHTML function could not find the data - web-scraping

I am trying to get the data table from this site http://people.stern.nyu.edu/adamodar/New_Home_Page/datafile/vebitda.html into goole sheets.
I have tried:
=IMPORTHTML("http://people.stern.nyu.edu/adamodar/New_Home_Page/datafile/vebitda.html", "table", 1), but this gives me a N/A
What is wrong?

you may try to get it via:
=QUERY(IMPORTDATA("http://people.stern.nyu.edu/adamodar/New_Home_Page/datafile/vebitda.html"),
"offset 1181")
and try to remove tags with:
=ARRAYFORMULA(IFNA(REGEXREPLACE(A1:A, "</?\S+[^<>]*>", )))
and then use FILTER with MOD to get every n-th value and recreate the whole table

Related

Web scraping with R?

I have a dataframe which indicates, in column, an url.
test = data.frame (id = 1, url = "https://www.georisques.gouv.fr/risques/installations/donnees/details/0030.12015")
Using this, I would like to retrieve an element in the web page. Specifically, I would like to retrieve the value of the activity state.
https://zupimages.net/viewer.php?id=20/51/t1fx.png
Thanks to my research, I was able to find a code which allows to select the element thanks to its "XPath".
library (rvest)
page = read_html ("https://www.georisques.gouv.fr/risques/installations/donnees/details/0030.12015")
page%>% html_nodes (xpath = '// * [# id = "detailAttributFiche"] / div / p')%>% html_text ()%>% as.character ()
character (0)
As you can see, I always have a "character (0)" that appears, as if it couldn't read the whole page. I suspect some JavaScript part is not linking properly ...
How can I do ?
Thank you.
The data is from this link (the etatActiviteInst parameter): https://www.georisques.gouv.fr/webappReport/ws/installations/etablissement/0030-12015

Filter extracted data via ImportDATA

When trying to extract data from https://int.soccerway.com/ via ImportDATA, the spreadsheet sometimes returns a message saying that it exceeds the data limit.
What I would like to do is that instead of importing everything, it would filter only the values that are within ||| td class = "score-time status" |||, because I want to capture the links it has within that specific "class" in "td".
ImportXML to capture "//td[#class='score-time status']/#href" is not an option because some of these links are hidden and only appear in the general page record, so only with ImporDATA to be able to search all the existing links.
=IMPORTDATA("https://int.soccerway.com/")
I have tried in many ways to add ARRAYFORMULA and FILTER so that it only filters this data, but each time it returns in error.
What I need to be able to collect is the links that are within:
||| td class = "score-time status" |||
you can do something like:
=ARRAY_CONSTRAIN(IMPORTDATA("https://int.soccerway.com/"), 8000, 1)
then you can wrap it in query and filter it how it fits you. for example:
=QUERY(ARRAY_CONSTRAIN(IMPORTDATA("https://int.soccerway.com/"), 8000, 1),
"where Col1 contains 'td'", 0)
=QUERY(ARRAY_CONSTRAIN(IMPORTDATA("https://int.soccerway.com/"), 8000, 1),
"where Col1 contains 'href'", 0)
etc.

Adding incomplete columns to a table

I'm new to R and picking it up pretty quick, I think, but I've hit a wall and I'm not even sure what to google to figure this out for myself.
In the code excerpt below, i'm adding a few calculated columns to table ALLDATA. The problem is with the last line. If I have an ALLDATA table where every entry has an associated QCAnalysisNumber, the code works fine. If only SOME of the entries have a QCAnalysis number, that column doesn't populate at all. I would like it to find an appropriate QCAnalysisNumber, and if it can't, just be NA or let me insert text like "No QCAnalysisNumber".
Can you guys tell me where I'm going wrong or point me in the right direction? Even just appropriate search terms for google would be a huge help. Thanks!
ALLDATA$IntResult <- round(ALLDATA$Value, 0)
ALLDATA$ComboResult <- ifelse(toupper(ALLDATA$DetectedResult)=="N", ALLDATA$Value/2, round(ALLDATA$Value, 0))
ALLDATA$ND15Result <- ifelse(toupper(ALLDATA$DetectedResult)=="N", ALLDATA$Value/2, ALLDATA$Value)
ALLDATA$LogComboResult <- ifelse(ALLDATA$DetectedResult=="N", log10(abs(ALLDATA$Value/2)), log10(abs(ALLDATA$Value)))
ALLDATA$LogResult <- log10(abs(ALLDATA$Value))
ALLDATA$QCAnalysisNumber <- ALLDATA$AnalysisNumber[ALLDATA$QCSampleCode!="O" &
ALLDATA$LongName==ALLDATAQC$LongName &
ALLDATA$SampleDate_D==ALLDATAQC$SampleDate_D]

How to use MariaDB's REGEXP_REPLACE?

I have read the docs for MariaDB's REGEX_REPLACE but cannot get my query to work. I am storing links in a column, link and want to change the end of the link:
From www.example.com/<code> to www.example.com/#/results/<code> where <code> is some hexidecimal hash, e.g. 55770abb384c06ee00e0c579. What I am trying is:
SELECT REGEX_REPLACE("link", "www\\.example\\.com\\/(.*)", "www\\.example\\.com\\/#\\/results\\/\\1");
The result is:
Showing rows 0 - 0.
I wasn't able to figure out what the first argument was--the documentation says "subject". Turns out it's just the column name. So this works:
UPDATE my_table
SET my_link = REGEXP_REPLACE(
my_link,
"http:\\/\\/www\\.example\\.com\\/(.*)",
"http:\\/\\/www\\.example\\.com\\/#\\/results\\/\\1")
WHERE my_link IS NOT NULL

Excel sheet corrupted after conditional formatting

I am generating an excel sheet with phpexcel, when using conditional formatting with the condition being (search for a text or part of it), i get a validation error when trying to open the generated sheet. Works perfectly with numbers, Less so WITH TEXTS.
Here is my code :
//conditional formatting
$objConditional1 = new PHPExcel_Style_Conditional();
$objConditional1->setConditionType(PHPExcel_Style_Conditional::CONDITION_CONTAINSTEXT)
->setOperatorType(PHPExcel_Style_Conditional::OPERATOR_CONTAINSTEXT)
->addCondition('X');
$objConditional1->getStyle()->getFont()->getColor()->setARGB(PHPExcel_Style_Color::COLOR_YELLOW);
$objConditional3 = new PHPExcel_Style_Conditional();
$objConditional3->setConditionType(PHPExcel_Style_Conditional::CONDITION_CELLIS)
->setOperatorType(PHPExcel_Style_Conditional::OPERATOR_GREATERTHANOREQUAL)
->addCondition('0');
$objConditional3->getStyle()->getFont()->getColor()->setARGB(PHPExcel_Style_Color::COLOR_GREEN);
$conditionalStyles = $objPHPExcel->getActiveSheet()->getStyle('B2')->getConditionalStyles();
array_push($conditionalStyles, $objConditional3);
array_push($conditionalStyles, $objConditional1);
$objPHPExcel->getActiveSheet()->getStyle('B2')->setConditionalStyles($conditionalStyles);
Does any one know how to work around this ? am i doing anything wrong ?
Thanks.
As far as i know, though i have not tested with your case specifically, when you try to apply conditional formatting based on CONTAINSTEXT,
you shouldn't call
'->addCondition('X')'
(line 5 in your code), but instead call the method:
->setText('X')
On your PHPExcel_Style_Conditional objects. This is how I specify the text to compare to in my PHPExcel sheets.

Resources