importxml could not fetch url after repeated attempts - web-scraping

I am trying to import the weather data for a number of dates, and one zip code, in Google Sheets. I am using importxml for this in the following base formula:
=importxml("https://www.almanac.com/weather/history/zipcode/89118/2020-01-21","//*")
When using this formula with certain zip codes and certain times, it returns the full text of the page which I then query for the mean temperature and mean dew point. However, with the above example and in many other cases, it returns "Could not fetch URL" and #N/A in the cells.
Thus, the issue is, it works a number of times, but by the fifth date or so, it throws the "Could not fetch URL" error. It also fails as I change zip codes. My only guess based on reading many threads is that because I'm requesting the URL so often from Sheets, it is eventually being blocked. Is there any other error anyone can see? I have to use the formula a few times to calculate relative humidity and other things, so I need it to work multiple times. Is it possible there would be a better way to get this working using a script? Or anything else that could cause this?
Here is the spreadsheet in question (just a work in progress, but the weather part is my issue): https://docs.google.com/spreadsheets/d/1WPyyMZjmMykQ5RH3FCRVqBHPSom9Vo0eaLlff-1z58w/edit?usp=sharing
The formulas that are throwing errors start at column N.
This Sheet contains many formulas using the above base formula, in case you want to see more examples of the problem.
Thanks!

After a great deal of trial and error, I found a solution to my own problem. I'm answering this in detail for anyone who needs to find weather info by zip code and date.
I switched to using importdata, transposed it to speed up the query, and used a helper cell to hold the result for each date. I then have the other formulas searching within the result in the helper cell, instead of calling import*** many times throughout. It is slow at times, but it works. This is the updated helper formula (where O3 contains the date in "YYYY-MM-DD" form, O5 contains the URL "https://www.almanac.com/weather/history/", and O4 contains the zip code:
=if(O3="",,query(transpose(IMPORTdata($O$5&$O$4&"/"&O3)),"select Col487 where Col487 contains 'Mean'"))
And then to get the temperature (where O3 contains the date and O8 contains the above formula):
=if(O3="",,iferror(text(mid(O$8,find("Mean Temperature",O$8)+53,4),"0.0° F"),"Loading..."))
And finally, to calculate the relative humidity:
=if(O3="",,iferror(if(now()=0,,exp(((17.625*243.04)*((mid(O$8,find("Mean Dew Point",O$8)+51,4)-32)/1.8-(mid(O$8,find("Mean Temperature",O$8)+53,4)-32)/1.8))/((243.04+(mid(O$8,find("Mean Temperature",O$8)+53,4)-32)/1.8)*(243.04+(mid(O$8,find("Mean Dew Point",O$8)+51,4)-32)/1.8)))),"Loading..."))
Most importantly, importdata has not once thrown the Could not fetch URL error, so it appears to be a better fetch method for this particular site.
Hopefully this can help others who need to pull in historical weather data :)

Related

R code halts in for loop no error message

I'm working on a segment of R code which is a for loop that reads in columns from a collection of .csv files the paths of which are compiled in a file path directory I've made. As it reads in each file it stores four different columns into four different grand tables within my working environment.
I'm trying to make part of code calculates the monthly average for each month then stores it in a new column so that I can use it to replace missing data points and this involves a couple more for loops for mapping the subset of the aggregate table.
All this ends up being four nested for loops which handles a great deal of data at once before overwriting with the next large file.
After incorporating the nested 3 loops which build the monthly average vector the code started halting as soon as it tries to read in the data file with no error message it just looks like this.
halted code no errror
if I add show_col_types = FALSE to the read_csv it looks like this instead still halting the code.
halted code with show_col_types = FALSE
I can't include the data or much of the code because my company will not allow it but I would appreciate any input since there isn't any error message I can google. Thanks!

Pull data using R from a specific weather station

I've looked through many pages of how to do this and they essentially all have the same R code suggestions, which I've followed. Here's the R code I'm using for the specific weather station I'm looking for:
library(rnoaa)
options(noaakey="MyKeyHere")
ncdc(datasetid='GHCND', stationid='GHCND:USW00014739', datatypeid='dly-tmax-normal', startdate='2017-05-15', enddate='2018-01-04')
The error message I get when I run this is:
Warning message:
Sorry, no data found
I've gone directly to the NOAA site (https://www.ncdc.noaa.gov/cdo-web/search) and manaually pulled the dataset out there (using the "daily summaries" dataset, which is the same as GHCND in the API). There is in fact data there for my entire date range.
What am I missing?
The documentation says:
Note that NOAA NCDC API calls can take a long time depending on the call. The NOAA API doesn't perform well with very long timespans, and will time out and make you angry - beware.
Have you tried a smaller timespan?

Assigning observation name to a value when retrieving a variable

I want to create a dataframe that contains > 100 observations on ~20 variables. Now, this will be based on a list of html files which are saved to my local folder. I would like to make sure that are matches the correct values per variable to each observation. Assuming that R would use the same order of going through the files for constructing each variable AND not skipping variables in case of errors or there like, this should happen automatically.
But, is there a "save way" to this, meaning assigning observation names to each variable value when retrieving the info?
Take my sample code for extracting a variable to make it more clear:
#Specifying the url for desired website to be scrapped
url <- 'http://www.imdb.com/search/title?
count=100&release_date=2016,2016&title_type=feature'
#Reading the HTML code from the website
webpage <- read_html(url)
title_data_html <- html_text(html_nodes(webpage,'.lister-item-header a'))
rank_data_html <- html_text(html_nodes(webpage,'.text-primary'))
description_data_html <- html_text(html_nodes(webpage,'.ratings-bar+ .text-
muted'))
df <- data.frame(title_data_html, rank_data_html,description_data_html)
This would come up with a list of rank and description data, but no reference to the observation name for rank or description (before binding it in the df). Now, in my actual code one variable suddenly comes up with 1 value too much, so 201 descriptions but there are only 200 movies. Without having a reference to which movie the description belongs, it is very though to see why that happens.
A colleague suggested to extract all variables for 1 observation at a time and extend the dataframe row-wise (1 observation at a time), instead of extending column-wise (1 variable at a time), but spotting errors and clean up needs per variable seems way more time consuming this way.
Does anyone have a suggestion of what is the "best practice" in such a case?
Thank you!
I know it's not a satisfying answer, but there is not a single strategy for solving this type of problem. This is the work of web scraping. There is no guarantee that the html is going to be structured in the way you'd expect it to be structured.
You haven't shown us a reproducible example (something we can run on our own machine that reproduces the problem you're having), so we can't help you troubleshoot why you ended up extracting 201 nodes during one call to html_nodes when you expected 200. Best practice here is the boring old advice to LOOK at the website you're scraping, LOOK at your data, and see where the extra or duplicate description is (or where the missing movie is). Perhaps there's an odd element that has an attribute that is also matching your xpath selector text. Look at both the website as it appears in a browser, as well as the source. Right click, CTL + U (PC), or OPT + CTL + U (Mac) are some ways to pull up the source code. Use the search function to see what matches the selector text.
If the html document you're working with is like the example you used, you won't be able to use the strategy you're looking for help with (extract the name of the movie together with the description). You're already extracting the names. The names are not in the same elements as the descriptions.

Having trouble figuring out how to approach this exercise #R scraping #extracting web data

So, sometimes I need to get some data from the web organizing it into a dataframe and waste a lot of time doing it manually. I've been trying to figure out how to optimize this proccess, and I've tried with some R scraping approaches, but couldn't get to do it right and I thought there could be an easier way to do this, can anyone help me out with this?
Fictional exercise:
Here's a webpage with countries listed by continents: https://simple.wikipedia.org/wiki/List_of_countries_by_continents
Each country name is also a link that leads to another webpage (specific of each country, e.g. https://simple.wikipedia.org/wiki/Angola).
I would like as a final result to get a data frame with number of observations (rows) = number of countries listed and 4 variables (colums) as ID=Country Name, Continent=Continent it belongs to, Language=Official language (from the specific webpage of the Countries) and Population = most recent population count (from the specific webpage of the Countries).
Which steps should I follow in R in order to be able to reach to the final data frame?
This will probably get you most of the way. You'll want to play around with the different nodes and probably do some string manipulation (clean up) after you download what you need.

Weka Apriori No Large Itemset and Rules Found

I am trying to do apriori association mining with WEKA (i use 3.7) using given database table
So, i exported two columns (orderLineNumber and productCode) and load it into weka, as far as i go, i haven't got any success attempt, always ended with "No large itemsets and rules found!"
Again, i tried to convert the csv into ARFF file first using ARFF Converter and still get the same message;
I also tried using database loader in WEKA, the data loaded just fine but still give the same result;
The filter i've applied in preprocessing is only numericToNominal filter;
What have i wrongly done here, i suspiciously think it was my ARFF format though, thank you
Update
After further trial, i found out that i exported wrong column and i lack 1 filter process, which is "denormalized", i installed the plugin via packet manager and denormalized my data after converting it to nominal first;
I then compared the results with "Supermarket" sample's result; The only difference are my output came with 'f' instead of 't' (like shown below) and the confidence value seems like always 100%;
First of all, OrderLine is the wrong column.
Obviously, the position on the printed bill is not very important.
Secondly, the file format is not appropriate.
You want one line for every order, one column for every possible item in the #data section. To save memory, it may be helpful to use sparse formats (do not forget to set flags appropriately)
Other tools like ELKI can process input formats like this, that may be easier to use (it also was a lot faster than Weka):
apple banana
milk diapers beer
but last I checked, ELKI would "only" find frequent itemsets (the harder part) not compute association rules. I then used a tiny python script to produce actual association rules as desired.

Resources