extracting Australia BOM weather data programmatically with R - r

Here http://www.bom.gov.au/climate/data/ I can enter a substation number, say 009572; choose the variable (say Temperature) and its type (say Maximum). Clicking "get data" brings me to a page with a link "All years of data". Click it, and you got a zip file. I am aware of this questions, but here I don't have a direct link to a zip file. Can something be done to automate weather data extraction from the Australian Bureau Of Meteorology website with R?

I had the same question and this S.O. question was one of the first pages to come up. After further searching I found the R package Bomrang (https://github.com/ropensci/bomrang) that:
Provides functions to interface with Australian Government Bureau of
Meteorology (BOM) data, fetching data and returning a tidy data frame
of précis forecasts, current weather data from stations, ag
information bulletins, historical weather data and downloading and
importing radar or satellite imagery.
Bomrang is apart of rOpenSci and is actively developed. It has a good set of functions:
Several functions are provided by bomrang to retrieve Australian
Bureau of Meteorology (BOM) data. A family of functions retrieve
weather data and return tidy data frames;
get_precis_forecast(), which retrieves the précis (short) forecast;
get_current_weather(), which fetches the current weather for a given station;
get_ag_bulletin(), which retrieves the agriculture bulletin;
get_weather_bulletin(), which retrieves the BOM 0900 or 1500 bulletins;
get_coastal_forecast(), which returns coastal waters forecasts; and
get_historical(), which retrieves historical daily observations for a given station.
A second group of functions retrieve information pertaining to
satellite and radar imagery,
get_available_imagery();
the satellite imagery itself, get_satellite_imagery();
get_available_radar(); and
the radar imagery itself, get_radar_imagery().
The function get_historical() seems to do what OP is needing. For example, to get the historical daily rainfall from a weather station in Sydney is as easy as:
> rain_066062 <- bomrang::get_historical(stationid = 066062,
+ type = 'rain',
+ meta = T)
> head(rain_066062)
$`meta`
# A tibble: 1 x 10
site name lat lon start end years percent AWS ncc_obs_code
<int> <chr> <dbl> <dbl> <date> <date> <dbl> <int> <chr> <chr>
1 66062 SYDNEY (OBSERVATORY HILL) -33.9 151. 1858-07-01 2018-11-01 160. 100 Y 136
$historical_data
Product_code Station_number Year Month Day Rainfall Period Quality
1 IDCJAC0009 66062 1858 1 1 NA NA
2 IDCJAC0009 66062 1858 1 2 NA NA
3 IDCJAC0009 66062 1858 1 3 NA NA
4 IDCJAC0009 66062 1858 1 4 NA NA
5 IDCJAC0009 66062 1858 1 5 NA NA
<<SNIP>>
Another nice feature is if you have the longitude and latitude of a place of interest, get_historical() will find the nearest weather station to that location.
To install from CRAN:
install.packages("bomrang")
Or install the development version from Github:
if (!require("remotes")) {
install.packages("remotes", repos = "http://cran.rstudio.com/")
library("remotes")
}
install_github("ropensci/bomrang", build_vignettes = TRUE)

Here's the code that I have done to download instantly and it also resolves your p_c problem. You can improve the function if you want and post.
#daily code = 136
#monthy code = 139
bomdata<- function(station,code){
for(i in 1: length(station)){
p.url<-paste("http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?p_stn_num=",station[i],"&p_display_type=availableYears&p_nccObsCode=",code,sep ="")
download.file(p.url,"test.txt")
filelist <- list.files(pattern = ".txt")
foo<- file(filelist,"r")
text<- suppressWarnings(readLines(foo))
close(foo)
l<- regexpr(":",text[1])
m<- unlist(gregexpr(",", text[1], perl = TRUE))
pc<- substr(text[1],l[[1]]+1,l[[1]]+(m[2]-(l[[1]]+1)))
url<-paste("http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?p_display_type=dailyZippedDataFile&p_stn_num=",station[i],"&p_c=",pc,"&p_nccObsCode=",code,"&p_startYear=2013", sep ="")
suppressWarnings(download.file(url,paste(station[i],".zip",sep= ""), mode = "wb"))
unlink("test.txt")
}
}
Example
bomdata(073137,136)

You can try this, it is a code sequence used by metvurst package. metvurst
## SET URL FOR DATA DOWNLOAD
url <- "http://www.bom.gov.au/ntc/IDO70004/IDO70004_"
## YEARS TO BE DOWNLOADED
yr <- 1993:2012
## READ DATA FOR ALL YEARS FROM URL INTO LIST
fijilst <- lapply(seq(yr), function(i) {
read.csv(paste(url, yr[i], ".csv", sep = ""), na.strings = c(-9999, 999))
})

While I still can't see how to do this with download.file(), the following almost does the job provided Chrome's "Ask where to save each file before downloading" is unticked.
system(paste('"C:/Documents and Settings/UserName/Local Settings/Application Data/Google/Chrome/Application/chrome.exe"',
'-url http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?p_display_type=dailyZippedDataFile&p_stn_num=009572&p_c=-18465084&p_nccObsCode=136'), wait = FALSE)
Then I could use paste0() and loop through various station numbers if I knew what p_c=-18465084 means and how it changes from station to station.

Related

Extract and match sets from list of filenames

I have a dataset of 4000+ images. For the purpose of figuring out the code, I moved a small subset of them to another folder.
The files look like this:
folder
[1] "r01c01f01p01-ch3.tiff" "r01c01f01p01-ch4.tiff" "r01c01f02p01-ch1.tiff"
[4] "r01c01f03p01-ch2.tiff" "r01c01f03p01-ch3.tiff" "r01c01f04p01-ch2.tiff"
[7] "r01c01f04p01-ch4.tiff" "r01c01f05p01-ch1.tiff" "r01c01f05p01-ch2.tiff"
[10] "r01c01f06p01-ch2.tiff" "r01c01f06p01-ch4.tiff" "r01c01f09p01-ch3.tiff"
[13] "r01c01f09p01-ch4.tiff" "r01c01f10p01-ch1.tiff" "r01c01f10p01-ch4.tiff"
[16] "r01c01f11p01-ch1.tiff" "r01c01f11p01-ch2.tiff" "r01c01f11p01-ch3.tiff"
[19] "r01c01f11p01-ch4.tiff" "r01c02f10p01-ch1.tiff" "r01c02f10p01-ch2.tiff"
[22] "r01c02f10p01-ch3.tiff" "r01c02f10p01-ch4.tiff"
I cannot remove the name prior to the -ch# as that information is important. What I want to do, however, is to filter this list of images, and return only sets (ie: r01c02f10p01) which have all four ch values (ch1-4).
I was originally thinking that we could approach the issue along the lines of this:
ch1 <- dir(path="/Desktop/cp/complete//", pattern="ch1")
ch2 <- dir(path="/Desktop/cp/complete//", pattern="ch2")
ch3 <- dir(path="/Desktop/cp/complete//", pattern="ch3")
ch4 <- dir(path="/Desktop/cp/complete//", pattern="ch4")
Applying this list with the file.remove function, similar to this:
final2 <- dir(path="/Desktop/cp1/Images//", pattern="ch5")
file.remove(folder,final2)
However, creating new variables for each ch value fragments out each file. I am unsure how to use these to actually distinguish whether an individual image has all four ch values to meaningfully filter my images. I'm kind of at a loss, as the other sources I've seen have issues that don't quite match this problem.
Earlier, I was able to remove the all images with ch5 from my image set like this. I was thinking this may be helpful in trying to filter only images which have ch1-ch4, but I'm not sure how to proceed.
##Create folder variable which has all image files
folder <- list.files(getwd())
##Create final2 variable which has all image files ending in ch5
final2 <- dir(path="/Desktop/cp1/Images//", pattern="ch5")
##Remove final2 from folder
file.remove(folder,final2)
To summarize: I expect to filter files from a random assortment without complete ch values (ie: maybe only ch1 and ch2, or ch3 and ch4, or ch1, ch2, ch3, and ch4), to an assortment which only contains files which have a complete set (four files with ch1, ch2, ch3, and ch4).
Starting with a vector of filenames like you would get from list.files or something similar, you can create a data frame of filenames, use regex to extract the alphanumeric part at the beginning and the number that follows "-ch". Then check that all elements of an expected set (I put this in ch_set, but there might be another way you need to do this) occur in each group's set of CH values.
# assume this is the vector of file names that comes from list.files
# or something comparable
files <- c("r01c01f01p01-ch3.tiff", "r01c01f01p01-ch4.tiff", "r01c01f02p01-ch1.tiff", "r01c01f03p01-ch2.tiff", "r01c01f03p01-ch3.tiff", "r01c01f04p01-ch2.tiff", "r01c01f04p01-ch4.tiff", "r01c01f05p01-ch1.tiff", "r01c01f05p01-ch2.tiff", "r01c01f06p01-ch2.tiff", "r01c01f06p01-ch4.tiff", "r01c01f09p01-ch3.tiff", "r01c01f09p01-ch4.tiff", "r01c01f10p01-ch1.tiff", "r01c01f10p01-ch4.tiff", "r01c01f11p01-ch1.tiff", "r01c01f11p01-ch2.tiff", "r01c01f11p01-ch3.tiff", "r01c01f11p01-ch4.tiff", "r01c02f10p01-ch1.tiff", "r01c02f10p01-ch2.tiff", "r01c02f10p01-ch3.tiff", "r01c02f10p01-ch4.tiff")
library(dplyr)
ch_set <- 1:4
files_to_keep <- data.frame(filename = files, stringsAsFactors = FALSE) %>%
tidyr::extract(filename, into = c("group", "ch"), regex = "(^[\\w\\d]+)\\-ch(\\d)", remove = FALSE) %>%
mutate(ch = as.numeric(ch)) %>%
group_by(group) %>%
filter(all(ch_set %in% ch))
files_to_keep
#> # A tibble: 8 x 3
#> # Groups: group [2]
#> filename group ch
#> <chr> <chr> <dbl>
#> 1 r01c01f11p01-ch1.tiff r01c01f11p01 1
#> 2 r01c01f11p01-ch2.tiff r01c01f11p01 2
#> 3 r01c01f11p01-ch3.tiff r01c01f11p01 3
#> 4 r01c01f11p01-ch4.tiff r01c01f11p01 4
#> 5 r01c02f10p01-ch1.tiff r01c02f10p01 1
#> 6 r01c02f10p01-ch2.tiff r01c02f10p01 2
#> 7 r01c02f10p01-ch3.tiff r01c02f10p01 3
#> 8 r01c02f10p01-ch4.tiff r01c02f10p01 4
Now that you have a dataframe of the complete groups, just pull the matching filenames back out:
files_to_keep$filename
#> [1] "r01c01f11p01-ch1.tiff" "r01c01f11p01-ch2.tiff" "r01c01f11p01-ch3.tiff"
#> [4] "r01c01f11p01-ch4.tiff" "r01c02f10p01-ch1.tiff" "r01c02f10p01-ch2.tiff"
#> [7] "r01c02f10p01-ch3.tiff" "r01c02f10p01-ch4.tiff"
One thing to note is that this worked without the mutate line where I converted ch to numeric—i.e. comparing character versions of those numbers to regular numeric version of them—because under the hood, %in% converts to matching types. That didn't seem totally safe if you needed to scale this, so I converted to have them in matching types.

Trouble merging two dataframes in R (VLOOKUP)

I need help merging two data frames with R. I'm a little desperate, since I have tried everthing I could. Any help would be appreciated.
The thing is that I'm doing some daily web scraping, and I need to compare today's results whith yesterday's results in order to to detect if there have been any changes.
I only have two variables (title of the page and url) in two dataframes (one for today and one for yesterday), and I want to merge them in one.
The possible changes are:
Changes in the name.
Changes in the url.
New programs (new name and new url).
Deleted programs.
I've tried with merge, cast & melt, ifelse, etc. etc. and I can't solve the problem. For example:
yesterday <- read.csv2("Yesterday.csv")
today <- read.csv2("Today.csv")
new <- merge(x = today, y = yesterday, all = TRUE, sort = TRUE)
But without the desired result. I'm attaching three files:
Today.csv, with the results of today scraping
Yesterdat.csv, with the results of yesterday scraping
Results.xlsx with the desired output. A VLOOKUP in Excel, highlighting the changes I want to detect (in this case name changes).
I would need a solution for the four changes options. The output could be different, I don't care about that, but I need the comparison to be correct Even if you found that this question is duplicated I would need the link to the other one, because I haven't been able to find it.
Thanks in advance.
Answer is updated in response to the comments bellow:
library(tidyverse)
bind_rows(
anti_join(today, yest) %>%
mutate(
label = ifelse(programa %in% yest$programa, 'changed', 'added')
),
anti_join(yest, select(today, programa)) %>% mutate(label = "deleted")
)
Which, while applying it to the whole data sets, returns following results:
# # A tibble: 6 x 3
# programa url label
# <chr> <chr> <chr>
# 1 Carrera de Derecho a distancia |~ https://universidadeuropea.es/onlin~ added
# 2 "Carrera de Criminolog\xeda a di~ https://universidadeuropea.es/onlin~ added
# 3 "Carrera Ingenier\xeda Inform\xe~ https://universidadeuropea.es/onlin~ added
# 4 Grado en Derecho a distancia | U~ https://universidadeuropea.es/onlin~ dele~
# 5 "Grado en Criminolog\xeda a dist~ https://universidadeuropea.es/onlin~ dele~
# 6 "Grado Ingenier\xeda Inform\xe1t~ https://universidadeuropea.es/onlin~ dele~
In order to check, if it is able to register changes in the programm, we can do following:
yest[22, 2] <- yest[23, 2]
Piping the changed data into the code above, returns table with additional record, labelled as changed:
# # A tibble: 7 x 3
# programa url label
# <chr> <chr> <chr>
# 1 "M\xe1ster en Direcci\xf3n Hotel~ https://universidadeuropea.es/onlin~ chan~
# 2 Carrera de Derecho a distancia |~ https://universidadeuropea.es/onlin~ added
# 3 "Carrera de Criminolog\xeda a di~ https://universidadeuropea.es/onlin~ added
# 4 "Carrera Ingenier\xeda Inform\xe~ https://universidadeuropea.es/onlin~ added
# 5 Grado en Derecho a distancia | U~ https://universidadeuropea.es/onlin~ dele~
# 6 "Grado en Criminolog\xeda a dist~ https://universidadeuropea.es/onlin~ dele~
# 7 "Grado Ingenier\xeda Inform\xe1t~ https://universidadeuropea.es/onlin~ dele~
Explanation:
Everything enclosed inside bind_rows() is combined into the single tibble. As far as we have two separate anti_join() statements here, and each of them returns it's own tibble, we have to rbind them into the one;
anti_join() is a set operation, which, giving two sets A and B, returns another set C which is subset of A but not subset of B. In other words, C is the difference between A and B.
When we call anti_join(today, yest) we obtain a subset of today with records either not present in yest at all, or those with program or url changed comparing to yest. We pipe those results into mutate() call, and assign the value changed to label, if the value of programa is the same as yesterday (programa %in% yest$programa), while url value was changed. If programa %in% yest$programa is FALSE, it means that program name wasn't present in yest so it is a new program, and we label it as added.
When we call anti_join() for a second time, we are looking for the difference between yest and today program names. In other words: 'Which programs present in yest are not present in today?' We achieve this by looking for subset of yest with program names which are not in program names of today (that's why you need to select(today, programa)). If any of such records where detected, they are labeled by deleted.
Sorry if this explanation is somewhat clumsy, but I hope it will help you to navigate the code.
Data:
tmp <- tempfile()
download.file(
"https://drive.google.com/uc?authuser=0&id=1scYdZrGYaSDr-TE8IZsy1tKSdLjMn7jt&export=download",
tmp
)
today <- read_delim(tmp, delim = ";")
download.file(
"https://drive.google.com/uc?authuser=0&id=1uJ-ThiKykTjoY1gc3jlBHoab8WAJD-wP&export=download",
tmp
)
yest <- read_delim(tmp, delim = ";")
file.remove(tmp)

Can I filter out certain rows/records when retrieving data from Salesforce using the RForcecom function "rforcecom.retrieve"?

Thanks for helping me with my first Stack Overflow question. I am trying to retrieve all the data from several fields in an Object called "Applied Questionnaire"; however, I do not want to retrieve any records that have the name "Training Site".
Currently, this is my code, which works:
quarterly_site_scores = rforcecom.retrieve(session, "AppliedQuestionnaire__c",
c("Site__c", "Site_Name__c", "Total_Score__c"))
%>% rename(site_id = Site__c, site_name = Site_Name__c)
quarterly_site_scores = quarterly_site_scores[!(quarterly_site_scores$site_name == "TRAINING PARK SITE" |
quarterly_site_scores$status != "Completed"),]
However, I'm wondering if there's a more elegant, streamlined solution here. Can I filter at the same time I retrieve? Or is there a better way to filter here?
(I've simplified the code here - I'm actually pulling in about ten fields and filtering on about five or six criteria, just in this one example).
Thank you.
Adding what the OP discovered as an answer using the salesforcer package which returns the SOQL resultset as a tbl_df.
library(salesforcer)
library(tidyverse)
sf_auth(username, password, security_token)
# list all object names in a Salesforce org
ped_objects <- sf_list_objects() %>% .$sobjects %>% map_chr(~pluck(., "name"))
# list all the fields on a particular object
fields <- sf_describe_object_fields('AppliedQuestionnaireBundle2__c')
# write a query to retrieve certain records from that object
site_scores_soql <- "SELECT Site__c,
Site_Name__c,
Total_Score__c
FROM AppliedQuestionnaireBundle2__c
WHERE Site_Name__c != 'GENERIC SITE'
AND Site_Name__c != 'TRAINING PARK SITE'
AND Status__c = 'Completed'"
# run the query
quarterly_site_scores <- sf_query(site_scores_soql)
quarterly_site_scores
#> # A tibble: 3 x 3
#> Site__c Site_Name__c Total_Score__c
#> <chr> <chr> <dbl>
#> 1 A Site Name1 78
#> 2 B Site Name2 52
#> 3 C Site Name3 83

show the map using mapview include multibyte character in data.frame

I want to show the data using mapview package.
but include multibyte character, sometime cannot show the map.
What would be the best thing to show the map?
library(mapview)
data(atlStorms2005)
test1 <- test2 <- atlStorms2005
test1#data$test <- as.factor(c("日本語", "てすと"))
test2#data$test <- as.factor(c("日本語", "五十嵐"))
mapview(test1) # can show the map
mapview(test2) # cannot show
re.data.frame <- function(data, encoding = "UTF-8", fileEncoding="UTF-8"){
write.csv(data, file("tmp.csv", encoding = encoding), row.names = F, fileEncoding=fileEncoding)
tmp <- readr::read_csv("tmp.csv", col_types = cols())
return(tmp)
}
test2#data <- re.data.frame(test2#data)
mapview(test2) # can show
but,the popup in test colum character is corrupted text.
data is correct.
head(test2#data)
# A tibble: 6 × 4
Name MaxWind MinPress test
<chr> <int> <int> <chr>
1 ALPHA 45 998 日本語
2 ARLENE 60 989 五十嵐
3 BRET 35 1002 日本語
4 CINDY 65 991 五十嵐
5 DELTA 60 980 日本語
6 DENNIS 130 930 五十嵐
As of commit bc2c57f, this should have been fixed. Until the next CRAN release of mapview, simply use the development version (devtools::install_github("environmentalinformatics-marburg/mapview", ref = "develop")) to solve this issue.
In brief, this behavior was related to our Rcpp routines which run under the hood in order to ensure a computationally efficient creation of popup tables. Here, the user's native encoding was used instead of UTF-8 to create JSON output files, resulting in corrupted text output on some machines where UTF-8 was not the default.

Scrape number of articles on a topic per year from NYT and WSJ?

I would like to create a data frame that scrapes the NYT and WSJ and has the number of articles on a given topic per year. That is:
NYT WSJ
2011 2 3
2012 10 7
I found this tutorial for the NYT but is not working for me :_(. When I get to line 30 I get this error:
> cts <- as.data.frame(table(dat))
Error in provideDimnames(x) :
length of 'dimnames' [1] not equal to array extent
Any help would be much appreciated.
Thanks!
PS: This is my code that is not working (A NYT api key is needed http://developer.nytimes.com/apps/register)
# Need to install from source http://www.omegahat.org/RJSONIO/RJSONIO_0.2-3.tar.gz
# then load:
library(RJSONIO)
### set parameters ###
api <- "API key goes here" ###### <<<API key goes here!!
q <- "MOOCs" # Query string, use + instead of space
records <- 500 # total number of records to return, note limitations above
# calculate parameter for offset
os <- 0:(records/10-1)
# read first set of data in
uri <- paste ("http://api.nytimes.com/svc/search/v1/article?format=json&query=", q, "&offset=", os[1], "&fields=date&api-key=", api, sep="")
raw.data <- readLines(uri, warn="F") # get them
res <- fromJSON(raw.data) # tokenize
dat <- unlist(res$results) # convert the dates to a vector
# read in the rest via loop
for (i in 2:length(os)) {
# concatenate URL for each offset
uri <- paste ("http://api.nytimes.com/svc/search/v1/article?format=json&query=", q, "&offset=", os[i], "&fields=date&api-key=", api, sep="")
raw.data <- readLines(uri, warn="F")
res <- fromJSON(raw.data)
dat <- append(dat, unlist(res$results)) # append
}
# aggregate counts for dates and coerce into a data frame
cts <- as.data.frame(table(dat))
# establish date range
dat.conv <- strptime(dat, format="%Y%m%d") # need to convert dat into POSIX format for this
daterange <- c(min(dat.conv), max(dat.conv))
dat.all <- seq(daterange[1], daterange[2], by="day") # all possible days
# compare dates from counts dataframe with the whole data range
# assign 0 where there is no count, otherwise take count
# (take out PSD at the end to make it comparable)
dat.all <- strptime(dat.all, format="%Y-%m-%d")
# cant' seem to be able to compare Posix objects with %in%, so coerce them to character for this:
freqs <- ifelse(as.character(dat.all) %in% as.character(strptime(cts$dat, format="%Y%m%d")), cts$Freq, 0)
plot (freqs, type="l", xaxt="n", main=paste("Search term(s):",q), ylab="# of articles", xlab="date")
axis(1, 1:length(freqs), dat.all)
lines(lowess(freqs, f=.2), col = 2)
UPDATE: the repo is now at https://github.com/rOpenGov/rtimes
There is a RNYTimes package created by Duncan Temple-Lang https://github.com/omegahat/RNYTimes - but it is outdated because the NYTimes API is on v2 now. I've been working on one for political endpoints only, but not relevant for you.
I'm rewiring RNYTimes right now...Install from github. You need to install devtools first to get install_github
install.packages("devtools")
library(devtools)
install_github("rOpenGov/RNYTimes")
Then try your search with that, e.g,
library(RNYTimes); library(plyr)
moocs <- searchArticles("MOOCs", key = "<yourkey>")
This gives you number of articles found
moocs$response$meta$hits
[1] 121
You could get word counts for each article by
as.numeric(sapply(moocs$response$docs, "[[", 'word_count'))
[1] 157 362 1316 312 2936 2973 355 1364 16 880

Resources