Scraping table from timeanddate.com using R - r

I'm trying to scrape weather data (in R) for the 2nd of March on the following web page: https://www.timeanddate.com/weather/sweden/stockholm/historic?month=3&year=2020 I am interested in the table at the end, below "Stockholm Weather History for..."
Just above and to the right of that table is a drop-down list where I chose the 2nd of March. But when I scrape using rselenium I only get the data of the 1st of March.
How can I get the data for the 2nd (and any other date except the 1st)
I have also tried to scrape the entire page using read_html but I can't find a way to extract the data I want from that.
The following code is the one that only seem to work for the 1st but any other date in the month.
library(tidyverse)
library(rvest)
library(RSelenium)
library(stringr)
library(dplyr)
rD <- rsDriver(browser="chrome", port=4234L, chromever ="85.0.4183.83")
remDr <- rD[["client"]]
remDr$navigate("https://www.timeanddate.com/weather/sweden/stockholm/historic?month=3&year=2020")
webElems <- remDr$findElements(using="class name", value="sticky-wr")
s<-webElems[[1]]$getElementText()
s<-as.character(s)
print(s)

Here's an approach with RSelenium
library(RSelenium)
library(rvest)
driver <- rsDriver(browser="chrome", port=4234L, chromever ="87.0.4280.87")
client <- driver[["client"]]
client$navigate("https://www.timeanddate.com/weather/sweden/stockholm/historic?month=3&year=2020")
client$findElement(using = "link text","Mar 2")$clickElement()
source <- client$getPageSource()[[1]]
read_html(source) %>%
html_node(xpath = '//*[#id="wt-his"]') %>%
html_table %>%
head
Conditions Conditions Conditions Comfort Comfort Comfort
1 Time Temp Weather Wind Humidity Barometer Visibility
2 12:20 amMon, Mar 2 39 °F Chilly. 7 mph ↑ 87% 29.18 "Hg N/A
3 12:50 am 37 °F Chilly. 7 mph ↑ 87% 29.18 "Hg N/A
4 1:20 am 37 °F Passing clouds. 7 mph ↑ 87% 29.18 "Hg N/A
5 1:50 am 37 °F Passing clouds. 7 mph ↑ 87% 29.18 "Hg N/A
6 2:20 am 37 °F Overcast. 8 mph ↑ 87% 29.18 "Hg N/A
You can then iterate over dates for findElement().
You can find the xpath by right clicking on the table and choosing Inspect in Chrome:
Then, you can find the table element, right click and choose Copy > Copy XPath.

It is always useful to use your browser's "developer tools" to inspect the web page and figure out how to extract the information you need.
A couple of tutorials that explain this I just googled:
https://towardsdatascience.com/tidy-web-scraping-in-r-tutorial-and-resources-ac9f72b4fe47
https://www.scrapingbee.com/blog/web-scraping-r/
For example, in this particular webpage, when we select a new date in the dropdown list, the webpage sends a GET request to the server, which returns a JSON string with the data of the requested date. Then the webpage updates the data in the table (probably using javascript -- did not check this).
So, in this case you need to emulate this behavior, capture the json file and parse the info in it.
In Chrome, if you look at the developer tool network pane, you will see that the address of the GET request is of the form:
https://www.timeanddate.com/scripts/cityajax.php?n=sweden/stockholm&mode=historic&hd=YYYYMMDD&month=M&year=YYYY&json=1
where YYYY stands for year with 4 digits, MM(M) month with two (one) digits, and DD day of the month with two digits.
So you can set your code to do the GET request directly to this address, get the json response and parse it accordingly.
library(rjson)
library(rvest)
library(plyr)
library(dplyr)
year <- 2020
month <- 3
day <- 7
# create formatted url with desired dates
url <- sprintf('https://www.timeanddate.com/scripts/cityajax.php?n=sweden/stockholm&mode=historic&hd=%4d%02d%02d&month=%d&year=%4d&json=1', year, month, day, month, year)
webpage <- read_html(url) %>% html_text()
# json string is not formatted the way fromJSON function needs
# so I had to parse it manually
# split string on each row
x <- strsplit(webpage, "\\{c:")[[1]]
# remove first element (garbage)
x <- x[2:length(x)]
# clean last 2 characters in each row
x <- sapply(x, FUN=function(xx){substr(xx[1], 1, nchar(xx[1])-2)}, USE.NAMES = FALSE)
# function to get actual data in each row and put it into a dataframe
parse.row <- function(row.string) {
# parse columns using '},{' as divider
a <- strsplit(row.string, '\\},\\{')[[1]]
# remove some lefover characters from parsing
a <- gsub('\\[\\{|\\}\\]', '', a)
# remove what I think is metadata
a <- gsub('h:', '', gsub('s:.*,', '', a))
df <- data.frame(time=a[1], temp=a[3], weather=a[4], wind=a[5], humidity=a[7],
barometer=a[8])
return(df)
}
# use ldply to run function parse.row for each element of x and combine the results in a single dataframe
df.final <- ldply(x, parse.row)
Result:
> head(df.final)
time temp weather wind humidity barometer
1 "12:20 amSat, Mar 7" "28 °F" "Passing clouds." "No wind" "100%" "29.80 \\"Hg"
2 "12:50 am" "28 °F" "Passing clouds." "No wind" "100%" "29.80 \\"Hg"
3 "1:20 am" "28 °F" "Passing clouds." "1 mph" "100%" "29.80 \\"Hg"
4 "1:50 am" "30 °F" "Passing clouds." "2 mph" "100%" "29.80 \\"Hg"
5 "2:20 am" "30 °F" "Passing clouds." "1 mph" "100%" "29.80 \\"Hg"
6 "2:50 am" "30 °F" "Low clouds." "No wind" "100%" "29.80 \\"Hg"
I left everything as strings in the data frame, but you can convert the columns to numeric or dates with you need.

Related

Compiling API outputs in XML format in R

I have searched everywhere trying to find an answer to this question and I haven't quite found what I'm looking for yet so I'm hoping asking directly will help.
I am working with the USPS Tracking API, which provides an output an XML format. The API is limited to 35 results per call (i.e. you can only provide 35 tracking numbers to get info on each time you call the API) and I need information on ~90,000 tracking numbers, so I am running my calls in a for loop. I was able to store the results of the call in a list, but then I had trouble exporting the list as-is into anything usable. However, when I tried to convert the results from the list into JSON, it dropped the attribute tag, which contained the tracking number I had used to generate the results.
Here is what a sample result looks like:
<TrackResponse>
<TrackInfo ID="XXXXXXXXXXX1">
<TrackSummary> Your item was delivered at 6:50 am on February 6 in BARTOW FL 33830.</TrackSummary>
<TrackDetail>February 6 6:49 am NOTICE LEFT BARTOW FL 33830</TrackDetail>
<TrackDetail>February 6 6:48 am ARRIVAL AT UNIT BARTOW FL 33830</TrackDetail>
<TrackDetail>February 6 3:49 am ARRIVAL AT UNIT LAKELAND FL 33805</TrackDetail>
<TrackDetail>February 5 7:28 pm ENROUTE 33699</TrackDetail>
<TrackDetail>February 5 7:18 pm ACCEPT OR PICKUP 33699</TrackDetail>
Here is the script I ran to get the output I'm currently working with:
final_tracking_info <- list()
for (i in 1:x) { # where x = the number of calls to the API the loop will need to make
usps = input_tracking_info[i] # input_tracking_info = GET commands
usps = read_xml(usps)
final_tracking_info1[[i+1]]<-usps$TrackResponse
gc()
}
final_output <- toJSON(final_tracking_info)
write(final_output,"final_tracking_info.json") # tried converting to JSON, lost the ID attribute
cat(capture.output(print(working_list),file = "Final_Tracking_Info.txt")) # exported the list to a textfile, was not an ideal format to work with
What I ultimately want tog et from this data is a table containing the tracking number, the first track detail, and the last track detail. What I'm wondering is, is there a better way to compile this in XML/JSON that will make it easier to convert to a tibble/df down the line? Is there any easy way/preferred format to select based on the fact that I know most of the columns will have the same name ("Track Detail") and the DFs will have to be different lengths (since each package will have a different number of track details) when I'm trying to compile 1,000 of results into one final output?
Using XML::xmlToList() will store the ID attribute in .attrs:
$TrackSummary
[1] " Your item was delivered at 6:50 am on February 6 in BARTOW FL 33830."
$TrackDetail
[1] "February 6 6:49 am NOTICE LEFT BARTOW FL 33830"
$TrackDetail
[1] "February 6 6:48 am ARRIVAL AT UNIT BARTOW FL 33830"
$TrackDetail
[1] "February 6 3:49 am ARRIVAL AT UNIT LAKELAND FL 33805"
$TrackDetail
[1] "February 5 7:28 pm ENROUTE 33699"
$TrackDetail
[1] "February 5 7:18 pm ACCEPT OR PICKUP 33699"
$.attrs
ID
"XXXXXXXXXXX1"
A way of using that output which assumes that the Summary and ID are always present as first and last elements, respectively, is:
xml_data <- XML::xmlToList("71563898.xml") %>%
unlist() %>% # flattening
unname() # removing names
data.frame (
ID = tail(xml_data, 1), # getting last element
Summary = head(xml_data, 1), # getting first element
Info = xml_data %>% head(-1) %>% tail(-1) # remove first and last elements
)

Retrieve date modified of a file from an FTP Server

Building off of this question (Retrieve modified DateTime of a file from an FTP Server), it's clear how to get the date modified value. However, the full date is not returned even though it's visible from the FTP site.
This shows how to get the date modified values for files at ftp://ftp.FreeBSD.org/pub/FreeBSD/
library(curl)
library(stringr)
con <- curl("ftp://ftp.FreeBSD.org/pub/FreeBSD/")
dat <- readLines(con)
close(con)
no_dirs <- grep("^d", dat, value=TRUE, invert=TRUE)
date_and_name <- sub("^[[:alnum:][:punct:][:blank:]]{43}", "", no_dirs)
dates <- sub('\\s[[:alpha:][:punct:][:alpha:]]+$', '', date_and_name)
dates
## [1] "May 07 2015" "Apr 22 15:15" "Apr 22 10:00"
Some dates are in month/day/year format, others are in month/day/hour/minute format.
Looking at the FTP site, all dates in month/day/year hour/minutes/seconds format.
I assume it's got something to do with Unix format standards (explained in FTP details command doesn't seem to return the year the file was modified, is there a way around this?). It would be nice to get the full date.
If you use download.file you get an html representation of the directory which you can parse with the xml2 package.
read_ftp <- function(url)
{
tmp <- tempfile()
download.file(url, tmp, quiet = TRUE)
html <- xml2::read_html(readChar(tmp, 1e6))
file.remove(tmp)
lines <- strsplit(xml2::xml_text(html), "[\n\r]+")[[1]]
lines <- grep("(\\d{2}/){2}\\d{4}", lines, value = TRUE)
result <- read.table(text = lines, stringsAsFactors = FALSE)
setNames(result, c("Date", "Time", "Size", "File"))
}
Which allows you to just do this:
read_ftp("ftp://ftp.FreeBSD.org/pub/FreeBSD/")
#> Date Time Size File
#> 1 05/07/2015 12:00AM 4,259 README.TXT
#> 2 04/22/2020 08:00PM 35 TIMESTAMP
#> 3 04/22/2020 08:00PM Directory development
#> 4 04/22/2020 10:00AM 2,325 dir.sizes
#> 5 11/12/2017 12:00AM Directory doc
#> 6 11/12/2017 12:00AM Directory ports
#> 7 04/22/2020 08:00PM Directory releases
#> 8 11/09/2018 12:00AM Directory snapshots
Created on 2020-04-22 by the reprex package (v0.3.0)

convert data modis EVI to date yy-mm-dd in r

I work with MODIS EVI rasters in 2000. I have 6 raster by years, one raster by month :
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000209_aid0001.tif"
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000225_aid0001.tif"
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000241_aid0001.tif"
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000257_aid0001.tif"
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000273_aid0001.tif"
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000289_aid0001.tif"
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000305_aid0001.tif"
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000321_aid0001.tif"
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000337_aid0001.tif"
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000353_aid0001.tif"
I would like convert the month like that 2000-02-09 but I don't know how to do it.
Looks like your datetime string format follows year(4digits)day_of_year(3digit) format, e.g. 2000209 means 2000 year & 209 days (27th July 2000). If it's true then the problem isn't difficult:
Extract those seven digit numbers. (str_extract)
Parse 'datetime' from it. (you need to know that j is used for parsing date from the day_of_year.)
[:Graph:] will drop anything other than numbers and punctuation marks.
Code
dt <- data.frame(string =
c("D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000209_aid0001.tif",
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000225_aid0001.tif",
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000241_aid0001.tif",
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000257_aid0001.tif",
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000273_aid0001.tif",
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000289_aid0001.tif",
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000305_aid0001.tif",
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000321_aid0001.tif",
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000337_aid0001.tif",
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000353_aid0001.tif"))
dt$string %>% str_extract('\\d{7}') %>% str_replace('2000', '2000-') %>%
parse_date_time('y-j') %>% str_subset('[:graph:]')
Output
[1] "2000-07-27" "2000-08-12" "2000-08-28" "2000-09-13" "2000-09-29" "2000-10-15"
[7] "2000-10-31" "2000-11-16" "2000-12-02" "2000-12-18"

How do I change the index in a csv file to a proper time format?

I have a CSV file of 1000 daily prices
They are of this format:
1 1.6
2 2.5
3 0.2
4 ..
5 ..
6
7 ..
.
.
1700 1.3
The index is from 1:1700
But I need to specify a begin date and end date this way:
Start period is lets say, 25th january 2009
and the last 1700th value corresponds to 14th may 2013
So far Ive gotten this close to this problem:
> dseries <- ts(dseries[,1], start = ??time??, freq = 30)
How do I go about this? thanks
UPDATE:
managed to create a seperate object with dates as suggested in the answers and plotted it, but the y axis is weird, as shown in the screenshot
Something like this?
as.Date("25-01-2009",format="%d-%m-%Y") + (seq(1:1700)-1)
A better way, thanks to #AnandaMahto:
seq(as.Date("2009-01-25"), by="1 day", length.out=1700)
Plotting:
df <- data.frame(
myDate=seq(as.Date("2009-01-25"), by="1 day", length.out=1700),
myPrice=runif(1700)
)
plot(df)
R stores Date-classed objects as the integer offset from "1970-01-01" but the as.Date.numeric function needs an offset ('origin') which can be any staring date:
rDate <- as.Date.numeric(dseries[,1], origin="2009-01-24")
Testing:
> rDate <- as.Date.numeric(1:10, origin="2009-01-24")
> rDate
[1] "2009-01-25" "2009-01-26" "2009-01-27" "2009-01-28" "2009-01-29"
[6] "2009-01-30" "2009-01-31" "2009-02-01" "2009-02-02" "2009-02-03"
You didn't need to add the extension .numeric since R would automticallly seek out that function if you used the generic stem, as.Date, with an integer argument. I just put it in because as.Date.numeric has different arguments than as.Date.character.

Date sequence in R spanning B.C.E. to A.D

I would like to generate a sequence of dates from 10,000 B.C.E. to the present. This is easy for 0 C.E. (or A.D.):
ADtoNow <- seq.Date(from = as.Date("0/1/1"), to = Sys.Date(), by = "day")
But I am stumped as to how to generate dates before 0 AD. Obviously, I could do years before present but it would be nice to be able to graph something as BCE and AD.
To expand on Ricardo's suggestion, here is some testing of how things work. Or don't work for that matter.
I will repeat Joshua's warning taken from ?as.Date for future searchers in big bold letters:
"Note: Years before 1CE (aka 1AD) will probably not be handled correctly."
as.integer(as.Date("0/1/1"))
[1] -719528
as.integer(seq(as.Date("0/1/1"),length=2,by="-10000 years"))
[1] -719528 -4371953
seq(as.Date(-4371953,origin="1970-01-01"),Sys.Date(),by="1000 years")
# nonsense
[1] "0000-01-01" "'000-01-01" "(000-01-01" ")000-01-01" "*000-01-01"
[6] "+000-01-01" ",000-01-01" "-000-01-01" ".000-01-01" "/000-01-01"
[11] "0000-01-01" "1000-01-01" "2000-01-01"
> as.integer(seq(as.Date(-4371953,origin="1970-01-01"),Sys.Date(),by="1000 years"))
# also possibly nonsense
[1] -4371953 -4006710 -3641468 -3276225 -2910983 -2545740 -2180498 -1815255
[9] -1450013 -1084770 -719528 -354285 10957
Though this does seem to work for graphing somewhat:
yrs1000 <- seq(as.Date(-4371953,origin="1970-01-01"),Sys.Date(),by="1000 years")
plot(yrs1000,rep(1,length(yrs1000)),axes=FALSE,ann=FALSE)
box()
axis(2)
axis(1,at=yrs1000,labels=c(paste(seq(10000,1000,by=-1000),"BC",sep=""),"0AD","1000AD","2000AD"))
title(xlab="Year",ylab="Value")
Quite some time has gone by since this question was asked. With that time came a new R package, gregorian which can handle BCE time values in the as_gregorian method.
Here's an example of piecewise constructing a list of dates that range from -10000 BCE to the current year.
library(lubridate)
library(gregorian)
# Container for the dates
dates <- c()
starting_year <- year(now())
# Add the CE dates to the list
for (year in starting_year:0){
date <- sprintf("%s-%s-%s", year, "1", "1")
dates <- c(dates, gregorian::as_gregorian(date))
}
starting_year <- "-10000"
# Add the BCE dates to the list
for (year in starting_year:0){
start_date <- gregorian::as_gregorian("-10000-1-1")
date <- sprintf("%s-%s-%s", year, "1", "1")
dates <- c(dates, gregorian::as_gregorian(date))
}
How you use the list is up to you, just know that the relevant properties of the date objects are year and bce. For example, you can loop over list of dates, parse the year, and determine if it's BCE or not.
> gregorian_date <- gregorian::as_gregorian("-10000-1-1")
> gregorian_date$bce
[1] TRUE
> gregorian_date$year
[1] 10001
Notes on 0AD
The gregorian package assumes that when you mean Year 0, you're really talking about year 1 (shown below). I personally think an exception should be thrown, but that's the mapping users needs to keep in mind.
> gregorian::as_gregorian("0-1-1")
[1] "Monday January 1, 1 CE"
This is also the case with BCE
> gregorian::as_gregorian("-0-1-1")
[1] "Saturday January 1, 1 BCE"
As #JoshuaUlrich commented, the short answer is no.
However, you can splice out the year into a separate column and then convert to integer. Would this work for you?
The package lubridate seems to handle "negative" years ok, although it does create a year 0, which from the above comments seems to be inaccurate. Try:
library(lubridate)
start <- -10000
stop <- 2013
myrange <- NULL
for (x in start:stop) {
myrange <- c(myrange,ymd(paste0(x,'-01-01')))
}

Resources