Changing Dates in R from webscraper but not able to convert - r

I am trying to complete a problem that pulls from two data sets that need to be combined into one data set. To get to this point, I need to rbind both data sets by the year-month information. Unfortunately, the first data set needs to be tallied by year-month info, and I can't seem to figure out how to change the date so I can have month-year info rather than month-day-year info.
This is data on avalanches and I need to write code totally the number of avalanches each moth for the Snow Season, defined as Dec-Mar. How do I do that?
I keep trying to convert the format of the date to month-year but after I change it with
as.Date(avalancheslc$Date, format="%y-%m")
all the values for Date turn to NA's....help!
# write the webscraper
library(XML)
library(RCurl)
avalanche<-data.frame()
avalanche.url<-"https://utahavalanchecenter.org/observations?page="
all.pages<-0:202
for(page in all.pages){
this.url<-paste(avalanche.url, page, sep=" ")
this.webpage<-htmlParse(getURL(this.url))
thispage.avalanche<-readHTMLTable(this.webpage, which=1, header=T)
avalanche<-rbind(avalanche,thispage.avalanche)
}
# subset the data to the Salt Lake Region
avalancheslc<-subset(avalanche, Region=="Salt Lake")
str(avalancheslc)
avalancheslc$monthyear<-format(as.Date(avalancheslc$Date),"%Y-%m")
# How can I tally the number of avalanches?
The final output of my dataset should be something like:
date avalanches
2000-1 18
2000-2 4
2000-3 10
2000-12 12
2001-1 52

This should work (I tried it on only 1 page, not all 203). Note the use of the option stringsAsFactors = F in the readHTMLTable function, and the need to add names because 1 column does not automatically get one.
library(XML)
library(RCurl)
library(dplyr)
avalanche <- data.frame()
avalanche.url <- "https://utahavalanchecenter.org/observations?page="
all.pages <- 0:202
for(page in all.pages){
this.url <- paste(avalanche.url, page, sep=" ")
this.webpage <- htmlParse(getURL(this.url))
thispage.avalanche <- readHTMLTable(this.webpage, which = 1, header = T,
stringsAsFactors = F)
names(thispage.avalanche) <- c('Date','Region','Location','Observer')
avalanche <- rbind(avalanche,thispage.avalanche)
}
avalancheslc <- subset(avalanche, Region == "Salt Lake")
str(avalancheslc)
avalancheslc <- mutate(avalancheslc, Date = as.Date(Date, format = "%m/%d/%Y"),
monthyear = paste(year(Date), month(Date), sep = "-"))

Related

R: extract dates and numbers from PDF

I'm really struggling to extract the proper information from several thousands PDF files from NTSB (some Dates and numbers to be specific); these PDFs don't require to be OCRed and each report is almost identical in length and layout information.
I need to extract the date and the time of the accident (first page) and some other information, like Pilot's age or its Flight experience. What I tried does the job for several files but is not working for each file the since code I am using is poorly written.
# an example with a single file
library(pdftools)
library(readr)
# Download the file and read it row by row
file <- 'http://data.ntsb.gov/carol-repgen/api/Aviation/ReportMain/GenerateNewestReport/89789/pdf' # less than 100 kb
destfile <- paste0(getwd(),"/example.pdf")
download.file(file, destfile)
pdf <- pdf_text(destfile)
rows <-scan(textConnection(pdf),
what="character", sep = "\n")
# Extract the date of the accident based on the 'Date & Time' occurrence.
date <-rows[grep(pattern = 'Date & Time', x = rows, ignore.case = T, value = F)]
date <- strsplit(date, " ")
date[[1]][9] #this method is not desirable since the date will not be always in that position
# Pilot age
age <- rows[grep(pattern = 'Age', x = rows, ignore.case = F, value = F)]
age <- strsplit(age, split = ' ')
age <- age[[1]][length(age[[1]])] # again, I'm using the exact position in that list
age <- readr::parse_number(age) #
The main issue I got is when I am trying to extract the date and time of the accident. Is it possible to extract that exact piece of information by avoiding using a list as I did here?
I think the best approach to achieve what you want is to use regex.
In this case I use stringr library. The main idea with regex is to find
the desire string pattern, in this case is the date 'July 29, 2014, 11:15'
Take on count that you'll have to check the date format for each pdf file
library(pdftools)
library(readr)
library(stringr)
# Download the file and read it row by row
file <- 'http://data.ntsb.gov/carol-repgen/api/Aviation/ReportMain/GenerateNewestReport/89789/pdf' # less than 100 kb
destfile <- paste0(getwd(), "/example.pdf")
download.file(file, destfile)
pdf <- pdf_text(destfile)
## New code
# Regex pattern for date 'July 29, 2014, 11:15'
regex_pattern <- "[T|t]ime\\:(.*\\d{2}\\:\\d{2})"
# Getting date from page 1
grouped_matched <- str_match_all(pdf[1], regex_pattern)
# This returns a list with groups. You're interested in group 2
raw_date <- grouped_matched[[1]][2] # First element, second group
# Clean date
date <- trimws(raw_date)
# Using dplyr
library(dplyr)
date <- pdf[1] %>%
str_match_all(regex_pattern) %>%
.[[1]] %>% # First list element
.[2] %>% # Second group
trimws() # Remove extra white spaces
You can make a function to extract the date changing the regex pattern for different files
Regards

Applying a custom function to multiple files and creating unique csv output in R

I am a beginner user in R and have been compiling a code to create a custom function to execute a specific task on some data that I possess. The custom function is structured to identify missing data in a csv file and patch this using the mean value. Thereafter, I want to summarize the data by year and month and export this as a csv file. I have multiple csv files that are sitting in a folder and would like to perform this task on each of these files. Thus far, I am able to get the code to perform the task at hand but don't know how to write a unique output for each csv file that has been processed and save these to a new folder. I would also like to retain the original file name in the processed output but have the words "_processed" appended to it. Additionally, any suggestions on how this code can be improved are most welcome. Thanks in advance.
# Load all packages required by the script
library(tidyverse) # data science package
library(lubridate) # work with dates
library(dplyr) # data manipulation (filter, summarize, mutate)
library(ggplot2) # graphics
library(gridExtra) # tile several plots next to each other
library(scales)
# Set the working directory #
setwd("H:/Shaeden_Post_Doc/Genus_Exchange/GEE_Data/MODIS_Product_Data_Raw/Cold_Temperate_Moist")
#create a function to summarize data by year and month
#patch missing values using the average
summarize_by_month = function(df){
# counting unique, missing and mean values in the ET column
df %>% summarise(n = n_distinct(ET),
na = sum(is.na(ET)),
med = mean(ET, na.rm = TRUE))
# assign mean values to the missing data and modify the dataframe
df = df %>%
mutate(ET = replace(ET,is.na(ET),mean(ET, na.rm = TRUE)))
df
#separate data into year, month and day
df$date = as.Date(df$date,format="%Y/%m/%d")
#summarize by year and month
df %>%
mutate(year = format(date, "%Y"), month = format(date, "%m")) %>%
group_by(year, month) %>%
summarise(mean_monthly = mean(ET))
}
#import all files and execute custom function for each
file_list = list.files(pattern="AET", full.names=TRUE)
file_list
my_AET_files = lapply(file_list, read_csv)
monthly_AET = lapply(my_AET_files, summarize_by_month)
monthly_AET
A link to the sample datasets is provided below
https://drive.google.com/drive/folders/1pLHt-vT87lxzW2We-AS1PwVcne3ALP2d?usp=sharing
You can read, manipulate data and write the csv in the same function :
library(dplyr)
summarize_by_month = function(file) {
df <- readr::read_csv(file)
# assign mean values to the missing data and modify the dataframe
df = df %>% mutate(ET = replace(ET,is.na(ET),mean(ET, na.rm = TRUE)))
#separate data into year, month and day
df$date = as.Date(df$date,format="%Y/%m/%d")
#summarize by year and month
new_df <- df %>%
mutate(year = format(date, "%Y"), month = format(date, "%m")) %>%
group_by(year, month) %>%
summarise(mean_monthly = mean(ET))
write.csv(new_df, sprintf('output_folder/%s_processed.csv',
tools::file_path_sans_ext(basename(file))), row.names = FALSE)
}
monthly_AET = lapply(file_list, summarize_by_month)
path<-"your_peferred_path/" #set a path to were you want to save the files
x<-list.files(pattern= "your_pattern") # create a list of your file names
name<-str_sub(x, start=xL, end=yL) #x & y being the part of the name you want to keep
for (i in 1:length(monthly_AET)){
write_excel_csv(monthly_AET[i], paste0(path, name, "_processed.csv")) # paste0 allows to create custom names from variables and static strings
}
note: this is only an assumption and may have to be tweaked to suit your needs

apply inside apply function?

I've a data frame with the start and end of each month of the year 2019.
I need to make a fetch to an API, write a CSV file with name mydf plus month (eg. mydf-01.csv, mydf-02.csv, etc).
I need to fetch the data, write CSV, clean memory to avoid error message "not enough memory", and continue with the next month.
For now I've this, but is giving me error: not enough memory, because the expected data for all 2019 is around 3GB.
I was thinking on making a for loop. But maybe I can use another apply family function?
Months: my_dates data.frame
This is how it looks:
from to
2019-01-01 2019-01-31
2019-02-01 2019-02-28
2019-03-01 2019-03-31
...
Code to generate the 12 months:
som <- function(x) as.Date(cut(as.Date(x), "month")) # start of month
eom <- function(x) som(som(x) + 32) - 1 # end of month
month_ranges <- function(from, to) {
s <- seq(som(from), as.Date(to), "month")
data.frame(from = pmax(as.Date(from), s), to = pmin(as.Date(to), eom(s)))
}
my_dates <- month_ranges(som("2019-01-01"), eom("2019-12-31"))
Code to fetch data:
Currently it fetches all months, holds them in memory and at the end
it rbinds them together. However, this approache gives error when
months range is too large because data is above 2GB. So I'd like it for each month to save the data to > a CSV and continue to the next month.
library(googleAuthR)
library(googleAnalyticsR)
my_fetch <- function(ga_id, d1, d2) {
google_analytics(ga_id,
date_range = c(d1, d2),
metrics = c("totalEvents"),
dimensions = c("ga:date", "ga:eventCategory", "ga:eventAction", "ga:eventLabel"),
anti_sample = TRUE,
anti_sample_batches = 1,
rows_per_call = 400)
}
my_fetches_fetches <- mapply(my_fetch, myviewID, my_dates$from, my_dates$to, SIMPLIFY = FALSE)
total <- do.call(rbind, my_fetches_fetches)
UPDATE 1:
Maybe it could be possible to pass the "loop" that generates an error, like API timeout to continue to the next month?

How do I fix my XTS (timeseries) to correlate to my CSV file?

So I have a "calendar" that reminds me of certain things that need to be done on an excel spreadsheet, sent through with email (MailR). There are two columns, one is the date and the other is the task. The format of the spreadsheet looks like this:
date(dd/mm/yyyy) | Task
This is the code that I've written so far:
library(zoo)
library(xts)
library(lubridate)
library(xtable)
setwd('Documents')
getwd()
data <- read.csv(file = 'email_calendar.csv', header = TRUE, sep = ',')
## NA values for blanks to match x(nrows)
data2 <- read.csv(file = 'email_calendar.csv', header = TRUE, sep = ',', na.strings = c('','NA'))
xtsdata <- xts(data2, order.by = as.Date(data2$Date),"%d/%m/%y")
body = print(xtable(data,caption = '30 Day Tasks'),type='html',caption.placement='top')
However, the output looks like this:
Date (XTS) Date Task
0001-01-20 1/01/20 Road Trip
0001-10-19 1/10/19 Buy Groceries
0001-11-19 1/11/19 Meeting with manager
0002-01-20 2/01/20 NA
0002-03-20 2/03/20 Dentists
0002-10-19 2/10/19 NA
0002-12-19 2/12/19 Go to the Gym
0003-01-20 3/01/20 Buy Groceries
0003-02-20 3/02/20 Pick up kids
I cannot sort by the date and when I send the email to myself as a test, the dates are completely jumbled. I cannot figure out how to align the two date types together.
What I would like in the final version is the date ordered and to only show the next 30 days with no historical dates in the email.
Can anyone shed some light, thanks?

Sub setting times out of time series in R

I downloaded stock market data from Yahoo (code below) - for context, at first I tried with getSymbols(^DJI) but I got error messages possibly related to Yahoo... different issue.
The point is that once downloaded, and imported into R, I massaged it into a format close enough to a time series to be able to run chartSeries(DJI):
require(RCurl)
require(foreign)
x <- getURL("https://raw.githubusercontent.com/RInterested/datasets/gh-pages/%5EDJI.csv")
DJI <- read.csv(text = x, sep =",")
DJI$Date <- as.Date(DJI$Date, format = "%m/%d/%Y") # Formatting Date as.Date
rownames(DJI) <- DJI$Date # Assigning Date to row names
DJI$Date <- NULL # Removing the Date column
chartSeries(DJI, type="auto", theme=chartTheme('white'))
even if the dataset is not really a time series:
> is.ts(DJI)
[1] FALSE
The problem comes about when I try to find out the date of, for instance, the minimum closing value of the Dow. I can do something like
> DJI[DJI$Close == min(DJI$Close),]
Open High Low Close Adj.Close Volume
1985-05-01 1257.18 1262.81 1239.07 1242.05 1242.05 10050000
yielding the entire row, including the row name (1985-05-01), which is the only part I want. However, if I insist on just getting the actual date, I have to juggle a second dataset containing the dates in one of the columns:
require(RCurl)
require(foreign)
x <- getURL("https://raw.githubusercontent.com/RInterested/datasets/gh-pages/%5EDJI.csv")
DJI <- read.csv(text = x, sep =",")
DJI$Date <- as.Date(DJI$Date, format = "%m/%d/%Y") # Formatting Date as.Date
rownames(DJI) <- DJI$Date # Assigning Date to row names
DJI.raw <- DJI # Second dataset for future subsetting
DJI$Date <- NULL # Removing the Date column
which does allow me to run
> DJI.raw$Date[DJI.raw$Close == min(DJI.raw$Close)]
[1] "1985-05-01"
Further, I don't think that turning the dataset into an .xts file would help.
I'm not clear what you want but it sounds like you just want the date? You mention xts is not an option (which would have been runnable)
time(as.xts(DJI))[which.min(DJI$Close)] # POSIXct format
# [1] "1985-05-01 EDT"
Otherwise a simple rownames + which.min would get the date for you?
as.Date(rownames(DJI)[which.min(DJI$Close)]) # Date format
# [1] "1985-05-01"

Resources