R - Select files by dates in filenames

R - Select files by dates in filenames - r

I already had a similar question here:
R - How to choose files by dates in file names?
But I have to do a little change.
I still have a list of filenames, similar to that:
list = c("AT0ACH10000700100dymax.1-1-1993.31-12-2003",
"AT0ILL10000700500dymax.1-1-1990.31-12-2011",
"AT0PIL10000700500dymax.1-1-1992.31-12-2011",
"AT0SON10000700100dymax.1-1-1990.31-12-2011",
"AT0STO10000700100dymax.1-1-1992.31-12-2006",
"AT0VOR10000700500dymax.1-1-1981.31-12-2011",
"AT110020000700100dymax.1-1-1993.31-12-2001",
"AT2HE190000700100dymax.1-1-1973.31-12-1994",
"AT2KA110000700500dymax.1-1-1991.31-12-2010",
"AT2KA410000700500dymax.1-1-1991.31-12-2011")
I already have a command to sort out files that a certain length of recording (for example 10 in this case):
#Listing Files (creates the list above)
files = list.files(pattern="*00007.*dymax", recursive = TRUE)
#Making date readable
split_daymax = strsplit(files, split=".", fixed=TRUE)
from = unlist(lapply(split_daymax, "[[", 2))
to = unlist(lapply(split_daymax, "[[", 3))
from = as.POSIXct(from, format="%d-%m-%Y")
to = as.POSIXct(to, format="%d-%m-%Y")
timelistmax = difftime(to, from, "days")
#Files with more than 10 years of recording
index = timelistmax >= 10*360
filesdaymean = filesdaymean[index]
My problem is now that I have way too many files and no computer can handle that.
Now I only want to read in files that contain files from 1993 (or any other certain year I want) on and have 10 years of recording from then on, so the recordings should be at least until 2003.
So the file 1973-1994 should not be included, but the file from 1981- 2011 is fine.
I dont know how to select a year in this case.
I am thankful for any help

library(stringr)
library(lubridate)
fileDates <- str_extract_all(files, "[0-9]{1,2}-[0-9]{1,2}-[0-9]{4}")
find_file <- function(x, whichYear, noYears = 10) {
start <- as.Date(x[[1]], "%d-%m-%Y")
end <- as.Date(x[[2]], "%d-%m-%Y")
years <- as.numeric(end-whichYear, units = "days")/365
years > noYears & (year(start) <= year(whichYear) &
year(end) >= year(whichYear))
}
sapply(fileDates, find_file, whichYear = as.Date("1993-01-01"), noYears = 10)
You have two conditions which you can calculate first the number of years since 1993 and then use boolean logic to figure out if 1993 is within the date range.

Using files, to, and from as you've defined them above, this should get get you files that contain atleast a ten year span of data between 1993 and 2003:
library(lubridate)
df <- data.frame(file_name = files, file_start = from, file_end = to)
df_index <- year(df$file_start) <=1993 & year(df$file_end) >= 2003
files_to_load <- df$file_name[df_index]
If a base only solution is desired, turn the POSIXct to POSIXlt and extract the year component as such:
df <- data.frame(file_name = files,
file_start = as.POSIXlt(from),
file_end = as.POSIXlt(to))
df_index <- (df$file_start$year+1900 <=1993 &
df$file_end$year+1900 >= 2003)
files_to_load <- df$file_name[df_index]

Related

Iterate import of excel files and averaging matched values by file name in R

I have a folder containing 630 excel files, all with similar file names. Each file represents climate data in specific geographic areas for a month of a specific year. My goal is to find a way to iterate my importing of these files and find the average of values for specific variables. All files are titled as such:
PRISM_ppt_stable_4kmM3_201201_bil
where "ppt" represents climate variable the data is about, "2012" represents the year 2012 and "01" represents the month of January. The next file in the folder is titled:
PRISM_ppt_stable_4kmM3_201202_bil
where "ppt" represents the same variable,"2012" again represents the year 2012 and "02" this time represents the month of February. These repeat for every month of every year and for 7 different variables. The variables are titled:
ppt, vpdmax, vpdmin, tmax, tmin, tdmean, tmean
Each excel file contains >1500 observations of 11 variables where I am interesting in finding the average MEAN variable among all matching tl_2016_us variables. Some quick sample data is shown below:
tl_2016_us MEAN
14136 135.808
14158 132.435
etc. etc.
It gets tricky in that I only wish to find my averages over a designated winter season, in this case November through March. So all files with 201211, 201212, 201301, 201302 and 201303 in the file name should be matched by tl_2016_us and the corresponding MEAN variables averaged. Ideally, this process would repeat to the next year of 201311, 201312, 201401, 201402, 201403. To this point, I have used
list.files(path = "filepath", pattern ="*ppt*")
to create lists of my filenames for each of the 7 variables.

I don't really get what the "tl_2016_us" variables are/mean.
However, you can easily get the list of only winter months using a bit of regular expressions like so:
library(tidyverse)
# Assuming your files are already in your working directory
all_files <- list.files(full.names = TRUE, pattern = "*ppt*")
winter_mos <- str_subset(files, "[01, 02, 03, 11, 12]_\\w{3}$")
After that, you can iterate reading in all files into a data frame with map() from purrr:
library(readxl)
data <- map(winter_mos, ~ read_xlsx(.x)) %>% bind_rows(.id = "id")
After that, you should be able to select the variables you want, use group_by() to group by id (i.e. id of each Excel file), and then summarize_all(mean)

Maybe something like (not very elegant):
filetypes = c("ppt", "vpdmax", "vpdmin", "tmax", "tmin", "tdmean", "tmean")
data_years = c(2012,2013,2014)
df <- NULL
for (i in 1:length(data_years)) {
yr <- data_years[i]
datecodes <- c(paste(yr,"11",sep=""),
paste(yr,"12",sep=""),
paste(yr+1,"01",sep=""),
paste(yr+1,"02",sep=""),
paste(yr+1,"03",sep=""))
for (j in 1:length(filetypes)) {
filetype <- filetypes[j]
file_prefix <- paste("PRISM",filetype,"stable_4kmM3",sep="_")
for (k in 1:length(datecodes)) {
datecode <- datecodes[k]
filename <- paste(file_prefix,datecode,"bil",sep="_")
dk <- read_excel(filename)
M <- dim(dk)[1]
dk$RefYr <- rep(yr,M)
dk$DataType <- rep(filetype,M)
if (is.null(df_new)) {
df <- dk
} else {
df <- rbind(df,dk)
}
}
}
}
Once that has run, you will have a data frame containing all the data you need to compute your averages (I think).
You could then do something like:
df_new <- NULL
for (i in 1:length(data_years)) {
yr <- data_years[i]
di <- df[df$RefYr==yr,]
for (j in 1:length(filetypes)) {
filetype <- filetypes[j]
dj <- di[di$DataType==filetype,]
tls <- unique(dj$tl_2016_us)
for (k in 1:length(tls)) {
tl <- tls[k]
dk <- dj[dj$tl_2016_us==tl,]
dijk <- data.frame(RefYr=yr,TL2016=tl,DataType=filetype,
SeasonAverage=mean(dk$MEAN))
if (is.null(df)){
df_new <- dijk
} else {
df_new <- rbind(df_new,dijk)
}
}
}
}
I'm sure there are more elegant ways to do it and that there are some bugs in the above since I couldn't really run the code, but I think you should be left with a data frame containing what you are looking for.

Include two multiple when scraping data

Wish to scrape weather data from
https://www.wetterzentrale.de/weatherdata.php?station=260&jaar=2019&maand=1&dag=1
From 2019-01-01 until today, yet I don't know how to write a code that changes jaar=2019 (i.e., year=2019), maand=1 (i.e., month=1) and dag=1 (i.e., day=1) to the desired days.
I tried to work with lapply as:
years <- c("2019", "2020")
urls <- rbindlist(lapply(years, function(x) {
url <- paste(https://www.wetterzentrale.de/weatherdata.php?station=260&jaar=2019&maand=1&dag=1, sep = "")
data.frame(url)
} ))
Hence this only gives the urls for 2019 and 2020. Is there a way to include months and days?

library(lubridate)
allYourDates <- seq(ymd(20190101), Sys.Date(), by = "days")
urls <- paste("https://www.wetterzentrale.de/weatherdata.php?station=260&jaar=", year(allYourDates)
, "&maand=", month(allYourDates)
, "&dag=", day(allYourDates)
, sep = "")

Changing Dates in R from webscraper but not able to convert

I am trying to complete a problem that pulls from two data sets that need to be combined into one data set. To get to this point, I need to rbind both data sets by the year-month information. Unfortunately, the first data set needs to be tallied by year-month info, and I can't seem to figure out how to change the date so I can have month-year info rather than month-day-year info.
This is data on avalanches and I need to write code totally the number of avalanches each moth for the Snow Season, defined as Dec-Mar. How do I do that?
I keep trying to convert the format of the date to month-year but after I change it with
as.Date(avalancheslc$Date, format="%y-%m")
all the values for Date turn to NA's....help!
# write the webscraper
library(XML)
library(RCurl)
avalanche<-data.frame()
avalanche.url<-"https://utahavalanchecenter.org/observations?page="
all.pages<-0:202
for(page in all.pages){
this.url<-paste(avalanche.url, page, sep=" ")
this.webpage<-htmlParse(getURL(this.url))
thispage.avalanche<-readHTMLTable(this.webpage, which=1, header=T)
avalanche<-rbind(avalanche,thispage.avalanche)
}
# subset the data to the Salt Lake Region
avalancheslc<-subset(avalanche, Region=="Salt Lake")
str(avalancheslc)
avalancheslc$monthyear<-format(as.Date(avalancheslc$Date),"%Y-%m")
# How can I tally the number of avalanches?
The final output of my dataset should be something like:
date avalanches
2000-1 18
2000-2 4
2000-3 10
2000-12 12
2001-1 52

This should work (I tried it on only 1 page, not all 203). Note the use of the option stringsAsFactors = F in the readHTMLTable function, and the need to add names because 1 column does not automatically get one.
library(XML)
library(RCurl)
library(dplyr)
avalanche <- data.frame()
avalanche.url <- "https://utahavalanchecenter.org/observations?page="
all.pages <- 0:202
for(page in all.pages){
this.url <- paste(avalanche.url, page, sep=" ")
this.webpage <- htmlParse(getURL(this.url))
thispage.avalanche <- readHTMLTable(this.webpage, which = 1, header = T,
stringsAsFactors = F)
names(thispage.avalanche) <- c('Date','Region','Location','Observer')
avalanche <- rbind(avalanche,thispage.avalanche)
}
avalancheslc <- subset(avalanche, Region == "Salt Lake")
str(avalancheslc)
avalancheslc <- mutate(avalancheslc, Date = as.Date(Date, format = "%m/%d/%Y"),
monthyear = paste(year(Date), month(Date), sep = "-"))

List organization by file name in R

I'm trying to create a list of data separated by month and year (40 years worth). The data currently has the name structure (Year)-(Numeric Month)-(Var).nc. I'd like to get all the data into its appropriate list created below. Not exactly sure how to proceed from here. Any guidance is appreciated.
files_nc <- list.files(pattern = ".nc")
year <- vector("list", length = 40)
month <- vector("list", length = 12)
names(year) <- c(1978:2017)
names(month) <- c("Jan","Feb","Mar","Apr","May","Jun","Jul",
"Aug","Sep","Oct","Nov","Dec")
for (i in 1:40) {
year[[i]] <- month
}

It's not entirely clear what you're asking for, but I believe this should work. I'm assuming you're loading in a list of files, and each file is associated with a year and month.
file_names <- list(names(files_nc))
file_names_split <- lapply(file_names,function(x)strsplit(x,"-"))
for(i in 1:length(file_names_split)) {
y <- which(names(year) == file_names_split[[i]][[1]][1])
m <- as.numeric(file_names_split[[i]][[1]][2])
year[[y]][m] <- files_nc[[i]]
}
In general, this method should work. If it works I'd take the time to rewrite the for loop as an apply statement.

Need help manipulating URL using concatenation to span 2 years of archived data in R

I would like to concatenate the following values in R.
day <- sprintf("%02d", 1:31)
month <- sprintf("%02d", 1:12)
year <- 2015:as.numeric(format(Sys.time(), "%Y"))
I need them to be in the following format 2015/01/01012015 (YYYY/MM/MMDDYYYY) where MMs would have to be equal at all times.
Ultimately I want to attach it to the end on this URL http://brocktonpolice.com/wp-content/uploads/ so I can pass it as an argument to a download function to download the files.
Here is what I have so far
links <- NULL
i <- 1
while (i <= length(year)) {
links[i] <- paste0("http://brocktonpolice.com/wp-content/uploads/",year[i], sep = "/")
i = i + 1
}
I would like it to span the entire year of 2015 and 2016.
For example:
http://brocktonpolice.com/wp-content/uploads/2015/01/01012015.pdf
http://brocktonpolice.com/wp-content/uploads/2015/01/01022015.pdf
http://brocktonpolice.com/wp-content/uploads/2015/01/01032015.pdf
http://brocktonpolice.com/wp-content/uploads/2015/01/01042015.pdf
...
http://brocktonpolice.com/wp-content/uploads/2015/02/02012015.pdf
http://brocktonpolice.com/wp-content/uploads/2015/02/02022015.pdf
http://brocktonpolice.com/wp-content/uploads/2015/02/02032015.pdf
...
etc

Use seq.Date. It's much easier.
prefix <- "http://brocktonpolice.com/wp-content/uploads/"
AllDays <- seq.Date(from = as.Date('2015-01-01'), to = Sys.Date(), by = "day")
links <- paste0(prefix, format(AllDays, '%Y/%m/%m%d%Y'), '.pdf')
print(links)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex