List organization by file name in R - r

I'm trying to create a list of data separated by month and year (40 years worth). The data currently has the name structure (Year)-(Numeric Month)-(Var).nc. I'd like to get all the data into its appropriate list created below. Not exactly sure how to proceed from here. Any guidance is appreciated.
files_nc <- list.files(pattern = ".nc")
year <- vector("list", length = 40)
month <- vector("list", length = 12)
names(year) <- c(1978:2017)
names(month) <- c("Jan","Feb","Mar","Apr","May","Jun","Jul",
"Aug","Sep","Oct","Nov","Dec")
for (i in 1:40) {
year[[i]] <- month
}

It's not entirely clear what you're asking for, but I believe this should work. I'm assuming you're loading in a list of files, and each file is associated with a year and month.
file_names <- list(names(files_nc))
file_names_split <- lapply(file_names,function(x)strsplit(x,"-"))
for(i in 1:length(file_names_split)) {
y <- which(names(year) == file_names_split[[i]][[1]][1])
m <- as.numeric(file_names_split[[i]][[1]][2])
year[[y]][m] <- files_nc[[i]]
}
In general, this method should work. If it works I'd take the time to rewrite the for loop as an apply statement.

Related

Parsing colnames text string as expression in R

I am trying to create a large number of data frames in a for loop using the "assign" function in R. I want to use the colnames function to set the column names in the data frame. The code I am trying to emulate is the following:
county_tmax_min_df <- data.frame(array(NA,c(length(days),67)))
colnames(county_tmax_min_df) <- c('Date',sd_counties$NAME)
county_tmax_min_df$Date <- days
The code I have so far in the loop looks like this:
file_vars = c('file1','file2')
days <- seq(as.Date("1979-01-01"), as.Date("1979-01-02"), "days")
f = 1
for (f in 1:2){
assign(paste0('county_',file_vars[f]),data.frame(array(NA,c(length(days),67))))
}
I need to be able to set the column names similar to how I did in the above statement. How do I do this? I think it needs to be something like this, but I am unsure what goes in the text portion. The end result I need is just a bunch of data frames. Any help would be wonderful. Thank you.
expression(parse(text = ))
You can set the names within assign, like that:
file_vars = c('file1', 'file2')
days <- seq.Date(from = as.Date("1979-01-01"), to = as.Date("1979-01-02"), by = "days")
for (f in seq_along(file_vars)) {
assign(x = paste0('county_', file_vars[f]),
value = {
df <- data.frame(array(NA, c(length(days), 67)))
colnames(df) <- paste0("fancy_column_",
sample(LETTERS, size = ncol(df), replace = TRUE))
df
})
}
When in {} you can use colnames(df) or setNames to assign column names in any manner desired. In your first piece of code you are referring to sd_counties object that is not available but the generic idea should work for you.

Changing variable in a string in a for loop

I have a script for opening meteorological data from a .h5 file and calculating the average windspeed (ugrd).
library(rhdf5)
windv.2014.dec <- h5read("/Users/sethparker/Documents/My_Lab/CR_met/Horizontes_2014DEC.h5", "ugrd")
a <- as.vector(windv.2014.dec)
a[which(a == 0)] = NA_character_
avg_windv.2014.dec <- mean(abs(as.numeric(na.omit(a))))
This works fine, but I have 57 of these files. I am trying to find a way to use a for loop to not have to manually change the date each time I run it. I am mainly concerned with the year changing, I do not mind doing the process 12 times. My failed attempt at a for loop is this:
for (i in 4:9)
{
windv.201i.oct <- h5read("/Users/sethparker/Documents/My_Lab/CR_met/Horizontes_201",i,"OCT.h5", "ugrd")
a <- as.vector(windv.201i.oct)
a[which(a == 0)] = NA_character_
avg_windv.201i.oct <- mean(abs(as.numeric(na.omit(a))))
}
The data is between 2014 and 2019, hence the 4:9. How do I get the variable to work in the file pathway string?
We can use paste or sprintf to create the path and in the OP's loop, the output gets updated on each iteration. We can create an empty list to store the output and assign the output to it
out <- vector('list', 6)
names(out) <- 4:9
for (i in 4:9){
tmp <- h5read(sprintf("/Users/sethparker/Documents/My_Lab/CR_met/Horizontes_201%dOCT.h5", i), "ugrd")
a <- as.vector(tmp)
a[which(a == 0)] = NA_character_
out[[as.character(i)]] <- mean(abs(as.numeric(na.omit(a))))
}
names(out) <- sprintf("windv.201%s.oct", names(out))

Iterate import of excel files and averaging matched values by file name in R

I have a folder containing 630 excel files, all with similar file names. Each file represents climate data in specific geographic areas for a month of a specific year. My goal is to find a way to iterate my importing of these files and find the average of values for specific variables. All files are titled as such:
PRISM_ppt_stable_4kmM3_201201_bil
where "ppt" represents climate variable the data is about, "2012" represents the year 2012 and "01" represents the month of January. The next file in the folder is titled:
PRISM_ppt_stable_4kmM3_201202_bil
where "ppt" represents the same variable,"2012" again represents the year 2012 and "02" this time represents the month of February. These repeat for every month of every year and for 7 different variables. The variables are titled:
ppt, vpdmax, vpdmin, tmax, tmin, tdmean, tmean
Each excel file contains >1500 observations of 11 variables where I am interesting in finding the average MEAN variable among all matching tl_2016_us variables. Some quick sample data is shown below:
tl_2016_us MEAN
14136 135.808
14158 132.435
etc. etc.
It gets tricky in that I only wish to find my averages over a designated winter season, in this case November through March. So all files with 201211, 201212, 201301, 201302 and 201303 in the file name should be matched by tl_2016_us and the corresponding MEAN variables averaged. Ideally, this process would repeat to the next year of 201311, 201312, 201401, 201402, 201403. To this point, I have used
list.files(path = "filepath", pattern ="*ppt*")
to create lists of my filenames for each of the 7 variables.
I don't really get what the "tl_2016_us" variables are/mean.
However, you can easily get the list of only winter months using a bit of regular expressions like so:
library(tidyverse)
# Assuming your files are already in your working directory
all_files <- list.files(full.names = TRUE, pattern = "*ppt*")
winter_mos <- str_subset(files, "[01, 02, 03, 11, 12]_\\w{3}$")
After that, you can iterate reading in all files into a data frame with map() from purrr:
library(readxl)
data <- map(winter_mos, ~ read_xlsx(.x)) %>% bind_rows(.id = "id")
After that, you should be able to select the variables you want, use group_by() to group by id (i.e. id of each Excel file), and then summarize_all(mean)
Maybe something like (not very elegant):
filetypes = c("ppt", "vpdmax", "vpdmin", "tmax", "tmin", "tdmean", "tmean")
data_years = c(2012,2013,2014)
df <- NULL
for (i in 1:length(data_years)) {
yr <- data_years[i]
datecodes <- c(paste(yr,"11",sep=""),
paste(yr,"12",sep=""),
paste(yr+1,"01",sep=""),
paste(yr+1,"02",sep=""),
paste(yr+1,"03",sep=""))
for (j in 1:length(filetypes)) {
filetype <- filetypes[j]
file_prefix <- paste("PRISM",filetype,"stable_4kmM3",sep="_")
for (k in 1:length(datecodes)) {
datecode <- datecodes[k]
filename <- paste(file_prefix,datecode,"bil",sep="_")
dk <- read_excel(filename)
M <- dim(dk)[1]
dk$RefYr <- rep(yr,M)
dk$DataType <- rep(filetype,M)
if (is.null(df_new)) {
df <- dk
} else {
df <- rbind(df,dk)
}
}
}
}
Once that has run, you will have a data frame containing all the data you need to compute your averages (I think).
You could then do something like:
df_new <- NULL
for (i in 1:length(data_years)) {
yr <- data_years[i]
di <- df[df$RefYr==yr,]
for (j in 1:length(filetypes)) {
filetype <- filetypes[j]
dj <- di[di$DataType==filetype,]
tls <- unique(dj$tl_2016_us)
for (k in 1:length(tls)) {
tl <- tls[k]
dk <- dj[dj$tl_2016_us==tl,]
dijk <- data.frame(RefYr=yr,TL2016=tl,DataType=filetype,
SeasonAverage=mean(dk$MEAN))
if (is.null(df)){
df_new <- dijk
} else {
df_new <- rbind(df_new,dijk)
}
}
}
}
I'm sure there are more elegant ways to do it and that there are some bugs in the above since I couldn't really run the code, but I think you should be left with a data frame containing what you are looking for.

change a for loop to a function to scrape a website

I am trying to scrape a website using the following:
industryurl <- "https://finance.yahoo.com/industries"
library(rvest)
read <- read_html(industryurl) %>%
html_table()
library(plyr)
industries <- ldply(read, data.frame)
industries = industries[-1,]
read <- read_html(industryurl)
industryurls <- html_attr(html_nodes(read, "a"), "href")
links <- industryurls[grep("/industry/", industryurls)]
industryurl <- "https://finance.yahoo.com"
links <- paste0(industryurl, links)
links
##############################################################################################
store <- NULL
tbl <- NULL
for(i in links){
store[[i]] = read_html(i)
tbl[[i]] = html_table(store[[i]])
}
#################################################################################################
I am mostly interested in the code between ########## and I want to apply a function instead of a for loop since I am running into time out issues with yahoo and I want to make it more human like to extract this data (it is not too much).
My question is, how can I take links apply a function and set a sort of delay timer to read in the contents of the for loop?
I can paste my own version of the for loop which does not work.
This is the function I came up with
##First argument is the link you need
##The second argument is the total time for Sys.sleep
extract_function <- function(define_link, define_time){
print(paste0("The system will stop for: ", define_time, " seconds"))
Sys.sleep(define_time)
first <- read_html(define_link)
print(paste0("It will now return the table for link", define_link))
return(html_table(first))
}
##I added the following tryCatch function
link_try_catch <- function(define_link, define_time){
out <- tryCatch(extract_function(define_link,define_time), error =
function(e) NA)
return(out)
}
##You can now retrieve the data using the links vector in two ways
##Picking the first ten, so it should not crash on link 5
p <- lapply(1:10, function(i)link_try_catch(links[i],1))
##OR (I subset the vector just for demo purposes
p2 <- lapply(links[1:10], function(i)extract_function(i,1))
Hope it helps

R code to iterate through dataframe rows for google maps distance queries

I'm looking for some assistance in writing some R code to iterate through rows in a dataframe and pass the values in each row to a function and print the output either to an excel file, txt file or just in the console.
The purpose of this is to automate a bunch of distance/time queries (several hundred) to google maps using the function found at this website: http://www.nfactorialanalytics.com/r-vignette-for-the-week-finding-time-distance-between-two-places/
The function on that website is as follows:
library(XML)
library(RCurl)
distance2Points <- function(origin,destination){
results <- list();
xml.url <- paste0('http://maps.googleapis.com/maps/api/distancematrix/xml?origins=',origin,'&destinations=',destination,'&mode=driving&sensor=false')
xmlfile <- xmlParse(getURL(xml.url))
dist <- xmlValue(xmlChildren(xpathApply(xmlfile,"//distance")[[1]])$value)
time <- xmlValue(xmlChildren(xpathApply(xmlfile,"//duration")[[1]])$value)
distance <- as.numeric(sub(" km","",dist))
time <- as.numeric(time)/60
distance <- distance/1000
results[['time']] <- time
results[['dist']] <- distance
return(results)
}
The dataframe will contain two columns: origin postal code and destination postal code (Canada, eh?). I'm a beginner R programmer, so I know how to use read.table to load a txt file into a dataframe. I'm just not sure how iterate through the dataframe, each time passing values to the distance2Points function and executing. I think this can be done using either a for loop or one of the apply calls?
Thanks for the help!
edit:
To keep it simple lets assume I want to transform these two vectors into a dataframe
> a <- c("L5B4P2","L5B4P2")
> b <- c("M5E1E5", "A2N1T3")
> postcodetest <- data.frame(a,b)
> postcodetest
a b
1 L5B4P2 M5E1E5
2 L5B4P2 A2N1T3
How should I go about iterating over these two rows to return both distances and times from the distance2Points function?
Here's one way to do it, using lapply to produce a list with the results for each row in your data and using Reduce(rbind, [yourlist]) to concatenate that list into a data frame whose rows correspond to the ones in your original. To make this work, we also have to tweak the code in the original function to return a one-row data frame, so I've done that here.
distance2Points <- function(origin,destination){
require(XML)
require(RCurl)
xml.url <- paste0('http://maps.googleapis.com/maps/api/distancematrix/xml?origins=',origin,'&destinations=',destination,'&mode=driving&sensor=false')
xmlfile <- xmlParse(getURL(xml.url))
dist <- xmlValue(xmlChildren(xpathApply(xmlfile,"//distance")[[1]])$value)
time <- xmlValue(xmlChildren(xpathApply(xmlfile,"//duration")[[1]])$value)
distance <- as.numeric(sub(" km","",dist))
time <- as.numeric(time)/60
distance <- distance/1000
# this gives you a one-row data frame instead of a list, b/c it's easy to rbind
results <- data.frame(time = time, distance = distance)
return(results)
}
# now apply that function rowwise to your data, using lapply, and roll the results
# into a single data frame using Reduce(rbind)
results <- Reduce(rbind, lapply(seq(nrow(postcodetest)), function(i)
distance2Points(postcodetest$a[i], postcodetest$b[i])))
Result when applied to your sample data:
> results
time distance
1 27.06667 27.062
2 1797.80000 2369.311
If you would prefer to do this without creating a new object, you could also write separate functions for computing time and distance -- or a single function with those outputs as options -- and then use sapply or just mutate to create new columns in your original data frame. Here's how that might look using sapply:
distance2Points <- function(origin, destination, output){
require(XML)
require(RCurl)
xml.url <- paste0('http://maps.googleapis.com/maps/api/distancematrix/xml?origins=',
origin, '&destinations=', destination, '&mode=driving&sensor=false')
xmlfile <- xmlParse(getURL(xml.url))
if(output == "distance") {
y <- xmlValue(xmlChildren(xpathApply(xmlfile,"//distance")[[1]])$value)
y <- as.numeric(sub(" km", "", y))/1000
} else if(output == "time") {
y <- xmlValue(xmlChildren(xpathApply(xmlfile,"//duration")[[1]])$value)
y <- as.numeric(y)/60
} else {
y <- NA
}
return(y)
}
postcodetest$distance <- sapply(seq(nrow(postcodetest)), function(i)
distance2Points(postcodetest$a[i], postcodetest$b[i], "distance"))
postcodetest$time <- sapply(seq(nrow(postcodetest)), function(i)
distance2Points(postcodetest$a[i], postcodetest$b[i], "time"))
And here's how you could do it in a dplyr pipe with mutate:
library(dplyr)
postcodetest <- postcodetest %>%
mutate(distance = sapply(seq(nrow(postcodetest)), function(i)
distance2Points(a[i], b[i], "distance")),
time = sapply(seq(nrow(postcodetest)), function(i)
distance2Points(a[i], b[i], "time")))

Resources