Changing variable in a string in a for loop - r

I have a script for opening meteorological data from a .h5 file and calculating the average windspeed (ugrd).
library(rhdf5)
windv.2014.dec <- h5read("/Users/sethparker/Documents/My_Lab/CR_met/Horizontes_2014DEC.h5", "ugrd")
a <- as.vector(windv.2014.dec)
a[which(a == 0)] = NA_character_
avg_windv.2014.dec <- mean(abs(as.numeric(na.omit(a))))
This works fine, but I have 57 of these files. I am trying to find a way to use a for loop to not have to manually change the date each time I run it. I am mainly concerned with the year changing, I do not mind doing the process 12 times. My failed attempt at a for loop is this:
for (i in 4:9)
{
windv.201i.oct <- h5read("/Users/sethparker/Documents/My_Lab/CR_met/Horizontes_201",i,"OCT.h5", "ugrd")
a <- as.vector(windv.201i.oct)
a[which(a == 0)] = NA_character_
avg_windv.201i.oct <- mean(abs(as.numeric(na.omit(a))))
}
The data is between 2014 and 2019, hence the 4:9. How do I get the variable to work in the file pathway string?

We can use paste or sprintf to create the path and in the OP's loop, the output gets updated on each iteration. We can create an empty list to store the output and assign the output to it
out <- vector('list', 6)
names(out) <- 4:9
for (i in 4:9){
tmp <- h5read(sprintf("/Users/sethparker/Documents/My_Lab/CR_met/Horizontes_201%dOCT.h5", i), "ugrd")
a <- as.vector(tmp)
a[which(a == 0)] = NA_character_
out[[as.character(i)]] <- mean(abs(as.numeric(na.omit(a))))
}
names(out) <- sprintf("windv.201%s.oct", names(out))

Related

Turn index-based loop into name-based function

I have at disposal a clean dataframe (1500r x 297c, named 'Data' - very inspiring) with both numeric/factor columns. However, as this is often the case, my factors were encoded as numbers (each number representing a level) hence a dataframe full a numeric vectors.
To overcome this matter I also have a second dataframe (VarLabels), containing information about the columns of the 1st dataframe (which has... 297 rows as you would imagine). In there, one specific column helps me defining what should be the data class in the main dataframe (named VarLabels$TypeVar).
I wrote the following piece of code, which might not be optimal but proved to work so far:
(NB: as you can see, for data labelled 'MIX' I wish to create a copy to have one numeric and one factor)
nbcol <- ncol(Data)
indexcol <- which(colnames(VarLabels) == "TypeVar")
for(i in 1:nbcol){
if (colnames(Data)[[i]] %in% VarLabels$VarName){
if (VarLabels[i,indexcol] == "Quant"){
Data[[i]] <- as.numeric(Data[[i]])
} else if (VarLabels[i,indexcol] == "Qual") {
Data[[i]] <- as.character(Data[[i]])
Data[[i]] <- as.factor(Data[[i]])
} else if (VarLabels[i,indexcol] == "Mix") {
Data <- cbind(Data, Data[[i]])
Data[[i]] <- as.character(Data[[i]])
Data[[i]] <- as.factor(Data[[i]])
Data[[ncol(Data)]] <- as.numeric(Data[[ncol(Data)]])
colnames(Data)[[ncol(Data)]] <- paste(colnames(Data)[[i]], "Num", sep = "_")
} else {
Data[[i]] <- as.numeric(Data[[i]])
}
} else {
}
}
Do you have a neater solution, possibly using a function to reduce the number of code lines / using names instead of column index? (which may be risky if order changes in one of the two dataframes) I recently got into R and am still struggling with user-defined functions.
I read other related topics like:
Change all columns from factor to numeric in R
Function to change class of columns in R to match the class of an other dataset
Convert type of multiple columns of a dataframe at once
How do I get the classes of all columns in a data frame?
but could not apply the answers to my own problem. Any idea how to make things simple? (if possible!)
The following function does what the question asks for.
It matches input data set X column names with the new column types with a sequence of which/match statements, without needing loops. The coercion is performed with lapply loops.
The test data set is the built-in data set mtcars.
coerceCols <- function(X, VarLabels){
i <- which(VarLabels$TypeVar == "Qual")
j <- match(VarLabels$VarName[i], names(X))
X[j] <- lapply(X[j], factor)
i <- which(VarLabels$TypeVar == "Mix")
j <- match(VarLabels$VarName[i], names(X))
tmp <- X[j]
names(tmp) <- paste(names(tmp), "Num", sep = "_")
X[j] <- lapply(X[j], factor)
cbind(X, tmp)
}
Data <- mtcars
VarLabels <- data.frame(VarName = names(mtcars),
TypeVar = c("Quant", "Mix", "Quant",
"Quant", "Quant", "Quant",
"Quant", "Qual", "Qual",
"Mix", "Mix"),
stringsAsFactors = FALSE)
coerceCols(Data, VarLabels)

Trycatch in for loop- continue to next r dataRetrieval

I have a list containing the following site id numbers:
sitelist <- c("02074500", "02077200", "208111310", "02081500", "02082950")
I want to use the dataRetrieval package to collect additional information about these sites and save it into individual .csv files. Site number "208111310" does not exist, so it returns an error and stops the code.
I want the code to ignore site numbers that do not return data and continue to the next number in sitelist.
I've tried trycatch in several ways but can't get the correct syntax. Here is my for loop without trycatch.
for (i in sitelist){
test_gage <- readNWISdv(siteNumbers = i,
parameterCd = pCode)
df = test_gage
df = subset(df, select= c(site_no, Date, X_00060_00003))
names(df)[3] <- c("flow in m3/s")
df$Year <- as.character(year(df$Date))
write.csv(df, paste0("./gage_flow/",i,".csv"), row.names = F)
rm(list=setdiff(ls(),c("sitelist", "pCode")))
}
You can use the variable error in the function trycatch to specify what happened when an error occurs and store the return value using operator <<-.
for (i in sitelist){
test_gage <- NULL
trycatch(error=function(message){
test_gage <<- readNWISdv(siteNumbers = i,parameterCd = pCode)
}
df = test_gage
df = subset(df, select= c(site_no, Date, X_00060_00003))
names(df)[3] <- c("flow in m3/s")
df$Year <- as.character(year(df$Date)) write.csv(df, paste0("./gage_flow/",i,".csv"), row.names = F)
rm(list=setdiff(ls(),c("sitelist", "pCode")))
}
If you want to catch the warnings also just give a second argument to trycatch.
trycatch(error=function(){},warning=function(){})

List organization by file name in R

I'm trying to create a list of data separated by month and year (40 years worth). The data currently has the name structure (Year)-(Numeric Month)-(Var).nc. I'd like to get all the data into its appropriate list created below. Not exactly sure how to proceed from here. Any guidance is appreciated.
files_nc <- list.files(pattern = ".nc")
year <- vector("list", length = 40)
month <- vector("list", length = 12)
names(year) <- c(1978:2017)
names(month) <- c("Jan","Feb","Mar","Apr","May","Jun","Jul",
"Aug","Sep","Oct","Nov","Dec")
for (i in 1:40) {
year[[i]] <- month
}
It's not entirely clear what you're asking for, but I believe this should work. I'm assuming you're loading in a list of files, and each file is associated with a year and month.
file_names <- list(names(files_nc))
file_names_split <- lapply(file_names,function(x)strsplit(x,"-"))
for(i in 1:length(file_names_split)) {
y <- which(names(year) == file_names_split[[i]][[1]][1])
m <- as.numeric(file_names_split[[i]][[1]][2])
year[[y]][m] <- files_nc[[i]]
}
In general, this method should work. If it works I'd take the time to rewrite the for loop as an apply statement.

How to build subset query using a loop in R?

I'm trying to subset a big table across a number of columns, so all the rows where State_2009, State_2010, State_2011 etc. do not equal the value "Unknown."
My instinct was to do something like this (coming from a JS background), where I either build the query in a loop or continually subset the data in a loop, referencing the year as a variable.
mysubset <- data
for(i in 2009:2016){
mysubset <- subset(mysubset, paste("State_",i," != Unknown",sep=""))
}
But this doesn't work, at least because paste returns a string, giving me the error 'subset' must be logical.
Is there a better way to do this?
Using dplyr with the filter_ function should get you the correct output
library(dplyr)
mysubset <- data
for(i in 2009:2016)
{
mysubset <- mysubset %>%
filter_(paste("State_",i," != \"Unknown\"", sep = ""))
}
To add to Matt's answer, you could also do it like this:
cols <- paste0( "State_", 2009:2016)
inds <- which( mysubset[ ,cols] == "Unknown", arr.ind = T)[,1]
mysubset <- mysubset[-(unique(inds), ]

Looping over an API with a varying structure using R

I am using Graphite (http://graphite.wikidot.com/) to log performance statistics for various services, which we can access via an API. Each service has a few different metrics, and each metric has a few different statistics. To loop over all of them to grab the stats we want, I've written 3 nested for loops as shown below to create the necessary URL. And then it gets worse. We just introduced another level to this hierarchy because there can be more than one of each service, so they each need a unique ID. Before making this even messier, I am convinced there must be an easier way, but Googling hasn't turned up anything. Any ideas on the best way to approach it?
dir.current <- getwd()
dir.create(file.path(dir.current, "All Data"), showWarnings = FALSE)
dir.create(file.path(dir.current, "Charts"), showWarnings = FALSE)
# Set the grab parameters
graphite.ip <- "192.168.0.16:8080"
from <- list(hour="18", min="00", year="2013", month="09", day="18")
until <- list(hour="10", min="50", year="2013", month="09", day="19")
test.name <- "multinode"
# Builds the ugly parts of the URL.
graphite.ip <- paste("http://", graphite.ip, "/render?", sep="")
from <- paste("from=", from$hour, "%3A", from$min, "_", from$year, from$month, from$day, sep="")
until <- paste("&until=", until$hour, "%3A", until$min, "_", until$year, until$month, until$day, sep="")
test.name <- paste("&target=", test.name, sep="")
# A few variables for common statistics used.
stats.few <- c("count", "m1_rate", "m5_rate", "m15_rate", "mean_rate")
stats.many <- c("count", "m1_rate", "mean", "mean_rate", "p95", "stddev")
stats.memory <- c("total.used")
# Specify which metrics to grab for which services
engine.stats <- list("event-timer"=stats.many, "memory"=stats.memory)
journaler.stats <- list("journaler-rate"=stats.few, "memory"=stats.memory)
notification.stats <- list("notification-rate"=stats.few, "memory"=stats.memory, "reaction-tenant-one-PT4-time"=stats.many)
eventsin.stats <- list("Incoming"=stats.few, "memory"=stats.memory)
broker.stats <- list("memory"=stats.memory, "events"=stats.few)
# Specify which services you're interested in (should be above as well)
services <- list("engine"=engine.stats, "notification"=notification.stats, "rest"=eventsin.stats, "broker"=broker.stats)
merge.count <- 1
# Loops over everything above to grab the CSVs
for (service in names(services)) {
for (metric in names(services[[service]])) {
for (stat in services[[service]][[metric]]) {
target <- paste(test.name, service, metric, stat, sep=".")
data.name <- paste(service, metric, stat, sep=".")
print(data.name) # Visual indicator
# Download the graphs
url.png <- paste(graphite.ip, from, until, target, "&width=800&height=600", "&format=png", sep="")
setwd(file.path(dir.current, "Charts"))
download.file(url.png, paste(data.name, ".png", sep=""), quiet=TRUE)
# Download, clean and merge CSVs
url.csv <- paste(graphite.ip, from, until, target, "&format=csv", sep="")
data <- read.csv(url.csv, col.names = c("Data Name", "Date", data.name), header=FALSE)
data[1] <- NULL # Cleans up the data
# If a column has integers larger than 2^31, rewrite the data in millions.
if (sapply(data[2], max, na.rm=TRUE) >= 2^31) {
data[2] = data[2]/10^6
}
if (merge.count == 1) {
data.merged <- data
merge.count = merge.count + 1
} else {
data.merged = cbind(data.merged, data[2])
}
csv.name <- paste(service, metric, stat, "csv", sep=".")
setwd(file.path(dir.current, "All Data"))
write.csv(data, csv.name, row.names=FALSE)
}
}
}
setwd(file.path(dir.current))
write.csv(data.merged, "MergedData.csv", row.names=FALSE)
# Print summary of all statistics
# print(summary(data.merged))
# Print a mean and sd of all the columns
print("Column Means:")
print(colMeans(data.merged[,-1], na.rm=TRUE))
print("Column Standard Deviations:")
print(sapply(data.merged[,-1], sd, na.rm=TRUE))
print("Download and merging complete.")
Wildcards! The Graphite URL API supports usage of Perl based regexes that allow you to query the metric tree using wildcards.
If i have the following-
stats.A.A
stats.A.B
stats.A.C
stats.B.A.1
stats.B.A.2
stats.B.A.3
stats.C.B.C.D.1
stats.C.B.C.D.2
stats.C.B.C.D.3
stats.C.B.C.D.4
Then group(stats.*.*,stats.*.*.*,stats.*.*.*.*) will resolve into all of them. Another interesting function is groupByNode.
I think an issue with this is that it's a big loop that keeps cbind()ing data. A better approach would be to write a function that contains all the code within the inner loop and that takes service, metric, and stat as parameters. Let's call this function "process.stat". It returns data, or whatever you wanted to cbind.
First, you need to extract the service/metric/stat tuples:
# One column (service)
mat1 <- data.frame(service=names(services))
# List (one entry per service name) of service/metric pairs
list1 <- apply(df1, 1, function(service) expand.grid(service=service, metric=names(services[[service]])))
# Two columns (service and metric)
mat2 <- do.call(rbind, list1)
# List (one entry per service/metric pair) of service/metric/stat tuples
list2 <- apply(df2, 1, function(x) expand.grid(service=x[1], metric=x[2], stat=services[[x[1]]][[x[2]]]))
# Three columns (service, metric, and stat)
tuples <- do.call(rbind, list2)
Then you would use something from the apply family to call process.stat on every combination of service/metric/stat that you want handled:
data.merged <- apply(tuples, 1, process.stat)

Resources