Nestled Loop not Working to gather data from NOAA - r

I'm using the R package rnoaa(along with it required other packages) to gather historical weather data. I wrote this nestled loop to gather all the data sets but I keep getting errors when I run it. It seems to run for a second fine
The loop:
require('triebeard')
require('bindr')
require('colorspace')
require('mime')
require('curl')
require('openssl')
require('R6')
require('urltools')
require('httpcode')
require('stringr')
require('assertthat')
require('bindrcpp')
require('glue')
require('magrittr')
require('pkgconfig')
require('rlang')
require('Rcpp')
require('BH')
require('plogr')
require('purrr')
require('stringi')
require('tidyselect')
require('digest')
require('gtable')
require('plyr')
require('reshape2')
require('lazyeval')
require('RColorBrewer')
require('dichromat')
require('munsell')
require('labeling')
require('viridisLite')
require('data.table')
require('rjson')
require('httr')
require('crul')
require('lubridate')
require('dplyr')
require('tidyr')
require('ggplot2')
require('scales')
require('XML')
require('xml2')
require('jsonlite')
require('rappdirs')
require('gridExtra')
require('tibble')
require('isdparser')
require('geonames')
require('hoardr')
require('rnoaa')
install.package('ncdf4')
install.packages("devtools")
library(devtools)
install_github("rnoaa", "ropensci")
library(rnoaa)
list <- buoys(dataset='wlevel')
lid <- data.frame(list$id)
foo <- for(range in 1990:2017){
for(bid in lid){
bid_range <- buoy(dataset = 'wlevel', buoyid = bid, year = range)
bid.year.data <- data.frame(bid.year$data)
write.csv(bid.year.data, file='cwind/bid_range.csv')
}
}
The response:
Using c1990.nc
Using
Error: length(url) == 1 is not TRUE
It saves the first data-set but it does not apply the for in the file name it just names it bid_range.csv.

This error message shows that there are no any data of a given station id in 1990. Because you were using for loop, once it gots an error, it stops.
Here I introduce the use of tidyverse to download the NOAA buoy data. A lot of the following functions are from the purrr package, which is part of the tidyverse.
# Load packages
library(tidyverse)
library(rnoaa)
Step 1: Create a "Grid" containing all combination of id and year
The expand function from tidyr can create the combination of different values.
data_list <- buoys(dataset = 'wlevel')
data_list2 <- data_list %>%
select(id) %>%
expand(id, year = 1990:2017)
Step 2: Create a "safe" version that does not break when there is no data.
Also make this function suitable for the map2 function
Because we will use map2 to loop through all the combination of id and year using the map2 function by its .x and .y argument. We modified the sequence of argument to create buoy_modify. We also use the safely function to create a safe version of buoy_modify. Now when it meets error, it will store the error message and moves to the next one rather than breaks.
# Modify the buoy function
buoy_modify <- function(buoyid, year, dataset, ...){
buoy(dataset, buoyid = buoyid, year = year, ...)
}
# Creare a safe version of buoy_modify
buoy_safe <- safely(buoy_modify)
Step 3: Apply the buoy_safe function
wlevel_data <- map2(data_list2$id, data_list2$year, buoy_safe, dataset = "wlevel")
# Assign name for the element in the list based on id and year
names(wlevel_data) <- paste(data_list2$id, data_list2$year, sep = "_")
After this step, all the data were downloaded in wlevel_data. Each element in wlevel_data has two parts. $result shows the data if the download is successful, otherwise, it shows NULL. $error shows NULL if the download is successful, otherwise, it shows the error message.
Step 4: Access the data
transpose can turn a list "inside out". So now wlevel_data2 has two elements: result and error. We can store these two and access the data.
# Turn the list "inside out"
wlevel_data2 <- transpose(wlevel_data)
# Get the error message
wlevel_error <- wlevel_data2$error
# Get he result
wlevel_result <- wlevel_data2$result
# Remove NULL element in wlevel_result
wlevel_result2 <- wlevel_result[!map_lgl(wlevel_result, is.null)]

Related

Create a function with a column name as an argument (with grouping)

I want to write a function that takes a data frame, grouping a variable(a column) and a variable (also a column).
From reading multiple stakcflow attempts at cracking this, mostly recommended using arguments as strings.
My initial code to check for normality using the Shapiro-Wilk test for multiple data frames and variables was unsuccessful.
check_normality <- function(d, x_grouping_variable, y_cont_var){
d %>%
group_by([[x_grouping_variable]]) %>%
summarise(`W Statistic` = shapiro.test([[y_cont_var]])$statistic,
`p-value` = shapiro.test([[y_cont_var]])$p.value)
return(shapiro.test([[y_cont_var]])$p.value)
}
ERROR:
Error: unexpected '[[' in " return(shapiro.test([["
My attempt to fix it using this code was also unsuccessful.
check_normality <- function(d, x_grouping_variable, y_cont_var){
d %>%
group_by(((!! sym(x_grouping_variable)))) %>%
summarise(`W Statistic` = shapiro.test((!! sym(y_cont_var)))$statistic,
`p-value` = shapiro.test((!! sym(y_cont_var)))$p.value)
return(shapiro.test((!! sym(y_cont_var)))$p.value)
}
check_normality(df, "RHF", "duratoin_days")
the error :
Error in !sym(y_cont_var) : invalid argument type
3.stopifnot(is.numeric(x))
2.shapiro.test((!!sym(y_cont_var)))
1.check_normality(df, "RHF", "duratoin_days")

Using an R function to hash values produces a repeating value across rows

I'm using the following query:
let
Source = {1..5},
#"Converted to Table" = Table.FromList(Source, Splitter.SplitByNothing(), {"Numbers"}, null, ExtraValues.Error),
#"Added Custom" = Table.AddColumn(#"Converted to Table", "Letters", each Character.FromNumber([Numbers] + 64)),
#"Run R script" = R.Execute("# 'dataset' holds the input data for this script#(lf)#(lf)library(""digest"")#(lf)#(lf)dataset$SuffixedLetters <- paste(dataset$Letters, ""_suffix"")#(lf)dataset$HashedLetters <- digest(dataset$Letters, ""md5"", serialize = TRUE)#(lf)output<-dataset",[dataset=#"Added Custom"]),
output = #"Run R script"{[Name="output"]}[Value]
in
output
which leads to the resulting table:
And the here is the R script with better formatting:
# 'dataset' holds the input data for this script
library("digest")
dataset$SuffixedLetters <- paste(dataset$Letters, "_suffix")
dataset$HashedLetters <- digest(dataset$Letters, "md5", serialize = TRUE)
output<-dataset
The 'paste' function appears to iterate over rows and resolve on each row with the new input. But the 'digest' function only appears to return the first value in the table across all rows.
I don't know why the behavior of the two functions would seem to operate differently. Can anyone advise how to get the 'HashedLetters' column to resolve using the values from each row instead of just the initial one?
Use:
dataset$HashedLetters <- sapply(dataset$Letters, digest, algo = "md5", serialize = TRUE)
digest works on a whole object at a time, not individual elements of a vector.
vec <- letters[1:3]
digest::digest(vec, algo="md5", serialize=TRUE)
# [1] "38ce1fe9e19a222505e693e8bdd8aeec"
sapply(vec, digest::digest, algo="md5", serialize=TRUE)
# a b c
# "127a2ec00989b9f7faf671ed470be7f8" "ddf100612805359cd81fdc5ce3b9fbba" "6e7a8c1c098e8817e3df3fd1b21149d1"

Creating a for loop in R from a list

I'm trying to create a for loop in R to iterate through a list of genetic variants, labeled with rsID's, and filter the results by patient ID.
ace2_snps <- c("rs4646121", "rs4646127", "rs1996225", "rs2158082", "rs4830974", "rs148271868", "rs113539251", "rs4646135", "rs4646179", "rs2301693", "rs16980031", "rs12689012", "rs4646141", "rs142049267", "rs16979971", "rs12007623", "rs4646182", "rs147214574", "rs6632677", "rs139469582", "rs149000434", "rs148805807", "rs112032651", "rs144314464", "rs147077778", "rs182259051", "rs112621533", "rs35803318", "rs35304868", "rs113848176", "rs145345877", "rs12009805", "rs233570", "rs73635824", "rs73635823", "rs4646142", "rs4646157", "rs2074192", "rs79878075", "rs144239059", "rs67635467", "rs183583165", "rs137910448", "rs116419580", "rs2097723", "rs4646170")
for (snps in ace2_snps) {
genotype_snps <- as.data.frame(bgen_ACE2$data[snps,,])
idfromcsv <- read.csv("/Users/keeseyyyyy/Desktop/Walley/pospatid.csv")
id <- as.character(idfromcsv[[1]])
filtered_snps <- genotype_snps[id,] }
I need to run genotype_rs146217251 <- as.data.frame(bgen_ACE2$data["rs146217251",,]) for each rsID, and then I'd like to label the variable filtered_snps according to its rsID in the place of "snps" in the variable name for each variant.
I'm not very familiar with R syntax. Can anyone give me some tips?
For one variant, the process would go like this:
genotype_rs146217251 <- as.data.frame(bgen_ACE2$data["rs146217251",,])
idfromcsv <- read.csv("/Users/keeseyyyyy/Desktop/Walley/pospatid.csv")
id <- as.character(idfromcsv[[1]])
filtered <- genotype_rs146217251[id,]

How to extract columns from a row and save the output as a variable dplyr

I am trying to extract a specific column from a specific row on my excel sheet (df). However, when I try to do so I get the message:
Error: ... must evaluate to column positions or names, not a list
Call `rlang::last_error()` to see a backtrace.
When I call rlang::last_error() I get:
Backtrace:
1. dplyr::select(., FGA, FTA, TOV, MP, TmFga, TmFta, TmTov, TmMin)
9. tidyselect::vars_select(tbl_vars(.data), !!!enquos(...))
10. tidyselect:::bad_calls(bad, "must evaluate to { singular(.vars) } positions or names, \\\n not { first_type }")
11. tidyselect:::glubort(fmt_calls(calls), ..., .envir = .envir)
12. dplyr::select(., FGA, FTA, TOV, MP, TmFga, TmFta, TmTov, TmMin)
At this point, I am lost. What can I do to my code to work?
library(readxl)
Lakers_Overall_Stats <- read_excel("Desktop/Lakers Overall Stats.xlsx")
library(readxl)
Lakers_Record <- read_excel("Desktop/Lakers Record.xlsx")
require(dplyr)
require(ggplot2)
##WinPercentage of the Team after season
mydata <- Lakers_Record %>% select(Pts,Opp,W,L)%>%
+ mutate(wpct=Pts^13.91/(Pts^13.91+Opp^13.91),expwin=round(wpct*(W+L)),diff=W-expwin)
head(mydata)
##Specifiying
Lakers_Overall_Stats[23,6] <- TmMin
Lakers_Overall_Stats[23,8] <- TmFga
Lakers_Overall_Stats[23,18] <- TmFta
Lakers_Overall_Stats[23,26] <- TmTov
rlang::last_error()
##Usage Percentage
Usgpct <- Lakers_Overall_Stats %>% select(FGA,FTA,TOV,MP,TmFga,TmFta,TmTov,TmMin)%>%
+ mutate(100*(Fga+0.44*Fta+Tov))*TmMin/(TmFga+0.44*TmFta+TmTov)*5(MP)
##head(Usgpct)
##filter(rank(desc(Usgpct))==1)
Also, am I filtering correctly? or should it be written as
Usgpct <- Lakers_Overall_Stats %>% select(FGA,FTA,TOV,MP,TmFga,TmFta,TmTov,TmMin)%>%
filter(rank(desc(Usgpct))==1)%>%
mutate(100*(Fga+0.44*Fta+Tov))*TmMin/(TmFga+0.44*TmFta+TmTov)*5(MP)
head(Usgpct)
You have
Lakers_Overall_Stats[23,6] <- TmMin
This will modify the Lakers_Overall_Stats data frame by setting the element at 23,6 etc. to be TmMin. TmMin is an object outside of your data frame.
Maybe you want:
TmMin <- Lakers_Overall_Stats[23,6]
?
Also, you cannot select TmFga,TmFta,TmTov,TmMin since these variables are not part of your data frame. You can refer to those variables in your mutate equation, but because of the way you've set it up, they're stand-alone variables.

Function to iterate over list, merging results into one data frame

I've completed the first couple R courses on DataCamp and in order to build up my skills I've decided to use R to prep for fantasy football this season, thus I have began playing around with the nflscrapR package.
With the nflscrapR package, one can pull Game Information using the season_games() function which simply returns a data frame with the gameID, game date, the home and away team abbreviations.
Example:
games.2012 = season_games(2012)
head(games.2012)
GameID date home away season
1 2012090500 2012-09-05 NYG DAL 2012
2 2012090900 2012-09-09 CHI IND 2012
3 2012090908 2012-09-09 KC ATL 2012
4 2012090907 2012-09-09 CLE PHI 2012
5 2012090906 2012-09-09 NO WAS 2012
6 2012090905 2012-09-09 DET STL 2012
Initially I copy and pasted the original function and changed the last digit manually for each season, then rbinded all the seasons into one data frame, games.
games.2012 <- season_games(2012)
games.2013 <- season_games(2013)
games.2014 <- season_games(2014)
games.2015 <- season_games(2015)
games = rbind(games2012,games2013,games2014,games2015)
I'd like to write a function to simplify this process.
My failed attempt:
gameID <- function(years) {
for (i in years) {
games[i] = season_games(years[i])
}
}
With years = list(2012, 2013) for testing purposes, produced the following:
Error in strsplit(headers, "\r\n") : non-character argument Called
from: strsplit(headers, "\r\n")
Thanks in advance!
While #Gregor has an apparent solution, he didn't run it because this wasn't a minimal example. I googled, found, and tried to use this code, and it doesn't work, at least in a non-trivial amount of time.
On the other hand, I took this code from Vivek Patil's blog.
library(XML)
weeklystats = as.data.frame(matrix(ncol = 14)) # Initializing our empty dataframe
names(weeklystats) = c("Week", "Day", "Date", "Blank",
"Win.Team", "At", "Lose.Team",
"Points.Win", "Points.Lose",
"YardsGained.Win", "Turnovers.Win",
"YardsGained.Lose", "Turnovers.Lose",
"Year") # Naming columns
URLpart1 = "http://www.pro-football-reference.com/years/"
URLpart3 = "/games.htm"
#### Our workhorse function ####
getData = function(URLpart1, URLpart3) {
for (i in 2012:2015) {
URL = paste(URLpart1, as.character(i), URLpart3, sep = "")
tablefromURL = readHTMLTable(URL)
table = tablefromURL[[1]]
names(table) = c("Week", "Day", "Date", "Blank", "Win.Team", "At", "Lose.Team",
"Points.Win", "Points.Lose", "YardsGained.Win", "Turnovers.Win",
"YardsGained.Lose", "Turnovers.Lose")
table$Year = i # Inserting a value for the year
weeklystats = rbind(table, weeklystats) # Appending happening here
}
return(weeklystats)
}
I posted this because, it works, you might learn something about web scraping you didn't know, and it runs in 11 seconds.
system.time(weeklystats <- getData(URLpart1, URLpart3))
user system elapsed
0.870 0.014 10.926
You should probably take a look at some popular answers for working with lists, specifically How do I make a list of data frames? and What's the difference between [ and [[?.
There's no reason to put your years in a list. They're just integers, so just do a normal vector.
years = 2012:2015
Then we can get your function to work (we'll need to initialize an empty list before the for loop):
gameID <- function(years) {
games = list()
for (i in years) {
games[[i]] = season_games(years[i])
}
return(games)
}
Read my link above for why we're using [[ with the list and [ with the vector. And we could run it like this:
game_list = gameID(2012:2015)
But this is such a simple function that it's easier to use lapply. Your function is just a wrapper around a for loop that returns a list, and that's precisely what lapply is too. But where your function has season_games hard-coded in, lapply can work with any function.
game_list = lapply(2012:2015, season_games)
# should be the same result as above
In either case, we have the list of data frames and want to combine it into one big data frame. The base R way is rbind with do.call, but dplyr and data.table have more efficient versions.
# pick your favorite
games = do.call(rbind, args = game_list) # base
games = dplyr::bind_rows(game_list)
games = data.table::rbindlist(game_list)

Resources