Using grepl in a string replace function in R Programming - r

Trying to create function using grepl to replace a word in a string.
Airport_ID<-c("3001","3002","3003","3004")
Airport_Name<-c("Adelaide Airport GOODFIND","GOODFIND DTS Land Airport Land ADTS",
"Washington DTS INC GOODFINDAirport DTSUpdated",
"DALLAS Airport TDS GOODFIND")
TF_Data<-data.frame(Airport_ID,Airport_Name)
Created the below function
STR_Manip_F_M_L_V1 <- function(data=NULL,by_text1="GOODFIND",by_text2="Updated") {
if(!require(glue)) {library(glue) }
TFD <- TF_Data %>%
filter(grepl(glue("^{by_text1} "),Airport_Name) |
grepl(glue(" \\({by_text1}\\) "),Airport_Name) |
grepl(glue(" \\({by_text1}\\$"),Airport_Name) )
TFD$Airport_Name <- str_replace(TFD$Airport_Name = glue("^{by_text1} "),replacement = glue("^{by_text2} ") )
return(TFD)
}
Error :
Error: object 'TFD' not found.

Several problems here:
You define data in the function arguments but use TFD_Data. This is not causing an error but is a logical mistake, making reproducibility and troubleshooting a bit more difficult.
Your use of str_replace is not quite right, don't "name" the arguments as you've done here. (Or ... perhaps just change your = to a comma ...).
Since you're using dplyr earlier, I think it's good to stay in that mode ... try this:
STR_Manip_F_M_L_V1 <- function(data=NULL,by_text1="GOODFIND",by_text2="Updated") {
out <- data %>%
filter(grepl(glue("^{by_text1} "),Airport_Name) |
grepl(glue(" \\({by_text1}\\) "),Airport_Name) |
grepl(glue(" \\({by_text1}\\$"),Airport_Name) ) %>%
mutate(Airport_Name = str_replace(Airport_Name, paste0("^", by_text1), by_text2))
out
}
STR_Manip_F_M_L_V1(TF_Data)
# Airport_ID Airport_Name
# 1 3002 Updated DTS Land Airport Land ADTS

Related

Having problem with ggmap's mapdist() function

I have this code. I have my google API set up already, registered as well in R, Distance Matrix API has been initiated as well in the Google Cloud console.
Here is the dataframe I have, random 25 postal codes FROM and TO postal codes.
Dataset_test = data.frame(
FROM_POSTAL = c("V8A 0E5","T4G 6M4","V1N 8X3",
"C1B 5G1","R5H 2L4","H9S 8L4","L8E 4Y0","H2Y 7N6",
"K1B 7C0","G4A 5B0","E4P 3T2","E4V 5P4","H3J 1R5",
"G0B 4J7","E7A 6E7","E5B 2Y9","S4H 1T8","A2V 4G5",
"V8L 2A9","T9E 1M5","A5A 5M2","E4T 5B4","S2V 6C4",
"S9H 5P8","B1Y 0V0"),
TO_POSTAL = c("G0J 0B8","N0H 9N4","J9B 4Y4",
"L3Z 2Y7","E8K 4R4","B4P 7X9","S4H 2M0","A1Y 0B8",
"A1W 1E9","P9N 7X1","E4R 4B0","N0P 0M8","E1W 9Y7",
"T9W 8E2","G6X 4S9","A0E 0V4","J5X 7N8","N4N 8A1",
"V9K 0B9","L4G 3H7","E1W 0T2","G5R 9G3","L7C 9S2",
"E8P 2X6","E2A 2M1")
)
Here is the simple script I have to try to calculate the distance between the two postal codes by driving using Google's Distance Matrix API.
Driving_Distance = mapdist(from = Dataset_test[["FROM_POSTAL"]], to = Dataset_test[["TO_POSTAL"]], mode = c("driving")) %>% distinct()
When I run this, it throws an error in the Driving_Distance - says
Error: Argument 1 is a list, must contain atomic vectors
Your Canadian postal codes are hereby working with the mapdist() function.
The number of addresses used here were shortened for the sake of brevity.
A tibble was used instead of a dataframe so that the variables were character data types rather than factor data types. The actual Google API key that was used has been replaced with some text.
This was a good mapping question. The working code and output below:
library(ggmap)
library(plyr)
library(googleway)
library(tidyverse)
df = tibble(
FROM_POSTAL = c("V8A 0E5","T4G 6M4","V1N 8X3",
"C1B 5G1","R5H 2L4","H9S 8L4"),
TO_POSTAL = c("G0J 0B8","N0H 9N4","J9B 4Y4",
"L3Z 2Y7","E8K 4R4","B4P 7X9"))
dd <- apply(df, 1, function(x){
google_distance(origins = list(x["from"]),
destinations = list(x["to"]),
key="My_secret_key")
})
dd

R - Dynamically Switch Between Variables in a Data Frame?

I have a data frame that has a collection of many zip codes and "city,state" in the US. So for instance it might read (note that the actual data frame has like 25000 observations):
zip.codes = c(33603, 33701, 32835)
place.names = c("Tampa, FL", "Saint Petersburg, FL", "Orlando, FL")
df = data.frame(zip.codes, place.names)
I am using this as part of a Shiny App. I'd like the user to be able to select a zip code, but to only have it return the place name. So if the user inputs 33603, then the code would return "Tampa, FL".
I've tried researching the Switch function to try to do this, but that seems to only accept hard-coded lists (which is kind of a nonstarter with so many entries in the DF). Is there a way to setup the switch function using variables? Otherwise, does anybody have another idea that might work?
First create vectors place.names and zip.codes and then try any of the following alternatives:
place.names <- as.character(df$place.names)
zip.codes <- df$zip.codes
1) match
place.names[match(33603, zip.codes)]
## [1] Tampa, FL
2) logical condition
place.names[33603 == zip.codes]
## [1] Tampa, FL
3) lookup name
v <- setNames(place.names, zip.codes)
unname(v[as.character(33603)])
## [1] Tampa, FL
4) switch
L <- setNames(as.list(place.names), zip.codes)
do.call("switch", c(as.character(33603), L))
## [1] "Tampa, FL"

Nestled Loop not Working to gather data from NOAA

I'm using the R package rnoaa(along with it required other packages) to gather historical weather data. I wrote this nestled loop to gather all the data sets but I keep getting errors when I run it. It seems to run for a second fine
The loop:
require('triebeard')
require('bindr')
require('colorspace')
require('mime')
require('curl')
require('openssl')
require('R6')
require('urltools')
require('httpcode')
require('stringr')
require('assertthat')
require('bindrcpp')
require('glue')
require('magrittr')
require('pkgconfig')
require('rlang')
require('Rcpp')
require('BH')
require('plogr')
require('purrr')
require('stringi')
require('tidyselect')
require('digest')
require('gtable')
require('plyr')
require('reshape2')
require('lazyeval')
require('RColorBrewer')
require('dichromat')
require('munsell')
require('labeling')
require('viridisLite')
require('data.table')
require('rjson')
require('httr')
require('crul')
require('lubridate')
require('dplyr')
require('tidyr')
require('ggplot2')
require('scales')
require('XML')
require('xml2')
require('jsonlite')
require('rappdirs')
require('gridExtra')
require('tibble')
require('isdparser')
require('geonames')
require('hoardr')
require('rnoaa')
install.package('ncdf4')
install.packages("devtools")
library(devtools)
install_github("rnoaa", "ropensci")
library(rnoaa)
list <- buoys(dataset='wlevel')
lid <- data.frame(list$id)
foo <- for(range in 1990:2017){
for(bid in lid){
bid_range <- buoy(dataset = 'wlevel', buoyid = bid, year = range)
bid.year.data <- data.frame(bid.year$data)
write.csv(bid.year.data, file='cwind/bid_range.csv')
}
}
The response:
Using c1990.nc
Using
Error: length(url) == 1 is not TRUE
It saves the first data-set but it does not apply the for in the file name it just names it bid_range.csv.
This error message shows that there are no any data of a given station id in 1990. Because you were using for loop, once it gots an error, it stops.
Here I introduce the use of tidyverse to download the NOAA buoy data. A lot of the following functions are from the purrr package, which is part of the tidyverse.
# Load packages
library(tidyverse)
library(rnoaa)
Step 1: Create a "Grid" containing all combination of id and year
The expand function from tidyr can create the combination of different values.
data_list <- buoys(dataset = 'wlevel')
data_list2 <- data_list %>%
select(id) %>%
expand(id, year = 1990:2017)
Step 2: Create a "safe" version that does not break when there is no data.
Also make this function suitable for the map2 function
Because we will use map2 to loop through all the combination of id and year using the map2 function by its .x and .y argument. We modified the sequence of argument to create buoy_modify. We also use the safely function to create a safe version of buoy_modify. Now when it meets error, it will store the error message and moves to the next one rather than breaks.
# Modify the buoy function
buoy_modify <- function(buoyid, year, dataset, ...){
buoy(dataset, buoyid = buoyid, year = year, ...)
}
# Creare a safe version of buoy_modify
buoy_safe <- safely(buoy_modify)
Step 3: Apply the buoy_safe function
wlevel_data <- map2(data_list2$id, data_list2$year, buoy_safe, dataset = "wlevel")
# Assign name for the element in the list based on id and year
names(wlevel_data) <- paste(data_list2$id, data_list2$year, sep = "_")
After this step, all the data were downloaded in wlevel_data. Each element in wlevel_data has two parts. $result shows the data if the download is successful, otherwise, it shows NULL. $error shows NULL if the download is successful, otherwise, it shows the error message.
Step 4: Access the data
transpose can turn a list "inside out". So now wlevel_data2 has two elements: result and error. We can store these two and access the data.
# Turn the list "inside out"
wlevel_data2 <- transpose(wlevel_data)
# Get the error message
wlevel_error <- wlevel_data2$error
# Get he result
wlevel_result <- wlevel_data2$result
# Remove NULL element in wlevel_result
wlevel_result2 <- wlevel_result[!map_lgl(wlevel_result, is.null)]

Function to iterate over list, merging results into one data frame

I've completed the first couple R courses on DataCamp and in order to build up my skills I've decided to use R to prep for fantasy football this season, thus I have began playing around with the nflscrapR package.
With the nflscrapR package, one can pull Game Information using the season_games() function which simply returns a data frame with the gameID, game date, the home and away team abbreviations.
Example:
games.2012 = season_games(2012)
head(games.2012)
GameID date home away season
1 2012090500 2012-09-05 NYG DAL 2012
2 2012090900 2012-09-09 CHI IND 2012
3 2012090908 2012-09-09 KC ATL 2012
4 2012090907 2012-09-09 CLE PHI 2012
5 2012090906 2012-09-09 NO WAS 2012
6 2012090905 2012-09-09 DET STL 2012
Initially I copy and pasted the original function and changed the last digit manually for each season, then rbinded all the seasons into one data frame, games.
games.2012 <- season_games(2012)
games.2013 <- season_games(2013)
games.2014 <- season_games(2014)
games.2015 <- season_games(2015)
games = rbind(games2012,games2013,games2014,games2015)
I'd like to write a function to simplify this process.
My failed attempt:
gameID <- function(years) {
for (i in years) {
games[i] = season_games(years[i])
}
}
With years = list(2012, 2013) for testing purposes, produced the following:
Error in strsplit(headers, "\r\n") : non-character argument Called
from: strsplit(headers, "\r\n")
Thanks in advance!
While #Gregor has an apparent solution, he didn't run it because this wasn't a minimal example. I googled, found, and tried to use this code, and it doesn't work, at least in a non-trivial amount of time.
On the other hand, I took this code from Vivek Patil's blog.
library(XML)
weeklystats = as.data.frame(matrix(ncol = 14)) # Initializing our empty dataframe
names(weeklystats) = c("Week", "Day", "Date", "Blank",
"Win.Team", "At", "Lose.Team",
"Points.Win", "Points.Lose",
"YardsGained.Win", "Turnovers.Win",
"YardsGained.Lose", "Turnovers.Lose",
"Year") # Naming columns
URLpart1 = "http://www.pro-football-reference.com/years/"
URLpart3 = "/games.htm"
#### Our workhorse function ####
getData = function(URLpart1, URLpart3) {
for (i in 2012:2015) {
URL = paste(URLpart1, as.character(i), URLpart3, sep = "")
tablefromURL = readHTMLTable(URL)
table = tablefromURL[[1]]
names(table) = c("Week", "Day", "Date", "Blank", "Win.Team", "At", "Lose.Team",
"Points.Win", "Points.Lose", "YardsGained.Win", "Turnovers.Win",
"YardsGained.Lose", "Turnovers.Lose")
table$Year = i # Inserting a value for the year
weeklystats = rbind(table, weeklystats) # Appending happening here
}
return(weeklystats)
}
I posted this because, it works, you might learn something about web scraping you didn't know, and it runs in 11 seconds.
system.time(weeklystats <- getData(URLpart1, URLpart3))
user system elapsed
0.870 0.014 10.926
You should probably take a look at some popular answers for working with lists, specifically How do I make a list of data frames? and What's the difference between [ and [[?.
There's no reason to put your years in a list. They're just integers, so just do a normal vector.
years = 2012:2015
Then we can get your function to work (we'll need to initialize an empty list before the for loop):
gameID <- function(years) {
games = list()
for (i in years) {
games[[i]] = season_games(years[i])
}
return(games)
}
Read my link above for why we're using [[ with the list and [ with the vector. And we could run it like this:
game_list = gameID(2012:2015)
But this is such a simple function that it's easier to use lapply. Your function is just a wrapper around a for loop that returns a list, and that's precisely what lapply is too. But where your function has season_games hard-coded in, lapply can work with any function.
game_list = lapply(2012:2015, season_games)
# should be the same result as above
In either case, we have the list of data frames and want to combine it into one big data frame. The base R way is rbind with do.call, but dplyr and data.table have more efficient versions.
# pick your favorite
games = do.call(rbind, args = game_list) # base
games = dplyr::bind_rows(game_list)
games = data.table::rbindlist(game_list)

Creating a function by taking few arguments and calculating

I'm still working on a question from couple of days ago and would like to receive feedback/support on how I could create a function. Your expertise is highly appreciated.
I have created the following:
##### 1)
> raceIDs
[1] "GER" "SUI" "NZ2" "US1" "US2" "POR" "FRA" "AUS" "NZ1" "SWE"
##### 2)
#For each "raceIDs", there is a csv file which I have made a loop to read and created a list of data frames (assigned to the symbol "boatList")
#For example, if I select "NZ1" the output is:
> head(boatList[[9]]) #Only selected the first six lines as there is more than 30000 rows
Boat Date Secs LocalTime SOG
1 NZ1 01:09:2013 38150.0 10:35:49.997 22.17
2 NZ1 01:09:2013 38150.2 10:35:50.197 22.19
3 NZ1 01:09:2013 38150.4 10:35:50.397 22.02
4 NZ1 01:09:2013 38150.6 10:35:50.597 21.90
5 NZ1 01:09:2013 38150.8 10:35:50.797 21.84
6 NZ1 01:09:2013 38151.0 10:35:50.997 21.95
##### 3)
# A matrix showing the race times for each raceIDs
> raceTimes
start finish
GER "11:10:02" "11:35:05"
SUI "11:10:02" "11:35:22"
NZ2 "11:10:02" "11:34:12"
US1 "11:10:01" "11:33:29"
US2 "11:10:01" "11:36:05"
POR "11:10:02" "11:34:31"
FRA "11:10:02" "11:34:45"
AUS "11:10:03" "11:36:48"
NZ1 "11:10:01" "11:35:16"
SWE "11:10:03" "11:35:08"
What I need to do is I need to calculate the average speed (SOG) of a boat "while it was racing" (between start and finish times) by creating a function called meanRaceSpeed and having three arguments:
What I have tried so far is to create a function with 3 arguments (with a bit of help from experts here):
meanRaceSpeed <- function(raceIDs, boatList, raceTimes)
{
#Probably need to compare times, and thought it might be useful to convert character values into `DateTime` values but not to sure how to use it
#DateTime <- as.POSIXct(paste(boatList$Date, boatList$Time), format="%Y%m%d %H%M%S")
#To get the times for each boat
start_time <- raceTimes$start[rownames(raceTimes) = raceIDs]
finish_time <- raceTimes$finish[rownames(raceTimes) = raceIDs]
start_LocalTime <- min(grep(start_time, boatList$LocalTime))
finish_LocalTime <- max(grep(finish_time, boatList$LocalTime))
#which `SOG`s contain all the `LocalTimes` between start and finish
#take their `mean`
mean(boatList$SOG[start_LocalTime : finish_LocalTime])
}
### Obviously, my code does not work :( and I don't know where.
So basically, I need to create a function with three arguments and the expected result is:
#e.g For NZ1
> meanRaceSpeed("NZ1", boatList, raceTimes)
[1] 18.32 #Mean speed for NZ1 between 11:10:01 - 11:35:16
#e.g for US1
> meanRaceSpeed("US1", boatList, raceTimes)
[1] 17.23 #Mean speed for US1 between 11:10:01 - 11:33:29
Any helps where I could have gone wrong? Highly appreciate your help please.
I'm going to give some general advice for R, but I will also help you with your specific question. Whenever I have a problem in R, I usually find that it helps to make things more explicit.
If the function isn't working with these methods (is that a data frame or a matrix in your function?) then you should try another method. If those table manipulation methods aren't working, try a different one. How?
Here's a few different things you can do to test your function, and a few suggestions that may move you along a bit. (I don't want to fix the whole thing for you, since it's your homework, but rather get you on your way.)
1) Why not try using a loop instead of brackets?
start_time <- raceTimes$start[rownames(raceTimes) = raceIDs]
Make that into a for loop. It's not too hard to do.
2) Debug your functions. There are a lot of tools to do this built into R, and in packages you can add. Since you, likely, don't have time for that with your homework. I'd suggest doing this. Take apart the function and apply each part of it with a variable you want. Are they of the right length? Are they the right data type? Are they getting the right answer before you put them all together? Make sure of that.
3) If all else fails, don't be afraid if the function and code is not elegant. R is not always an elegant language. (Actually, it's rarely an elegant language.) Especially when you're a beginner, your code will likely be ugly. Just make sure it works.
Since I, already, had experience with your data, I sat to make a complete example.
First, data that look like yours:
raceIDs <- c("GER", "SUI", "NZ2", "US1", "US2", "POR", "FRA", "AUS", "NZ1", "SWE")
raceTimes <- as.matrix(read.table(text = ' start finish
GER "11:10:02" "11:35:05"
SUI "11:10:02" "11:35:22"
NZ2 "11:10:02" "11:34:12"
US1 "11:10:01" "11:33:29"
US2 "11:10:01" "11:36:05"
POR "11:10:02" "11:34:31"
FRA "11:10:02" "11:34:45"
AUS "11:10:03" "11:36:48"
NZ1 "11:10:01" "11:35:16"
SWE "11:10:03" "11:35:08"', header = T))
#turn matrix to data.frame or, else, `$` won't work
raceTimes <- as.data.frame(raceTimes, stringsAsFactors = F)
blDF <- data.frame(Boat = rep(raceIDs, 3),
LocalTime = c(raceTimes$start, rep("11:20:25", length(raceIDs)), raceTimes$finish),
SOG = runif(3 * length(raceIDs), 15, 25), stringsAsFactors = F)
boatList <- split(blDF, blDF$Boat)
#remove `names` to create them from scratch
names(boatList) <- NULL
Then:
#create `names` by searching each element of
#`boatList` of what `boat` it contains
names(boatList) <- unlist(lapply(boatList, function(x) unique(x$Boat)))
#the function
meanRaceSpeed <- function(ID, boatList, raceTimes)
{ #named the first argument `ID` instead of `raceIDs`
start_time <- raceTimes$start[rownames(raceTimes) == ID]
finish_time <- raceTimes$finish[rownames(raceTimes) == ID]
start_LocalTime <- min(grep(start_time, boatList[[ID]]$LocalTime))
finish_LocalTime <- max(grep(finish_time, boatList[[ID]]$LocalTime))
mean(boatList[[ID]]$SOG[start_LocalTime : finish_LocalTime])
}
Test:
meanRaceSpeed("US1", boatList, raceTimes)
#[1] 19.7063
meanRaceSpeed("NZ1", boatList, raceTimes)
#[1] 21.74729
mean(boatList$NZ1$SOG) #to test function
#[1] 21.74729
mean(boatList$US1$SOG) #to test function
#[1] 19.7063

Resources