R - Dynamically Switch Between Variables in a Data Frame? - r

I have a data frame that has a collection of many zip codes and "city,state" in the US. So for instance it might read (note that the actual data frame has like 25000 observations):
zip.codes = c(33603, 33701, 32835)
place.names = c("Tampa, FL", "Saint Petersburg, FL", "Orlando, FL")
df = data.frame(zip.codes, place.names)
I am using this as part of a Shiny App. I'd like the user to be able to select a zip code, but to only have it return the place name. So if the user inputs 33603, then the code would return "Tampa, FL".
I've tried researching the Switch function to try to do this, but that seems to only accept hard-coded lists (which is kind of a nonstarter with so many entries in the DF). Is there a way to setup the switch function using variables? Otherwise, does anybody have another idea that might work?

First create vectors place.names and zip.codes and then try any of the following alternatives:
place.names <- as.character(df$place.names)
zip.codes <- df$zip.codes
1) match
place.names[match(33603, zip.codes)]
## [1] Tampa, FL
2) logical condition
place.names[33603 == zip.codes]
## [1] Tampa, FL
3) lookup name
v <- setNames(place.names, zip.codes)
unname(v[as.character(33603)])
## [1] Tampa, FL
4) switch
L <- setNames(as.list(place.names), zip.codes)
do.call("switch", c(as.character(33603), L))
## [1] "Tampa, FL"

Related

Using grepl in a string replace function in R Programming

Trying to create function using grepl to replace a word in a string.
Airport_ID<-c("3001","3002","3003","3004")
Airport_Name<-c("Adelaide Airport GOODFIND","GOODFIND DTS Land Airport Land ADTS",
"Washington DTS INC GOODFINDAirport DTSUpdated",
"DALLAS Airport TDS GOODFIND")
TF_Data<-data.frame(Airport_ID,Airport_Name)
Created the below function
STR_Manip_F_M_L_V1 <- function(data=NULL,by_text1="GOODFIND",by_text2="Updated") {
if(!require(glue)) {library(glue) }
TFD <- TF_Data %>%
filter(grepl(glue("^{by_text1} "),Airport_Name) |
grepl(glue(" \\({by_text1}\\) "),Airport_Name) |
grepl(glue(" \\({by_text1}\\$"),Airport_Name) )
TFD$Airport_Name <- str_replace(TFD$Airport_Name = glue("^{by_text1} "),replacement = glue("^{by_text2} ") )
return(TFD)
}
Error :
Error: object 'TFD' not found.
Several problems here:
You define data in the function arguments but use TFD_Data. This is not causing an error but is a logical mistake, making reproducibility and troubleshooting a bit more difficult.
Your use of str_replace is not quite right, don't "name" the arguments as you've done here. (Or ... perhaps just change your = to a comma ...).
Since you're using dplyr earlier, I think it's good to stay in that mode ... try this:
STR_Manip_F_M_L_V1 <- function(data=NULL,by_text1="GOODFIND",by_text2="Updated") {
out <- data %>%
filter(grepl(glue("^{by_text1} "),Airport_Name) |
grepl(glue(" \\({by_text1}\\) "),Airport_Name) |
grepl(glue(" \\({by_text1}\\$"),Airport_Name) ) %>%
mutate(Airport_Name = str_replace(Airport_Name, paste0("^", by_text1), by_text2))
out
}
STR_Manip_F_M_L_V1(TF_Data)
# Airport_ID Airport_Name
# 1 3002 Updated DTS Land Airport Land ADTS

Function to iterate over list, merging results into one data frame

I've completed the first couple R courses on DataCamp and in order to build up my skills I've decided to use R to prep for fantasy football this season, thus I have began playing around with the nflscrapR package.
With the nflscrapR package, one can pull Game Information using the season_games() function which simply returns a data frame with the gameID, game date, the home and away team abbreviations.
Example:
games.2012 = season_games(2012)
head(games.2012)
GameID date home away season
1 2012090500 2012-09-05 NYG DAL 2012
2 2012090900 2012-09-09 CHI IND 2012
3 2012090908 2012-09-09 KC ATL 2012
4 2012090907 2012-09-09 CLE PHI 2012
5 2012090906 2012-09-09 NO WAS 2012
6 2012090905 2012-09-09 DET STL 2012
Initially I copy and pasted the original function and changed the last digit manually for each season, then rbinded all the seasons into one data frame, games.
games.2012 <- season_games(2012)
games.2013 <- season_games(2013)
games.2014 <- season_games(2014)
games.2015 <- season_games(2015)
games = rbind(games2012,games2013,games2014,games2015)
I'd like to write a function to simplify this process.
My failed attempt:
gameID <- function(years) {
for (i in years) {
games[i] = season_games(years[i])
}
}
With years = list(2012, 2013) for testing purposes, produced the following:
Error in strsplit(headers, "\r\n") : non-character argument Called
from: strsplit(headers, "\r\n")
Thanks in advance!
While #Gregor has an apparent solution, he didn't run it because this wasn't a minimal example. I googled, found, and tried to use this code, and it doesn't work, at least in a non-trivial amount of time.
On the other hand, I took this code from Vivek Patil's blog.
library(XML)
weeklystats = as.data.frame(matrix(ncol = 14)) # Initializing our empty dataframe
names(weeklystats) = c("Week", "Day", "Date", "Blank",
"Win.Team", "At", "Lose.Team",
"Points.Win", "Points.Lose",
"YardsGained.Win", "Turnovers.Win",
"YardsGained.Lose", "Turnovers.Lose",
"Year") # Naming columns
URLpart1 = "http://www.pro-football-reference.com/years/"
URLpart3 = "/games.htm"
#### Our workhorse function ####
getData = function(URLpart1, URLpart3) {
for (i in 2012:2015) {
URL = paste(URLpart1, as.character(i), URLpart3, sep = "")
tablefromURL = readHTMLTable(URL)
table = tablefromURL[[1]]
names(table) = c("Week", "Day", "Date", "Blank", "Win.Team", "At", "Lose.Team",
"Points.Win", "Points.Lose", "YardsGained.Win", "Turnovers.Win",
"YardsGained.Lose", "Turnovers.Lose")
table$Year = i # Inserting a value for the year
weeklystats = rbind(table, weeklystats) # Appending happening here
}
return(weeklystats)
}
I posted this because, it works, you might learn something about web scraping you didn't know, and it runs in 11 seconds.
system.time(weeklystats <- getData(URLpart1, URLpart3))
user system elapsed
0.870 0.014 10.926
You should probably take a look at some popular answers for working with lists, specifically How do I make a list of data frames? and What's the difference between [ and [[?.
There's no reason to put your years in a list. They're just integers, so just do a normal vector.
years = 2012:2015
Then we can get your function to work (we'll need to initialize an empty list before the for loop):
gameID <- function(years) {
games = list()
for (i in years) {
games[[i]] = season_games(years[i])
}
return(games)
}
Read my link above for why we're using [[ with the list and [ with the vector. And we could run it like this:
game_list = gameID(2012:2015)
But this is such a simple function that it's easier to use lapply. Your function is just a wrapper around a for loop that returns a list, and that's precisely what lapply is too. But where your function has season_games hard-coded in, lapply can work with any function.
game_list = lapply(2012:2015, season_games)
# should be the same result as above
In either case, we have the list of data frames and want to combine it into one big data frame. The base R way is rbind with do.call, but dplyr and data.table have more efficient versions.
# pick your favorite
games = do.call(rbind, args = game_list) # base
games = dplyr::bind_rows(game_list)
games = data.table::rbindlist(game_list)

Convert R JSON Twitter data to list

When using SearchTwitter, I converted to dataframe and then exported to JSON. However, all the text is in one line, etc (sample below). I need to separate so that each tweet is its own.
phish <- searchTwitteR('phish', n = 5, lang = 'en')
phishdf <- do.call("rbind", lapply(phish, as.data.frame))
exportJson <-toJSON(phishdf)
write(exportJson, file = "phishdf.json")
json_phishdf <- fromJSON(file="phishdf.json")
I tried converting to a list and am wondering if maybe converting to a data frame is a mistake.
However, for a list, I tried:
newlist['text']=phish[[1]]$getText()
But this will just give me the text for the first tweet. Is there a way to iterate over the entire data set, maybe in a for loop?
{"text":["#ilazer #abbijacobson I do feel compelled to say that I phind phish awphul... sorry, Abbi!","#phish This on-sale was an embarrassment. Something needs to change.","FS: Have 2 Tix To Phish In Chula Vista #Phish #facevaluetickets #phish #facevalue GO: https://t.co/dFdrpyaotp","RT #WKUPhiDelt: Come unwind from a busy week of class and kick off the weekend with a Phish Fry! 4:30-7:30 at the Phi Delt house. Cost is $\u2026","RT #phish: Tickets for Phish's July 15 & 16 shows at The Gorge go on sale in fifteen minutes at 1PM ET: https://t.co/tEKLNjI5u7 https://t.c\u2026"],
"favorited":[false,false,false,false,false],
"favoriteCount":[0,0,0,0,0],
"replyToSN":["rAlexandria","phish","NA","NA","NA"],
"created":[1456521159,1456521114,1456521022,1456521016,1456520988],
"truncated":[false,false,false,false,false],
"replyToSID":["703326502629277696","703304948990222337","NA","NA","NA"],
"id":["703326837720662016","703326646074343424","703326261045829632","703326236722991105","703326119328686080"],
"replyToUID":["26152867","14503997","NA","NA","NA"],"statusSource":["Mobile Web (M5)","Twitter for iPhone","CashorTrade - Face Value Tickets","Twitter for iPhone","Twitter for Android"],
"screenName":["rAlexandria","adamgelvan","CashorTrade","Kyle_Smith1087","timogrennell"],
"retweetCount":[0,0,0,2,5],
"isRetweet":[false,false,false,true,true],
"retweeted":[false,false,false,false,false],
"longitude":["NA","NA","NA","NA","NA"],
"latitude":["NA","NA","NA","NA","NA"]}
I followed your code and don't have the issue you're describing. Are you using library(twitteR) and library(jsonlite)?
Here is the code, and a screenshot of it working
library(twitteR)
library(jsonlite)
phish <- searchTwitteR('phish', n = 5, lang = 'en')
phishdf <- do.call("rbind", lapply(phish, as.data.frame))
exportJson <-toJSON(phishdf)
write(exportJson, file = "./../phishdf.json")
## note the `txt` argument, as opposed to `file` used in the question
json_phishdf <- fromJSON(txt="./../phishdf.json")

Creating a function by taking few arguments and calculating

I'm still working on a question from couple of days ago and would like to receive feedback/support on how I could create a function. Your expertise is highly appreciated.
I have created the following:
##### 1)
> raceIDs
[1] "GER" "SUI" "NZ2" "US1" "US2" "POR" "FRA" "AUS" "NZ1" "SWE"
##### 2)
#For each "raceIDs", there is a csv file which I have made a loop to read and created a list of data frames (assigned to the symbol "boatList")
#For example, if I select "NZ1" the output is:
> head(boatList[[9]]) #Only selected the first six lines as there is more than 30000 rows
Boat Date Secs LocalTime SOG
1 NZ1 01:09:2013 38150.0 10:35:49.997 22.17
2 NZ1 01:09:2013 38150.2 10:35:50.197 22.19
3 NZ1 01:09:2013 38150.4 10:35:50.397 22.02
4 NZ1 01:09:2013 38150.6 10:35:50.597 21.90
5 NZ1 01:09:2013 38150.8 10:35:50.797 21.84
6 NZ1 01:09:2013 38151.0 10:35:50.997 21.95
##### 3)
# A matrix showing the race times for each raceIDs
> raceTimes
start finish
GER "11:10:02" "11:35:05"
SUI "11:10:02" "11:35:22"
NZ2 "11:10:02" "11:34:12"
US1 "11:10:01" "11:33:29"
US2 "11:10:01" "11:36:05"
POR "11:10:02" "11:34:31"
FRA "11:10:02" "11:34:45"
AUS "11:10:03" "11:36:48"
NZ1 "11:10:01" "11:35:16"
SWE "11:10:03" "11:35:08"
What I need to do is I need to calculate the average speed (SOG) of a boat "while it was racing" (between start and finish times) by creating a function called meanRaceSpeed and having three arguments:
What I have tried so far is to create a function with 3 arguments (with a bit of help from experts here):
meanRaceSpeed <- function(raceIDs, boatList, raceTimes)
{
#Probably need to compare times, and thought it might be useful to convert character values into `DateTime` values but not to sure how to use it
#DateTime <- as.POSIXct(paste(boatList$Date, boatList$Time), format="%Y%m%d %H%M%S")
#To get the times for each boat
start_time <- raceTimes$start[rownames(raceTimes) = raceIDs]
finish_time <- raceTimes$finish[rownames(raceTimes) = raceIDs]
start_LocalTime <- min(grep(start_time, boatList$LocalTime))
finish_LocalTime <- max(grep(finish_time, boatList$LocalTime))
#which `SOG`s contain all the `LocalTimes` between start and finish
#take their `mean`
mean(boatList$SOG[start_LocalTime : finish_LocalTime])
}
### Obviously, my code does not work :( and I don't know where.
So basically, I need to create a function with three arguments and the expected result is:
#e.g For NZ1
> meanRaceSpeed("NZ1", boatList, raceTimes)
[1] 18.32 #Mean speed for NZ1 between 11:10:01 - 11:35:16
#e.g for US1
> meanRaceSpeed("US1", boatList, raceTimes)
[1] 17.23 #Mean speed for US1 between 11:10:01 - 11:33:29
Any helps where I could have gone wrong? Highly appreciate your help please.
I'm going to give some general advice for R, but I will also help you with your specific question. Whenever I have a problem in R, I usually find that it helps to make things more explicit.
If the function isn't working with these methods (is that a data frame or a matrix in your function?) then you should try another method. If those table manipulation methods aren't working, try a different one. How?
Here's a few different things you can do to test your function, and a few suggestions that may move you along a bit. (I don't want to fix the whole thing for you, since it's your homework, but rather get you on your way.)
1) Why not try using a loop instead of brackets?
start_time <- raceTimes$start[rownames(raceTimes) = raceIDs]
Make that into a for loop. It's not too hard to do.
2) Debug your functions. There are a lot of tools to do this built into R, and in packages you can add. Since you, likely, don't have time for that with your homework. I'd suggest doing this. Take apart the function and apply each part of it with a variable you want. Are they of the right length? Are they the right data type? Are they getting the right answer before you put them all together? Make sure of that.
3) If all else fails, don't be afraid if the function and code is not elegant. R is not always an elegant language. (Actually, it's rarely an elegant language.) Especially when you're a beginner, your code will likely be ugly. Just make sure it works.
Since I, already, had experience with your data, I sat to make a complete example.
First, data that look like yours:
raceIDs <- c("GER", "SUI", "NZ2", "US1", "US2", "POR", "FRA", "AUS", "NZ1", "SWE")
raceTimes <- as.matrix(read.table(text = ' start finish
GER "11:10:02" "11:35:05"
SUI "11:10:02" "11:35:22"
NZ2 "11:10:02" "11:34:12"
US1 "11:10:01" "11:33:29"
US2 "11:10:01" "11:36:05"
POR "11:10:02" "11:34:31"
FRA "11:10:02" "11:34:45"
AUS "11:10:03" "11:36:48"
NZ1 "11:10:01" "11:35:16"
SWE "11:10:03" "11:35:08"', header = T))
#turn matrix to data.frame or, else, `$` won't work
raceTimes <- as.data.frame(raceTimes, stringsAsFactors = F)
blDF <- data.frame(Boat = rep(raceIDs, 3),
LocalTime = c(raceTimes$start, rep("11:20:25", length(raceIDs)), raceTimes$finish),
SOG = runif(3 * length(raceIDs), 15, 25), stringsAsFactors = F)
boatList <- split(blDF, blDF$Boat)
#remove `names` to create them from scratch
names(boatList) <- NULL
Then:
#create `names` by searching each element of
#`boatList` of what `boat` it contains
names(boatList) <- unlist(lapply(boatList, function(x) unique(x$Boat)))
#the function
meanRaceSpeed <- function(ID, boatList, raceTimes)
{ #named the first argument `ID` instead of `raceIDs`
start_time <- raceTimes$start[rownames(raceTimes) == ID]
finish_time <- raceTimes$finish[rownames(raceTimes) == ID]
start_LocalTime <- min(grep(start_time, boatList[[ID]]$LocalTime))
finish_LocalTime <- max(grep(finish_time, boatList[[ID]]$LocalTime))
mean(boatList[[ID]]$SOG[start_LocalTime : finish_LocalTime])
}
Test:
meanRaceSpeed("US1", boatList, raceTimes)
#[1] 19.7063
meanRaceSpeed("NZ1", boatList, raceTimes)
#[1] 21.74729
mean(boatList$NZ1$SOG) #to test function
#[1] 21.74729
mean(boatList$US1$SOG) #to test function
#[1] 19.7063

How to Vectorize this R code Using Plyr, Apply, or Similar?

I wrote the following R code that identifies duplicate files in a directory. How can one vectorize the for-loop using the plyr package (or similar)? I would like to achieve a more idiomatic R solution than the one I came up with.
library("digest") # to compute the MD5 digest
test_dir = "/Users/user/Dropbox/kaggle/r_projects/test_photo"
filelist <- dir(test_dir, pattern = "JPG|AVI", recursive=TRUE,
all.files =TRUE, full.names=TRUE)
fl = list() #create and empty list to hold md5's and filenames
for (itm in filelist) {
file_digest = digest(itm, file=TRUE, algo="md5")
fl[[file_digest]]= c(fl[[file_digest]],itm)
}
fl
the output is ( using a small test directory):
> fl
$`5715b719723c5111b3a38a6ff8b7ca56`
[1] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3480 copy.JPG"
[2] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3480.JPG"
$`24fd4d7d252ca66c8d7a88b539c55112`
[1] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3481 copy.JPG"
[2] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3481.JPG"
[3] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_b/IMG_3481.JPG"
$`2a1d668c874dc856b9df0fbf3f2e81ec`
[1] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3482 copy.JPG"
[2] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3482.JPG"
[3] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_b/IMG_3482 copy.JPG"
[4] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_b/IMG_3482.JPG"
I tried:
h=ldply(filelist, digest, file=TRUE, algo="md5")
h$filenames=filelist
but ended up with a unique row for every key value pair of (MD5, filename). I was not able to get the compact output desired.
(Background: As an exercise, I converted the python code presented by Raymond Hettinger in his PyCon AU 2011 keynote "What Makes Python Awesome". The slides are here: http://slidesha.re/WKkh9M . I was able to cut the LOC in half, but I think I can do better - and learn more - by vectorizing).
Here is a solution in base that is a little more concise:
md5s<-sapply(filelist,digest,file=TRUE,algo="md5")
split(filelist,md5s)
Here's one answer. First get the md5 and file names on to a data.frame with ldply. Then, create the list you desire with dlply.
fl <- ldply(seq_along(filelist), function(idx)
c(digest(filelist[idx], file=TRUE, algo="md5"),
filelist[idx]))
fl <- dlply(fl, .(V1), function(x) x$V2)

Resources