How to have bar labels be names in Plotly for R - r

So I'm trying to make a bar chart that displays the most popular airports that flew to Chicago. For some reason, I'm finding it to be extremely difficult to have my bars be labeled by the airport names specifically.
I have a data frame called ty
> ty
Name
1 Atlanta, GA: Hartsfield-Jackson Atlanta International
2 New York, NY: LaGuardia
3 Minneapolis, MN: Minneapolis-St Paul International
4 Los Angeles, CA: Los Angeles International
5 Denver, CO: Denver International
6 Washington, DC: Ronald Reagan Washington National
7 Orlando, FL: Orlando International
8 Phoenix, AZ: Phoenix Sky Harbor International
9 Detroit, MI: Detroit Metro Wayne County
10 Las Vegas, NV: McCarran International
11 San Francisco, CA: San Francisco International
12 Dallas/Fort Worth, TX: Dallas/Fort Worth International
13 Boston, MA: Logan International
14 Philadelphia, PA: Philadelphia International
15 Newark, NJ: Newark Liberty International
I also have a data frame called df
id numArrivals
1 10397 964
2 12953 962
3 13487 883
4 12892 823
5 11292 776
6 11278 771
7 13204 725
8 14107 700
9 11433 672
10 12889 647
11 14771 611
12 11298 580
13 10721 569
14 14100 567
15 11618 488
The id corresponds to the airport name 10397 is Atlanta, GA: Hartsfield-Jackson Atlanta International and they continue in that order.
However, when I run:
plotly::plot_ly(df,x=ty["Name"],y=df$numArrivals,type="bar",color=I("rgba(0,92,124,1)"))
I am given this chart.
How can I make the labels of my bars into the names of the airport rather than just numbers?

Feel free to use ggplotly() to create your plot. I used the code below to create a small example.
example <- data.frame(airport = c("Atlanta, GA: Hartsfield-Jackson Atlanta International","New York, NY: LaGuardia","Minneapolis, MN: Minneapolis-St Paul International"),
id = c(10397,12953,13487),
numArrivals = c(964,962,883),stringsAsFactors = F)
library(ggplot2)
library(plotly)
a <- ggplot(example,aes(x=airport,y=numArrivals,fill=id)) + geom_bar(stat = "identity") + coord_flip()
ggplotly(a)
The final result looks like this.

Related

Rvest read table with cells that span multiple rows

I'm trying to scrape an irregular table from Wikipedia using rvest. The table has cells that span multiple rows. The documentation for html_table clearly states that this is a limitation. I'm just wondering if there's a workaround.
The table looks like this:
My code:
library(rvest)
url <- "https://en.wikipedia.org/wiki/Arizona_League"
parks <- url %>%
read_html() %>%
html_nodes(xpath='/html/body/div[3]/div[3]/div[4]/div/table[2]') %>%
html_table(fill=TRUE) %>% # fill=FALSE yields the same results
.[[1]]
Returns this:
Where there are several errors, for example: row 4 under "City" should be "Mesa", NOT "Chicago Cubs". I'd be happy with blank cells as I could "fill down" as needed, but the wrong data is a problem. Help is much appreciated.
I have a way to code it.
It is not perfect, a bit long but it does the trick:
library(rvest)
url <- "https://en.wikipedia.org/wiki/Arizona_League"
# get the lines of the table
lines <- url %>%
read_html() %>%
html_nodes(xpath="//table[starts-with(#class, 'wikitable')]") %>%
html_nodes(xpath = 'tbody/tr')
#define the empty table
ncol <- lines %>%
.[[1]] %>%
html_children()%>%
length()
nrow <- length(lines)
table <- as.data.frame(matrix(nrow = nrow,ncol = ncol))
# fill the table
for(i in 1:nrow){
# get content of the line
linecontent <- lines[[i]]%>%
html_children()%>%
html_text()%>%
gsub("\n","",.)
# attribute the content to free columns
colselect <- is.na(table[i,])
table[i,colselect] <- linecontent
# get the line repetition of each columns
repetition <- lines[[i]]%>%
html_children()%>%
html_attr("rowspan")%>%
ifelse(is.na(.),1,.) %>% # if no rowspan, then it is a normal row, not a multiple one
as.numeric
# repeat the cells of the multiple rows down
for(j in 1:length(repetition)){
span <- repetition[j]
if(span > 1){
table[(i+1):(i+span-1),colselect][,j] <- rep(linecontent[j],span-1)
}
}
}
The idea is to have the html lines of the table in the lines variable by getting the /tr nodes. I then create an empty table: number of columns is the length of the children of the first row (because it contains the titles), number of line the length of lines. I fill it by hand in a for loop (didn't amanger a nicer way here).
The difficulty is that the amount of column text given in a row changes when there is already a multiple row column spanning on the current row. For example :
lines[[3]]%>%
html_children()%>%
html_text()%>%
gsub("\n","",.)
gives only 5 values :
[1] "Arizona League Athletics Gold" "Oakland Athletics" "Mesa" "Fitch Park"
[5] "10,000"
instead of the 6 columns, because the first column is East on 8 rows. This East value appears only on the first rows it spans on.
The trick is to repeat the cells down in the table when they have a rowspan attribute (meaning they span on several rows). It allows to select on the next row only the NA columns, so that the amount of text given by the html line match the amount of free columns in the table we fill.
This is done with the colselect variable, which is a bolean giving the free rows before repeting the cells of the given row.
The result :
V1 V2 V3 V4 V5 V6
1 Division Team MLB Affiliation City Stadium Capacity
2 East Arizona League Angels Los Angeles Angels Tempe Tempe Diablo Stadium 9,785
3 East Arizona League Athletics Gold Oakland Athletics Mesa Fitch Park 10,000
4 East Arizona League Athletics Green Oakland Athletics Mesa Fitch Park 10,000
5 East Arizona League Cubs 1 Chicago Cubs Mesa Sloan Park 15,000
6 East Arizona League Cubs 2 Chicago Cubs Mesa Sloan Park 15,000
7 East Arizona League Diamondbacks Arizona Diamondbacks Scottsdale Salt River Fields at Talking Stick 11,000
8 East Arizona League Giants Black San Francisco Giants Scottsdale Scottsdale Stadium 12,000
9 East Arizona League Giants Orange San Francisco Giants Scottsdale Scottsdale Stadium 12,000
10 Central Arizona League Brewers Gold Milwaukee Brewers Phoenix American Family Fields of Phoenix 8,000
11 Central Arizona League Dodgers Lasorda Los Angeles Dodgers Phoenix Camelback Ranch 12,000
12 Central Arizona League Indians Blue Cleveland Indians Goodyear Goodyear Ballpark 10,000
13 Central Arizona League Padres 2 San Diego Padres Peoria Peoria Sports Complex 12,882
14 Central Arizona League Reds Cincinnati Reds Goodyear Goodyear Ballpark 10,000
15 Central Arizona League White Sox Chicago White Sox Phoenix Camelback Ranch 12,000
16 West Arizona League Brewers Blue Milwaukee Brewers Phoenix American Family Fields of Phoenix 8,000
17 West Arizona League Dodgers Mota Los Angeles Dodgers Phoenix Camelback Ranch 12,000
18 West Arizona League Indians Red Cleveland Indians Goodyear Goodyear Ballpark 10,000
19 West Arizona League Mariners Seattle Mariners Peoria Peoria Sports Complex 12,882
20 West Arizona League Padres 1 San Diego Padres Peoria Peoria Sports Complex 12,882
21 West Arizona League Rangers Texas Rangers Surprise Surprise Stadium 10,500
22 West Arizona League Royals Kansas City Royals Surprise Surprise Stadium 10,500
Edit
I made a shorter version of the function, with more explanation here

Map zip codes to their respective city and state in R?

I have a data frame of zip codes that I'm looking to map to a city & state for each specific zip code. Currently, I have played around with the zipcode package a bit but I'm not sure that can solve this specific issue.
Here's sample data of what I have now:
str(all_key$zip)
chr [1:406] "43031" "24517" "43224" "43832" "53022" "60185" "84104" "43081"
"85226" "85193" "54656" "43215" "94533" "95826" "64804" "49548" "54467"
The expected output would be adding a city & state column to each row of the data frame referring to the individual zips:
head(all_key)
zip city state
1 43031 city1 state1
2 24517 city2 state2
3 43224 city3 state3
4 43832 city4 state4
5 53022 city5 state5
6 60185 city6 state6
Thanks in advance for your help.
Another Update - February 2023
Another package (zipcodeR) has been added that makes this easier. See below.
Answer updated - January 2020
The zipcode package seems to have disappeared, so this answer has been updated to show how to add lat-lon from an external file. New answer at bottom.
Original answer
You can get the data from the zipcode package and just do a merge to look things up.
zip = c("43031", "24517", "43224", "43832", "53022",
"60185", "84104", "43081", "85226", "85193", "54656",
"43215", "94533", "95826", "64804", "49548", "54467")
ZC = data.frame(zip)
library(zipcode)
data(zipcode)
merge(ZC, zipcode)
zip city state latitude longitude
1 24517 Altavista VA 37.12754 -79.27409
2 43031 Johnstown OH 40.15198 -82.66944
3 43081 Westerville OH 40.10951 -82.91606
4 43215 Columbus OH 39.96513 -83.00431
5 43224 Columbus OH 40.03991 -82.96772
6 43832 Newcomerstown OH 40.27738 -81.59662
7 49548 Grand Rapids MI 42.86823 -85.66391
8 53022 Germantown WI 43.21916 -88.12043
9 54467 Plover WI 44.45228 -89.54399
10 54656 Sparta WI 43.96977 -90.80796
11 60185 West Chicago IL 41.89198 -88.20502
12 64804 Joplin MO 37.04716 -94.51124
13 84104 Salt Lake City UT 40.75063 -111.94077
14 85193 Casa Grande AZ 32.86000 -111.83000
15 85226 Chandler AZ 33.31221 -111.93177
16 94533 Fairfield CA 38.26958 -122.03701
17 95826 Sacramento CA 38.55010 -121.37492
If you need to keep the rows in the same order, you can just set the rownames on the zipcode data and use that to select the desired rows and columns.
rownames(zipcode) = zipcode$zip
zipcode[zip, 1:3]
zip city state
43031 43031 Johnstown OH
24517 24517 Altavista VA
43224 43224 Columbus OH
43832 43832 Newcomerstown OH
53022 53022 Germantown WI
60185 60185 West Chicago IL
84104 84104 Salt Lake City UT
43081 43081 Westerville OH
85226 85226 Chandler AZ
85193 85193 Casa Grande AZ
54656 54656 Sparta WI
43215 43215 Columbus OH
94533 94533 Fairfield CA
95826 95826 Sacramento CA
64804 64804 Joplin MO
49548 49548 Grand Rapids MI
54467 54467 Plover WI
Updated Answer - January 2020
Since the zipcode package has disappeared, this shows how to add lat-lon information from a downloaded data set. The file that I am using exists today but the method should work for other files. See the GIS StackExchange for some leads on where to download data.
## Original Data to match
zip = c("43031", "24517", "43224", "43832", "53022",
"60185", "84104", "43081", "85226", "85193", "54656",
"43215", "94533", "95826", "64804", "49548", "54467")
ZC = data.frame(zip)
## Download source file, unzip and extract into table
ZipCodeSourceFile = "http://download.geonames.org/export/zip/US.zip"
temp <- tempfile()
download.file(ZipCodeSourceFile , temp)
ZipCodes <- read.table(unz(temp, "US.txt"), sep="\t")
unlink(temp)
names(ZipCodes) = c("CountryCode", "zip", "PlaceName",
"AdminName1", "AdminCode1", "AdminName2", "AdminCode2",
"AdminName3", "AdminCode3", "latitude", "longitude", "accuracy")
## merge extra info onto original data
fZC_Info = merge(ZC, ZipCodes[,c(2:6,10:11)])
head(ZC_Info)
zip PlaceName AdminName1 AdminCode1 AdminName2 latitude longitude
1 24517 Altavista Virginia VA Campbell 37.1222 -79.2911
2 43031 Johnstown Ohio OH Licking 40.1445 -82.6973
3 43081 Westerville Ohio OH Franklin 40.1146 -82.9105
4 43215 Columbus Ohio OH Franklin 39.9671 -83.0044
5 43224 Columbus Ohio OH Franklin 40.0425 -82.9689
6 43832 Newcomerstown Ohio OH Tuscarawas 40.2739 -81.5940
Second Update - February 2023
Another package, zipcodeR, is now available that makes this easier. Here is some simple code to demonstrate it.
library(zipcodeR)
zip = c("43031", "24517", "43224", "43832", "53022",
"60185", "84104", "43081", "85226", "85193", "54656",
"43215", "94533", "95826", "64804", "49548", "54467")
reverse_zipcode(zip)[,c(1,3,7)]
# A tibble: 17 × 3
zipcode major_city state
<chr> <chr> <chr>
1 85193 Casa Grande AZ
2 85226 Chandler AZ
3 94533 Fairfield CA
4 95826 Sacramento CA
5 60185 West Chicago IL
6 49548 Grand Rapids MI
7 64804 Joplin MO
8 43031 Johnstown OH
9 43081 Westerville OH
10 43215 Columbus OH
11 43224 Columbus OH
12 43832 Newcomerstown OH
13 84104 Salt Lake City UT
14 24517 Altavista VA
15 53022 Germantown WI
16 54467 Plover WI
17 54656 Sparta WI
You can still use the "zipcode" package by downloading it from the archives
https://cran.r-project.org/src/contrib/Archive/zipcode/
Once you download the tar.gz file to your computer, you can install it from the RStudio GUI Packages pane. After clicking "Install", you can change the option to "Package Archive File" and point to the downloaded tar.gz file.
Install/use the USA package, also described here, which contains a tibble (zips and lats/longs) from the archived zipcode package.
library(usa)
zcs <- usa::zipcodes
head(zcs)
# A tibble: 6 x 5
zip city state lat long
<chr> <chr> <chr> <dbl> <dbl>
1 00210 Portsmouth NH 43.0 -71.0
2 00211 Portsmouth NH 43.0 -71.0
3 00212 Portsmouth NH 43.0 -71.0
4 00213 Portsmouth NH 43.0 -71.0
5 00214 Portsmouth NH 43.0 -71.0
6 00215 Portsmouth NH 43.0 -71.0
You can use the data frame in the R package zipcodeR.
To add the city and state to your data frame, you can select the variables you want from the data frame provided in zipcodeR (called zip_code_db), then join it with your data frame:
library(dplyr)
library(zipcodeR)
zip_code_db_selected =
zip_code_db %>%
select(zipcode, major_city, state)
all_key_with_city_st =
left_join(all_key, zip_code_db_selected, by = c("zip" = "zipcode"))

R: Mission impossible? How to assign "New York" to a county

I run into problems assigning a county to some city places. When querying via the acs package
> geo.lookup(state = "NY", place = "New York")
state state.name county.name place place.name
1 36 New York <NA> NA <NA>
2 36 New York Bronx County, Kings County, New York County, Queens County, Richmond County 51000 New York city
3 36 New York Oneida County 51011 New York Mills village
, you can see that "New York", for instance, has a bunch of counties. So do Los Angeles, Portland, Oklahoma, Columbus etc. How can such data be assigned to a "county"?
Following code is currently used to match "county.name" with the corresponding county FIPS code. Unfortunately, it only works for cases of only one county name output in the query.
Script
dat <- c("New York, NY","Boston, MA","Los Angeles, CA","Dallas, TX","Palo Alto, CA")
dat <- strsplit(dat, ",")
dat
library(tigris)
library(acs)
data(fips_codes) # FIPS codes with state, code, county information
GeoLookup <- lapply(dat,function(x) {
geo.lookup(state = trimws(x[2]), place = trimws(x[1]))[2,]
})
df <- bind_rows(GeoLookup)
#Rename cols to match
colnames(fips_codes) = c("state.abb", "statefips", "state.name", "countyfips", "county.name")
# Here is a problem, because it works with one item in "county.name" but not more than one (see output below).
df <- df %>% left_join(fips_codes, by = c("state.name", "county.name"))
df
Returns:
state state.name county.name place place.name state.abb statefips countyfips
1 36 New York Bronx County, Kings County, New York County, Queens County, Richmond County 51000 New York city <NA> <NA> <NA>
2 25 Massachusetts Suffolk County 7000 Boston city MA 25 025
3 6 California Los Angeles County 20802 East Los Angeles CDP CA 06 037
4 48 Texas Collin County, Dallas County, Denton County, Kaufman County, Rockwall County 19000 Dallas city <NA> <NA> <NA>
5 6 California San Mateo County 20956 East Palo Alto city CA 06 081
In order to retain data, the left_join might better be matched as "look for county.name that contains place.name (without the appending xy city in the name), or choose the first item by default. It would be great to see how this could be done.
In general: I assume, there's no better way than this approach?
Thanks for your help!
What about something like the code below to create a "long" data frame for joining. We use the tidyverse pipe operator to chain operations. strsplit returns a list, which we unnest to stack the list values (the county names that go with each combination of state.name and place.name) into a long data frame where each county.name now gets its own row.
library(tigris)
library(acs)
library(tidyverse)
dat = geo.lookup(state = "NY", place = "New York")
state state.name county.name place place.name
1 36 New York <NA> NA <NA>
2 36 New York Bronx County, Kings County, New York County, Queens County, Richmond County 51000 New York city
3 36 New York Oneida County 51011 New York Mills village
dat = dat %>%
group_by(state.name, place.name) %>%
mutate(county.name = strsplit(county.name, ", ")) %>%
unnest
state state.name place place.name county.name
<chr> <chr> <int> <chr> <chr>
1 36 New York NA <NA> <NA>
2 36 New York 51000 New York city Bronx County
3 36 New York 51000 New York city Kings County
4 36 New York 51000 New York city New York County
5 36 New York 51000 New York city Queens County
6 36 New York 51000 New York city Richmond County
7 36 New York 51011 New York Mills village Oneida County
UPDATE: Regarding the second question in your comment, assuming you have the vector of metro areas already, how about this:
dat <- c("New York, NY","Boston, MA","Los Angeles, CA","Dallas, TX","Palo Alto, CA")
df <- map_df(strsplit(dat, ", "), function(x) {
geo.lookup(state = x[2], place = x[1])[-1, ] %>%
group_by(state.name, place.name) %>%
mutate(county.name = strsplit(county.name, ", ")) %>%
unnest
})
df
state state.name place place.name county.name
1 36 New York 51000 New York city Bronx County
2 36 New York 51000 New York city Kings County
3 36 New York 51000 New York city New York County
4 36 New York 51000 New York city Queens County
5 36 New York 51000 New York city Richmond County
6 36 New York 51011 New York Mills village Oneida County
7 25 Massachusetts 7000 Boston city Suffolk County
8 25 Massachusetts 7000 Boston city Suffolk County
9 6 California 20802 East Los Angeles CDP Los Angeles County
10 6 California 39612 Lake Los Angeles CDP Los Angeles County
11 6 California 44000 Los Angeles city Los Angeles County
12 48 Texas 19000 Dallas city Collin County
13 48 Texas 19000 Dallas city Dallas County
14 48 Texas 19000 Dallas city Denton County
15 48 Texas 19000 Dallas city Kaufman County
16 48 Texas 19000 Dallas city Rockwall County
17 48 Texas 40516 Lake Dallas city Denton County
18 6 California 20956 East Palo Alto city San Mateo County
19 6 California 55282 Palo Alto city Santa Clara County
UPDATE 2: If I understand your comments, for cities (actually place names in the example) with more than one county, we want only the county that includes the same name as the city (for example, New York County in the case of New York city), or the first county in the list otherwise. The following code selects a county with the same name as the city or, if there isn't one, the first county for that city. You might have to tweak it a bit to make it work for the entire U.S. For example, for it to work for Louisiana, you might need gsub(" County| Parish"... instead of gsub(" County"....
map_df(strsplit(dat, ", "), function(x) {
geo.lookup(state = x[2], place = x[1])[-1, ] %>%
group_by(state.name, place.name) %>%
mutate(county.name = strsplit(county.name, ", ")) %>%
unnest %>%
slice(max(1, which(grepl(sub(" [A-Za-z]*$","", place.name), gsub(" County", "", county.name))), na.rm=TRUE))
})
state state.name place place.name county.name
<chr> <chr> <int> <chr> <chr>
1 36 New York 51000 New York city New York County
2 36 New York 51011 New York Mills village Oneida County
3 25 Massachusetts 7000 Boston city Suffolk County
4 6 California 20802 East Los Angeles CDP Los Angeles County
5 6 California 39612 Lake Los Angeles CDP Los Angeles County
6 6 California 44000 Los Angeles city Los Angeles County
7 48 Texas 19000 Dallas city Dallas County
8 48 Texas 40516 Lake Dallas city Denton County
9 6 California 20956 East Palo Alto city San Mateo County
10 6 California 55282 Palo Alto city Santa Clara County
Could you prep the data by using something like the below code?
new_york_data <- geo.lookup(state = "NY", place = "New York")
prep_data <- function(full_data){
output <- data.frame()
for(row in 1:nrow(full_data)){
new_rows <- replicateCounty(full_data[row, ])
output <- plyr::rbind.fill(output, new_rows)
}
return(output)
}
replicateCounty <- function(row){
counties <- str_trim(unlist(str_split(row$county.name, ",")))
output <- data.frame(state = row$state,
state.name = row$state.name,
county.name = counties,
place = row$place,
place.name = row$place.name)
return(output)
}
prep_data(new_york_data)
It's a little messy and you'll need the plyr and stringr packages. Once you prep the data, you should be able to join on it

Convert string vector to dataframe in R

I am working on a quick scraping project that involves grabbing historical NFL football data. Here is a quick glance of what my data looks like:
allgames_thisweek = c("Chicago Bears 21, Tampa Bay Buccaneers 9 -- Box Score", "Cleveland Browns 28, Cincinnati Bengals 20 -- Box Score",
"Dallas Cowboys 26, Pittsburgh Steelers 9 -- Box Score", "Detroit Lions 31, Atlanta Falcons 28 (OT) -- Box Score",
"Green Bay Packers 16, Minnesota Vikings 10 -- Box Score", "Indianapolis Colts 45, Houston Oilers 21 -- Box Score",
"Kansas City Chiefs 30, New Orleans Saints 17 -- Box Score",
"Los Angeles Rams 14, Arizona Cardinals 12 -- Box Score", "Miami Dolphins 39, New England Patriots 35 -- Box Score",
"New York Giants 28, Philadelphia Eagles 23 -- Box Score", "New York Jets 23, Buffalo Bills 3 -- Box Score",
"San Diego Chargers 37, Denver Broncos 34 -- Box Score", "San Francisco 49ers 44, Los Angeles Raiders 14 -- Box Score",
"Seattle Seahawks 28, Washington Redskins 7 -- Box Score")
allgames_thisweek[1]
"Chicago Bears 21, Tampa Bay Buccaneers 9 -- Box Score"
Each row has the following data [team1, team1score, team2, team2score, --, Box Score]
My data is all formatted the exact same way, meaning there's always a comma after the first team's score, and there's always a -- after the 2nd team's score. I'd like to create a dataframe that has 4 columns (team1, team1score, team2, team2score), so an output might look like this:
output_df
team1 team1score team2 team2score
1. Chicago Bears 21 Tampba Bay Buccaneers 9
Any thoughts on how I could achieve this? Any help is appreciated! Thanks
You can do this with dplyr + stringr:
library(dplyr)
library(stringr)
string %>%
str_replace("(?<=\\d)\\s.*--.+$", "") %>%
str_replace_all("\\s(?=\\d+\\b)", ",") %>%
strsplit(",") %>%
do.call(rbind, .) %>%
data.frame() %>%
setNames(c("team1", "team1score", "team2", "team2score"))
Result:
team1 team1score team2 team2score
1 Chicago Bears 21 Tampa Bay Buccaneers 9
2 Cleveland Browns 28 Cincinnati Bengals 20
3 Dallas Cowboys 26 Pittsburgh Steelers 9
4 Detroit Lions 31 Atlanta Falcons 28
5 Green Bay Packers 16 Minnesota Vikings 10
6 Indianapolis Colts 45 Houston Oilers 21
7 Kansas City Chiefs 30 New Orleans Saints 17
8 Los Angeles Rams 14 Arizona Cardinals 12
9 Miami Dolphins 39 New England Patriots 35
10 New York Giants 28 Philadelphia Eagles 23
11 New York Jets 23 Buffalo Bills 3
12 San Diego Chargers 37 Denver Broncos 34
13 San Francisco 49ers 44 Los Angeles Raiders 14
14 Seattle Seahawks 28 Washington Redskins 7
Notes:
(?<=\\d)\\s.*--.+$ matches a space (\\s) followed by any character zero or more times (.*), the literal --, any character one or more times (.+), and which ends the string ($). This pattern has an extra condition that it has to be following a digit (?<=\\d).
(?<=...) is called a positive lookbehind, which checks whether what comes after is immediately following the pattern in ....
\\s(?=\\d+\\b) matches a space that immediately follows ((?=...)) a digit one or more times and a word boundary (\\b). So this matches the space between the team names and the team scores.
(?=...) is a positive lookahead, which checks whether what comes before immediately follows the pattern in ....
Data:
string = c("Chicago Bears 21, Tampa Bay Buccaneers 9 -- Box Score", "Cleveland Browns 28, Cincinnati Bengals 20 -- Box Score",
"Dallas Cowboys 26, Pittsburgh Steelers 9 -- Box Score", "Detroit Lions 31, Atlanta Falcons 28 (OT) -- Box Score",
"Green Bay Packers 16, Minnesota Vikings 10 -- Box Score", "Indianapolis Colts 45, Houston Oilers 21 -- Box Score",
"Kansas City Chiefs 30, New Orleans Saints 17 -- Box Score",
"Los Angeles Rams 14, Arizona Cardinals 12 -- Box Score", "Miami Dolphins 39, New England Patriots 35 -- Box Score",
"New York Giants 28, Philadelphia Eagles 23 -- Box Score", "New York Jets 23, Buffalo Bills 3 -- Box Score",
"San Diego Chargers 37, Denver Broncos 34 -- Box Score", "San Francisco 49ers 44, Los Angeles Raiders 14 -- Box Score",
"Seattle Seahawks 28, Washington Redskins 7 -- Box Score")

R append function

I'm writing an R script that parses out the a state abbreviation from a column in a data.frame. It then uses the which() function to determine the index of the found state abbreviation in a look up data frame that contains state abbreviations and their corresponding full state names. I then use the found index to access the the full state name and append it to a vector called completeList. I then add the vector completeList which should contain the full state names to my original data frame under a newly created column STATE_NAME.
However, for some reason completeList only contains the indexes that were found earlier and not the full state names that I expected. What did I do wrong?
#read in csv weather data file
file <- read.csv(header = TRUE, file = "C:\\Users\\michael.guarino1\\Desktop\\Work\\weather\\nov_2_1976\\734677_cleaned.csv")
#read in csv state Abbreviation file
abbreviationsFile<-read.csv(header=TRUE, file="C:\\Users\\michael.guarino1\\Desktop\\Work\\weather\\stateAbbreviationMatches.csv")
#iterate through STATION_NAME and store abreviations
completeList<-c()
for(stateAbvr in file$STATION_NAME){
addTo<-(substring(stateAbvr,(nchar(stateAbvr)-4),(nchar(stateAbvr)-3)))
index<-which(abbreviationsFile$Abbreviation==addTo)
addCompleteStateName<-(abbreviationsFile[index,1])
completeList<-append(completeList, addCompleteStateName)
}
file["STATE_NAME"]<-completeList
>completeList
[1] 27 17 17 29 42 50 20 53 45 19 22 52 9 29 26 37 8 58 35
Here is the csv file where the abbreviation of the station is found
STATION STATION_NAME ELEVATION
GHCND:USC00202381 EAST JORDAN MI US 180.1
GHCND:USC00111290 CARLYLE RESERVOIR IL US 153
GHCND:USC00116661 PAW PAW 2 S IL US 274.9
GHCND:USC00228556 SUMRALL MS US 88.1
GHCND:USC00340292 ARDMORE OK US 267.9
GHCND:USC00408522 SPARTA WASTEWATER PLANT TN US 289.9
GHCND:USC00148341 VALLEY FALLS KS US 283.5
GHCND:USW00014742 BURLINGTON INTERNATIONAL AIRPORT VT US 101.2
GHCND:USC00367782 SALINA 3 W PA US 338
GHCND:USC00134142 IOWA FALLS IA US 356.9
GHCND:USC00161565 CARVILLE 2 SW LA US 9.1
GHCND:USC00421446 CITY CRK WATER PLANT UT US 1628.9
GHCND:USW00013781 WILMINGTON NEW CASTLE CO AIRPORT DE US 22.6
GHCND:USC00229400 WATER VALLEY MS US 116.1
GHCND:USC00190562 BELCHERTOWN MA US 171
GHCND:USW00094728 NEW YORK CENTRAL PARK OBS BELVEDERE TOWER NY US 40.2
GHCND:USC00060973 BURLINGTON CT US 155.4
GHCND:USC00475516 MINOCQUA WI US 484.9
GHCND:USC00286055 NEW BRUNSWICK 3 SE NJ US 38.1
Here is the csv file where we look up abbreviations and find the corresponding full state name
State/Possession Abbreviation
Alabama AL
Alaska AK
American Samoa AS
Arizona AZ
Arkansas AR
California CA
Colorado CO
Connecticut CT
Delaware DE
District of Columbia DC
Federated States of Micronesia FM
Florida FL
Georgia GA
Guam GU
Hawaii HI
Idaho ID
Illinois IL
Indiana IN
Iowa IA
Kansas KS
Kentucky KY
Louisiana LA
Maine ME
Marshall Islands MH
Maryland MD
Massachusetts MA
Michigan MI
Minnesota MN
Mississippi MS
Missouri MO
Montana MT
Nebraska NE
Nevada NV
New Hampshire NH
New Jersey NJ
New Mexico NM
New York NY
North Carolina NC
North Dakota ND
Northern Mariana Islands MP
Ohio OH
Oklahoma OK
Oregon OR
Palau PW
Pennsylvania PA
Puerto Rico PR
Rhode Island RI
South Carolina SC
South Dakota SD
Tennessee TN
Texas TX
Utah UT
Vermont VT
Virgin Islands VI
Virginia VA
Washington WA
West Virginia WV
Wisconsin WI
Wyoming WY
Why am I not getting the full state name?
figured it out 😎
#read in csv weather data file
file <- read.csv(header = TRUE, file = "C:\\Users\\michael.guarino1\\Desktop\\Work\\weather\\nov_2_1976\\734677_cleaned.csv")
#read in csv state Abbreviation file
abbreviationsFile<-read.csv(header=TRUE, file="C:\\Users\\michael.guarino1\\Desktop\\Work\\weather\\stateAbbreviationMatches.csv")
#iterate through STATION_NAME and store abreviations
completeList<-c()
for(stateAbvr in file$STATION_NAME){
addTo<-(substring(stateAbvr,(nchar(stateAbvr)-4),(nchar(stateAbvr)-3)))
index<-which(abbreviationsFile$Abbreviation==addTo)
addCompleteStateName<-(abbreviationsFile[index,1])
completeList<-append(completeList, toString(addCompleteStateName))
}
file["STATE_NAME"]<-completeList
the type was being forced to an integer
The variable addCompleteStateName is a factor. You can convert it to a character to append the labels.
#iterate through STATION_NAME and store abreviations
completeList<-c()
for(stateAbvr in file$STATION_NAME){
addTo<-(substring(stateAbvr,(nchar(stateAbvr)-4),(nchar(stateAbvr)-3)))
index<-which(abbreviationsFile$Abbreviation==addTo)
addCompleteStateName<-(abbreviationsFile[index,1])
# modified to convert addCompleteStateName to character
completeList<-append(completeList, as.character(addCompleteStateName))
}
file["STATE_NAME"]<-completeList

Resources