Convert string vector to dataframe in R - r

I am working on a quick scraping project that involves grabbing historical NFL football data. Here is a quick glance of what my data looks like:
allgames_thisweek = c("Chicago Bears 21, Tampa Bay Buccaneers 9 -- Box Score", "Cleveland Browns 28, Cincinnati Bengals 20 -- Box Score",
"Dallas Cowboys 26, Pittsburgh Steelers 9 -- Box Score", "Detroit Lions 31, Atlanta Falcons 28 (OT) -- Box Score",
"Green Bay Packers 16, Minnesota Vikings 10 -- Box Score", "Indianapolis Colts 45, Houston Oilers 21 -- Box Score",
"Kansas City Chiefs 30, New Orleans Saints 17 -- Box Score",
"Los Angeles Rams 14, Arizona Cardinals 12 -- Box Score", "Miami Dolphins 39, New England Patriots 35 -- Box Score",
"New York Giants 28, Philadelphia Eagles 23 -- Box Score", "New York Jets 23, Buffalo Bills 3 -- Box Score",
"San Diego Chargers 37, Denver Broncos 34 -- Box Score", "San Francisco 49ers 44, Los Angeles Raiders 14 -- Box Score",
"Seattle Seahawks 28, Washington Redskins 7 -- Box Score")
allgames_thisweek[1]
"Chicago Bears 21, Tampa Bay Buccaneers 9 -- Box Score"
Each row has the following data [team1, team1score, team2, team2score, --, Box Score]
My data is all formatted the exact same way, meaning there's always a comma after the first team's score, and there's always a -- after the 2nd team's score. I'd like to create a dataframe that has 4 columns (team1, team1score, team2, team2score), so an output might look like this:
output_df
team1 team1score team2 team2score
1. Chicago Bears 21 Tampba Bay Buccaneers 9
Any thoughts on how I could achieve this? Any help is appreciated! Thanks

You can do this with dplyr + stringr:
library(dplyr)
library(stringr)
string %>%
str_replace("(?<=\\d)\\s.*--.+$", "") %>%
str_replace_all("\\s(?=\\d+\\b)", ",") %>%
strsplit(",") %>%
do.call(rbind, .) %>%
data.frame() %>%
setNames(c("team1", "team1score", "team2", "team2score"))
Result:
team1 team1score team2 team2score
1 Chicago Bears 21 Tampa Bay Buccaneers 9
2 Cleveland Browns 28 Cincinnati Bengals 20
3 Dallas Cowboys 26 Pittsburgh Steelers 9
4 Detroit Lions 31 Atlanta Falcons 28
5 Green Bay Packers 16 Minnesota Vikings 10
6 Indianapolis Colts 45 Houston Oilers 21
7 Kansas City Chiefs 30 New Orleans Saints 17
8 Los Angeles Rams 14 Arizona Cardinals 12
9 Miami Dolphins 39 New England Patriots 35
10 New York Giants 28 Philadelphia Eagles 23
11 New York Jets 23 Buffalo Bills 3
12 San Diego Chargers 37 Denver Broncos 34
13 San Francisco 49ers 44 Los Angeles Raiders 14
14 Seattle Seahawks 28 Washington Redskins 7
Notes:
(?<=\\d)\\s.*--.+$ matches a space (\\s) followed by any character zero or more times (.*), the literal --, any character one or more times (.+), and which ends the string ($). This pattern has an extra condition that it has to be following a digit (?<=\\d).
(?<=...) is called a positive lookbehind, which checks whether what comes after is immediately following the pattern in ....
\\s(?=\\d+\\b) matches a space that immediately follows ((?=...)) a digit one or more times and a word boundary (\\b). So this matches the space between the team names and the team scores.
(?=...) is a positive lookahead, which checks whether what comes before immediately follows the pattern in ....
Data:
string = c("Chicago Bears 21, Tampa Bay Buccaneers 9 -- Box Score", "Cleveland Browns 28, Cincinnati Bengals 20 -- Box Score",
"Dallas Cowboys 26, Pittsburgh Steelers 9 -- Box Score", "Detroit Lions 31, Atlanta Falcons 28 (OT) -- Box Score",
"Green Bay Packers 16, Minnesota Vikings 10 -- Box Score", "Indianapolis Colts 45, Houston Oilers 21 -- Box Score",
"Kansas City Chiefs 30, New Orleans Saints 17 -- Box Score",
"Los Angeles Rams 14, Arizona Cardinals 12 -- Box Score", "Miami Dolphins 39, New England Patriots 35 -- Box Score",
"New York Giants 28, Philadelphia Eagles 23 -- Box Score", "New York Jets 23, Buffalo Bills 3 -- Box Score",
"San Diego Chargers 37, Denver Broncos 34 -- Box Score", "San Francisco 49ers 44, Los Angeles Raiders 14 -- Box Score",
"Seattle Seahawks 28, Washington Redskins 7 -- Box Score")

Related

Replace For Loop to fill column depending on other column value

I have a two-column dataframe (HOME & AWAY) called 'gamelist' with sports games. The HOME column also includes some dates with the corresponding games listed below.
HOME AWAY
15 Oct 2019 Pre-season
Phoenix Suns Denver Nuggets
Utah Jazz Sacramento Kings
Dallas Mavericks Oklahoma City Thunder
Memphis Grizzlies Charlotte Hornets
14 Oct 2019 Pre-season
Miami Heat Atlanta Hawks
13 Oct 2019 Pre-season
Orlando Magic Philadelphia 76ers
Toronto Raptors Chicago Bulls
Washington Wizards Milwaukee Bucks
I want to create a new column with the dates for each game. Coming from a excel vba approach, I've used a for loop which is giving the result intented but I was wondering if there was a more efficient approach in R, and I'm sure there is.
This is the code I've used:
gamelist<-add_column(gamelist,SDATE="",.before = 1)
for(i in 1:nrow(gamelist)){
if(str_count(gamelist[[i,3]],"\\d")==6){
gamelist[i,2]<-gamelist[i,3]
}else{
gamelist[i,2]<-gamelist[i-1,2]
}
}
Which gives me this as intended
SDATE HOME AWAY
15 Oct 2019 15 Oct 2019 Pre-season
15 Oct 2019 Phoenix Suns Denver Nuggets
15 Oct 2019 Utah Jazz Sacramento Kings
15 Oct 2019 Dallas Mavericks Oklahoma City Thunder
15 Oct 2019 Memphis Grizzlies Charlotte Hornets
14 Oct 2019 14 Oct 2019 Pre-season
14 Oct 2019 Miami Heat Atlanta Hawks
13 Oct 2019 13 Oct 2019 Pre-season
13 Oct 2019 Orlando Magic Philadelphia 76ers
13 Oct 2019 Toronto Raptors Chicago Bulls
13 Oct 2019 Washington Wizards Milwaukee Bucks
My apologies for the dataframe formatting, couldn't figure out how to reproduce one properly here.
Thanks for your help
We could use str_extract to get only the 'dates' so that if there is no match it returns NA, then we use fill to fill the NA elements with the previous non-NA values
library(dplyr)
library(tidyr)
library(stringr)
gamelist %>%
mutate(SDATE = str_extract(HOME, "^\\d+ [A-Za-z]+ \\d{4}")) %>%
fill(SDATE)
# HOME AWAY SDATE
#1 15 Oct 2019 Pre-season 15 Oct 2019
#2 Phoenix Suns Denver Nuggets 15 Oct 2019
#3 Utah Jazz Sacramento Kings 15 Oct 2019
#4 Dallas Mavericks Oklahoma City Thunder 15 Oct 2019
#5 Memphis Grizzlies Charlotte Hornets 15 Oct 2019
#6 14 Oct 2019 Pre-season 14 Oct 2019
#7 Miami Heat Atlanta Hawks 14 Oct 2019
#8 13 Oct 2019 Pre-season 13 Oct 2019
#9 Orlando Magic Philadelphia 76ers 13 Oct 2019
#10 Toronto Raptors Chicago Bulls 13 Oct 2019
#11 Washington Wizards Milwaukee Bucks 13 Oct 2019
If we need the SDATE column first, we can use select
gamelist %>%
mutate(SDATE = str_extract(HOME, "^\\d+ [A-Za-z]+ \\d{4}")) %>%
fill(SDATE) %>%
select(SDATE, everything())
Or use add_column from tibble with either .after or .before
library(tibble)
gamelist %>%
add_column(SDATE = str_extract(.$HOME, "^\\d+ [A-Za-z]+ \\d{4}"),
.before = 1 ) %>%
fill(SDATE)
data
gamelist <- structure(list(HOME = c("15 Oct 2019", "Phoenix Suns", "Utah Jazz",
"Dallas Mavericks", "Memphis Grizzlies", "14 Oct 2019", "Miami Heat",
"13 Oct 2019", "Orlando Magic", "Toronto Raptors", "Washington Wizards"
), AWAY = c("Pre-season", "Denver Nuggets", "Sacramento Kings",
"Oklahoma City Thunder", "Charlotte Hornets", "Pre-season", "Atlanta Hawks",
"Pre-season", "Philadelphia 76ers", "Chicago Bulls", "Milwaukee Bucks"
)), class = "data.frame", row.names = c(NA, -11L))
If the date is always in the HOME column when the AWAY column is "Pre-season" (or some other predictable condition), then you could do something like:
# data
gamelist <- data.frame(
stringsAsFactors = FALSE,
HOME = c("15-Oct-19","Phoenix Suns",
"Utah Jazz","Dallas Mavericks","Memphis Grizzlies",
"14-Oct-19","Miami Heat","13-Oct-19","Orlando Magic",
"Toronto Raptors","Washington Wizards"),
AWAY = c("Pre-season","Denver Nuggets",
"Sacramento Kings","Oklahoma City Thunder",
"Charlotte Hornets","Pre-season","Atlanta Hawks","Pre-season",
"Philadelphia 76ers","Chicago Bulls","Milwaukee Bucks")
)
# create blank column to fill in
gamelist$date <- NA
# fill cases where there's a date
gamelist$date[gamelist$AWAY=="Pre-season"] <- gamelist$HOME[gamelist$AWAY=="Pre-season"]
# user zoo::na.locf() to fill in missing values
gamelist$date <- zoo::na.locf(gamelist$date)

Rvest read table with cells that span multiple rows

I'm trying to scrape an irregular table from Wikipedia using rvest. The table has cells that span multiple rows. The documentation for html_table clearly states that this is a limitation. I'm just wondering if there's a workaround.
The table looks like this:
My code:
library(rvest)
url <- "https://en.wikipedia.org/wiki/Arizona_League"
parks <- url %>%
read_html() %>%
html_nodes(xpath='/html/body/div[3]/div[3]/div[4]/div/table[2]') %>%
html_table(fill=TRUE) %>% # fill=FALSE yields the same results
.[[1]]
Returns this:
Where there are several errors, for example: row 4 under "City" should be "Mesa", NOT "Chicago Cubs". I'd be happy with blank cells as I could "fill down" as needed, but the wrong data is a problem. Help is much appreciated.
I have a way to code it.
It is not perfect, a bit long but it does the trick:
library(rvest)
url <- "https://en.wikipedia.org/wiki/Arizona_League"
# get the lines of the table
lines <- url %>%
read_html() %>%
html_nodes(xpath="//table[starts-with(#class, 'wikitable')]") %>%
html_nodes(xpath = 'tbody/tr')
#define the empty table
ncol <- lines %>%
.[[1]] %>%
html_children()%>%
length()
nrow <- length(lines)
table <- as.data.frame(matrix(nrow = nrow,ncol = ncol))
# fill the table
for(i in 1:nrow){
# get content of the line
linecontent <- lines[[i]]%>%
html_children()%>%
html_text()%>%
gsub("\n","",.)
# attribute the content to free columns
colselect <- is.na(table[i,])
table[i,colselect] <- linecontent
# get the line repetition of each columns
repetition <- lines[[i]]%>%
html_children()%>%
html_attr("rowspan")%>%
ifelse(is.na(.),1,.) %>% # if no rowspan, then it is a normal row, not a multiple one
as.numeric
# repeat the cells of the multiple rows down
for(j in 1:length(repetition)){
span <- repetition[j]
if(span > 1){
table[(i+1):(i+span-1),colselect][,j] <- rep(linecontent[j],span-1)
}
}
}
The idea is to have the html lines of the table in the lines variable by getting the /tr nodes. I then create an empty table: number of columns is the length of the children of the first row (because it contains the titles), number of line the length of lines. I fill it by hand in a for loop (didn't amanger a nicer way here).
The difficulty is that the amount of column text given in a row changes when there is already a multiple row column spanning on the current row. For example :
lines[[3]]%>%
html_children()%>%
html_text()%>%
gsub("\n","",.)
gives only 5 values :
[1] "Arizona League Athletics Gold" "Oakland Athletics" "Mesa" "Fitch Park"
[5] "10,000"
instead of the 6 columns, because the first column is East on 8 rows. This East value appears only on the first rows it spans on.
The trick is to repeat the cells down in the table when they have a rowspan attribute (meaning they span on several rows). It allows to select on the next row only the NA columns, so that the amount of text given by the html line match the amount of free columns in the table we fill.
This is done with the colselect variable, which is a bolean giving the free rows before repeting the cells of the given row.
The result :
V1 V2 V3 V4 V5 V6
1 Division Team MLB Affiliation City Stadium Capacity
2 East Arizona League Angels Los Angeles Angels Tempe Tempe Diablo Stadium 9,785
3 East Arizona League Athletics Gold Oakland Athletics Mesa Fitch Park 10,000
4 East Arizona League Athletics Green Oakland Athletics Mesa Fitch Park 10,000
5 East Arizona League Cubs 1 Chicago Cubs Mesa Sloan Park 15,000
6 East Arizona League Cubs 2 Chicago Cubs Mesa Sloan Park 15,000
7 East Arizona League Diamondbacks Arizona Diamondbacks Scottsdale Salt River Fields at Talking Stick 11,000
8 East Arizona League Giants Black San Francisco Giants Scottsdale Scottsdale Stadium 12,000
9 East Arizona League Giants Orange San Francisco Giants Scottsdale Scottsdale Stadium 12,000
10 Central Arizona League Brewers Gold Milwaukee Brewers Phoenix American Family Fields of Phoenix 8,000
11 Central Arizona League Dodgers Lasorda Los Angeles Dodgers Phoenix Camelback Ranch 12,000
12 Central Arizona League Indians Blue Cleveland Indians Goodyear Goodyear Ballpark 10,000
13 Central Arizona League Padres 2 San Diego Padres Peoria Peoria Sports Complex 12,882
14 Central Arizona League Reds Cincinnati Reds Goodyear Goodyear Ballpark 10,000
15 Central Arizona League White Sox Chicago White Sox Phoenix Camelback Ranch 12,000
16 West Arizona League Brewers Blue Milwaukee Brewers Phoenix American Family Fields of Phoenix 8,000
17 West Arizona League Dodgers Mota Los Angeles Dodgers Phoenix Camelback Ranch 12,000
18 West Arizona League Indians Red Cleveland Indians Goodyear Goodyear Ballpark 10,000
19 West Arizona League Mariners Seattle Mariners Peoria Peoria Sports Complex 12,882
20 West Arizona League Padres 1 San Diego Padres Peoria Peoria Sports Complex 12,882
21 West Arizona League Rangers Texas Rangers Surprise Surprise Stadium 10,500
22 West Arizona League Royals Kansas City Royals Surprise Surprise Stadium 10,500
Edit
I made a shorter version of the function, with more explanation here

How to have bar labels be names in Plotly for R

So I'm trying to make a bar chart that displays the most popular airports that flew to Chicago. For some reason, I'm finding it to be extremely difficult to have my bars be labeled by the airport names specifically.
I have a data frame called ty
> ty
Name
1 Atlanta, GA: Hartsfield-Jackson Atlanta International
2 New York, NY: LaGuardia
3 Minneapolis, MN: Minneapolis-St Paul International
4 Los Angeles, CA: Los Angeles International
5 Denver, CO: Denver International
6 Washington, DC: Ronald Reagan Washington National
7 Orlando, FL: Orlando International
8 Phoenix, AZ: Phoenix Sky Harbor International
9 Detroit, MI: Detroit Metro Wayne County
10 Las Vegas, NV: McCarran International
11 San Francisco, CA: San Francisco International
12 Dallas/Fort Worth, TX: Dallas/Fort Worth International
13 Boston, MA: Logan International
14 Philadelphia, PA: Philadelphia International
15 Newark, NJ: Newark Liberty International
I also have a data frame called df
id numArrivals
1 10397 964
2 12953 962
3 13487 883
4 12892 823
5 11292 776
6 11278 771
7 13204 725
8 14107 700
9 11433 672
10 12889 647
11 14771 611
12 11298 580
13 10721 569
14 14100 567
15 11618 488
The id corresponds to the airport name 10397 is Atlanta, GA: Hartsfield-Jackson Atlanta International and they continue in that order.
However, when I run:
plotly::plot_ly(df,x=ty["Name"],y=df$numArrivals,type="bar",color=I("rgba(0,92,124,1)"))
I am given this chart.
How can I make the labels of my bars into the names of the airport rather than just numbers?
Feel free to use ggplotly() to create your plot. I used the code below to create a small example.
example <- data.frame(airport = c("Atlanta, GA: Hartsfield-Jackson Atlanta International","New York, NY: LaGuardia","Minneapolis, MN: Minneapolis-St Paul International"),
id = c(10397,12953,13487),
numArrivals = c(964,962,883),stringsAsFactors = F)
library(ggplot2)
library(plotly)
a <- ggplot(example,aes(x=airport,y=numArrivals,fill=id)) + geom_bar(stat = "identity") + coord_flip()
ggplotly(a)
The final result looks like this.

R: Mission impossible? How to assign "New York" to a county

I run into problems assigning a county to some city places. When querying via the acs package
> geo.lookup(state = "NY", place = "New York")
state state.name county.name place place.name
1 36 New York <NA> NA <NA>
2 36 New York Bronx County, Kings County, New York County, Queens County, Richmond County 51000 New York city
3 36 New York Oneida County 51011 New York Mills village
, you can see that "New York", for instance, has a bunch of counties. So do Los Angeles, Portland, Oklahoma, Columbus etc. How can such data be assigned to a "county"?
Following code is currently used to match "county.name" with the corresponding county FIPS code. Unfortunately, it only works for cases of only one county name output in the query.
Script
dat <- c("New York, NY","Boston, MA","Los Angeles, CA","Dallas, TX","Palo Alto, CA")
dat <- strsplit(dat, ",")
dat
library(tigris)
library(acs)
data(fips_codes) # FIPS codes with state, code, county information
GeoLookup <- lapply(dat,function(x) {
geo.lookup(state = trimws(x[2]), place = trimws(x[1]))[2,]
})
df <- bind_rows(GeoLookup)
#Rename cols to match
colnames(fips_codes) = c("state.abb", "statefips", "state.name", "countyfips", "county.name")
# Here is a problem, because it works with one item in "county.name" but not more than one (see output below).
df <- df %>% left_join(fips_codes, by = c("state.name", "county.name"))
df
Returns:
state state.name county.name place place.name state.abb statefips countyfips
1 36 New York Bronx County, Kings County, New York County, Queens County, Richmond County 51000 New York city <NA> <NA> <NA>
2 25 Massachusetts Suffolk County 7000 Boston city MA 25 025
3 6 California Los Angeles County 20802 East Los Angeles CDP CA 06 037
4 48 Texas Collin County, Dallas County, Denton County, Kaufman County, Rockwall County 19000 Dallas city <NA> <NA> <NA>
5 6 California San Mateo County 20956 East Palo Alto city CA 06 081
In order to retain data, the left_join might better be matched as "look for county.name that contains place.name (without the appending xy city in the name), or choose the first item by default. It would be great to see how this could be done.
In general: I assume, there's no better way than this approach?
Thanks for your help!
What about something like the code below to create a "long" data frame for joining. We use the tidyverse pipe operator to chain operations. strsplit returns a list, which we unnest to stack the list values (the county names that go with each combination of state.name and place.name) into a long data frame where each county.name now gets its own row.
library(tigris)
library(acs)
library(tidyverse)
dat = geo.lookup(state = "NY", place = "New York")
state state.name county.name place place.name
1 36 New York <NA> NA <NA>
2 36 New York Bronx County, Kings County, New York County, Queens County, Richmond County 51000 New York city
3 36 New York Oneida County 51011 New York Mills village
dat = dat %>%
group_by(state.name, place.name) %>%
mutate(county.name = strsplit(county.name, ", ")) %>%
unnest
state state.name place place.name county.name
<chr> <chr> <int> <chr> <chr>
1 36 New York NA <NA> <NA>
2 36 New York 51000 New York city Bronx County
3 36 New York 51000 New York city Kings County
4 36 New York 51000 New York city New York County
5 36 New York 51000 New York city Queens County
6 36 New York 51000 New York city Richmond County
7 36 New York 51011 New York Mills village Oneida County
UPDATE: Regarding the second question in your comment, assuming you have the vector of metro areas already, how about this:
dat <- c("New York, NY","Boston, MA","Los Angeles, CA","Dallas, TX","Palo Alto, CA")
df <- map_df(strsplit(dat, ", "), function(x) {
geo.lookup(state = x[2], place = x[1])[-1, ] %>%
group_by(state.name, place.name) %>%
mutate(county.name = strsplit(county.name, ", ")) %>%
unnest
})
df
state state.name place place.name county.name
1 36 New York 51000 New York city Bronx County
2 36 New York 51000 New York city Kings County
3 36 New York 51000 New York city New York County
4 36 New York 51000 New York city Queens County
5 36 New York 51000 New York city Richmond County
6 36 New York 51011 New York Mills village Oneida County
7 25 Massachusetts 7000 Boston city Suffolk County
8 25 Massachusetts 7000 Boston city Suffolk County
9 6 California 20802 East Los Angeles CDP Los Angeles County
10 6 California 39612 Lake Los Angeles CDP Los Angeles County
11 6 California 44000 Los Angeles city Los Angeles County
12 48 Texas 19000 Dallas city Collin County
13 48 Texas 19000 Dallas city Dallas County
14 48 Texas 19000 Dallas city Denton County
15 48 Texas 19000 Dallas city Kaufman County
16 48 Texas 19000 Dallas city Rockwall County
17 48 Texas 40516 Lake Dallas city Denton County
18 6 California 20956 East Palo Alto city San Mateo County
19 6 California 55282 Palo Alto city Santa Clara County
UPDATE 2: If I understand your comments, for cities (actually place names in the example) with more than one county, we want only the county that includes the same name as the city (for example, New York County in the case of New York city), or the first county in the list otherwise. The following code selects a county with the same name as the city or, if there isn't one, the first county for that city. You might have to tweak it a bit to make it work for the entire U.S. For example, for it to work for Louisiana, you might need gsub(" County| Parish"... instead of gsub(" County"....
map_df(strsplit(dat, ", "), function(x) {
geo.lookup(state = x[2], place = x[1])[-1, ] %>%
group_by(state.name, place.name) %>%
mutate(county.name = strsplit(county.name, ", ")) %>%
unnest %>%
slice(max(1, which(grepl(sub(" [A-Za-z]*$","", place.name), gsub(" County", "", county.name))), na.rm=TRUE))
})
state state.name place place.name county.name
<chr> <chr> <int> <chr> <chr>
1 36 New York 51000 New York city New York County
2 36 New York 51011 New York Mills village Oneida County
3 25 Massachusetts 7000 Boston city Suffolk County
4 6 California 20802 East Los Angeles CDP Los Angeles County
5 6 California 39612 Lake Los Angeles CDP Los Angeles County
6 6 California 44000 Los Angeles city Los Angeles County
7 48 Texas 19000 Dallas city Dallas County
8 48 Texas 40516 Lake Dallas city Denton County
9 6 California 20956 East Palo Alto city San Mateo County
10 6 California 55282 Palo Alto city Santa Clara County
Could you prep the data by using something like the below code?
new_york_data <- geo.lookup(state = "NY", place = "New York")
prep_data <- function(full_data){
output <- data.frame()
for(row in 1:nrow(full_data)){
new_rows <- replicateCounty(full_data[row, ])
output <- plyr::rbind.fill(output, new_rows)
}
return(output)
}
replicateCounty <- function(row){
counties <- str_trim(unlist(str_split(row$county.name, ",")))
output <- data.frame(state = row$state,
state.name = row$state.name,
county.name = counties,
place = row$place,
place.name = row$place.name)
return(output)
}
prep_data(new_york_data)
It's a little messy and you'll need the plyr and stringr packages. Once you prep the data, you should be able to join on it

R append function

I'm writing an R script that parses out the a state abbreviation from a column in a data.frame. It then uses the which() function to determine the index of the found state abbreviation in a look up data frame that contains state abbreviations and their corresponding full state names. I then use the found index to access the the full state name and append it to a vector called completeList. I then add the vector completeList which should contain the full state names to my original data frame under a newly created column STATE_NAME.
However, for some reason completeList only contains the indexes that were found earlier and not the full state names that I expected. What did I do wrong?
#read in csv weather data file
file <- read.csv(header = TRUE, file = "C:\\Users\\michael.guarino1\\Desktop\\Work\\weather\\nov_2_1976\\734677_cleaned.csv")
#read in csv state Abbreviation file
abbreviationsFile<-read.csv(header=TRUE, file="C:\\Users\\michael.guarino1\\Desktop\\Work\\weather\\stateAbbreviationMatches.csv")
#iterate through STATION_NAME and store abreviations
completeList<-c()
for(stateAbvr in file$STATION_NAME){
addTo<-(substring(stateAbvr,(nchar(stateAbvr)-4),(nchar(stateAbvr)-3)))
index<-which(abbreviationsFile$Abbreviation==addTo)
addCompleteStateName<-(abbreviationsFile[index,1])
completeList<-append(completeList, addCompleteStateName)
}
file["STATE_NAME"]<-completeList
>completeList
[1] 27 17 17 29 42 50 20 53 45 19 22 52 9 29 26 37 8 58 35
Here is the csv file where the abbreviation of the station is found
STATION STATION_NAME ELEVATION
GHCND:USC00202381 EAST JORDAN MI US 180.1
GHCND:USC00111290 CARLYLE RESERVOIR IL US 153
GHCND:USC00116661 PAW PAW 2 S IL US 274.9
GHCND:USC00228556 SUMRALL MS US 88.1
GHCND:USC00340292 ARDMORE OK US 267.9
GHCND:USC00408522 SPARTA WASTEWATER PLANT TN US 289.9
GHCND:USC00148341 VALLEY FALLS KS US 283.5
GHCND:USW00014742 BURLINGTON INTERNATIONAL AIRPORT VT US 101.2
GHCND:USC00367782 SALINA 3 W PA US 338
GHCND:USC00134142 IOWA FALLS IA US 356.9
GHCND:USC00161565 CARVILLE 2 SW LA US 9.1
GHCND:USC00421446 CITY CRK WATER PLANT UT US 1628.9
GHCND:USW00013781 WILMINGTON NEW CASTLE CO AIRPORT DE US 22.6
GHCND:USC00229400 WATER VALLEY MS US 116.1
GHCND:USC00190562 BELCHERTOWN MA US 171
GHCND:USW00094728 NEW YORK CENTRAL PARK OBS BELVEDERE TOWER NY US 40.2
GHCND:USC00060973 BURLINGTON CT US 155.4
GHCND:USC00475516 MINOCQUA WI US 484.9
GHCND:USC00286055 NEW BRUNSWICK 3 SE NJ US 38.1
Here is the csv file where we look up abbreviations and find the corresponding full state name
State/Possession Abbreviation
Alabama AL
Alaska AK
American Samoa AS
Arizona AZ
Arkansas AR
California CA
Colorado CO
Connecticut CT
Delaware DE
District of Columbia DC
Federated States of Micronesia FM
Florida FL
Georgia GA
Guam GU
Hawaii HI
Idaho ID
Illinois IL
Indiana IN
Iowa IA
Kansas KS
Kentucky KY
Louisiana LA
Maine ME
Marshall Islands MH
Maryland MD
Massachusetts MA
Michigan MI
Minnesota MN
Mississippi MS
Missouri MO
Montana MT
Nebraska NE
Nevada NV
New Hampshire NH
New Jersey NJ
New Mexico NM
New York NY
North Carolina NC
North Dakota ND
Northern Mariana Islands MP
Ohio OH
Oklahoma OK
Oregon OR
Palau PW
Pennsylvania PA
Puerto Rico PR
Rhode Island RI
South Carolina SC
South Dakota SD
Tennessee TN
Texas TX
Utah UT
Vermont VT
Virgin Islands VI
Virginia VA
Washington WA
West Virginia WV
Wisconsin WI
Wyoming WY
Why am I not getting the full state name?
figured it out 😎
#read in csv weather data file
file <- read.csv(header = TRUE, file = "C:\\Users\\michael.guarino1\\Desktop\\Work\\weather\\nov_2_1976\\734677_cleaned.csv")
#read in csv state Abbreviation file
abbreviationsFile<-read.csv(header=TRUE, file="C:\\Users\\michael.guarino1\\Desktop\\Work\\weather\\stateAbbreviationMatches.csv")
#iterate through STATION_NAME and store abreviations
completeList<-c()
for(stateAbvr in file$STATION_NAME){
addTo<-(substring(stateAbvr,(nchar(stateAbvr)-4),(nchar(stateAbvr)-3)))
index<-which(abbreviationsFile$Abbreviation==addTo)
addCompleteStateName<-(abbreviationsFile[index,1])
completeList<-append(completeList, toString(addCompleteStateName))
}
file["STATE_NAME"]<-completeList
the type was being forced to an integer
The variable addCompleteStateName is a factor. You can convert it to a character to append the labels.
#iterate through STATION_NAME and store abreviations
completeList<-c()
for(stateAbvr in file$STATION_NAME){
addTo<-(substring(stateAbvr,(nchar(stateAbvr)-4),(nchar(stateAbvr)-3)))
index<-which(abbreviationsFile$Abbreviation==addTo)
addCompleteStateName<-(abbreviationsFile[index,1])
# modified to convert addCompleteStateName to character
completeList<-append(completeList, as.character(addCompleteStateName))
}
file["STATE_NAME"]<-completeList

Resources