How to filter by a given phrase in a single column in R - r

Below is the sample data. Seems pretty basic but my internet searches have not yielded a clear answer. In this case, how would I create a new data frame where the areaname only has the phrase MSA at the end and not MicroSA or other possibilities.
areaname <- c("Albany NY MSA", "Albany GA MSA", "Aberdeen SD MicroSA", "Reno NV MSA", "Fernley NV MicroSA", "Syracuse NY MSA")
Employment <- c(100,104,108,112,116,88)
testitem <- data.frame(areaname, Employment)

testitem %>%
filter(stringr::str_ends(areaname, "MSA"))
areaname Employment
1 Albany NY MSA 100
2 Albany GA MSA 104
3 Reno NV MSA 112
4 Syracuse NY MSA 88

Another option is to use stringi:
library(dplyr)
testitem %>%
filter(stringi::stri_endswith_fixed(areaname, "MSA"))
Output
areaname Employment
1 Albany NY MSA 100
2 Albany GA MSA 104
3 Reno NV MSA 112
4 Syracuse NY MSA 88
Or you can use grepl:
library(dplyr)
testitem %>%
filter(grepl("MSA$", areaname))
Or use endsWith:
testitem %>%
filter(endsWith(areaname, "MSA"))

Related

Is it possible to convert lines from a text file into columns to get a dataframe?

I have a text file containing information on book title, author name, and country of birth which appear in seperate lines as shown below:
Oscar Wilde
De Profundis
Ireland
Nathaniel Hawthorn
Birthmark
USA
James Joyce
Ulysses
Ireland
Walt Whitman
Leaves of Grass
USA
Is there any way to convert the text to a dataframe with these three items appearing as different columns:
ID Author Book Country
1 "Oscar Wilde" "De Profundis" "Ireland"
2 "Nathaniel Hawthorn" "Birthmark" "USA"
There are built-in functions for dealing with this kind of data:
data.frame(scan(text=xx, multi.line=TRUE,
what=list(Author="", Book="", Country=""), sep="\n"))
# Author Book Country
#1 Oscar Wilde De Profundis Ireland
#2 Nathaniel Hawthorn Birthmark USA
#3 James Joyce Ulysses Ireland
#4 Walt Whitman Leaves of Grass USA
You can create a 3-column matrix from one column of data.
dat <- read.table('data.txt', sep = ',')
result <- matrix(dat$V1, ncol = 3, byrow = TRUE) |>
data.frame() |>
setNames(c('Author', 'Book', 'Country'))
result <- cbind(ID = 1:nrow(result), result)
result
# ID Author Book Country
#1 1 Oscar Wilde De Profundis Ireland
#2 2 Nathaniel Hawthorn Birthmark USA
#3 3 James Joyce Ulysses Ireland
#4 4 Walt Whitman Leaves of Grass USA
There aren't any built in functions that handle data like this. But you can reshape your data after importing.
#Test data
xx <- "Oscar Wilde
De Profundis
Ireland
Nathaniel Hawthorn
Birthmark
USA
James Joyce
Ulysses
Ireland
Walt Whitman
Leaves of Grass
USA"
writeLines(xx, "test.txt")
And then the code
library(dplyr)
library(tidyr)
lines <- read.csv("test.txt", header=FALSE)
lines %>%
mutate(
rid = ((row_number()-1) %% 3)+1,
pid = (row_number()-1) %/%3 + 1) %>%
mutate(col=case_when(rid==1~"Author",rid==2~"Book", rid==3~"Country")) %>%
select(-rid) %>%
pivot_wider(names_from=col, values_from=V1)
Which returns
# A tibble: 4 x 4
pid Author Book Country
<dbl> <chr> <chr> <chr>
1 1 Oscar Wilde De Profundis Ireland
2 2 Nathaniel Hawthorn Birthmark USA
3 3 James Joyce Ulysses Ireland
4 4 Walt Whitman Leaves of Grass USA

Find aiports that have flights connected to them

library(tidyverse)
library(nycflights13)
I want to find out which airports have flights to them. My attempt is seen below, but it is not correct (it yields a number that is way bigger than the amount of airports)
airPortFlights <- airports %>% rename(dest=faa) %>% left_join(flights, "dest"=faa)
If anyone wonders why I do the rename above, that's because it won't let me do
airports %>% left_join(flights, "dest"=faa)
It gives
Error: by required, because the data sources have no common variables`
I even tried airports %>% left_join(flights, by=c("dest"=faa)) and several other attempts, which are also not working.
Thanks in advance.
You want an inner_join and then either count the distinct flights, or just list the airports using distinct. Here I count them.
library(dplyr)
inner_join(airports, flights, by=c("faa"="dest")) %>%
count(faa, name) %>% # number of flights
arrange(-n)
# A tibble: 101 x 3
faa name n
<chr> <chr> <int>
1 ORD Chicago Ohare Intl 17283
2 ATL Hartsfield Jackson Atlanta Intl 17215
3 LAX Los Angeles Intl 16174
4 BOS General Edward Lawrence Logan Intl 15508
5 MCO Orlando Intl 14082
6 CLT Charlotte Douglas Intl 14064
7 SFO San Francisco Intl 13331
8 FLL Fort Lauderdale Hollywood Intl 12055
9 MIA Miami Intl 11728
10 DCA Ronald Reagan Washington Natl 9705
# ... with 91 more rows
So 101 of the 1,458 airports in this dataset have at least 1 record in the flights dataset, with Chicago's O'Hare Intl airport having the most flights from New York.
And just for fun, the following lists the airports that don't have any flights from NY:
anti_join(airports, flights, by=c("faa"="dest"))

Create a Column that Counts (one at a time) Repeated Occurrences Within Groups? (in R)

mydata <- read.table(header=TRUE, text="
Away Home Game.ID Points.A Points.H Series.ID Series.Wins
Denver Utah aaa123 121 123 aaabbb Utah
Denver Utah aaa124 132 116 aaabbb Denver
Utah Denver aaa125 117 121 aaabbb Denver
Utah Denver aaa126 112 120 aaabbb Denver
Denver Utah aaa127 115 122 aaabbb Utah
Atlanta Boston aab123 112 114 aaaccc Boston
")
I am trying to create an additional column that counts, one at a time, the Series.Wins within each Series.ID group. So, from the data above, that column column would look like:
new.column <- c(1, 1, 2, 3, 2, 1)
The ultimate goal is to come up with a series record column with "home wins -
away wins":
Record <- c(1-0, 1-1, 2-1, 3-1, 2-3, 1-0)
This seemed to work:
mydata <- mydata %>% group_by(Series.Wins) %>% group_by(Series.ID, add = TRUE) %>% mutate(id = seq_len(n()))

Convert List to Tibble Plus Add Column With List Names

I'm working on a web scraping / mapping project where I've scraped address data from a restaurant website and I've stored the results as a list - in this example, called loc_list.
Question is, how best to convert these list items into a single data.frame / tibble (currently using bind_rows( )) but ALSO, in the new data.frame, have a column titled metro which corresponds to each list item name. In my example, the output would have 3 alpharettas, followed by 3 atlanta, then 1 buford.
loc_list
$alpharetta
# A tibble: 3 x 2
names address
<chr> <chr>
1 East Roswell US 2630 Holcomb Bridge Rd Alpharetta, GA 30022
2 Old Milton US 4305 Old Milton Parkway Ste 101 Alpharetta, GA 30022
3 Windward US 875 N Main Street Ste 306 Alpharetta, GA 30009
$atlanta
# A tibble: 3 x 2
names address
<chr> <chr>
1 Philips Arena US 100 Techwood Drive Atlanta, GA 30303
2 Virginia Highlands US 1006 N Highland Ave Atlanta, GA 30306
3 Perimeter US 1211 Ashford Crossing Atlanta, GA 30346
$buford
# A tibble: 1 x 2
names address
<chr> <chr>
1 Woodward US 3250 Woodward Crossing Blvd Buford, GA 30519
Targeted output:
names address metro
East Ros... US 2630... alpharetta
As alistaire pointed out bind_rows is enough with .id. Here is example data:
alpharetta <- tibble(names=c("East Roswell", "Old Milton"),
address = c("US 2630 Holcomb Bridge Rd Alpharetta, GA 30022", "4305 Old Milton Parkway Ste 101 Alpharetta, GA 30022"))
atlanta <- tibble(names=c("Philips Arena", "Virginia Highlands"),
address = c("US 100 Techwood Drive Atlanta, GA 30303", "US 1006 N Highland Ave Atlanta, GA 30306"))
loc_list <- list(alpharetta = alpharetta, atlanta = atlanta)
bind_rows(loc_list, .id="metro")
# A tibble: 4 x 3
metro names address
<chr> <chr> <chr>
1 alpharetta East Roswell US 2630 Holcomb Bridge Rd Alpharetta, GA 30022
2 alpharetta Old Milton 4305 Old Milton Parkway Ste 101 Alpharetta, GA 30022
3 atlanta Philips Arena US 100 Techwood Drive Atlanta, GA 30303
4 atlanta Virginia Highlands US 1006 N Highland Ave Atlanta, GA 30306

R - Finding a specific row of a dataframe and then adding data from that row to another data frame

I have two dataframes. 1 full of data about individuals, including their street name and house number but not their house size. And another with information about each house including street name and house number and house size but not data on the individuals living in that house. I'd like to add the size information to the first dataframe as a new column so I can see the house size for each individual.
I have over 200,000 individuals and around 100,000 houses and the methods I've tried so far (cutting down the second dataframe for each individual) are painfully slow. Is their an efficient way to do this? Thank you.
Using #jazzurro's example another option for larger datasets would be to use data.table
library(data.table)
setkey(setDT(df1), street, num)
setkey(setDT(df2), street, num)
df2[df1]
# size street num person
#1: large liliha st 3 bob
#2: NA mahalo st 32 dan
#3: small makiki st 15 ana
#4: NA nehoa st 11 ellen
#5: medium nuuanu ave 8 cathy
Here is my suggestion. Given what you described in your data, I created a sample data. However, please try to provide sample data from next time. When you provide sample data and your code, you are more likely to receive help and let people save more time. You have two key variables to merge two data frames, which are street name and house number. Here, I chose to keep all data points in df1.
df1 <- data.frame(person = c("ana", "bob", "cathy", "dan", "ellen"),
street = c("makiki st", "liliha st", "nuuanu ave", "mahalo st", "nehoa st"),
num = c(15, 3, 8, 32, 11),
stringsAsFactors = FALSE)
#person street num
#1 ana makiki st 15
#2 bob liliha st 3
#3 cathy nuuanu ave 8
#4 dan mahalo st 32
#5 ellen nehoa st 11
df2 <- data.frame(size = c("small", "large", "medium"),
street = c("makiki st", "liliha st", "nuuanu ave"),
num = c(15, 3, 8),
stringsAsFactors = FALSE)
# size street num
#1 small makiki st 15
#2 large liliha st 3
#3 medium nuuanu ave 8
library(dplyr)
left_join(df1, df2)
# street num person size
#1 makiki st 15 ana small
#2 liliha st 3 bob large
#3 nuuanu ave 8 cathy medium
#4 mahalo st 32 dan <NA>
#5 nehoa st 11 ellen <NA>

Resources