library(tidyverse)
library(nycflights13)
I want to find out which airports have flights to them. My attempt is seen below, but it is not correct (it yields a number that is way bigger than the amount of airports)
airPortFlights <- airports %>% rename(dest=faa) %>% left_join(flights, "dest"=faa)
If anyone wonders why I do the rename above, that's because it won't let me do
airports %>% left_join(flights, "dest"=faa)
It gives
Error: by required, because the data sources have no common variables`
I even tried airports %>% left_join(flights, by=c("dest"=faa)) and several other attempts, which are also not working.
Thanks in advance.
You want an inner_join and then either count the distinct flights, or just list the airports using distinct. Here I count them.
library(dplyr)
inner_join(airports, flights, by=c("faa"="dest")) %>%
count(faa, name) %>% # number of flights
arrange(-n)
# A tibble: 101 x 3
faa name n
<chr> <chr> <int>
1 ORD Chicago Ohare Intl 17283
2 ATL Hartsfield Jackson Atlanta Intl 17215
3 LAX Los Angeles Intl 16174
4 BOS General Edward Lawrence Logan Intl 15508
5 MCO Orlando Intl 14082
6 CLT Charlotte Douglas Intl 14064
7 SFO San Francisco Intl 13331
8 FLL Fort Lauderdale Hollywood Intl 12055
9 MIA Miami Intl 11728
10 DCA Ronald Reagan Washington Natl 9705
# ... with 91 more rows
So 101 of the 1,458 airports in this dataset have at least 1 record in the flights dataset, with Chicago's O'Hare Intl airport having the most flights from New York.
And just for fun, the following lists the airports that don't have any flights from NY:
anti_join(airports, flights, by=c("faa"="dest"))
Related
This question already has answers here:
extracting country name from city name in R
(3 answers)
Closed 7 months ago.
I have a dataset containing information on a range of cities, but there is no column which says what country the city is located in. In order to perform the analysis, I need to add an extra column which has the name of the country.
population city
500,000 Oslo
750,000 Bristol
500,000 Liverpool
1,000,000 Dublin
I expect the output to look like this:
population city country
500,000 Oslo Norway
750,000 Bristol England
500,000 Liverpool England
1,000,000 Dublin Ireland
How can I add a column of country names based on the city and population to a large dataset in R?
I am adapting Tom Hoel's answer, as suggested by Ian Campbell. If this is selected I am happy to mark it as community wiki.
library(maps)
library(dplyr)
data("world.cities")
df <- readr::read_table("population city
500,000 Oslo
750,000 Bristol
500,000 Liverpool
1,000,000 Dublin")
df |>
inner_join(
select(world.cities, name, country.etc, pop),
by = c("city" = "name")
) |> group_by(city) |>
filter(
abs(pop - population) == min(abs(pop - population))
)
# A tibble: 4 x 4
# Groups: city [4]
# population city country.etc pop
# <dbl> <chr> <chr> <int>
# 1 500000 Oslo Norway 821445
# 2 750000 Bristol UK 432967
# 3 500000 Liverpool UK 468584
# 4 1000000 Dublin Ireland 1030431
As stated by others, the cities exists in other countries too as well.
library(tidyverse)
library(maps)
data("world.cities")
df <- read_table("population city
500,000 Oslo
750,000 Bristol
500,000 Liverpool
1,000,000 Dublin")
df %>%
merge(., world.cities %>%
select(name, country.etc),
by.x = "city",
by.y = "name")
# A tibble: 7 × 3
city population country.etc
<chr> <dbl> <chr>
1 Bristol 750000 UK
2 Bristol 750000 USA
3 Dublin 1000000 USA
4 Dublin 1000000 Ireland
5 Liverpool 500000 UK
6 Liverpool 500000 Canada
7 Oslo 500000 Norway
I think your best bet would be to add a new column in your dataset called country and fill it out, this is part of the CRSIP-DM process data preparation so this is not uncommon. If that does not answer your question please let me know and i will do my best to help.
Below is the sample data. Seems pretty basic but my internet searches have not yielded a clear answer. In this case, how would I create a new data frame where the areaname only has the phrase MSA at the end and not MicroSA or other possibilities.
areaname <- c("Albany NY MSA", "Albany GA MSA", "Aberdeen SD MicroSA", "Reno NV MSA", "Fernley NV MicroSA", "Syracuse NY MSA")
Employment <- c(100,104,108,112,116,88)
testitem <- data.frame(areaname, Employment)
testitem %>%
filter(stringr::str_ends(areaname, "MSA"))
areaname Employment
1 Albany NY MSA 100
2 Albany GA MSA 104
3 Reno NV MSA 112
4 Syracuse NY MSA 88
Another option is to use stringi:
library(dplyr)
testitem %>%
filter(stringi::stri_endswith_fixed(areaname, "MSA"))
Output
areaname Employment
1 Albany NY MSA 100
2 Albany GA MSA 104
3 Reno NV MSA 112
4 Syracuse NY MSA 88
Or you can use grepl:
library(dplyr)
testitem %>%
filter(grepl("MSA$", areaname))
Or use endsWith:
testitem %>%
filter(endsWith(areaname, "MSA"))
I have a text file containing information on book title, author name, and country of birth which appear in seperate lines as shown below:
Oscar Wilde
De Profundis
Ireland
Nathaniel Hawthorn
Birthmark
USA
James Joyce
Ulysses
Ireland
Walt Whitman
Leaves of Grass
USA
Is there any way to convert the text to a dataframe with these three items appearing as different columns:
ID Author Book Country
1 "Oscar Wilde" "De Profundis" "Ireland"
2 "Nathaniel Hawthorn" "Birthmark" "USA"
There are built-in functions for dealing with this kind of data:
data.frame(scan(text=xx, multi.line=TRUE,
what=list(Author="", Book="", Country=""), sep="\n"))
# Author Book Country
#1 Oscar Wilde De Profundis Ireland
#2 Nathaniel Hawthorn Birthmark USA
#3 James Joyce Ulysses Ireland
#4 Walt Whitman Leaves of Grass USA
You can create a 3-column matrix from one column of data.
dat <- read.table('data.txt', sep = ',')
result <- matrix(dat$V1, ncol = 3, byrow = TRUE) |>
data.frame() |>
setNames(c('Author', 'Book', 'Country'))
result <- cbind(ID = 1:nrow(result), result)
result
# ID Author Book Country
#1 1 Oscar Wilde De Profundis Ireland
#2 2 Nathaniel Hawthorn Birthmark USA
#3 3 James Joyce Ulysses Ireland
#4 4 Walt Whitman Leaves of Grass USA
There aren't any built in functions that handle data like this. But you can reshape your data after importing.
#Test data
xx <- "Oscar Wilde
De Profundis
Ireland
Nathaniel Hawthorn
Birthmark
USA
James Joyce
Ulysses
Ireland
Walt Whitman
Leaves of Grass
USA"
writeLines(xx, "test.txt")
And then the code
library(dplyr)
library(tidyr)
lines <- read.csv("test.txt", header=FALSE)
lines %>%
mutate(
rid = ((row_number()-1) %% 3)+1,
pid = (row_number()-1) %/%3 + 1) %>%
mutate(col=case_when(rid==1~"Author",rid==2~"Book", rid==3~"Country")) %>%
select(-rid) %>%
pivot_wider(names_from=col, values_from=V1)
Which returns
# A tibble: 4 x 4
pid Author Book Country
<dbl> <chr> <chr> <chr>
1 1 Oscar Wilde De Profundis Ireland
2 2 Nathaniel Hawthorn Birthmark USA
3 3 James Joyce Ulysses Ireland
4 4 Walt Whitman Leaves of Grass USA
I am trying to find data entry errors in the names and locations of my dataset by fuzzy matching. I am have a unique key from the original data, siterow_id, and have made a new key, pi_key, where I already identified some hard matches. (no fuzzy matching). After running the fuzzy matching I get duplicate values. The matches from both the left and right side of the join for some of the siterow_id's. I can manually look at the data and see where this occurs and hard code to remove the rows. I want a more algorithmic way of doing this as I go to a larger dataset with many more matches.
I tried doing it this way but it removes the matches on the left and the right. If possible I would love a tidyverse way to do this and not a loop.
The table output is included below. You can see a duplicate in row 8 and 9.
for(site in three_letter_matches$siterow_id.x){
if (any(three_letter_matches$siterow_id.y == site)) {
three_letter_matches <- three_letter_matches[!three_letter_matches$siterow_id.y == site,]
}
}
pi_key.x siterow_id.x last_name.x first_name.x city.x country.x pi_key.y siterow_id.y
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 6309 1-9CH29M kim kevin san f~ united s~ 11870 1-HC3YY6
2 7198 1-CJGRSZ kim jinseok seoul korea re~ 2952 1-2QBRZ2
3 7198 1-CJGRSZ kim jinseok seoul korea re~ 2952 1-3AHHSU
4 7198 1-CJGRSZ kim jinseok seoul korea re~ 2952 1-3JYF8V
5 7567 1-CW4DXI bar jair ramat~ israel 8822 1-E3UILG
6 8822 1-E3UILG bar jair ramat~ israel 7567 1-CW4DXI
7 11870 1-HC3YY6 kim kevin san f~ united s~ 6309 1-9CH29M
8 12357 1-HUUEA6 lee hyojin daeje~ korea re~ 13460 1-IGKCPP
9 13460 1-IGKCPP lee hyo jin daeje~ korea re~ 12357 1-HUUEA6
I found another way to do it
update <- three_letter_matches[!is.na(match(three_letter_matches$siterow_id.x, three_letter_matches$siterow_id.y)),]
update %<>% arrange(last_name.x, first_name.x) %>%
filter(row_number() %% 2 != 0)
three_letter_matches_update <- three_letter_matches %>%
anti_join(update)
Still open to suggestions.
Not the easiest problem, but there are a few ways to do this. The first that comes to mind for me is a bit slow (because it uses rowwise() which is equivalent to using map() or lapply()) is this one:
NOTE: This only works if siterow_id.x/y are character vectors. Won't work for factors.
three_letter_matches <- three_letter_matches %>%
rowwise() %>%
mutate(both_values = paste0(sort(c(siterow_id.x,siterow_id.y)),collapse = ",")) %>%
ungroup() %>%
distinct(both_values,.keep_all = TRUE) %>%
select(-both_values)
# pi_key.x siterow_id.x last_name.x first_name.x city.x country.x pi_key.y siterow_id.y
# 6309 1-9CH29M kim kevin san f~ united s~ 11870 1-HC3YY6
# 7198 1-CJGRSZ kim jinseok seoul korea re~ 2952 1-2QBRZ2
# 7198 1-CJGRSZ kim jinseok seoul korea re~ 2952 1-3AHHSU
# 7198 1-CJGRSZ kim jinseok seoul korea re~ 2952 1-3JYF8V
# 7567 1-CW4DXI bar jair ramat~ israel 8822 1-E3UILG
# 12357 1-HUUEA6 lee hyojin daeje~ korea re~ 13460 1-IGKCPP
Basically what I'm doing here is doing rowwise so that I work on one row at a time, then I take the site_row ids and sort them so that every row will have the same order, then I paste them together into a single string that is easy to compare for equivalence. Next I ungroup so that you are looking at all rows again (get rid of that rowwise). Then run a distinct to only keep the first row for each value in the new column but with the .keep_all option to keep all the columns. Then I cleanup by removing my extra column.
I'm working on a web scraping / mapping project where I've scraped address data from a restaurant website and I've stored the results as a list - in this example, called loc_list.
Question is, how best to convert these list items into a single data.frame / tibble (currently using bind_rows( )) but ALSO, in the new data.frame, have a column titled metro which corresponds to each list item name. In my example, the output would have 3 alpharettas, followed by 3 atlanta, then 1 buford.
loc_list
$alpharetta
# A tibble: 3 x 2
names address
<chr> <chr>
1 East Roswell US 2630 Holcomb Bridge Rd Alpharetta, GA 30022
2 Old Milton US 4305 Old Milton Parkway Ste 101 Alpharetta, GA 30022
3 Windward US 875 N Main Street Ste 306 Alpharetta, GA 30009
$atlanta
# A tibble: 3 x 2
names address
<chr> <chr>
1 Philips Arena US 100 Techwood Drive Atlanta, GA 30303
2 Virginia Highlands US 1006 N Highland Ave Atlanta, GA 30306
3 Perimeter US 1211 Ashford Crossing Atlanta, GA 30346
$buford
# A tibble: 1 x 2
names address
<chr> <chr>
1 Woodward US 3250 Woodward Crossing Blvd Buford, GA 30519
Targeted output:
names address metro
East Ros... US 2630... alpharetta
As alistaire pointed out bind_rows is enough with .id. Here is example data:
alpharetta <- tibble(names=c("East Roswell", "Old Milton"),
address = c("US 2630 Holcomb Bridge Rd Alpharetta, GA 30022", "4305 Old Milton Parkway Ste 101 Alpharetta, GA 30022"))
atlanta <- tibble(names=c("Philips Arena", "Virginia Highlands"),
address = c("US 100 Techwood Drive Atlanta, GA 30303", "US 1006 N Highland Ave Atlanta, GA 30306"))
loc_list <- list(alpharetta = alpharetta, atlanta = atlanta)
bind_rows(loc_list, .id="metro")
# A tibble: 4 x 3
metro names address
<chr> <chr> <chr>
1 alpharetta East Roswell US 2630 Holcomb Bridge Rd Alpharetta, GA 30022
2 alpharetta Old Milton 4305 Old Milton Parkway Ste 101 Alpharetta, GA 30022
3 atlanta Philips Arena US 100 Techwood Drive Atlanta, GA 30303
4 atlanta Virginia Highlands US 1006 N Highland Ave Atlanta, GA 30306