Convert List to Tibble Plus Add Column With List Names - r

I'm working on a web scraping / mapping project where I've scraped address data from a restaurant website and I've stored the results as a list - in this example, called loc_list.
Question is, how best to convert these list items into a single data.frame / tibble (currently using bind_rows( )) but ALSO, in the new data.frame, have a column titled metro which corresponds to each list item name. In my example, the output would have 3 alpharettas, followed by 3 atlanta, then 1 buford.
loc_list
$alpharetta
# A tibble: 3 x 2
names address
<chr> <chr>
1 East Roswell US 2630 Holcomb Bridge Rd Alpharetta, GA 30022
2 Old Milton US 4305 Old Milton Parkway Ste 101 Alpharetta, GA 30022
3 Windward US 875 N Main Street Ste 306 Alpharetta, GA 30009
$atlanta
# A tibble: 3 x 2
names address
<chr> <chr>
1 Philips Arena US 100 Techwood Drive Atlanta, GA 30303
2 Virginia Highlands US 1006 N Highland Ave Atlanta, GA 30306
3 Perimeter US 1211 Ashford Crossing Atlanta, GA 30346
$buford
# A tibble: 1 x 2
names address
<chr> <chr>
1 Woodward US 3250 Woodward Crossing Blvd Buford, GA 30519
Targeted output:
names address metro
East Ros... US 2630... alpharetta

As alistaire pointed out bind_rows is enough with .id. Here is example data:
alpharetta <- tibble(names=c("East Roswell", "Old Milton"),
address = c("US 2630 Holcomb Bridge Rd Alpharetta, GA 30022", "4305 Old Milton Parkway Ste 101 Alpharetta, GA 30022"))
atlanta <- tibble(names=c("Philips Arena", "Virginia Highlands"),
address = c("US 100 Techwood Drive Atlanta, GA 30303", "US 1006 N Highland Ave Atlanta, GA 30306"))
loc_list <- list(alpharetta = alpharetta, atlanta = atlanta)
bind_rows(loc_list, .id="metro")
# A tibble: 4 x 3
metro names address
<chr> <chr> <chr>
1 alpharetta East Roswell US 2630 Holcomb Bridge Rd Alpharetta, GA 30022
2 alpharetta Old Milton 4305 Old Milton Parkway Ste 101 Alpharetta, GA 30022
3 atlanta Philips Arena US 100 Techwood Drive Atlanta, GA 30303
4 atlanta Virginia Highlands US 1006 N Highland Ave Atlanta, GA 30306

Related

How to assign one dataframe column's value to be the same as another column's value in r?

I am trying to run this line of code below to copy the city.output column to pm.city where it is not NA (in my sample dataframe, nothing is NA though) because city.output contains the correct city spellings.
resultdf <- dplyr::mutate(df, pm.city = ifelse(is.na(city.output) == FALSE, city.output, pm.city))
df:
pm.uid pm.address pm.state pm.zip pm.city city.output
<int> <chr> <chr> <chr> <chr> <fct>
1 1 1809 MAIN ST OH 63312 NORWOOD NORWOOD
2 2 123 ELM DR CA NA BRYAN BRYAN
3 3 8970 WOOD ST UNIT 4 LA 33333 BATEN ROUGE BATON ROUGE
4 4 4444 OAK AVE OH 87481 CINCINATTI CINCINNATI
5 5 3333 HELPME DR MT 87482 HELENA HELENA
6 6 2342 SOMEWHERE RD LA 45103 BATON ROUGE BATON ROUGE
resultdf (pm.city should be the same as city.output but it's an integer)
pm.uid pm.address pm.state pm.zip pm.city city.output
<int> <chr> <chr> <chr> <int> <fct>
1 1 1809 MAIN ST OH 63312 7 NORWOOD
2 2 123 ELM DR CA NA 2 BRYAN
3 3 8970 WOOD ST UNIT 4 LA 33333 1 BATON ROUGE
4 4 4444 OAK AVE OH 87481 3 CINCINNATI
5 5 4444 HELPME DR MT 87482 4 HELENA
6 6 2342 SOMEWHERE RD LA 45103 1 BATON ROUGE
An integer is instead assigned to pm.city. It appears the integer is the order number of the cities when they're in alphabetical order. Prior to this, I used the dplyr left_join method to attach city.output column from another dataframe but even there, there was no row number that I supplied explicitly.
This works on my computer in r studio but not when I run it from a server. Maybe it has something to do with my version of dplyr or the factor data type under city.output? I am pretty new to r.
The city.output is factor which gets coerced to integer storage values. Instead, convert to character with as.character
dplyr::mutate(df, pm.city = ifelse(!is.na(city.output), as.character(city.output), pm.city))

Find aiports that have flights connected to them

library(tidyverse)
library(nycflights13)
I want to find out which airports have flights to them. My attempt is seen below, but it is not correct (it yields a number that is way bigger than the amount of airports)
airPortFlights <- airports %>% rename(dest=faa) %>% left_join(flights, "dest"=faa)
If anyone wonders why I do the rename above, that's because it won't let me do
airports %>% left_join(flights, "dest"=faa)
It gives
Error: by required, because the data sources have no common variables`
I even tried airports %>% left_join(flights, by=c("dest"=faa)) and several other attempts, which are also not working.
Thanks in advance.
You want an inner_join and then either count the distinct flights, or just list the airports using distinct. Here I count them.
library(dplyr)
inner_join(airports, flights, by=c("faa"="dest")) %>%
count(faa, name) %>% # number of flights
arrange(-n)
# A tibble: 101 x 3
faa name n
<chr> <chr> <int>
1 ORD Chicago Ohare Intl 17283
2 ATL Hartsfield Jackson Atlanta Intl 17215
3 LAX Los Angeles Intl 16174
4 BOS General Edward Lawrence Logan Intl 15508
5 MCO Orlando Intl 14082
6 CLT Charlotte Douglas Intl 14064
7 SFO San Francisco Intl 13331
8 FLL Fort Lauderdale Hollywood Intl 12055
9 MIA Miami Intl 11728
10 DCA Ronald Reagan Washington Natl 9705
# ... with 91 more rows
So 101 of the 1,458 airports in this dataset have at least 1 record in the flights dataset, with Chicago's O'Hare Intl airport having the most flights from New York.
And just for fun, the following lists the airports that don't have any flights from NY:
anti_join(airports, flights, by=c("faa"="dest"))

Map zip codes to their respective city and state in R?

I have a data frame of zip codes that I'm looking to map to a city & state for each specific zip code. Currently, I have played around with the zipcode package a bit but I'm not sure that can solve this specific issue.
Here's sample data of what I have now:
str(all_key$zip)
chr [1:406] "43031" "24517" "43224" "43832" "53022" "60185" "84104" "43081"
"85226" "85193" "54656" "43215" "94533" "95826" "64804" "49548" "54467"
The expected output would be adding a city & state column to each row of the data frame referring to the individual zips:
head(all_key)
zip city state
1 43031 city1 state1
2 24517 city2 state2
3 43224 city3 state3
4 43832 city4 state4
5 53022 city5 state5
6 60185 city6 state6
Thanks in advance for your help.
Another Update - February 2023
Another package (zipcodeR) has been added that makes this easier. See below.
Answer updated - January 2020
The zipcode package seems to have disappeared, so this answer has been updated to show how to add lat-lon from an external file. New answer at bottom.
Original answer
You can get the data from the zipcode package and just do a merge to look things up.
zip = c("43031", "24517", "43224", "43832", "53022",
"60185", "84104", "43081", "85226", "85193", "54656",
"43215", "94533", "95826", "64804", "49548", "54467")
ZC = data.frame(zip)
library(zipcode)
data(zipcode)
merge(ZC, zipcode)
zip city state latitude longitude
1 24517 Altavista VA 37.12754 -79.27409
2 43031 Johnstown OH 40.15198 -82.66944
3 43081 Westerville OH 40.10951 -82.91606
4 43215 Columbus OH 39.96513 -83.00431
5 43224 Columbus OH 40.03991 -82.96772
6 43832 Newcomerstown OH 40.27738 -81.59662
7 49548 Grand Rapids MI 42.86823 -85.66391
8 53022 Germantown WI 43.21916 -88.12043
9 54467 Plover WI 44.45228 -89.54399
10 54656 Sparta WI 43.96977 -90.80796
11 60185 West Chicago IL 41.89198 -88.20502
12 64804 Joplin MO 37.04716 -94.51124
13 84104 Salt Lake City UT 40.75063 -111.94077
14 85193 Casa Grande AZ 32.86000 -111.83000
15 85226 Chandler AZ 33.31221 -111.93177
16 94533 Fairfield CA 38.26958 -122.03701
17 95826 Sacramento CA 38.55010 -121.37492
If you need to keep the rows in the same order, you can just set the rownames on the zipcode data and use that to select the desired rows and columns.
rownames(zipcode) = zipcode$zip
zipcode[zip, 1:3]
zip city state
43031 43031 Johnstown OH
24517 24517 Altavista VA
43224 43224 Columbus OH
43832 43832 Newcomerstown OH
53022 53022 Germantown WI
60185 60185 West Chicago IL
84104 84104 Salt Lake City UT
43081 43081 Westerville OH
85226 85226 Chandler AZ
85193 85193 Casa Grande AZ
54656 54656 Sparta WI
43215 43215 Columbus OH
94533 94533 Fairfield CA
95826 95826 Sacramento CA
64804 64804 Joplin MO
49548 49548 Grand Rapids MI
54467 54467 Plover WI
Updated Answer - January 2020
Since the zipcode package has disappeared, this shows how to add lat-lon information from a downloaded data set. The file that I am using exists today but the method should work for other files. See the GIS StackExchange for some leads on where to download data.
## Original Data to match
zip = c("43031", "24517", "43224", "43832", "53022",
"60185", "84104", "43081", "85226", "85193", "54656",
"43215", "94533", "95826", "64804", "49548", "54467")
ZC = data.frame(zip)
## Download source file, unzip and extract into table
ZipCodeSourceFile = "http://download.geonames.org/export/zip/US.zip"
temp <- tempfile()
download.file(ZipCodeSourceFile , temp)
ZipCodes <- read.table(unz(temp, "US.txt"), sep="\t")
unlink(temp)
names(ZipCodes) = c("CountryCode", "zip", "PlaceName",
"AdminName1", "AdminCode1", "AdminName2", "AdminCode2",
"AdminName3", "AdminCode3", "latitude", "longitude", "accuracy")
## merge extra info onto original data
fZC_Info = merge(ZC, ZipCodes[,c(2:6,10:11)])
head(ZC_Info)
zip PlaceName AdminName1 AdminCode1 AdminName2 latitude longitude
1 24517 Altavista Virginia VA Campbell 37.1222 -79.2911
2 43031 Johnstown Ohio OH Licking 40.1445 -82.6973
3 43081 Westerville Ohio OH Franklin 40.1146 -82.9105
4 43215 Columbus Ohio OH Franklin 39.9671 -83.0044
5 43224 Columbus Ohio OH Franklin 40.0425 -82.9689
6 43832 Newcomerstown Ohio OH Tuscarawas 40.2739 -81.5940
Second Update - February 2023
Another package, zipcodeR, is now available that makes this easier. Here is some simple code to demonstrate it.
library(zipcodeR)
zip = c("43031", "24517", "43224", "43832", "53022",
"60185", "84104", "43081", "85226", "85193", "54656",
"43215", "94533", "95826", "64804", "49548", "54467")
reverse_zipcode(zip)[,c(1,3,7)]
# A tibble: 17 × 3
zipcode major_city state
<chr> <chr> <chr>
1 85193 Casa Grande AZ
2 85226 Chandler AZ
3 94533 Fairfield CA
4 95826 Sacramento CA
5 60185 West Chicago IL
6 49548 Grand Rapids MI
7 64804 Joplin MO
8 43031 Johnstown OH
9 43081 Westerville OH
10 43215 Columbus OH
11 43224 Columbus OH
12 43832 Newcomerstown OH
13 84104 Salt Lake City UT
14 24517 Altavista VA
15 53022 Germantown WI
16 54467 Plover WI
17 54656 Sparta WI
You can still use the "zipcode" package by downloading it from the archives
https://cran.r-project.org/src/contrib/Archive/zipcode/
Once you download the tar.gz file to your computer, you can install it from the RStudio GUI Packages pane. After clicking "Install", you can change the option to "Package Archive File" and point to the downloaded tar.gz file.
Install/use the USA package, also described here, which contains a tibble (zips and lats/longs) from the archived zipcode package.
library(usa)
zcs <- usa::zipcodes
head(zcs)
# A tibble: 6 x 5
zip city state lat long
<chr> <chr> <chr> <dbl> <dbl>
1 00210 Portsmouth NH 43.0 -71.0
2 00211 Portsmouth NH 43.0 -71.0
3 00212 Portsmouth NH 43.0 -71.0
4 00213 Portsmouth NH 43.0 -71.0
5 00214 Portsmouth NH 43.0 -71.0
6 00215 Portsmouth NH 43.0 -71.0
You can use the data frame in the R package zipcodeR.
To add the city and state to your data frame, you can select the variables you want from the data frame provided in zipcodeR (called zip_code_db), then join it with your data frame:
library(dplyr)
library(zipcodeR)
zip_code_db_selected =
zip_code_db %>%
select(zipcode, major_city, state)
all_key_with_city_st =
left_join(all_key, zip_code_db_selected, by = c("zip" = "zipcode"))

How to have bar labels be names in Plotly for R

So I'm trying to make a bar chart that displays the most popular airports that flew to Chicago. For some reason, I'm finding it to be extremely difficult to have my bars be labeled by the airport names specifically.
I have a data frame called ty
> ty
Name
1 Atlanta, GA: Hartsfield-Jackson Atlanta International
2 New York, NY: LaGuardia
3 Minneapolis, MN: Minneapolis-St Paul International
4 Los Angeles, CA: Los Angeles International
5 Denver, CO: Denver International
6 Washington, DC: Ronald Reagan Washington National
7 Orlando, FL: Orlando International
8 Phoenix, AZ: Phoenix Sky Harbor International
9 Detroit, MI: Detroit Metro Wayne County
10 Las Vegas, NV: McCarran International
11 San Francisco, CA: San Francisco International
12 Dallas/Fort Worth, TX: Dallas/Fort Worth International
13 Boston, MA: Logan International
14 Philadelphia, PA: Philadelphia International
15 Newark, NJ: Newark Liberty International
I also have a data frame called df
id numArrivals
1 10397 964
2 12953 962
3 13487 883
4 12892 823
5 11292 776
6 11278 771
7 13204 725
8 14107 700
9 11433 672
10 12889 647
11 14771 611
12 11298 580
13 10721 569
14 14100 567
15 11618 488
The id corresponds to the airport name 10397 is Atlanta, GA: Hartsfield-Jackson Atlanta International and they continue in that order.
However, when I run:
plotly::plot_ly(df,x=ty["Name"],y=df$numArrivals,type="bar",color=I("rgba(0,92,124,1)"))
I am given this chart.
How can I make the labels of my bars into the names of the airport rather than just numbers?
Feel free to use ggplotly() to create your plot. I used the code below to create a small example.
example <- data.frame(airport = c("Atlanta, GA: Hartsfield-Jackson Atlanta International","New York, NY: LaGuardia","Minneapolis, MN: Minneapolis-St Paul International"),
id = c(10397,12953,13487),
numArrivals = c(964,962,883),stringsAsFactors = F)
library(ggplot2)
library(plotly)
a <- ggplot(example,aes(x=airport,y=numArrivals,fill=id)) + geom_bar(stat = "identity") + coord_flip()
ggplotly(a)
The final result looks like this.

R:Fuzzy Logic Name match

I have been working on large data set which has names of customers , each of this has to be checked with the master file which has correct names (300 KB) and if matched append the master file name to names of customer file as new column value. My prev Question worked for small data sets
Both Customer & Master file has been cleaned using tm and have tried different logic , but only works on small set of data when applied to huge files not effective, pattern matching doesn't help here my opinion cause no names comes with exact pattern
Cus File
1 chang chun petrochemical
2 chang chun plastics
3 church dwight
4 citrix systems asia pacific
5 cnh industrial services srl
6 conoco phillips
7 conocophillips
8 dfk laurence varnay
9 dtz worldwide
10 electro motive maintenance operati
11 enterasys networks
12 esso resources
13 expedia
14 expedia
15 exponential interactive aust
16 exxonmobil asia pacific pte
17 exxonmobil chemical asia pac div
18 exxonmobil png
19 formula world championship
20 fortitech asia pacific sdn bhd
Master
1 chang chun group
2 church dwight
3 citrix systems asia pacific
4 cnh industrial nv
5 conoco phillips
6 dfk laurence varnay
7 dtz group zealand
8 caterpillar
9 enterasys networks
10 exxon mobil group
11 expedia group
12 exponential interactive aust
13 formula world championship
14 fortitech asia pacific sdn bhd
15 frhi hotels resorts
16 gardner denver industries
17 glencore xstrata international plc
18 grace
19 incomm nz
20 information resources
21 kbr holdings llc
22 kennametal
23 komatsu
24 leonhard hofstetter pelzdesign
25 communications corporation
26 manhattan associates
27 mattel
28 mmg finance
29 nokia oyj group
30 nortek
i have tried with this simple loop
for (i in 1:100){
result$x[i] = agrep(result$ICIS_Cust_Names[i], result1$Master_Names, value = TRUE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
#result$Y[i] = agrep(result$ICIS_Cust_Names[i], result1$Master_Names, value = FALSE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
}
*result *
1 chang chun petrochemical <NA> NA
2 chang chun plastics <NA> NA
3 church dwight church dwight 2
4 citrix systems asia pacific citrix systems asia pacific 3
5 cnh industrial services srl <NA> NA
6 conoco phillips church dwight 2
7 conocophillips <NA> NA
8 dfk laurence varnay <NA> NA
9 dtz worldwide church dwight 2
10 electro motive maintenance operati <NA> NA
11 enterasys networks <NA> NA
12 esso resources church dwight 2
13 expedia <NA> NA
14 expedia <NA> NA
15 exponential interactive aust church dwight 2
16 exxonmobil asia pacific pte <NA> NA
17 exxonmobil chemical asia pac div <NA> NA
18 exxonmobil png church dwight 2
19 formula world championship <NA> NA
20 fortitech asia pacific sdn bhd
tried with lapply but no use , as you can notice my master file is large and some times i get error of rows length doesn't match!
mm<-dt[lapply(result, function(x) levenshteinDist(x ,lapply(result1, function(x) x)))]
#using looping stat. for checking each cus name with all the master names
for(i in seq(nrow(result)) )
{
if((levenshteindist(result[i],lapply(result1, function(x) String(x))))==0)
sprintf("%s", x)
}
which method would be best for this ? similar to my Q but not much helpfullI referd few Q from STO
it might be naive but when applied with huge data sets it mis behaves, can anybody familiar with R could correct me with the above code for levenshteinDist
code:
#check with each value of master file and if matches more than .90 then return master value.
for(i in seq(1:nrow(gr1))
{
for(j in seq(1:nrow(gr2))
{
gr1$jar[i,j]<-jarowinkler(gr1$ICIS_Cust_Names[i],gr2$Master_Names[j])
if(gr1$jar[i,j]>.90)
gr1$res[i] = gr2$Master_Names[j]
}
}
#Please let know if there is any minute error with this code
Please if anybody has worked with such data in R please help !
achieved partial result by
code :
df$result<-data.frame(df$Cust_Names, df$Master_Names[max.col(-adist(df$Cust_Names,df$Master_Names))])

Resources