Can't match two dataframe values - r

I am not sure why the dataframe values do not match with each other.
I have a df name fileUpload which looks like this (the cols are aligned correctly):
Destination City Year Adults
Amsterdam 2015 2
Amsterdam 2016 2
Amsterdam 2015 2
Amsterdam 2016 2
Amsterdam 2015 3
There is a space after each city name.
I have another dataframe that is not uploaded, like this:
cities <- read.csv(text = "
City,Lat,Long,Pop
Amsterdam ,4.8952,52.3702,779808
Bali ,115.1889,-8.4095,4225000")
I need to merge the two dataframes, but I realized that the city values returns not matching (NA). I tried checking it using fileUpload %in% cities returns false
I tried removing the space after the city, also did not work.
The typeof(df$city) for both is integer.
How can I make the cities name match together?

As pointed out in the comments you should convert your columns to strings from factors.
mergedCities <- merge(fileUpload, cities, by.x ="Destination City", by.y = "City", all = TRUE)
Set the all parameter to specify if you want to keep all cities or just the one form x or y or only the cities present in both.

Related

How to get US county name from Address, city and state in R?

I have a dataset of around 10000 rows. I have the Address, City, State and Zipcode values. I do not have lat/long coordinates. I would like to retrieve the county name without taking a large amount of time. I have tried library(tinygeocoder) but it takes around 14 seconds for 100 values, and is giving a 'time-out' error when I put in the entire dataset. Plus, it's outputting a fip code, which I have to join to get the actual county name. Reproducible example:
library(tidygeocoder)
library(dplyr)
df <- tidygeocoder::louisville[,1:4]
county_fips <- data.frame (fips = c("111", "112"),
county = c("Jefferson", "Montgomery"))
geocoded <- df %>% geocode(street = street, city = city, state = state,
method = 'census', full_results = TRUE,
api_options = list(census_return_type = 'geographies'))
df$fips <- geocoded$county_fips
df_new <- merge(x=df, y=county_fips, by="fips", all.x = T)
You can use a public dataset that links city and/or zipcode to county. I found these websites with such data:
https://www.unitedstateszipcodes.org/zip-code-database
https://simplemaps.com/data/us-cities
You can then do a left join on the linking column (presumably city or zipcode but will depend on the dataset):
df = merge(x=df, y=public_dataset, by="City", all.x=T)
If performance is an issue, you can select just the county and linking columns from the public data set before you do the merge.
public_dataset = public_dataset %>% select(County, City)
The slow performance is due to tinygeocoder's use of the the Census Bureau's API in order to match data. Asking the API to match thousands of addresses is the slow down, and I'm not aware of a different way to do this.
However, we can at least pare down the number of addresses that you are putting into the API. Maybe if we get that number low enough the code will run.
The ZIP Code Tabulation Areas (ZCTA) shows the relationships between ZIP Codes and county names (as well as FIPS). A "|" delimited file with a description of the data can be found on the Bureau's website.
Counting the number of times a ZIP code shows up tells us if a ZIP code spans multiple counties. If the frequency == 1, then you can freely translate the ZIP code to the county.
ZCTA <- read.delim("tab20_zcta520_county20_natl.txt", sep="|")
n_occur <- data.frame(table(ZCTA$GEOID_ZCTA5_20))
head(n_occur, 10)
Var1
Freq
1
601
2
2
602
2
3
603
2
4
606
3
5
610
4
6
611
1
7
612
3
8
616
1
9
617
2
10
622
1
In these results, addresses with ZIP codes 00611 and 00622 can be mapped to the corresponding counties without sending the addresses through the API. If your addresses are very urban, then you may be lucky in that the ZIP codes are small area-wise and may not span typically multiple counties.

Pattern matching character vectors in R

I am trying to match the characters between two vectors in two separate dataframes, lets call the dataframes "rentals" and "parcels", which both contain the vector "address" which is a character of the addresses of all rental parcels in a county and the addresses of all parcels in a city. We would like to figure out which addresses in the "parcels" dataframe match an address in the "rentals" dataframe by searching through the vector of addresses in "parcels" for matches with an address in "rentals."
The values in rentals$address look like this:
rentals$address <- c("110 SW ARTHUR ST", "1610 NE 66TH AVE", "1420 SE 16TH AVE",...)
And the values in parcels$address look like this:
parcels$address <- c("635 N MARINE DR, PORTLAND, OR, 97217", "7023 N BANK ST, PORTLAND, OR, 97203", "5410 N CECELIA ST, PORTLAND, OR, 97203",...)
There are about 172,000 entries in the "parcels" dataframe and 285 in the "rentals" dataframe. My first solution was to match character values using grepl, which I don't think worked:
matches = grepl(rentals$address, parcels$address, fixed = TRUE)
This returns FALSE for each entry in parcels$address, but copying and pasting some values of "address" from "rentals" into Excel's CNTRL+F window viewing the "parcels" dataframe, I see a few addresses. So some appear to match.
How would I best be able to find which observation's values in the "address" column of the "rentals" dataframe is a matching character sequence in the "parcels" dataframe?
Are the addresses all exact matches? That is, no variations in spacing, capitalization, apartment number? If so, you might be able to use the dplyr function left_join to create a new df, using the address as the key, like so
library(dplyr)
df_compare <- df_rentals %>%
left_join(df_parcels, by = "address")
additionally, if you have columns along the lines of df_rentals$rentals = yes and df_parcels$parcels = yes, you can filter the resulting new dataframe
df_both <- filter(df_compare, rentals == "yes", parcels == "yes")

Jupyter lab sorting

I have a csv file which contains airports in the world, columns with the country names, city names, elevation, regions, runways so on. I am trying to display the city names which has at least 5 runways accumulated, and sort the city list by the number of runways in descending order, and also display the number of runway in each enter image description herecity. I can make the least 5 runways but I can not sort them decreasing, could you help?
I solved it;
byrunways = df.sort_values(by = 'runways', ascending = False)
df_indexed = byrunways.set_index('city')
condition = df_indexed['runways'] > 4
display(df_indexed[condition]['runways'])
----BUUTTTTTT-----`But`
the point is what if the city has multiple airports?

Categorizing types of duplicates in R

Let's say I have the following data frame:
df <- data.frame(address=c('654 Peachtree St','890 River Rd','890 River Rd','890 River Rd','1234 Main St','1234 Main St','567 1st Ave','567 1st Ave'), city=c('Atlanta','Eugene','Eugene','Eugene','Portland','Portland','Pittsburgh','Etna'), state=c('GA','OR','OR','OR','OR','OR','PA','PA'), zip5=c('30308','97404','97404','97404','97201','97201','15223','15223'), zip9=c('30308-1929','97404-3253','97404-3253','97404-3253','97201-5717','97201-5000','15223-2105','15223-2105'), stringsAsFactors = FALSE)
`address city state zip5 zip9
1 654 Peachtree St Atlanta GA 30308 30308-1929
2 8910 River Rd Eugene OR 97404 97404-3253
3 8910 River Rd Eugene OR 97404 97404-3253
4 8910 River Rd Eugene OR 97404 97404-3253
5 1234 Main St Portland OR 97201 97201-5717
6 1234 Main St Portland OR 97201 97201-5000
7 567 1st Ave Pittsburgh PA 15223 15223-2105
8 567 1st Ave Etna PA 15223 15223-2105`
I'm considering any rows with a matching address and zip5 to be duplicates.
Filtering out or keeping duplicates based on these two columns is simple enough in R. What I'm trying to do is create a new column with a conditional label for each set of duplicates, ending up with something similar to this:
`address city state zip5 zip9 type
1 8910 River Rd Eugene OR 97404 97404-3253 Exact Match
2 8910 River Rd Eugene OR 97404 97404-3253 Exact Match
3 8910 River Rd Eugene OR 97404 97404-3253 Exact Match
4 1234 Main St Portland OR 97201 97201-5717 Different Zip9
5 1234 Main St Portland OR 97201 97201-5000 Different Zip9
6 567 1st Ave Pittsburgh PA 15223 15223-2105 Different City
7 567 1st Ave Etna PA 15223 15223-2105 Different City`
(I'd also be fine with a True/False column for each type of duplicate.)
I'm assuming the solution will be in some mutate+ifelse+boolean code, but I think it's the comparing within each duplicate subset that has me stuck...
Any advice?
Edit:
I don't believe this is a duplicate of Find duplicated rows (based on 2 columns) in Data Frame in R. I can use that solution to create a T/F column for each type of duplicate/group_by match, but I'm trying to create exclusive categories. How could my conditions also take differences into account? The exact match rows should show true only on the "exact match" column, and false for every other column. If I define my columns simply by feeding different combinations of columns to group_by, the exact match rows will never return a False.
I think the key is grouping by "reference" variable--here address makes sense--and then you can count the number of unique items in that vector. It's not a perfect solution since my use of case_when will prioritize earlier options (i.e. if there are two different cities attributed to one address AND two different zip codes, you'll only see that there are two different cities--you will need to address this if it matters with additional case_when statements). However, getting the length of unique items is a reasonable heuristic in this case if you don't need a perfectly granular solution.
df %>%
group_by(address) %>%
mutate(
match_type = case_when(
all(
length(unique(city)) == 1,
length(unique(state)) == 1,
length(unique(zip5)) == 1,
length(unique(zip9)) == 1) ~ "Exact Match",
length(unique(city)) > 1 ~ "Different City",
length(unique(state)) > 1 ~ "Different State",
length(unique(zip5)) > 1 ~ "Different Zip5",
length(unique(zip9)) > 1 ~ "Different Zip9"
))
Otherwise, you'll have to do iterative grouping (address + other variable) and mutate in a Boolean column as you alluded to.
Edit
One additional approach I just thought of if you need a more granular solution is to utilize the addition of an id column (df %>% rowid_to_column("ID")) and then a full join of the table to itself by address with suffixes (e.g. suffix = c("a","b")), filtering out same IDs and calling distinct (since each comparison is there twice), and then you can make Boolean columns with mutate for the pairwise comparisons. It may be too computationally intensive, depending on the size of your dataset, but it should work on the scale of a few thousand if you have a reasonable amount of RAM.

Averaging rows based upon a known, irregular relationship using R

I have data on energy companies whose jurisdiction overlaps in places. I want to be able to compute an average of sales for the places where these companies overlap. These companies will always overlap - so how can I use this information to calculate the averages just for those pairs? There are about 20 pairs of companies.
data <- data.frame(Company = c("Energy USA","Good Energy",
"Hydropower 4 U",
"Coal Town",
"Energy USA/Good Energy",
"Good Energy/Coal Town"),
Sales = c(100, 2500, 550, 6000, "?", "?"))
Company Sales
1 Energy USA 100
2 Good Energy 2500
3 Hydropower 4 U 550
4 Coal Town 6000
5 Energy USA/Good Energy ? (Answer: 1300)
6 Good Energy/Coal Town ? (Answer: 4250)
We use 'grep' to get index of 'Company' elements that have more than one entries i.e. separated by '/'. Then, split those elements by the delimiter (output will be a list), loop through the list with sapply, match the elements with the 'Company' column to get the position, use that to get the corresponding 'Sales' elements. As the 'Sales' column was factor, we need to convert it to numeric to get the mean. When we convert factor to numeric class, all non-numeric elements i.e. ? will be converted to NA. Replace those NA elements with the mean values.
i1 <- grepl('/', data$Company)
v1 <- sapply(strsplit(as.character(data$Company[i1]), '/'),
function(x) mean(as.numeric(as.character(data$Sales[match(x,
data$Company)]))))
data$Sales <- as.numeric(as.character(data$Sales))
data$Sales[is.na(data$Sales)] <- v1
data
# Company Sales
#1 Energy USA 100
#2 Good Energy 2500
#3 Hydropower 4 U 550
#4 Coal Town 6000
#5 Energy USA/Good Energy 1300
#6 Good Energy/Coal Town 4250
Without knowing how your original data is, it is hard to give a working answer. However, assuming your data has Company and Sales columns with multiple rows for each company, you can do something like this:
mean(data$Sales[data$Company %in% c('Energy USA', 'Good Energy')]])
mean(data$Sales[data$Company %in% c('Good Energy', 'Coal Town')]])
you could create a new column "jurisdiction" in "data", if your dataset is rather small..
MeansByJurisdiction <- tapply(data$sales, data$jurisdiction, mean)
then you could convert the vector to dataframe
MeansByJurisdiction <- data.frame(MeansByJurisdiction)
the rownames in the MeansByJurisdiction dataframe will be populated with the jurisdictions and you can extract them with a simple line of code:
MeansByJurisdiction$jurisdictions <- row.names(MeansByJurisdiction)

Resources