Jupyter lab sorting - jupyter-notebook

I have a csv file which contains airports in the world, columns with the country names, city names, elevation, regions, runways so on. I am trying to display the city names which has at least 5 runways accumulated, and sort the city list by the number of runways in descending order, and also display the number of runway in each enter image description herecity. I can make the least 5 runways but I can not sort them decreasing, could you help?

I solved it;
byrunways = df.sort_values(by = 'runways', ascending = False)
df_indexed = byrunways.set_index('city')
condition = df_indexed['runways'] > 4
display(df_indexed[condition]['runways'])
----BUUTTTTTT-----`But`
the point is what if the city has multiple airports?

Related

<R code>Create a subset of the dataset from Part1 with only the top 5 departments based on the number of employees working in that department

Question:
Initialize the city of Boston earnings dataset as shown below:
boston <- read.csv( "https://people.bu.edu/kalathur/datasets/bostonCityEarnings.csv", colClasses = c("character", "character", "character", "integer", "character"))
Generate a subset of the dataset from Boston earnings dataset with only the top 5 departments based on the number of employees working in that department. The top 5 departments should be computed using R code. Then, use %in% operator to create the required subset.
Use a sample size of 50 for each of the following.
Set the start seed for random numbers as the last 4 digits of 1000
a) Show the sample drawn using simple random sampling without replacement.
Show the frequencies for the selected departments.
Show the percentages of these with respect to sample size.
I've tried to write the code, but still don't know how to create the subset with the top 5 and don't know how to turn into percentage as well.
Thank you all for the help!
enter image description here

How to get US county name from Address, city and state in R?

I have a dataset of around 10000 rows. I have the Address, City, State and Zipcode values. I do not have lat/long coordinates. I would like to retrieve the county name without taking a large amount of time. I have tried library(tinygeocoder) but it takes around 14 seconds for 100 values, and is giving a 'time-out' error when I put in the entire dataset. Plus, it's outputting a fip code, which I have to join to get the actual county name. Reproducible example:
library(tidygeocoder)
library(dplyr)
df <- tidygeocoder::louisville[,1:4]
county_fips <- data.frame (fips = c("111", "112"),
county = c("Jefferson", "Montgomery"))
geocoded <- df %>% geocode(street = street, city = city, state = state,
method = 'census', full_results = TRUE,
api_options = list(census_return_type = 'geographies'))
df$fips <- geocoded$county_fips
df_new <- merge(x=df, y=county_fips, by="fips", all.x = T)
You can use a public dataset that links city and/or zipcode to county. I found these websites with such data:
https://www.unitedstateszipcodes.org/zip-code-database
https://simplemaps.com/data/us-cities
You can then do a left join on the linking column (presumably city or zipcode but will depend on the dataset):
df = merge(x=df, y=public_dataset, by="City", all.x=T)
If performance is an issue, you can select just the county and linking columns from the public data set before you do the merge.
public_dataset = public_dataset %>% select(County, City)
The slow performance is due to tinygeocoder's use of the the Census Bureau's API in order to match data. Asking the API to match thousands of addresses is the slow down, and I'm not aware of a different way to do this.
However, we can at least pare down the number of addresses that you are putting into the API. Maybe if we get that number low enough the code will run.
The ZIP Code Tabulation Areas (ZCTA) shows the relationships between ZIP Codes and county names (as well as FIPS). A "|" delimited file with a description of the data can be found on the Bureau's website.
Counting the number of times a ZIP code shows up tells us if a ZIP code spans multiple counties. If the frequency == 1, then you can freely translate the ZIP code to the county.
ZCTA <- read.delim("tab20_zcta520_county20_natl.txt", sep="|")
n_occur <- data.frame(table(ZCTA$GEOID_ZCTA5_20))
head(n_occur, 10)
Var1
Freq
1
601
2
2
602
2
3
603
2
4
606
3
5
610
4
6
611
1
7
612
3
8
616
1
9
617
2
10
622
1
In these results, addresses with ZIP codes 00611 and 00622 can be mapped to the corresponding counties without sending the addresses through the API. If your addresses are very urban, then you may be lucky in that the ZIP codes are small area-wise and may not span typically multiple counties.

Pattern matching character vectors in R

I am trying to match the characters between two vectors in two separate dataframes, lets call the dataframes "rentals" and "parcels", which both contain the vector "address" which is a character of the addresses of all rental parcels in a county and the addresses of all parcels in a city. We would like to figure out which addresses in the "parcels" dataframe match an address in the "rentals" dataframe by searching through the vector of addresses in "parcels" for matches with an address in "rentals."
The values in rentals$address look like this:
rentals$address <- c("110 SW ARTHUR ST", "1610 NE 66TH AVE", "1420 SE 16TH AVE",...)
And the values in parcels$address look like this:
parcels$address <- c("635 N MARINE DR, PORTLAND, OR, 97217", "7023 N BANK ST, PORTLAND, OR, 97203", "5410 N CECELIA ST, PORTLAND, OR, 97203",...)
There are about 172,000 entries in the "parcels" dataframe and 285 in the "rentals" dataframe. My first solution was to match character values using grepl, which I don't think worked:
matches = grepl(rentals$address, parcels$address, fixed = TRUE)
This returns FALSE for each entry in parcels$address, but copying and pasting some values of "address" from "rentals" into Excel's CNTRL+F window viewing the "parcels" dataframe, I see a few addresses. So some appear to match.
How would I best be able to find which observation's values in the "address" column of the "rentals" dataframe is a matching character sequence in the "parcels" dataframe?
Are the addresses all exact matches? That is, no variations in spacing, capitalization, apartment number? If so, you might be able to use the dplyr function left_join to create a new df, using the address as the key, like so
library(dplyr)
df_compare <- df_rentals %>%
left_join(df_parcels, by = "address")
additionally, if you have columns along the lines of df_rentals$rentals = yes and df_parcels$parcels = yes, you can filter the resulting new dataframe
df_both <- filter(df_compare, rentals == "yes", parcels == "yes")

Filter factor variable based on counts

I have a dataframe containing house price data, with price and lots of variables. One of these variables is a "sub-area" for the property, and I am trying to incorporate this into various regressions. However, it is a factor variable with almost 3000 levels.
For example:
table(df$sub_area)
La Jolla
2
Carlsbad
5
Esconsido
1
..etc
I want to filter out those places that have only 1 count, since they don't offer much predictive power but add lots of computation time. However, I want to replace the sub_area entry for that property with blank or NA, since I still want to use the rest of the information for that property, such as bedrooms, bathrooms, etc.
For reference, an individual property entry might look like:
ID Beds Baths City Sub_area sqm... etc
1 4 2 San Diego La Jolla 100....
Then I can do
lm(price ~ beds + baths + city + sub_area)
under the new, smaller sub_area variable with fewer levels.
I want to do this because most of the predictive price power is contained in sub_area for the locations I'm working on.
One way:
areas <- names(which(table(df$Sub_area) > 10))
df$Sub_area[! df$Sub_area %in% areas] <- NA
Create a new dataframe with the number of occurrences for each subarea and keep the subareas that occur at least twice.
Then add NAs to the original dataframe if the subarea does not appear in the filtered sub_area_count.
library(dplyr)
sub_area_count <- df %>%
count(sub_area) %>%
filter(n > 1)
boo <- !df$sub_area %in% sub_area_count$sub_area
df[boo, ]$sub_area <- NA
You didn't give a reproducible example, but I think this will work for identifying those places which count==1
count_1 <- as.data.frame(table(df$sub_area))
count_1 <- count_1$Var1[which(count_1$Freq==1)]

Can't match two dataframe values

I am not sure why the dataframe values do not match with each other.
I have a df name fileUpload which looks like this (the cols are aligned correctly):
Destination City Year Adults
Amsterdam 2015 2
Amsterdam 2016 2
Amsterdam 2015 2
Amsterdam 2016 2
Amsterdam 2015 3
There is a space after each city name.
I have another dataframe that is not uploaded, like this:
cities <- read.csv(text = "
City,Lat,Long,Pop
Amsterdam ,4.8952,52.3702,779808
Bali ,115.1889,-8.4095,4225000")
I need to merge the two dataframes, but I realized that the city values returns not matching (NA). I tried checking it using fileUpload %in% cities returns false
I tried removing the space after the city, also did not work.
The typeof(df$city) for both is integer.
How can I make the cities name match together?
As pointed out in the comments you should convert your columns to strings from factors.
mergedCities <- merge(fileUpload, cities, by.x ="Destination City", by.y = "City", all = TRUE)
Set the all parameter to specify if you want to keep all cities or just the one form x or y or only the cities present in both.

Resources