From state and county names to fips in R - r

I have the following data frame in R. I would like to get fips from this dataset. I tried to use fips function in usmap (https://rdrr.io/cran/usmap/man/fips.html). But I could not get fips from this function because I need to enclose double quote. Then, I tried to use paste0(""", df$state, """), but I could not get it. Is there any efficient ways to get fips?
> df1
state county
1 california napa
2 florida palm beach
3 florida collier
4 florida duval
UPDATE
I can get "\"california\"" by using dQuote. Thanks. After the conversion of each column, I tried the followings. How do I deal with this issue?
> df1$state <- dQuote(df1$state, FALSE)
> df1$county <- dQuote(df1$county, FALSE)
> fips(state = df1$state, county = df1$county)
Error in fips(state = df1$state, county = df1$county) :
`county` parameter cannot be used with multiple states.
> fips(state = df1$state[1], county = df1$county[1])
Error in fips(state = df1$state[1], county = df1$county[1]) :
"napa" is not a valid county in "california".
> fips(state = "california", county = "napa")
[1] "06055"

We can split the dataset by state and apply the fips
library(usmap)
lapply(split(df1, df1$state), function(x)
fips(state = x$state[1], county = x$county))
#$california
#[1] "06055"
#$florida
#[1] "12099" "12021" "12031"
Or with Map
lst1 <- split(df1$county, df1$state)
Map(fips, lst1, state = names(lst1))
#$california
#[1] "06055"
#$florida
#[1] "12099" "12021" "12031"
Or with tidyverse
library(dplyr)
library(tidyr)
df1 %>%
group_by(state) %>%
summarise(new = list(fips(state = first(state), county = county))) %>%
unnest(c(new))
# A tibble: 4 x 2
# state new
# <chr> <chr>
#1 california 06055
#2 florida 12099
#3 florida 12021
#4 florida 12031
data
df1 <- structure(list(state = c("california", "florida", "florida",
"florida"), county = c("napa", "palm beach", "collier", "duval"
)), class = "data.frame", row.names = c("1", "2", "3", "4"))

Related

How to order data frame by a partial string (or first word)

I'm very new to R and I'm self-learning basic operations.
I would like to yield the following:
County Population
ACounty, Alabama 106242
BCounty, Alabama 362845
ACounty, Texas 242342
BCounty, Texas 293729
I've tried:
df<-df %>% arrange(County)
view(df)
Which ends up as:
County Population
ACounty, Alabama 106242
ACounty, Texas 242342
BCounty, Alabama 362845
BCounty, Texas 293729
You can divide County and States and arrange data based on State.
library(dplyr)
library(tidyr)
df %>%
separate(County, c('County', 'State'), sep = ",\\s*") %>%
arrange(State) %>%
unite(County, County, State, sep = ",")
In base R, you can keep only the state information by removing everything till comma and use order to arrange data by state.
df[order(sub('.*,', '', df$County)), ]
We can do this without splitting or uniting
library(dplyr)
library(stringr)
df1 %>%
arrange(str_remove(County, ",.*"))
# County Population
#1 ACounty, Alabama 106242
#2 ACounty, Texas 242342
#3 BCounty, Alabama 362845
#4 BCounty, Texas 293729
data
df1 <- structure(list(County = c("ACounty, Alabama", "BCounty, Alabama",
"ACounty, Texas", "BCounty, Texas"), Population = c(106242L,
362845L, 242342L, 293729L)), class = "data.frame", row.names = c(NA,
-4L))

Iterate in R with a function that requires four vectors

I'm trying to find the distance between multiple cities using the distHaversine function in the geosphere package. This code requires a variety of arguments:
The longitude and latitude of the first place.
The longitude and latitude of the second place.
The radius of the earth in whatever unit (I'm using r = 3961 for miles).
When I input this as a vector, it works easily:
HongKong <- c(114.17, 22.31)
GrandCanyon <- c(-112.11, 36.11)
library(geosphere)
distHaversine(HongKong, GrandCanyon, r=3961)
#[1] 7399.113 distance in miles
However, my actual datasets look like this:
library(dplyr)
location1 <- tibble(person = c("Sally", "Jane", "Lisa"),
current_loc = c("Bogota Colombia", "Paris France", "Hong Kong China"),
lon = c(-74.072, 2.352, 114.169),
lat = c(4.710, 48.857, 22.319))
location2 <- tibble(destination = c("Atlanta United States", "Rome Italy", "Bangkok Thailand", "Grand Canyon United States"),
lon = c(-84.388, 12.496, 100.501, -112.113),
lat = c(33.748, 41.903, 13.756, 36.107))
What I want is for there to be rows that say how far each destination is from the person's current location.
I know there has to be a way using purrr's pmap_dbl(), but I'm unable to figure it out.
Bonus points if your code uses the tidyverse and if there's any easy way to make a column that identifies the closest destination. Thank you!
In an ideal world, I would get this:
solution <- tibble(person = c("Sally", "Jane", "Lisa"),
current_loc = c("Bogota Colombia", "Paris France", "Hong Kong China"),
lon = c(-74.072, 2.352, 114.169),
lat = c(4.710, 48.857, 22.319),
dist_Atlanta = c(1000, 2000, 7000),
dist_Rome = c(2000, 500, 3000),
dist_Bangkok = c(7000, 5000, 1000),
dist_Grand = c(1500, 4000, 7500),
nearest = c("Atlanta United State", "Rome Italy", "Bangkok Thailand"))
Note: The numbers in the dist columns are random; however, they would be the output from the distHaversine() function. The name of those columns is arbitrary--it does not need to be called that. Also, if the nearest column is out of the scope of this question, I think that I can figure that one out.
distHaversine accepts only one pair of lat and lon values at a time so we need to send all combinations of location1 and location2 rows one by one to the function. One way using sapply would be
library(geosphere)
location1[paste0("dist_", stringr::word(location2$destination))] <-
t(sapply(seq_len(nrow(location1)), function(i)
sapply(seq_len(nrow(location2)), function(j) {
distHaversine(location1[i, c("lon", "lat")], location2[j, c("lon", "lat")], r=3961)
})))
location1$nearest <- location2$destination[apply(location1[5:8], 1, which.min)]
location1
# A tibble: 3 x 9
# person current_loc lon lat dist_Atlanta dist_Rome dist_Bangkok dist_Grand nearest
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#1 Sally Bogota Colombia -74.1 4.71 2114. 5828. 11114. 3246. Atlanta United States
#2 Jane Paris France 2.35 48.9 4375. 687. 5871. 5329. Rome Italy
#3 Lisa Hong Kong China 114. 22.3 8380. 5768. 1075. 7399. Bangkok Thailand
Using the tidyverse an map fuction form purrr as you asked, I found a solution, all in one pipe line.
library(tidyverse)
library(geosphere)
# renaming lon an lat variables in each df
location1 <- location1 %>%
rename(lon.act = lon, lat.act = lat)
location2 <- location2 %>%
rename(lon.dest = lon, lat.dest = lat)
# geting distances
merge(location1, location2, all = TRUE) %>%
group_by(person,current_loc, destination) %>%
nest() %>%
mutate( act = map(data, `[`, c("lon.act", "lat.act")) %>%
map(as.numeric),
dest = map(data, `[`, c("lon.dest", "lat.dest")) %>%
map(as.numeric),
dist = map2(act, dest, ~distHaversine(.x, .y, r = 3961))) %>%
unnest(data, dist) %>%
group_by(person) %>%
mutate(mindis = dist == min(dist))

How can I fuzzy string match multiple strings from different sized data frames?

I would like to match the strings from my first dataset with all of their closest common matches.
Data looks like:
dataset1:
California
Texas
Florida
New York
dataset2:
Californiia
callifoornia
T3xas
Te xas
texas
Fl0 rida
folrida
New york
new york
desired result is:
col_1 col_2 col_3 col4
California Californiia callifoornia
Texas T3xas texas Te xas
Florida folrida Fl0 rida
New York New york new york
The question is:
How do I search for common strings between the first dataset and the
second dataset, and generate a list of terms in the second dataset
that align with each term in the first?
Thanks in advance.
library(fuzzyjoin); library(tidyverse)
dataset1 %>%
stringdist_left_join(dataset2,
max_dist = 3) %>%
rename(col_1 = "states.x") %>%
group_by(col_1) %>%
mutate(col = paste0("col_", row_number() + 1)) %>%
spread(col, states.y)
#Joining by: "states"
## A tibble: 4 x 4
## Groups: col_1 [4]
# col_1 col_2 col_3 col_4
# <chr> <chr> <chr> <chr>
#1 California Californiia callifoornia NA
#2 Florida Fl0 rida folrida NA
#3 New York New york new york NA
#4 Texas T3xas Te xas texas
data:
dataset1 <- data.frame(states = c("California",
"Texas",
"Florida",
"New York"),
stringsAsFactors = F)
dataset2 <- data.frame(stringsAsFactors = F,
states = c(
"Californiia",
"callifoornia",
"T3xas",
"Te xas",
"texas",
"Fl0 rida",
"folrida",
"New york",
"new york"
)
)
I read a bit about stringdist and came up with this. It's a workaround, but I like it. Can definitely be improved:
library(stringdist)
library(janitor)
ds1a <- read.csv('dataset1')
ds2a <- read.csv('dataset2')
distancematrix <- stringdistmatrix(ds2a$name, ds1a$name, useNames = T)
df <- data.frame(stringdistmatrix(ds2a$name, ds1a$name, useNames = T), ncol=maxcol in distance matrix)
# go thru this df, and every cell that's < 4, replace with the column name, otherwise replace with empty string
for (j in 1:ncol(df)) {
trigger <- df[j,] < 4
df[trigger , j] <- names(df)[j]
df[!trigger , j] <- ""
}
df <- remove_constant(df)
write.csv(df, file="~/Desktop/df.csv")

Using a variable number of groups with do in function

I would like to understand if and how this could be achieved using the tidyverse framework.
Assume I have the following simple function:
my_fn <- function(list_char) {
data.frame(comma_separated = rep(paste0(list_char, collapse = ","),2),
second_col = "test",
stringsAsFactors = FALSE)
}
Given the below list:
list_char <- list(name = "Chris", city = "London", language = "R")
my function works fine if you run:
my_fn(list_char)
However if we change some of the list's elements with a vector of characters we could use the dplyr::do function in the following way to achieve the below:
list_char_mult <- list(name = c("Chris", "Mike"),
city = c("New York", "London"), language = "R")
expand.grid(list_char_mult, stringsAsFactors = FALSE) %>%
tbl_df() %>%
group_by_all() %>%
do(my_fn(list(name = .$name, city = .$city, language = "R")))
The question is how to write a function that could do this for a list with a variable number of elements. For example:
my_fn_generic <- function(list_char_mult) {
expand.grid(list_char_mult, stringsAsFactors = FALSE) %>%
tbl_df() %>%
group_by_all() %>%
do(my_fn(...))
}
Thanks
Regarding how to use the function with variable number of arguments
my_fn_generic <- function(list_char) {
expand.grid(list_char, stringsAsFactors = FALSE) %>%
tbl_df() %>%
group_by_all() %>%
do(do.call(my_fn, list(.)))
}
my_fn_generic(list_char_mult)
# A tibble: 4 x 4
# Groups: name, city, language [4]
# name city language comma_separated
# <chr> <chr> <chr> <chr>
#1 Chris London R Chris,London,R
#2 Chris New York R Chris,New York,R
#3 Mike London R Mike,London,R
#4 Mike New York R Mike,New York,R
Or use the pmap
library(tidyverse)
list_char_mult %>%
expand.grid(., stringsAsFactors = FALSE) %>%
mutate(comma_separated = purrr::pmap_chr(.l = ., .f = paste, sep=", ") )
# name city language comma_separated
#1 Chris New York R Chris, New York, R
#2 Mike New York R Mike, New York, R
#3 Chris London R Chris, London, R
#4 Mike London R Mike, London, R
If I understand your question, you could use apply without grouping:
expand.grid(list_char_mult, stringsAsFactors = FALSE) %>%
mutate(comma_separated = apply(., 1, paste, collapse=","))
expand.grid(list_char_mult, stringsAsFactors = FALSE) %>%
mutate(comma_separated = apply(., 1, my_fn))
name city language comma_separated
1 Chris London R Chris,London,R
2 Chris New York R Chris,New York,R
3 Mike London R Mike,London,R
4 Mike New York R Mike,New York,R

Use aggregate function to calculate output in data frame

I've been trying myself and searching for a while now over the net and stackoverflow to no success. I've got a dataframe which I subset from applying conditions and select for projection but fail to retrieve aggregated output.
Dataframe mydf:
mydf = list()
mydf = cbind(mydf,
c("New York", "New York", "San Francisco"),
c(4000, 7600, 2500),
c("Bartosz", "Damian", "Maciej"))
mydf = as.data.frame(mydf)
colnames(mydf) = c("city","salary","name")
Let's assume given part of dataframe returned with:
subset(mydf, city == "New York", select = c(salary, name))
which return a data frame such as:
salary name
9 4000 Bartosz
10 7600 Damian
Now I need to calculate from the given salary a sum, avg and choose an employee with least salary from above data frame, preferably using one-liner by modifying the above code (I'm guessing it's possible), so that it returns:
for sum: 11600
for avg: 5800
for least: 4000 Bartosz
I've tried things as (1)
subset(mydf, city == "New York", select = sum(salary))
or (2)
x = subset(mydf, city == "New York", select = salary)
min(x)
and many more combination which only yields errors saying that summary function is only defined on a data frame with all variables being numbers (2) or the same output as the first code without sum (1)
The problem might be that your dataframe object actually contains a bunch of lists. So if you take
ny.df = subset(mydf, city == "New York", select = c(salary, name))
then any of the subsequent work needs to be peppered with as.numeric calls to translate your lists into vectors. These will give you your answers:
sum(as.numeric(ny.df$salary)) # sum
mean(as.numeric(ny.df$salary)) # avg
ny.df[which(as.numeric(ny.df$salary) == min(as.numeric(ny.df$salary))),] # row with min salary
Alternatively, you can define mydf as a dataframe of vectors instead of a dataframe of lists:
mydf = data.frame(c("New York", "New York", "San Francisco"),
c(4000, 7600, 2500),
c("Bartosz", "Damian", "Maciej"))
colnames(mydf) = c("city","salary","name")
ny.df = subset(mydf, city == "New York", select = c(salary, name))
sum(ny.df$salary)
mean(ny.df$salary)
ny.df[which(ny.df$salary == min(ny.df$salary)),]
Your mydf was weird so I made my own. I split mydf by city and then obtained the necessary data from running necessary operations (mean, sum, etc.) on each subgroup.
#DATA
mydf = structure(list(city = structure(c(1L, 1L, 2L), .Label = c("New York",
"San Francisco"), class = "factor"), salary = c(4000, 7600, 2500
), name = structure(1:3, .Label = c("Bartosz", "Damian", "Maciej"
), class = "factor")), .Names = c("city", "salary", "name"), row.names = c(NA,
-3L), class = "data.frame")
do.call(rbind, lapply(split(mydf, mydf$city), function(a)
data.frame(employee = a$name[which.min(a$salary)], #employee with least salary
mean = mean(a$salary), #mean salary
sum = sum(a$salary)))) #sum of salary
# employee mean sum
#New York Bartosz 5800 11600
#San Francisco Maciej 2500 2500
There is a simple and fast solution using data.table
library(data.table)
setDT(mydf)[, .( salary_sum = sum(salary),
salary_avg = mean(salary),
name = name[which.min(salary)]), by= city]
> city salary_sum salary_avg name
> 1: New York 11600 5800 Bartosz
> 2: San Francisco 2500 2500 Maciej
your dataset:
mydf = data.frame(city=c("New York", "New York", "San Francisco"),
salary=c(4000, 7600, 2500),
name=c("Bartosz", "Damian", "Maciej"))
Your data frame is structured unusally as lists within the dataframe, which may be casuign you issues. Here is a dplyr solution (now edited to find th elowest salary)
library(dplyr)
mydf <- data.frame(
city = c("New York", "New York", "San Francisco"),
salary = c(4000, 7600, 2500),
name = c("Bartosz", "Damian", "Maciej"))
mydf %>%
group_by(city) %>%
mutate(avg = mean(salary),
sum = sum(salary)) %>%
top_n(-1, wt = salary)
# city salary name avg sum
# <fctr> <dbl> <fctr> <dbl> <dbl>
# 1 New York 4000 Bartosz 5800 11600
# 2 San Francisco 2500 Maciej 2500 2500
I think the dplyr is what you might be looking for:
library(dplyr)
mydf %>%
group_by(city) %>%
filter (city =="New York") %>%
summarise(mean(salary), sum(salary))
# A tibble: 1 x 3
# city mean(salary) sum(salary)
# <fctr> <dbl> <dbl>
#1 New York 5800 11600
There is a good tutorial at this link link[https://rpubs.com/justmarkham/dplyr-tutorial]

Resources