Need to ID states from mixed names /IDs in location data - r

Need to ID states from mixed location data
Need to search for 50 states abbreviations & 50 states full names, and return state abbreviation
N <- 1:10
Loc <- c("Los Angeles, CA", "Manhattan, NY", "Florida, USA", "Chicago, IL" , "Houston, TX",
+ "Texas, USA", "Corona, CA", "Georgia, USA", "WV NY NJ", "qwerty uy PO DOPL JKF" )
df <- data.frame(N, Loc)
> # Objective create variable state such
> # state contains abbreviated names of states from Loc:
> # for "Los Angeles, CA", state = CA
> # for "Florida, USA", sate = FL
> # for "WV NY NJ", state = NA
> # for "qwerty NJuy PO DOPL JKF", sate = NA (inspite of containing the srting NJ, it is not wrapped in spaces)
>
# End result should be Newdf
State <- c("CA", "NY", "FL", "IL", "TX","TX", "CA", "GA", NA, NA)
Newdf <- data.frame(N, Loc, State)
> Newdf
N Loc State
1 1 Los Angeles, CA CA
2 2 Manhattan, NY NY
3 3 Florida, USA FL
4 4 Chicago, IL IL
5 5 Houston, TX TX
6 6 Texas, USA TX
7 7 Corona, CA CA
8 8 Georgia, USA GA
9 9 WV NY NJ <NA>
10 10 qwerty uy PO DOPL JKF <NA>
Is there a package? or can a loop be written? Even if the schema could be demonstrated with a few states, that would be sufficient - I will post the full solution when I get to it. Btw, this is for a Twitter dataset downloaded using rtweet package, and the variable is: place_full_name

There are default constants in R, state.abb and state.name which can be used.
vars <- stringr::str_extract(df$Loc, paste0('\\b',c(state.abb, state.name),
'\\b', collapse = '|'))
#[1] "CA" "NY" "Florida" "IL" "TX" "Texas" "CA" "Georgia" "WV" NA
If you want everything as abbreviations, we can go further and do :
inds <- vars %in% state.name
vars[inds] <- state.abb[match(vars[inds], state.name)]
vars
#[1] "CA" "NY" "FL" "IL" "TX" "TX" "CA" "GA" "WV" NA
However, we can see that in 9th row you expect output as NA but here it returns "WV" because it is a state name. In such cases, you need to prepare rules which are strict enough so that it only extracts state names and nothing else.

Utilising the built-in R constants, state.abb and state.name, we can try to extract these from the Loc with regular expressions.
state.abbs <- sub('.+, ([A-Z]{2})', '\\1', df$Loc)
state.names <- sub('^(.+),.+', '\\1', df$Loc)
Now if the state abbreviations are not in any of the built-in ones, then we can use match to find the positions of our state.names that are in any of the items in the built-in state.name vector, and use that to index state.abb, else keep what we already have. Those that don't match either return NA.
df$state.abb <- ifelse(!state.abbs %in% state.abb,
state.abb[match(state.names, state.name)], state.abbs)
df
N Loc state.abb
1 1 Los Angeles, CA CA
2 2 Manhattan, NY NY
3 3 Florida, USA FL
4 4 Chicago, IL IL
5 5 Houston, TX TX
6 6 Texas, USA TX
7 7 Corona, CA CA
8 8 Georgia, USA GA
9 9 WV NY NJ <NA>
10 10 qwerty uy PO DOPL JKF <NA>

Related

How do I rename the values in my column as I have misspelt them and cant rename them in R or Colab

I have a data frame that was given to me. Under the column titled state, there are two components with the same name but with different case sensitivities ie one is "London" and the other is "LONDON". How would i be able to rename "LONDON" to become "London" in order to total them up together and not separately. reminder, I am trying to change the name of the input not the name of the column.
You can use the following code, df is your current dataframe, in which you want to substitute "LONDON" for "London"
df <- data.frame(Country = c("US", "UK", "Germany", "Brazil","US", "Brazil", "UK", "Germany"),
State = c("NY", "London", "Bavaria", "SP", "CA", "RJ", "LONDON", "Berlin"),
Candidate = c(1:8))
print(df)
output
Country State Candidate
1 US NY 1
2 UK London 2
3 Germany Bavaria 3
4 Brazil SP 4
5 US CA 5
6 Brazil RJ 6
7 UK LONDON 7
8 Germany Berlin 8
then run the following code to substitute London to all the instances where State is equal to "LONDON"
df[df$State == "LONDON", "State"] <- "London"
Now the output will be as
Country State Candidate
1 US NY 1
2 UK London 2
3 Germany Bavaria 3
4 Brazil SP 4
5 US CA 5
6 Brazil RJ 6
7 UK London 7
8 Germany Berlin 8
Maybe you could try using the case_when function. I would do something like this:
ยดยดยดยด
mutate(data, State_def=case_when(State=="LONDON" ~ "London",
State=="London" ~ "London",
TRUE ~ NA_real_)
I might misunderstand, but I think it should be as simple as this:
x$state <- sub( "LONDON", "London", x$state, fixed=TRUE )
This should change LONDON to London

Extract cell with AND without commas in R

I'm trying to extract the city and state from the Address column into 2 separate columns labeled City and State in r. This is what my data looks like:
df <- data.frame(address = c("Los Angeles, CA", "Pittsburgh PA", "Miami FL","Baltimore MD", "Philadelphia, PA", "Trenton, NJ")) %>%
separate(address, c("City", "State"), sep=",")
I tried using the separate function but that only gets the ones with commas. Any ideas on how to do this for both cases?
There is a pattern at the end (space, letter, letter) which I can use to exploit and then remove any commas but not sure how the syntax would work using grep.
Starting from your df
df <- data.frame(address = c("Los Angeles, CA", "Pittsburgh PA", "Miami FL","Baltimore MD", "Philadelphia, PA", "Trenton, NJ"))
> df
address
1 Los Angeles, CA
2 Pittsburgh PA
3 Miami FL
4 Baltimore MD
5 Philadelphia, PA
6 Trenton, NJ
It's possible to use gsub to subset the string like this:
> city=gsub(',','',gsub("(.*).{3}","\\1",df[,1]))
> city
[1] "Los Angeles" "Pittsburgh" "Miami" "Baltimore" "Philadelphia"
[6] "Trenton"
> state=gsub(".*(\\w{2})","\\1",df[,1])
> state
[1] "CA" "PA" "FL" "MD" "PA" "NJ"
df=data.frame(City=city,State=state)
> df
City State
1 Los Angeles CA
2 Pittsburgh PA
3 Miami FL
4 Baltimore MD
5 Philadelphia PA
6 Trenton NJ
This is a little unorthodox but it works well. It assumes that all states are 2 characters long and that there is at least 1 space between the city and state. Comma's are ignored
df <- data.frame(address = c("Los Angeles, CA", "Pittsburgh PA", "Miami FL","Baltimore MD", "Philadelphia, PA", "Trenton, NJ"))
df$city <- substring(sub(",","",df$address),1,nchar(sub(",","",df$address))-3)
df$state <- substring(as.character(df$address),nchar(as.character(df$address))-1,nchar(as.character(df$address)))
df <- within(df,rm(address))
output:
city state
1 Los Angeles CA
2 Pittsburgh PA
3 Miami FL
4 Baltimore MD
5 Philadelphia PA
6 Trenton NJ

Function to subset dataframe on optional arguments

I have a dataframe as follows:
df1 <- data.frame(
Country = c("France", "England", "India", "America", "England"),
City = c("Paris", "London", "Mumbai", "Los Angeles", "London"),
Order_No = c("1", "2", "3", "4", "5"),
delivered = c("Yes", "no", "Yes", "No", "yes"),
stringsAsFactors = FALSE
)
and multiple other columns as well (around 50)
I want to write a function which is generic and can take in as many parameters as the user wants and return a subset of only those specific columns. So the user should technically be able to pass 1 column or 30 columns to get the result back from function
With what I was able to find online on optional arguments, I wrote this following code but I am running into issues. Can anyone help me out here?
SubsetFunction <- function(inputdf, ...)
{
params <- vector(...)
subset.df <- subset(inputdf, select = params)
return(subset.df)
}
This is the error I am getting -
error in vector(...) :
vector: cannot make a vector of mode 'Country'.
The use of vector(...) is making problems here. The ellipsis has to be converted into a list instead. Therefore, in order to finally obtain a vector out of the three-dot parameters, the seemingly awkward construction unlist(list(...)) should be used instead of vector(...):
SubsetFunction <- function(inputdf, ...){
params <- unlist(list(...))
subset.df <- subset(inputdf, select=params)
return(subset.df)
}
This allows to call the function SubsetFunction() with an arbitrary number of parameters:
> SubsetFunction(df1, "City")
# City
#1 Paris
#2 London
#3 Mumbai
#4 Los Angeles
#5 London
> SubsetFunction (df1, "City", "delivered")
# City delivered
#1 Paris Yes
#2 London no
#3 Mumbai Yes
#4 Los Angeles No
#5 London yes
We can use missing function here to check if arguments are present or not
select_cols <- function(df, cols) {
if(missing(cols))
df
else
df[cols]
}
select_cols(df1, c("Country", "City"))
# Country City
#1 France Paris
#2 England London
#3 India Mumbai
#4 America Los Angeles
#5 England London
select_cols(df1)
# Country City Order_No delivered
#1 France Paris 1 Yes
#2 England London 2 no
#3 India Mumbai 3 Yes
#4 America Los Angeles 4 No
#5 England London 5 yes

How to clean the city and state(both full and abbreviation) using R

I have a list of uncleaned city and state from "Location" in twitter, for example:
location <- c("the Great Lake State", "PA", "Harrisburg, Pennsylvania",
"Pennsylvania", "MI", "Detroit,MI")
How to clean the data to make a clean list of two columns with city and state?
You can do this:
splitted_list <- strsplit(location,",")
wide_matrix <- sapply(splitted_list,function(x) c(rep(NA,length(x)==1),x))
res <- setNames(data.frame(t(wide_matrix),stringsAsFactors = FALSE),c("city","state"))
res
# city state
# 1 <NA> the Great Lake State
# 2 <NA> PA
# 3 Harrisburg Pennsylvania
# 4 <NA> Pennsylvania
# 5 <NA> MI
# 6 Detroit MI
Assuming your data (location) is already part of a data.frame which you want to clean up, then tidyr::separate can be suitable option.
location <- c("the Great Lake State", "PA", "Harrisburg, Pennsylvania",
"Pennsylvania", "MI", "Detroit,MI")
library(tidyverse)
as.data.frame(location) %>% # I created a data.frame, which is not needed in actual data
tidyr::separate(location, c("City", "State"), sep=",", fill="left")
# City State
# 1 <NA> the Great Lake State
# 2 <NA> PA
# 3 Harrisburg Pennsylvania
# 4 <NA> Pennsylvania
# 5 <NA> MI
# 6 Detroit MI

Empty rows in list as NA values in data.frame in R

I have a dataframe as follows:
hospital <- c("PROVIDENCE ALASKA MEDICAL CENTER", "ALASKA REGIONAL HOSPITAL", "FAIRBANKS MEMORIAL HOSPITAL",
"CRESTWOOD MEDICAL CENTER", "BAPTIST MEDICAL CENTER EAST", "ARKANSAS HEART HOSPITAL",
"MEDICAL CENTER NORTH LITTLE ROCK", "CRITTENDEN MEMORIAL HOSPITAL")
state <- c("AK", "AK", "AK", "AL", "AL", "AR", "AR", "AR")
rank <- c(1,2,3,1,2,1,2,3)
df <- data.frame(hospital, state, rank)
df
hospital state rank
1 PROVIDENCE ALASKA MEDICAL CENTER AK 1
2 ALASKA REGIONAL HOSPITAL AK 2
3 FAIRBANKS MEMORIAL HOSPITAL AK 3
4 CRESTWOOD MEDICAL CENTER AL 1
5 BAPTIST MEDICAL CENTER EAST AL 2
6 ARKANSAS HEART HOSPITAL AR 1
7 MEDICAL CENTER NORTH LITTLE ROCK AR 2
8 CRITTENDEN MEMORIAL HOSPITAL AR 3
I would like to create a function, rankall, that takes rank as an argument and returns the hospitals of that rank for each state, with NAs returned if the state does not have a hospital that matches the given rank. For example, I want output of rankall(rank=3) to look like this:
hospital state
AK FAIRBANKS MEMORIAL HOSPITAL AK
AL <NA> AL
AR CRITTENDEN MEMORIAL HOSPITAL AR
I've tried:
rankall <- function(rank) {
split_by_state <- split(df, df$state)
ranked_hospitals <- lapply(split_by_state, function (x) {
x[(x$rank==rank), ]
})
combined_ranked_hospitals <- do.call(rbind, ranked_hospitals)
return(combined_ranked_hospitals[ ,1:2])
}
But rankall(rank=3) returns:
hospital state
AK FAIRBANKS MEMORIAL HOSPITAL AK
AR CRITTENDEN MEMORIAL HOSPITAL AR
This leaves out the NA values that I need to keep track of. Is there a way for R to recognize the empty rows in my list object within my function as NAs, rather than as empty rows? Is there another function besides lapply that would be more useful for this task?
[ Note: This dataframe is from the Coursera R Programming course. This is also my first post on Stackoverflow, and my first time learning programming. Thank you to all who offered solutions and advice, this forum is fantastic. ]
You just need an in/else in your function:
rankall <- function(rank) {
split_by_state <- split(df, df$state)
ranked_hospitals <- lapply(split_by_state, function (x) {
indx <- x$rank==rank
if(any(indx)){
return(x[indx, ])
else{
out = x[1, ]
out$hospital = NA
return(out)
}
}
}
Here's an alternative approach:
rankall <- function(rank) {
do.call(rbind, lapply(split(df, df$state), function(df) {
tmp <- df[df$rank == rank, 1:2]
if (!nrow(tmp)) return(transform(df[1, 1:2], hospital = NA)) else return(tmp)
}))
}
rankall(3)
# hospital state
# AK FAIRBANKS MEMORIAL HOSPITAL AK
# AL <NA> AL
# AR CRITTENDEN MEMORIAL HOSPITAL AR
Here is another dplyr approach.
fun1 <- function(x) {
group_by(df, state) %>%
summarise(hospital = hospital[x],
rank = nth(rank, x))
}
# fun1(3)
#Source: local data frame [3 x 3]
#
# state hospital rank
#1 AK FAIRBANKS MEMORIAL HOSPITAL 3
#2 AL NA NA
#3 AR CRITTENDEN MEMORIAL HOSPITAL 3
I think this is a good use of dplyr. Only thing that's weird is summarize complains when I use NA instead of "NA". Anyone have thoughts on why?
library(dplyr)
rankall <- function(chosen_rank){
group_by(df, state) %>%
summarize(hospital = ifelse(length(hospital[rank==chosen_rank])!=0,
as.character(hospital[rank==chosen_rank]), "NA"),
rank = chosen_rank)
}
rankall(1)
rankall(2)
rankall(3)

Resources