Making a new variable column using if/else statements - r

I have a dataset that contains a column of the state in which a particular office is located. I would like to take that column and make a new column denoting which region of the US that office is located. The state column has the postal abbreviations for each state (ie. NY stands for New York) and I am using the US Census Bureau's Regions.
Here's a mock example of the data. I don't have a Region column, but I want to create it:
Store State Region
A FL South
B NY Northeast
C CA West
D IL Midwest
E MA Northeast
Let's make it simpler and let's just say I want to denote only offices in the Northeast. I used the following syntax:
stores$Northeast<-if(
stores$state=="ME"|"NH"|"VT"|"MA"|"RI"|"CT"|"NY"|"PA"|"NJ"){
print("Northeast")
} else{print("Non-northeast")
}
but I get an error message saying that the | operation doesn't work on characters. Is there a different function I should be using instead?

I'm posting in the interest of saving people's typing time. There are already two vectors available as part of the base R installation that can be used to do this very efficiently: state.abb and state.region. If you have a named vector it can be indexed via the names as a look-up facility. They both need to be converted from factor to character (and the index needs to be de-factorized as well):
# Do read `?states`. Hey, S was invented in the US, but why not some Yuropean constants?
mock <-read.table(text="Store State
A FL
B NY
C CA
D IL
E MA ",head=TRUE)
stat <- as.character(state.region)
> names(stat) <- as.character(state.abb)
> mock$Region <- stat[as.character(mock$State)]
> mock
Store State Region
1 A FL South
2 B NY Northeast
3 C CA West
4 D IL North Central
5 E MA Northeast
If you want to "edit" the regional assignments, do this:
> stat["IL"] <- "Midwest"
> mock$Region <- stat[as.character(mock$State)]
> mock
Store State Region
1 A FL South
2 B NY Northeast
3 C CA West
4 D IL Midwest
5 E MA Northeast

You should probably use the %in% operator here:
NE = c("ME","NH","VT","MA","RI","CT","NY","PA","NJ")
if stores$state %in% NE {
print("Northeast")
} else {
print("Non-northeast")
}
You can also define a new variable this way, especially if you are going to go on to define other regions:
stores$region = "Non-northeast"
stores$region[stores$state %in% NE] = "Northeast"

You need the %in% operator!
stores$Northeast <- ifelse(stores$state %in% c("ME", "NH", "VT", "MA", "RI", "CT", "NY", "PA", "NJ"), "Northeast", "Non-northeast")
cheers

Related

Assigning Value to New Variable Based on Specific Values in Another Variable in R

I have a data.frame that contains state names and I would like to create a new variable called "region" in which a value is assigned based on the state that is found under the "state" variable.
For example, if the state variable has "Alabama" or "Georgia", I would like to have "Region" assigned as "South". If state is "Washington" or "California", I would like it assigned to "West". I have to do this for each of the 48 contiguous U.S. states, and I'm having difficulty figuring out the best way to do this. Any help in this (I'm sure simple) procedure would be great. What I am looking for is something like this in the end:
State Region
Wyoming West
Michigan Midwest
Alabama South
Georgia South
California West
Texas Central
And to be clear, I don't have the regions in a separate file, i have to create this as a new variable and create the region names myself. I'm just looking for a way that the code can go through all 3000 lines that I have and can automatically assign the region name once I tell it how to do so.
Rather than type the region for every state, you can use the built-in "state.name" and "state.region" variables from the 'datasets' package (like Jon Spring suggests in his comment), e.g.
library(tidyverse)
library(datasets)
state_lookup_table <- data.frame(name = state.name,
region = state.region)
my_df <- data.frame(place = c("Washington", "California"),
value = c(1000, 2000))
my_df
#> place value
#> 1 Washington 1000
#> 2 California 2000
my_df %>%
left_join(state_lookup_table, by = c("place" = "name"))
#> place value region
#> 1 Washington 1000 West
#> 2 California 2000 West
Created on 2022-09-02 by the reprex package (v2.0.1)
I would go this way:
df <- data.frame(name = c("john", "will", "thomas", "Ali"),
state = c("California", "Alabama", "Washington", "Georgia"))
region_df <- data.frame(state= c("Alabama", "Georgia", "Washington"),
region = c("south", "south", "west"))
merged.df <- merge(df, region_df, all.x = TRUE, on= "state")
I think you need a reference to do so. For your specific question, a dict would be the best solution.
ref_ge <- {}
ref_ge["Georgia"]="South"
ref_ge["Alabama"]="South"
ref_ge["California"]="West"
ref1["Georgia"]
#Or, if you could read the state->region information from an excel to a dataframe
df=data.frame(state=c("Georgia","Alabama","California"),region=c("South","South","West"))
ref2 <- df$region
names(ref2) <- df$state
ref2["Georgia"]

Replacing NAs in a dataframe based on a partial string match (in another dataframe) in R

Goal: To change a column of NAs in one dataframe based on a "key" in another dataframe (something like a VLookUp, except only in R)
Given df1 here (For Simplicity's sake, I just have 6 rows. The key I have is 50 rows for 50 states):
Index
State_Name
Abbreviation
1
California
CA
2
Maryland
MD
3
New York
NY
4
Texas
TX
5
Virginia
VA
6
Washington
WA
And given df2 here (This is just an example. The real dataframe I'm working with has a lot more rows) :
Index
State
Article
1
NA
Texas governor, Abbott, signs new abortion bill
2
NA
Effort to recall California governor Newsome loses steam
3
NA
New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
4
NA
Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
5
NA
DC statehood unlikely as Manchin opposes
6
NA
Amazon HQ2 causing housing prices to soar in northern Virginia
Task: To create an R function that loops and reads the state in each df2$Article row; then cross-reference it with df1$State_Name to replace the NAs in df2$State with the respective df1$Abbreviation key based on the state in df2$Article. I know it's quite a mouthful. I'm stuck with how to start, and finish this puzzle. Hard-coding is not an option as the real datasheet I have have thousands of rows like this, and will update as we add more articles to text-scrape.
The output should look like:
Index
State
Article
1
TX
Texas governor, Abbott, signs new abortion bill
2
CA
Effort to recall California governor Newsome loses steam
3
NY
New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
4
MD
Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
5
NA
DC statehood unlikely as Manchin opposes
6
VA
Amazon HQ2 causing housing prices to soar in northern Virginia
Note: The fifth entry with DC is intended to be NA.
Any links to guides, and/or any advice on how to code this is most appreciated. Thank you!
You can create create a regex pattern from the State_Name and use str_extract to extract it from Article. Use match to get the corresponding Abbreviation name from df1.
library(stringr)
df2$State <- df1$Abbreviation[match(str_extract(df2$Article,
str_c(df1$State_Name, collapse = '|')), df1$State_Name)]
df2$State
#[1] "TX" "CA" "NY" "MD" NA "VA"
You can also use inbuilt state.name and state.abb instead of df1 to get state name and abbreviations.
Here's a way to do this in for loop -
for(i in seq(nrow(df1))) {
inds <- grep(df1$State_Name[i], df2$Article)
if(length(inds)) df2$State[inds] <- df1$Abbreviation[i]
}
df2
# Index State Article
#1 1 TX Texas governor, Abbott, signs new abortion bill
#2 2 CA Effort to recall California governor Newsome loses steam
#3 3 NY New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
#4 4 MD Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
#5 5 <NA> DC statehood unlikely as Manchin opposes
#6 6 VA Amazon HQ2 causing housing prices to soar in northern Virginia
Not as concise as above but a Base R approach:
# Unlist handling 0 length vectors: list_2_vec => function()
list_2_vec <- function(lst){
# Coerce 0 length vectors to na values of the appropriate type:
# .zero_to_nas => function()
.zero_to_nas <- function(x){
if(identical(x, character(0))){
NA_character_
}else if(identical(x, integer(0))){
NA_integer_
}else if(identical(x, numeric(0))){
NA_real_
}else if(identical(x, complex(0))){
NA_complex_
}else if(identical(x, logical(0))){
NA
}else{
x
}
}
# Unlist cleaned list: res => vector
res <- unlist(lapply(lst, .zero_to_nas))
# Explictly define return object: vector => GlobalEnv()
return(res)
}
# Classify each article as belonging to the appropriate state:
# clean_df => data.frame
clean_df <- transform(
df2,
State = df1$Abbreviation[
match(
list_2_vec(
regmatches(
Article,
gregexpr(
paste0(df1$State_Name, collapse = "|"), Article
)
)
),
df1$State_Name
)
]
)
# Data:
df1 <- structure(list(Index = 1:6, State_Name = c("California", "Maryland",
"New York", "Texas", "Virginia", "Washington"), Abbreviation = c("CA",
"MD", "NY", "TX", "VA", "WA")), class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(Index = 1:6, State = c(NA, NA, NA, NA, NA, NA),
Article = c("Texas governor, Abbott, signs new abortion bill",
"Effort to recall California governor Newsome loses steam",
"New York governor, Cuomo, accused of manipulating Covid-19 nursing home data",
"Hogan (Maryland, R) announces plans to lift statewide Covid restrictions",
"DC statehood unlikely as Manchin opposes", "Amazon HQ2 causing housing prices to soar in northern Virginia"
)), class = "data.frame", row.names = c(NA, -6L))

Lapply instead of for loop in r

I want to write the function ,that returns a 2-column data frame
containing the hospital in each state that has the ranking specified in num.
Rankall that takes two arguments: an outcome name (outcome) and a hospital ranking
(num). The function reads the outcome-of-care-measures.csv file and returns a 2-column data frame
containing the hospital in each state that has the ranking specified in num.
rankall <- function(outcome, num = "best") {
## Read outcome data
## Check that state and outcome are valid
## For each state, find the hospital of the given rank
## Return a data frame with the hospital names and the
## (abbreviated) state name
}
head(rankall("heart attack", 20), 10)
hospital state
AK <NA> AK
AL D W MCMILLAN MEMORIAL HOSPITAL AL
AR ARKANSAS METHODIST MEDICAL CENTER AR
4
AZ JOHN C LINCOLN DEER VALLEY HOSPITAL AZ
CA SHERMAN OAKS HOSPITAL CA
CO SKY RIDGE MEDICAL CENTER CO
CT MIDSTATE MEDICAL CENTER CT
DC <NA> DC
DE <NA> DE
FL SOUTH FLORIDA BAPTIST HOSPITAL FL
My function works correct, but the last step(formating 2-column data frame) I made by the following loop:
new_data <- vector()
for(i in sort(unique(d$State))){
new_data <- rbind(new_data,cbind(d$Hospital.Name[which(d$State == i)][num],i))
}
new_data <- as.data.frame(new_data)
It is correct, but i know, that it is possible to code the same loop by lapply function
My attempt is wrong:
lapply(d,function(x) x <-rbind(x,d$Hospital.Name[which(d$State == i)][num]))
How can I fix that?
I'm supposing your d data is already sorted:
new_data <- do.call(rbind,
lapply(unique(d$State),
function(state){
data.frame(State = state,
Hospital.Name = d$Hospital.Name[which(d$State==state)][num],
stringsAsFactors = FALSE)
}))

Easier way to invert data from list of character vectors?

I have a data.frame that looks something like this:
states responsible
1 KS Joe, Suzie
2 MO Bob
3 CO Suzie, Bob, Ralph
4 NE Joe
5 MT Suzie, Ralph
Where each state has a list of people responsible for it in another column. I'd like to invert this to create a list of all the states that each person is responsible for.
Here's how to create a reproducible example:
states <- c("KS", "MO", "CO", "NE", "MT")
responsible <- list(c("Joe", "Suzie"), "Bob", c("Suzie", "Bob", "Ralph"), "Joe", c("Suzie", "Ralph"))
df <- as.data.frame(cbind(states, responsible))
Here's how I would like the data to look:
person states
1 Joe KS, NE
2 Suzie KS, CO, MT
3 Bob MO, CO
4 Ralph CO, MT
I have used the following to get what I want, but I feel that I'm making it more complicated than it needs to be. using melt and split get me almost what I want, but I take a few more steps to then convert from indices back to the values. Here's the ugly solution:
people <- unique(unlist(df$responsible))
foo <- melt(responsible)
bar <- split(foo$L1, foo$value)
#This function just grabs the indices from 'bar' and gets the corresponding states.
#Really ugly and I'm guessing unnecessary.
stackoverflow_function <- function(person) {
return(states[do.call('$', list(bar, paste0(person)))])
}
answer <- lapply(people, stackoverflow_function)
as.data.frame(cbind(people, answer))
Any help is appreciated. It feels like I'm overlooking something simple.
You can use data.table:
data.table::setDT(df)
df[, .(responsible = unlist(responsible)), .(states = unlist(states))]
[, .(states = list(states)), .(responsible)]
responsible states
1: Joe KS,NE
2: Suzie KS,CO,MT
3: Bob MO,CO
4: Ralph CO,MT

approximate string matching on condition of a match in a separate field in R

I have two dataframes from which I would like to carry out approximate string matching.
> df1
Source Name Country
A Glen fiddich United Kingdom
A Talisker dark storm United Kingdom
B johnney walker United states
D veuve clicquot brut France
E nicolas feuillatte brut France
C glen morangie united kingdom
B Talisker 54 degrees United kingdom
F Talisker dark storm United states
The second data frame:
> df2
Source Name Country
A smirnoff ice Russia
A Talisker daek strome United Kingdom
B johnney walker United states
D veuve clicquot brut Australia
E nicolea feuilate brut Italy
C glen morangie united kingdom
B Talisker 54 degrees United kingdom
the key column for the approximate matching between the two data frames is "Name". Because of the relationship in the columns for the observations, It is important to select the approximate matches that also have a match on the "country" column. The extract of the code I am using is below:
dist.mat <- stringdistmatrix(tolower(df1$title), tolower(df2$title), method = "jw",
nthread = getOption("sd_num_thread"))
min.dist <- apply(dist.mat, 1, min)
matched <- data.frame(df1$title,
as.character(apply(dist.mat, 1, function(x) df2$title[which(x == min(x))])),
apply(dist.mat, 1, which.min), "jw", apply(dist.mat, 1, min))
colnames(matched) <- c("to_be_matched", "closest_match", "index_closest_match",
"distance_method", "distance")
The code above only executes approximate match between df1 and df2 based on data in the "Name" column. What I want to do is have the approximate match on "Name" column selected on the condition that for the two values, there is a match on the "Country" column.
You should check out the fuzzywuzzy library, which has excellent fuzzy text matching capabilities. Then I would iterate through the unique countries and look for matches that pass a certain fuzz threshold score, like the following:
from fuzzywuzzy import fuzz, process
matches = []
for country in df1['Country'].unique().tolist():
dfm1 = df1[df1['Country'] == country]
dfm2 = df2[df2['Country'] == country]
candidates = dfm2['Name'].tolist()
matches.append(dfm1['Name'].apply(lambda x: x, process.extractOne(x, candidates, score_cutoff=90)))
You can tweak the scorer input in order to get the matches the way you like it.

Resources