Using compound statement in an "if" clause - r

I found a similar question involving templates that was was beyond the scope of my question. I want to be able to say something like: if (a and b) then (do something). Here is an example:
t1 <- tribble(
~state, ~county,
"New York", "Bronx",
"New York", "Richmond",
"New York", "Albany",
"Virginia", "Richmond"
)
five_boroughs = c("Bronx", "Kings", "New York", "Queens", "Richmond")
if t1$state == "New York" && t1$county in five_boroughs
t1$county = "New York City"
Using either &, &&, in, or %in% puts New York City in Virginia. I apologize to New Yorkers for calling counties boroughs.

We can use case_when
library(dplyr)
library(stringr)
t1 %>%
mutate(county = case_when(state == 'New York' &
county %in% five_boroughs~ str_c(state, ' City')))

Related

r- Error when trying to use mutate with case_when

I am trying to add vector to a data frame holding the region of each US state. I have tried the following code and keep on getting an error message. I'm new to the tidyverse so any help you can offer would be appreciated. I'm guessing it's something small and embarrassing. :)
df <- df %>%
mutate(region = case_when((State=="Connecticut"|State=="Maine"|State=="Massachusetts"|State=="New Hampshire"|State=="Rhode Island"|State=="Vermont"~ "New England"),
case_when((State=="Delaware"| State=="District of Columbia" | State=="Maryland"| State=="New Jersey"| State=="New York"| State=="Pennsylvania"~ "Central Atlanic"),
case_when((State=="Florida"| State=="Georgia"| State=="North Carolina"|State=="South Carolina"| State=="Virginia"| State=="West Virginia"~ "Lower Atlantic"),
case_when((State=="Illinois"| State=="Indiana"| State=="Iowa"| State=="Kansas"| State=="Kentucky"| State=="Michigan"| State=="Minnesota"| State=="Missouri"| State=="Nebraska"| State=="North Dakota"| State=="Ohio"| State=="Oklahoma"| State=="South Dakota"| State=="Tennessee" |State=="Wisconsin"~ "Midwest"),
case_when((State=="Alabama" | State=="Arkansas" | State=="Louisiana"| State=="Mississippi"| State=="New Mexico"| State=="Texas"~ "Gulf Coast"),
case_when((State=="Colorado"| State=="Idaho" | State=="Montana"| State=="Utah"| State=="Wyoming"~ "Rocky Mountain"),
case_when((State=="Alaska" | State=="Arizona" | State=="California"| State=="Hawaii" | State=="Nevada"| State=="Oregon"| State=="Washington"~ "West Coast"), TRUE~"NA"))))))))
Error in mutate():
! Problem while computing region = case_when(...).
Caused by error in case_when():
! Case 2 ((State == "Colorado" | State == "Idaho" | State == "Montana" | State == "Utah" | State == "Wyoming" ~ "Rocky Mountain")) must be a two-sided formula, not a character vector.
As docs show, there is no need to nest case_when. Simply, separate the mutually exclusive conditions by commas. Also, consider %in% and avoid the many OR calls.
mutate(region = case_when(
State %in% c("Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont") ~ "New England"),
State %in% c("Delaware", "District of Columbia", "Maryland", "New Jersey", "New York", "Pennsylvania") ~ "Central Atlantic"),
...,
TRUE ~ NA
))
In fact, consider simply merging and avoid any conditional logic:
txt = 'State Region
Connecticut "New England"
Maine "New England"
Massachusetts "New England"
"New Hampshire" "New England"
"Rhode Island" "New England"
Vermont "New England"
Delaware "Central Atlantic"
"District of Columbia" "Central Atlantic"
Maryland "Central Atlantic"
"New Jersey" "Central Atlantic"
"New York" "Central Atlantic"
Pennsylvania "Central Atlantic"
...'
region_df <- read.table(text = txt, header = TRUE)
region_df
# State Region
# 1 Connecticut New England
# 2 Maine New England
# 3 Massachusetts New England
# 4 New Hampshire New England
# 5 Rhode Island New England
# 6 Vermont New England
# 7 Delaware Central Atlantic
# 8 District of Columbia Central Atlantic
# 9 Maryland Central Atlantic
# 10 New Jersey Central Atlantic
# 11 New York Central Atlantic
# 12 Pennsylvania Central Atlantic
# ...
main_df <- merge(main_df, region_df, by = "State")

Removing different words from vector in R

Lets say I have in R a long data frame like this:
var1 <- c("Los Angeles - CA", "New York - NY", "Seattle - WA", "Los Angeles - CA", "New York - NY")
var2 <- c(1, 2, 3, 4, 5)
df <- data.frame(var1, var2)
I want to remove the " - State", to get a result like:
var1 <- c("Los Angeles", "New York", "Seattle", "Los Angeles", "New York")
var2 <- c(1, 2, 3, 4, 5)
df <- data.frame(var1, var2)
I wasn't able to figure out how to do so since I have more than 5,000 rows and cannot use gsub because I'd have to state every state abbreviation to remove. I mean, there's dozens of patterns (-State) that I'd have to define a priori before using such functions,
Is there an easy way to remove all "-State" from that column at once by using some splitting pattern that I haven't figured out yet?
Couple of options.
Most basic would be to just remove the last 5 characters.
library(stringr)
str_sub(var1, 1L, -6L)
Or maybe search for the pattern and delete that:
gsub(" - \\w+$","",var1)
or
str_remove_all(var1, " - \\w+$")
All will get you the same result
[1] "Los Angeles" "New York" "Seattle" "Los Angeles" "New York"
var1 <- c("Los Angeles - CA", "New York - NY", "Seattle - WA", "Los Angeles - CA", "New York - NY")
gsub(" - [A-Z]+$", "", var1)
[1] "Los Angeles" "New York" "Seattle" "Los Angeles" "New York"

Remove all rows that doesn't match a set of strings and recategorization of the columns

I have a set of social media data queried from twitter API, which also included people's self-reported location. However, the location string does not default to a standard format for categorization, and sometimes there are "trolls" value. Here is an example
a1 = data.frame(x=c(1:4),y=c("181 Metro Drive San Francisco", "Wall Street New York", "Austin, TX", "The Moon"))
a1
My plan is to obtain a CSV file with all cities names around the world at https://www.kaggle.com/max-mind/world-cities-database and import it into R as a vector, here is a small example
a2 = c("New York", "Washington", "Austin")
a2
What I want to do is to write an R function that cross-references a1 based on a2, replace all strings in a1 where it doesn't appear on a2 as NA, and replace all strings where it appears on a2 by that exact string values. For example, say that our function is f, the output of the function would be as follow
x = data.frame(x=c(1:4),c("San Francisco", "New York", "Austin", NA))
x
Can I write a function in R for this, or are there any existing R package build for this task? Thank you for the help
We can paste all the city names as a pattern and then use str_extract to extract it.
library(stringr)
str_extract(a1, str_c(a2, collapse = "|"))
#[1] "San Francisco" "New York" "Austin" NA
data
a2 = c("New York", "Washington", "Austin", "San Francisco")
a1 = c("181 Metro Drive San Francisco", "Wall Street New York",
"Austin, TX", "The Moon")

Using grepl to subset dataframe containing the same mentioning of some text in two columns

I'm working on a dataframe (account) with two columns containing "posting" IP location (in the column city) and the locations at the time when those accounts were first registered (in the column register). I'm using grepl() to subset rows whose posting location and register location are both from the state of New York (NY). Below are part of the data and my code for subsetting the desired output:
account <- data.frame(city = c("Beijing, China", "New York, NY", "Hoboken, NJ", "Los Angeles, CA", "New York, NY", "Bloomington, IN"),
register = c("New York, NY", "New York, NY", "Wilwaukee, WI", "Rochester, NY", "New York, NY", "Tokyo, Japan"))
sub_data <- subset(account, grepl("NY", city) == "NY" & grepl("NY", register) == "NY")
sub_data
[1] city register
<0 rows> (or 0-length row.names)
My code didn't work and returned 0 row (while at least two rows should have met my selection criterion). What went wrong in my code?
I have referenced this previous thread before lodging this question.
The function grepl already returns a logical vector, so just use the following:
sub_data <- subset(account,
grepl("NY", city) & grepl("NY", register)
)
By using something like grepl("NY", city) == "NY" you are asking R if any values in FALSE TRUE FALSE FALSE TRUE FALSE are equal to "NY", which is of course false.

R script, how do i assign 3 values from a collection with the same label to sort into levels

I am trying to do something like this, I want every name inside of england to be set to england so when it is ran it will count everything in that collection and they will all be added to englands total. as you can see below there are 9 other labels I want anything named as such to become another england label. I hope this makes sense to someone out there, I really didn't know how to explain this.
area_c <- factor(Outlets2016_local$Region,levels = c("England","Scotland","Wales"),labels = c("England" = england,"Scotland","Wales"))
here is englands collection:
england <- c("London","North East","East of England","West Midlands","South East","North West","East Midlands","South West","Yorkshire and The Humber")
You can do the following if you don't mind recoding Outlets2016_local$Region.
england <- c("London", "North East", "East of England", "West Midlands", "South East", "North West", "East Midlands", "South West", "Yorkshire and The Humber")
Outlets2016_local$Region[Outlets2016_local$Region %in% england] <- "England"
area_c <- factor(Outlets2016_local$Region, levels = c("England", "Scotland", "Wales"), labels = c("England", "Scotland", "Wales"))

Resources