filter data by values (common values but different data entry) stored in another dataframe - r

Based on the data below how can I filter data by values stored in another dataframe object?
Sample data:
# Data to be filtered
Dest_FIPS = c(1,2,3,4)
Dest_county = c("West Palm Beach County","Brevard County","Bay County","Miami-Dade County")
Dest_State = c("FL", "FL", "FL", "FL")
OutFlow = c(111, 222, 333, 444)
Orig_county = c("Broward County", "Broward County", "Broward County", "Broward County")
Orig_FIPS = c(5,5,5,5)
Orig_State = c("FL", "FL", "FL", "FL")
df = data.frame(Dest_FIPS, Dest_county, Dest_State, OutFlow, Orig_county, Orig_FIPS, Orig_State)
# rows to be filtered in column Dest_county based on the values in val_df
COUNTY_NAM = c("WEST PALM BEACH","BAY","MIAMI-DADE") #(values are actually stored in a CSV, so will be imported as a dataframe)
val_df = data.frame(COUNTY_NAM) # will use val_df to filter df
Desired output:
Dest_FIPS Dest_county OutFlow Orig_county
1 West Palm Beach County 111 Broward County
3 Bay County 333 Broward County
4 Miami-Dade County 444 Broward County

Transform df$Dest_county to match the format in val_df, then check which values are %in% val_df$COUNTY_NAM.
Base R:
df[toupper(gsub(" County", "", df$Dest_county)) %in% val_df$COUNTY_NAM,]
tidyverse:
library(dplyr)
filter(df, str_to_upper(str_remove(Dest_county, " County")) %in% val_df$COUNTY_NAM)
Output for both:
Dest_FIPS Dest_county Dest_State OutFlow Orig_county Orig_FIPS Orig_State
1 1 West Palm Beach County FL 111 Broward County 5 FL
2 3 Bay County FL 333 Broward County 5 FL
3 4 Miami-Dade County FL 444 Broward County 5 FL

Related

Categorizing Data Based on Text Value in New Column [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 2 years ago.
I'm trying to take an existing data frame that has a column for state and add a new column called Region depending on what the row's state is. So for example any row that has "CA" should be categorized "West" and any row that has "IL" should be Midwest. There are 4 regions: West, South, Midwest, and Northeast.
I had tried doing this separately in 4 code chunks like this:
south <- c("FL", "KY", "GA", "TX", "MS", "SC", "NC", "AL", "LA", "AR", "TN", "VA", "DC", "MD", "DE", "WV") #16 states
south.mdata <- mdata %>% filter(state %in% south) #1832 locations
south.byyear <- south.mdata %>% group_by(Year) %>% summarize(s.total = n())
south.total <- data %>% filter(state %in% south) %>% group_by(Year) %>% summarize(yearly.total = n())
But this seems repetitive and not the most efficient way to do this. Plus I'd like to be able to group_by both Year and Region so I can compare across regions.
I'm having trouble implementing this and the first thing that comes to mind is to do some sort of if/else loop using filter but I know loops aren't really R's style.
The original data looks like this:
Field.1 ID title description streetaddress city state
1 74 DE074 Cork 'n' Bottle Route 14, 1 mile south of town Rehoboth Beach DE
2 75 DE075 Cork 'n' Bottle Route 14, 1 mile south of town Rehoboth Beach DE
3 23 DE023 Dog House 1200 DuPont Hwy. Wilmington DE
4 19 DE019 Dog House 1200 DuPont Hwy Wilmington DE
5 26 DE026 Dog House 1200 Dupont Wilmington DE
6 65 DE065 Henlopen Hotel Bar Boardwalk & Surf Rehoboth Beach DE
amenityfeatures type Year notes lon lat
1 (M),(R) Restaurant 1977 <NA> -75.07601 38.72095
2 (M),(R) Restaurant 1976 <NA> -75.07601 38.72095
3 (M),(R) Restaurant 1975 <NA> -75.58243 39.68839
4 (M),(R) Restaurant 1976 <NA> -75.58243 39.68839
5 (M),(R) Restaurant 1974 <NA> -75.58723 39.76705
6 (M) Bars/Clubs,Hotel 1972 <NA> -75.07712 38.72280
status
1 Location could not be verified. General city or location coordinates used.
2 Location could not be verified. General city or location coordinates used.
3 Google Verified Location
4 Google Verified Location
5 Google Verified Location
6 Verified Location
I want to add a new column called "Region" that would loop through each row, look at the state, and then add a value to Region.
Any suggestions on the right syntax to do something like this would be so appreciated! Thanks so much!
This is a snippet of what the solution suggested by Gregor’s comment could look like.
library(tidyverse)
orig_data <-
tribble(~ID, ~state,
1, "CA",
2, "FL",
3, "DE")
region_lookup <-
tribble(~state, ~region,
"CA", "west",
"FL", "south",
"DE", "south")
left_join(orig_data, region_lookup)
#> Joining, by = "state"
#> # A tibble: 3 x 3
#> ID state region
#> <dbl> <chr> <chr>
#> 1 1 CA west
#> 2 2 FL south
#> 3 3 DE south
Created on 2020-11-02 by the reprex package (v0.3.0)
the simplest solution would be a join. Therefore you need a data.frame/tibble that has all the states an regions. Fortunately the data is already in base R:
library(dplyr)
# build the tibble of state abbrevitation and region from base R data
state_region <- dplyr::tibble(state.abb, state.region)
# join it on your data.frame/tibble
ORIGINAL_DATA %>%
dplyr::left_join(state_region, by = c("state" = "state.abb"))
Now you should have a new column called "state.region" that you can group by. Be aware that the states have to be in upper case.

Previous marking nullified while using ifelse condition in R

Here is my data -
library(data.table)
basefile2 = data.table(States = c("California","California", "California", "Texas","Texas","Texas", "Ohio", "Ohio", "Ohio"),
Cities = c("LA", "California City", "San Fran", "Houston", "Dallas", "Austin", "Columbus", "Cleaveland", "Wooster"))
And here is my code -
Market = function(state, city){
if (missing(state)) stop("Enter State",
call. = FALSE)
if (missing(city)) stop("Enter City(ies)",
call. = FALSE)
basefile2 <<- basefile2[, "Consideration" := ifelse(States == state & Cities %in% city, "Y",
ifelse("Consideration" %in% colnames(basefile) & "Consideration" == "Y", "Y", "N"))]
}
Market(state = "California",
city = c("LA", "California City"))
Market(state = "Texas",
city = c("Dallas", "Austin"))
The previous marking in the consideration column when state was California is getting nullified. Yes, I need to input different states in separate functions to due to certain input constraints
Here is my output
States Cities Consideration
1: California LA N
2: California California City N
3: California San Fran N
4: Texas Houston N
5: Texas Dallas Y
6: Texas Austin Y
7: Ohio Columbus N
8: Ohio Cleaveland N
9: Ohio Wooster N
Where as, the output I need is the consideration column must have "Y" in the California City, LA, Austin & Dallas.
One option is to add the "Consideration" column to the data.table at the start, and then use that as a condition to update within the function so that previous updates are not replaced.
library(data.table)
basefile2 <- data.table(...) # as you had
basefile2[, Consideration := "N"] # initialize the column
Market <- function(state, city){
basefile2 <<- basefile2[Consideration=="N", # Only update if this is "N"
"Consideration" := ifelse(States == state & Cities %in% city, "Y", "N")]
}
Or maybe like this:
Market <- function(state, city){
basefile2 <<- basefile2[States == state & Cities %in% city, Consideration := "Y"]
}
Market(state = "California", city = c("LA", "California City"))
Market(state = "Texas", city = c("Dallas", "Austin"))
basefile2
States Cities Consideration
1: California LA Y
2: California California City Y
3: California San Fran N
4: Texas Houston N
5: Texas Dallas Y
6: Texas Austin Y
7: Ohio Columbus N
8: Ohio Cleaveland N
9: Ohio Wooster N

Need to ID states from mixed names /IDs in location data

Need to ID states from mixed location data
Need to search for 50 states abbreviations & 50 states full names, and return state abbreviation
N <- 1:10
Loc <- c("Los Angeles, CA", "Manhattan, NY", "Florida, USA", "Chicago, IL" , "Houston, TX",
+ "Texas, USA", "Corona, CA", "Georgia, USA", "WV NY NJ", "qwerty uy PO DOPL JKF" )
df <- data.frame(N, Loc)
> # Objective create variable state such
> # state contains abbreviated names of states from Loc:
> # for "Los Angeles, CA", state = CA
> # for "Florida, USA", sate = FL
> # for "WV NY NJ", state = NA
> # for "qwerty NJuy PO DOPL JKF", sate = NA (inspite of containing the srting NJ, it is not wrapped in spaces)
>
# End result should be Newdf
State <- c("CA", "NY", "FL", "IL", "TX","TX", "CA", "GA", NA, NA)
Newdf <- data.frame(N, Loc, State)
> Newdf
N Loc State
1 1 Los Angeles, CA CA
2 2 Manhattan, NY NY
3 3 Florida, USA FL
4 4 Chicago, IL IL
5 5 Houston, TX TX
6 6 Texas, USA TX
7 7 Corona, CA CA
8 8 Georgia, USA GA
9 9 WV NY NJ <NA>
10 10 qwerty uy PO DOPL JKF <NA>
Is there a package? or can a loop be written? Even if the schema could be demonstrated with a few states, that would be sufficient - I will post the full solution when I get to it. Btw, this is for a Twitter dataset downloaded using rtweet package, and the variable is: place_full_name
There are default constants in R, state.abb and state.name which can be used.
vars <- stringr::str_extract(df$Loc, paste0('\\b',c(state.abb, state.name),
'\\b', collapse = '|'))
#[1] "CA" "NY" "Florida" "IL" "TX" "Texas" "CA" "Georgia" "WV" NA
If you want everything as abbreviations, we can go further and do :
inds <- vars %in% state.name
vars[inds] <- state.abb[match(vars[inds], state.name)]
vars
#[1] "CA" "NY" "FL" "IL" "TX" "TX" "CA" "GA" "WV" NA
However, we can see that in 9th row you expect output as NA but here it returns "WV" because it is a state name. In such cases, you need to prepare rules which are strict enough so that it only extracts state names and nothing else.
Utilising the built-in R constants, state.abb and state.name, we can try to extract these from the Loc with regular expressions.
state.abbs <- sub('.+, ([A-Z]{2})', '\\1', df$Loc)
state.names <- sub('^(.+),.+', '\\1', df$Loc)
Now if the state abbreviations are not in any of the built-in ones, then we can use match to find the positions of our state.names that are in any of the items in the built-in state.name vector, and use that to index state.abb, else keep what we already have. Those that don't match either return NA.
df$state.abb <- ifelse(!state.abbs %in% state.abb,
state.abb[match(state.names, state.name)], state.abbs)
df
N Loc state.abb
1 1 Los Angeles, CA CA
2 2 Manhattan, NY NY
3 3 Florida, USA FL
4 4 Chicago, IL IL
5 5 Houston, TX TX
6 6 Texas, USA TX
7 7 Corona, CA CA
8 8 Georgia, USA GA
9 9 WV NY NJ <NA>
10 10 qwerty uy PO DOPL JKF <NA>

How to clean the city and state(both full and abbreviation) using R

I have a list of uncleaned city and state from "Location" in twitter, for example:
location <- c("the Great Lake State", "PA", "Harrisburg, Pennsylvania",
"Pennsylvania", "MI", "Detroit,MI")
How to clean the data to make a clean list of two columns with city and state?
You can do this:
splitted_list <- strsplit(location,",")
wide_matrix <- sapply(splitted_list,function(x) c(rep(NA,length(x)==1),x))
res <- setNames(data.frame(t(wide_matrix),stringsAsFactors = FALSE),c("city","state"))
res
# city state
# 1 <NA> the Great Lake State
# 2 <NA> PA
# 3 Harrisburg Pennsylvania
# 4 <NA> Pennsylvania
# 5 <NA> MI
# 6 Detroit MI
Assuming your data (location) is already part of a data.frame which you want to clean up, then tidyr::separate can be suitable option.
location <- c("the Great Lake State", "PA", "Harrisburg, Pennsylvania",
"Pennsylvania", "MI", "Detroit,MI")
library(tidyverse)
as.data.frame(location) %>% # I created a data.frame, which is not needed in actual data
tidyr::separate(location, c("City", "State"), sep=",", fill="left")
# City State
# 1 <NA> the Great Lake State
# 2 <NA> PA
# 3 Harrisburg Pennsylvania
# 4 <NA> Pennsylvania
# 5 <NA> MI
# 6 Detroit MI

Finding Average of One Column Based on 2 Other Columns RStudio

I currently have a data frame that has three columns (City, State and Income) I wrote an example of the data below...
City State Income
Addison Illinois 71,000
Addison Illinois 101,000
Addison Illinois 81,000
Addison Texas 74,000
As you can see there are repeats of the cities. There are several Addison, IL's because income differs by the zip-code/area of the city.
I want to take the average of all incomes in a given city and state. In this example I want the average of all Addison IL's but NOT including Addison, Texas.
I am looking for this (in this given example)
City State MeanIncome
Addison Illinois 84,333
Addison Texas 74,000
I tried this:
Income_By_City <- aggregate( Income ~ City, df, mean )
But it gave me the average of ALL Addison's, including Texas...
Is there a way to take the average of Income Column, based on City AND State??
I am pretty new to coding, so I'm not sure if this is a simple question. But I would appreciate any help I can get.
df <- data.frame(City = c("Addison", "Addison", "Addison", "Addison"), State = c("Illinois", "Illinois", "Illinois", "Texas"), Income = c(71000, 101000, 81000, 74000))
library(dplyr)
df %>%
group_by(City, State) %>%
summarise(MeanIncome=(mean(Income)))
# City State MeanIncome
#1 Addison Illinois 84333.33
#2 Addison Texas 74000.00
Here is a dplyr solution:
library(tidyverse)
df <- tribble(
~City, ~State, ~Income,
"Addison", "Illinois", 71000,
"Addison", "Illinois", 101000,
"Addison", "Illinois", 81000,
"Addison", "Texas", 74000
)
df %>%
group_by(City, State) %>%
mutate(AverageIncome = mean(Income))
# A tibble: 4 x 4
# Groups: City, State [2]
City State Income AverageIncome
<chr> <chr> <dbl> <dbl>
1 Addison Illinois 71000 84333.33
2 Addison Illinois 101000 84333.33
3 Addison Illinois 81000 84333.33
4 Addison Texas 74000 74000.00

Resources