Categorizing Data Based on Text Value in New Column [duplicate] - r

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 2 years ago.
I'm trying to take an existing data frame that has a column for state and add a new column called Region depending on what the row's state is. So for example any row that has "CA" should be categorized "West" and any row that has "IL" should be Midwest. There are 4 regions: West, South, Midwest, and Northeast.
I had tried doing this separately in 4 code chunks like this:
south <- c("FL", "KY", "GA", "TX", "MS", "SC", "NC", "AL", "LA", "AR", "TN", "VA", "DC", "MD", "DE", "WV") #16 states
south.mdata <- mdata %>% filter(state %in% south) #1832 locations
south.byyear <- south.mdata %>% group_by(Year) %>% summarize(s.total = n())
south.total <- data %>% filter(state %in% south) %>% group_by(Year) %>% summarize(yearly.total = n())
But this seems repetitive and not the most efficient way to do this. Plus I'd like to be able to group_by both Year and Region so I can compare across regions.
I'm having trouble implementing this and the first thing that comes to mind is to do some sort of if/else loop using filter but I know loops aren't really R's style.
The original data looks like this:
Field.1 ID title description streetaddress city state
1 74 DE074 Cork 'n' Bottle Route 14, 1 mile south of town Rehoboth Beach DE
2 75 DE075 Cork 'n' Bottle Route 14, 1 mile south of town Rehoboth Beach DE
3 23 DE023 Dog House 1200 DuPont Hwy. Wilmington DE
4 19 DE019 Dog House 1200 DuPont Hwy Wilmington DE
5 26 DE026 Dog House 1200 Dupont Wilmington DE
6 65 DE065 Henlopen Hotel Bar Boardwalk & Surf Rehoboth Beach DE
amenityfeatures type Year notes lon lat
1 (M),(R) Restaurant 1977 <NA> -75.07601 38.72095
2 (M),(R) Restaurant 1976 <NA> -75.07601 38.72095
3 (M),(R) Restaurant 1975 <NA> -75.58243 39.68839
4 (M),(R) Restaurant 1976 <NA> -75.58243 39.68839
5 (M),(R) Restaurant 1974 <NA> -75.58723 39.76705
6 (M) Bars/Clubs,Hotel 1972 <NA> -75.07712 38.72280
status
1 Location could not be verified. General city or location coordinates used.
2 Location could not be verified. General city or location coordinates used.
3 Google Verified Location
4 Google Verified Location
5 Google Verified Location
6 Verified Location
I want to add a new column called "Region" that would loop through each row, look at the state, and then add a value to Region.
Any suggestions on the right syntax to do something like this would be so appreciated! Thanks so much!

This is a snippet of what the solution suggested by Gregor’s comment could look like.
library(tidyverse)
orig_data <-
tribble(~ID, ~state,
1, "CA",
2, "FL",
3, "DE")
region_lookup <-
tribble(~state, ~region,
"CA", "west",
"FL", "south",
"DE", "south")
left_join(orig_data, region_lookup)
#> Joining, by = "state"
#> # A tibble: 3 x 3
#> ID state region
#> <dbl> <chr> <chr>
#> 1 1 CA west
#> 2 2 FL south
#> 3 3 DE south
Created on 2020-11-02 by the reprex package (v0.3.0)

the simplest solution would be a join. Therefore you need a data.frame/tibble that has all the states an regions. Fortunately the data is already in base R:
library(dplyr)
# build the tibble of state abbrevitation and region from base R data
state_region <- dplyr::tibble(state.abb, state.region)
# join it on your data.frame/tibble
ORIGINAL_DATA %>%
dplyr::left_join(state_region, by = c("state" = "state.abb"))
Now you should have a new column called "state.region" that you can group by. Be aware that the states have to be in upper case.

Related

How can I add the country name to a dataset based on city name and population? [duplicate]

This question already has answers here:
extracting country name from city name in R
(3 answers)
Closed 7 months ago.
I have a dataset containing information on a range of cities, but there is no column which says what country the city is located in. In order to perform the analysis, I need to add an extra column which has the name of the country.
population city
500,000 Oslo
750,000 Bristol
500,000 Liverpool
1,000,000 Dublin
I expect the output to look like this:
population city country
500,000 Oslo Norway
750,000 Bristol England
500,000 Liverpool England
1,000,000 Dublin Ireland
How can I add a column of country names based on the city and population to a large dataset in R?
I am adapting Tom Hoel's answer, as suggested by Ian Campbell. If this is selected I am happy to mark it as community wiki.
library(maps)
library(dplyr)
data("world.cities")
df <- readr::read_table("population city
500,000 Oslo
750,000 Bristol
500,000 Liverpool
1,000,000 Dublin")
df |>
inner_join(
select(world.cities, name, country.etc, pop),
by = c("city" = "name")
) |> group_by(city) |>
filter(
abs(pop - population) == min(abs(pop - population))
)
# A tibble: 4 x 4
# Groups: city [4]
# population city country.etc pop
# <dbl> <chr> <chr> <int>
# 1 500000 Oslo Norway 821445
# 2 750000 Bristol UK 432967
# 3 500000 Liverpool UK 468584
# 4 1000000 Dublin Ireland 1030431
As stated by others, the cities exists in other countries too as well.
library(tidyverse)
library(maps)
data("world.cities")
df <- read_table("population city
500,000 Oslo
750,000 Bristol
500,000 Liverpool
1,000,000 Dublin")
df %>%
merge(., world.cities %>%
select(name, country.etc),
by.x = "city",
by.y = "name")
# A tibble: 7 × 3
city population country.etc
<chr> <dbl> <chr>
1 Bristol 750000 UK
2 Bristol 750000 USA
3 Dublin 1000000 USA
4 Dublin 1000000 Ireland
5 Liverpool 500000 UK
6 Liverpool 500000 Canada
7 Oslo 500000 Norway
I think your best bet would be to add a new column in your dataset called country and fill it out, this is part of the CRSIP-DM process data preparation so this is not uncommon. If that does not answer your question please let me know and i will do my best to help.

How to remove values in a column based on other column values equaling the column values above it?

I am currently coding in R and merged two dataframes together so I could include all the information together but I don't want the one column "Cost" to be duplicated multiple times (it was due to the unique values of the last 3 columns). I want it to include the cost 100 only in the first column and then for every other instance where the columns "State", "Market", "Date", and "Cost" are the same as above. I attached what the dataframe looks like and what I want it to be changed to. Thank you!
What it currently looks like
What it should look like
Please use index like in this example:
name_of_your_dataset[nrow_init:nrow_fin, ncol] <- NA
In your case, assuming the name of your dataset as 'data'
data[2:4,4]<- NA
Just leave a positive feedback and if I was useful, just vote this answer up.
Here is a solution using duplicated with your dataframe (df)
State Market Date Cost Word format Type
1 AZ Phoenix 10-20-2020 100 HELLO AM Sports related
2 AZ Phoenix 10-21-2020 NA GOODBYE PM Non Sports related
3 AZ Phoenix 10-22-2020 NA YES FM Country
4 AZ Phoenix 10-23-2020 NA NONE CM Rock
Set duplicates to NA
df$Cost[duplicated(df$Cost)] <- NA
Output:
State Market Date Cost Word format Type
1 AZ Phoenix 10-20-2020 100 HELLO AM Sports related
2 AZ Phoenix 10-21-2020 NA GOODBYE PM Non Sports related
3 AZ Phoenix 10-22-2020 NA YES FM Country
4 AZ Phoenix 10-23-2020 NA NONE CM Rock
The column Date is different so I think you want to do replace duplicated Cost for every value of State and Market combination.
library(dplyr)
df <- df %>%
group_by(State, Market) %>%
mutate(Cost = replace(Cost, duplicated(Cost), NA)) %>%
ungroup
df
# State Market Date Cost Word format Type
# <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
#1 AZ Phoenix 10-20-2020 100 HELLO AM Sports related
#2 AZ Phoenix 10-21-2020 NA GOODBYE PM Non Sports related
#3 AZ Phoenix 10-22-2020 NA YES FM Country
#4 AZ Phoenix 10-23-2020 NA NONE CM Rock
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(State = c("AZ", "AZ", "AZ", "AZ"), Market = c("Phoenix",
"Phoenix", "Phoenix", "Phoenix"), Date = c("10-20-2020", "10-21-2020",
"10-22-2020", "10-23-2020"), Cost = c(100, 100, 100, 100), Word = c("HELLO",
"GOODBYE", "YES", "NONE"), format = c("AM", "PM", "FM", "CM"),
Type = c("Sports related", "Non Sports related", "Country",
"Rock")), row.names = c(NA, -4L), class = "data.frame")

How do I rename the values in my column as I have misspelt them and cant rename them in R or Colab

I have a data frame that was given to me. Under the column titled state, there are two components with the same name but with different case sensitivities ie one is "London" and the other is "LONDON". How would i be able to rename "LONDON" to become "London" in order to total them up together and not separately. reminder, I am trying to change the name of the input not the name of the column.
You can use the following code, df is your current dataframe, in which you want to substitute "LONDON" for "London"
df <- data.frame(Country = c("US", "UK", "Germany", "Brazil","US", "Brazil", "UK", "Germany"),
State = c("NY", "London", "Bavaria", "SP", "CA", "RJ", "LONDON", "Berlin"),
Candidate = c(1:8))
print(df)
output
Country State Candidate
1 US NY 1
2 UK London 2
3 Germany Bavaria 3
4 Brazil SP 4
5 US CA 5
6 Brazil RJ 6
7 UK LONDON 7
8 Germany Berlin 8
then run the following code to substitute London to all the instances where State is equal to "LONDON"
df[df$State == "LONDON", "State"] <- "London"
Now the output will be as
Country State Candidate
1 US NY 1
2 UK London 2
3 Germany Bavaria 3
4 Brazil SP 4
5 US CA 5
6 Brazil RJ 6
7 UK London 7
8 Germany Berlin 8
Maybe you could try using the case_when function. I would do something like this:
´´´´
mutate(data, State_def=case_when(State=="LONDON" ~ "London",
State=="London" ~ "London",
TRUE ~ NA_real_)
I might misunderstand, but I think it should be as simple as this:
x$state <- sub( "LONDON", "London", x$state, fixed=TRUE )
This should change LONDON to London

Create a ggplot with grouped factor levels

This is variation on a question asked here: Group factor levels in ggplot.
I have a dataframe:
df <- data.frame(respondent = factor(c(1, 2, 3, 4, 5, 6, 7)),
location = factor(c("California", "Oregon", "Mexico",
"Texas", "Canada", "Mexico", "Canada")))
There are three separate levels related to the US. I don't want to collapse them as the distinction between states is useful for data analysis. I would like to have, however, a basic barplot that combines the three US states and stacks them on top of one another, so that there are three bars in the barplot--Canada, Mexico, and US--with the US bar divided into three states like so:
If the state factor levels had the "US" in their names, e.g. "US: California", I could use
library(tidyverse)
with_states <- df %>%
separate(location, into = c("Country", "State"), sep = ": ") %>%
replace_na(list(State = "Other")) %>%
mutate(State = as.factor(State)
%>% fct_relevel("Other", after = Inf))
to achieve the desired outcome. But how can this be done when R doesn't know that the three states are in the US?
If you look at the previous example, all the separate and replace_na functions do is separate the location variable into a country and state variable:
df
respondent location
1 1 US: California
2 2 US: Oregon
3 3 Mexico
...
df %>%
separate(location, into = c("Country", "State"), sep = ": ") %>%
replace_na(list(State = "Other"))
respondent Country State
1 1 US California
2 2 US Oregon
3 3 Mexico Other
...
So really all you need to do if get your data into this format: with a column for country and a column for state/provence.
There are many ways to do this yourself. Many times your data will already be in this format. If it isn't, the easiest way to fix it is to do a join to a table which maps location to country:
df
respondent location
1 1 California
2 2 Oregon
3 3 Mexico
4 4 Texas
5 5 Canada
6 6 Mexico
7 7 Canada
state_mapping <- data.frame(state = c("California", "Oregon", "Texas"),
country = c('US', 'US', 'US'),
stringsAsFactors = F)
df %>%
left_join(state_mapping, by = c('location' = 'state')) %>%
mutate(country = if_else(is.na(.$country),
location,
country))
respondent location country
1 1 California US
2 2 Oregon US
3 3 Mexico Mexico
4 4 Texas US
5 5 Canada Canada
6 6 Mexico Mexico
7 7 Canada Canada
Once you've got it in this format, you can just do what the other question suggested.

How to clean the city and state(both full and abbreviation) using R

I have a list of uncleaned city and state from "Location" in twitter, for example:
location <- c("the Great Lake State", "PA", "Harrisburg, Pennsylvania",
"Pennsylvania", "MI", "Detroit,MI")
How to clean the data to make a clean list of two columns with city and state?
You can do this:
splitted_list <- strsplit(location,",")
wide_matrix <- sapply(splitted_list,function(x) c(rep(NA,length(x)==1),x))
res <- setNames(data.frame(t(wide_matrix),stringsAsFactors = FALSE),c("city","state"))
res
# city state
# 1 <NA> the Great Lake State
# 2 <NA> PA
# 3 Harrisburg Pennsylvania
# 4 <NA> Pennsylvania
# 5 <NA> MI
# 6 Detroit MI
Assuming your data (location) is already part of a data.frame which you want to clean up, then tidyr::separate can be suitable option.
location <- c("the Great Lake State", "PA", "Harrisburg, Pennsylvania",
"Pennsylvania", "MI", "Detroit,MI")
library(tidyverse)
as.data.frame(location) %>% # I created a data.frame, which is not needed in actual data
tidyr::separate(location, c("City", "State"), sep=",", fill="left")
# City State
# 1 <NA> the Great Lake State
# 2 <NA> PA
# 3 Harrisburg Pennsylvania
# 4 <NA> Pennsylvania
# 5 <NA> MI
# 6 Detroit MI

Resources