Create a ggplot with grouped factor levels - r

This is variation on a question asked here: Group factor levels in ggplot.
I have a dataframe:
df <- data.frame(respondent = factor(c(1, 2, 3, 4, 5, 6, 7)),
location = factor(c("California", "Oregon", "Mexico",
"Texas", "Canada", "Mexico", "Canada")))
There are three separate levels related to the US. I don't want to collapse them as the distinction between states is useful for data analysis. I would like to have, however, a basic barplot that combines the three US states and stacks them on top of one another, so that there are three bars in the barplot--Canada, Mexico, and US--with the US bar divided into three states like so:
If the state factor levels had the "US" in their names, e.g. "US: California", I could use
library(tidyverse)
with_states <- df %>%
separate(location, into = c("Country", "State"), sep = ": ") %>%
replace_na(list(State = "Other")) %>%
mutate(State = as.factor(State)
%>% fct_relevel("Other", after = Inf))
to achieve the desired outcome. But how can this be done when R doesn't know that the three states are in the US?

If you look at the previous example, all the separate and replace_na functions do is separate the location variable into a country and state variable:
df
respondent location
1 1 US: California
2 2 US: Oregon
3 3 Mexico
...
df %>%
separate(location, into = c("Country", "State"), sep = ": ") %>%
replace_na(list(State = "Other"))
respondent Country State
1 1 US California
2 2 US Oregon
3 3 Mexico Other
...
So really all you need to do if get your data into this format: with a column for country and a column for state/provence.
There are many ways to do this yourself. Many times your data will already be in this format. If it isn't, the easiest way to fix it is to do a join to a table which maps location to country:
df
respondent location
1 1 California
2 2 Oregon
3 3 Mexico
4 4 Texas
5 5 Canada
6 6 Mexico
7 7 Canada
state_mapping <- data.frame(state = c("California", "Oregon", "Texas"),
country = c('US', 'US', 'US'),
stringsAsFactors = F)
df %>%
left_join(state_mapping, by = c('location' = 'state')) %>%
mutate(country = if_else(is.na(.$country),
location,
country))
respondent location country
1 1 California US
2 2 Oregon US
3 3 Mexico Mexico
4 4 Texas US
5 5 Canada Canada
6 6 Mexico Mexico
7 7 Canada Canada
Once you've got it in this format, you can just do what the other question suggested.

Related

Create several columns from a complex column in R

Imagine dataset:
df1 <- tibble::tribble(~City, ~Population,
"United Kingdom > Leeds", 1500000,
"Spain > Las Palmas de Gran Canaria", 200000,
"Canada > Nanaimo, BC", 150000,
"Canada > Montreal", 250000,
"United States > Minneapolis, MN", 700000,
"United States > Milwaukee, WI", NA,
"United States > Milwaukee", 400000)
The same dataset for visual representation:
I would like to:
Split column City into three columns: City, Country, State (if available, NA otherwise)
Check that Milwaukee has data in state and population (the NA for Milwaukee should have a value of 400000 and then split [City-State-Country] :).
Could you, please, suggest the easiest method to do so :)
Here's another solution with extract to do the extraction of Country, City, and State in a single go with State extracted by an optional capture group (the remainder of the task is done as by #Allen's code):
library(tidyr)
library(dplyr)
df1 %>%
extract(City,
into = c("Country", "City", "State"),
regex = "([^>]+) > ([^,]+),? ?([A-Z]+)?"
) %>%
# as by #Allen Cameron:
group_by(Country, City) %>%
summarize(State = ifelse(all(is.na(State)), NA, State[!is.na(State)]),
Population = Population[!is.na(Population)])
You can use separate twice to get the country and state, then group_by Country and City to summarize away the NA values where appropriate:
library(tidyverse)
df1 %>%
separate(City, sep = " > ", into = c("Country", "City")) %>%
separate(City, sep = ', ', into = c('City', 'State')) %>%
group_by(Country, City) %>%
summarize(State = ifelse(all(is.na(State)), NA, State[!is.na(State)]),
Population = Population[!is.na(Population)])
#> # A tibble: 6 x 4
#> # Groups: Country [4]
#> Country City State Population
#> <chr> <chr> <chr> <dbl>
#> 1 Canada Montreal <NA> 250000
#> 2 Canada Nanaimo BC 150000
#> 3 Spain Las Palmas de Gran Canaria <NA> 200000
#> 4 United Kingdom Leeds <NA> 1500000
#> 5 United States Milwaukee WI 400000
#> 6 United States Minneapolis MN 700000

Categorizing Data Based on Text Value in New Column [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 2 years ago.
I'm trying to take an existing data frame that has a column for state and add a new column called Region depending on what the row's state is. So for example any row that has "CA" should be categorized "West" and any row that has "IL" should be Midwest. There are 4 regions: West, South, Midwest, and Northeast.
I had tried doing this separately in 4 code chunks like this:
south <- c("FL", "KY", "GA", "TX", "MS", "SC", "NC", "AL", "LA", "AR", "TN", "VA", "DC", "MD", "DE", "WV") #16 states
south.mdata <- mdata %>% filter(state %in% south) #1832 locations
south.byyear <- south.mdata %>% group_by(Year) %>% summarize(s.total = n())
south.total <- data %>% filter(state %in% south) %>% group_by(Year) %>% summarize(yearly.total = n())
But this seems repetitive and not the most efficient way to do this. Plus I'd like to be able to group_by both Year and Region so I can compare across regions.
I'm having trouble implementing this and the first thing that comes to mind is to do some sort of if/else loop using filter but I know loops aren't really R's style.
The original data looks like this:
Field.1 ID title description streetaddress city state
1 74 DE074 Cork 'n' Bottle Route 14, 1 mile south of town Rehoboth Beach DE
2 75 DE075 Cork 'n' Bottle Route 14, 1 mile south of town Rehoboth Beach DE
3 23 DE023 Dog House 1200 DuPont Hwy. Wilmington DE
4 19 DE019 Dog House 1200 DuPont Hwy Wilmington DE
5 26 DE026 Dog House 1200 Dupont Wilmington DE
6 65 DE065 Henlopen Hotel Bar Boardwalk & Surf Rehoboth Beach DE
amenityfeatures type Year notes lon lat
1 (M),(R) Restaurant 1977 <NA> -75.07601 38.72095
2 (M),(R) Restaurant 1976 <NA> -75.07601 38.72095
3 (M),(R) Restaurant 1975 <NA> -75.58243 39.68839
4 (M),(R) Restaurant 1976 <NA> -75.58243 39.68839
5 (M),(R) Restaurant 1974 <NA> -75.58723 39.76705
6 (M) Bars/Clubs,Hotel 1972 <NA> -75.07712 38.72280
status
1 Location could not be verified. General city or location coordinates used.
2 Location could not be verified. General city or location coordinates used.
3 Google Verified Location
4 Google Verified Location
5 Google Verified Location
6 Verified Location
I want to add a new column called "Region" that would loop through each row, look at the state, and then add a value to Region.
Any suggestions on the right syntax to do something like this would be so appreciated! Thanks so much!
This is a snippet of what the solution suggested by Gregor’s comment could look like.
library(tidyverse)
orig_data <-
tribble(~ID, ~state,
1, "CA",
2, "FL",
3, "DE")
region_lookup <-
tribble(~state, ~region,
"CA", "west",
"FL", "south",
"DE", "south")
left_join(orig_data, region_lookup)
#> Joining, by = "state"
#> # A tibble: 3 x 3
#> ID state region
#> <dbl> <chr> <chr>
#> 1 1 CA west
#> 2 2 FL south
#> 3 3 DE south
Created on 2020-11-02 by the reprex package (v0.3.0)
the simplest solution would be a join. Therefore you need a data.frame/tibble that has all the states an regions. Fortunately the data is already in base R:
library(dplyr)
# build the tibble of state abbrevitation and region from base R data
state_region <- dplyr::tibble(state.abb, state.region)
# join it on your data.frame/tibble
ORIGINAL_DATA %>%
dplyr::left_join(state_region, by = c("state" = "state.abb"))
Now you should have a new column called "state.region" that you can group by. Be aware that the states have to be in upper case.

finding shared column information - a least common ancestor question

I have a data.frame object consisting of columns of information that is tree-like. For instance, I have performed a search of a set of features (query_name) and returned a set of potential matches (match_name). Every match has an associated location that is split into continent, country, region, and town.
The problem I'd like to resolve is finding, for a given query_name, the location information that all potential matches have in common.
For example, with this bit of example data:
query_name <- c(rep("feature1", 3), rep("feature2", 2), rep("feature3", 4))
match_name <- paste0("match", seq(1:9))
continent <- c(rep("NorthAmerica", 3), rep("NorthAmerica", 2), rep("Europe", 4))
country <- c(rep("UnitedStates", 3), rep("Canada", 2), rep("Germany", 4))
region <- c(rep("NewYork", 3), "Ontario", NA, rep("Bayern", 2), rep("Berlin", 2))
town <- c("Manhattan", "Albany", "Buffalo", "Toronto", NA, "Munich", "Nuremberg", "Berlin", "Frankfurt")
data <- data.frame(query_name, match_name, continent, country, region, town)
We'd generate this data.frame object:
query_name match_name continent country region town
1 feature1 match1 NorthAmerica UnitedStates NewYork Manhattan
2 feature1 match2 NorthAmerica UnitedStates NewYork Albany
3 feature1 match3 NorthAmerica UnitedStates NewYork Buffalo
4 feature2 match4 NorthAmerica Canada Ontario Toronto
5 feature2 match5 NorthAmerica Canada <NA> <NA>
6 feature3 match6 Europe Germany Bayern Munich
7 feature3 match7 Europe Germany Bayern Nuremberg
8 feature3 match8 Europe Germany Berlin Berlin
9 feature3 match9 Europe Germany Berlin Frankfurt
I'm hoping to get advice on how to construct a function that will produce the result below. Note that shared location information is now concatenated and separated with a ; delimiter.
Feature1 differs only at the town information, thus the returned string contains the continent through region information.
Feature2 doesn't differ at region or town in the two matches here because one of the two matches contains no information. Nevertheless, lack of information is considered distinct from values with information, so the only thing shared in common for feature2 matches are continent and country.
Feature3 contains shared continent and country information, but distinct region and town, so just continent and country are retained.
Hoping for an output file that looks like this:
query_name location_output
feature1 NorthAmerica;UnitedStates;NewYork;
feature2 NorthAmerica;Canada;;
feature3 Europe;Germany;;
Thanks for any advice you can spare.
Cheers!
Here is an option
library(tidyverse)
data %>%
gather(key, val, -query_name, -match_name) %>%
select(-match_name, -key) %>%
group_by(query_name, val) %>%
add_count() %>%
group_by(query_name) %>%
filter(n == max(n)) %>%
summarise(location_output = paste0(unique(val[!is.na(val)]), collapse = ";"))
## A tibble: 3 x 2
# query_name location_output
# <fct> <chr>
#1 feature1 NorthAmerica;UnitedStates;NewYork
#2 feature2 NorthAmerica;Canada
#3 feature3 Europe;Germany
This is less elegant than #MauritsEvers' solution (it doesn't automatically take care of an arbitrary number of levels), but it ensures that every location_output has all four ; delimiters.
library(dplyr)
data %>%
group_by(query_name) %>%
summarize(continent = ifelse(n_distinct(continent) == 1, first(continent), ""),
country = ifelse(n_distinct(country) == 1, first(country), ""),
region = ifelse(n_distinct(region) == 1, first(region), ""),
town = ifelse(n_distinct(town) == 1, first(town), "")) %>%
mutate(location_output = paste(continent, country, region, town, sep = ";")) %>%
select(query_name, location_output)
lapply(split(data, data$query_name), function(x){
x = x[,-(1:2)]
r = rle(sapply(x, function(d) length(unique(d))))
x[1, seq(r$lengths[1])]
})
#$feature1
# continent country region
#1 NorthAmerica UnitedStates NewYork
#$feature2
# continent country
#4 NorthAmerica Canada
#$feature3
# continent country
#6 Europe Germany

Combine rows together that are duplicates [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 4 years ago.
I have a data frame with 4 columns and several thousands rows. The first two columns are geographical identifiers, the third one is a date, and the last one is the number of shipments in that date.
For example:
London UK 4/4/2018 1
London UK 4/4/2018 1
London UK 4/5/2018 3
London UK 4/5/2018 2
I would like to combine the rows so as to have only one row per city, country, and date.
For example, the above data would become:
London UK 4/4/2018 2
London UK 4/5/2018 5
Thank you for all help in advance.
Here is your solution:
# 1. Data set
df <- data.frame(
country = c("UK", "UK", "UK", "UK"),
city = c("London", "London", "London", "London"),
date = c("4/4/2018", "4/4/2018", "4/5/2018", "4/5/2018"),
shipment = c(1, 1, 3, 2))
# 2. Group by 'country', 'city', and 'date' features
df %>%
group_by(country, city, date) %>%
summarise(shipment = sum(shipment))

R column mapping

How to map column of one CSV file to column of another CSV file in R. If both are in same data type.
For example first column of data frame A consist some text with country name in it. While column of second data frame B contains a standard list of all country .Now I have to map all rows of first data frame with standard country column.
For example column (location) of data frame A consist 10000 rows of data like this
Sydney, Australia
Aarhus C, Central Region, Denmark
Auckland, New Zealand
Mumbai Area, India
Singapore
df1 <- data.frame(col1 = 1:5, col2=c("Sydney, Australia", "Aarhus C, Central Region, Denmark", "Auckland, New Zealand", "Mumbai Area, India", "Singapore"))
Now I have another column (country) of data frame B as
India
USA
New Zealand
UK
Singapore
Denmark
China
df2 <- data.frame(col1=1:7, col2=c("India", "USA", "New Zealand", "UK", "Singapore", "Denmark", "China"))
If location column matches with Country column then, I want to replace that location with country name otherwise it will remain as it is. Sample output is as
Sydney, Australia
Denmark
New Zealand
India
Singapore
Initially, it looked like a trivial question but it's not. This approach works like this:
1. We convert the location string into vector using unlist, strsplit.
2. Then we check if any string in the vector is available in country column. If it is available, we store the country name in res and if not we store notfound.
2. Finally, we check if res contains a country name or not.
df1 <- data.frame(location = c('Sydney, Australia',
'Aarhus C, Central Region, Denmark',
'Auckland, New Zealand',
'Mumbai Area, India',
'Singapore'),stringsAsFactors = F)
df2 <- data.frame(country = c('India',
'USA',
'New Zealand',
'UK',
'Singapore',
'Denmark',
'China'),stringsAsFactors = F)
get_values <- function(i)
{
val <- unlist(strsplit(i, split = ','))
val <- sapply(val, str_trim)
res <- c()
for(j in val)
{
if(j %in% df2$country) res <- append(res, j)
else res <- append(res, 'notfound')
}
if(all(res == 'notfound')) return (i)
else return (res[res!='notfound'])
}
df1$location2 <- sapply(df1$location, get_values)
location location2
1 Sydney, Australia Sydney, Australia
2 Aarhus C, Central Region, Denmark Denmark
3 Auckland, New Zealand New Zealand
4 Mumbai Area, India India
5 Singapore Singapore
A solution using tidyverse. First, please convert your col2 to character by setting stringsAsFactors = FALSE because that is easier to work with.
We can use str_extract to extract the matched country name, and then create a new col2 with mutate and ifelse.
df3 <- df1 %>%
mutate(Country = str_extract(col2, paste0(df2$col2, collapse = "|")),
col2 = ifelse(is.na(Country), col2, Country)) %>%
select(-Country)
df3
# col1 col2
# 1 1 Sydney, Australia
# 2 2 Denmark
# 3 3 New Zealand
# 4 4 India
# 5 5 Singapore
We can also start with df1, use separate_rows to separate the country name. After that, use semi_join to check if the country names are in df2. Finally, we can combine the data frame with the original df1 by rows, and then filter the first one for each id in col1. df3 is the final output.
library(tidyverse)
df3 <- df1 %>%
separate_rows(col2, sep = ", ") %>%
semi_join(df2, by = "col2") %>%
bind_rows(df1) %>%
group_by(col1) %>%
slice(1) %>%
ungroup() %>%
arrange(col1)
df3
# # A tibble: 5 x 2
# col1 col2
# <int> <chr>
# 1 1 Sydney, Australia
# 2 2 Denmark
# 3 3 New Zealand
# 4 4 India
# 5 5 Singapore
DATA
df1 <- data.frame(col1 = 1:5,
col2=c("Sydney, Australia", "Aarhus C, Central Region, Denmark", "Auckland, New Zealand", "Mumbai Area, India", "Singapore"),
stringsAsFactors = FALSE)
df2 <- data.frame(col1=1:7,
col2=c("India", "USA", "New Zealand", "UK", "Singapore", "Denmark", "China"),
stringsAsFactors = FALSE)
If you are looking for the countries, and they come after the cities then you can do something like this.
transform(df1,col3= sub(paste0(".*,\\s*(",paste0(df2$col2,collapse="|"),")"),"\\1",col2))
col1 col2 col3
1 1 Sydney, Australia Sydney, Australia
2 2 Aarhus C, Central Region, Denmark Denmark
3 3 Auckland, New Zealand New Zealand
4 4 Mumbai Area, India India
5 5 Singapore Singapore
Breakdown:
> A=sub(".*,\\s(.*)","\\1",df1$col2)
> B=sapply(A,grep,df2$col2,value=T)
> transform(df1,col3=replace(A,!lengths(B),col2[!lengths(B)]))
col1 col2 col3
1 1 Sydney, Australia Sydney, Australia
2 2 Aarhus C, Central Region, Denmark Denmark
3 3 Auckland, New Zealand New Zealand
4 4 Mumbai Area, India India
5 5 Singapore Singapore

Resources