How to count multiple text values in a column in R? - r

I have a dataframe with a column of city names, in each cell of this column there are multiple text values separated by ",".
For example the first 4 rows of the cities column of my df are:
"Barcelona, Milaan, Londen, Paris, Berlin"
"Barcelona"
"Milaan, Barcelona, Berlin"
"London, Berlin"
I want to count for each row of this column
wheter these cities occurs.
For example, the output needs to look like this:
count_cities
5
1
3
2
Thank you in advance!

DATA:
cities <- data.frame(names = c("Barcelona, Milaan, Londen, Paris, Berlin","Barcelona",
"Milaan, Barcelona, Berlin","London, Berlin"), stringsAsFactors = F)
To count how many city namesthere are you can first split the string at ,and count the splits using lengths:
cities$count <- lengths(strsplit(cities$names, ","))
The resulting dataframe is this:
cities
names count
1 Barcelona, Milaan, Londen, Paris, Berlin 5
2 Barcelona 1
3 Milaan, Barcelona, Berlin 3
4 London, Berlin 2
EDIT:
If the strings contain not only city namesbut additional information, you can use str_countto match upper-case letters (because city names begin with an upper-case letter but other words don't, at least not in the example you've given):
cities <- data.frame(names = c("Barcelona, Milaan, Londen, Paris, Berlin","Barcelona (a big city)",
"Milaan, Barcelona, Berlin","London, Berlin (are all capitals, are big cities)"), stringsAsFactors = F)
library(stringr)
cities$count <- str_count(cities$names, "[A-Z][a-z]+")
Alternatively, use str_extract:
cities$count <- lengths(str_extract_all(cities$names, "[A-Z][a-z]+"))

library(tidyverse)
travel <- tibble(CITYS = c("Barcelona, Milaan, Londen, Paris, Berlin",
"Barcelona",
"Milaan, Barcelona, Berlin",
"London, Berlin"))
travel %>%
mutate(CITY.COUNT = map_dbl(str_split(CITYS, ",\\s*"), length))
Yields
# A tibble: 4 x 2
CITYS CITY.COUNT
<chr> <dbl>
1 Barcelona, Milaan, Londen, Paris, Berlin 5
2 Barcelona 1
3 Milaan, Barcelona, Berlin 3
4 London, Berlin 2

Another option is str_count
library(stringr)
str_count(travel$CITYS, "\\w+")
#[1] 5 1 3 2

Related

New Column Based on Conditions

To set the scene, I have a set of data where two columns of the data have been mixed up. To give a simple example:
df1 <- data.frame(Name = c("Bob", "John", "Mark", "Will"), City=c("Apple", "Paris", "Orange", "Berlin"), Fruit=c("London", "Pear", "Madrid", "Orange"))
df2 <- data.frame(Cities = c("Paris", "London", "Berlin", "Madrid", "Moscow", "Warsaw"))
As a result, we have two small data sets:
> df1
Name City Fruit
1 Bob Apple London
2 John Paris Pear
3 Mark Orange Madrid
4 Will Berlin Orange
> df2
Cities
1 Paris
2 London
3 Berlin
4 Madrid
5 Moscow
6 Warsaw
My aim is to create a new column where the cities are in the correct place using df2. I am a bit new to R so I don't know how this would work.
I don't really know where to even start with this sort of a problem. My full dataset is much larger and it would be good to have an efficient method of unpicking this issue!
If the 'City' values are only different. We may loop over the rows, create a logical vector based on the matching values with 'Cities' from 'df2', and concatenate with the rest of the values by getting the matched values second in the order
df1[] <- t(apply(df1, 1, function(x)
{
i1 <- x %in% df2$Cities
i2 <- !i1
x1 <- x[i2]
c(x1[1], x[i1], x1[2])}))
-output
> df1
Name City Fruit
1 Bob London Apple
2 John Paris Pear
3 Mark Madrid Orange
4 Will Berlin Orange
using dplyr package this is a solution, where it looks up the two City and Fruit values in df1, and takes the one that exists in the df2 cities list.
if none of the two are a city name, an empty string is returned, you can replace that with anything you prefer.
library(dplyr)
df1$corrected_City <- case_when(df1$City %in% df2$Cities ~ df1$City,
df1$Fruit%in% df2$Cities ~ df1$Fruit,
TRUE ~ "")
output, a new column created as you wanted with the city name on that row.
> df1
Name City Fruit corrected_City
1 Bob Apple London London
2 John Paris Pear Paris
3 Mark Orange Madrid Madrid
4 Will Berlin Orange Berlin
Another way is:
library(dplyr)
library(tidyr)
df1 %>%
mutate(across(1:3, ~case_when(. %in% df2$Cities ~ .), .names = 'new_{col}')) %>%
unite(New_Col, starts_with('new'), na.rm = TRUE, sep = ' ')
Name City Fruit New_Col
1 Bob Apple London London
2 John Paris Pear Paris
3 Mark Orange Madrid Madrid
4 Will Berlin Orange Berlin

Insert a value to a column by condition

I am attempting to fill in a new column in my dataset. I have a dataset containing information on football matches. There is a column called "Stadium", which has various stadium names. I wish to add a new column which contains the country of which the stadium is located within. My set looks something like this
Match ID Stadium
1 Anfield
2 Camp Nou
3 Stadio Olimpico
4 Anfield
5 Emirates
I am attempting to create a new column looking like this:
Match ID Stadium Country
1 Anfield England
2 Camp Nou Spain
3 Stadio Olimpico Italy
4 Anfield England
5 Emirates England
There is only a handful of stadiums but many rows, meaning I am trying to find a way to avoid inserting the values manually. Any tips?
You want to get the unique stadium names from your data, manually create a vector with the country for each of those stadiums, then join them using Stadium as a key.
library(dplyr)
# Example data
df <- data.frame(`Match ID` = 1:12,
Stadium = rep(c("Stadio Olympico", "Anfield",
"Emirates"), 4))
# Get the unique stadium names in a vector
unique_stadiums <- df %>% pull(Stadium) %>% unique()
unique_stadiums
#> [1] "Stadio Olympico" "Anfield" "Emirates"
# Manually create a vector of country names corresponding to each element of
# the unique stadum name vector. Ordering matters here!
countries <- c("Italy", "England", "England")
# Place them both into a data.frame
lookup <- data.frame(Stadium = unique_stadiums, Country = countries)
# Join the country names to the original data on the stadium key
left_join(x = df, y = lookup, by = "Stadium")
#> Match.ID Stadium Country
#> 1 1 Stadio Olympico Italy
#> 2 2 Anfield England
#> 3 3 Emirates England
#> 4 4 Stadio Olympico Italy
#> 5 5 Anfield England
#> 6 6 Emirates England
#> 7 7 Stadio Olympico Italy
#> 8 8 Anfield England
#> 9 9 Emirates England
#> 10 10 Stadio Olympico Italy
#> 11 11 Anfield England
#> 12 12 Emirates England

How can I add the country name to a dataset based on city name and population? [duplicate]

This question already has answers here:
extracting country name from city name in R
(3 answers)
Closed 7 months ago.
I have a dataset containing information on a range of cities, but there is no column which says what country the city is located in. In order to perform the analysis, I need to add an extra column which has the name of the country.
population city
500,000 Oslo
750,000 Bristol
500,000 Liverpool
1,000,000 Dublin
I expect the output to look like this:
population city country
500,000 Oslo Norway
750,000 Bristol England
500,000 Liverpool England
1,000,000 Dublin Ireland
How can I add a column of country names based on the city and population to a large dataset in R?
I am adapting Tom Hoel's answer, as suggested by Ian Campbell. If this is selected I am happy to mark it as community wiki.
library(maps)
library(dplyr)
data("world.cities")
df <- readr::read_table("population city
500,000 Oslo
750,000 Bristol
500,000 Liverpool
1,000,000 Dublin")
df |>
inner_join(
select(world.cities, name, country.etc, pop),
by = c("city" = "name")
) |> group_by(city) |>
filter(
abs(pop - population) == min(abs(pop - population))
)
# A tibble: 4 x 4
# Groups: city [4]
# population city country.etc pop
# <dbl> <chr> <chr> <int>
# 1 500000 Oslo Norway 821445
# 2 750000 Bristol UK 432967
# 3 500000 Liverpool UK 468584
# 4 1000000 Dublin Ireland 1030431
As stated by others, the cities exists in other countries too as well.
library(tidyverse)
library(maps)
data("world.cities")
df <- read_table("population city
500,000 Oslo
750,000 Bristol
500,000 Liverpool
1,000,000 Dublin")
df %>%
merge(., world.cities %>%
select(name, country.etc),
by.x = "city",
by.y = "name")
# A tibble: 7 × 3
city population country.etc
<chr> <dbl> <chr>
1 Bristol 750000 UK
2 Bristol 750000 USA
3 Dublin 1000000 USA
4 Dublin 1000000 Ireland
5 Liverpool 500000 UK
6 Liverpool 500000 Canada
7 Oslo 500000 Norway
I think your best bet would be to add a new column in your dataset called country and fill it out, this is part of the CRSIP-DM process data preparation so this is not uncommon. If that does not answer your question please let me know and i will do my best to help.

Keep specific rows of a data frame based on word sequence in R

I have a dataframe (df) like this. What I want to do is to go through the values for each ID and if there are two strings starting with the same word, I want to compare them to keep distinct values.
df <- data.frame(id = c(1,1,2,3,3,4,4,4,4,5),
value = c('australia', 'australia sydney', 'brazil',
'australia', 'usa', 'australia sydney', 'australia sydney randwick', 'australia', 'australia sydney circular quay', 'australia sydney'))
I want to get the first words to compare them and if they are different keep both but if they are the same go to the second words to compare them and so on...
so like for ID 1 I want to keep the row with the value 'australia sydney' and for Id 4 I want to keep both 'australia sydney circular quay', 'australia sydney randwick'.
For this example I need to get rows 2:5, 7, 9,10
Based on your edit, you can check within groups if any entry matches the start of any other entry and remove entries that do:
library(tidyverse)
df %>%
group_by(id) %>%
filter(!map_lgl(seq_along(value), ~ any(if (length(value) == 1) FALSE else str_detect(value[-.x], paste0("^", value[.x])))))
# A tibble: 7 x 2
# Groups: id, value [7]
id value
<dbl> <chr>
1 1 australia sydney
2 2 brazil
3 3 australia
4 3 usa
5 4 australia sydney randwick
6 4 australia sydney circular quay
7 5 australia sydney

Regular Expressions to Unmerge row entries

I have an example data set given by
df <- data.frame(
country = c("GermanyBerlin", "England (UK)London", "SpainMadrid", "United States of AmericaWashington DC", "HaitiPort-au-Prince", "country66city"),
capital = c("#Berlin", "NA", "#Madrid", "NA", "NA", "NA"),
url = c("/country/germany/01", "/country/england-uk/02", "/country/spain/03", "country/united-states-of-america/04", "country/haiti/05", "country/country6/06"),
stringsAsFactors = FALSE
)
country capital url
1 GermanyBerlin #Berlin /country/germany/01
2 England (UK)London NA /country/england-uk/02
3 SpainMadrid #Madrid /country/spain/03
4 United States of AmericaWashington DC NA country/united-states-of-america/04
5 HaitiPort-au-Prince NA country/haiti/05
6 country66city NA country/country6/06
The aim is to tidy this so that the columns are as one would expect from their names:
the first should contain only the country name.
the second should contain the capital (without a # sign).
the third should remain unchanged.
So my desired output is:
country capital url
1 Germany Berlin /country/germany/01
2 England (UK) London /country/england-uk/02
3 Spain Madrid /country/spain/03
4 United States of America Washington DC country/united-states-of-america/04
5 Haiti Port-au-Prince country/haiti/05
6 country6 6city country/country6/06
In the cases where there are non-NA entries in the capital column, I have a piece of code that achieves this (see bottom of post).
Therefore I am looking for a solution that recognises that the pattern of the url column can be used to split the capital out of the country column.
This needs to account for the fact that
the URL text is all lower case, whilst the country name as it appears in the country column has mixed cases.
the text in the URL replaces spaces with hyphens.
the url removes special characters (such as the brackets around UK).
I would be interested to see how this aim can be achieved, presumably using regular expressions (though open to any options).
Partial solution when capital column is non-NA
Where there are non-NA entries in the capital column the following code achieves my aim:
df %>% mutate( capital = str_replace(capital, "#", ""),
country = str_replace(country, capital,"")
)
country capital url
1 Germany Berlin /country/germany/01
2 England (UK)London NA /country/england-uk/02
3 Spain Madrid /country/spain/03
4 United States of AmericaWashington DC NA country/united-states-of-america/04
you can do
transform(df,capital=sub(".*[A-Z]\\S+([A-Z])","\\1",country))
country capital url
1 GermanyBerlin Berlin /country/germany/01
2 England (UK)London London /country/england-uk/02
3 SpainMadrid Madrid /country/spain/03
4 United States of AmericaWashington DC Washington DC country/united-states-of-america/04
You could start with something like this and keep on refining until you get the (100%) correct results and then see if you can skip/merge any steps.
library(magrittr)
df$country2 <- df$url %>%
gsub("-", " ", .) %>%
gsub(".+try/(.+)/.+", "\\1", .) %>%
gsub("(\\b[a-z])", "\\U\\1", ., perl = TRUE)
df$capital <- df$country %>%
gsub("[()]", " ", .) %>%
gsub(" +", " ", .) %>%
gsub(paste(df$country2, collapse = "|"), "", ., ignore.case = TRUE)
df$country <- df$country2
df$country2 <- NULL
df
country capital url
1 Germany Berlin /country/germany/01
2 England Uk London /country/england-uk/02
3 Spain Madrid /country/spain/03
4 United States Of America Washington DC country/united-states-of-america/04
5 Haiti Port-au-Prince country/haiti/05
6 Country6 6city country/country6/0

Resources