Regular Expressions to Unmerge row entries - r

I have an example data set given by
df <- data.frame(
country = c("GermanyBerlin", "England (UK)London", "SpainMadrid", "United States of AmericaWashington DC", "HaitiPort-au-Prince", "country66city"),
capital = c("#Berlin", "NA", "#Madrid", "NA", "NA", "NA"),
url = c("/country/germany/01", "/country/england-uk/02", "/country/spain/03", "country/united-states-of-america/04", "country/haiti/05", "country/country6/06"),
stringsAsFactors = FALSE
)
country capital url
1 GermanyBerlin #Berlin /country/germany/01
2 England (UK)London NA /country/england-uk/02
3 SpainMadrid #Madrid /country/spain/03
4 United States of AmericaWashington DC NA country/united-states-of-america/04
5 HaitiPort-au-Prince NA country/haiti/05
6 country66city NA country/country6/06
The aim is to tidy this so that the columns are as one would expect from their names:
the first should contain only the country name.
the second should contain the capital (without a # sign).
the third should remain unchanged.
So my desired output is:
country capital url
1 Germany Berlin /country/germany/01
2 England (UK) London /country/england-uk/02
3 Spain Madrid /country/spain/03
4 United States of America Washington DC country/united-states-of-america/04
5 Haiti Port-au-Prince country/haiti/05
6 country6 6city country/country6/06
In the cases where there are non-NA entries in the capital column, I have a piece of code that achieves this (see bottom of post).
Therefore I am looking for a solution that recognises that the pattern of the url column can be used to split the capital out of the country column.
This needs to account for the fact that
the URL text is all lower case, whilst the country name as it appears in the country column has mixed cases.
the text in the URL replaces spaces with hyphens.
the url removes special characters (such as the brackets around UK).
I would be interested to see how this aim can be achieved, presumably using regular expressions (though open to any options).
Partial solution when capital column is non-NA
Where there are non-NA entries in the capital column the following code achieves my aim:
df %>% mutate( capital = str_replace(capital, "#", ""),
country = str_replace(country, capital,"")
)
country capital url
1 Germany Berlin /country/germany/01
2 England (UK)London NA /country/england-uk/02
3 Spain Madrid /country/spain/03
4 United States of AmericaWashington DC NA country/united-states-of-america/04

you can do
transform(df,capital=sub(".*[A-Z]\\S+([A-Z])","\\1",country))
country capital url
1 GermanyBerlin Berlin /country/germany/01
2 England (UK)London London /country/england-uk/02
3 SpainMadrid Madrid /country/spain/03
4 United States of AmericaWashington DC Washington DC country/united-states-of-america/04

You could start with something like this and keep on refining until you get the (100%) correct results and then see if you can skip/merge any steps.
library(magrittr)
df$country2 <- df$url %>%
gsub("-", " ", .) %>%
gsub(".+try/(.+)/.+", "\\1", .) %>%
gsub("(\\b[a-z])", "\\U\\1", ., perl = TRUE)
df$capital <- df$country %>%
gsub("[()]", " ", .) %>%
gsub(" +", " ", .) %>%
gsub(paste(df$country2, collapse = "|"), "", ., ignore.case = TRUE)
df$country <- df$country2
df$country2 <- NULL
df
country capital url
1 Germany Berlin /country/germany/01
2 England Uk London /country/england-uk/02
3 Spain Madrid /country/spain/03
4 United States Of America Washington DC country/united-states-of-america/04
5 Haiti Port-au-Prince country/haiti/05
6 Country6 6city country/country6/0

Related

How to count multiple text values in a column in R?

I have a dataframe with a column of city names, in each cell of this column there are multiple text values separated by ",".
For example the first 4 rows of the cities column of my df are:
"Barcelona, Milaan, Londen, Paris, Berlin"
"Barcelona"
"Milaan, Barcelona, Berlin"
"London, Berlin"
I want to count for each row of this column
wheter these cities occurs.
For example, the output needs to look like this:
count_cities
5
1
3
2
Thank you in advance!
DATA:
cities <- data.frame(names = c("Barcelona, Milaan, Londen, Paris, Berlin","Barcelona",
"Milaan, Barcelona, Berlin","London, Berlin"), stringsAsFactors = F)
To count how many city namesthere are you can first split the string at ,and count the splits using lengths:
cities$count <- lengths(strsplit(cities$names, ","))
The resulting dataframe is this:
cities
names count
1 Barcelona, Milaan, Londen, Paris, Berlin 5
2 Barcelona 1
3 Milaan, Barcelona, Berlin 3
4 London, Berlin 2
EDIT:
If the strings contain not only city namesbut additional information, you can use str_countto match upper-case letters (because city names begin with an upper-case letter but other words don't, at least not in the example you've given):
cities <- data.frame(names = c("Barcelona, Milaan, Londen, Paris, Berlin","Barcelona (a big city)",
"Milaan, Barcelona, Berlin","London, Berlin (are all capitals, are big cities)"), stringsAsFactors = F)
library(stringr)
cities$count <- str_count(cities$names, "[A-Z][a-z]+")
Alternatively, use str_extract:
cities$count <- lengths(str_extract_all(cities$names, "[A-Z][a-z]+"))
library(tidyverse)
travel <- tibble(CITYS = c("Barcelona, Milaan, Londen, Paris, Berlin",
"Barcelona",
"Milaan, Barcelona, Berlin",
"London, Berlin"))
travel %>%
mutate(CITY.COUNT = map_dbl(str_split(CITYS, ",\\s*"), length))
Yields
# A tibble: 4 x 2
CITYS CITY.COUNT
<chr> <dbl>
1 Barcelona, Milaan, Londen, Paris, Berlin 5
2 Barcelona 1
3 Milaan, Barcelona, Berlin 3
4 London, Berlin 2
Another option is str_count
library(stringr)
str_count(travel$CITYS, "\\w+")
#[1] 5 1 3 2

Switching values to labels in a new column

I got a column of labelled values. Let's call it country.
When I run:
attr(dat[["Country"]], "labels")
I get the next table:
USA Germany France UK Spain India Saudi Arabia
1 2 3 4 5 6 7
Now I got a new column of int values that are not labelled. Let's call it newCountry. I would like to change those int values to the label of the original Country column. In other words, I would like to go from this in an efficient way...
3
2
2
1
5
4
to this...
France
Germany
Germany
USA
Spain
UK
The problem is that the data frame has a column, Country, with the attribute "labels" set. In its turn, this attribute, which is just a vector, has the attribute "names" set. So the steps to get the "names" of the "labels" are:
Get the "labels" of column Country;
Get the "names" of the vector of labels;
Extract the names corresponding to a vector of indices, the vector i.
First read in the posted data.
nms <- scan(text = "USA Germany France UK Spain India 'Saudi Arabia'",
what = character())
i <- scan(text = "3 2 2 1 5 4")
Now create a data set example.
labs <- setNames(1:7, nms)
dat <- data.frame(Country = sample(letters, 7))
attr(dat[["Country"]], "labels") <- labs
And extract what the question asks for, following the steps above.
labsCountry <- attr(dat[["Country"]], "labels")
names(labsCountry)[i]
#[1] "France" "Germany" "Germany" "USA" "Spain" "UK"
Or a one-liner:
names(attr(dat[["Country"]], "labels"))[i]
#[1] "France" "Germany" "Germany" "USA" "Spain" "UK"
To see that this does not depend on the values of the labels, create a second example.
labs2 <- setNames(101:107, nms)
attr(dat[["Country"]], "labels") <- labs2
And though the "labels" are different, the same instructions work:
attr(dat[["Country"]], "labels")
# USA Germany France UK Spain India Saudi Arabia
# 101 102 103 104 105 106 107
labsCountry <- attr(dat[["Country"]], "labels")
names(labsCountry)[i]

How can I write a function that is iterable?

I need to modify a function (below) that will apply row-wise with dplyr::mutate to remove any '_' characters and capitalise the first letter of each word.
My function
simple_cap <- function(x) {
s <- strsplit(x, "_")[[1]]
paste(toupper(substring(s, 1,1)), substring(s, 2),
sep="", collapse=" ")
}
My data
df <- read.table(text = c('
location obs
1 australia 12454.
2 new_south_wales 3931.
3 victoria 3244.
4 queensland 2477.
5 south_australia 834.
6 western_australia 1335.
7 tasmania 246.'), stringsAsFactors = F)
The dplyr::mutate call:
df %>% mutate(
location = simple_cap(location)
)
The output
location obs
1 Australia 12454
2 Australia 3931
3 Australia 3244
4 Australia 2477
5 Australia 834
6 Australia 1335
7 Australia 246
How can I change my function so that it can be used to iterate over the values in df$location rather than replacing them all with the output from the first element?
1) With gsub
We can use gsub to select the lower case characters ([a-z]), capture as a group ((...)) that is the first letter of the string (^) or (|) that follows an underscore (_) and replace with the backreference after converting to upper case (\\U)
Wrap with another gsub to remove the _ and replace with " "
df %>%
mutate(location = gsub("_", " ", gsub("(^|_)([a-z])", "\\1\\U\\2", location, perl = TRUE)))
# location obs
#1 Australia 12454
#2 New South Wales 3931
#3 Victoria 3244
#4 Queensland 2477
#5 South Australia 834
#6 Western Australia 1335
#7 Tasmania 246
2) With stringi
Or another option is stri_trans_totitle from stringi
library(stringi)
df %>%
mutate(location = stri_trans_totitle(stri_replace_all_fixed(location, "_", " ")))
# location obs
#1 Australia 12454
#2 New South Wales 3931
#3 Victoria 3244
#4 Queensland 2477
#5 South Australia 834
#6 Western Australia 1335
#7 Tasmania 246
3) Using OP's modified function
The strsplit output is a list. In the OP's code, it is just subsetting the first element by extracting [[1]]. But, here we have a list of length 7. So, one option is to use map from purrr (or with lapply/sapply from base R) and then do the pasteing of the substring
simple_cap <- function(x) {
s <- strsplit(x, "_")
purrr::map_chr(s, ~
paste(toupper(substring(.x, 1,1)), substring(.x, 2),
sep="", collapse=" "))
}
df %>%
mutate(location = simple_cap(location))
# location obs
#1 Australia 12454
#2 New South Wales 3931
#3 Victoria 3244
#4 Queensland 2477
#5 South Australia 834
#6 Western Australia 1335
#7 Tasmania 246
4) OP's modified function with sapply
simple_cap <- function(x) {
s <- strsplit(x, "_")
sapply(s, function(.s)
paste(toupper(substring(.s, 1,1)), substring(.s, 2),
sep="", collapse=" "))
}
5) No external packages
But, this can be done without using any external package
df$location <- gsub("_", " ", gsub("(^|_)([a-z])", "\\1\\U\\2", df$location, perl = TRUE))
There is a str_to_title function in stringr which capitalises the first character of word and with gsub we replace all the "_" (underscore) with " " (blank space).
library(stringr)
library(dplyr)
df %>%
mutate(location = str_to_title(gsub("_", " ", location)))
# location obs
#1 Australia 12454
#2 New South Wales 3931
#3 Victoria 3244
#4 Queensland 2477
#5 South Australia 834
#6 Western Australia 1335
#7 Tasmania 246
Ronak Shah and akrun have solved your specific problem. Here's the general solution to your title question (how do I write a function that is iterable).
In the parlance of R, you want a vectorized function -- a function that accepts a vector input and returns a vector output. There are two ways to do this.
1) Make sure each step in your function can accept a vector input and return a vector output. #akrun's 4th answer identifies the step in your code that prevents it from doing this, s <- strsplit(x, "_")[[1]].
2) Turn a non-vectorized function into a vectorized one with Vectorize. Option 1 is more efficient, but sometimes it's not possible. This is clearly an example where it's possible, but to show you how this works, lets vectorize your function with Vectorize
simple_cap <- function(x) {
s <- strsplit(x, "_")[[1]]
paste(toupper(substring(s, 1,1)), substring(s, 2),
sep="", collapse=" ")
}
simple_cap_v <- Vectorize(simple_cap, USE.NAMES = FALSE)
simple_cap(df$location)
# [1] "Australia"
simple_cap_v(df$location)
# [1] "Australia" "New South Wales" "Victoria" "Queensland"
# [5] "South Australia" "Western Australia" "Tasmania"
df %>% mutate(
location = simple_cap_v(location)
)
# location obs
# 1 Australia 12454
# 2 New South Wales 3931
# 3 Victoria 3244
# 4 Queensland 2477
# 5 South Australia 834
# 6 Western Australia 1335
# 7 Tasmania 246
Vectorize returns a function that is a wrapper to mapply. Effectively, a call to simple_cap_v(x) is now mapply(simple_cap, x, USE.NAMES = FALSE)

extracting country name from city name in R

This question may look like a duplicate but I am facing some issue while extracting country names from the string. I have gone through this link [link]Extracting Country Name from Author Affiliations but I was not able to solve my problem.I have tried grepl and for loop for text matching and replacement, my data column consists of more than 300k rows so using grepl and for loop for pattern matching is very very slow.
I have a column like this.
org_loc
Zug
Zug Canton of Zug
Zimbabwe
Zigong
Zhuhai
Zaragoza
York United Kingdom
Delhi
Yalleroi Queensland
Waterloo Ontario
Waterloo ON
Washington D.C.
Washington D.C. Metro
New York
df$org_loc <- c("zug", "zug canton of zug", "zimbabwe",
"zigong", "zhuhai", "zaragoza","York United Kingdom", "Delhi","Yalleroi Queensland","Waterloo Ontario","Waterloo ON","Washington D.C.","Washington D.C. Metro","New York")
the string may contain the name of a state, city or country. I just want Country as output. Like this
org_loc
Switzerland
Switzerland
Zimbabwe
China
China
Spain
United Kingdom
India
Australia
Canada
Canada
United State
United state
United state
I am trying to convert state (if match found) to its country using countrycode library but not able to do so. Any help would be appreciable.
You can use your City_and_province_list.csv as a custom dictionary for countrycode. The custom dictionary can not have duplicates in the origin vector (the City column in your City_and_province_list.csv), so you'll have to remove them or deal with them somehow first (as in my example below). Currently, you don't have all of the possible strings in your example in your lookup CSV, so they are not all converted, but if you added all of the possible strings to the CSV, it would work completely.
library(countrycode)
org_loc <- c("Zug", "Zug Canton of Zug", "Zimbabwe", "Zigong", "Zhuhai",
"Zaragoza", "York United Kingdom", "Delhi",
"Yalleroi Queensland", "Waterloo Ontario", "Waterloo ON",
"Washington D.C.", "Washington D.C. Metro", "New York")
df <- data.frame(org_loc)
city_country <- read.csv("https://raw.githubusercontent.com/girijesh18/dataset/master/City_and_province_list.csv")
# custom_dict for countrycode cannot have duplicate origin codes
city_country <- city_country[!duplicated(city_country$City), ]
df$country <- countrycode(df$org_loc, "City", "Country",
custom_dict = city_country)
df
# org_loc country
# 1 Zug Switzerland
# 2 Zug Canton of Zug <NA>
# 3 Zimbabwe <NA>
# 4 Zigong China
# 5 Zhuhai China
# 6 Zaragoza Spain
# 7 York United Kingdom <NA>
# 8 Delhi India
# 9 Yalleroi Queensland <NA>
# 10 Waterloo Ontario <NA>
# 11 Waterloo ON <NA>
# 12 Washington D.C. <NA>
# 13 Washington D.C. Metro <NA>
# 14 New York United States of America
library(countrycode)
df <- c("zug switzerland", "zug canton of zug switzerland", "zimbabwe",
"zigong chengdu pr china", "zhuhai guangdong china", "zaragoza","York United Kingdom", "Yamunanagar","Yalleroi Queensland Australia","Waterloo Ontario","Waterloo ON","Washington D.C.","Washington D.C. Metro","USA")
df1 <- countrycode(df, 'country.name', 'country.name')
It didn't match a lot of them, but that should do what you're looking for, based on the reference manual for countrycode.
With function geocode from package ggmap you may accomplish, with good but not total accuracy your task; you must also use your criterion to say "Zaragoza" is a city in Spain (which is what geocode returns) and not somewhere in Argentina; geocode tends to give you the biggest city when there are several homonyms.
(remove the $country to see all of the output)
library(ggmap)
org_loc <- c("zug", "zug canton of zug", "zimbabwe",
"zigong", "zhuhai", "zaragoza","York United Kingdom",
"Delhi","Yalleroi Queensland","Waterloo Ontario","Waterloo ON","Washington D.C.","Washington D.C. Metro","New York")
geocode(org_loc, output = "more")$country
as geocode is provided by google, it has a query limit, 2,500 per day per IP address; if it returns NAs it may be because an unconsistent limit check, just try it again.

R make new data frame from current one

I'm trying to calculate the best goal differentials in the group stage of the 2014 world cup.
football <- read.csv(
file="http://pastebin.com/raw.php?i=iTXdPvGf",
header = TRUE,
strip.white = TRUE
)
football <- head(football,n=48L)
football[which(max(abs(football$home_score - football$away_score)) == abs(football$home_score - football$away_score)),]
Results in
home home_continent home_score away away_continent away_score result
4 Cameroon Africa 0 Croatia Europe 4 l
7 Spain Europe 1 Netherlands Europe 5 l
37 Germany
So those are the games with the highest goal differntial, but now I need to make a new data frame that has a team name, and abs(football$home_score-football$away_score)
football$score_diff <- abs(football$home_score - football$away_score)
football$winner <- ifelse(football$home_score > football$away_score, as.character(football$home),
ifelse(football$result == "d", NA, as.character(football$away)))
You could save some typing in this way. You first get score differences and winners. When the result indicates w, home is the winner. So you do not have to look into scores at all. Once you add the score difference and winner, you can subset your data by subsetting data with max().
mydf <- read.csv(file="http://pastebin.com/raw.php?i=iTXdPvGf",
header = TRUE, strip.white = TRUE)
mydf <- head(mydf,n = 48L)
library(dplyr)
mutate(mydf, scorediff = abs(home_score - away_score),
winner = ifelse(result == "w", as.character(home),
ifelse(result == "l", as.character(away), "draw"))) %>%
filter(scorediff == max(scorediff))
# home home_continent home_score away away_continent away_score result scorediff winner
#1 Cameroon Africa 0 Croatia Europe 4 l 4 Croatia
#2 Spain Europe 1 Netherlands Europe 5 l 4 Netherlands
#3 Germany Europe 4 Portugal Europe 0 w 4 Germany
Here is another option without using ifelse for creating the "winner" column. This is based on row/column indexes. The numeric column index is created by matching the result column with its unique elements (match(football$result,..), and the row index is just 1:nrow(football). Subset the "football" dataset with columns 'home', 'away' and cbind it with an additional column 'draw' with NAs so that the 'd' elements in "result" change to NA.
football$score_diff <- abs(football$home_score - football$away_score)
football$winner <- cbind(football[c('home', 'away')],draw=NA)[
cbind(1:nrow(football), match(football$result, c('w', 'l', 'd')))]
football[with(football, score_diff==max(score_diff)),]
# home home_continent home_score away away_continent away_score result
#60 Brazil South America 1 Germany Europe 7 l
# score_diff winner
#60 6 Germany
If the dataset is very big, you could speed up the match by using chmatch from library(data.table)
library(data.table)
chmatch(as.character(football$result), c('w', 'l', 'd'))
NOTE: I used the full dataset in the link

Resources