Is it possible to get R to identify countries in a dataframe? - r

This is what my dataset currently looks like. I'm hoping to add a column with the country names that correspond with the 'paragraph' column, but I don't even know how to start going about with that. Should I upload a list of all country names and then use the match function?
Any suggestions for a more optimal way would be appreciated! Thank you.
The output of dput(head(dataset, 20)) is as follows:
structure(list(category = c("State Ownership and Privatization;...row.names = c(NA, 20L), class = "data.frame")

Use the package "countrycode":
Toy data:
df <- data.frame(entry_number = 1:5,
text = c("a few paragraphs that might contain the country name congo or democratic republic of congo",
"More text that might contain myanmar or burma, as well as thailand",
"sentences that do not contain a country name can be returned as NA",
"some variant of U.S or the united states",
"something with an accent samóoa"))
This is how you can match the country names in a separate column:
library(tidyr)
library(dplyr)
#install.packages("countrycode")
library(countrycode)
all_country <- countryname_dict %>%
# filter out non-ASCII country names:
filter(grepl('[A-Za-z]', country.name.alt)) %>%
# define column `country.name.alt` as an atomic vector:
pull(country.name.alt) %>%
# change to lower-case:
tolower()
# define alternation pattern of all country names:
library(stringr)
pattern <- str_c(all_country, collapse = '|') # A huge alternation pattern!
df %>%
# extract country name matches
mutate(country = str_extract_all(tolower(text), pattern))
entry_number text
1 1 a few paragraphs that might contain the country name congo or democratic republic of congo
2 2 More text that might contain myanmar or burma, as well as thailand
3 3 sentences that do not contain a country name can be returned as NA
4 4 some variant of U.S or the united states
5 5 something with an accent samóoa
country
1 congo, democratic republic of congo
2 myanma, burma, thailand
3
4 united states
5 samóoa

Related

How to clean up data in R using strings?

I need to clean up gender and dates columns of the dataset found here.
They apparently contain some misspellings and ambiguities. I am new to R and data cleaning so I am not sure how to go about doing this. For starters, I have tried to correct the misspellings using
factor(data$artist_data$gender)
str_replace_all(data$artist_data$gender, pattern = "femle", replacement = "Female")
str_replace_all(data$artist_data$gender, pattern = "f.", replacement = "Female")
str_replace_all(data$artist_data$gender, pattern = "F.", replacement = "Female")
str_replace_all(data$artist_data$gender, pattern = "female", replacement = "Female")
But it doesn't seem to work as I still have f., F. and femle in my output. Secondly, there seem to be empty cells inside. Do I need to remove them or is it alright to leave them there. If I need to remove them, how?
Thirdly, for the dates column, how do I make it clearer? i.e. change the format of born in xxxx to maybe xxxx-yyyy if died or xxxx-present if still alive. e.g. born in 1940 - is it safe to assume that they are still alive? Also one of the data has the word active in it. Would like to make this data more straight-forward.
Please help,
Thank you.
We have to escape the dot in f. and F.
library(dplyr)
library(stringr)
library(tibble)
pattern <- paste("f\\.|F\\.|female|femle", collapse="|")
df[[2]] %>%
mutate(gender = str_replace(string=gender,
pattern = pattern,
replacement="Female")) %>%
as_tibble()
name gender dates placeOfBirth placeOfDeath
<chr> <chr> <chr> <chr> <chr>
1 Abakanowicz, Magdalena Female born 1930 Polska ""
2 Abbey, Edwin Austin Male 1852–1911 Philadelphia, United States "London, United Kingdom"
3 Abbott, Berenice Female 1898–1991 Springfield, United States "Monson, United States"
4 Abbott, Lemuel Francis Male 1760–1803 Leicestershire, United Kingdom "London, United Kingdom"
5 Abrahams, Ivor Male born 1935 Wigan, United Kingdom ""
6 Absalon Male 1964–1993 Tel Aviv-Yafo, Yisra'el "Paris, France"
7 Abts, Tomma Female born 1967 Kiel, Deutschland ""
8 Acconci, Vito Male born 1940 New York, United States ""
9 Ackling, Roger Male 1947–2014 Isleworth, United Kingdom ""
10 Ackroyd, Norman Male born 1938 Leeds, United Kingdom ""
# ... with 3,522 more rows

remove list of strings from string column in R

I have a dataframe like so:
df = data.frame('name' = c('California parks', 'bear lake', 'beautiful tree house', 'banana plant'), 'extract' = c('parks', 'bear', 'tree', 'plant'))
How do I remove the strings of the 'extract' column from the name column to get the following result:
name_new = California, lake, beautiful house, banana
I'm suspecting this demands a combination of str_extract and lapply but can quite figure it out.
Thanks!
The str_remove or str_replace are vectorized for both string and pattern. So, if we have two columns, just pass those columns 'name', 'extract' as the string, pattern to remove the substring in the 'name' column elementwise. Once we remove those substring, there are chances of having spaces before or after which can be removed or replaced with str_replace with trimws (to remove the leading/lagging spaces)
library(dplyr)
library(stringr)
df %>%
mutate(name_new = str_remove(name, extract),
name_new = str_replace_all(trimws(name_new), "\\s{2,}", " "))
# name extract name_new
#1 California parks parks California
#2 bear lake bear lake
#3 beautiful tree house tree beautiful house
#4 banana plant plant banana
A base R option using gsub + Vectorize
within(df,name_new <- Vectorize(gsub)(paste0("\\s",extract,"\\s")," ",name))
which gives
name extract name_new
1 California parks parks California
2 bear lake bear lake
3 beautiful tree house tree beautiful house
4 banana plant plant banana

how do I extract a part of data from a column and and paste it n another column using R?

I want to extract a part of data from a column and and paste it in another column using R:
My data looks like this:
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NULL,+3579514862,NULL,+5554848123)
data <- data.frame(names,country,mobile)
data
> data
names country mobile
1 Sia London +1234567890 NULL
2 Ryan Paris +3579514862
3 J Sydney +0123458796 NULL
4 Ricky Delhi +5554848123
I would like to separate phone number from country column wherever applicable and put it into mobile.
The output should be,
> data
names country mobile
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
You can use the tidyverse package which has a separate function.
Note that I rather use NA instead of NULL inside the mobile vector.
Also, I use the option, stringsAsFactors = F when creating the dataframe to avoid working with factors.
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NA, "+3579514862", NA, "+5554848123")
data <- data.frame(names,country,mobile, stringsAsFactors = F)
library(tidyverse)
data %>% as_tibble() %>%
separate(country, c("country", "number"), sep = " ", fill = "right") %>%
mutate(mobile = coalesce(mobile, number)) %>%
select(-number)
# A tibble: 4 x 3
names country mobile
<chr> <chr> <chr>
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
EDIT
If you want to do this without the pipes (which I would not recommend because the code becomes much harder to read) you can do this:
select(mutate(separate(as_tibble(data), country, c("country", "number"), sep = " ", fill = "right"), mobile = coalesce(mobile, number)), -number)

approximate string matching on condition of a match in a separate field in R

I have two dataframes from which I would like to carry out approximate string matching.
> df1
Source Name Country
A Glen fiddich United Kingdom
A Talisker dark storm United Kingdom
B johnney walker United states
D veuve clicquot brut France
E nicolas feuillatte brut France
C glen morangie united kingdom
B Talisker 54 degrees United kingdom
F Talisker dark storm United states
The second data frame:
> df2
Source Name Country
A smirnoff ice Russia
A Talisker daek strome United Kingdom
B johnney walker United states
D veuve clicquot brut Australia
E nicolea feuilate brut Italy
C glen morangie united kingdom
B Talisker 54 degrees United kingdom
the key column for the approximate matching between the two data frames is "Name". Because of the relationship in the columns for the observations, It is important to select the approximate matches that also have a match on the "country" column. The extract of the code I am using is below:
dist.mat <- stringdistmatrix(tolower(df1$title), tolower(df2$title), method = "jw",
nthread = getOption("sd_num_thread"))
min.dist <- apply(dist.mat, 1, min)
matched <- data.frame(df1$title,
as.character(apply(dist.mat, 1, function(x) df2$title[which(x == min(x))])),
apply(dist.mat, 1, which.min), "jw", apply(dist.mat, 1, min))
colnames(matched) <- c("to_be_matched", "closest_match", "index_closest_match",
"distance_method", "distance")
The code above only executes approximate match between df1 and df2 based on data in the "Name" column. What I want to do is have the approximate match on "Name" column selected on the condition that for the two values, there is a match on the "Country" column.
You should check out the fuzzywuzzy library, which has excellent fuzzy text matching capabilities. Then I would iterate through the unique countries and look for matches that pass a certain fuzz threshold score, like the following:
from fuzzywuzzy import fuzz, process
matches = []
for country in df1['Country'].unique().tolist():
dfm1 = df1[df1['Country'] == country]
dfm2 = df2[df2['Country'] == country]
candidates = dfm2['Name'].tolist()
matches.append(dfm1['Name'].apply(lambda x: x, process.extractOne(x, candidates, score_cutoff=90)))
You can tweak the scorer input in order to get the matches the way you like it.

aggregates variables into new variable

I have a column in a dataframe which includes 30 different countries. I want to group these countries into 5 new values.
For example,
I have
China
Japan
US
Canada
....
Aggregate to new variables:
Asia
Asia
North America
North America
....
One solution I am thinking about is using nested ifelse. However it seems that I need 4 or 5 nested ifelse to get what I need. I don't think that's a good way. I want to know other efficient solutions.
One option would be to use a key/value dataset. The countrycode_data from the library(countrycode) can be used for this purpose. We match the 'country.name' column in 'countrycode_data' with the example data column ('Col1'). If there are no matches, it will return NA. Using the OP's example, 'US' returns NA as the 'country.name' is 'United States'. But, we can get the abbreviated form using the 'cowc' column. However, the abbreviated version is also USA, which we can find using grep. I would suggest to grep all NA elements in 'indx'. The 'indx' can be used for returning 'region' from the 'countrycode_data'.
library(countrycode)
indx <- match(df1$Col1, countrycode_data$country.name)
pat <- paste0('^',paste(df1$Col1[is.na(indx)], collapse='|'))
indx[is.na(indx)] <- grep(pat, countrycode_data$cowc)
countrycode_data$region[indx]
#[1] "Eastern Asia" "Eastern Asia" "Northern America" "Northern America"
NOTE: This will return a bit more specific than the general 'Asia'.
If we use the 'continent' column,
countrycode_data$continent[indx]
#[1] "Asia" "Asia" "Americas" "Americas"
data
df1 <- structure(list(Col1 = c("China", "Japan", "US", "Canada")),
.Names = "Col1", class = "data.frame", row.names = c(NA, -4L))
Another approach is to use the recode function from the car package:
library(car)
dat$Region <- recode(dat$Country, "c('China', 'Japan') = 'Asia'; c('US','Canada') = 'North America'")
Country Region
1 China Asia
2 Japan Asia
3 US North America
4 Canada North America
They are just 30 countries and so you can make few vectors like shown below, create a new column and replace according to the vectors.
asia <- c("India", "china")
NorthAmerica <- c("US", "canada")
df$continent <- df$countries
df$continent <- with(df, replace(continent, countries%in%asia,"Asia"))
df$continent <- with(df, replace(continent, countries%in%NorthAmerica,"North America"))
'continent' is a built-in destination code of the countrycode package. You can pass a vector of country names and get a vector of continent names back with...
library(countrycode)
countries <- c('China', 'Japan', 'US', 'Canada')
countrycode(countries, 'country.name', 'continent')
returns...
[1] "Asia" "Asia" "Americas" "Americas"
Make sure when using Veera's and Jay's approaches to define column as a vector in order to allow for the change of a column's levels:
df$continent <- as.factor(as.vector(df$countries))

Resources