Find levels of a factors that appear more than once - r

I have this dataframe:
data <- data.frame(countries=c(rep('UK', 5),
rep('Netherlands 1a', 5),
rep('Netherlands', 5),
rep('USA', 5),
rep('spain', 5),
rep('Spain', 5),
rep('Spain 1a', 5),
rep('spain 1a', 5)),
var=rnorm(40))
countries var
1 UK 0.506232270
2 UK 0.976348808
3 UK -0.752151769
4 UK 1.137267199
5 UK -0.363406715
6 Netherlands 1a -0.800835463
7 Netherlands 1a 1.767724231
8 Netherlands 1a 0.810757929
9 Netherlands 1a -1.188975114
10 Netherlands 1a -0.763144245
11 Netherlands 0.428511920
12 Netherlands 0.835184425
13 Netherlands -0.198316780
14 Netherlands 1.108191193
15 Netherlands 0.946819500
16 USA 0.226786121
17 USA -0.466886468
18 USA -2.217910876
19 USA -0.003472937
20 USA -0.784264921
21 spain -1.418014562
22 spain 1.002412706
23 spain 0.472621627
24 spain -1.378960222
25 spain -0.197020702
26 Spain 1.197971896
27 Spain 1.227648883
28 Spain -0.253083684
29 Spain -0.076562960
30 Spain 0.338882352
31 Spain 1a 0.074459521
32 Spain 1a -1.136391220
33 Spain 1a -1.648418916
34 Spain 1a 0.277264011
35 Spain 1a -0.568411569
36 spain 1a 0.250151646
37 spain 1a -1.527885883
38 spain 1a -0.452190849
39 spain 1a 0.454168927
40 spain 1a 0.889401396
I want to be able to find levels of countries that appear in different forms more than once. Forms that levels of countries might appear in are:
lowercase, for example "spain"
titlecase, for example "Spain"
lowercase with a different word attached, for example "spain 1a"
titlecase with a different word attached, for example "Spain 1a"
So I need to function to return a vector listing levels countries that appear more than once. In data, the vector that should be returned is:
"Netherlands 1a", "Netherlands", "spain", "Spain", "spain 1a", "Spain 1a"
Is it possible to make a function that would return this vector?

A quick solution that should meet all requirements (assuming that the country name is always the first element of your data$country entries):
# Country substrings
country.substr <- sapply(strsplit(tolower(levels(data$countries)), " "), "[[", 1)
# Duplicated country substrings
country.substr.dupl <- duplicated(country.substr)
# Display all country levels that appear in different forms
do.call("c", lapply(unique(country.substr[country.substr.dupl]), function(i) {
levels(data$countries)[grep(i, tolower(levels(data$countries)))]
}))
[1] "Netherlands" "Netherlands 1a" "spain" "Spain" "spain 1a" "Spain 1a"
Update:
Assuming that the country name is not always to be found at the first position, you need to apply a different approach that I took from here. Note that I slightly modified your sample data to clarify what I'm doing:
data <- data.frame(countries=c(rep('United Kingdom', 5),
rep('united kingdom', 5),
rep('Netherlands', 5),
rep('Netherlands 1a', 5),
rep('1a Netherlands', 5),
rep('USA', 5),
rep('spain', 5),
rep('Spain', 5),
rep('Spain 1a', 5),
rep('spain 1a', 5)),
var=rnorm(50))
Now let's identify all country substrings that do NOT contain any numerics. The subsequent steps remain the same. Is that what you need?
# Remove mixed numeric/alphabetic parts from country names
country.substr <- lapply(strsplit(tolower(levels(data$countries)), " "), function(i) {
# Identify, paste and return alphabetic-only components
tmp <- grep("^[[:alpha:]]*$", i)
if (length(tmp) == 1)
return(i[tmp])
else
return(paste(i[tmp], collapse = " "))
})
# Identify douplicated country names
country.substr.dupl <- duplicated(country.substr)
# Display all country levels that appear in different forms
do.call("c", lapply(unique(country.substr[country.substr.dupl]), function(i) {
levels(data$countries)[grep(i, tolower(levels(data$countries)))]
}))
[1] "1a Netherlands" "Netherlands" "Netherlands 1a" "spain" "Spain" "spain 1a" "Spain 1a" "united kingdom" "United Kingdom"

Why not use grep? The ignore.case argument is just what you need here.
> uch <- unique(as.character(data$countries))
> found <- sapply(seq(uch), function(i){
if(!grepl("\\s|[0-9]", uch[i]))
grep(uch[i], uch, ignore.case = TRUE, value = TRUE)
})
> ff <- found[sapply(found, function(x) length(x) > 1)]
> unique(unlist(ff))
# [1] "Netherlands 1a" "Netherlands" "spain"
# [4] "Spain" "Spain 1a" "spain 1a"
Here's my logic: Take the unique factor levels of the column as a character vector. Then, compare it with itself, looking only at those levels that do not contain a space or a digit. grep will catch those, but the other way around is a bit more tough. Then, we just find the unique matches. So here's a function and a test run,
find.matches <- function(column)
{
uch <- unique(as.character(column))
found <- sapply(seq(uch), function(i){
if(!grepl("\\s|[0-9]", uch[i]))
grep(uch[i], uch, ignore.case = TRUE, value = TRUE)
})
ff <- found[sapply(found, function(x) length(x) > 1)]
unique(unlist(ff))
}
> dat <- data.frame(x = c("a", "a1", "a 1b", "c", "d"),
y = c("fac", "tor", "fac 1a", "tor1a", "fac"))
> sapply(dat, find.matches)
# $x
# [1] "a" "a1" "a 1b"
#
# $y
# [1] "fac" "fac 1a" "tor" "tor1a"

Related

Create several columns from a complex column in R

Imagine dataset:
df1 <- tibble::tribble(~City, ~Population,
"United Kingdom > Leeds", 1500000,
"Spain > Las Palmas de Gran Canaria", 200000,
"Canada > Nanaimo, BC", 150000,
"Canada > Montreal", 250000,
"United States > Minneapolis, MN", 700000,
"United States > Milwaukee, WI", NA,
"United States > Milwaukee", 400000)
The same dataset for visual representation:
I would like to:
Split column City into three columns: City, Country, State (if available, NA otherwise)
Check that Milwaukee has data in state and population (the NA for Milwaukee should have a value of 400000 and then split [City-State-Country] :).
Could you, please, suggest the easiest method to do so :)
Here's another solution with extract to do the extraction of Country, City, and State in a single go with State extracted by an optional capture group (the remainder of the task is done as by #Allen's code):
library(tidyr)
library(dplyr)
df1 %>%
extract(City,
into = c("Country", "City", "State"),
regex = "([^>]+) > ([^,]+),? ?([A-Z]+)?"
) %>%
# as by #Allen Cameron:
group_by(Country, City) %>%
summarize(State = ifelse(all(is.na(State)), NA, State[!is.na(State)]),
Population = Population[!is.na(Population)])
You can use separate twice to get the country and state, then group_by Country and City to summarize away the NA values where appropriate:
library(tidyverse)
df1 %>%
separate(City, sep = " > ", into = c("Country", "City")) %>%
separate(City, sep = ', ', into = c('City', 'State')) %>%
group_by(Country, City) %>%
summarize(State = ifelse(all(is.na(State)), NA, State[!is.na(State)]),
Population = Population[!is.na(Population)])
#> # A tibble: 6 x 4
#> # Groups: Country [4]
#> Country City State Population
#> <chr> <chr> <chr> <dbl>
#> 1 Canada Montreal <NA> 250000
#> 2 Canada Nanaimo BC 150000
#> 3 Spain Las Palmas de Gran Canaria <NA> 200000
#> 4 United Kingdom Leeds <NA> 1500000
#> 5 United States Milwaukee WI 400000
#> 6 United States Minneapolis MN 700000

Changing spelling for multiple words at a time in R/replacing many words at once

I have a dataset (survey) and a column of birth_country, where people have written their country of birth. An example of it:
1 america
2 usa
3 american
4 us of a
5 united states
6 england
7 english
8 great britain
9 uk
10 united kingdom
how I would like it to look:
1 america
2 america
3 america
4 america
5 america
6 uk
7 uk
8 uk
9 uk
10 uk
I have tried using str_replace to manually insert the different spellings, to replace them with 'america' but when I look at my dataset, nothing has changed
e.g.
survey <- structure(list(birth_country = c("america", "usa", "american", "us of a", "united states", "england", "english", "great britain", "uk", "united kingdom")), row.names = c(NA, -10L), class = "data.frame")
survey$birth_country <- str_replace(survey$birth_country, ' "united state"|"united statea"|"united states of america"', "america")
thank you in advance
Come up with some patterns that only match for each country and basically loop over what you are already doing (you can change the replacement below with your favorite function)
survey <- structure(list(birth_country = c("america", "usa", "american", "us of a", "united states", "england", "english", "great britain", "uk", "united kingdom")), row.names = c(NA, -10L), class = "data.frame")
## use a _named_ list of regular expressions
## the name will be the replacement string
l <- list(
america = 'amer|us|states',
uk = 'eng|brit|king|uk',
'another country' = 'ano|an co',
chaz = 'chaz|chop'
)
f <- function(x, list) {
for (ii in seq_along(list)) {
x[grepl(list[[ii]], x, ignore.case = TRUE)] <- names(list)[ii]
}
x
}
## test it
f(survey$birth_country, l)
# [1] "america" "america" "america" "america" "america" "uk" "uk" "uk" "uk" "uk"
within(survey, {
clean <- f(birth_country, l)
})
# birth_country clean
# 1 america america
# 2 usa america
# 3 american america
# 4 us of a america
# 5 united states america
# 6 england uk
# 7 english uk
# 8 great britain uk
# 9 uk uk
# 10 united kingdom uk
Note that 1) if you don't give a pattern that matches, nothing will change, but 2) if you give a pattern that matches both countries (e.g., "united"), the first in the list will be used (unless the replacement itself is also matched)
Looks like the problem is in how you specified your regular expression. Try this (updated based on #Gabriella 's comment, and another tidyverse approach, similar to #MarBIo ):
library(tidyverse)
survey <- survey %>%
mutate(birth_country = if_else(
str_detect(birth_country,
"(united state)|(united statea)|(united states of america)"), #If your regular expression matches any in birth_country
"america", #Change it to "america"
birth_country #Otherwise, keep as is.
) #end of if_else
) #end of mutate
Other people are suggesting you come up with a more complex regular expression, which you can certainly do as well. Consecutive "or" (i.e. "|") statements in your regular expression works though.
In case you allow tidyverse`s mutate you can do:
library(tidyverse)
survey <- structure(list(birth_country = c("america", "usa", "american", "us of a", "united states", "england", "english", "great britain", "uk", "united kingdom")), row.names = c(NA, -10L), class = "data.frame")
americas <- c("america", "usa", "american", "us of a", "united states")
englands <- c("england", "english", "great britain")
survey %>%
mutate(birth_country = ifelse(birth_country %in% americas, 'america', 'UK'))
#> birth_country
#> 1 america
#> 2 america
#> 3 america
#> 4 america
#> 5 america
#> 6 UK
#> 7 UK
#> 8 UK
#> 9 UK
#> 10 UK

Switching values to labels in a new column

I got a column of labelled values. Let's call it country.
When I run:
attr(dat[["Country"]], "labels")
I get the next table:
USA Germany France UK Spain India Saudi Arabia
1 2 3 4 5 6 7
Now I got a new column of int values that are not labelled. Let's call it newCountry. I would like to change those int values to the label of the original Country column. In other words, I would like to go from this in an efficient way...
3
2
2
1
5
4
to this...
France
Germany
Germany
USA
Spain
UK
The problem is that the data frame has a column, Country, with the attribute "labels" set. In its turn, this attribute, which is just a vector, has the attribute "names" set. So the steps to get the "names" of the "labels" are:
Get the "labels" of column Country;
Get the "names" of the vector of labels;
Extract the names corresponding to a vector of indices, the vector i.
First read in the posted data.
nms <- scan(text = "USA Germany France UK Spain India 'Saudi Arabia'",
what = character())
i <- scan(text = "3 2 2 1 5 4")
Now create a data set example.
labs <- setNames(1:7, nms)
dat <- data.frame(Country = sample(letters, 7))
attr(dat[["Country"]], "labels") <- labs
And extract what the question asks for, following the steps above.
labsCountry <- attr(dat[["Country"]], "labels")
names(labsCountry)[i]
#[1] "France" "Germany" "Germany" "USA" "Spain" "UK"
Or a one-liner:
names(attr(dat[["Country"]], "labels"))[i]
#[1] "France" "Germany" "Germany" "USA" "Spain" "UK"
To see that this does not depend on the values of the labels, create a second example.
labs2 <- setNames(101:107, nms)
attr(dat[["Country"]], "labels") <- labs2
And though the "labels" are different, the same instructions work:
attr(dat[["Country"]], "labels")
# USA Germany France UK Spain India Saudi Arabia
# 101 102 103 104 105 106 107
labsCountry <- attr(dat[["Country"]], "labels")
names(labsCountry)[i]

Read csv file with too many commas

Maybe it's easy, but I have a csv file with a lot of commas and R doesn't read it correctly, it puts all data in the first column and doesn't present it as a table.
Do you know how I can make to read the file correctly as a classic csv file?
You can download the file here from world bank
Well, it takes some time to clean the file, fortunately I have the time.
gdp2012 <- read.csv("getdata_data_GDP.csv", stringsAsFactors = FALSE)
cnames <- gdp2012[3, ]
names(cnames) <- NULL
cnames[1] <- "Abbrev"
cnames[5] <- "Millions.USD"
names(gdp2012) <- cnames
names(gdp2012)
[1] "Abbrev" "Ranking" "NA" "Economy" "Millions.USD"
[6] "" "NA" "NA" "NA" "NA"
gdp2012 <- gdp2012[, -grep("NA", names(gdp2012))]
gdp2012 <- gdp2012[, -ncol(gdp2012)]
gdp2012 <- gdp2012[-c(1:4, 237:nrow(gdp2012)), ]
dim(gdp2012)
[1] 232 4
str(gdp2012)
'data.frame': 232 obs. of 4 variables:
$ Abbrev : chr "USA" "CHN" "JPN" "DEU" ...
$ Ranking : chr "1" "2" "3" "4" ...
$ Economy : chr "United States" "China" "Japan" "Germany" ...
$ Millions.USD: chr " 16,244,600 " " 8,227,103 " " 5,959,718 " " 3,428,131 " ...
gdp2012[[4]] <- as.numeric(gsub(",", "", gdp2012[[4]]))
Warning message:
NAs introduced by coercion
gdp2012[[2]] <- as.numeric(gdp2012[[2]])
head(gdp2012)
Abbrev Ranking Economy Millions.USD
5 USA 1 United States 16244600
6 CHN 2 China 8227103
7 JPN 3 Japan 5959718
8 DEU 4 Germany 3428131
9 FRA 5 France 2612878
10 GBR 6 United Kingdom 2471784
If you want the row numbers to start at 1, just do
rownames(gdp2012) <- NULL
Delete row 1,-3,238-241 in csv itself
use rio library.
data=import("filename").
There are lots of blank rows in data. so you can use
data[rowSums(is.na(data)) == 0,]
Using data.table, fread function, which has many import arguments and also faster:
library(data.table)
res <- fread("myFile.csv",
sep = ",", # separator is comma
skip = 5, # skip first 5 rows
select = c(1, 2, 4, 5), # select columns by index
na.strings = c("", ".."), # convert blanks to NA
# set column names
col.names = c("Country", "Ranking", "Economy", "USD_Mln"))
# remove blank rows
res <- res[ !is.na(Country), ]
# convert character numbers to numbers
res[ , USD_Mln := as.numeric(gsub(",", "", USD_Mln))]
head(res)
# Country Ranking Economy USD_Mln
# 1: USA 1 United States 16244600
# 2: CHN 2 China 8227103
# 3: JPN 3 Japan 5959718
# 4: DEU 4 Germany 3428131
# 5: FRA 5 France 2612878
# 6: GBR 6 United Kingdom 2471784

How do I change contents of a column in data.frame

I’m using data from World Development Indicators (WDI) and want to merge this data with some other data. My problem is that the spelling of country names in the two datasets is different. How do I change the country variable?
library('WDI')
df <- WDI(country="all", indicator= c("NY.GDP.MKTP.CD", "EN.ATM.CO2E.KD.GD", 'SE.TER.ENRR'), start=1998, end=2011, extra=FALSE)
head(df)
country iso2c year NY.GDP.MKTP.CD EN.ATM.CO2E.KD.GD SE.TER.ENRR
99 ArabWorld 1A 1998 575369488074 1.365953 NA
100 ArabWorld 1A 1999 627550544566 1.355583 19.54259
101 ArabWorld 1A 2000 723111925659 1.476619 NA
102 ArabWorld 1A 2001 703688747656 1.412750 NA
103 ArabWorld 1A 2002 713021728054 1.413733 NA
104 ArabWorld 1A 2003 803017236111 1.469197 NA
How do i change ArabWorld to Arab World?
There are a lot of names I need to change so doing this with the use of row.numbers will not give me enough flexibility. I want something that is similar to the replace function in Stata.
This would work for character or factors.
df$country <- sub("ArabWorld", "Arab World", df$country)
This is equivalent:
> df[,1] <- sub("ArabWorld", "Arab World", df[,1] )
> head(df)
country iso2c year NY.GDP.MKTP.CD EN.ATM.CO2E.KD.GD
99 Arab World 1A 1998 575369488074 1.365953
100 Arab World 1A 1999 627550544566 1.355583
101 Arab World 1A 2000 723111925659 1.476619
102 Arab World 1A 2001 703688747656 1.412750
If you create a dataframe with the desired changes you can loop through to change them. Note that I have updated this so that it shows how to enter the parentheses in that column so they would be correctly passed to sub:
name.cng <- data.frame(orig = c("AntiguaandBarbuda", "AmericanSamoa",
"EastAsia&Pacific\\(developingonly\\)",
"Europe&CentralAsia\\(developingonly\\)",
"UnitedArabEmirates"),
spaced=c("Antigua and Barbuda", "American Samoa",
"East Asia & Pacific (developing only)",
"Europe&CentralAsia (developing only)",
"United Arab Emirates") )
for (i in 1:NROW(name.cng)){
df$country <- sub(name.cng[i,1], name.cng[i,2], df$country) }
The easiest, especially if you have many names to change, is probably to put your correspondance table in a data.frame, and join it with the data, with the merge command.
For instance, if you wanted to change the name of the Koreas:
# Correspondance table
countries <- data.frame(
iso2c = c("KR", "KP"),
country = c("South Korea", "North Korea")
)
# Join the data.frames
d <- merge( df, countries, by="iso2c", all.x=TRUE )
# Compute the new country name
d$country <- ifelse(is.na(d$country.y), as.character(d$country.x), as.character(d$country.y))
# Remove the columns we no longer need
d <- d[, setdiff(names(d), c("country.x", "country.y"))]
# Check that the result looks correct
head(d)
head(d[ d$iso2c %in% c("KR", "KP"), ])
However, it may be safer to join your two datasets on the country ISO code, which is more standard, than on the country name.
Using subsetting:
df[df[, "country"] == "ArabWorld", "country"] <- "Arab World"
head(df)
country iso2c year NY.GDP.MKTP.CD EN.ATM.CO2E.KD.GD SE.TER.ENRR
99 Arab World 1A 1998 575369488074 1.365953 NA
100 Arab World 1A 1999 627550544566 1.355583 19.54259
101 Arab World 1A 2000 723111925659 1.476619 NA
102 Arab World 1A 2001 703688747656 1.412750 NA
103 Arab World 1A 2002 713021728054 1.413733 NA
104 Arab World 1A 2003 803017236111 1.469197 NA

Resources