I’m using data from World Development Indicators (WDI) and want to merge this data with some other data. My problem is that the spelling of country names in the two datasets is different. How do I change the country variable?
library('WDI')
df <- WDI(country="all", indicator= c("NY.GDP.MKTP.CD", "EN.ATM.CO2E.KD.GD", 'SE.TER.ENRR'), start=1998, end=2011, extra=FALSE)
head(df)
country iso2c year NY.GDP.MKTP.CD EN.ATM.CO2E.KD.GD SE.TER.ENRR
99 ArabWorld 1A 1998 575369488074 1.365953 NA
100 ArabWorld 1A 1999 627550544566 1.355583 19.54259
101 ArabWorld 1A 2000 723111925659 1.476619 NA
102 ArabWorld 1A 2001 703688747656 1.412750 NA
103 ArabWorld 1A 2002 713021728054 1.413733 NA
104 ArabWorld 1A 2003 803017236111 1.469197 NA
How do i change ArabWorld to Arab World?
There are a lot of names I need to change so doing this with the use of row.numbers will not give me enough flexibility. I want something that is similar to the replace function in Stata.
This would work for character or factors.
df$country <- sub("ArabWorld", "Arab World", df$country)
This is equivalent:
> df[,1] <- sub("ArabWorld", "Arab World", df[,1] )
> head(df)
country iso2c year NY.GDP.MKTP.CD EN.ATM.CO2E.KD.GD
99 Arab World 1A 1998 575369488074 1.365953
100 Arab World 1A 1999 627550544566 1.355583
101 Arab World 1A 2000 723111925659 1.476619
102 Arab World 1A 2001 703688747656 1.412750
If you create a dataframe with the desired changes you can loop through to change them. Note that I have updated this so that it shows how to enter the parentheses in that column so they would be correctly passed to sub:
name.cng <- data.frame(orig = c("AntiguaandBarbuda", "AmericanSamoa",
"EastAsia&Pacific\\(developingonly\\)",
"Europe&CentralAsia\\(developingonly\\)",
"UnitedArabEmirates"),
spaced=c("Antigua and Barbuda", "American Samoa",
"East Asia & Pacific (developing only)",
"Europe&CentralAsia (developing only)",
"United Arab Emirates") )
for (i in 1:NROW(name.cng)){
df$country <- sub(name.cng[i,1], name.cng[i,2], df$country) }
The easiest, especially if you have many names to change, is probably to put your correspondance table in a data.frame, and join it with the data, with the merge command.
For instance, if you wanted to change the name of the Koreas:
# Correspondance table
countries <- data.frame(
iso2c = c("KR", "KP"),
country = c("South Korea", "North Korea")
)
# Join the data.frames
d <- merge( df, countries, by="iso2c", all.x=TRUE )
# Compute the new country name
d$country <- ifelse(is.na(d$country.y), as.character(d$country.x), as.character(d$country.y))
# Remove the columns we no longer need
d <- d[, setdiff(names(d), c("country.x", "country.y"))]
# Check that the result looks correct
head(d)
head(d[ d$iso2c %in% c("KR", "KP"), ])
However, it may be safer to join your two datasets on the country ISO code, which is more standard, than on the country name.
Using subsetting:
df[df[, "country"] == "ArabWorld", "country"] <- "Arab World"
head(df)
country iso2c year NY.GDP.MKTP.CD EN.ATM.CO2E.KD.GD SE.TER.ENRR
99 Arab World 1A 1998 575369488074 1.365953 NA
100 Arab World 1A 1999 627550544566 1.355583 19.54259
101 Arab World 1A 2000 723111925659 1.476619 NA
102 Arab World 1A 2001 703688747656 1.412750 NA
103 Arab World 1A 2002 713021728054 1.413733 NA
104 Arab World 1A 2003 803017236111 1.469197 NA
Related
I have a data frame that was given to me. Under the column titled state, there are two components with the same name but with different case sensitivities ie one is "London" and the other is "LONDON". How would i be able to rename "LONDON" to become "London" in order to total them up together and not separately. reminder, I am trying to change the name of the input not the name of the column.
You can use the following code, df is your current dataframe, in which you want to substitute "LONDON" for "London"
df <- data.frame(Country = c("US", "UK", "Germany", "Brazil","US", "Brazil", "UK", "Germany"),
State = c("NY", "London", "Bavaria", "SP", "CA", "RJ", "LONDON", "Berlin"),
Candidate = c(1:8))
print(df)
output
Country State Candidate
1 US NY 1
2 UK London 2
3 Germany Bavaria 3
4 Brazil SP 4
5 US CA 5
6 Brazil RJ 6
7 UK LONDON 7
8 Germany Berlin 8
then run the following code to substitute London to all the instances where State is equal to "LONDON"
df[df$State == "LONDON", "State"] <- "London"
Now the output will be as
Country State Candidate
1 US NY 1
2 UK London 2
3 Germany Bavaria 3
4 Brazil SP 4
5 US CA 5
6 Brazil RJ 6
7 UK London 7
8 Germany Berlin 8
Maybe you could try using the case_when function. I would do something like this:
´´´´
mutate(data, State_def=case_when(State=="LONDON" ~ "London",
State=="London" ~ "London",
TRUE ~ NA_real_)
I might misunderstand, but I think it should be as simple as this:
x$state <- sub( "LONDON", "London", x$state, fixed=TRUE )
This should change LONDON to London
I got a column of labelled values. Let's call it country.
When I run:
attr(dat[["Country"]], "labels")
I get the next table:
USA Germany France UK Spain India Saudi Arabia
1 2 3 4 5 6 7
Now I got a new column of int values that are not labelled. Let's call it newCountry. I would like to change those int values to the label of the original Country column. In other words, I would like to go from this in an efficient way...
3
2
2
1
5
4
to this...
France
Germany
Germany
USA
Spain
UK
The problem is that the data frame has a column, Country, with the attribute "labels" set. In its turn, this attribute, which is just a vector, has the attribute "names" set. So the steps to get the "names" of the "labels" are:
Get the "labels" of column Country;
Get the "names" of the vector of labels;
Extract the names corresponding to a vector of indices, the vector i.
First read in the posted data.
nms <- scan(text = "USA Germany France UK Spain India 'Saudi Arabia'",
what = character())
i <- scan(text = "3 2 2 1 5 4")
Now create a data set example.
labs <- setNames(1:7, nms)
dat <- data.frame(Country = sample(letters, 7))
attr(dat[["Country"]], "labels") <- labs
And extract what the question asks for, following the steps above.
labsCountry <- attr(dat[["Country"]], "labels")
names(labsCountry)[i]
#[1] "France" "Germany" "Germany" "USA" "Spain" "UK"
Or a one-liner:
names(attr(dat[["Country"]], "labels"))[i]
#[1] "France" "Germany" "Germany" "USA" "Spain" "UK"
To see that this does not depend on the values of the labels, create a second example.
labs2 <- setNames(101:107, nms)
attr(dat[["Country"]], "labels") <- labs2
And though the "labels" are different, the same instructions work:
attr(dat[["Country"]], "labels")
# USA Germany France UK Spain India Saudi Arabia
# 101 102 103 104 105 106 107
labsCountry <- attr(dat[["Country"]], "labels")
names(labsCountry)[i]
Specifically, I'm trying to combine the two data frames UN_M.49_Countries and UN_M.49_Regions which contains the country codes in nested lists.
> UN_M.49_Countries
Code Name ISO_Alpha_3
1 004 Afghanistan AFG
2 248 Åland Islands ALA
3 008 Albania ALB
...
> UN_M.49_Regions
Code Name Parent Children Type
1 001 World 002, 019, 010, 142, 150, 009 Region
2 002 Africa 001 015, 202 Region
3 015 Northern Africa 002 012, 818, 434, 504, 729, 788, 732 Region
...
I would like to build a new table which adds two columns to UN_M.49_Countries.
> new_table
Code Name ISO_Alpha_3 Region Subregion
1 004 Afghanistan AFG Asia Southern Asia
2 248 Åland Islands ALA Europe Northern Europe
3 008 Albania ALB Europe Southern Europe
...
I am new to programming and R and, to be honest, I do not even know where to start. Any help would be much appreciated!
install.packages("ISOcodes")
library(ISOcodes)
UN_M.49_Countries
UN_M.49_Regions
if you need to get a specific version you can change Southern Europe to anything you would like, also if don't subset you can get the whole world.
Check out the package documentation.
https://cran.r-project.org/web/packages/ISOcodes/ISOcodes.pdf
data("UN_M.49_Regions")
data("UN_M.49_Countries")
region <- subset(UN_M.49_Regions, Name == "Southern Europe")
codes <- unlist(strsplit(region$Children, ", "))
subset(UN_M.49_Countries, Code %in% codes)
Using the tidyverse
library(ISOcodes)
library(tidyverse)
library(stringr)
countries <- UN_M.49_Countries
regions <- UN_M.49_Regions
countries <- UN_M.49_Countries
region_focused <- regions %>%
mutate(codes = str_split(Children,",")) %>%
unnest() %>%
left_join(countries, by = c("codes" = "Code"))
countr_focused <- regions %>%
mutate(codes = str_split(Children,",")) %>%
unnest() %>%
right_join(countries, by = c("codes" = "Code"))
I am having some problems with the following task
I have a data frame of this type with 99 different countries for thousands of IDs
ID Nationality var 1 var 2 ....
1 Italy //
2 Eritrea //
3 Italy //
4 USA
5 France
6 France
7 Eritrea
....
I want to add a variable corresponding to a given macroregion of Nationality
so I created a matrix of this kind with the rule to follow
Nationality Continent
Italy Europe
Eritrea Africa
Usa America
France Europe
Germany Europe
....
I d like to obtain this
ID Nationality var 1 var 2 Continent
1 Italy // Europe
2 Eritrea // Africa
3 Italy // Europe
4 USA America
5 France Europe
6 France Europe
7 Eritrea Africa
....
I was trying with this command
datasubset <- merge(dataset , continent.matrix )
but it doesn't work, it reports the following error
Error: cannot allocate vector of size 56.6 Mb
that seems very strange to me, also trying to apply this code to a subset it doesn't work. do you have any suggestion on how to proceed?
thank you very much in advance for your help, I hope my question doesn't sound too trivial, but I am quite new to R
You can do this with the left_join function (dplyr's library):
library(dplyr)
df <- tibble(ID=c(1,2,3),
Nationality=c("Italy", "Usa", "France"),
var1=c("a", "b", "c"),
var2=c(4,5,6))
nat_cont <- tibble(Nationality=c("Italy", "Eritrea", "Usa", "Germany", "France"),
Continent=c("Europe", "Africa", "America", "Europe", "Europe"))
df_2 <- left_join(df, nat_cont, by=c("Nationality"))
The output:
> df_2
# A tibble: 3 x 5
ID Nationality var1 var2 Continent
<dbl> <chr> <chr> <dbl> <chr>
1 1 Italy a 4 Europe
2 2 Usa b 5 America
3 3 France c 6 Europe
I have this dataframe:
data <- data.frame(countries=c(rep('UK', 5),
rep('Netherlands 1a', 5),
rep('Netherlands', 5),
rep('USA', 5),
rep('spain', 5),
rep('Spain', 5),
rep('Spain 1a', 5),
rep('spain 1a', 5)),
var=rnorm(40))
countries var
1 UK 0.506232270
2 UK 0.976348808
3 UK -0.752151769
4 UK 1.137267199
5 UK -0.363406715
6 Netherlands 1a -0.800835463
7 Netherlands 1a 1.767724231
8 Netherlands 1a 0.810757929
9 Netherlands 1a -1.188975114
10 Netherlands 1a -0.763144245
11 Netherlands 0.428511920
12 Netherlands 0.835184425
13 Netherlands -0.198316780
14 Netherlands 1.108191193
15 Netherlands 0.946819500
16 USA 0.226786121
17 USA -0.466886468
18 USA -2.217910876
19 USA -0.003472937
20 USA -0.784264921
21 spain -1.418014562
22 spain 1.002412706
23 spain 0.472621627
24 spain -1.378960222
25 spain -0.197020702
26 Spain 1.197971896
27 Spain 1.227648883
28 Spain -0.253083684
29 Spain -0.076562960
30 Spain 0.338882352
31 Spain 1a 0.074459521
32 Spain 1a -1.136391220
33 Spain 1a -1.648418916
34 Spain 1a 0.277264011
35 Spain 1a -0.568411569
36 spain 1a 0.250151646
37 spain 1a -1.527885883
38 spain 1a -0.452190849
39 spain 1a 0.454168927
40 spain 1a 0.889401396
I want to be able to find levels of countries that appear in different forms more than once. Forms that levels of countries might appear in are:
lowercase, for example "spain"
titlecase, for example "Spain"
lowercase with a different word attached, for example "spain 1a"
titlecase with a different word attached, for example "Spain 1a"
So I need to function to return a vector listing levels countries that appear more than once. In data, the vector that should be returned is:
"Netherlands 1a", "Netherlands", "spain", "Spain", "spain 1a", "Spain 1a"
Is it possible to make a function that would return this vector?
A quick solution that should meet all requirements (assuming that the country name is always the first element of your data$country entries):
# Country substrings
country.substr <- sapply(strsplit(tolower(levels(data$countries)), " "), "[[", 1)
# Duplicated country substrings
country.substr.dupl <- duplicated(country.substr)
# Display all country levels that appear in different forms
do.call("c", lapply(unique(country.substr[country.substr.dupl]), function(i) {
levels(data$countries)[grep(i, tolower(levels(data$countries)))]
}))
[1] "Netherlands" "Netherlands 1a" "spain" "Spain" "spain 1a" "Spain 1a"
Update:
Assuming that the country name is not always to be found at the first position, you need to apply a different approach that I took from here. Note that I slightly modified your sample data to clarify what I'm doing:
data <- data.frame(countries=c(rep('United Kingdom', 5),
rep('united kingdom', 5),
rep('Netherlands', 5),
rep('Netherlands 1a', 5),
rep('1a Netherlands', 5),
rep('USA', 5),
rep('spain', 5),
rep('Spain', 5),
rep('Spain 1a', 5),
rep('spain 1a', 5)),
var=rnorm(50))
Now let's identify all country substrings that do NOT contain any numerics. The subsequent steps remain the same. Is that what you need?
# Remove mixed numeric/alphabetic parts from country names
country.substr <- lapply(strsplit(tolower(levels(data$countries)), " "), function(i) {
# Identify, paste and return alphabetic-only components
tmp <- grep("^[[:alpha:]]*$", i)
if (length(tmp) == 1)
return(i[tmp])
else
return(paste(i[tmp], collapse = " "))
})
# Identify douplicated country names
country.substr.dupl <- duplicated(country.substr)
# Display all country levels that appear in different forms
do.call("c", lapply(unique(country.substr[country.substr.dupl]), function(i) {
levels(data$countries)[grep(i, tolower(levels(data$countries)))]
}))
[1] "1a Netherlands" "Netherlands" "Netherlands 1a" "spain" "Spain" "spain 1a" "Spain 1a" "united kingdom" "United Kingdom"
Why not use grep? The ignore.case argument is just what you need here.
> uch <- unique(as.character(data$countries))
> found <- sapply(seq(uch), function(i){
if(!grepl("\\s|[0-9]", uch[i]))
grep(uch[i], uch, ignore.case = TRUE, value = TRUE)
})
> ff <- found[sapply(found, function(x) length(x) > 1)]
> unique(unlist(ff))
# [1] "Netherlands 1a" "Netherlands" "spain"
# [4] "Spain" "Spain 1a" "spain 1a"
Here's my logic: Take the unique factor levels of the column as a character vector. Then, compare it with itself, looking only at those levels that do not contain a space or a digit. grep will catch those, but the other way around is a bit more tough. Then, we just find the unique matches. So here's a function and a test run,
find.matches <- function(column)
{
uch <- unique(as.character(column))
found <- sapply(seq(uch), function(i){
if(!grepl("\\s|[0-9]", uch[i]))
grep(uch[i], uch, ignore.case = TRUE, value = TRUE)
})
ff <- found[sapply(found, function(x) length(x) > 1)]
unique(unlist(ff))
}
> dat <- data.frame(x = c("a", "a1", "a 1b", "c", "d"),
y = c("fac", "tor", "fac 1a", "tor1a", "fac"))
> sapply(dat, find.matches)
# $x
# [1] "a" "a1" "a 1b"
#
# $y
# [1] "fac" "fac 1a" "tor" "tor1a"