How to create a data group (factor variables) in my dataframe based on categorical variables #R - r

I want to create a factor variables in my dataframes based on categorical variables.
My data:
# A tibble: 159 x 3
name.country gpd rate_suicide
<chr> <dbl> <dbl>
1 Afghanistan 2129. 6.4
2 Albania 12003. 5.6
3 Algeria 11624. 3.3
4 Angola 7103. 8.9
5 Antigua and Barbuda 19919. 0.5
6 Argentina 20308. 9.1
7 Armenia 10704. 5.7
8 Australia 47350. 11.7
9 Austria 52633. 11.4
10 Azerbaijan 14371. 2.6
# ... with 149 more rows
I want to create factor variable region, which contains a factors as:
region <- c('Asian', 'Europe', 'South America', 'North America', 'Africa')
region = factor(region, levels = c('Asian', 'Europe', 'South America', 'North America', 'Africa'))
I want to do this with dplyr packages, that can to choose a factor levels depends on name.countrybut it doesn't work. Example:
if (new_data$name.country[new_data$name.country == "N"]) {
mutate(new_data, region_ = region[1])
}
How i can solve the problem?

I think the way I would think about your problem is
Create a reproducible problem. (see How to make a great R reproducible example. ) Since you already have the data, use dput to make it easier for people like me to recreate your data in their environment.
dput(yourdf)
structure(list(name.country = c("Afghanistan", "Albania", "Algeria"
), gpd = c(2129L, 12003L, 11624L), rate_suicide = c(6.4, 5.6,
3.3)), class = "data.frame", row.names = c(NA, -3L))
raw_data<-structure(list(name.country = c("Afghanistan", "Albania", "Algeria"
), gpd = c(2129L, 12003L, 11624L), rate_suicide = c(6.4, 5.6,
3.3)), class = "data.frame", row.names = c(NA, -3L))
Define vectors that specify your regions
Use case_when to separate countries into regions
Use as.factor to convert your character variable to a factor
asia=c("Afghanistan","India","...","Rest of countries in Asia")
europe=c("Albania","France","...","Rest of countries in Europe")
africa=c("Algeria","Egypt","...","Rest of countries in Africa")
df<-raw_data %>%
mutate(region=case_when(
name.country %in% asia ~ "asia",
name.country %in% europe ~ "europe",
name.country %in% africa ~ "africa",
TRUE ~ "other"
)) %>%
mutate(region=region %>% as.factor())
You can check that your variable region is a factor using str
str(df)
'data.frame': 3 obs. of 4 variables:
$ name.country: chr "Afghanistan" "Albania" "Algeria"
$ gpd : int 2129 12003 11624
$ rate_suicide: num 6.4 5.6 3.3
$ region : Factor w/ 3 levels "africa","asia",..: 2 3 1

Here is a working example that combines data from the question with a file of countries and region information from Github. H/T to Luke Duncalfe for maintaining the region data, which is:
...a combination of the Wikipedia ISO-3166 article for alpha and numeric country codes and the UN Statistics site for countries' regional and sub-regional codes.
regionFile <- "https://raw.githubusercontent.com/lukes/ISO-3166-Countries-with-Regional-Codes/master/all/all.csv"
regionData <- read.csv(regionFile,header=TRUE)
textFile <- "rowID|country|gdp|suicideRate
1|Afghanistan|2129.|6.4
2|Albania|12003.|5.6
3|Algeria|11624.|3.3
4|Angola|7103.|8.9
5|Antigua and Barbuda|19919.|0.5
6|Argentina|20308.|9.1
7|Armenia|10704.|5.7
8|Australia|47350.|11.7
9|Austria|52633.|11.4
10|Azerbaijan|14371.|2.6"
data <- read.csv(text=textFile,sep="|")
library(dplyr)
data %>%
left_join(.,regionData,by = c("country" = "name"))
...and the output:
rowID country gdp suicideRate alpha.2 alpha.3 country.code
1 1 Afghanistan 2129 6.4 AF AFG 4
2 2 Albania 12003 5.6 AL ALB 8
3 3 Algeria 11624 3.3 DZ DZA 12
4 4 Angola 7103 8.9 AO AGO 24
5 5 Antigua and Barbuda 19919 0.5 AG ATG 28
6 6 Argentina 20308 9.1 AR ARG 32
7 7 Armenia 10704 5.7 AM ARM 51
8 8 Australia 47350 11.7 AU AUS 36
9 9 Austria 52633 11.4 AT AUT 40
10 10 Azerbaijan 14371 2.6 AZ AZE 31
iso_3166.2 region sub.region intermediate.region
1 ISO 3166-2:AF Asia Southern Asia
2 ISO 3166-2:AL Europe Southern Europe
3 ISO 3166-2:DZ Africa Northern Africa
4 ISO 3166-2:AO Africa Sub-Saharan Africa Middle Africa
5 ISO 3166-2:AG Americas Latin America and the Caribbean Caribbean
6 ISO 3166-2:AR Americas Latin America and the Caribbean South America
7 ISO 3166-2:AM Asia Western Asia
8 ISO 3166-2:AU Oceania Australia and New Zealand
9 ISO 3166-2:AT Europe Western Europe
10 ISO 3166-2:AZ Asia Western Asia
region.code sub.region.code intermediate.region.code
1 142 34 NA
2 150 39 NA
3 2 15 NA
4 2 202 17
5 19 419 29
6 19 419 5
7 142 145 NA
8 9 53 NA
9 150 155 NA
10 142 145 NA
At this point one can decide whether to use the region, sub region, or intermediate region and convert it to a factor.
We can set region to a factor by adding a mutate() function to the dplyr pipeline:
data %>%
left_join(.,regionData,by = c("country" = "name")) %>%
mutate(region = factor(region)) -> mergedData
At this point mergedData$region is a factor.
str(mergedData$region)
table(mergedData$region)
> str(mergedData$region)
Factor w/ 5 levels "Africa","Americas",..: 3 4 1 1 2 2 3 5 4 3
> table(mergedData$region)
Africa Americas Asia Europe Oceania
2 2 3 2 1
Now the data is ready for further analysis. We will generate a table of average suicide rates by region.
library(knitr) # for kable
mergedData %>% group_by(region) %>%
summarise(suicideRate = mean(suicideRate)) %>%
kable(.)
...and the output:
|region | suicideRate|
|:--------|-----------:|
|Africa | 6.1|
|Americas | 4.8|
|Asia | 4.9|
|Europe | 8.5|
|Oceania | 11.7|
When rendered in an HTML / markdown viewer, the result looks like this:

Related

R Split strings into two columns based on TWO regular expressions

I have some data that is structured something like this:
ID Region Value
1 Europe 8
2 Europe: Class 1 6
3 Asia: System 2 6
4 North America 7
5 Europe: System 1 5
6 Africa 7
7 Africa: Class 2 5
8 South America 9
9 Europe: System 1 3
10 Europe 7
What I want to do is create a new column called Class which adds instances of where "Class" AND "System" are mentioned in the Region column - if it's not clear what I mean, take a look at my expected output below. I know this can be done with the separate function but I think you can only specify one value for the separator part of the code. E.g. sep = ": Class" will only split instances that mention "class" but I also want to split any instances where "system" is mentioned too. Can this be done in one line of code, or do I need to do something a bit more complicated here? Here's how my final data should look:
ID Region Class Value
1 Europe 8
2 Europe 1 6
3 Asia 2 6
4 North America 7
5 Europe 1 5
6 Africa 7
7 Africa 2 5
8 South America 9
9 Europe 1 3
10 Europe 7
Please note, I want to remove any reference to "class" or "system" (including colons) from the Region column, and simply add the numerical value to a new Class column.
You can do it with base functions by just using strsplit with a regular expression that takes either ": System" or ": Class" as symbol:
splitted = strsplit(df$Region,"(: Class)|(: System)")
df$Region = lapply(splitted,FUN=function(x){x[1]})
df$Class = lapply(splitted,FUN=function(x){x[2]})
The result is:
> df
ID Region Value Class
1 1 Europe 8 NA
2 2 Europe 6 1
3 3 Asia 6 2
4 4 North America 7 NA
5 5 Europe 5 1
6 6 Africa 7 NA
7 7 Africa 5 2
8 8 South America 9 NA
9 9 Europe 3 1
10 10 Europe 7 NA
You can use str_extract to extract the number and str_remove to drop the text that you don't want.
library(dplyr)
library(stringr)
df %>%
mutate(Class = str_extract(Region, '(?<=(Class|System)\\s)\\d+'),
Region = str_remove(Region, ':\\s*(Class|System)\\s*\\d+'))
# ID Region Value Class
#1 1 Europe 8 <NA>
#2 2 Europe 6 1
#3 3 Asia 6 2
#4 4 North America 7 <NA>
#5 5 Europe 5 1
#6 6 Africa 7 <NA>
#7 7 Africa 5 2
#8 8 South America 9 <NA>
#9 9 Europe 3 1
#10 10 Europe 7 <NA>
str_extract extracts the number which comes after 'Class'
or 'System'. If these words are not present then it returns NA.
str_remove removes colon followed by zero or more whitespace (\\s*) followed by either 'Class' or 'System' and a number (\\d+).
data
It is easier to help if you provide data in a reproducible format which is easier to copy.
df <- structure(list(ID = 1:10, Region = c("Europe", "Europe: Class 1",
"Asia: System 2", "North America", "Europe: System 1", "Africa",
"Africa: Class 2", "South America", "Europe: System 1", "Europe"
), Value = c(8L, 6L, 6L, 7L, 5L, 7L, 5L, 9L, 3L, 7L)),
class = "data.frame", row.names = c(NA, -10L))

Joining two dataframes to plot a map with ggplot2

I want to make a worldmap visualization using an data frame, which look like this:
Country Year Sex Age Suicides Population Suicides_per_100k Country_Year HDI/Year Year_GDP
1 Albania 1987 Male 15-24 years 21 312900 6.71 Albania1987 NA 2156624900
2 Albania 1987 Male 35-54 years 16 308000 5.19 Albania1987 NA 2156624900
3 Albania 1987 Female 15-24 years 14 289700 4.83 Albania1987 NA 2156624900
4 Albania 1987 Male 75+ years 1 21800 4.59 Albania1987 NA 2156624900
5 Albania 1987 Male 25-34 years 9 274300 3.28 Albania1987 NA 2156624900
6 Albania 1987 Female 75+ years 1 35600 2.81 Albania1987 NA 2156624900
GDP_Per_Capita Generation Continent
1 796 Generation X Europe
2 796 Silent Europe
3 796 Generation X Europe
4 796 G.I. Generation Europe
5 796 Boomers Europe
6 796 G.I. Generation Europe
I tried to use the following code:
world <- ggplot2::map_data('world')
worldstart <- left_join(df,world,by = c("Country"="region")
This code created a new dataframe with 14 million observations.
But, I'd like to keep the same number of the dataset "df".
What is the best approach?
Indeed, the map_data functions returns the values for each point of each multipolygons in the world (~10k rows). As mentioned earlier, you cannot chose what point to keep.
You can use the sf library to go around this difficulty, keeping the geometry (here multipolygons) on one side and your data on the other.
My proposal would be the following :
library(dplyr)
library(sf)
library(ggplot2)
df <- tibble(Country = "Albania",
GDP_per_Capita = 796)
world <- maps::map('world', plot = F, fill = T) %>% st_as_sf(stringsAsFactors = F)
world_df <- df %>%
left_join(world, by = c("Country" = "ID"))
In my example, you would have only one row of data, but the geometry columns contains all necessary information for plotting.
sf and ggplot2 are well linked so you are good to go.
Best regards

create a variable in a dataframe based on another matrix on R

I am having some problems with the following task
I have a data frame of this type with 99 different countries for thousands of IDs
ID Nationality var 1 var 2 ....
1 Italy //
2 Eritrea //
3 Italy //
4 USA
5 France
6 France
7 Eritrea
....
I want to add a variable corresponding to a given macroregion of Nationality
so I created a matrix of this kind with the rule to follow
Nationality Continent
Italy Europe
Eritrea Africa
Usa America
France Europe
Germany Europe
....
I d like to obtain this
ID Nationality var 1 var 2 Continent
1 Italy // Europe
2 Eritrea // Africa
3 Italy // Europe
4 USA America
5 France Europe
6 France Europe
7 Eritrea Africa
....
I was trying with this command
datasubset <- merge(dataset , continent.matrix )
but it doesn't work, it reports the following error
Error: cannot allocate vector of size 56.6 Mb
that seems very strange to me, also trying to apply this code to a subset it doesn't work. do you have any suggestion on how to proceed?
thank you very much in advance for your help, I hope my question doesn't sound too trivial, but I am quite new to R
You can do this with the left_join function (dplyr's library):
library(dplyr)
df <- tibble(ID=c(1,2,3),
Nationality=c("Italy", "Usa", "France"),
var1=c("a", "b", "c"),
var2=c(4,5,6))
nat_cont <- tibble(Nationality=c("Italy", "Eritrea", "Usa", "Germany", "France"),
Continent=c("Europe", "Africa", "America", "Europe", "Europe"))
df_2 <- left_join(df, nat_cont, by=c("Nationality"))
The output:
> df_2
# A tibble: 3 x 5
ID Nationality var1 var2 Continent
<dbl> <chr> <chr> <dbl> <chr>
1 1 Italy a 4 Europe
2 2 Usa b 5 America
3 3 France c 6 Europe

Count origin-destination relationships (without direct) with R

I have a origin-destination table like this.
library(dplyr)
set.seed(1983)
namevec <- c('Portugal', 'Romania', 'Nigeria', 'Peru', 'Texas', 'New Jersey', 'Colorado', 'Minnesota')
## Create OD pairs
df <- data_frame(origins = sample(namevec, size = 100, replace = TRUE),
destinations = sample(namevec, size = 100, replace = TRUE))
Question
I got stucked in counting the relationships for each origin-destination (with no directionality).
How can I get output that Colorado-Minnesota and Minnesota-Colorado are seen as one group?
What I have tried so far:
## Counts for each OD-pairs
df %>%
group_by(origins, destinations) %>%
summarize(counts = n()) %>%
ungroup() %>%
arrange(desc(counts))
Source: local data frame [48 x 3]
origins destinations counts
(chr) (chr) (int)
1 Nigeria Colorado 5
2 Colorado Portugal 4
3 New Jersey Minnesota 4
4 New Jersey New Jersey 4
5 Peru Nigeria 4
6 Peru Peru 4
7 Romania Texas 4
8 Texas Nigeria 4
9 Minnesota Minnesota 3
10 Nigeria Portugal 3
.. ... ... ...
One way is to combine the sorted combination of the two locations into a single field. Summarizing on that will remove your two original columns, so you'll need to join them back in.
paired <- df %>%
mutate(
orderedpair = paste(pmin(origins, destinations), pmax(origins, destinations), sep = "::")
)
paired
# # A tibble: 100 × 3
# origins destinations orderedpair
# <chr> <chr> <chr>
# 1 Peru Colorado Colorado::Peru
# 2 Romania Portugal Portugal::Romania
# 3 Romania Colorado Colorado::Romania
# 4 New Jersey Minnesota Minnesota::New Jersey
# 5 Minnesota Texas Minnesota::Texas
# 6 Romania Texas Romania::Texas
# 7 Peru Peru Peru::Peru
# 8 Romania Nigeria Nigeria::Romania
# 9 Portugal Minnesota Minnesota::Portugal
# 10 Nigeria Colorado Colorado::Nigeria
# # ... with 90 more rows
left_join(
paired,
group_by(paired, orderedpair) %>% count(),
by = "orderedpair"
) %>%
select(-orderedpair) %>%
distinct() %>%
arrange(desc(n))
# # A tibble: 48 × 3
# origins destinations n
# <chr> <chr> <int>
# 1 Romania Portugal 6
# 2 New Jersey Minnesota 6
# 3 Portugal Romania 6
# 4 Minnesota New Jersey 6
# 5 Romania Texas 5
# 6 Nigeria Colorado 5
# 7 Texas Nigeria 5
# 8 Texas Romania 5
# 9 Nigeria Texas 5
# 10 Peru Peru 4
# # ... with 38 more rows
(The only reason I used "::" as the separator is in the unlikely event you need to parse orderedpair; using the default " " (space) won't work with (e.g.) "New Jersey" in the mix.)

How to reshape this complicated data frame?

Here is first 4 rows of my data;
X...Country.Name Country.Code Indicator.Name
1 Turkey TUR Inflation, GDP deflator (annual %)
2 Turkey TUR Unemployment, total (% of total labor force)
3 Afghanistan AFG Inflation, GDP deflator (annual %)
4 Afghanistan AFG Unemployment, total (% of total labor force)
Indicator.Code X2010
1 NY.GDP.DEFL.KD.ZG 5.675740
2 SL.UEM.TOTL.ZS 11.900000
3 NY.GDP.DEFL.KD.ZG 9.437322
4 SL.UEM.TOTL.ZS NA
I want my data reshaped into two colums, one of each Indicator code, and I want each row correspond to a country, something like this;
Country Name NY.GDP.DEFL.KD.ZG SL.UEM.TOTL.ZS
Turkey 5.6 11.9
Afghanistan 9.43 NA
I think I could do this with Excel, but I want to learn the R way, so that I don't need to rely on excel everytime I have a problem. Here is dput of data if you need it.
Edit: I actually want 3 colums, one for each indicator and one for the country's name.
Sticking with base R, use reshape. I took the liberty of cleaning up the column names. Here, I'm only showing you a few rows of the output. Remove head to see the full output. This assumes your data.frame is named "mydata".
names(mydata) <- c("CountryName", "CountryCode",
"IndicatorName", "IndicatorCode", "X2010")
head(reshape(mydata[-c(2:3)],
direction = "wide",
idvar = "CountryName",
timevar = "IndicatorCode"))
# CountryName X2010.NY.GDP.DEFL.KD.ZG X2010.SL.UEM.TOTL.ZS
# 1 Turkey 5.675740 11.9
# 3 Afghanistan 9.437322 NA
# 5 Albania 3.459343 NA
# 7 Algeria 16.245617 11.4
# 9 American Samoa NA NA
# 11 Andorra NA NA
Another option in base R is xtabs, but NA gets replaced with 0:
head(xtabs(X2010 ~ CountryName + IndicatorCode, mydata))
# IndicatorCode
# CountryName NY.GDP.DEFL.KD.ZG SL.UEM.TOTL.ZS
# Afghanistan 9.437322 0.0
# Albania 3.459343 0.0
# Algeria 16.245617 11.4
# American Samoa 0.000000 0.0
# Andorra 0.000000 0.0
# Angola 22.393924 0.0
The result of xtabs is a matrix, so if you want a data.frame, wrap the output with as.data.frame.matrix.

Resources