I have a dataframe with data about the US States.
One of the columns in the df is "Division", which tells the location where each state belongs to ("East North Central", "East South Central", "Middle Atlantic", "Mountain", "New England", "Pacific", "South Atlantic", "West North Central", "West South Central").
I created an array with the average expectancy life for each division, using an existing column called "Life Exp:
avg.life.exp = tapply(df[["Life Exp"]], df$Division, mean, na.rm=TRUE)
Which returns the following:
East North Central East South Central Middle Atlantic
70.99000 69.33750 70.63667
Mountain New England Pacific
70.94750 71.57833 71.69400
South Atlantic West North Central West South Central
69.52625 72.32143 70.43500
Now I would like to add a new column to the df, with the average life expectancy of each Division. So basically I would like to do a Left Join, where if the state belonged to the East Noth Central, it would return 70.99000, and so on.
I need to do this without using packages.
Thank you in advance for any help you can provide!
One option would be to use merge:
merge(df, data.frame(Division = names(avg.life.exp), avg.life.exp), all.x = TRUE)
A second option would be to use match
df$avg.life.exp <- avg.life.exp[match(df$Division, names(avg.life.exp))]
Using the gapminder dataset as example data:
library(gapminder)
# Example data
df <- gapminder[gapminder$year == 2007, c("country", "continent", "lifeExp")]
avg.life.exp <- tapply(df[["lifeExp"]], df$continent, mean, na.rm=TRUE)
avg.life.exp
#> Africa Americas Asia Europe Oceania
#> 54.80604 73.60812 70.72848 77.64860 80.71950
# Using merge
df1 <- merge(df, data.frame(continent = names(avg.life.exp), avg.life.exp), all.x = TRUE)
head(df1)
#> continent country lifeExp avg.life.exp
#> 1 Africa Reunion 76.442 54.80604
#> 2 Africa Eritrea 58.040 54.80604
#> 3 Africa Algeria 72.301 54.80604
#> 4 Africa Congo, Rep. 55.322 54.80604
#> 5 Africa Equatorial Guinea 51.579 54.80604
#> 6 Africa Malawi 48.303 54.80604
# Using match
df$avg.life.exp <- avg.life.exp[match(df$continent, names(avg.life.exp))]
head(df)
#> # A tibble: 6 × 4
#> country continent lifeExp avg.life.exp
#> <fct> <fct> <dbl> <dbl>
#> 1 Afghanistan Asia 43.8 70.7
#> 2 Albania Europe 76.4 77.6
#> 3 Algeria Africa 72.3 54.8
#> 4 Angola Africa 42.7 54.8
#> 5 Argentina Americas 75.3 73.6
#> 6 Australia Oceania 81.2 80.7
Related
I am trying to create an identity matrix from a dataframe. The dataframe is like so:
i<-c("South Korea", "South Korea", "France", "France","France")
j <-c("Rwanda", "France", "Rwanda", "South Korea","France")
distance <-c(10844.6822,9384,6003,9384,0)
dis_matrix<-data.frame(i,j,distance)
dis_matrix
1 South Korea South Korea 0.0000
2 South Korea Rwanda 10844.6822
3 South Korea France 9384.1793
4 France Rwanda 6003.3498
5 France South Korea 9384.1793
6 France France 0.0000
I am trying to create a matrix that will look like this:
South Korea France Rwanda
South Korea 0 9384.1793 10844.6822
France 9384.1793 0 6003.3498
Rwanda 10844.6822 6003.3498 0
I have tried using SparseMatrix from Matrix package as described here (Create sparse matrix from data frame)
The issue is that the i and j have to be integers, and I have character strings. I am unable to find another function that does what I am looking for. I would appreciate any help. Thank you
A possible solution:
tidyr::pivot_wider(dis_matrix, id_cols = i, names_from = j,
values_from = distance, values_fill = 0)
#> # A tibble: 2 × 4
#> i Rwanda France `South Korea`
#> <chr> <dbl> <dbl> <dbl>
#> 1 South Korea 10845. 9384 0
#> 2 France 6003 0 9384
You can use igraph::get.adjacency to create the desired matrix. You can also create a sparse matrix with sparse = TRUE.
library(igraph)
g <- graph.data.frame(dis_matrix, directed = FALSE)
get.adjacency(g, attr="distance", sparse = FALSE)
South Korea France Rwanda
South Korea 0.00 9384 10844.68
France 9384.00 0 6003.00
Rwanda 10844.68 6003 0.00
We may convert the first two columns to factor with levels specified as the unique values from both columns, and then use xtabs from base R
un1 <- unique(unlist(dis_matrix[1:2]))
dis_matrix[1:2] <- lapply(dis_matrix[1:2], factor, levels = un1)
xtabs(distance ~ i + j, dis_matrix)
-output
j
i South Korea France Rwanda
South Korea 0.00 9384.00 10844.68
France 9384.00 0.00 6003.00
Rwanda 0.00 0.00 0.00
Overview
I am analyzing incidents of protest in a dataset in which each observation indicates a single protest. Each observation has information about the date, country, and protest group that participated. I am using R.
Data
The data look like this:
Date Country Group
---------- ----------- ------------
7/1/2015 Algeria Labour Union
7/10/2015 Algeria Labour Union
9/15/2015 Algeria Labour Union
9/9/2016 Benin Political Party
10/1/2016 Benin Political Party
10/2/2016 Benin Political Party
10/3/2016 Benin Political Party
Objective
I want to do two things:
First, I am trying to create a variable that tracks the cumulative number of protests that each group has performed.
Second, I am trying to count the number of days between events per group.
I want the data to look like this:
Date Country Group Cumul Days
---------- ----------- ------------ --------- ------
7/1/2015 Algeria Labour Union 1 NA
7/10/2015 Algeria Labour Union 2 9
7/15/2015 Algeria Labour Union 3 5
9/9/2016 Benin Political Party 1 NA
10/1/2016 Benin Political Party 2 22
10/2/2016 Benin Political Party 3 1
10/3/2016 Benin Political Party 4 1
Simply put, I have no idea where to start. Any help would be appreciated!
An option would be to group by 'Country' , 'Group', create the 'Cumul' as the sequence of rows, while taking the diff of the Date class converted 'Date'
library(dplyr)
library(lubridate)
df1 %>%
group_by(Country, Group) %>%
mutate(Cumul = row_number(), Days = c(NA, diff(mdy(Date))))
# A tibble: 7 x 5
# Groups: Country, Group [2]
# Date Country Group Cumul Days
# <chr> <chr> <chr> <int> <dbl>
#1 7/1/2015 Algeria Labour Union 1 NA
#2 7/10/2015 Algeria Labour Union 2 9
#3 9/15/2015 Algeria Labour Union 3 67
#4 9/9/2016 Benin Political Party 1 NA
#5 10/1/2016 Benin Political Party 2 22
#6 10/2/2016 Benin Political Party 3 1
#7 10/3/2016 Benin Political Party 4 1
or with data.table
library(data.table)
setDT(df1)[, .(Cumul = .N, Days = c(NA, diff(as.IDate(Date,
"%m/%d/%Y")))), .(Country, Group)]
data
df1 <- structure(list(Date = c("7/1/2015", "7/10/2015", "9/15/2015",
"9/9/2016", "10/1/2016", "10/2/2016", "10/3/2016"), Country = c("Algeria",
"Algeria", "Algeria", "Benin", "Benin", "Benin", "Benin"), Group = c("Labour Union",
"Labour Union", "Labour Union", "Political Party", "Political Party",
"Political Party", "Political Party")), class = "data.frame", row.names = c(NA,
-7L))
I am having some problems with the following task
I have a data frame of this type with 99 different countries for thousands of IDs
ID Nationality var 1 var 2 ....
1 Italy //
2 Eritrea //
3 Italy //
4 USA
5 France
6 France
7 Eritrea
....
I want to add a variable corresponding to a given macroregion of Nationality
so I created a matrix of this kind with the rule to follow
Nationality Continent
Italy Europe
Eritrea Africa
Usa America
France Europe
Germany Europe
....
I d like to obtain this
ID Nationality var 1 var 2 Continent
1 Italy // Europe
2 Eritrea // Africa
3 Italy // Europe
4 USA America
5 France Europe
6 France Europe
7 Eritrea Africa
....
I was trying with this command
datasubset <- merge(dataset , continent.matrix )
but it doesn't work, it reports the following error
Error: cannot allocate vector of size 56.6 Mb
that seems very strange to me, also trying to apply this code to a subset it doesn't work. do you have any suggestion on how to proceed?
thank you very much in advance for your help, I hope my question doesn't sound too trivial, but I am quite new to R
You can do this with the left_join function (dplyr's library):
library(dplyr)
df <- tibble(ID=c(1,2,3),
Nationality=c("Italy", "Usa", "France"),
var1=c("a", "b", "c"),
var2=c(4,5,6))
nat_cont <- tibble(Nationality=c("Italy", "Eritrea", "Usa", "Germany", "France"),
Continent=c("Europe", "Africa", "America", "Europe", "Europe"))
df_2 <- left_join(df, nat_cont, by=c("Nationality"))
The output:
> df_2
# A tibble: 3 x 5
ID Nationality var1 var2 Continent
<dbl> <chr> <chr> <dbl> <chr>
1 1 Italy a 4 Europe
2 2 Usa b 5 America
3 3 France c 6 Europe
This seems like a simple problem, but I'm having trouble wrapping my mind around it. I have a data frame of locations with population by region of birth, and I'm trying to filter for the regions whose combined population exceeds a threshold—in this case, 50%.
For example, for each location I need to be able to say something like, "In Fairfield County, a majority of the foreign-born population were born in Central and South America or the Caribbean." To be able to phrase it that way, I need to include the first country that gets over the 50% mark.
An abridged version of my data, along with the first few rows for each location, is here:
library(tidyverse)
df <- structure(list(name = c("Fairfield County", "Fairfield County",
"Fairfield County", "Fairfield County", "Greater Hartford", "Greater Hartford",
"Greater Hartford", "Greater Hartford", "Greater Hartford"),
subregion = c("South America", "Central America", "Caribbean",
"South Central Asia", "Caribbean", "Eastern Europe", "South Central Asia",
"South America", "Southern Europe"),
pop = c(40565, 33919, 32044, 17031, 26939, 23765, 20153, 14384, 9309),
cum_share = c(0.2, 0.38, 0.54, 0.62, 0.2, 0.37, 0.51, 0.62, 0.69)),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -9L))
df %>%
group_by(name) %>%
top_n(4, pop)
#> # A tibble: 8 x 4
#> # Groups: name [2]
#> name subregion pop cum_share
#> <chr> <chr> <dbl> <dbl>
#> 1 Fairfield County South America 40565 0.2
#> 2 Fairfield County Central America 33919 0.38
#> 3 Fairfield County Caribbean 32044 0.54
#> 4 Fairfield County South Central Asia 17031 0.62
#> 5 Greater Hartford Caribbean 26939 0.2
#> 6 Greater Hartford Eastern Europe 23765 0.37
#> 7 Greater Hartford South Central Asia 20153 0.51
#> 8 Greater Hartford South America 14384 0.62
My first plan was to filter for where the cumulative share was less than or equal to 51%, meaning the top-ranking regions until reaching a majority of the population. The problem with that is that because these shares aren't a continuous distribution, having a set cutoff point like this doesn't work, because I need to include the first region for which the cumulative share is a majority.
df %>%
filter(cum_share <= 0.51)
#> # A tibble: 5 x 4
#> name subregion pop cum_share
#> <chr> <chr> <dbl> <dbl>
#> 1 Fairfield County South America 40565 0.2
#> 2 Fairfield County Central America 33919 0.38
#> 3 Greater Hartford Caribbean 26939 0.2
#> 4 Greater Hartford Eastern Europe 23765 0.37
#> 5 Greater Hartford South Central Asia 20153 0.51
As you can see by comparing to the first snapshot, Greater Hartford works as I'd expect. But Fairfield County should include the Caribbean, at which the cumulative share is 54%; by filtering with a set threshold of 51%, Caribbean isn't included. What I'd like to get is instead like this:
#> # A tibble: 6 x 4
#> name subregion pop cum_share
#> <chr> <chr> <dbl> <dbl>
#> 1 Fairfield County South America 40565 0.2
#> 2 Fairfield County Central America 33919 0.38
#> 3 Fairfield County Caribbean 32044 0.54
#> 4 Greater Hartford Caribbean 26939 0.2
#> 5 Greater Hartford Eastern Europe 23765 0.37
#> 6 Greater Hartford South Central Asia 20153 0.51
Here, the first place at which the share exceeds 50% is also included. I could filter manually, but I'm actually doing this by country, not region of the world, and for 18 locations, so it becomes unwieldy.
Thanks in advance!
Edit: Wow, I'm realizing my own foolishness—I could have calculated cumulative shares from populations in ascending order, not descending, and then easily filtered for where this threshold exceeds 50%. I'll leave this up, though, to help out someone who doesn't have control over their data in this way.
For example, for each location I need to be able to say something like, "In Fairfield County, a majority of the foreign-born population were born in Central and South America or the Caribbean."
For the general case of stopping after a condition is met, there's filter(lag(cumsum(cond), default=FALSE) == 0)
> df %>% group_by(name) %>% filter(cumsum(lag(cum_share > 0.5, default = FALSE)) == 0)
# A tibble: 6 x 4
# Groups: name [2]
name subregion pop cum_share
<chr> <chr> <dbl> <dbl>
1 Fairfield County South America 40565 0.20
2 Fairfield County Central America 33919 0.38
3 Fairfield County Caribbean 32044 0.54
4 Greater Hartford Caribbean 26939 0.20
5 Greater Hartford Eastern Europe 23765 0.37
6 Greater Hartford South Central Asia 20153 0.51
The OP identified a simpler filter in the case of a monotone condition (ie, one such that after first meeting the condition, later elements of the vector also do so): filter(lag(cum_share, default = 0) <= 0.5).
There's probably a good way to wrap this in a function (mutate .cond from user input; mutate .keep criterion = cumsum(lag(.cond, default=FALSE) == 0); filter; drop .cond and .keep), but I don't have the tidyverse NSE skills for the first step.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I've been given a set of country groups and I'm trying to get a set of mutually exclusive regions so that I can compare them. The problem is that my data contains several groups, many of which overlap. How can I get a set of groups which contain all countries, but do not overlap with each other?
For example, assume that this is the list of countries in the world:
World <- c("Angola", "France", "Germany", "Australia", "New Zealand")
Assume that this is my set of groups:
df <- data.frame(group = c("Africa", "Western Europe", "Europe", "Europe", "Oceania", "Oceania", "Commonwealth Countries"),
element = c("Angola", "France", "Germany", "France", "Australia", "New Zealand", "Australia"))
group element
1 Africa Angola
2 Western Europe France
3 Europe Germany
4 Europe France
5 Oceania Australia
6 Oceania New Zealand
7 Commonwealth Countries Australia
How could I remove overlapping groups (in this case Western Europe) to get a set of groups that contains all countries like the following:
df_solved <- data.frame(group = c("Africa", "Europe", "Europe", "Oceania", "Oceania"),
element = c("Angola", "France", "Germany", "Australia", "New Zealand"))
group element
1 Africa Angola
2 Europe France
3 Europe Germany
4 Oceania Australia
5 Oceania New Zealand
One possible rule could be to minimize the number of groups, e.g. to associate an element with that group which includes the most elements.
library(data.table)
setDT(df)[, n.elements := .N, by = group][
order(-n.elements), .(group = group[1L]), by = element]
element group
1: Germany Europe
2: France Europe
3: Australia Oceania
4: New Zealand Oceania
5: Angola Africa
Explanation
setDT(df)[, n.elements := .N, by = group][]
returns
group element n.elements
1: Africa Angola 1
2: Western Europe France 1
3: Europe Germany 2
4: Europe France 2
5: Oceania Australia 2
6: Oceania New Zealand 2
7: Commonwealth Countries Australia 1
Now, the rows are ordered by decreasing number of elements and for each country the first, i.e., the "largest", group is picked. This should return a group for each country as requested.
In case of ties, i.e., one group contains equally many elements, you can add additional citeria when ordering, e.g., length of the group name, or just alphabetical order.
1) If you want to simply eliminate duplicate elements then use !duplicated(...) as shown. No packages are used.
subset(df, !duplicated(element))
giving:
group element
1 Africa Angola
2 Europe France
3 Europe Germany
5 Oceania Australia
6 Oceania New Zealand
2) set partitioning If each group must be wholly in or wholly out and each element may only appear once then this is a set partitioning problem:
library(lpSolve)
const.mat <- with(df, table(element, group))
obj <- rep(1L, ncol(const.mat))
res <- lp("min", obj, const.mat, "=", 1L, all.bin = TRUE)
subset(df, group %in% colnames(const.mat[, res$solution == 1]))
giving:
group element
1 Africa Angola
2 Europe France
3 Europe Germany
5 Oceania Australia
6 Oceania New Zealand
3) set covering Of course there may be no exact set partition so we could consider the set covering problem (same code exceept "=" is replaced by ">=" in the lp line.
library(lpSolve)
const.mat <- with(df, table(element, group))
obj <- rep(1L, ncol(const.mat))
res <- lp("min", obj, const.mat, ">=", 1L, all.bin = TRUE)
subset(df, group %in% colnames(const.mat[, res$solution == 1]))
giving:
group element
1 Africa Angola
2 Europe France
3 Europe Germany
5 Oceania Australia
6 Oceania New Zealand
and we could optionally then apply (1) to remove any duplicates in the cover.
4) Non-dominated groups Another approach is to remove any group whose elements form a strict subset of the elements of some other group. For example, every element in Western Europe is in Europe and Europe has more elements than Western Europe so the elements of Western Europe are a strict subset of the elements of Europe and we remove Western Europe. Using const.mat from above:
# returns TRUE if jth column of const.mat is dominated by some other column
is_dom_fun <- function(j) any(apply(const.mat[, j] <= const.mat[, -j], 2, all) &
sum(const.mat[, j]) < colSums(const.mat[, -j]))
is_dom <- sapply(seq_len(ncol(const.mat)), is_dom_fun)
subset(df, group %in% colnames(const.mat)[!is_dom])
giving:
group element
1 Africa Angola
3 Europe Germany
4 Europe France
5 Oceania Australia
6 Oceania New Zealand
If there are any duplicates left we can use (1) to remove them.
library(dplyr)
df %>% distinct(element, .keep_all=TRUE)
group element
1 Africa Angola
2 Europe France
3 Europe Germany
4 Oceania Australia
5 Oceania New Zealand
Shoutout to Axeman for beating me with this answer.
Update
Your question is ill-defined. Why is 'Europe' preferred over 'Western Europe'? Put another way, each country is assigned several groups. You want to reduce it to one group per country. How do you decide which group?
Here's one way, we always prefer the biggest:
groups <- df %>% count(group)
df %>% inner_join(groups, by='group') %>%
arrange(desc(n)) %>% distinct(elemenet, .keep_all=TRUE)
group element n
1 Europe France 2
2 Europe Germany 2
3 Oceania Australia 2
4 Oceania New Zealand 2
5 Africa Angola 1
Here is one option with data.table
library(data.table)
setDT(df)[, head(.SD, 1), element]
Or with unique
unique(setDT(df), by = 'element')
# group element
#1: Africa Angola
#2: Europe France
#3: Europe Germany
#4: Oceania Australia
#5: Oceania New Zealand
Packages are used and it is data.table
A completely different approach would be to ignore the given groups but to look up just the country names in the catalogue of UN regions which are available in the countrycodes or ISOcodes packages.
The countrycodes package seems to offer the simpler interface and it also warns about country names which can not be found in its database:
# given country names - note the deliberately misspelled last entry
World <- c("Angola", "France", "Germany", "Australia", "New Zealand", "New Sealand")
# regions
countrycode::countrycode(World, "country.name.en", "region")
[1] "Middle Africa" "Western Europe" "Western Europe" "Australia and New Zealand"
[5] "Australia and New Zealand" NA
Warning message:
In countrycode::countrycode(World, "country.name.en", "region") :
Some values were not matched unambiguously: New Sealand
# continents
countrycode::countrycode(World, "country.name.en", "continent")
[1] "Africa" "Europe" "Europe" "Oceania" "Oceania" NA
Warning message:
In countrycode::countrycode(World, "country.name.en", "continent") :
Some values were not matched unambiguously: New Sealand