Filter data by threshold, including first value surpassing threshold - r

This seems like a simple problem, but I'm having trouble wrapping my mind around it. I have a data frame of locations with population by region of birth, and I'm trying to filter for the regions whose combined population exceeds a threshold—in this case, 50%.
For example, for each location I need to be able to say something like, "In Fairfield County, a majority of the foreign-born population were born in Central and South America or the Caribbean." To be able to phrase it that way, I need to include the first country that gets over the 50% mark.
An abridged version of my data, along with the first few rows for each location, is here:
library(tidyverse)
df <- structure(list(name = c("Fairfield County", "Fairfield County",
"Fairfield County", "Fairfield County", "Greater Hartford", "Greater Hartford",
"Greater Hartford", "Greater Hartford", "Greater Hartford"),
subregion = c("South America", "Central America", "Caribbean",
"South Central Asia", "Caribbean", "Eastern Europe", "South Central Asia",
"South America", "Southern Europe"),
pop = c(40565, 33919, 32044, 17031, 26939, 23765, 20153, 14384, 9309),
cum_share = c(0.2, 0.38, 0.54, 0.62, 0.2, 0.37, 0.51, 0.62, 0.69)),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -9L))
df %>%
group_by(name) %>%
top_n(4, pop)
#> # A tibble: 8 x 4
#> # Groups: name [2]
#> name subregion pop cum_share
#> <chr> <chr> <dbl> <dbl>
#> 1 Fairfield County South America 40565 0.2
#> 2 Fairfield County Central America 33919 0.38
#> 3 Fairfield County Caribbean 32044 0.54
#> 4 Fairfield County South Central Asia 17031 0.62
#> 5 Greater Hartford Caribbean 26939 0.2
#> 6 Greater Hartford Eastern Europe 23765 0.37
#> 7 Greater Hartford South Central Asia 20153 0.51
#> 8 Greater Hartford South America 14384 0.62
My first plan was to filter for where the cumulative share was less than or equal to 51%, meaning the top-ranking regions until reaching a majority of the population. The problem with that is that because these shares aren't a continuous distribution, having a set cutoff point like this doesn't work, because I need to include the first region for which the cumulative share is a majority.
df %>%
filter(cum_share <= 0.51)
#> # A tibble: 5 x 4
#> name subregion pop cum_share
#> <chr> <chr> <dbl> <dbl>
#> 1 Fairfield County South America 40565 0.2
#> 2 Fairfield County Central America 33919 0.38
#> 3 Greater Hartford Caribbean 26939 0.2
#> 4 Greater Hartford Eastern Europe 23765 0.37
#> 5 Greater Hartford South Central Asia 20153 0.51
As you can see by comparing to the first snapshot, Greater Hartford works as I'd expect. But Fairfield County should include the Caribbean, at which the cumulative share is 54%; by filtering with a set threshold of 51%, Caribbean isn't included. What I'd like to get is instead like this:
#> # A tibble: 6 x 4
#> name subregion pop cum_share
#> <chr> <chr> <dbl> <dbl>
#> 1 Fairfield County South America 40565 0.2
#> 2 Fairfield County Central America 33919 0.38
#> 3 Fairfield County Caribbean 32044 0.54
#> 4 Greater Hartford Caribbean 26939 0.2
#> 5 Greater Hartford Eastern Europe 23765 0.37
#> 6 Greater Hartford South Central Asia 20153 0.51
Here, the first place at which the share exceeds 50% is also included. I could filter manually, but I'm actually doing this by country, not region of the world, and for 18 locations, so it becomes unwieldy.
Thanks in advance!
Edit: Wow, I'm realizing my own foolishness—I could have calculated cumulative shares from populations in ascending order, not descending, and then easily filtered for where this threshold exceeds 50%. I'll leave this up, though, to help out someone who doesn't have control over their data in this way.

For example, for each location I need to be able to say something like, "In Fairfield County, a majority of the foreign-born population were born in Central and South America or the Caribbean."
For the general case of stopping after a condition is met, there's filter(lag(cumsum(cond), default=FALSE) == 0)
> df %>% group_by(name) %>% filter(cumsum(lag(cum_share > 0.5, default = FALSE)) == 0)
# A tibble: 6 x 4
# Groups: name [2]
name subregion pop cum_share
<chr> <chr> <dbl> <dbl>
1 Fairfield County South America 40565 0.20
2 Fairfield County Central America 33919 0.38
3 Fairfield County Caribbean 32044 0.54
4 Greater Hartford Caribbean 26939 0.20
5 Greater Hartford Eastern Europe 23765 0.37
6 Greater Hartford South Central Asia 20153 0.51
The OP identified a simpler filter in the case of a monotone condition (ie, one such that after first meeting the condition, later elements of the vector also do so): filter(lag(cum_share, default = 0) <= 0.5).
There's probably a good way to wrap this in a function (mutate .cond from user input; mutate .keep criterion = cumsum(lag(.cond, default=FALSE) == 0); filter; drop .cond and .keep), but I don't have the tidyverse NSE skills for the first step.

Related

Build identity matrix from dataframe (sparsematrix) in R

I am trying to create an identity matrix from a dataframe. The dataframe is like so:
i<-c("South Korea", "South Korea", "France", "France","France")
j <-c("Rwanda", "France", "Rwanda", "South Korea","France")
distance <-c(10844.6822,9384,6003,9384,0)
dis_matrix<-data.frame(i,j,distance)
dis_matrix
1 South Korea South Korea 0.0000
2 South Korea Rwanda 10844.6822
3 South Korea France 9384.1793
4 France Rwanda 6003.3498
5 France South Korea 9384.1793
6 France France 0.0000
I am trying to create a matrix that will look like this:
South Korea France Rwanda
South Korea 0 9384.1793 10844.6822
France 9384.1793 0 6003.3498
Rwanda 10844.6822 6003.3498 0
I have tried using SparseMatrix from Matrix package as described here (Create sparse matrix from data frame)
The issue is that the i and j have to be integers, and I have character strings. I am unable to find another function that does what I am looking for. I would appreciate any help. Thank you
A possible solution:
tidyr::pivot_wider(dis_matrix, id_cols = i, names_from = j,
values_from = distance, values_fill = 0)
#> # A tibble: 2 × 4
#> i Rwanda France `South Korea`
#> <chr> <dbl> <dbl> <dbl>
#> 1 South Korea 10845. 9384 0
#> 2 France 6003 0 9384
You can use igraph::get.adjacency to create the desired matrix. You can also create a sparse matrix with sparse = TRUE.
library(igraph)
g <- graph.data.frame(dis_matrix, directed = FALSE)
get.adjacency(g, attr="distance", sparse = FALSE)
South Korea France Rwanda
South Korea 0.00 9384 10844.68
France 9384.00 0 6003.00
Rwanda 10844.68 6003 0.00
We may convert the first two columns to factor with levels specified as the unique values from both columns, and then use xtabs from base R
un1 <- unique(unlist(dis_matrix[1:2]))
dis_matrix[1:2] <- lapply(dis_matrix[1:2], factor, levels = un1)
xtabs(distance ~ i + j, dis_matrix)
-output
j
i South Korea France Rwanda
South Korea 0.00 9384.00 10844.68
France 9384.00 0.00 6003.00
Rwanda 0.00 0.00 0.00

Assign values from array to dataframe in R

I have a dataframe with data about the US States.
One of the columns in the df is "Division", which tells the location where each state belongs to ("East North Central", "East South Central", "Middle Atlantic", "Mountain", "New England", "Pacific", "South Atlantic", "West North Central", "West South Central").
I created an array with the average expectancy life for each division, using an existing column called "Life Exp:
avg.life.exp = tapply(df[["Life Exp"]], df$Division, mean, na.rm=TRUE)
Which returns the following:
East North Central East South Central Middle Atlantic
70.99000 69.33750 70.63667
Mountain New England Pacific
70.94750 71.57833 71.69400
South Atlantic West North Central West South Central
69.52625 72.32143 70.43500
Now I would like to add a new column to the df, with the average life expectancy of each Division. So basically I would like to do a Left Join, where if the state belonged to the East Noth Central, it would return 70.99000, and so on.
I need to do this without using packages.
Thank you in advance for any help you can provide!
One option would be to use merge:
merge(df, data.frame(Division = names(avg.life.exp), avg.life.exp), all.x = TRUE)
A second option would be to use match
df$avg.life.exp <- avg.life.exp[match(df$Division, names(avg.life.exp))]
Using the gapminder dataset as example data:
library(gapminder)
# Example data
df <- gapminder[gapminder$year == 2007, c("country", "continent", "lifeExp")]
avg.life.exp <- tapply(df[["lifeExp"]], df$continent, mean, na.rm=TRUE)
avg.life.exp
#> Africa Americas Asia Europe Oceania
#> 54.80604 73.60812 70.72848 77.64860 80.71950
# Using merge
df1 <- merge(df, data.frame(continent = names(avg.life.exp), avg.life.exp), all.x = TRUE)
head(df1)
#> continent country lifeExp avg.life.exp
#> 1 Africa Reunion 76.442 54.80604
#> 2 Africa Eritrea 58.040 54.80604
#> 3 Africa Algeria 72.301 54.80604
#> 4 Africa Congo, Rep. 55.322 54.80604
#> 5 Africa Equatorial Guinea 51.579 54.80604
#> 6 Africa Malawi 48.303 54.80604
# Using match
df$avg.life.exp <- avg.life.exp[match(df$continent, names(avg.life.exp))]
head(df)
#> # A tibble: 6 × 4
#> country continent lifeExp avg.life.exp
#> <fct> <fct> <dbl> <dbl>
#> 1 Afghanistan Asia 43.8 70.7
#> 2 Albania Europe 76.4 77.6
#> 3 Algeria Africa 72.3 54.8
#> 4 Angola Africa 42.7 54.8
#> 5 Argentina Americas 75.3 73.6
#> 6 Australia Oceania 81.2 80.7

R- delete the tail word

Can someone teach me how to delete tail word ,thanks.
from
1 North Africa
2 Algeria
3 Canary Islands (Spain)[153]
4 Ceuta (Spain)[154]
to
1 North Africa
2 Algeria
3 Canary Islands
4 Ceuta
I'm sad with my poor English.
It seems that you want to trim a trailing name in parentheses, along with anything which follows to the end of the string. We can use sub for this purpose:
df <- data.frame(id=c(1:4),
places=c("North Africa", "Algeria", "Canary Islands (Spain)[153]", "Ceuta (Spain)[154]"),
stringsAsFactors=FALSE)
df$places <- sub("\\s*\\(.*\\).*$", "", df$places)
df
id places
1 1 North Africa
2 2 Algeria
3 3 Canary Islands
4 4 Ceuta

Conditional cumulative and time series columns in R

Overview
I am analyzing incidents of protest in a dataset in which each observation indicates a single protest. Each observation has information about the date, country, and protest group that participated. I am using R.
Data
The data look like this:
Date Country Group
---------- ----------- ------------
7/1/2015 Algeria Labour Union
7/10/2015 Algeria Labour Union
9/15/2015 Algeria Labour Union
9/9/2016 Benin Political Party
10/1/2016 Benin Political Party
10/2/2016 Benin Political Party
10/3/2016 Benin Political Party
Objective
I want to do two things:
First, I am trying to create a variable that tracks the cumulative number of protests that each group has performed.
Second, I am trying to count the number of days between events per group.
I want the data to look like this:
Date Country Group Cumul Days
---------- ----------- ------------ --------- ------
7/1/2015 Algeria Labour Union 1 NA
7/10/2015 Algeria Labour Union 2 9
7/15/2015 Algeria Labour Union 3 5
9/9/2016 Benin Political Party 1 NA
10/1/2016 Benin Political Party 2 22
10/2/2016 Benin Political Party 3 1
10/3/2016 Benin Political Party 4 1
Simply put, I have no idea where to start. Any help would be appreciated!
An option would be to group by 'Country' , 'Group', create the 'Cumul' as the sequence of rows, while taking the diff of the Date class converted 'Date'
library(dplyr)
library(lubridate)
df1 %>%
group_by(Country, Group) %>%
mutate(Cumul = row_number(), Days = c(NA, diff(mdy(Date))))
# A tibble: 7 x 5
# Groups: Country, Group [2]
# Date Country Group Cumul Days
# <chr> <chr> <chr> <int> <dbl>
#1 7/1/2015 Algeria Labour Union 1 NA
#2 7/10/2015 Algeria Labour Union 2 9
#3 9/15/2015 Algeria Labour Union 3 67
#4 9/9/2016 Benin Political Party 1 NA
#5 10/1/2016 Benin Political Party 2 22
#6 10/2/2016 Benin Political Party 3 1
#7 10/3/2016 Benin Political Party 4 1
or with data.table
library(data.table)
setDT(df1)[, .(Cumul = .N, Days = c(NA, diff(as.IDate(Date,
"%m/%d/%Y")))), .(Country, Group)]
data
df1 <- structure(list(Date = c("7/1/2015", "7/10/2015", "9/15/2015",
"9/9/2016", "10/1/2016", "10/2/2016", "10/3/2016"), Country = c("Algeria",
"Algeria", "Algeria", "Benin", "Benin", "Benin", "Benin"), Group = c("Labour Union",
"Labour Union", "Labour Union", "Political Party", "Political Party",
"Political Party", "Political Party")), class = "data.frame", row.names = c(NA,
-7L))

create a variable in a dataframe based on another matrix on R

I am having some problems with the following task
I have a data frame of this type with 99 different countries for thousands of IDs
ID Nationality var 1 var 2 ....
1 Italy //
2 Eritrea //
3 Italy //
4 USA
5 France
6 France
7 Eritrea
....
I want to add a variable corresponding to a given macroregion of Nationality
so I created a matrix of this kind with the rule to follow
Nationality Continent
Italy Europe
Eritrea Africa
Usa America
France Europe
Germany Europe
....
I d like to obtain this
ID Nationality var 1 var 2 Continent
1 Italy // Europe
2 Eritrea // Africa
3 Italy // Europe
4 USA America
5 France Europe
6 France Europe
7 Eritrea Africa
....
I was trying with this command
datasubset <- merge(dataset , continent.matrix )
but it doesn't work, it reports the following error
Error: cannot allocate vector of size 56.6 Mb
that seems very strange to me, also trying to apply this code to a subset it doesn't work. do you have any suggestion on how to proceed?
thank you very much in advance for your help, I hope my question doesn't sound too trivial, but I am quite new to R
You can do this with the left_join function (dplyr's library):
library(dplyr)
df <- tibble(ID=c(1,2,3),
Nationality=c("Italy", "Usa", "France"),
var1=c("a", "b", "c"),
var2=c(4,5,6))
nat_cont <- tibble(Nationality=c("Italy", "Eritrea", "Usa", "Germany", "France"),
Continent=c("Europe", "Africa", "America", "Europe", "Europe"))
df_2 <- left_join(df, nat_cont, by=c("Nationality"))
The output:
> df_2
# A tibble: 3 x 5
ID Nationality var1 var2 Continent
<dbl> <chr> <chr> <dbl> <chr>
1 1 Italy a 4 Europe
2 2 Usa b 5 America
3 3 France c 6 Europe

Resources