I have a dataframe that lists political mandates with their start and end year. The question I want to answer is "How many people (person_id) have had mandates that overlap in terms of years active?
I've tried sequencing the active years as vectors, and then grouping and summarising by intersecting the vectors of active years.
My input table
id
person_id
start_year
end_year
1
6
1987
1992
2
6
1989
1995
mandates_active <- mandates %>% mutate(active_years = map2(mandate_start_year, mandate_end_year, seq))
I get an additional column with the sequenced active years:
id
person_id
start_year
end_year
active_years
1
6
1987
1992
[1987, 1988, 1989 ...]
2
6
1989
1995
[1989, 1990, 1991 ...]
Then I try to group this bz person_id and summarise by intersecting the active years list, but I need two arguments for the intersect function, hence this doesn't work:
mandates_test <- mandates_active %>% group_by(person_id) %>% summarise(intersect(active_years))
My output would show which person_ids have had multiple mandates during overlapping years.
Maybe this works for you. It shows the person_id and the overlapping years in the final result.
data.frame(df %>%
rowwise() %>%
summarize(id, person_id, f = seq(start_year, end_year, 1)) %>%
group_by(person_id) %>%
summarize(overlapping_years = list(f[duplicated(f)])))
person_id overlapping_years
1 6 1989, 1990, 1991, 1992
2 7
3 8 1992
If you just want a sum of years
df %>%
rowwise() %>%
summarize(id, person_id, f = seq(start_year, end_year, 1)) %>%
group_by(person_id) %>%
summarize(overlapping_years = length(f[duplicated(f)]))
# A tibble: 3 × 2
person_id overlapping_years
<dbl> <int>
1 6 4
2 7 0
3 8 1
Extended Data
df <- structure(list(id = c(1L, 2L, 1L, 2L, 1L, 2L), person_id = c(6,
6, 7, 7, 8, 8), start_year = c(1987, 1989, 1987, 1993, 1987,
1992), end_year = c(1992L, 1995L, 1992L, 1995L, 1992L, 1995L)), row.names = c(NA,
6L), class = "data.frame")
library(dplyr)
library(tidyr)
set.seed(123)
#Data example
start_year <- sample(1957:2003, 12, T)
end_year <- start_year + sample(1:4, 12,T)
data <- data.frame(person_id = 1:12, start_year, end_year)
data
#> person_id start_year end_year
#> 1 1 1987 1990
#> 2 2 1971 1974
#> 3 3 1970 1971
#> 4 4 1959 1963
#> 5 5 1998 1999
#> 6 6 1999 2000
#> 7 7 1993 1994
#> 8 8 1970 1973
#> 9 9 1981 1985
#> 10 10 1982 1984
#> 11 11 1983 1986
#> 12 12 1961 1963
# Unroll the intervals into registers:
pers_years <- data %>% rowwise() %>%
mutate(years = list(start_year:end_year)) %>%
unnest(years)
Number of distinct persons by year:
summarise(persons = n_distinct(person_id))
#> # A tibble: 25 × 2
#> years persons
#> <int> <int>
#> 1 1959 1
#> 2 1960 1
#> 3 1961 2
#> 4 1962 2
#> 5 1963 2
#> 6 1970 2
#> 7 1971 3
#> 8 1972 2
#> 9 1973 2
#> 10 1974 1
#> # … with 15 more rows
Number of distinct persons that have mandates that overlaps with at least other person in the dataset:
pers_years %>%
inner_join(
pers_years %>% group_by(years) %>%
summarise(persons = n_distinct(person_id)) %>%
filter(persons > 1),
by = "years") %>%
summarise(n_persons = n_distinct(person_id))
#> # A tibble: 1 × 1
#> n_persons
#> <int>
#> 1 10
Related
I have a dafaframe called "CERS":
CERS = data.frame(IDs = c(seq(1, 10, by = 1)),
city = c("Vienna" , "Bratislava" , "Prague", "Budapest", "Amsterdam", "Turin", "Barcelona", "Bratislava", "Budapest", "Prague" ),
state = c("AT" , "SK" , "CZ", "HU", "NL", "IT", "ES", "SK", "HU", "CZ"),
year = c(seq(2011, 2020, by = 1)))
I would like to count the number of distinct cities and states, within the increasing intervals of years, that is c(2011, 2011:2012, 2011:2013, 2011:2014, 2011:2015, ... 2011:2020) and save the results as a table.
How can I expand the code below to do the job? Answers outside of dplyr are also useful.
library(dplyr)
CERS %>%
filter(year>=2011 & year <= 2015) %>%
summarise(n_dist=n_distinct(city), n_dist_country = n_distinct(state))
You could do:
CERS %>%
arrange(year) %>%
mutate(across(c(city, state), ~ !duplicated(.), .names = "n_dist_{.col}")) %>%
group_by(year) %>%
summarise(across(starts_with("n_dist"), sum)) %>%
ungroup() %>%
mutate(across(starts_with("n_dist"), cumsum))
year n_dist_city n_dist_state
<dbl> <int> <int>
1 2011 1 1
2 2012 2 2
3 2013 3 3
4 2014 4 4
5 2015 5 5
6 2016 6 6
7 2017 7 7
8 2018 7 7
9 2019 7 7
10 2020 7 7
Count the number of distinct city/state combinations in the 2011 to y range, where y varies from 2011 to 2020:
data.frame(
year = 2011:2020,
n = sapply(2011:2020, \(y) CERS %>% filter(between(year,2011,y)) %>% with(n_distinct(city,state)))
)
Output:
year n
1 2011 1
2 2012 2
3 2013 3
4 2014 4
5 2015 5
6 2016 6
7 2017 7
8 2018 7
9 2019 7
10 2020 7
I have a dataframe that looks like
country
sector
data1
data2
France
1
7
.
France
2
10
.
belgium
1
12
7
belgium
2
14
8
I want to subset columns that are missing for a country in all sectors. In this example I would like to drop/exclude column two because it is missing for sector 1 and 2 for france. To be clear I would also be throwing out the values of data2 for belgium in this example.
My expected output would look like
country
sector
data1
France
1
7
France
2
10
belgium
1
12
belgium
2
14
data 2 is now excluded because it had a complete set of missing values for all sectors in France
We may group by country, create logical columns where the count of NA elements are equal to group size, ungroup, replace the corresponding columns to NA based on the logical column and remove those columns in select
library(dplyr)
library(stringr)
df1 %>%
group_by(country) %>%
mutate(across(everything(), ~ sum(is.na(.x)) == n(),
.names = "{.col}_lgl")) %>%
ungroup %>%
mutate(across(names(df1)[-1], ~ if(any(get(str_c(cur_column(),
"_lgl")) )) NA else .x)) %>%
select(c(where(~ !is.logical(.x) && any(complete.cases(.x)))))
-output
# A tibble: 4 × 3
country sector data1
<chr> <int> <int>
1 France 1 7
2 France 2 10
3 belgium 1 12
4 belgium 2 14
If we don't use group_by, the steps can be simplified as showed in Maël's post i.e. do the grouping with a base R function within select i.e. either tapply or ave can work
df1 %>%
select(where(~ !any(tapply(is.na(.x), df1[["country"]],
FUN = all))))
data
df1 <- structure(list(country = c("France", "France", "belgium", "belgium"
), sector = c(1L, 2L, 1L, 2L), data1 = c(7L, 10L, NA, 14L), data2 = c(NA,
NA, 7L, 8L)), row.names = c(NA, -4L), class = "data.frame")
In base R:
df1 <- read.table(header = T, text = "country sector data1 data2
France 1 7 NA
France 2 10 2
belgium 1 12 7
belgium 2 14 8")
df2 <- read.table(header = T, text = "country sector data1 data2
France 1 7 NA
France 2 10 NA
belgium 1 12 7
belgium 2 14 8")
df1[!sapply(df1, \(x) any(ave(x, df1$country, FUN = \(y) all(is.na(y)))))]
# country sector data1 data2
# 1 France 1 7 NA
# 2 France 2 10 2
# 3 belgium 1 12 7
# 4 belgium 2 14 8
df2[!sapply(df2, \(x) any(ave(x, df2$country, FUN = \(y) all(is.na(y)))))]
# country sector data1
# 1 France 1 7
# 2 France 2 10
# 3 belgium 1 12
# 4 belgium 2 14
Note: \ replaces function.
For a base R solution, you can use the apply family on column names and detect if there's any NA in the values of all columns:
keep_remove <- sapply(names(data), \(x) all(!is.na(data[[x]])))
data <- data[, keep_remove]
My dataframe consists of monthly weather data as follows for a given location
set.seed(123)
dat <-
data.frame(Year = rep(1980:1985, each = 12),
Month = rep(1:12, times = 6),
value = runif(12*6))
I have split the year into seasons as shown below.
s1 <- c(11, 12, 1, 2) # season 1 consists of month 11, 12, 1 and 2 i.e. cuts across years
s2 <- c(3, 4, 5) # season 2 consists of month 3, 4, 5
s3 <- c(6, 7, 8, 9, 10) # season 3 consists of month 6, 7, 8, 9, 10
Taking example for 1980 -
season 1 is Nov-Dec from 1979 and Jan-Feb from 1980
season 2 is from March - May of 1980
season 3 is June - Oct of 1980
However, for year 1980, season 1 is incomplete since it only has months 1 and 2 and missing
the months 11 and 12 from 1979.
In contrast, for year 1985 season 1 to season 3 is complete and hence
I do not need months 11 and 12 from 1985 since it contributes to 1986 season1
With this background, I want to sum monthly values of each season by year
so that the dataframe is in year X season format instead of year-month format
In doing so there will be no values for 1980 season1 since it has missing months.
For cases when months cut across years, I don't know how to sum individual months?
library(dplyr)
season_list <- list(s1, s2, s3)
temp_list <- list()
for(s in seq_along(season_list)){
season_ref <- unlist(season_list[s])
if(sum(diff(season_ref) < 0) != 0){ # check if season cuts across years
dat %>%
dplyr::filter(Month %in% season_ref) %>%
# how do I sum across years for this exception
} else {
# if season does not cut across years, simply filter the months in each year and add
temp_list[[s]] <-
dat %>%
dplyr::filter(Month %in% season_ref) %>%
dplyr::group_by(Year) %>%
dplyr::summarise(season_value = sum(value)) %>%
dplyr::mutate(season = s)
}
}
Assuming that you want to sum the values for each season calculate the Season and endYear (the year that the season ends) and then sum by those.
dat %>%
group_by(endYear = Year + (Month %in% 11:12),
Season = 1 * (Month %in% s1) +
2 * (Month %in% s2) +
3 * (Month %in% s3)) %>%
summarize(value = sum(value), .groups = "drop")
giving:
# A tibble: 19 x 3
endYear Season value
<int> <dbl> <dbl>
1 1980 1 1.08
2 1980 2 2.23
3 1980 3 2.47
4 1981 1 2.66
5 1981 2 1.25
6 1981 3 2.91
7 1982 1 3.00
8 1982 2 1.43
9 1982 3 3.50
10 1983 1 1.48
11 1983 2 0.693
12 1983 3 1.49
13 1984 1 1.82
14 1984 2 1.29
15 1984 3 1.77
16 1985 1 2.03
17 1985 2 1.47
18 1985 3 3.31
19 1986 1 1.38
I have a data frame which has distances from a unit's centroid to different points. The points are identified by numbers and what I am trying to obtain a new column where I get the distance to the closest object.
So the data frame looks like this:
FID <- c(12, 12, 14, 15, 17, 18)
year <- c(1990, 1994, 1983, 1953, 1957, 2000)
centroid_distance_1 <- c(220.3, 220.3, 515.6, NA, 200.2, 22)
centroid_distance_2 <- c(520, 520, 24.3, NA , NA, 51.8)
centroid_distance_3 <- c(NA, 12.8, 124.2, NA, NA, 18.8)
centroid_distance_4 <- c(725.3, 725.3, 44.2, NA, 62.9, 217.9)
sample2 <- data.frame(FID, year, centroid_distance_1, centroid_distance_2, centroid_distance_3, centroid_distance_4)
sample2
FID year centroid_distance_1 centroid_distance_2 centroid_distance_3 centroid_distance_4
1 12 1990 220.3 520.0 NA 725.3
2 12 1994 220.3 520.0 12.8 725.3
3 14 1983 515.6 24.3 124.2 44.2
4 15 1953 NA NA NA NA
5 17 1957 200.2 NA NA 62.9
6 18 2000 22.0 51.8 18.8 217.9
FID is an identifier of each unit and year a year indicator. Each row is a FID*year pair. centroid_distance_xis the row's distance between its centroid and the object x. This is a small sample of the data frame, which contains much more columns and rows.
What I am looking for is something like this:
short_distance <- c(220.3, 12.8, 24.3, NA, 62.9,18.8)
unit <- c(1, 3, 2, NA, 4, 3)
ideal.df <- data.frame(FID, year, short_distance, unit)
ideal.df
FID year short_distance unit
1 12 1990 220.3 1
2 12 1994 12.8 3
3 14 1983 24.3 2
4 15 1953 NA NA
5 17 1957 62.9 4
6 18 2000 18.8 3
Where basically, I add one column with named short_distance which is the cell with the lower value a row takes of all the centroid_distance_* columns above, and one named unit which identifies the object from which each row has the smaller distance (so if one row has smallest value in centorid_distance_1 it takes the value of 1 for unit).
I have tried a bunch of things with dplyr and pivot and re-pivoting the dataframe but I'm really not getting there.
Thanks a lot for the help!
Another solution based in the tidyverse - using pivot_longer - could look as follows.
library(dplyr)
library(tidyr)
library(stringr)
sample2 %>%
pivot_longer(-c(FID, year)) %>%
group_by(year, FID) %>%
slice_min(value, n = 1, with_ties = FALSE) %>%
mutate(unit = str_sub(name, -1)) %>%
select(-name, short_distance = value)
# Groups: year, FID [6]
# FID year short_distance unit
# <dbl> <dbl> <dbl> <chr>
# 1 15 1953 NA 1
# 2 17 1957 62.9 4
# 3 14 1983 24.3 2
# 4 12 1990 220. 1
# 5 12 1994 12.8 3
# 6 18 2000 18.8 3
My first couple of attempts at this weren't working like I imagined, either - couldn't always get the NA behavior you want - but here's one that works:
library(dplyr)
library(reshape2) # Or use tidyr if you prefer
sample2 %>%
# Melt/unpivot to one value per row
melt(id.vars = c("FID", "year")) %>%
# Extract the unit number
mutate(
unit = sub(x = variable,
pattern = "^centroid_distance_",
replacement = "")
) %>%
group_by(FID, year) %>% # Group by FID and year to get one row of output for each
arrange(value) %>% # Put smallest distance at the top of each group
slice_head(n = 1) # Take one row from the top of each group
Base R solution
FID <- c(12, 12, 14, 15, 17, 18)
year <- c(1990, 1994, 1983, 1953, 1957, 2000)
centroid_distance_1 <- c(220.3, 220.3, 515.6, NA, 200.2, 22)
centroid_distance_2 <- c(520, 520, 24.3, NA , NA, 51.8)
centroid_distance_3 <- c(NA, 12.8, 124.2, NA, NA, 18.8)
centroid_distance_4 <- c(725.3, 725.3, 44.2, NA, 62.9, 217.9)
sample2 <- data.frame(FID, year, centroid_distance_1, centroid_distance_2, centroid_distance_3, centroid_distance_4)
Apply function min for each row and add it to the data frame as column short_distance. Ignore the warning and handle it in the next operation.
sample2$short_distance <- apply(sample2[,3:6], 1, min, na.rm = TRUE)
#> Warning in FUN(newX[, i], ...): kein nicht-fehlendes Argument für min; gebe Inf
#> zurück
sample2$short_distance[is.infinite(sample2$short_distance)] <- NA #C hange `Inf` created by the `min` function to to `NA`
Get units with which.min. ifelse is required because min.which would drop NA rows.
sample2$unit <- apply(sample2[,3:6], 1, function(x) ifelse(length(which.min(x)) == 0, NA, which.min(x)))
Keep only relevant columns
sample2 <- sample2[, c(1,2,7,8)]
sample2
#> FID year short_distance unit
#> 1 12 1990 220.3 1
#> 2 12 1994 12.8 3
#> 3 14 1983 24.3 2
#> 4 15 1953 NA NA
#> 5 17 1957 62.9 4
#> 6 18 2000 18.8 3
Created on 2021-01-18 by the reprex package (v0.3.0)
Here is a solution using dplyr & stringr packages (but you can just import tidyverse):
library(tidyverse)
df <- sample2 %>%
gather('centroid', 'dist', 3:length(.)) %>%
group_by(year) %>%
slice(if(all(is.na(dist))) 1L else which.min(dist)) %>%
mutate(centroid = str_replace(centroid, "centroid_distance_", ""))
df
Returns:
# A tibble: 6 x 4
# Groups: year [6]
FID year centroid dist
<dbl> <dbl> <chr> <dbl>
1 15 1953 1 NA
2 17 1957 4 62.9
3 14 1983 2 24.3
4 12 1990 1 220.
5 12 1994 3 12.8
6 18 2000 3 18.8
A data.table solution
setDT(sample2)
s <- melt(sample2, id = 1:2, variable.name = "object", value.name = "distance") ## pivot
s[, obj := as.numeric(object) ## transform factor into numeric
][, .(shortest = min(distance, na.rm=TRUE), unit= which.min(distance)), by = .(FID, year) ## calculate the shortest and which
][is.infinite(shortest), shortest:= NA # transform Inf into NA
][] ## report
Let me illustrate my question with an example:
Sample data:
df<-data.frame(BirthYear = c(1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005), Number= c(1,1,1,1,1,1,1,1,1,1,1), Group = c("g", "g", "g", "g", "g", "g","t","t","t","t","t"))
df
BirthYear Number Group
1 1995 1 g
2 1996 1 g
3 1997 1 g
4 1998 1 g
5 1999 1 g
6 2000 1 g
7 2001 1 t
8 2002 1 t
9 2003 1 t
10 2004 1 t
11 2005 1 t
and
df1<- structure(list(Year = c(2015, 2016, 2017, 2018, 2019, 2020)), class = "data.frame", row.names = c(NA,
-6L))
df1
Year
1 2015
2 2016
3 2017
4 2018
5 2019
6 2020
Now I want to add new columns to df1: g1, g2, t1 and t2.
g1 and t1 respectively represent the sum of df$Number for all instances of a group (g or t in df) where df1$Year - df$BirthYear is greater than 18 and lower than 21, so basically if someone is in the age between 19 & 20.
g2 and t2 represent the sum of df$Number for all instances of a group where the difference in years is lower than 19.
I want to end up with the following:
df1
Year g1 g2 t1 t2
1 2015 2 4 0 5
2 2016 2 3 0 5
3 2017 2 2 0 5
4 2018 2 1 0 5
5 2019 2 0 0 5
6 2020 1 0 1 4
I know I could make a for-loop over df1 to create the new columns but I don't know how to specify the condition to get the correct group sums for each year.
I hope this example makes clear what I'm trying to achieve.
I'd be very grateful for any help cause I'm really stuck at this point.
If what you want to do is just to calculate year differences across 2015:2020 and BirthYear, then you don't have to create a separate dataframe. Perhaps just
library(tidyr)
library(dplyr)
df %>%
expand(Year = 2015:2020, nesting(BirthYear, Number, Group)) %>%
group_by(Year, Group) %>%
summarise(
`1` = sum(between(Year - BirthYear, 19, 20) * Number),
`2` = sum((Year - BirthYear < 19) * Number)
) %>%
pivot_wider(names_from = "Group", values_from = c("1", "2"), names_glue = "{Group}{.value}")
Output
`summarise()` regrouping output by 'Year' (override with `.groups` argument)
# A tibble: 6 x 5
# Groups: Year [6]
Year g1 t1 g2 t2
<int> <dbl> <dbl> <dbl> <dbl>
1 2015 2 0 4 5
2 2016 2 0 3 5
3 2017 2 0 2 5
4 2018 2 0 1 5
5 2019 2 0 0 5
6 2020 1 1 0 4