sum overlapping groups in R - r

My dataframe consists of monthly weather data as follows for a given location
set.seed(123)
dat <-
data.frame(Year = rep(1980:1985, each = 12),
Month = rep(1:12, times = 6),
value = runif(12*6))
I have split the year into seasons as shown below.
s1 <- c(11, 12, 1, 2) # season 1 consists of month 11, 12, 1 and 2 i.e. cuts across years
s2 <- c(3, 4, 5) # season 2 consists of month 3, 4, 5
s3 <- c(6, 7, 8, 9, 10) # season 3 consists of month 6, 7, 8, 9, 10
Taking example for 1980 -
season 1 is Nov-Dec from 1979 and Jan-Feb from 1980
season 2 is from March - May of 1980
season 3 is June - Oct of 1980
However, for year 1980, season 1 is incomplete since it only has months 1 and 2 and missing
the months 11 and 12 from 1979.
In contrast, for year 1985 season 1 to season 3 is complete and hence
I do not need months 11 and 12 from 1985 since it contributes to 1986 season1
With this background, I want to sum monthly values of each season by year
so that the dataframe is in year X season format instead of year-month format
In doing so there will be no values for 1980 season1 since it has missing months.
For cases when months cut across years, I don't know how to sum individual months?
library(dplyr)
season_list <- list(s1, s2, s3)
temp_list <- list()
for(s in seq_along(season_list)){
season_ref <- unlist(season_list[s])
if(sum(diff(season_ref) < 0) != 0){ # check if season cuts across years
dat %>%
dplyr::filter(Month %in% season_ref) %>%
# how do I sum across years for this exception
} else {
# if season does not cut across years, simply filter the months in each year and add
temp_list[[s]] <-
dat %>%
dplyr::filter(Month %in% season_ref) %>%
dplyr::group_by(Year) %>%
dplyr::summarise(season_value = sum(value)) %>%
dplyr::mutate(season = s)
}
}

Assuming that you want to sum the values for each season calculate the Season and endYear (the year that the season ends) and then sum by those.
dat %>%
group_by(endYear = Year + (Month %in% 11:12),
Season = 1 * (Month %in% s1) +
2 * (Month %in% s2) +
3 * (Month %in% s3)) %>%
summarize(value = sum(value), .groups = "drop")
giving:
# A tibble: 19 x 3
endYear Season value
<int> <dbl> <dbl>
1 1980 1 1.08
2 1980 2 2.23
3 1980 3 2.47
4 1981 1 2.66
5 1981 2 1.25
6 1981 3 2.91
7 1982 1 3.00
8 1982 2 1.43
9 1982 3 3.50
10 1983 1 1.48
11 1983 2 0.693
12 1983 3 1.49
13 1984 1 1.82
14 1984 2 1.29
15 1984 3 1.77
16 1985 1 2.03
17 1985 2 1.47
18 1985 3 3.31
19 1986 1 1.38

Related

Find out if time periods of years overlap

I have a dataframe that lists political mandates with their start and end year. The question I want to answer is "How many people (person_id) have had mandates that overlap in terms of years active?
I've tried sequencing the active years as vectors, and then grouping and summarising by intersecting the vectors of active years.
My input table
id
person_id
start_year
end_year
1
6
1987
1992
2
6
1989
1995
mandates_active <- mandates %>% mutate(active_years = map2(mandate_start_year, mandate_end_year, seq))
I get an additional column with the sequenced active years:
id
person_id
start_year
end_year
active_years
1
6
1987
1992
[1987, 1988, 1989 ...]
2
6
1989
1995
[1989, 1990, 1991 ...]
Then I try to group this bz person_id and summarise by intersecting the active years list, but I need two arguments for the intersect function, hence this doesn't work:
mandates_test <- mandates_active %>% group_by(person_id) %>% summarise(intersect(active_years))
My output would show which person_ids have had multiple mandates during overlapping years.
Maybe this works for you. It shows the person_id and the overlapping years in the final result.
data.frame(df %>%
rowwise() %>%
summarize(id, person_id, f = seq(start_year, end_year, 1)) %>%
group_by(person_id) %>%
summarize(overlapping_years = list(f[duplicated(f)])))
person_id overlapping_years
1 6 1989, 1990, 1991, 1992
2 7
3 8 1992
If you just want a sum of years
df %>%
rowwise() %>%
summarize(id, person_id, f = seq(start_year, end_year, 1)) %>%
group_by(person_id) %>%
summarize(overlapping_years = length(f[duplicated(f)]))
# A tibble: 3 × 2
person_id overlapping_years
<dbl> <int>
1 6 4
2 7 0
3 8 1
Extended Data
df <- structure(list(id = c(1L, 2L, 1L, 2L, 1L, 2L), person_id = c(6,
6, 7, 7, 8, 8), start_year = c(1987, 1989, 1987, 1993, 1987,
1992), end_year = c(1992L, 1995L, 1992L, 1995L, 1992L, 1995L)), row.names = c(NA,
6L), class = "data.frame")
library(dplyr)
library(tidyr)
set.seed(123)
#Data example
start_year <- sample(1957:2003, 12, T)
end_year <- start_year + sample(1:4, 12,T)
data <- data.frame(person_id = 1:12, start_year, end_year)
data
#> person_id start_year end_year
#> 1 1 1987 1990
#> 2 2 1971 1974
#> 3 3 1970 1971
#> 4 4 1959 1963
#> 5 5 1998 1999
#> 6 6 1999 2000
#> 7 7 1993 1994
#> 8 8 1970 1973
#> 9 9 1981 1985
#> 10 10 1982 1984
#> 11 11 1983 1986
#> 12 12 1961 1963
# Unroll the intervals into registers:
pers_years <- data %>% rowwise() %>%
mutate(years = list(start_year:end_year)) %>%
unnest(years)
Number of distinct persons by year:
summarise(persons = n_distinct(person_id))
#> # A tibble: 25 × 2
#> years persons
#> <int> <int>
#> 1 1959 1
#> 2 1960 1
#> 3 1961 2
#> 4 1962 2
#> 5 1963 2
#> 6 1970 2
#> 7 1971 3
#> 8 1972 2
#> 9 1973 2
#> 10 1974 1
#> # … with 15 more rows
Number of distinct persons that have mandates that overlaps with at least other person in the dataset:
pers_years %>%
inner_join(
pers_years %>% group_by(years) %>%
summarise(persons = n_distinct(person_id)) %>%
filter(persons > 1),
by = "years") %>%
summarise(n_persons = n_distinct(person_id))
#> # A tibble: 1 × 1
#> n_persons
#> <int>
#> 1 10

Attribute value to new column based on values in similarly called columns

I have a data frame which has distances from a unit's centroid to different points. The points are identified by numbers and what I am trying to obtain a new column where I get the distance to the closest object.
So the data frame looks like this:
FID <- c(12, 12, 14, 15, 17, 18)
year <- c(1990, 1994, 1983, 1953, 1957, 2000)
centroid_distance_1 <- c(220.3, 220.3, 515.6, NA, 200.2, 22)
centroid_distance_2 <- c(520, 520, 24.3, NA , NA, 51.8)
centroid_distance_3 <- c(NA, 12.8, 124.2, NA, NA, 18.8)
centroid_distance_4 <- c(725.3, 725.3, 44.2, NA, 62.9, 217.9)
sample2 <- data.frame(FID, year, centroid_distance_1, centroid_distance_2, centroid_distance_3, centroid_distance_4)
sample2
FID year centroid_distance_1 centroid_distance_2 centroid_distance_3 centroid_distance_4
1 12 1990 220.3 520.0 NA 725.3
2 12 1994 220.3 520.0 12.8 725.3
3 14 1983 515.6 24.3 124.2 44.2
4 15 1953 NA NA NA NA
5 17 1957 200.2 NA NA 62.9
6 18 2000 22.0 51.8 18.8 217.9
FID is an identifier of each unit and year a year indicator. Each row is a FID*year pair. centroid_distance_xis the row's distance between its centroid and the object x. This is a small sample of the data frame, which contains much more columns and rows.
What I am looking for is something like this:
short_distance <- c(220.3, 12.8, 24.3, NA, 62.9,18.8)
unit <- c(1, 3, 2, NA, 4, 3)
ideal.df <- data.frame(FID, year, short_distance, unit)
ideal.df
FID year short_distance unit
1 12 1990 220.3 1
2 12 1994 12.8 3
3 14 1983 24.3 2
4 15 1953 NA NA
5 17 1957 62.9 4
6 18 2000 18.8 3
Where basically, I add one column with named short_distance which is the cell with the lower value a row takes of all the centroid_distance_* columns above, and one named unit which identifies the object from which each row has the smaller distance (so if one row has smallest value in centorid_distance_1 it takes the value of 1 for unit).
I have tried a bunch of things with dplyr and pivot and re-pivoting the dataframe but I'm really not getting there.
Thanks a lot for the help!
Another solution based in the tidyverse - using pivot_longer - could look as follows.
library(dplyr)
library(tidyr)
library(stringr)
sample2 %>%
pivot_longer(-c(FID, year)) %>%
group_by(year, FID) %>%
slice_min(value, n = 1, with_ties = FALSE) %>%
mutate(unit = str_sub(name, -1)) %>%
select(-name, short_distance = value)
# Groups: year, FID [6]
# FID year short_distance unit
# <dbl> <dbl> <dbl> <chr>
# 1 15 1953 NA 1
# 2 17 1957 62.9 4
# 3 14 1983 24.3 2
# 4 12 1990 220. 1
# 5 12 1994 12.8 3
# 6 18 2000 18.8 3
My first couple of attempts at this weren't working like I imagined, either - couldn't always get the NA behavior you want - but here's one that works:
library(dplyr)
library(reshape2) # Or use tidyr if you prefer
sample2 %>%
# Melt/unpivot to one value per row
melt(id.vars = c("FID", "year")) %>%
# Extract the unit number
mutate(
unit = sub(x = variable,
pattern = "^centroid_distance_",
replacement = "")
) %>%
group_by(FID, year) %>% # Group by FID and year to get one row of output for each
arrange(value) %>% # Put smallest distance at the top of each group
slice_head(n = 1) # Take one row from the top of each group
Base R solution
FID <- c(12, 12, 14, 15, 17, 18)
year <- c(1990, 1994, 1983, 1953, 1957, 2000)
centroid_distance_1 <- c(220.3, 220.3, 515.6, NA, 200.2, 22)
centroid_distance_2 <- c(520, 520, 24.3, NA , NA, 51.8)
centroid_distance_3 <- c(NA, 12.8, 124.2, NA, NA, 18.8)
centroid_distance_4 <- c(725.3, 725.3, 44.2, NA, 62.9, 217.9)
sample2 <- data.frame(FID, year, centroid_distance_1, centroid_distance_2, centroid_distance_3, centroid_distance_4)
Apply function min for each row and add it to the data frame as column short_distance. Ignore the warning and handle it in the next operation.
sample2$short_distance <- apply(sample2[,3:6], 1, min, na.rm = TRUE)
#> Warning in FUN(newX[, i], ...): kein nicht-fehlendes Argument für min; gebe Inf
#> zurück
sample2$short_distance[is.infinite(sample2$short_distance)] <- NA #C hange `Inf` created by the `min` function to to `NA`
Get units with which.min. ifelse is required because min.which would drop NA rows.
sample2$unit <- apply(sample2[,3:6], 1, function(x) ifelse(length(which.min(x)) == 0, NA, which.min(x)))
Keep only relevant columns
sample2 <- sample2[, c(1,2,7,8)]
sample2
#> FID year short_distance unit
#> 1 12 1990 220.3 1
#> 2 12 1994 12.8 3
#> 3 14 1983 24.3 2
#> 4 15 1953 NA NA
#> 5 17 1957 62.9 4
#> 6 18 2000 18.8 3
Created on 2021-01-18 by the reprex package (v0.3.0)
Here is a solution using dplyr & stringr packages (but you can just import tidyverse):
library(tidyverse)
df <- sample2 %>%
gather('centroid', 'dist', 3:length(.)) %>%
group_by(year) %>%
slice(if(all(is.na(dist))) 1L else which.min(dist)) %>%
mutate(centroid = str_replace(centroid, "centroid_distance_", ""))
df
Returns:
# A tibble: 6 x 4
# Groups: year [6]
FID year centroid dist
<dbl> <dbl> <chr> <dbl>
1 15 1953 1 NA
2 17 1957 4 62.9
3 14 1983 2 24.3
4 12 1990 1 220.
5 12 1994 3 12.8
6 18 2000 3 18.8
A data.table solution
setDT(sample2)
s <- melt(sample2, id = 1:2, variable.name = "object", value.name = "distance") ## pivot
s[, obj := as.numeric(object) ## transform factor into numeric
][, .(shortest = min(distance, na.rm=TRUE), unit= which.min(distance)), by = .(FID, year) ## calculate the shortest and which
][is.infinite(shortest), shortest:= NA # transform Inf into NA
][] ## report

Assign value from one variable to another new variable

my data is here:
x <- data.frame("Year" = c(1945,1945,1945,1946,1946,1946, 1947,1947,1947), "Age" = c(1,2,3,1,2,3,1,2,3), "Value" = c(4,5,6,7,8,9,10,11,12))
I would like to assign the value from "year+1 and age +1" to a new variable. Ex. For the case with year =1945 and age=1, I would like to assign the value = 8 (from year = 1946, age =2 ) to the new variable.
My ideal result will be like this:
x <- data.frame("Year" = c(1945,1945,1945,1946,1946,1946, 1947,1947,1947), "Age" = c(1,2,3,1,2,3,1,2,3), "Value" = c(4,5,6,7,8,9,10,11,12),"Year1moereandAge1more"= c(8,9,NA, 11, 12, NA, NA, NA,NA))
Thank you for helping a beginner.
Using a modified self-join:
library(dplyr)
x %>%
transmute(Year = Year - 1, Age = Age - 1, Year1moereandAge1more = Value) %>%
right_join(x) %>%
arrange(Year, Age)
# Joining, by = c("Year", "Age")
# Year Age Year1moereandAge1more Value
# 1 1945 1 8 4
# 2 1945 2 9 5
# 3 1945 3 NA 6
# 4 1946 1 11 7
# 5 1946 2 12 8
# 6 1946 3 NA 9
# 7 1947 1 NA 10
# 8 1947 2 NA 11
# 9 1947 3 NA 12

Get the Standard Deviation of a Group where the Count is in a Column

I am looking to get the standard deviation grouped by year. All the examples I have seen does not involve an aggregated count column.
I want to use the sum of the count column as part of the standard deviation calculation.
year count age
2018 2 0
2018 3 1
2018 4 2
2017 1 0
2017 4 1
2017 2 2
The expected answer for the above would be:-
Year 2018 = 0.78567420131839
Year 2017 = 0.63887656499994
The following should do the trick.
library(dplyr)
library(purrr)
data <- tibble(year = c(2018, 2018, 2018, 2017, 2017, 2017),
count = c(2, 3, 4, 1, 4, 2),
age = c(0, 1, 2, 0, 1, 2))
data %>%
mutate(vec = map2(age, count, ~ rep(.x, .y))) %>%
group_by(year) %>%
mutate(concs = list(unlist(vec))) %>%
ungroup() %>%
mutate(age_sd = map_dbl(concs, sd)) %>%
select(-vec, -concs)
# year count age age_sd
# <dbl> <dbl> <dbl> <dbl>
# 1 2018 2 0 0.833
# 2 2018 3 1 0.833
# 3 2018 4 2 0.833
# 4 2017 1 0 0.690
# 5 2017 4 1 0.690
# 6 2017 2 2 0.690

R: isolation of initial 10%

My data frame has a 10 columns and 100,000 rows, each row is an observation and the columns are data pertaining to each observation. One of the columns has the date of an observation in the julian day(ie feb 4= day 34). I want to reduce my data set so I'd have the first 10% observations PER year PER species. Ie, for species 1 in the year 1901 I want the average day of appearance based on the first 10% of observations.
Example of what I have: note id= species but as a number. ie blue=1
date=c(3,84,98,100,34,76,86...)
species=c(blue,purple,grey,purple,green,pink,pink,white...)
id=c(1,2,3,2,4,5,5,6...)
year=c(1901,2000,1901,1996,1901,2000,1986...)
habitat=c(forest,plain,mountain...)
ect
What i want:
date=c(3,84,76,86...)
species=c(purple,pink,pink, white...)
id=c(2,5,5,6...)
year=c(1901,2000,2000,1986...)
habitat=c(forest,plain,mountain...)
new=c(3,84,79,86...)
Assuming the data set dd defined below
set.seed(123)
n <- 100000
dd <- data.frame(year = sample(1901:2000, n, replace = TRUE),
date = sample(0:364, n, replace = TRUE),
species = sample(1:5, n, replace = TRUE))
1) base Aggregate dd with the indicated function. No packages are used:
avg10 <- function(date) {
ok <- seq_along(date) <= length(date) / 10
if (any(ok)) mean(date[ok]) else NA
}
aggregate(date ~ species + year, dd, avg10)
2) data.table Here is a data.table solution:
data.table(dd)[,
{ok <- .I <= .10 * .N; if (any(ok)) mean(date[ok]) else NA}, by = "species,year"]
Note: If you don't want NA's then use this instead of either of the if statements above to get the first point in that case:
if (any(ok)) mean(date[ok]) else date[1]
Just as for your last question, dplyr may work well for you:
Some data:
library(dplyr)
set.seed(42)
n <- 500
dat <- data.frame(date = sample(365, size=n, replace=TRUE),
species = sample(5, size=n, replace=TRUE),
year = 1980 + sample(20, size=n, replace=TRUE))
How it looks without filtering:
dat %>% group_by(year, species) %>% arrange(year, date)
## Source: local data frame [500 x 3]
## Groups: year, species
## date species year
## 1 50 1 1981
## 2 138 1 1981
## 3 174 1 1981
## 4 179 1 1981
## 5 200 1 1981
## 6 332 1 1981
## 7 31 2 1981
## 8 52 2 1981
## 9 196 2 1981
## 10 226 2 1981
## .. ... ... ...
How it looks with the first 10% by date within each year:
dat %>%
group_by(year, species) %>%
filter(ntile(date, 10) == 1) %>%
arrange(year, date)
## Source: local data frame [100 x 3]
## Groups: year, species
## date species year
## 1 50 1 1981
## 2 31 2 1981
## 3 63 3 1981
## 4 112 4 1981
## 5 1 5 1981
## 6 40 1 1982
## 7 103 2 1982
## 8 40 3 1982
## 9 86 4 1982
## 10 48 5 1982
## .. ... ... ...
I think the ntile trick is doing what you want: it breaks the data into roughly equal-sized bins, so it should be giving you the lowest 10% of your dates.
EDIT
Sorry, I missed the mean in there:
dat %>% group_by(year, species) %>%
filter(ntile(date, 10) == 1) %>%
summarise(date = mean(date)) %>%
arrange(year, date)
## Source: local data frame [99 x 3]
## Groups: year
## year species date
## 1 1981 5 1
## 2 1981 2 31
## 3 1981 1 50
## 4 1981 3 63
## 5 1981 4 112
## 6 1982 1 40
## 7 1982 3 40
## 8 1982 5 48
## 9 1982 4 86
## 10 1982 2 103
## .. ... ... ...

Resources